[arch-dev-public] Problem with web dashboard: massive orphaning of packages
Hi, I don't know if you remember but a while ago a huge part of extra i686 (IIRC it was all packages from L to Z) were orphaned and erroneouly showing up as recently updated on the web site. This just happened again with packages in extra x86_64. I don't know what could caused that but it's very annoying as we has to readopt all our packages back. Eric -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
On Fri, Sep 12, 2008 at 2:00 PM, Eric Belanger <belanger@astro.umontreal.ca> wrote:
Hi,
I don't know if you remember but a while ago a huge part of extra i686 (IIRC it was all packages from L to Z) were orphaned and erroneouly showing up as recently updated on the web site. This just happened again with packages in extra x86_64. I don't know what could caused that but it's very annoying as we has to readopt all our packages back.
Did it *just* happen, or did it happen last night?
On Fri, 12 Sep 2008, Aaron Griffin wrote:
On Fri, Sep 12, 2008 at 2:00 PM, Eric Belanger <belanger@astro.umontreal.ca> wrote:
Hi,
I don't know if you remember but a while ago a huge part of extra i686 (IIRC it was all packages from L to Z) were orphaned and erroneouly showing up as recently updated on the web site. This just happened again with packages in extra x86_64. I don't know what could caused that but it's very annoying as we has to readopt all our packages back.
Did it *just* happen, or did it happen last night?
It was happening when I wrote the email. I think it was still happening after I sent it. I had readopted all my packages by they were all orphaned again. I guess it was still in the process of readding them or doing something. -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
2008/9/12 Eric Belanger <belanger@astro.umontreal.ca>:
Hi,
I don't know if you remember but a while ago a huge part of extra i686 (IIRC it was all packages from L to Z) were orphaned and erroneouly showing up as recently updated on the web site. This just happened again with packages in extra x86_64. I don't know what could caused that but it's very annoying as we has to readopt all our packages back.
Fuck. I remember Judd telling me not to swear at users but its ok to swear at scripts right? This has to be happening in reporead.py. Fucking reporead.py. To the best of my knowledge, no other script updates the web database in anyway, am I wrong? The actual db_update script splits the packages into those that are in the database and those that are not and processes them separately. Packages that are not currently in the database get added as orphans because apparently its hard to interrogate the maintainer from the db.tar.gz. At first, I assumed that it is doing an add when it should be doing an update, which would add new packages with orphan maintainer. But this doesn't appear to be the case because there are not currently any duplicate x86_64 packages (that aren't in testing). My second more likely hypothesis is race conditions. I don't know how the db scripts update exactly, but I suspect reporead is reading a db.tar.gz file that is either broken or not yet fully uploaded. It sees this broken db file and drops all the packages in the web interface that are not in that file. Then x minutes later (crontab), it runs again on a proper db and sees the missing packages again. It adds them to the database and sets the maintainer to orphan. Are such broken dbs possible/likely/happening? If its a race condition, we need to put a lock on the database (maybe dbtools does this already) so that reporead isn't accessing it at the same time as dbtools. If its just that when the database gets updated it sometimes breaks the database well.. that just needs to be fixed. Dusty
On Fri, Sep 12, 2008 at 3:23 PM, Dusty Phillips <buchuki@gmail.com> wrote:
2008/9/12 Eric Belanger <belanger@astro.umontreal.ca>:
Hi,
I don't know if you remember but a while ago a huge part of extra i686 (IIRC it was all packages from L to Z) were orphaned and erroneouly showing up as recently updated on the web site. This just happened again with packages in extra x86_64. I don't know what could caused that but it's very annoying as we has to readopt all our packages back.
Fuck.
I remember Judd telling me not to swear at users but its ok to swear at scripts right?
This has to be happening in reporead.py. Fucking reporead.py. To the best of my knowledge, no other script updates the web database in anyway, am I wrong?
The actual db_update script splits the packages into those that are in the database and those that are not and processes them separately. Packages that are not currently in the database get added as orphans because apparently its hard to interrogate the maintainer from the db.tar.gz. At first, I assumed that it is doing an add when it should be doing an update, which would add new packages with orphan maintainer. But this doesn't appear to be the case because there are not currently any duplicate x86_64 packages (that aren't in testing).
My second more likely hypothesis is race conditions. I don't know how the db scripts update exactly, but I suspect reporead is reading a db.tar.gz file that is either broken or not yet fully uploaded. It sees this broken db file and drops all the packages in the web interface that are not in that file. Then x minutes later (crontab), it runs again on a proper db and sees the missing packages again. It adds them to the database and sets the maintainer to orphan.
Are such broken dbs possible/likely/happening? If its a race condition, we need to put a lock on the database (maybe dbtools does this already) so that reporead isn't accessing it at the same time as dbtools. If its just that when the database gets updated it sometimes breaks the database well.. that just needs to be fixed.
Hmmm, the DBs are constructed in /tmp and then moved live to /home/ftp/whatever it's possible that reporead may be opening it mid-move, but that doesn't seem right. It's gzipped. Wouldn't that balk if you took half of a DB file, and tried to gunzip it?
2008/9/12 Aaron Griffin <aaronmgriffin@gmail.com>:
On Fri, Sep 12, 2008 at 3:23 PM, Dusty Phillips <buchuki@gmail.com> wrote:
2008/9/12 Eric Belanger <belanger@astro.umontreal.ca>:
Hi,
I don't know if you remember but a while ago a huge part of extra i686 (IIRC it was all packages from L to Z) were orphaned and erroneouly showing up as recently updated on the web site. This just happened again with packages in extra x86_64. I don't know what could caused that but it's very annoying as we has to readopt all our packages back.
Fuck.
I remember Judd telling me not to swear at users but its ok to swear at scripts right?
This has to be happening in reporead.py. Fucking reporead.py. To the best of my knowledge, no other script updates the web database in anyway, am I wrong?
The actual db_update script splits the packages into those that are in the database and those that are not and processes them separately. Packages that are not currently in the database get added as orphans because apparently its hard to interrogate the maintainer from the db.tar.gz. At first, I assumed that it is doing an add when it should be doing an update, which would add new packages with orphan maintainer. But this doesn't appear to be the case because there are not currently any duplicate x86_64 packages (that aren't in testing).
My second more likely hypothesis is race conditions. I don't know how the db scripts update exactly, but I suspect reporead is reading a db.tar.gz file that is either broken or not yet fully uploaded. It sees this broken db file and drops all the packages in the web interface that are not in that file. Then x minutes later (crontab), it runs again on a proper db and sees the missing packages again. It adds them to the database and sets the maintainer to orphan.
Are such broken dbs possible/likely/happening? If its a race condition, we need to put a lock on the database (maybe dbtools does this already) so that reporead isn't accessing it at the same time as dbtools. If its just that when the database gets updated it sometimes breaks the database well.. that just needs to be fixed.
Hmmm, the DBs are constructed in /tmp and then moved live to /home/ftp/whatever it's possible that reporead may be opening it mid-move, but that doesn't seem right. It's gzipped. Wouldn't that balk if you took half of a DB file, and tried to gunzip it?
I think so... not sure if this is a proper test of it but it fails: dusty:x86_64 $ head -c 10000 extra.db.tar.gz | tar -xz gzip: stdin: unexpected end of file tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now reporead does some great stuff with logger (debug and info). Do you know if any of those logged messages are saved? I haven't checked this time, but IIRC last time, it was all packages after the letter L that got orphaned or something. This indicates that for some reason reporead is not processing all the packages in the file. Either the db does not contain all the files because a half full db got uploaded or it is reading part of the db and then exiting for some reason. Why either of these would occur is beyond me. Dusty
Dusty Phillips schrieb:
I think so... not sure if this is a proper test of it but it fails:
dusty:x86_64 $ head -c 10000 extra.db.tar.gz | tar -xz
gzip: stdin: unexpected end of file tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now
reporead does some great stuff with logger (debug and info). Do you know if any of those logged messages are saved?
debug and info level messages are saved in /var/log/everything.log. @Aaron, may I suggest that you add many (all?) devs to the "log" group so we can read the stuff in /var/log. It is often useful, as it would be now for Dusty.
On Fri, Sep 12, 2008 at 5:11 PM, Thomas Bächler <thomas@archlinux.org> wrote:
Dusty Phillips schrieb:
I think so... not sure if this is a proper test of it but it fails:
dusty:x86_64 $ head -c 10000 extra.db.tar.gz | tar -xz
gzip: stdin: unexpected end of file tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now
reporead does some great stuff with logger (debug and info). Do you know if any of those logged messages are saved?
debug and info level messages are saved in /var/log/everything.log.
@Aaron, may I suggest that you add many (all?) devs to the "log" group so we can read the stuff in /var/log. It is often useful, as it would be now for Dusty.
I added a few - either people with sysadmin experience, or people who check up on things like this regularly (i.e. you, Thomas)
On Fri, Sep 12, 2008 at 4:40 PM, Dusty Phillips <buchuki@gmail.com> wrote:
reporead does some great stuff with logger (debug and info). Do you know if any of those logged messages are saved?
They're stored in /tmp/archweb_update.log and emailed to me once a day. This is all done in the cron script located at /etc/cron.hourly/update_web_db.sh Looking at it, I noticed lots of this: 2008-09-12 18:02:38 -> INFO: Finished repo parsing 2008-09-12 18:02:38 -> INFO: Starting database updates. 2008-09-12 18:02:38 -> INFO: Updating Arch: x86_64 2008-09-12 18:02:47 -> INFO: Finished updating Arch: x86_64 2008-09-12 18:02:47 -> INFO: Updating Arch: i686 2008-09-12 18:02:47 -> INFO: Removing package kde-l10n-ca from database 2008-09-12 18:02:47 -> INFO: Removing package xalan-java from database 2008-09-12 18:02:47 -> INFO: Removing package fcgi from database 2008-09-12 18:02:47 -> INFO: Removing package enblend-enfuse from database 2008-09-12 18:02:47 -> INFO: Removing package netcdf from database 2008-09-12 18:02:47 -> INFO: Removing package mirage from database 2008-09-12 18:02:47 -> INFO: Removing package glhack from database ..... lots and lots of "Removing package" lines .... I wonder a) why those were removed and b) if that is related to the x86_64 orphaning
2008/9/12 Aaron Griffin <aaronmgriffin@gmail.com>:
They're stored in /tmp/archweb_update.log and emailed to me once a day. This is all done in the cron script located at /etc/cron.hourly/update_web_db.sh
What about debug level messages?
2008-09-12 18:02:38 -> INFO: Finished repo parsing 2008-09-12 18:02:38 -> INFO: Starting database updates. 2008-09-12 18:02:38 -> INFO: Updating Arch: x86_64 2008-09-12 18:02:47 -> INFO: Finished updating Arch: x86_64 2008-09-12 18:02:47 -> INFO: Updating Arch: i686 2008-09-12 18:02:47 -> INFO: Removing package kde-l10n-ca from database 2008-09-12 18:02:47 -> INFO: Removing package xalan-java from database 2008-09-12 18:02:47 -> INFO: Removing package fcgi from database 2008-09-12 18:02:47 -> INFO: Removing package enblend-enfuse from database 2008-09-12 18:02:47 -> INFO: Removing package netcdf from database 2008-09-12 18:02:47 -> INFO: Removing package mirage from database 2008-09-12 18:02:47 -> INFO: Removing package glhack from database ..... lots and lots of "Removing package" lines .... I wonder a) why those were removed and b) if that is related to the x86_64 orphaning
b) is almost certainly yes. The packages get removed and then presumably get added again later with orphan status. This must be thoroughly fucking up the web interface new package notification. a) is WTF. I just checked the current state of the db.tar.gz and they seem to contain packages that reporead claims were removed. So it doesn't look like anything is breaking the db.tar.gz. It seems more like reporead is not reading the whole file. But its still possible the db.tar.gz has been fixed since the error occurred. I have added some logging info to say how many packages are currently in the web db and how many are in the new sync db. If these are disparate the problem is in the code that loads the repo.db.tar.gz. Otherwise its in the code that adds/removes packages. I also implemented a check to warn or exception if these numbers are 75% or 50%, as Paul suggested. I don't have time to look for anything else right now, hopefully it will keep happening so I can track it down. Does somebody want to give me a quck rundown or wiki article of how the database tools move packages from svn to release in repo.db.tar.gz? I'm thinking if reporead wants to be this anal, maybe we should add some hooks to whatever script says 'I just released a package, please update the database' and sync up the web database at the time things get updated. Sorry I don't know what's causing this folks. I'm just praying its a long standing bug and can blame it on cactus instead of having to come back to y'all and say "well here's the thing, I introduced this really really stupid bug into reporead.py....." ;-) Dusty
On Fri, Sep 12, 2008 at 7:44 PM, Dusty Phillips <buchuki@gmail.com> wrote:
2008/9/12 Aaron Griffin <aaronmgriffin@gmail.com>:
They're stored in /tmp/archweb_update.log and emailed to me once a day. This is all done in the cron script located at /etc/cron.hourly/update_web_db.sh
What about debug level messages?
I'm fairly certain those *don't* go to the syslog, and they're all output to the same script. Maybe the level can be adjusted (at the top it looks like it sets something to WARNING).
2008-09-12 18:02:38 -> INFO: Finished repo parsing 2008-09-12 18:02:38 -> INFO: Starting database updates. 2008-09-12 18:02:38 -> INFO: Updating Arch: x86_64 2008-09-12 18:02:47 -> INFO: Finished updating Arch: x86_64 2008-09-12 18:02:47 -> INFO: Updating Arch: i686 2008-09-12 18:02:47 -> INFO: Removing package kde-l10n-ca from database 2008-09-12 18:02:47 -> INFO: Removing package xalan-java from database 2008-09-12 18:02:47 -> INFO: Removing package fcgi from database 2008-09-12 18:02:47 -> INFO: Removing package enblend-enfuse from database 2008-09-12 18:02:47 -> INFO: Removing package netcdf from database 2008-09-12 18:02:47 -> INFO: Removing package mirage from database 2008-09-12 18:02:47 -> INFO: Removing package glhack from database ..... lots and lots of "Removing package" lines .... I wonder a) why those were removed and b) if that is related to the x86_64 orphaning
b) is almost certainly yes. The packages get removed and then presumably get added again later with orphan status. This must be thoroughly fucking up the web interface new package notification.
a) is WTF. I just checked the current state of the db.tar.gz and they seem to contain packages that reporead claims were removed. So it doesn't look like anything is breaking the db.tar.gz. It seems more like reporead is not reading the whole file. But its still possible the db.tar.gz has been fixed since the error occurred.
I have added some logging info to say how many packages are currently in the web db and how many are in the new sync db. If these are disparate the problem is in the code that loads the repo.db.tar.gz. Otherwise its in the code that adds/removes packages.
I also implemented a check to warn or exception if these numbers are 75% or 50%, as Paul suggested.
I don't have time to look for anything else right now, hopefully it will keep happening so I can track it down.
Does somebody want to give me a quck rundown or wiki article of how the database tools move packages from svn to release in repo.db.tar.gz? I'm thinking if reporead wants to be this anal, maybe we should add some hooks to whatever script says 'I just released a package, please update the database' and sync up the web database at the time things get updated.
That's actually what we tried to get away from by doing this. The old DB scripts were so tightly coupled to gerolde, it was near impossible to test them. We actually had binaries that did mysql work. I don't want to go back to that way of doing things. This should all be as decoupled as possible....
Sorry I don't know what's causing this folks. I'm just praying its a long standing bug and can blame it on cactus instead of having to come back to y'all and say "well here's the thing, I introduced this really really stupid bug into reporead.py....." ;-)
I plan on looking into this on my saturday sprint too. I can do some testing and maybe some improvements of reporead.py too. Should be straightforward - setup a DB, grab the django code, wget the extra DB file, and bam.... Anyone else willing to work with me on testing this one?
2008/9/13 Aaron Griffin <aaronmgriffin@gmail.com>:
On Fri, Sep 12, 2008 at 7:44 PM, Dusty Phillips <buchuki@gmail.com> wrote:
2008/9/12 Aaron Griffin <aaronmgriffin@gmail.com>:
They're stored in /tmp/archweb_update.log and emailed to me once a day. This is all done in the cron script located at /etc/cron.hourly/update_web_db.sh
What about debug level messages?
I'm fairly certain those *don't* go to the syslog, and they're all output to the same script. Maybe the level can be adjusted (at the top it looks like it sets something to WARNING).
2008-09-12 18:02:38 -> INFO: Finished repo parsing 2008-09-12 18:02:38 -> INFO: Starting database updates. 2008-09-12 18:02:38 -> INFO: Updating Arch: x86_64 2008-09-12 18:02:47 -> INFO: Finished updating Arch: x86_64 2008-09-12 18:02:47 -> INFO: Updating Arch: i686 2008-09-12 18:02:47 -> INFO: Removing package kde-l10n-ca from database 2008-09-12 18:02:47 -> INFO: Removing package xalan-java from database 2008-09-12 18:02:47 -> INFO: Removing package fcgi from database 2008-09-12 18:02:47 -> INFO: Removing package enblend-enfuse from database 2008-09-12 18:02:47 -> INFO: Removing package netcdf from database 2008-09-12 18:02:47 -> INFO: Removing package mirage from database 2008-09-12 18:02:47 -> INFO: Removing package glhack from database ..... lots and lots of "Removing package" lines .... I wonder a) why those were removed and b) if that is related to the x86_64 orphaning
b) is almost certainly yes. The packages get removed and then presumably get added again later with orphan status. This must be thoroughly fucking up the web interface new package notification.
a) is WTF. I just checked the current state of the db.tar.gz and they seem to contain packages that reporead claims were removed. So it doesn't look like anything is breaking the db.tar.gz. It seems more like reporead is not reading the whole file. But its still possible the db.tar.gz has been fixed since the error occurred.
I have added some logging info to say how many packages are currently in the web db and how many are in the new sync db. If these are disparate the problem is in the code that loads the repo.db.tar.gz. Otherwise its in the code that adds/removes packages.
I also implemented a check to warn or exception if these numbers are 75% or 50%, as Paul suggested.
I don't have time to look for anything else right now, hopefully it will keep happening so I can track it down.
Does somebody want to give me a quck rundown or wiki article of how the database tools move packages from svn to release in repo.db.tar.gz? I'm thinking if reporead wants to be this anal, maybe we should add some hooks to whatever script says 'I just released a package, please update the database' and sync up the web database at the time things get updated.
That's actually what we tried to get away from by doing this. The old DB scripts were so tightly coupled to gerolde, it was near impossible to test them. We actually had binaries that did mysql work. I don't want to go back to that way of doing things. This should all be as decoupled as possible....
Sorry I don't know what's causing this folks. I'm just praying its a long standing bug and can blame it on cactus instead of having to come back to y'all and say "well here's the thing, I introduced this really really stupid bug into reporead.py....." ;-)
I plan on looking into this on my saturday sprint too. I can do some testing and maybe some improvements of reporead.py too. Should be straightforward - setup a DB, grab the django code, wget the extra DB file, and bam....
Should work, but you'll also need to import a working extra.db.tar.gz so that you have the 'correct' packages already in the db. Not sure if you have backups or if you can snag one from a mirror somewhere. I have a copy of the current broken i686 extra db if it happens to get fixed at some point and you want it. Currently if you run reporead on this db it raises: SomethingFishyException: it looks like the syncdb is twice as big as the newpackages. WTF? The problem is hapening before the db_update in reporead.py is called and is thus probably in tarfile. The database appears to contain the packages. I'll be working on this for the next hour or so, then I have to give up and try to make some money today. Its shaping up to be that day in 2008. You know that day every year where absolutely everything goes wrong? Its that day for me. My toast landed butter side down this morning... So don't expect anything good from me. Anyway, if I don't track it down I'll pas it to you with any findings. Dusty
Hey guys, I tracked down the problem and fixed it. The basic problem is that there are two packages in i686 right now that have their arch set to x86_64: texlive-core texlive-htmlxml ATTENTION: The dbscripts should be updated to ensure that this doesn't happen. The only valid arches for any db.tar.gz file are the name of the arch and 'any'. When reporead found these it was parsing the i686 packages then the x86_64 packages it found, wiping out any packages from the previous run. I am guessing the reverse happened in the x86_64 orphaning; in this case, it was parsing the few i686 packages, which wiped out x86_64 packages and then processing x86_64 which readded the missing packages, orphaning them. I changed it so that if there are packages with the wrong arch, they are ignored and a warning is printed. Another option would be to coerce the wrong arch packages to the correct arch for that db.tar.gz, but I think the optimal solution is to ensure the dbtools don't allow adding packages with the wrong arch. The web interface *should* sync up properly with the next running of reporead. Here's the fix: http://projects.archlinux.org/?p=archweb_dev.git;a=commitdiff;h=765c6c0cd089... Hopefully it should all work now, I didn't test it all that thoroughly, but it did update my local db. Hmm, I just realized there's still a chance that 'any' packages from i686 could clobbel 'any' packages frim x86_64. I didn't look too too close, but this may be occuring. Aaron if you're bored today you might want to look into the possibility of that. If not, say hi to Cynthia. One other thing, I see that the package info files do have references to the maintainer's name and e-mail. I don't know if these are all correct or up to date. If they are, I could probably extract the maintainer info from that and automatically set the package maintainer instead of setting them to orphan. The problem with this is that if you orphan a package in the web interface and forget to drop your name from the maintainer field in the PKGBUILD, you'll get the package back. I think its safer to stick with the manual way, anybody else have thoughts on this? Dusty 2008/9/13 Dusty Phillips <buchuki@gmail.com>:
2008/9/13 Aaron Griffin <aaronmgriffin@gmail.com>:
On Fri, Sep 12, 2008 at 7:44 PM, Dusty Phillips <buchuki@gmail.com> wrote:
2008/9/12 Aaron Griffin <aaronmgriffin@gmail.com>:
They're stored in /tmp/archweb_update.log and emailed to me once a day. This is all done in the cron script located at /etc/cron.hourly/update_web_db.sh
What about debug level messages?
I'm fairly certain those *don't* go to the syslog, and they're all output to the same script. Maybe the level can be adjusted (at the top it looks like it sets something to WARNING).
2008-09-12 18:02:38 -> INFO: Finished repo parsing 2008-09-12 18:02:38 -> INFO: Starting database updates. 2008-09-12 18:02:38 -> INFO: Updating Arch: x86_64 2008-09-12 18:02:47 -> INFO: Finished updating Arch: x86_64 2008-09-12 18:02:47 -> INFO: Updating Arch: i686 2008-09-12 18:02:47 -> INFO: Removing package kde-l10n-ca from database 2008-09-12 18:02:47 -> INFO: Removing package xalan-java from database 2008-09-12 18:02:47 -> INFO: Removing package fcgi from database 2008-09-12 18:02:47 -> INFO: Removing package enblend-enfuse from database 2008-09-12 18:02:47 -> INFO: Removing package netcdf from database 2008-09-12 18:02:47 -> INFO: Removing package mirage from database 2008-09-12 18:02:47 -> INFO: Removing package glhack from database ..... lots and lots of "Removing package" lines .... I wonder a) why those were removed and b) if that is related to the x86_64 orphaning
b) is almost certainly yes. The packages get removed and then presumably get added again later with orphan status. This must be thoroughly fucking up the web interface new package notification.
a) is WTF. I just checked the current state of the db.tar.gz and they seem to contain packages that reporead claims were removed. So it doesn't look like anything is breaking the db.tar.gz. It seems more like reporead is not reading the whole file. But its still possible the db.tar.gz has been fixed since the error occurred.
I have added some logging info to say how many packages are currently in the web db and how many are in the new sync db. If these are disparate the problem is in the code that loads the repo.db.tar.gz. Otherwise its in the code that adds/removes packages.
I also implemented a check to warn or exception if these numbers are 75% or 50%, as Paul suggested.
I don't have time to look for anything else right now, hopefully it will keep happening so I can track it down.
Does somebody want to give me a quck rundown or wiki article of how the database tools move packages from svn to release in repo.db.tar.gz? I'm thinking if reporead wants to be this anal, maybe we should add some hooks to whatever script says 'I just released a package, please update the database' and sync up the web database at the time things get updated.
That's actually what we tried to get away from by doing this. The old DB scripts were so tightly coupled to gerolde, it was near impossible to test them. We actually had binaries that did mysql work. I don't want to go back to that way of doing things. This should all be as decoupled as possible....
Sorry I don't know what's causing this folks. I'm just praying its a long standing bug and can blame it on cactus instead of having to come back to y'all and say "well here's the thing, I introduced this really really stupid bug into reporead.py....." ;-)
I plan on looking into this on my saturday sprint too. I can do some testing and maybe some improvements of reporead.py too. Should be straightforward - setup a DB, grab the django code, wget the extra DB file, and bam....
Should work, but you'll also need to import a working extra.db.tar.gz so that you have the 'correct' packages already in the db. Not sure if you have backups or if you can snag one from a mirror somewhere. I have a copy of the current broken i686 extra db if it happens to get fixed at some point and you want it.
Currently if you run reporead on this db it raises:
SomethingFishyException: it looks like the syncdb is twice as big as the newpackages. WTF?
The problem is hapening before the db_update in reporead.py is called and is thus probably in tarfile. The database appears to contain the packages.
I'll be working on this for the next hour or so, then I have to give up and try to make some money today. Its shaping up to be that day in 2008. You know that day every year where absolutely everything goes wrong? Its that day for me. My toast landed butter side down this morning... So don't expect anything good from me. Anyway, if I don't track it down I'll pas it to you with any findings.
Dusty
On Sat, Sep 13, 2008 at 10:24 AM, Dusty Phillips <buchuki@gmail.com> wrote:
Hey guys,
I tracked down the problem and fixed it. The basic problem is that there are two packages in i686 right now that have their arch set to x86_64:
texlive-core texlive-htmlxml
Looks like the little "hack" of renaming files instead of rebuilding them bit us in the ass here. The DB scripts currently look for the arch in the filename, but don't validate the PKGINFO.
ATTENTION: The dbscripts should be updated to ensure that this doesn't happen. The only valid arches for any db.tar.gz file are the name of the arch and 'any'.
Please see this (untested) patch which validates the PKGINFO for the correct architecture: http://projects.archlinux.org/?p=dbscripts.git;a=commitdiff;h=a0f73ceca409fa... In the future, perhaps I'll validate more before adding it to the DB.
Aaron Griffin schrieb:
On Sat, Sep 13, 2008 at 10:24 AM, Dusty Phillips <buchuki@gmail.com> wrote:
Hey guys,
I tracked down the problem and fixed it. The basic problem is that there are two packages in i686 right now that have their arch set to x86_64:
texlive-core texlive-htmlxml
Looks like the little "hack" of renaming files instead of rebuilding them bit us in the ass here. The DB scripts currently look for the arch in the filename, but don't validate the PKGINFO.
They should. Instead of renaming the file, Francois should have used makepkg -R with CARCH set to i686.
Dusty Phillips schrieb:
a) is WTF. I just checked the current state of the db.tar.gz and they seem to contain packages that reporead claims were removed. So it doesn't look like anything is breaking the db.tar.gz. It seems more like reporead is not reading the whole file. But its still possible the db.tar.gz has been fixed since the error occurred.
Just an idea here: Instead of removing deleted packages from the db, we could add a "deleted on" column with a date in it. archweb will only display the lines which have this set to NULL. When we delete a package, we set this to the current date/time. When we readd a package, all the maintainer info (and probably other stuff) is still there (as readding is simply setting the field to "NULL" again). Only when a package has been deleted for at least 2 weeks, a cleanup script removes it from the database. This will save us MUCH trouble next time reporead has a bug like this.
On Fri, Sep 12, 2008 at 3:23 PM, Dusty Phillips <buchuki@gmail.com> wrote:
2008/9/12 Eric Belanger <belanger@astro.umontreal.ca>:
Hi,
I don't know if you remember but a while ago a huge part of extra i686 (IIRC it was all packages from L to Z) were orphaned and erroneouly showing up as recently updated on the web site. This just happened again with packages in extra x86_64. I don't know what could caused that but it's very annoying as we has to readopt all our packages back.
Fuck.
I remember Judd telling me not to swear at users but its ok to swear at scripts right?
This has to be happening in reporead.py. Fucking reporead.py. To the best of my knowledge, no other script updates the web database in anyway, am I wrong?
The actual db_update script splits the packages into those that are in the database and those that are not and processes them separately. Packages that are not currently in the database get added as orphans because apparently its hard to interrogate the maintainer from the db.tar.gz. At first, I assumed that it is doing an add when it should be doing an update, which would add new packages with orphan maintainer. But this doesn't appear to be the case because there are not currently any duplicate x86_64 packages (that aren't in testing).
My second more likely hypothesis is race conditions. I don't know how the db scripts update exactly, but I suspect reporead is reading a db.tar.gz file that is either broken or not yet fully uploaded. It sees this broken db file and drops all the packages in the web interface that are not in that file. Then x minutes later (crontab), it runs again on a proper db and sees the missing packages again. It adds them to the database and sets the maintainer to orphan.
Are such broken dbs possible/likely/happening? If its a race condition, we need to put a lock on the database (maybe dbtools does this already) so that reporead isn't accessing it at the same time as dbtools. If its just that when the database gets updated it sometimes breaks the database well.. that just needs to be fixed.
This would be a hell of a race condition- to make a database, we first unzip it to a temp location, make our changes and updates, and then rezip it. Thus reporead.py would have to open the db while it is being zipped, which is a very short period of time, but I guess theoretically possible. WIthout looking at the repo-add code, I don't know if we do this now, but we probably should: 1. unzip the db to a temp location 2. make changes 3. rezip it to db.tar.gz.new 4. move old db to db.tar.gz.old 5 move new db to db.tar.gz This would make the "db replacement" portion atomic in the sense that we would never have a partial DB; we would only have a short period of time where no db existed in that location. If really necessary we could avoid even this by copying the old db to one with the old extension instead of moving it. -Dan
On Fri, Sep 12, 2008 at 3:57 PM, Dan McGee <dpmcgee@gmail.com> wrote:
On Fri, Sep 12, 2008 at 3:23 PM, Dusty Phillips <buchuki@gmail.com> wrote:
2008/9/12 Eric Belanger <belanger@astro.umontreal.ca>:
Hi,
I don't know if you remember but a while ago a huge part of extra i686 (IIRC it was all packages from L to Z) were orphaned and erroneouly showing up as recently updated on the web site. This just happened again with packages in extra x86_64. I don't know what could caused that but it's very annoying as we has to readopt all our packages back.
Fuck.
I remember Judd telling me not to swear at users but its ok to swear at scripts right?
This has to be happening in reporead.py. Fucking reporead.py. To the best of my knowledge, no other script updates the web database in anyway, am I wrong?
The actual db_update script splits the packages into those that are in the database and those that are not and processes them separately. Packages that are not currently in the database get added as orphans because apparently its hard to interrogate the maintainer from the db.tar.gz. At first, I assumed that it is doing an add when it should be doing an update, which would add new packages with orphan maintainer. But this doesn't appear to be the case because there are not currently any duplicate x86_64 packages (that aren't in testing).
My second more likely hypothesis is race conditions. I don't know how the db scripts update exactly, but I suspect reporead is reading a db.tar.gz file that is either broken or not yet fully uploaded. It sees this broken db file and drops all the packages in the web interface that are not in that file. Then x minutes later (crontab), it runs again on a proper db and sees the missing packages again. It adds them to the database and sets the maintainer to orphan.
Are such broken dbs possible/likely/happening? If its a race condition, we need to put a lock on the database (maybe dbtools does this already) so that reporead isn't accessing it at the same time as dbtools. If its just that when the database gets updated it sometimes breaks the database well.. that just needs to be fixed.
This would be a hell of a race condition- to make a database, we first unzip it to a temp location, make our changes and updates, and then rezip it. Thus reporead.py would have to open the db while it is being zipped, which is a very short period of time, but I guess theoretically possible.
WIthout looking at the repo-add code, I don't know if we do this now, but we probably should: 1. unzip the db to a temp location 2. make changes 3. rezip it to db.tar.gz.new 4. move old db to db.tar.gz.old 5 move new db to db.tar.gz
This would make the "db replacement" portion atomic in the sense that we would never have a partial DB; we would only have a short period of time where no db existed in that location. If really necessary we could avoid even this by copying the old db to one with the old extension instead of moving it.
Well, all the repo-add stuff is done in a subdir of /tmp too, then it's simply 'mv'ed to /home/ftp, so it *should* be fairly atomic... well, it would be if /tmp was on the same filesystem - it's just a matter of moving inodes
Dusty Phillips wrote:
2008/9/12 Eric Belanger <belanger@astro.umontreal.ca>:
Hi,
I don't know if you remember but a while ago a huge part of extra i686 (IIRC it was all packages from L to Z) were orphaned and erroneouly showing up as recently updated on the web site. This just happened again with packages in extra x86_64. I don't know what could caused that but it's very annoying as we has to readopt all our packages back.
Fuck.
I remember Judd telling me not to swear at users but its ok to swear at scripts right?
This has to be happening in reporead.py. Fucking reporead.py. To the best of my knowledge, no other script updates the web database in anyway, am I wrong?
The actual db_update script splits the packages into those that are in the database and those that are not and processes them separately. Packages that are not currently in the database get added as orphans because apparently its hard to interrogate the maintainer from the db.tar.gz. At first, I assumed that it is doing an add when it should be doing an update, which would add new packages with orphan maintainer. But this doesn't appear to be the case because there are not currently any duplicate x86_64 packages (that aren't in testing).
My second more likely hypothesis is race conditions. I don't know how the db scripts update exactly, but I suspect reporead is reading a db.tar.gz file that is either broken or not yet fully uploaded. It sees this broken db file and drops all the packages in the web interface that are not in that file. Then x minutes later (crontab), it runs again on a proper db and sees the missing packages again. It adds them to the database and sets the maintainer to orphan.
Are such broken dbs possible/likely/happening? If its a race condition, we need to put a lock on the database (maybe dbtools does this already) so that reporead isn't accessing it at the same time as dbtools. If its just that when the database gets updated it sometimes breaks the database well.. that just needs to be fixed.
Hey, Dusty-- feeling your pain. Don't worry, there's no permanent harm done. One thing I considered doing with the AUR scripts and might work here is just setting a threshhold.. if more than X packages seem to be orphaned since the last run, assume there's an error somewhere, and yell loudly to several people with root access on the machine. Even if you set that number really high, this should work-- it's usually when the whole db is messed up that these sort of bad things happen. It could check if more than 50% of the packages are being orphaned. A simple flag to override the check, in the rare case when we really mean it, would require a simple manual step. Just a thought. - P
The x86_64 orphaning has happened again some minutes ago. Moreover, the i686 packages from extra are now completely gone from the web interface. Alex
On Sat, 13 Sep 2008 01:12:46 +0200 Alexander Fehr <pizzapunk@gmail.com> wrote:
The x86_64 orphaning has happened again some minutes ago. Moreover, the i686 packages from extra are now completely gone from the web interface.
Alex
I would suggest that no developer commit anything or update the db until this thing is fixed. The complete i686 extra repo is gone from the web like Alex said it. Daniel
2008/9/13 Daniel Isenmann <daniel.isenmann@gmx.de>:
On Sat, 13 Sep 2008 01:12:46 +0200 Alexander Fehr <pizzapunk@gmail.com> wrote:
The x86_64 orphaning has happened again some minutes ago. Moreover, the i686 packages from extra are now completely gone from the web interface.
Alex
I would suggest that no developer commit anything or update the db until this thing is fixed. The complete i686 extra repo is gone from the web like Alex said it.
This shouldn't be necessary; because of the decoupling Aaron mentioned, the web view database will sync itself properly when its fixed. People will be confused though, but they're going to be confused anyway. Dusty
participants (8)
-
Aaron Griffin
-
Alexander Fehr
-
Dan McGee
-
Daniel Isenmann
-
Dusty Phillips
-
Eric Belanger
-
Paul Mattal
-
Thomas Bächler