Re: [arch-devops] [arch-dev-public] Uploading old packages on archive.org (Was: Archive cleanup)
Hi Baptiste, On Thu, Jun 14, 2018 at 10:28:17AM +0200, Florian Pritz via arch-dev-public <arch-dev-public@archlinux.org> wrote:
On 10.06.2018 01:35, Baptiste Jonglez wrote:
Archival of all packages between September 2013 and December 2016 is finished:
https://archive.org/details/archlinuxarchive
Here is some documentation on this "Historical Archive" hosted on archive.org:
https://wiki.archlinux.org/index.php/Arch_Linux_Archive#Historical_Archive
So the archive is still growing and we'll need to clean it up again soonish. Where do you have the archive.org uploader and is it in a state where we (devops team) can also easily use it? If not, could you help us getting it to that point? We also have an archive cleanup script here[1]. Maybe the uploader can be integrated there? I don't know how complicated it is. [1] https://github.com/archlinux/archivetools/blob/master/archive-cleaner Florian
Hi, On 23-01-19, Florian Pritz wrote:
Hi Baptiste,
On Thu, Jun 14, 2018 at 10:28:17AM +0200, Florian Pritz via arch-dev-public <arch-dev-public@archlinux.org> wrote:
On 10.06.2018 01:35, Baptiste Jonglez wrote:
Archival of all packages between September 2013 and December 2016 is finished:
https://archive.org/details/archlinuxarchive
Here is some documentation on this "Historical Archive" hosted on archive.org:
https://wiki.archlinux.org/index.php/Arch_Linux_Archive#Historical_Archive
So the archive is still growing and we'll need to clean it up again soonish. Where do you have the archive.org uploader and is it in a state where we (devops team) can also easily use it? If not, could you help us getting it to that point?
I have just pushed the script I wrote last time: https://github.com/zorun/arch-historical-archive It's a bit hackish and requires some manual work to correctly upload all packages for a given year, because archive.org rate-limits quite aggressively when they are overloaded.
We also have an archive cleanup script here[1]. Maybe the uploader can be integrated there? I don't know how complicated it is.
[1] https://github.com/archlinux/archivetools/blob/master/archive-cleaner
What about uploading to archive.org as soon as we archive packages on orion? https://github.com/archlinux/archivetools/blob/master/archive.sh It would avoid hammering the archive.org server, because we would only send one package at a time. In any case, we need a retry mechanism to cope with the case where the upload fails. Baptiste
On Thu, Jan 24, 2019 at 09:27:23AM +0100, Baptiste Jonglez <baptiste@bitsofnetworks.org> wrote:
I have just pushed the script I wrote last time:
https://github.com/zorun/arch-historical-archive
It's a bit hackish and requires some manual work to correctly upload all packages for a given year, because archive.org rate-limits quite aggressively when they are overloaded.
Thanks!
We also have an archive cleanup script here[1]. Maybe the uploader can be integrated there? I don't know how complicated it is.
[1] https://github.com/archlinux/archivetools/blob/master/archive-cleaner
What about uploading to archive.org as soon as we archive packages on orion?
https://github.com/archlinux/archivetools/blob/master/archive.sh
While we still use this archive.sh script, dbscripts has recently also be extended to populate the archive continuously. So uploading could be integrated there with a queue file and a background job that performs the upload. Alternatively the uploader could be kept standalone and just adapted to run more often and to maintain its own database/list to know which packages have already been successfully uploaded and which haven't. I'll call this "state database". Then we could run it every hour or so via a systemd timer and it could upload all new and all failed packages. One thing I'd want to have in this context is that the uploader should exit with an error to let the systemd service fail if a package fails to upload multiple times. I think I'd actually prefer this to be standalone for simplicity.
It would avoid hammering the archive.org server, because we would only send one package at a time.
Avoiding load spikes for archive.org certainly sounds like a good idea and for us it's easier to monitor and maintain services that run more often too.
In any case, we need a retry mechanism to cope with the case where the upload fails.
This could use the state database I mentioned above. As for the implementation of such a database, I'd suggest sqlite instead of rolling your own text based list or whatever. It's fast and simple, but you get all the fancy stuff, like transactions, for free. You also don't have to deal with recovering the database if the script crashes. sqlite just rolls back uncommited transactions for you. Would you be interested in adapting the uploader like this and making it an automated service? If you're interested I can help with the deployment part and provide feedback on the scripting side. If you want, we can also discuss this on IRC. PS: I've whitelisted you on the arch-devops ML so that your replies also get archived. Florian
On 24-01-19, Florian Pritz wrote:
What about uploading to archive.org as soon as we archive packages on orion?
https://github.com/archlinux/archivetools/blob/master/archive.sh
While we still use this archive.sh script, dbscripts has recently also be extended to populate the archive continuously. So uploading could be integrated there with a queue file and a background job that performs the upload.
Alternatively the uploader could be kept standalone and just adapted to run more often and to maintain its own database/list to know which packages have already been successfully uploaded and which haven't. I'll call this "state database". Then we could run it every hour or so via a systemd timer and it could upload all new and all failed packages. One thing I'd want to have in this context is that the uploader should exit with an error to let the systemd service fail if a package fails to upload multiple times. I think I'd actually prefer this to be standalone for simplicity.
There is one argument against a standalone tool: each time it runs, it will need to scan the whole filesystem hierarchy to detect new packages, which can be quite slow. One solution is to have dbscripts build a queue of new packages to upload, but then the upload tool would not be completely standalone (it's basically your first solution above). A simpler but less robust way would be to scan only the current year (along with the previous year for a while). Other than this issue, it indeed looks like a good idea to clearly separate this tool from the dbscripts.
In any case, we need a retry mechanism to cope with the case where the upload fails.
This could use the state database I mentioned above. As for the implementation of such a database, I'd suggest sqlite instead of rolling your own text based list or whatever. It's fast and simple, but you get all the fancy stuff, like transactions, for free. You also don't have to deal with recovering the database if the script crashes. sqlite just rolls back uncommited transactions for you.
Would you be interested in adapting the uploader like this and making it an automated service? If you're interested I can help with the deployment part and provide feedback on the scripting side. If you want, we can also discuss this on IRC.
I don't have a lot of time to work on this at the moment, but I'll see what I can do. How urgent is the cleanup on orion? Is it ok to do it in a few weeks/months?
PS: I've whitelisted you on the arch-devops ML so that your replies also get archived.
Ok, thanks! Baptiste
On Sun, Jan 27, 2019 at 10:37:23PM +0100, Baptiste Jonglez <baptiste@bitsofnetworks.org> wrote:
There is one argument against a standalone tool: each time it runs, it will need to scan the whole filesystem hierarchy to detect new packages, which can be quite slow.
You can focus on the /srv/archive/packages/* directories. I've just run find on those and, once cached, it takes about 0.33 seconds. Uncached it's slightly slower, but below 5 seconds. I forgot to redirect the output the first time and having it output the list makes it quite a bit slower.
A simpler but less robust way would be to scan only the current year (along with the previous year for a while).
The ./packages/ subtree contains all unique packages no matter when they were released. If you just record the filenames of all packages that were already uploaded, you can easily detect new ones. I don't see a need for scanning each of the year/month/day trees. Also, the README in your repo already uses the packages/ tree and does not scan the other directories. Right now, there are a little under 500k packages and signatures in the packages tree. So that's 250k package filenames you'd need to check against the database. I'll ignore signatures and assume that we only add the package file name when the signature has also been uploaded. I've performed some very quick testing against an sqlite db with 2.5M rows. Running a select statement that searches for matches with a set of 10 strings (some of which never match) completed in ~0.2ms. Multiplied by 25k (250k / 10 since we have batches of 10 strings) that's roughly 5 seconds. You will probably get better performance with a smaller database and with larger batches of like 100 file names so I'd say that's perfectly fine. I've also tried matching only a single path and that took slightly under 0.2ms. With a batch of 100 strings it took 0.6ms which puts the total around 1-2 seconds. If you need to further reduce the number of db queries, you could also just check the modification time of the files and skip checking for file that are older than some cutoff (let's say 1 month). I'd advise against this though, unless it's really necessary.
How urgent is the cleanup on orion? Is it ok to do it in a few weeks/months?
Looking at your script, I see that you already seem to have uploaded 2016, is that correct? In that case we could go ahead and remove those packages to buy us some more time (probably another year). Last time we only removed up to 2015. Florian
Hi Florian, So actually I don't have enough time available to work on this project alone. If you or somebody else wants to start writing this tool (especially the database part), please go ahead. I can help with the archive.org part if needed. The rest is inline: On 27-01-19, Florian Pritz wrote:
On Sun, Jan 27, 2019 at 10:37:23PM +0100, Baptiste Jonglez <baptiste@bitsofnetworks.org> wrote:
There is one argument against a standalone tool: each time it runs, it will need to scan the whole filesystem hierarchy to detect new packages, which can be quite slow.
You can focus on the /srv/archive/packages/* directories. I've just run find on those and, once cached, it takes about 0.33 seconds. Uncached it's slightly slower, but below 5 seconds. I forgot to redirect the output the first time and having it output the list makes it quite a bit slower.
(...)
Right now, there are a little under 500k packages and signatures in the packages tree. So that's 250k package filenames you'd need to check against the database. I'll ignore signatures and assume that we only add the package file name when the signature has also been uploaded.
I've performed some very quick testing against an sqlite db with 2.5M rows. Running a select statement that searches for matches with a set of 10 strings (some of which never match) completed in ~0.2ms. Multiplied by 25k (250k / 10 since we have batches of 10 strings) that's roughly 5 seconds. You will probably get better performance with a smaller database and with larger batches of like 100 file names so I'd say that's perfectly fine. I've also tried matching only a single path and that took slightly under 0.2ms. With a batch of 100 strings it took 0.6ms which puts the total around 1-2 seconds.
If you need to further reduce the number of db queries, you could also just check the modification time of the files and skip checking for file that are older than some cutoff (let's say 1 month). I'd advise against this though, unless it's really necessary.
Thanks for checking the actual numbers so precisely. From what I remember, accessing the filesystem was really slow whenever the archive script was running (100% CPU for several minutes, probably checking/creating hardlinks). But I don't have precise measurements and that's probably not a major issue then.
How urgent is the cleanup on orion? Is it ok to do it in a few weeks/months?
Looking at your script, I see that you already seem to have uploaded 2016, is that correct? In that case we could go ahead and remove those packages to buy us some more time (probably another year). Last time we only removed up to 2015.
Yes, I had already uploaded everything up to 2016: https://lists.archlinux.org/pipermail/arch-dev-public/2018-June/029274.html When you delete the packages, don't forget to keep all database files so that the redirection to archive.org keeps working! Baptiste
Hi Baptiste, On Mon, Mar 04, 2019 at 12:43:19PM +0100, Baptiste Jonglez <baptiste@bitsofnetworks.org> wrote:
So actually I don't have enough time available to work on this project alone. If you or somebody else wants to start writing this tool (especially the database part), please go ahead. I can help with the archive.org part if needed.
Thanks! This project could benefit from a volunteer so I'll try to find someone interested in helping out.
Thanks for checking the actual numbers so precisely. From what I remember, accessing the filesystem was really slow whenever the archive script was running (100% CPU for several minutes, probably checking/creating hardlinks). But I don't have precise measurements and that's probably not a major issue then.
Yeah, hardlinks can lead to really bad IO performance. It should be much smoother once we get rid of them.
When you delete the packages, don't forget to keep all database files so that the redirection to archive.org keeps working!
We've already removed 2016 a few days ago with the same script we used before. The databases have been kept. Thanks for the pointer though! Florian
participants (2)
-
Baptiste Jonglez
-
Florian Pritz