Hi Florian, So actually I don't have enough time available to work on this project alone. If you or somebody else wants to start writing this tool (especially the database part), please go ahead. I can help with the archive.org part if needed. The rest is inline: On 27-01-19, Florian Pritz wrote:
On Sun, Jan 27, 2019 at 10:37:23PM +0100, Baptiste Jonglez <baptiste@bitsofnetworks.org> wrote:
There is one argument against a standalone tool: each time it runs, it will need to scan the whole filesystem hierarchy to detect new packages, which can be quite slow.
You can focus on the /srv/archive/packages/* directories. I've just run find on those and, once cached, it takes about 0.33 seconds. Uncached it's slightly slower, but below 5 seconds. I forgot to redirect the output the first time and having it output the list makes it quite a bit slower.
(...)
Right now, there are a little under 500k packages and signatures in the packages tree. So that's 250k package filenames you'd need to check against the database. I'll ignore signatures and assume that we only add the package file name when the signature has also been uploaded.
I've performed some very quick testing against an sqlite db with 2.5M rows. Running a select statement that searches for matches with a set of 10 strings (some of which never match) completed in ~0.2ms. Multiplied by 25k (250k / 10 since we have batches of 10 strings) that's roughly 5 seconds. You will probably get better performance with a smaller database and with larger batches of like 100 file names so I'd say that's perfectly fine. I've also tried matching only a single path and that took slightly under 0.2ms. With a batch of 100 strings it took 0.6ms which puts the total around 1-2 seconds.
If you need to further reduce the number of db queries, you could also just check the modification time of the files and skip checking for file that are older than some cutoff (let's say 1 month). I'd advise against this though, unless it's really necessary.
Thanks for checking the actual numbers so precisely. From what I remember, accessing the filesystem was really slow whenever the archive script was running (100% CPU for several minutes, probably checking/creating hardlinks). But I don't have precise measurements and that's probably not a major issue then.
How urgent is the cleanup on orion? Is it ok to do it in a few weeks/months?
Looking at your script, I see that you already seem to have uploaded 2016, is that correct? In that case we could go ahead and remove those packages to buy us some more time (probably another year). Last time we only removed up to 2015.
Yes, I had already uploaded everything up to 2016: https://lists.archlinux.org/pipermail/arch-dev-public/2018-June/029274.html When you delete the packages, don't forget to keep all database files so that the redirection to archive.org keeps working! Baptiste