On Sun, Jan 27, 2019 at 10:37:23PM +0100, Baptiste Jonglez <baptiste@bitsofnetworks.org> wrote:
There is one argument against a standalone tool: each time it runs, it will need to scan the whole filesystem hierarchy to detect new packages, which can be quite slow.
You can focus on the /srv/archive/packages/* directories. I've just run find on those and, once cached, it takes about 0.33 seconds. Uncached it's slightly slower, but below 5 seconds. I forgot to redirect the output the first time and having it output the list makes it quite a bit slower.
A simpler but less robust way would be to scan only the current year (along with the previous year for a while).
The ./packages/ subtree contains all unique packages no matter when they were released. If you just record the filenames of all packages that were already uploaded, you can easily detect new ones. I don't see a need for scanning each of the year/month/day trees. Also, the README in your repo already uses the packages/ tree and does not scan the other directories. Right now, there are a little under 500k packages and signatures in the packages tree. So that's 250k package filenames you'd need to check against the database. I'll ignore signatures and assume that we only add the package file name when the signature has also been uploaded. I've performed some very quick testing against an sqlite db with 2.5M rows. Running a select statement that searches for matches with a set of 10 strings (some of which never match) completed in ~0.2ms. Multiplied by 25k (250k / 10 since we have batches of 10 strings) that's roughly 5 seconds. You will probably get better performance with a smaller database and with larger batches of like 100 file names so I'd say that's perfectly fine. I've also tried matching only a single path and that took slightly under 0.2ms. With a batch of 100 strings it took 0.6ms which puts the total around 1-2 seconds. If you need to further reduce the number of db queries, you could also just check the modification time of the files and skip checking for file that are older than some cutoff (let's say 1 month). I'd advise against this though, unless it's really necessary.
How urgent is the cleanup on orion? Is it ok to do it in a few weeks/months?
Looking at your script, I see that you already seem to have uploaded 2016, is that correct? In that case we could go ahead and remove those packages to buy us some more time (probably another year). Last time we only removed up to 2015. Florian