[arch-devops] [arch-dev-public] Uploading old packages on archive.org (Was: Archive cleanup)
baptiste at bitsofnetworks.org
Mon Mar 4 11:43:19 UTC 2019
So actually I don't have enough time available to work on this project
alone. If you or somebody else wants to start writing this tool
(especially the database part), please go ahead. I can help with the
archive.org part if needed.
The rest is inline:
On 27-01-19, Florian Pritz wrote:
> On Sun, Jan 27, 2019 at 10:37:23PM +0100, Baptiste Jonglez <baptiste at bitsofnetworks.org> wrote:
> > There is one argument against a standalone tool: each time it runs, it
> > will need to scan the whole filesystem hierarchy to detect new packages,
> > which can be quite slow.
> You can focus on the /srv/archive/packages/* directories. I've just run
> find on those and, once cached, it takes about 0.33 seconds. Uncached it's
> slightly slower, but below 5 seconds. I forgot to redirect the output
> the first time and having it output the list makes it quite a bit
> Right now, there are a little under 500k packages and signatures in the
> packages tree. So that's 250k package filenames you'd need to check
> against the database. I'll ignore signatures and assume that we only add
> the package file name when the signature has also been uploaded.
> I've performed some very quick testing against an sqlite db with 2.5M
> rows. Running a select statement that searches for matches with a set of
> 10 strings (some of which never match) completed in ~0.2ms. Multiplied
> by 25k (250k / 10 since we have batches of 10 strings) that's roughly 5
> seconds. You will probably get better performance with a smaller
> database and with larger batches of like 100 file names so I'd say
> that's perfectly fine. I've also tried matching only a single path and
> that took slightly under 0.2ms. With a batch of 100 strings it took
> 0.6ms which puts the total around 1-2 seconds.
> If you need to further reduce the number of db queries, you could also
> just check the modification time of the files and skip checking for
> file that are older than some cutoff (let's say 1 month). I'd advise
> against this though, unless it's really necessary.
Thanks for checking the actual numbers so precisely. From what I
remember, accessing the filesystem was really slow whenever the archive
script was running (100% CPU for several minutes, probably
checking/creating hardlinks). But I don't have precise measurements and
that's probably not a major issue then.
> > How urgent is the cleanup on orion? Is it ok to do it in a few weeks/months?
> Looking at your script, I see that you already seem to have uploaded
> 2016, is that correct? In that case we could go ahead and remove those
> packages to buy us some more time (probably another year). Last time we
> only removed up to 2015.
Yes, I had already uploaded everything up to 2016: https://lists.archlinux.org/pipermail/arch-dev-public/2018-June/029274.html
When you delete the packages, don't forget to keep all database files so
that the redirection to archive.org keeps working!
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 833 bytes
Desc: not available
More information about the arch-devops