[arch-devops] [arch-dev-public] Uploading old packages on archive.org (Was: Archive cleanup)

Baptiste Jonglez baptiste at bitsofnetworks.org
Mon Mar 4 11:43:19 UTC 2019


Hi Florian,

So actually I don't have enough time available to work on this project
alone.  If you or somebody else wants to start writing this tool
(especially the database part), please go ahead.  I can help with the
archive.org part if needed.

The rest is inline:

On 27-01-19, Florian Pritz wrote:
> On Sun, Jan 27, 2019 at 10:37:23PM +0100, Baptiste Jonglez <baptiste at bitsofnetworks.org> wrote:
> > There is one argument against a standalone tool: each time it runs, it
> > will need to scan the whole filesystem hierarchy to detect new packages,
> > which can be quite slow.
> 
> You can focus on the /srv/archive/packages/* directories. I've just run
> find on those and, once cached, it takes about 0.33 seconds. Uncached it's
> slightly slower, but below 5 seconds. I forgot to redirect the output
> the first time and having it output the list makes it quite a bit
> slower.
>
> (...)
>
> Right now, there are a little under 500k packages and signatures in the
> packages tree. So that's 250k package filenames you'd need to check
> against the database. I'll ignore signatures and assume that we only add
> the package file name when the signature has also been uploaded.
> 
> I've performed some very quick testing against an sqlite db with 2.5M
> rows. Running a select statement that searches for matches with a set of
> 10 strings (some of which never match) completed in ~0.2ms. Multiplied
> by 25k (250k / 10 since we have batches of 10 strings) that's roughly 5
> seconds. You will probably get better performance with a smaller
> database and with larger batches of like 100 file names so I'd say
> that's perfectly fine. I've also tried matching only a single path and
> that took slightly under 0.2ms. With a batch of 100 strings it took
> 0.6ms which puts the total around 1-2 seconds.
> 
> If you need to further reduce the number of db queries, you could also
> just check the modification time of the files and skip checking for
> file that are older than some cutoff (let's say 1 month). I'd advise
> against this though, unless it's really necessary.

Thanks for checking the actual numbers so precisely.  From what I
remember, accessing the filesystem was really slow whenever the archive
script was running (100% CPU for several minutes, probably
checking/creating hardlinks).  But I don't have precise measurements and
that's probably not a major issue then.

> > How urgent is the cleanup on orion?  Is it ok to do it in a few weeks/months?
> 
> Looking at your script, I see that you already seem to have uploaded
> 2016, is that correct? In that case we could go ahead and remove those
> packages to buy us some more time (probably another year). Last time we
> only removed up to 2015.

Yes, I had already uploaded everything up to 2016: https://lists.archlinux.org/pipermail/arch-dev-public/2018-June/029274.html

When you delete the packages, don't forget to keep all database files so
that the redirection to archive.org keeps working!

Baptiste
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://lists.archlinux.org/pipermail/arch-devops/attachments/20190304/9e641946/attachment.sig>


More information about the arch-devops mailing list