[arch-devops] [arch-dev-public] Uploading old packages on archive.org (Was: Archive cleanup)

Florian Pritz bluewind at xinu.at
Sun Jan 27 22:18:47 UTC 2019


On Sun, Jan 27, 2019 at 10:37:23PM +0100, Baptiste Jonglez <baptiste at bitsofnetworks.org> wrote:
> There is one argument against a standalone tool: each time it runs, it
> will need to scan the whole filesystem hierarchy to detect new packages,
> which can be quite slow.

You can focus on the /srv/archive/packages/* directories. I've just run
find on those and, once cached, it takes about 0.33 seconds. Uncached it's
slightly slower, but below 5 seconds. I forgot to redirect the output
the first time and having it output the list makes it quite a bit
slower.

> A simpler but less robust way would be to scan only the current year
> (along with the previous year for a while).

The ./packages/ subtree contains all unique packages no matter when they
were released. If you just record the filenames of all packages that
were already uploaded, you can easily detect new ones. I don't see a
need for scanning each of the year/month/day trees. Also, the README in
your repo already uses the packages/ tree and does not scan the other
directories.

Right now, there are a little under 500k packages and signatures in the
packages tree. So that's 250k package filenames you'd need to check
against the database. I'll ignore signatures and assume that we only add
the package file name when the signature has also been uploaded.

I've performed some very quick testing against an sqlite db with 2.5M
rows. Running a select statement that searches for matches with a set of
10 strings (some of which never match) completed in ~0.2ms. Multiplied
by 25k (250k / 10 since we have batches of 10 strings) that's roughly 5
seconds. You will probably get better performance with a smaller
database and with larger batches of like 100 file names so I'd say
that's perfectly fine. I've also tried matching only a single path and
that took slightly under 0.2ms. With a batch of 100 strings it took
0.6ms which puts the total around 1-2 seconds.

If you need to further reduce the number of db queries, you could also
just check the modification time of the files and skip checking for
file that are older than some cutoff (let's say 1 month). I'd advise
against this though, unless it's really necessary.

> How urgent is the cleanup on orion?  Is it ok to do it in a few weeks/months?

Looking at your script, I see that you already seem to have uploaded
2016, is that correct? In that case we could go ahead and remove those
packages to buy us some more time (probably another year). Last time we
only removed up to 2015.

Florian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://lists.archlinux.org/pipermail/arch-devops/attachments/20190127/575edaa5/attachment.asc>


More information about the arch-devops mailing list