[aur-general] AUR package metadata dump

Thu Nov 15 19:24:39 UTC 2018

On Thu, Nov 15, 2018 at 09:25:02PM +0300, Dmitry Marakasov <amdmi3 at amdmi3.ru> wrote:
> The way Repology currently fetches AUR package data is as follows:
> - fetch https://aur.archlinux.org/packages.gz
> - split packages into 100 item packs
> - fetch JSON data for packages in each pack from https://aur.archlinux.org/rpc/?v=5&type=info&arg[]=<packages>
> 
> While fetching data from API, Repology does a 1 second pause between
> requests to not create excess load on the server, but there are still
> frequent 429 errors. I've tried 2 second delays, but the 429s are still
> there, and fetch time increases dramatically as we have to do more than
> 500 requests. Probably API is loaded by other clients as well.

The rate limit allows 4000 API requests per source IP in a 24 hour
window. It does not matter which type of request you send or how many
packages you request information for. Spreading out requests is still
appreciated, but it mostly won't influence your rate limit.

The packages.gz file currently contains around 53000 packages. If you
split those into packs of 100 each and then perform a single API request
for each pack to fetch all the details, you end up with roughly 530
requests. Given you hit the limit, you probably check multiple times
each day, correct? I'd suggest to spread the checks over a 6 hour period
or longer. This should keep you well below the limit.

> I suggest to implement a regularly updated JSON dump of information
> on all packages and make it available for the site, like packages.gz is.
> The content should be similar to what https://aur.archlinux.org/rpc/?v=5&type=info
> would return for all packages at once.
>
> This will eliminate the need to access the API and generate load
> on it, simplify and speed up fetching dramatically for both Repology
> and possible other clients.

It may also generate much more network traffic since the problem that
prompted the creation of the rate limit was that people ran update check
scripts every 5 or 10 seconds via conky. Some of those resulted in up to
40 millions of requests on a single day due to inefficient clients and a
huge number of checked packages. I'm somewhat worried that a central
dump may just invite people to write clients that fetch it and then we
start this whole thing again. Granted, it's only a single request per
check, but the response is likely quite big. Maybe the best way to do
this is to actually implement it as an API call and thus share the rate
limit with the rest of the API to prevent abuse.

Apart from all that, I'd suggest that you propose the idea (or a patch)
on the aur-dev mailing list, assuming that there isn't a huge discussion
about it here first.

Florian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://lists.archlinux.org/pipermail/aur-general/attachments/20181115/a72d3259/attachment-0001.asc>