[aur-general] AUR package metadata dump

Fri Nov 16 14:35:28 UTC 2018

* Florian Pritz via aur-general (aur-general at archlinux.org) wrote:

> > The way Repology currently fetches AUR package data is as follows:
> > - fetch https://aur.archlinux.org/packages.gz
> > - split packages into 100 item packs
> > - fetch JSON data for packages in each pack from https://aur.archlinux.org/rpc/?v=5&type=info&arg[]=<packages>
> > 
> > While fetching data from API, Repology does a 1 second pause between
> > requests to not create excess load on the server, but there are still
> > frequent 429 errors. I've tried 2 second delays, but the 429s are still
> > there, and fetch time increases dramatically as we have to do more than
> > 500 requests. Probably API is loaded by other clients as well.
> 
> The rate limit allows 4000 API requests per source IP in a 24 hour
> window. It does not matter which type of request you send or how many
> packages you request information for. Spreading out requests is still
> appreciated, but it mostly won't influence your rate limit.
> 
> The packages.gz file currently contains around 53000 packages. If you
> split those into packs of 100 each and then perform a single API request
> for each pack to fetch all the details, you end up with roughly 530
> requests. Given you hit the limit, you probably check multiple times
> each day, correct? I'd suggest to spread the checks over a 6 hour period
> or longer. This should keep you well below the limit.

Thanks for clarification! Correct, I'm doing multiple updates a day. The
rate is varying but is about once each 2 hours. I guess I can stuff
more packages into a single request for now. Later proper update
scheduling will be implemented (which will allow to e.g. set aur to update
no faster than every 3 hours), but I hope to facilitate making a json
dump which would allow both faster and more frequent updates.

> > I suggest to implement a regularly updated JSON dump of information
> > on all packages and make it available for the site, like packages.gz is.
> > The content should be similar to what https://aur.archlinux.org/rpc/?v=5&type=info
> > would return for all packages at once.
> >
> > This will eliminate the need to access the API and generate load
> > on it, simplify and speed up fetching dramatically for both Repology
> > and possible other clients.
> 
> It may also generate much more network traffic since the problem that
> prompted the creation of the rate limit was that people ran update check
> scripts every 5 or 10 seconds via conky. Some of those resulted in up to
> 40 millions of requests on a single day due to inefficient clients and a
> huge number of checked packages. I'm somewhat worried that a central
> dump may just invite people to write clients that fetch it and then we
> start this whole thing again. Granted, it's only a single request per
> check, but the response is likely quite big. Maybe the best way to do
> this is to actually implement it as an API call and thus share the rate
> limit with the rest of the API to prevent abuse.

The same way as I've replied to Eli, suggesting to implement an API
call is a strange thing to suggest as it'll make it much easier to
generate more load on the server and more trafic.

The benefits of the dump as I see it are:

- Much less load on the server.

  I've looked through API code and it does an extra SQL query per a
  package to get extended data such as dependencies and licenses, which
  consists of multiple unions and joins involving 10 tables. That looks
  extremely heavy, and getting a dump through API is equivalent to
  issuing this heavy query 53k times (e.g. for each package).

  Dump OTOH may be done hourly, and it will eliminate the need for
  client to reside to these heavy quries.

- Less traffic usage, as the static dump can be

  - Compressed
  - Cached
  - Not transfered at all if it hasn't changed since the previous
    requests, e.g. based on If-Modified-Since or related header

I don't think that the existence of the dump will encourage clients
which don't need ALL the data as it's still heavier to download and
decompress - doing that every X seconds will create noticeable load
on the clients asking to redo it in a proper way.

It can also still be rate limited (separately from API, and probably with
much lower rate, e.g. 4 RPH looks reasonable) - I see aur.archlinux.org
uses nginx, and it supports such rate limiting pretty well.

> Apart from all that, I'd suggest that you propose the idea (or a patch)
> on the aur-dev mailing list, assuming that there isn't a huge discussion
> about it here first.

-- 
Dmitry Marakasov   .   55B5 0596 FF1E 8D84 5F56  9510 D35A 80DD F9D2 F77D
amdmi3 at amdmi3.ru  ..:              https://github.com/AMDmi3