On Thu, Nov 15, 2018 at 09:25:02PM +0300, Dmitry Marakasov <amdmi3@amdmi3.ru> wrote:
The way Repology currently fetches AUR package data is as follows: - fetch https://aur.archlinux.org/packages.gz - split packages into 100 item packs - fetch JSON data for packages in each pack from https://aur.archlinux.org/rpc/?v=5&type=info&arg[]=<packages>
While fetching data from API, Repology does a 1 second pause between requests to not create excess load on the server, but there are still frequent 429 errors. I've tried 2 second delays, but the 429s are still there, and fetch time increases dramatically as we have to do more than 500 requests. Probably API is loaded by other clients as well.
The rate limit allows 4000 API requests per source IP in a 24 hour window. It does not matter which type of request you send or how many packages you request information for. Spreading out requests is still appreciated, but it mostly won't influence your rate limit. The packages.gz file currently contains around 53000 packages. If you split those into packs of 100 each and then perform a single API request for each pack to fetch all the details, you end up with roughly 530 requests. Given you hit the limit, you probably check multiple times each day, correct? I'd suggest to spread the checks over a 6 hour period or longer. This should keep you well below the limit.
I suggest to implement a regularly updated JSON dump of information on all packages and make it available for the site, like packages.gz is. The content should be similar to what https://aur.archlinux.org/rpc/?v=5&type=info would return for all packages at once.
This will eliminate the need to access the API and generate load on it, simplify and speed up fetching dramatically for both Repology and possible other clients.
It may also generate much more network traffic since the problem that prompted the creation of the rate limit was that people ran update check scripts every 5 or 10 seconds via conky. Some of those resulted in up to 40 millions of requests on a single day due to inefficient clients and a huge number of checked packages. I'm somewhat worried that a central dump may just invite people to write clients that fetch it and then we start this whole thing again. Granted, it's only a single request per check, but the response is likely quite big. Maybe the best way to do this is to actually implement it as an API call and thus share the rate limit with the rest of the API to prevent abuse. Apart from all that, I'd suggest that you propose the idea (or a patch) on the aur-dev mailing list, assuming that there isn't a huge discussion about it here first. Florian