Re: [aur-general] AUR package metadata dump

16 Nov 2018

      On Fri, Nov 16, 2018 at 05:35:28PM +0300, Dmitry Marakasov <amdmi3@amdmi3.ru> wrote:
...
- Much less load on the server.
I've looked through API code and it does an extra SQL query per a
  package to get extended data such as dependencies and licenses, which
  consists of multiple unions and joins involving 10 tables. That looks
  extremely heavy, and getting a dump through API is equivalent to
  issuing this heavy query 53k times (e.g. for each package).
Actually the database load of the current API is so low (possibly due to
the mysql query cache) that we failed to measure a difference when we
put it behind a 10 minute cache via nginx. The most noticeable effect of
API requests is the size of the log file. Well, and then, I am on
principle against runaway scripts that generate unnecessary requests.
The log file that filled up the disk was the primary trigger to look
into this though.

My idea is to either generate the results on demand or cache them in the
code. If cached in the code, there would be no database load. It would
just pass through the code so we can perform rate limiting. Granted, if
we can implemented the rate limit in nginx (see below) that would be
essentially the same and fine too. Then we/you could indeed just dump it
to a file and serve that.
...
I don't think that the existence of the dump will encourage clients
which don't need ALL the data as it's still heavier to download and
decompress - doing that every X seconds will create noticeable load
on the clients asking to redo it in a proper way.
You'll be amazed what ideas people come up with and what they don't
notice. Someone once though that it would be a good idea to have a
script that regularly (I think daily) fetches a sorted mirror list from
our web site and then reuses that without modification. Obviously if
many people use that solution and all use the same sort order, which was
intended by the script author, they all have the same mirror in the
first line and thus that mirror becomes overloaded quite quickly.
...
It can also still be rate limited (separately from API, and probably with
much lower rate, e.g. 4 RPH looks reasonable) - I see aur.archlinux.org
uses nginx, and it supports such rate limiting pretty well.
How would you configure a limit of 4/hour? Last time I checked nginx
only supported limits per second and per minute and no arbitrary time
frame nor non-integer values. This still seems to be the case after a
quick check in the documentation[1]. Thus, the lowest limit that I could
configure is 1/m, but that's something totally different from what I/we
want. If you have a solution to configure arbitrary limits directly in
nginx I'd love to know about it.

[1]
http://nginx.org/en/docs/http/ngx_http_limit_req_module.html#limit_req_zone

Florian

Re: [aur-general] AUR package metadata dump

Florian Pritz