On Fri, Nov 16, 2018 at 05:35:28PM +0300, Dmitry Marakasov <amdmi3@amdmi3.ru> wrote:
- Much less load on the server.
I've looked through API code and it does an extra SQL query per a package to get extended data such as dependencies and licenses, which consists of multiple unions and joins involving 10 tables. That looks extremely heavy, and getting a dump through API is equivalent to issuing this heavy query 53k times (e.g. for each package).
Actually the database load of the current API is so low (possibly due to the mysql query cache) that we failed to measure a difference when we put it behind a 10 minute cache via nginx. The most noticeable effect of API requests is the size of the log file. Well, and then, I am on principle against runaway scripts that generate unnecessary requests. The log file that filled up the disk was the primary trigger to look into this though. My idea is to either generate the results on demand or cache them in the code. If cached in the code, there would be no database load. It would just pass through the code so we can perform rate limiting. Granted, if we can implemented the rate limit in nginx (see below) that would be essentially the same and fine too. Then we/you could indeed just dump it to a file and serve that.
I don't think that the existence of the dump will encourage clients which don't need ALL the data as it's still heavier to download and decompress - doing that every X seconds will create noticeable load on the clients asking to redo it in a proper way.
You'll be amazed what ideas people come up with and what they don't notice. Someone once though that it would be a good idea to have a script that regularly (I think daily) fetches a sorted mirror list from our web site and then reuses that without modification. Obviously if many people use that solution and all use the same sort order, which was intended by the script author, they all have the same mirror in the first line and thus that mirror becomes overloaded quite quickly.
It can also still be rate limited (separately from API, and probably with much lower rate, e.g. 4 RPH looks reasonable) - I see aur.archlinux.org uses nginx, and it supports such rate limiting pretty well.
How would you configure a limit of 4/hour? Last time I checked nginx only supported limits per second and per minute and no arbitrary time frame nor non-integer values. This still seems to be the case after a quick check in the documentation[1]. Thus, the lowest limit that I could configure is 1/m, but that's something totally different from what I/we want. If you have a solution to configure arbitrary limits directly in nginx I'd love to know about it. [1] http://nginx.org/en/docs/http/ngx_http_limit_req_module.html#limit_req_zone Florian