[aur-general] AUR package metadata dump

Dmitry Marakasov

15 Nov 2018 15 Nov '18

7:25 p.m.

Hi! I'm maintainer of Repology.org, a service which monitors, aggregates and compares package vesion accross 200+ package repositories with a purpose of simplifying package maintainers work by discovering new versions faster, improving collaboration between maintainers and giving software authhors a complete overview of how well their projects are packaged. Repology does obviously support AUR, however there were some problems with retrieving information on AUR packages and I think this could be improved. The way Repology currently fetches AUR package data is as follows: - fetch https://aur.archlinux.org/packages.gz - split packages into 100 item packs - fetch JSON data for packages in each pack from https://aur.archlinux.org/rpc/?v=5&type=info&arg[]=<packages> While fetching data from API, Repology does a 1 second pause between requests to not create excess load on the server, but there are still frequent 429 errors. I've tried 2 second delays, but the 429s are still there, and fetch time increases dramatically as we have to do more than 500 requests. Probably API is loaded by other clients as well. I suggest to implement a regularly updated JSON dump of information on all packages and make it available for the site, like packages.gz is. The content should be similar to what https://aur.archlinux.org/rpc/?v=5&type=info would return for all packages at once. This will eliminate the need to access the API and generate load on it, simplify and speed up fetching dramatically for both Repology and possible other clients. Additionally, I'd like to suggest to add information on distfiles to the dump (and probably an API as well for consistency). For instance, Repology checks availability for all (homepage and download) links it retreives from package repositories and reports broken ones so the packages could be fixed. -- Dmitry Marakasov . 55B5 0596 FF1E 8D84 5F56 9510 D35A 80DD F9D2 F77D amdmi3@amdmi3.ru ..: https://github.com/AMDmi3

Show replies by date

Florian Pritz

15 Nov 15 Nov

8:24 p.m.

On Thu, Nov 15, 2018 at 09:25:02PM +0300, Dmitry Marakasov <amdmi3@amdmi3.ru> wrote:

...

The way Repology currently fetches AUR package data is as follows: - fetch https://aur.archlinux.org/packages.gz - split packages into 100 item packs - fetch JSON data for packages in each pack from https://aur.archlinux.org/rpc/?v=5&type=info&arg[]=<packages>

While fetching data from API, Repology does a 1 second pause between requests to not create excess load on the server, but there are still frequent 429 errors. I've tried 2 second delays, but the 429s are still there, and fetch time increases dramatically as we have to do more than 500 requests. Probably API is loaded by other clients as well.

The rate limit allows 4000 API requests per source IP in a 24 hour window. It does not matter which type of request you send or how many packages you request information for. Spreading out requests is still appreciated, but it mostly won't influence your rate limit. The packages.gz file currently contains around 53000 packages. If you split those into packs of 100 each and then perform a single API request for each pack to fetch all the details, you end up with roughly 530 requests. Given you hit the limit, you probably check multiple times each day, correct? I'd suggest to spread the checks over a 6 hour period or longer. This should keep you well below the limit.

...

I suggest to implement a regularly updated JSON dump of information on all packages and make it available for the site, like packages.gz is. The content should be similar to what https://aur.archlinux.org/rpc/?v=5&type=info would return for all packages at once.

This will eliminate the need to access the API and generate load on it, simplify and speed up fetching dramatically for both Repology and possible other clients.

It may also generate much more network traffic since the problem that prompted the creation of the rate limit was that people ran update check scripts every 5 or 10 seconds via conky. Some of those resulted in up to 40 millions of requests on a single day due to inefficient clients and a huge number of checked packages. I'm somewhat worried that a central dump may just invite people to write clients that fetch it and then we start this whole thing again. Granted, it's only a single request per check, but the response is likely quite big. Maybe the best way to do this is to actually implement it as an API call and thus share the rate limit with the rest of the API to prevent abuse. Apart from all that, I'd suggest that you propose the idea (or a patch) on the aur-dev mailing list, assuming that there isn't a huge discussion about it here first. Florian

Dmitry Marakasov

16 Nov 16 Nov

3:35 p.m.

* Florian Pritz via aur-general (aur-general@archlinux.org) wrote:

...

...
The way Repology currently fetches AUR package data is as follows: - fetch https://aur.archlinux.org/packages.gz - split packages into 100 item packs - fetch JSON data for packages in each pack from https://aur.archlinux.org/rpc/?v=5&type=info&arg[]=<packages>

While fetching data from API, Repology does a 1 second pause between requests to not create excess load on the server, but there are still frequent 429 errors. I've tried 2 second delays, but the 429s are still there, and fetch time increases dramatically as we have to do more than 500 requests. Probably API is loaded by other clients as well.

The rate limit allows 4000 API requests per source IP in a 24 hour window. It does not matter which type of request you send or how many packages you request information for. Spreading out requests is still appreciated, but it mostly won't influence your rate limit.

The packages.gz file currently contains around 53000 packages. If you split those into packs of 100 each and then perform a single API request for each pack to fetch all the details, you end up with roughly 530 requests. Given you hit the limit, you probably check multiple times each day, correct? I'd suggest to spread the checks over a 6 hour period or longer. This should keep you well below the limit.

Thanks for clarification! Correct, I'm doing multiple updates a day. The rate is varying but is about once each 2 hours. I guess I can stuff more packages into a single request for now. Later proper update scheduling will be implemented (which will allow to e.g. set aur to update no faster than every 3 hours), but I hope to facilitate making a json dump which would allow both faster and more frequent updates.

...

...
I suggest to implement a regularly updated JSON dump of information on all packages and make it available for the site, like packages.gz is. The content should be similar to what https://aur.archlinux.org/rpc/?v=5&type=info would return for all packages at once.

This will eliminate the need to access the API and generate load on it, simplify and speed up fetching dramatically for both Repology and possible other clients.

It may also generate much more network traffic since the problem that prompted the creation of the rate limit was that people ran update check scripts every 5 or 10 seconds via conky. Some of those resulted in up to 40 millions of requests on a single day due to inefficient clients and a huge number of checked packages. I'm somewhat worried that a central dump may just invite people to write clients that fetch it and then we start this whole thing again. Granted, it's only a single request per check, but the response is likely quite big. Maybe the best way to do this is to actually implement it as an API call and thus share the rate limit with the rest of the API to prevent abuse.

The same way as I've replied to Eli, suggesting to implement an API call is a strange thing to suggest as it'll make it much easier to generate more load on the server and more trafic. The benefits of the dump as I see it are: - Much less load on the server. I've looked through API code and it does an extra SQL query per a package to get extended data such as dependencies and licenses, which consists of multiple unions and joins involving 10 tables. That looks extremely heavy, and getting a dump through API is equivalent to issuing this heavy query 53k times (e.g. for each package). Dump OTOH may be done hourly, and it will eliminate the need for client to reside to these heavy quries. - Less traffic usage, as the static dump can be - Compressed - Cached - Not transfered at all if it hasn't changed since the previous requests, e.g. based on If-Modified-Since or related header I don't think that the existence of the dump will encourage clients which don't need ALL the data as it's still heavier to download and decompress - doing that every X seconds will create noticeable load on the clients asking to redo it in a proper way. It can also still be rate limited (separately from API, and probably with much lower rate, e.g. 4 RPH looks reasonable) - I see aur.archlinux.org uses nginx, and it supports such rate limiting pretty well.

...

Apart from all that, I'd suggest that you propose the idea (or a patch) on the aur-dev mailing list, assuming that there isn't a huge discussion about it here first.

-- Dmitry Marakasov . 55B5 0596 FF1E 8D84 5F56 9510 D35A 80DD F9D2 F77D amdmi3@amdmi3.ru ..: https://github.com/AMDmi3

Florian Pritz

6:27 p.m.

On Fri, Nov 16, 2018 at 05:35:28PM +0300, Dmitry Marakasov <amdmi3@amdmi3.ru> wrote:

...

- Much less load on the server.

I've looked through API code and it does an extra SQL query per a package to get extended data such as dependencies and licenses, which consists of multiple unions and joins involving 10 tables. That looks extremely heavy, and getting a dump through API is equivalent to issuing this heavy query 53k times (e.g. for each package).

Actually the database load of the current API is so low (possibly due to the mysql query cache) that we failed to measure a difference when we put it behind a 10 minute cache via nginx. The most noticeable effect of API requests is the size of the log file. Well, and then, I am on principle against runaway scripts that generate unnecessary requests. The log file that filled up the disk was the primary trigger to look into this though. My idea is to either generate the results on demand or cache them in the code. If cached in the code, there would be no database load. It would just pass through the code so we can perform rate limiting. Granted, if we can implemented the rate limit in nginx (see below) that would be essentially the same and fine too. Then we/you could indeed just dump it to a file and serve that.

...

I don't think that the existence of the dump will encourage clients which don't need ALL the data as it's still heavier to download and decompress - doing that every X seconds will create noticeable load on the clients asking to redo it in a proper way.

You'll be amazed what ideas people come up with and what they don't notice. Someone once though that it would be a good idea to have a script that regularly (I think daily) fetches a sorted mirror list from our web site and then reuses that without modification. Obviously if many people use that solution and all use the same sort order, which was intended by the script author, they all have the same mirror in the first line and thus that mirror becomes overloaded quite quickly.

...

It can also still be rate limited (separately from API, and probably with much lower rate, e.g. 4 RPH looks reasonable) - I see aur.archlinux.org uses nginx, and it supports such rate limiting pretty well.

How would you configure a limit of 4/hour? Last time I checked nginx only supported limits per second and per minute and no arbitrary time frame nor non-integer values. This still seems to be the case after a quick check in the documentation[1]. Thus, the lowest limit that I could configure is 1/m, but that's something totally different from what I/we want. If you have a solution to configure arbitrary limits directly in nginx I'd love to know about it. [1] http://nginx.org/en/docs/http/ngx_http_limit_req_module.html#limit_req_zone Florian

Thore Bödecker

8:11 p.m.

On 16.11.18 - 18:27, Florian Pritz via aur-general wrote:

...

My idea is to either generate the results on demand or cache them in the code. If cached in the code, there would be no database load. It would just pass through the code so we can perform rate limiting. Granted, if we can implemented the rate limit in nginx (see below) that would be essentially the same and fine too. Then we/you could indeed just dump it to a file and serve that.

I've just been discussing an idea with Florian, which might provide a reasonable way for both sides: The clients could send a timestamp to the API, that implies some sort of "give me all updates since *that*". The update timestamps for the packages are already tracked in the database anyway, putting an index on that column would make requesting various ranges quite efficient. When there were no changes since the client-supplied timestamp, the API could respond with HTTP 304 "Not Modified" (possibly without a body even) which would provide a suitable meaning and a very tiny response. We could actually think about not logging those 304 responses then, dunno what the general opinion on that is. If a new client wants to get started and build his own archive, it could supply the timestamp "since=0" (e.g. talking about unix timestamps here), which would simply result in a response with all packages. To prevent abuse of such (very large) deltas, we could implement some sort of shared rate-limit, like Florian mentioned. A first idea would be to use a shared rate-limit for all requests with a timestamp older than 48 hours for example. We could allow something like 200 of such requests per hour and if that limit was exceeded, the API would reply with maybe HTTP 400 "Bad Request" or HTTP 412 "Precondition Failed", along with a "Retry-After: [0-9]+" header to tell the client when to try again. Anywho, I just wanted to put this out there and gather some thoughts, feedback and opinions on this. Cheers, Thore -- Thore Bödecker GPG ID: 0xD622431AF8DB80F3 GPG FP: 0F96 559D 3556 24FC 2226 A864 D622 431A F8DB 80F3

Eli Schwartz

8:25 p.m.

On 11/16/18 2:11 PM, Thore Bödecker via aur-general wrote:

...

Anywho, I just wanted to put this out there and gather some thoughts, feedback and opinions on this.

This discussion is at this point, no longer "hey, do you know what happened and why this doesn't work?" I think at this point (we're beginning to propose high-level code concepts and make highly specific requests) we are ready to move the discussion to the aur-dev mailing list. :) Thanks. -- Eli Schwartz Bug Wrangler and Trusted User

Eli Schwartz

15 Nov 15 Nov

8:26 p.m.

On 11/15/18 1:25 PM, Dmitry Marakasov wrote:

...

Hi!

I'm maintainer of Repology.org, a service which monitors, aggregates and compares package vesion accross 200+ package repositories with a purpose of simplifying package maintainers work by discovering new versions faster, improving collaboration between maintainers and giving software authhors a complete overview of how well their projects are packaged.

Repology does obviously support AUR, however there were some problems with retrieving information on AUR packages and I think this could be improved.

The way Repology currently fetches AUR package data is as follows: - fetch https://aur.archlinux.org/packages.gz - split packages into 100 item packs - fetch JSON data for packages in each pack from https://aur.archlinux.org/rpc/?v=5&type=info&arg[]=<packages>

While fetching data from API, Repology does a 1 second pause between requests to not create excess load on the server, but there are still frequent 429 errors. I've tried 2 second delays, but the 429s are still there, and fetch time increases dramatically as we have to do more than 500 requests. Probably API is loaded by other clients as well.

Our rate limit is 4000 per 24 hours. One-second pauses aren't taken into account, and our initial motivation to add rate limiting was to ban users who were using 5-second delays... Please read our documentation on the limits here: https://wiki.archlinux.org/index.php/Aurweb_RPC_interface#Limitations A single request should be able to return as many packages as needed as long as it conforms to the limitations imposed by the URI length.

...

I suggest to implement a regularly updated JSON dump of information on all packages and make it available for the site, like packages.gz is. The content should be similar to what https://aur.archlinux.org/rpc/?v=5&type=info would return for all packages at once.

If the RPC interface had a parameter to circumvent the arg[]=pkg1&arg[]=pkg2 search, and simply request all packages, that would already do what you want, I guess.

...

This will eliminate the need to access the API and generate load on it, simplify and speed up fetching dramatically for both Repology and possible other clients.

Additionally, I'd like to suggest to add information on distfiles to the dump (and probably an API as well for consistency). For instance, Repology checks availability for all (homepage and download) links it retreives from package repositories and reports broken ones so the packages could be fixed.

The source code running the website is here: https://git.archlinux.org/aurweb.git/about/ We currently provide the url, but not the sources for download, since the use case for our community has not (yet?) proposed that the latter is something needed. I'm unsure who would use it other than repology. If you would like to submit a patch to implement the API that would help you, feel free (I'm open to discussion on merging it). However, I don't know if any current aurweb contributors are interested in doing the work. I know I'm not. -- Eli Schwartz Bug Wrangler and Trusted User

brent s.

8:50 p.m.

On 11/15/18 14:26, Eli Schwartz via aur-general wrote:

...

On 11/15/18 1:25 PM, Dmitry Marakasov wrote:

...
Hi!

(SNIP)

...
While fetching data from API, Repology does a 1 second pause between requests to not create excess load on the server, but there are still frequent 429 errors. I've tried 2 second delays, but the 429s are still there, and fetch time increases dramatically as we have to do more than 500 requests. Probably API is loaded by other clients as well.

Our rate limit is 4000 per 24 hours. One-second pauses aren't taken into account, and our initial motivation to add rate limiting was to ban users who were using 5-second delays...

(SNIP) don't forget about the URI max length, too. staggering into requests of 100 pkgs would work fine, but worth noting the max length is 4443 bytes https://wiki.archlinux.org/index.php/Aurweb_RPC_interface#Limitations -- brent saner https://square-r00t.net/ GPG info: https://square-r00t.net/gpg-info

Eli Schwartz

8:58 p.m.

On 11/15/18 2:50 PM, brent s. wrote:

...

On 11/15/18 14:26, Eli Schwartz via aur-general wrote:

...
On 11/15/18 1:25 PM, Dmitry Marakasov wrote:

...
Hi!

(SNIP)

...
While fetching data from API, Repology does a 1 second pause between requests to not create excess load on the server, but there are still frequent 429 errors. I've tried 2 second delays, but the 429s are still there, and fetch time increases dramatically as we have to do more than 500 requests. Probably API is loaded by other clients as well.

Our rate limit is 4000 per 24 hours. One-second pauses aren't taken into account, and our initial motivation to add rate limiting was to ban users who were using 5-second delays...

(SNIP)

don't forget about the URI max length, too. staggering into requests of 100 pkgs would work fine, but worth noting the max length is 4443 bytes

https://wiki.archlinux.org/index.php/Aurweb_RPC_interface#Limitations

It's a pity that I forgot to reply with the exact same link and almost the exact same caveat in the very next paragraph, isn't it? The paragraph which you quoted as "(SNIP)". -- Eli Schwartz Bug Wrangler and Trusted User

brent s.

9:14 p.m.

On 11/15/18 2:58 PM, Eli Schwartz via aur-general wrote:

...

It's a pity that I forgot to reply with the exact same link and almost the exact same caveat in the very next paragraph, isn't it?

The paragraph which you quoted as "(SNIP)".

it most likely would have been more noticeable if you trimmed the quoted content down to the relevant bits instead of including it whole. -- brent saner https://square-r00t.net/ GPG info: https://square-r00t.net/gpg-info

Dmitry Marakasov

16 Nov 16 Nov

5:34 p.m.

* brent s. (bts@square-r00t.net) wrote:

...

(SNIP)

...
...
While fetching data from API, Repology does a 1 second pause between requests to not create excess load on the server, but there are still frequent 429 errors. I've tried 2 second delays, but the 429s are still there, and fetch time increases dramatically as we have to do more than 500 requests. Probably API is loaded by other clients as well.

Our rate limit is 4000 per 24 hours. One-second pauses aren't taken into account, and our initial motivation to add rate limiting was to ban users who were using 5-second delays...

(SNIP)

don't forget about the URI max length, too. staggering into requests of 100 pkgs would work fine, but worth noting the max length is 4443 bytes

https://wiki.archlinux.org/index.php/Aurweb_RPC_interface#Limitations

Actually it does work fine with URL lengths up to 8k. -- Dmitry Marakasov . 55B5 0596 FF1E 8D84 5F56 9510 D35A 80DD F9D2 F77D amdmi3@amdmi3.ru ..: https://github.com/AMDmi3

Uwe Koloska

15 Nov 15 Nov

11:31 p.m.

Hi Eli, Am 15.11.18 um 20:26 schrieb Eli Schwartz via aur-general:

...

The source code running the website is here: https://git.archlinux.org/aurweb.git/about/

We currently provide the url, but not the sources for download, since the use case for our community has not (yet?) proposed that the latter is something needed. I'm unsure who would use it other than repology.

I don't understand what "url" and "sources" refer to. Obviously its not the Sourcecode of aurweb, because that's available in the linked git repo, isn't it? If both refer to something inside the quote, then the reference is very far from its destination ... Uwe

Jiachen YANG

16 Nov 16 Nov

5:06 a.m.

On 2018/11/16 7:31, Uwe Koloska wrote:

...

Hi Eli,

Am 15.11.18 um 20:26 schrieb Eli Schwartz via aur-general:

...
The source code running the website is here: https://git.archlinux.org/aurweb.git/about/

We currently provide the url, but not the sources for download, since the use case for our community has not (yet?) proposed that the latter is something needed. I'm unsure who would use it other than repology. I don't understand what "url" and "sources" refer to. Obviously its not the Sourcecode of aurweb, because that's available in the linked git repo, isn't it?

If both refer to something inside the quote, then the reference is very far from its destination ...

Uwe

Hi Uwe, First of all thank you for Repology, a interesting and useful project. I think the "url" and "sources" are refering to the 2 variables in PKGBUILD. "url" is the url to the upstream homepage, and "sources" are urls to download the source code, or in the case of VCS packages, the urls to fetch VCS repositories. I think "sources" is closest thing to "distfiles" you asked in your first message. Please see the manpage of PKGBUILD for details [1]. These are defined in PKGBUILD and generated in .SRCINFO for the AUR packages. And currently we only have "URL" field exposed in the aur rpc api. [1]: https://www.archlinux.org/pacman/PKGBUILD.5.html#_options_and_directives <https://www.archlinux.org/pacman/PKGBUILD.5.html#_options_and_directives> farseerfc

Dmitry Marakasov

3:19 p.m.

* Eli Schwartz via aur-general (aur-general@archlinux.org) wrote:

...

...
I'm maintainer of Repology.org, a service which monitors, aggregates and compares package vesion accross 200+ package repositories with a purpose of simplifying package maintainers work by discovering new versions faster, improving collaboration between maintainers and giving software authhors a complete overview of how well their projects are packaged.

Repology does obviously support AUR, however there were some problems with retrieving information on AUR packages and I think this could be improved.

The way Repology currently fetches AUR package data is as follows: - fetch https://aur.archlinux.org/packages.gz - split packages into 100 item packs - fetch JSON data for packages in each pack from https://aur.archlinux.org/rpc/?v=5&type=info&arg[]=<packages>

While fetching data from API, Repology does a 1 second pause between requests to not create excess load on the server, but there are still frequent 429 errors. I've tried 2 second delays, but the 429s are still there, and fetch time increases dramatically as we have to do more than 500 requests. Probably API is loaded by other clients as well.

Our rate limit is 4000 per 24 hours. One-second pauses aren't taken into account, and our initial motivation to add rate limiting was to ban users who were using 5-second delays...

Please read our documentation on the limits here: https://wiki.archlinux.org/index.php/Aurweb_RPC_interface#Limitations

Got it, thanks for clarification.

...

A single request should be able to return as many packages as needed as long as it conforms to the limitations imposed by the URI length.

There's also a 5000 max_rpc_results limit. But, requesting more packages in each request should fix my problem for now. Later I'll implement finer update frequency control too, so e.g. AUR could be updated no more frequent than 3 hours or so.

...

...
I suggest to implement a regularly updated JSON dump of information on all packages and make it available for the site, like packages.gz is. The content should be similar to what https://aur.archlinux.org/rpc/?v=5&type=info would return for all packages at once.

If the RPC interface had a parameter to circumvent the arg[]=pkg1&arg[]=pkg2 search, and simply request all packages, that would already do what you want, I guess.

That's a strange thing to suggest. Obviously there was a reason for API rate limiting, probably excess CPU load or trafic usage. And allowing to fetch all packages from API will make creating these kinds of load even easier, without hitting the rate limit. It will also require more memory as it accumulates all the data before sending it to the client.

...

...
This will eliminate the need to access the API and generate load on it, simplify and speed up fetching dramatically for both Repology and possible other clients.

Additionally, I'd like to suggest to add information on distfiles to the dump (and probably an API as well for consistency). For instance, Repology checks availability for all (homepage and download) links it retreives from package repositories and reports broken ones so the packages could be fixed.

The source code running the website is here: https://git.archlinux.org/aurweb.git/about/

We currently provide the url, but not the sources for download, since the use case for our community has not (yet?) proposed that the latter is something needed. I'm unsure who would use it other than repology.

If you would like to submit a patch to implement the API that would help you, feel free (I'm open to discussion on merging it). However, I don't know if any current aurweb contributors are interested in doing the work. I know I'm not.

How about this? https://github.com/AMDmi3/aurweb/compare/expose-package-sources Not tested though as I'd have to install an Arch VM for proper testing and this can take time. -- Dmitry Marakasov . 55B5 0596 FF1E 8D84 5F56 9510 D35A 80DD F9D2 F77D amdmi3@amdmi3.ru ..: https://github.com/AMDmi3

2495

Age (days ago)

2496

Last active (days ago)

List overview

Download

13 comments

7 participants

participants (7)

brent s.
Dmitry Marakasov
Eli Schwartz
Florian Pritz
Jiachen YANG
Thore Bödecker
Uwe Koloska