[aur-general] AUR package metadata dump

Eli Schwartz eschwartz at archlinux.org
Thu Nov 15 19:26:06 UTC 2018


On 11/15/18 1:25 PM, Dmitry Marakasov wrote:
> Hi!
> 
> I'm maintainer of Repology.org, a service which monitors, aggregates
> and compares package vesion accross 200+ package repositories with
> a purpose of simplifying package maintainers work by discovering
> new versions faster, improving collaboration between maintainers
> and giving software authhors a complete overview of how well their
> projects are packaged.
> 
> Repology does obviously support AUR, however there were some problems
> with retrieving information on AUR packages and I think this could
> be improved.
> 
> The way Repology currently fetches AUR package data is as follows:
> - fetch https://aur.archlinux.org/packages.gz
> - split packages into 100 item packs
> - fetch JSON data for packages in each pack from https://aur.archlinux.org/rpc/?v=5&type=info&arg[]=<packages>
> 
> While fetching data from API, Repology does a 1 second pause between
> requests to not create excess load on the server, but there are still
> frequent 429 errors. I've tried 2 second delays, but the 429s are still
> there, and fetch time increases dramatically as we have to do more than
> 500 requests. Probably API is loaded by other clients as well.

Our rate limit is 4000 per 24 hours. One-second pauses aren't taken into
account, and our initial motivation to add rate limiting was to ban
users who were using 5-second delays...

Please read our documentation on the limits here:
https://wiki.archlinux.org/index.php/Aurweb_RPC_interface#Limitations

A single request should be able to return as many packages as needed as
long as it conforms to the limitations imposed by the URI length.

> I suggest to implement a regularly updated JSON dump of information
> on all packages and make it available for the site, like packages.gz is.
> The content should be similar to what https://aur.archlinux.org/rpc/?v=5&type=info
> would return for all packages at once.

If the RPC interface had a parameter to circumvent the
arg[]=pkg1&arg[]=pkg2 search, and simply request all packages, that
would already do what you want, I guess.

> This will eliminate the need to access the API and generate load
> on it, simplify and speed up fetching dramatically for both Repology
> and possible other clients.
> 
> Additionally, I'd like to suggest to add information on distfiles to the
> dump (and probably an API as well for consistency). For instance,
> Repology checks availability for all (homepage and download) links
> it retreives from package repositories and reports broken ones so
> the packages could be fixed.

The source code running the website is here:
https://git.archlinux.org/aurweb.git/about/

We currently provide the url, but not the sources for download, since
the use case for our community has not (yet?) proposed that the latter
is something needed. I'm unsure who would use it other than repology.

If you would like to submit a patch to implement the API that would help
you, feel free (I'm open to discussion on merging it). However, I don't
know if any current aurweb contributors are interested in doing the
work. I know I'm not.

-- 
Eli Schwartz
Bug Wrangler and Trusted User

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.archlinux.org/pipermail/aur-general/attachments/20181115/2b6d8b0d/attachment.asc>


More information about the aur-general mailing list