[aur-general] AUR package metadata dump
amdmi3 at amdmi3.ru
Fri Nov 16 14:19:40 UTC 2018
* Eli Schwartz via aur-general (aur-general at archlinux.org) wrote:
> > I'm maintainer of Repology.org, a service which monitors, aggregates
> > and compares package vesion accross 200+ package repositories with
> > a purpose of simplifying package maintainers work by discovering
> > new versions faster, improving collaboration between maintainers
> > and giving software authhors a complete overview of how well their
> > projects are packaged.
> > Repology does obviously support AUR, however there were some problems
> > with retrieving information on AUR packages and I think this could
> > be improved.
> > The way Repology currently fetches AUR package data is as follows:
> > - fetch https://aur.archlinux.org/packages.gz
> > - split packages into 100 item packs
> > - fetch JSON data for packages in each pack from https://aur.archlinux.org/rpc/?v=5&type=info&arg=<packages>
> > While fetching data from API, Repology does a 1 second pause between
> > requests to not create excess load on the server, but there are still
> > frequent 429 errors. I've tried 2 second delays, but the 429s are still
> > there, and fetch time increases dramatically as we have to do more than
> > 500 requests. Probably API is loaded by other clients as well.
> Our rate limit is 4000 per 24 hours. One-second pauses aren't taken into
> account, and our initial motivation to add rate limiting was to ban
> users who were using 5-second delays...
> Please read our documentation on the limits here:
Got it, thanks for clarification.
> A single request should be able to return as many packages as needed as
> long as it conforms to the limitations imposed by the URI length.
There's also a 5000 max_rpc_results limit.
But, requesting more packages in each request should fix my problem
for now. Later I'll implement finer update frequency control too, so
e.g. AUR could be updated no more frequent than 3 hours or so.
> > I suggest to implement a regularly updated JSON dump of information
> > on all packages and make it available for the site, like packages.gz is.
> > The content should be similar to what https://aur.archlinux.org/rpc/?v=5&type=info
> > would return for all packages at once.
> If the RPC interface had a parameter to circumvent the
> arg=pkg1&arg=pkg2 search, and simply request all packages, that
> would already do what you want, I guess.
That's a strange thing to suggest.
Obviously there was a reason for API rate limiting, probably excess CPU
load or trafic usage. And allowing to fetch all packages from API will
make creating these kinds of load even easier, without hitting the rate
limit. It will also require more memory as it accumulates all the data
before sending it to the client.
> > This will eliminate the need to access the API and generate load
> > on it, simplify and speed up fetching dramatically for both Repology
> > and possible other clients.
> > Additionally, I'd like to suggest to add information on distfiles to the
> > dump (and probably an API as well for consistency). For instance,
> > Repology checks availability for all (homepage and download) links
> > it retreives from package repositories and reports broken ones so
> > the packages could be fixed.
> The source code running the website is here:
> We currently provide the url, but not the sources for download, since
> the use case for our community has not (yet?) proposed that the latter
> is something needed. I'm unsure who would use it other than repology.
> If you would like to submit a patch to implement the API that would help
> you, feel free (I'm open to discussion on merging it). However, I don't
> know if any current aurweb contributors are interested in doing the
> work. I know I'm not.
How about this?
Not tested though as I'd have to install an Arch VM for proper testing
and this can take time.
Dmitry Marakasov . 55B5 0596 FF1E 8D84 5F56 9510 D35A 80DD F9D2 F77D
amdmi3 at amdmi3.ru ..: https://github.com/AMDmi3
More information about the aur-general