list of package names, versions[, descriptions]
hi. i'd like to keep a copy of the list of AUR packages, version numbers, and descriptions on my machine. the list can be somewhat out of date (say, as of the last time i did "pacman -Syu", which, for me, is every week or two). my question is how to do it with minimal overhead? i know packages.gz. that will give me the list of packages. however, without the version numbers, i won't be able to tell whether my cached information for a given package is up to date or not. would it be possible to (put on the list to) provide at some point a "packages-versions.gz"? or, even, "packages-versions-descriptions.gz"? (though the former is probably of more general use.) below is my motivation for wanting this. apologies if i've missed some already existing way of doing this. cheers, Greg ---------------------------------------------------------------------- motivation: for the last several years i've been using a script i "wrote" (inspired by something similar from fink) that, from local information, lists available packages, install status, version numbers, and descriptions [1]. (i'm often in a disconnected, or badly connected, world, so i try to avoid relying on the web.) having converted recently to arch, i pulled my script up to pacman [2], and would like to do an aur version, as well. for "packages-versions.gz", i'd download that, then use the package names and versions to make queries to bring a local database up to date with what AUR has. i'd damp this process to once a week or so. (i've spent some time in the last week, offline, playing with figuring out how efficiently -- in terms of both number of requests and amount of [duplicated] data transmitted -- i can download all the information via repeated RPC searches on frequently-appearing search terms -- using packages.gz as a way of knowing all the entries. but, as fun as that is, it's a hack.) [1] https://github.com/greg-minshall/apt-list [2] https://gitlab.com/minshall/pac-list
What about something like [1]? Mark [1] https://wiki.archlinux.org/index.php/Aurweb_RPC_interface On Mon, 2019-09-30 at 21:44 -0400, Greg Minshall wrote:
hi. i'd like to keep a copy of the list of AUR packages, version numbers, and descriptions on my machine. the list can be somewhat out of date (say, as of the last time i did "pacman -Syu", which, for me, is every week or two).
my question is how to do it with minimal overhead? i know packages.gz. that will give me the list of packages. however, without the version numbers, i won't be able to tell whether my cached information for a given package is up to date or not.
would it be possible to (put on the list to) provide at some point a "packages-versions.gz"? or, even, "packages-versions-descriptions.gz"? (though the former is probably of more general use.)
below is my motivation for wanting this. apologies if i've missed some already existing way of doing this.
cheers, Greg ---------------------------------------------------------------------- motivation:
for the last several years i've been using a script i "wrote" (inspired by something similar from fink) that, from local information, lists available packages, install status, version numbers, and descriptions [1]. (i'm often in a disconnected, or badly connected, world, so i try to avoid relying on the web.)
having converted recently to arch, i pulled my script up to pacman [2], and would like to do an aur version, as well.
for "packages-versions.gz", i'd download that, then use the package names and versions to make queries to bring a local database up to date with what AUR has. i'd damp this process to once a week or so.
(i've spent some time in the last week, offline, playing with figuring out how efficiently -- in terms of both number of requests and amount of [duplicated] data transmitted -- i can download all the information via repeated RPC searches on frequently-appearing search terms -- using packages.gz as a way of knowing all the entries. but, as fun as that is, it's a hack.)
[1] https://github.com/greg-minshall/apt-list [2] https://gitlab.com/minshall/pac-list
With AUR RPC web you need specify the package with exact matches. So you are in the same scenario than Greg. Should be very usefull have an official query in wich the server can return that packages inside AUR, core, community and extra (taking in consideration the possibles DDOS effects if don't have a cache server for it). El mar., 1 de oct. de 2019 a la(s) 14:34, Mark Weiman ( mark.weiman@markzz.com) escribió:
What about something like [1]?
Mark
[1] https://wiki.archlinux.org/index.php/Aurweb_RPC_interface
On Mon, 2019-09-30 at 21:44 -0400, Greg Minshall wrote:
hi. i'd like to keep a copy of the list of AUR packages, version numbers, and descriptions on my machine. the list can be somewhat out of date (say, as of the last time i did "pacman -Syu", which, for me, is every week or two).
my question is how to do it with minimal overhead? i know packages.gz. that will give me the list of packages. however, without the version numbers, i won't be able to tell whether my cached information for a given package is up to date or not.
would it be possible to (put on the list to) provide at some point a "packages-versions.gz"? or, even, "packages-versions-descriptions.gz"? (though the former is probably of more general use.)
below is my motivation for wanting this. apologies if i've missed some already existing way of doing this.
cheers, Greg ---------------------------------------------------------------------- motivation:
for the last several years i've been using a script i "wrote" (inspired by something similar from fink) that, from local information, lists available packages, install status, version numbers, and descriptions [1]. (i'm often in a disconnected, or badly connected, world, so i try to avoid relying on the web.)
having converted recently to arch, i pulled my script up to pacman [2], and would like to do an aur version, as well.
for "packages-versions.gz", i'd download that, then use the package names and versions to make queries to bring a local database up to date with what AUR has. i'd damp this process to once a week or so.
(i've spent some time in the last week, offline, playing with figuring out how efficiently -- in terms of both number of requests and amount of [duplicated] data transmitted -- i can download all the information via repeated RPC searches on frequently-appearing search terms -- using packages.gz as a way of knowing all the entries. but, as fun as that is, it's a hack.)
[1] https://github.com/greg-minshall/apt-list [2] https://gitlab.com/minshall/pac-list
-- Un cordial saludo. *Joaquín Manuel Crespo.*
On Tue, 2019-10-01 at 14:49 -0300, Joaquin Manuel Crespo wrote:
With AUR RPC web you need specify the package with exact matches. So you are in the same scenario than Greg.
Should be very usefull have an official query in wich the server can return that packages inside AUR, core, community and extra (taking in consideration the possibles DDOS effects if don't have a cache server for it).
El mar., 1 de oct. de 2019 a la(s) 14:34, Mark Weiman ( mark.weiman@markzz.com) escribió:
What about something like [1]?
Mark
[1] https://wiki.archlinux.org/index.php/Aurweb_RPC_interface
The RPC gives you information on the packages you want from the AUR and you should have copies of the official databases that you can run a utility like expac against. Mark
Mark, thanks. if i understand correctly, if i do an RPC "search", i will get back pretty much the entire "database" entry for a given package. and, if i do an "info", i will likely get that plus more information for the given package. i think this means that, given a list of packages (from packages.gz), in order to find out what has changed i would need to issue a search for each of the packages. i.e., i'd have to download the entire database, at a cost of some 55,000 RPC calls -- which i worry would be a heavier load on the server than downloading a file with the same information. if there were something like a packages-vers.gz file, then one could keep track of which package entries one currently had and which were changed, and only do RPC calls for the changed ones. (assuming, by the way, that when a packages database entry, it's *version* also changes.) or, maybe i'm missing something? cheers, Greg
On Tue, 2019-10-01 at 15:47 -0400, Greg Minshall wrote:
Mark,
thanks. if i understand correctly, if i do an RPC "search", i will get back pretty much the entire "database" entry for a given package. and, if i do an "info", i will likely get that plus more information for the given package.
i think this means that, given a list of packages (from packages.gz), in order to find out what has changed i would need to issue a search for each of the packages. i.e., i'd have to download the entire database, at a cost of some 55,000 RPC calls -- which i worry would be a heavier load on the server than downloading a file with the same information.
if there were something like a packages-vers.gz file, then one could keep track of which package entries one currently had and which were changed, and only do RPC calls for the changed ones. (assuming, by the way, that when a packages database entry, it's *version* also changes.)
or, maybe i'm missing something?
cheers, Greg
You need information on *every* package? Not just the ones you use? Technically in your script, you can query pacman for the foreign packages, then compare to an RPC response (and info can take multiple packages, just check the wiki), then take action on those specific packages. I don't see the need to have information on every single PKGBUILD on the AUR. I'm pretty sure this is how most AUR helpers work these days when checking for updates. Mark
Mark, thanks again.
You need information on *every* package? Not just the ones you use? Technically in your script, you can query pacman for the foreign packages, then compare to an RPC response (and info can take multiple packages, just check the wiki), then take action on those specific packages. I don't see the need to have information on every single PKGBUILD on the AUR.
yes, my *desire* is to have info on all packages, so i can search locally (especially the descriptions) when wondering what might be available. i don't know if that seems unreasonable, but it is what i was hoping to achieve. and, thanks for reminding me -- i did once know that "info" requests allow multiple packages. there is some limit, though i don't remember what. still, the number of separate RPCs would likely be in the thousands. cheers, Greg
On 10/1/19 5:33 PM, Greg Minshall wrote:
Mark,
thanks again.
You need information on *every* package? Not just the ones you use? Technically in your script, you can query pacman for the foreign packages, then compare to an RPC response (and info can take multiple packages, just check the wiki), then take action on those specific packages. I don't see the need to have information on every single PKGBUILD on the AUR.
yes, my *desire* is to have info on all packages, so i can search locally (especially the descriptions) when wondering what might be available. i don't know if that seems unreasonable, but it is what i was hoping to achieve.
and, thanks for reminding me -- i did once know that "info" requests allow multiple packages. there is some limit, though i don't remember what. still, the number of separate RPCs would likely be in the thousands.
The limitations are described on the wiki page you were referred to: https://wiki.archlinux.org/index.php/Aurweb_RPC_interface#Limitations (Also, it is "multiinfo", not "info" requests, that you use for multiple packages.) -- Eli Schwartz Bug Wrangler and Trusted User
On Wed, Oct 2, 2019 at 12:22 PM Eli Schwartz <eschwartz@archlinux.org> wrote:
(Also, it is "multiinfo", not "info" requests, that you use for multiple packages.) I thought they're no longer different since some version, and have just checked it: https://git.archlinux.org/aurweb.git/tree/web/lib/aurjson.class.php#n109
Eli,
The limitations are described on the wiki page you were referred to: https://wiki.archlinux.org/index.php/Aurweb_RPC_interface#Limitations
thanks. i had missed that. Greg
On 9/30/19 9:44 PM, Greg Minshall wrote:
hi. i'd like to keep a copy of the list of AUR packages, version numbers, and descriptions on my machine. the list can be somewhat out of date (say, as of the last time i did "pacman -Syu", which, for me, is every week or two).
my question is how to do it with minimal overhead? i know packages.gz. that will give me the list of packages. however, without the version numbers, i won't be able to tell whether my cached information for a given package is up to date or not.
packages.gz only gives you names, the list of names is up to date if the snapshot date says it is up to date and version numbers don't enter into it. If you have additional cached information for a given package, then that presumably includes the version. So I'm not sure I understand the comparison to packages.gz At any rate...
would it be possible to (put on the list to) provide at some point a "packages-versions.gz"? or, even, "packages-versions-descriptions.gz"? (though the former is probably of more general use.)
Neither one is very useful for general use, "general use" would mean providing the information people want for arbitrary offline queries, so what you're actually asking for is a regular database dump of all packages and their multiinfo descriptions. It may be that for your specific use, you only need name, version and description... other people have had similar offline requests where they wanted everything from dependency information to the name of the maintainer (off the top of my head, https://repology.org is currently importing the name, version, maintainer, url, description, and license -- all via repeated RPC calls for the multiinfo data).
below is my motivation for wanting this. apologies if i've missed some already existing way of doing this.
cheers, Greg ---------------------------------------------------------------------- motivation:
for the last several years i've been using a script i "wrote" (inspired by something similar from fink) that, from local information, lists available packages, install status, version numbers, and descriptions [1]. (i'm often in a disconnected, or badly connected, world, so i try to avoid relying on the web.)
But in order to actually use the information from the AUR, you would need to be online, because you need to be online in order to retrieve the PKGBUILD and build the desired package.
having converted recently to arch, i pulled my script up to pacman [2], and would like to do an aur version, as well.
For local packages I would just use expac, a printf-style formatter for libalpm database information, which will be both faster and more accurate than running pacman -Si/-Qi in a tight loop in a python script followed by using regular expressions on the output.
for "packages-versions.gz", i'd download that, then use the package names and versions to make queries to bring a local database up to date with what AUR has. i'd damp this process to once a week or so.
(i've spent some time in the last week, offline, playing with figuring out how efficiently -- in terms of both number of requests and amount of [duplicated] data transmitted -- i can download all the information via repeated RPC searches on frequently-appearing search terms -- using packages.gz as a way of knowing all the entries. but, as fun as that is, it's a hack.)
For AUR packages, I'm still not entirely sure why you need all this information for the 99% of packages that you won't ever use, and for the ones you do use you likely want the PKGBUILD too. In a general sort of way, what you want reminds me of the old AUR-mirror project, which was a giant git repo of every PKGBUILD in the AUR (this was before the migration to git, so everything was tarballs, and this was meant to pull them all together and provide history for the daily snapshots). Today, I guess you could clone 55603 package bases from git, thus ensuring you had fully up to date offline information for anything you cared to do. That would sort of slam our server, though. -- Eli Schwartz Bug Wrangler and Trusted User
Eli,
That would sort of slam our server, though.
yes, my whole goal is to not slam the servers. i'm not sure i can explain why i find having the complete list, with descriptions, local on my machine useful. but, i do. "search locally, build globally" somehow works well for me. (one rationalization might be that searching is inherently more interactive than building, so random network latencies, etc., during building are less annoying than during searching.) anyway, grant me the desire to maintain, offline, a complete list of AUR packages, version numbers, descriptions. let's say that i've managed, over a period of a week or so, to download the entire database (or, at least, the "rows" in which i am interested: package name, version, description) into my own local database. then, a week later, i'd like to *update* my local database with what's changed in the AUR repository. how would i proceed? as things currently stand, iiuc (always a dubious proposition), i'd need to again download the entire database. on the other hand, if there were a packages-vers.gz (*), i could download that, then compare the package names and versions in it with those in my local database, and schedule to download the database entries for those packages whose version numbers had changed (as well as those packages in packages-vers.gz that are new; and at the same time delete those packages in my local database that are no longer in packages-vers.gz); one can visualize this code. my presumption is that this would be much lighter on server resources than downloading the entire database each week. and, maybe (you'll know the "churn" in the repository) would even be very light. and, i think this could be useful for general use. i may only care about descriptions, but if someone cares about dependencies, maintainers, etc., they would still use the version-number mechanism (again, see (*) below) to determine which packages have changed, and only download the information from those changed packages. thanks again. cheers, Greg ps -- thanks for the pointer to expac. i'll look at converting to that. no one ever accused me of writing overly-efficient code... :) (*) NB: note that, for "true consistency", using "version" depends on the assumption, likely to be at least occasionally, maybe often, invalid, that if the *metadata* for a package in the database changes then the *version* of the package itself also changes. if "last modified" time in the database is updated when any of the metadata changes, that would be better to use than package version number. if "last modified" time isn't updated when (some) metadata is updated, one could also run an md5sum(1) over (a textual representation of) each package's database entry, and provide packages-md5sums.gz, say. i'll note that a simple test shows that adding an md5sum to each line inflates the size of the file considerably : % ls -skh packages*.gz : 1.5M packages-md5sums.gz 344K packages.gz the inflation for version numbers and/or "last modified time" (as seconds since the epoch) would probably be less, maybe double the size of packages.gz?
On 10/2/19 11:32 AM, Greg Minshall wrote:
i'm not sure i can explain why i find having the complete list, with descriptions, local on my machine useful. but, i do. "search locally, build globally" somehow works well for me. (one rationalization might be that searching is inherently more interactive than building, so random network latencies, etc., during building are less annoying than during searching.)
anyway, grant me the desire to maintain, offline, a complete list of AUR packages, version numbers, descriptions.
Could be, I dunno. All I know is what I would consider personally useful -- your use cases remain your own, regardless of my opinions or expressed doubts. :)
let's say that i've managed, over a period of a week or so, to download the entire database (or, at least, the "rows" in which i am interested: package name, version, description) into my own local database.
then, a week later, i'd like to *update* my local database with what's changed in the AUR repository. how would i proceed? as things currently stand, iiuc (always a dubious proposition), i'd need to again download the entire database.
on the other hand, if there were a packages-vers.gz (*), i could download that, then compare the package names and versions in it with those in my local database, and schedule to download the database entries for those packages whose version numbers had changed (as well as those packages in packages-vers.gz that are new; and at the same time delete those packages in my local database that are no longer in packages-vers.gz); one can visualize this code.
my presumption is that this would be much lighter on server resources than downloading the entire database each week. and, maybe (you'll know the "churn" in the repository) would even be very light.
and, i think this could be useful for general use. i may only care about descriptions, but if someone cares about dependencies, maintainers, etc., they would still use the version-number mechanism (again, see (*) below) to determine which packages have changed, and only download the information from those changed packages.
Well, I guess I could hear the argument you make for providing a way to invalidate offline assumptions about a package. Even if providing a dump of names-versions is not strictly useful itself.
ps -- thanks for the pointer to expac. i'll look at converting to that. no one ever accused me of writing overly-efficient code... :)
(*) NB:
note that, for "true consistency", using "version" depends on the assumption, likely to be at least occasionally, maybe often, invalid, that if the *metadata* for a package in the database changes then the *version* of the package itself also changes.
This is "supposed to be true", as in, it's generally considered pretty bad if people update a PKGBUILD so that it creates a different package but don't update the pkgrel for metadata or package content changes. It isn't guaranteed, sure, but I guess there are worse things than simply failing to detect a cache invalidation for that package. On the other hand...
if "last modified" time in the database is updated when any of the metadata changes, that would be better to use than package version number.
if "last modified" time isn't updated when (some) metadata is updated, one could also run an md5sum(1) over (a textual representation of) each package's database entry, and provide packages-md5sums.gz, say. i'll note that a simple test shows that adding an md5sum to each line inflates the size of the file considerably : % ls -skh packages*.gz : 1.5M packages-md5sums.gz 344K packages.gz
the inflation for version numbers and/or "last modified time" (as seconds since the epoch) would probably be less, maybe double the size of packages.gz?
The package details key "last modified" is indeed updated to the time of the latest push to the package's git repository, see https://git.archlinux.org/aurweb.git/tree/aurweb/git/update.py#n92 So this would be a valid method for guaranteeing cache invalidation. -- Eli Schwartz Bug Wrangler and Trusted User
Eli, thanks.
The package details key "last modified" is indeed updated to the time of the latest push to the package's git repository, see https://git.archlinux.org/aurweb.git/tree/aurweb/git/update.py#n92
So this would be a valid method for guaranteeing cache invalidation.
ah, great. so, my amended request would be for a packages-time.gz file (or, something of that ilk). if possible, that would be great. cheers, Greg ps -- some sizes: 280K packages-sorted.gz packages.gz gunzip'd, sort'd, gzip'd 308K packages-date.gz the 8 "date" digits are exactly the same on each line 344K packages.gz a recent packages.gz 636K packages-8md5sum.gz the 8 "date" digits are *completely* random (from md5sum) 1.5M packages-md5sums.gz the 36 "md5sum" digits are completely random
On Mon, 30 Sep 2019 at 21:44:32, Greg Minshall wrote:
would it be possible to (put on the list to) provide at some point a "packages-versions.gz"? or, even, "packages-versions-descriptions.gz"? (though the former is probably of more general use.)
It's possible and I am fine with doing that. However, we won't regenerate those files too often and you'll have to accept that they will be slightly out of date. If a lot of people start downloading these files, we might need to consider using CDN/mirrors.
Lukas,
It's possible and I am fine with doing that. However, we won't regenerate those files too often and you'll have to accept that they will be slightly out of date.
yes, that's perfectly fine. great, actually.
If a lot of people start downloading these files, we might need to consider using CDN/mirrors.
curiosity: packages.gz seems to come just from aur.archlinux.org? (is that a single machine, or is there a load-balancer, or some such, in front of it?) cheers, Greg
On 10/6/19 1:45 PM, Greg Minshall wrote:
Lukas,
It's possible and I am fine with doing that. However, we won't regenerate those files too often and you'll have to accept that they will be slightly out of date.
yes, that's perfectly fine. great, actually.
If a lot of people start downloading these files, we might need to consider using CDN/mirrors.
curiosity: packages.gz seems to come just from aur.archlinux.org? (is that a single machine, or is there a load-balancer, or some such, in front of it?)
The gz files are simply on-disk files. A cron job regenerates it, and IIRC the job fires every 5 minutes. The AUR website simply serves the contents as-is. As for load balancing, we only use one machine. (It is located at luna.archlinux.org) -- Eli Schwartz Bug Wrangler and Trusted User
participants (6)
-
Eli Schwartz
-
Greg Minshall
-
Joaquin Manuel Crespo
-
LI YU YU
-
Lukas Fleischer
-
Mark Weiman