[arch-projects] [dbscripts] [PATCH] Don't parse .db files ourselves; use pyalpm instead
From: Luke Shumaker <lukeshu@parabola.nu> In a patchset that I recently submitted, Eli was concerned that I was parsing .db files with bsdtar+awk, when the format of .db files isn't "public"; the only guarantees made about it are that libalpm can parse it. https://lists.archlinux.org/pipermail/arch-projects/2018-June/004932.html I wasn't too concerned, because `ftpdir-cleanup` and `sourceballs` already parse the .db files in the same way. Nonetheless, I think Eli is right: we shouldn't be parsing these files ourselves. So, add a `dbquery` function that uses pyalpm to parse the .db files: - It takes as arguments Python 3 expressions; 1. one that that returns a bool deciding whether we want to print information on a package, and 2. another that returns the string to print for a package. Currently, all callers use "True" for the decider expression, as ftpdir-cleanup and sourceballs operate on *every* package. However, I'm including a way to filter packages because, I'm coming at this from the context that I want to parse .db files in other places too. - libalpm doesn't offer an easy way to say "parse this DB file for me"; instead, we must construct a configuration that has a syncdb pointing to that file, which we then have it sync in to a temporary directory. As a final note, when re-writing the bit of sourceballs to use dbquery instead of AWK, I realized that it does not correctly handle licenses that have a space in them (as of 2018-07-07 there are 67 packages in the Arch repos that have license containing a space). I did not fix this bug; I merely translated it from AWK to Python, as the program would also need to be adjusted elsewhere. --- cron-jobs/ftpdir-cleanup | 2 +- cron-jobs/sourceballs | 14 ++------------ db-functions | 25 +++++++++++++++++++++++++ test/Dockerfile | 2 +- 4 files changed, 29 insertions(+), 14 deletions(-) diff --git a/cron-jobs/ftpdir-cleanup b/cron-jobs/ftpdir-cleanup index 9df5f99..77e49c8 100755 --- a/cron-jobs/ftpdir-cleanup +++ b/cron-jobs/ftpdir-cleanup @@ -44,7 +44,7 @@ for repo in "${PKGREPOS[@]}"; do fi done | sort > "${WORKDIR}/repo-${repo}-${arch}" # get a list of package files defined in the repo db - bsdtar -xOf "${FTP_BASE}/${repo}/os/${arch}/${repo}${DBEXT}" | awk '/^%FILENAME%/{getline;print}' | sort > "${WORKDIR}/db-${repo}-${arch}" + dbquery "$repo" "$arch" True pkg.filename | sort > "${WORKDIR}/db-${repo}-${arch}" missing_pkgs=($(comm -13 "${WORKDIR}/repo-${repo}-${arch}" "${WORKDIR}/db-${repo}-${arch}")) if (( ${#missing_pkgs[@]} >= 1 )); then diff --git a/cron-jobs/sourceballs b/cron-jobs/sourceballs index 6be28ab..784b48b 100755 --- a/cron-jobs/sourceballs +++ b/cron-jobs/sourceballs @@ -24,18 +24,8 @@ for repo in "${PKGREPOS[@]}"; do if [[ ! -f ${FTP_BASE}/${repo}/os/${arch}/${repo}${DBEXT} ]]; then continue fi - bsdtar -xOf "${FTP_BASE}/${repo}/os/${arch}/${repo}${DBEXT}" \ - | awk '/^%NAME%/ { getline b }; - /^%BASE%/ { getline b }; - /^%VERSION%/ { getline v }; - /^%LICENSE%/,/^$/ { - if ( !/^%LICENSE%/ ) { l=l" "$0 } - }; - /^%ARCH%/ { - getline a; - printf "%s %s %s %s\n", b, v, a, l; - l=""; - }' + dbquery "$repo" "$arch" True \ + 'f"{pkg.base or pkg.name} {pkg.version} {pkg.arch} {'\'' '\''.join(pkg.licenses)}"' done | sort -u > "${WORKDIR}/db-${repo}" done diff --git a/db-functions b/db-functions index 0491c22..f1d821a 100644 --- a/db-functions +++ b/db-functions @@ -294,6 +294,31 @@ getpkgfiles() { echo "${files[@]}" } +# usage: dbquery repo arch filter_expr output_expr +dbquery() { + local repo=$1 + local arch=$2 + local filter=$3 + local output=$4 + local dbfile="${FTP_BASE}/${repo}/os/${arch}/${repo}.db" + + python3 - "$dbfile" "$filter" "$output" <<-'EOT' + import os.path + import sys + import tempfile + import pyalpm + db_dir, db_file = os.path.split(os.path.abspath(sys.argv[1])) + with tempfile.TemporaryDirectory() as tmpdirname: + handle = pyalpm.Handle(tmpdirname, tmpdirname) + db = handle.register_syncdb(db_file[:-3], 0) + db.servers = ["file://{}".format(db_dir)] + db.update(False) + for pkg in db.search(".*"): + if eval(sys.argv[2], {}, {"pkg": pkg}): + print(eval(sys.argv[3], {}, {"pkg": pkg})) + EOT +} + check_pkgfile() { local pkgfile=$1 diff --git a/test/Dockerfile b/test/Dockerfile index 83c8449..0d01a75 100644 --- a/test/Dockerfile +++ b/test/Dockerfile @@ -1,5 +1,5 @@ FROM archlinux/base -RUN pacman -Syu --noconfirm --needed sudo fakeroot awk subversion make kcov bash-bats gettext grep +RUN pacman -Syu --noconfirm --needed sudo fakeroot awk subversion make kcov bash-bats gettext grep pyalpm RUN pacman-key --init RUN echo '%wheel ALL=(ALL) NOPASSWD: ALL' > /etc/sudoers.d/wheel RUN useradd -N -g users -G wheel -d /build -m tester -- 2.17.1
On 07/08/2018 09:14 PM, Luke Shumaker wrote:
From: Luke Shumaker <lukeshu@parabola.nu>
In a patchset that I recently submitted, Eli was concerned that I was parsing .db files with bsdtar+awk, when the format of .db files isn't "public"; the only guarantees made about it are that libalpm can parse it.
https://lists.archlinux.org/pipermail/arch-projects/2018-June/004932.html
I wasn't too concerned, because `ftpdir-cleanup` and `sourceballs` already parse the .db files in the same way. Nonetheless, I think Eli is right: we shouldn't be parsing these files ourselves.
So, add a `dbquery` function that uses pyalpm to parse the .db files:
- It takes as arguments Python 3 expressions; 1. one that that returns a bool deciding whether we want to print information on a package, and 2. another that returns the string to print for a package.
Currently, all callers use "True" for the decider expression, as ftpdir-cleanup and sourceballs operate on *every* package. However, I'm including a way to filter packages because, I'm coming at this from the context that I want to parse .db files in other places too.
- libalpm doesn't offer an easy way to say "parse this DB file for me"; instead, we must construct a configuration that has a syncdb pointing to that file, which we then have it sync in to a temporary directory.
As a final note, when re-writing the bit of sourceballs to use dbquery instead of AWK, I realized that it does not correctly handle licenses that have a space in them (as of 2018-07-07 there are 67 packages in the Arch repos that have license containing a space). I did not fix this bug; I merely translated it from AWK to Python, as the program would also need to be adjusted elsewhere. Keeping in mind the ones we're looking for are a whitelist of strictly-defined license types... I think those are all ad-hoc custom
What's wrong with expac? expac --config ${dbscripts_root}/pacman-community.conf -S '%f' expac is not only super elegant, there's pending patches to provide it in pacman 6 as part of the core project. This is what I'm waiting for, actually. I see no reason to add an external dependency on both python and pyalpm, in order to run a small python program which evals its arguments in order to inject database queries, when a tool with a simple API can do the same and will eventually be guaranteed to be everywhere pacman itself is. (Let's ignore for a moment, the defunct integrity checks service which is written in python, but not pyalpm. pyalpm is not currently installed on the dbscripts server ATM.) licenses, none of which we're interested in in the primary sourceballs deployment. -- Eli Schwartz Bug Wrangler and Trusted User
On Sun, 08 Jul 2018 22:38:06 -0400, Eli Schwartz wrote:
On 07/08/2018 09:14 PM, Luke Shumaker wrote:
From: Luke Shumaker <lukeshu@parabola.nu>
In a patchset that I recently submitted, Eli was concerned that I was parsing .db files with bsdtar+awk, when the format of .db files isn't "public"; the only guarantees made about it are that libalpm can parse it.
https://lists.archlinux.org/pipermail/arch-projects/2018-June/004932.html
I wasn't too concerned, because `ftpdir-cleanup` and `sourceballs` already parse the .db files in the same way. Nonetheless, I think Eli is right: we shouldn't be parsing these files ourselves.
So, add a `dbquery` function that uses pyalpm to parse the .db files:
What's wrong with expac?
expac --config ${dbscripts_root}/pacman-community.conf -S '%f'
expac is not only super elegant, there's pending patches to provide it in pacman 6 as part of the core project. This is what I'm waiting for, actually.
I see no reason to add an external dependency on both python and pyalpm, in order to run a small python program which evals its arguments in order to inject database queries, when a tool with a simple API can do the same and will eventually be guaranteed to be everywhere pacman itself is.
With the "True" filter that ftpdir-cleanup and sourceballs both use, you're right; this could be done with expac. But, with the context that this patch exists to enable me to address the concern you had with the other patchset: AFAICT, with expac there's no way to do a query like: dbquery core x86_64 \ "(pkg.base or pkg.name) == '$pkgbase'" \ ... Which is what most (all?) of the queries in the other patchset would become. (Drat, it seems that discussing this separately from the other patchset won't work after all.)
(Let's ignore for a moment, the defunct integrity checks service which is written in python, but not pyalpm. pyalpm is not currently installed on the dbscripts server ATM.)
Good call; check_packages.py is python2, the pyalpm dep does add a new dependency on python3.
As a final note, when re-writing the bit of sourceballs to use dbquery instead of AWK, I realized that it does not correctly handle licenses that have a space in them (as of 2018-07-07 there are 67 packages in the Arch repos that have license containing a space). I did not fix this bug; I merely translated it from AWK to Python, as the program would also need to be adjusted elsewhere. Keeping in mind the ones we're looking for are a whitelist of strictly-defined license types... I think those are all ad-hoc custom licenses, none of which we're interested in in the primary sourceballs deployment.
Indeed; if I thought it were a serious problem, I would have written a patch for it :) -- Happy hacking, ~ Luke Shumaker
On 07/09/2018 01:32 PM, Luke Shumaker wrote:
With the "True" filter that ftpdir-cleanup and sourceballs both use, you're right; this could be done with expac. But, with the context that this patch exists to enable me to address the concern you had with the other patchset:
AFAICT, with expac there's no way to do a query like:
dbquery core x86_64 \ "(pkg.base or pkg.name) == '$pkgbase'" \ ...
Which is what most (all?) of the queries in the other patchset would become.
(Drat, it seems that discussing this separately from the other patchset won't work after all.)
Well, it's not as complex as you think. No matter what tool is used, we'd need to manually reassemble the repository layout vs. pacman's system DBPath. (Whether by doing database loading by hand with low-level APIs or by providing a custom pacman.conf). But: Checking a specific arch is as simple as using setarch first. Current versions of makepkg set the pkgbase regardless of whether the pkgname is different (with the rationale that there's no reason not to, and it makes parsing package metadata easier), so we can add this to the long list of other reasons we want to do a complete rebuild of every package (cf. reproducible builds, PIE, and more). And expac supports "repo/package" -- or printing the repo and filtering that while parsing the filename or what-have-you. That being said, I've discussed on our development IRC channel, having expac's -Ss mode to support 'repo/.*' filtering which might be useful too... -- Eli Schwartz Bug Wrangler and Trusted User
On Mon, 09 Jul 2018 18:37:04 -0400, Eli Schwartz wrote:
On 07/09/2018 01:32 PM, Luke Shumaker wrote:
AFAICT, with expac there's no way to do a query like:
dbquery core x86_64 \ "(pkg.base or pkg.name) == '$pkgbase'" \ ...
Well, it's not as complex as you think. No matter what tool is used, we'd need to manually reassemble the repository layout vs. pacman's system DBPath. (Whether by doing database loading by hand with low-level APIs or by providing a custom pacman.conf). But:
Checking a specific arch is as simple as using setarch first.
The part I'm concerned about is being able to do is filtering by pkgbase rather than pkgname; setting up repository layout is simple enough either way (it just comes down to setting DBpath (and RootDir?) to a temporary directory and setting `$repo.Server=file://$FTP_BASE/$repo/os/$arch`).
Current versions of makepkg set the pkgbase regardless of whether the pkgname is different (with the rationale that there's no reason not to, and it makes parsing package metadata easier), so we can add this to the long list of other reasons we want to do a complete rebuild of every package (cf. reproducible builds, PIE, and more).
Ok, but even if/when all packages have pkgbase, expac still doesn't have the ability to search based on pkgbase instead of pkgname, does it?
And expac supports "repo/package" -- or printing the repo and filtering that while parsing the filename or what-have-you. That being said, I've discussed on our development IRC channel, having expac's -Ss mode to support 'repo/.*' filtering which might be useful too...
I guess we could have it print all packages for the repo, include pkgbase in the output, then filter that with grep... -- Happy hacking, ~ Luke Shumaker
participants (2)
-
Eli Schwartz
-
Luke Shumaker