[repod][pacman/makepkg] artix feedback
Hi arch team, after receiving your email on repod, here is an idea. What I have been asking myself since the python db scripts arrived at the arch gitlab instance. "Why doesn't arch consider writing yaml files with makepkg instead of the various formats in pkg.tar, db.tar, files.tar and links.tar?" For the sake of argument of the idea, I assume yaml to be the file format, but it could be json too. So here is the idea, that would certainly not materialize over night if it was implemented. Maybe for pacman 7+ ? It is not be about implementation details atm, just about a general idea to cure some shortcomings in the build process. What if: 1. PKGBUILD was in yaml 2. makepkg/pacman handled, read/write yaml files, ie standardize the file formats to yaml * pkg.tar: PKGINFO yaml(would contain the MTREE as a node), BUILDINFO yaml, maybe refactor these if there is redundant data * it might even be possible to use parts of the PKGBUILD yaml in code for the above, once read, it can be queried and/or reused * db.tar: a unified db consisting of one yaml for each package 3. *repod directly handled* these yaml files and tar'ed them in a new db instead of writing the json from the various formats and files as it is the case currently * there is a lot of redundancy in the current build process up to some repo operation in terms of writing, reading files in different formats The idea would be to *base everything on yaml*, which eventually raises the question on the viability of makepkg/repo-add in bash. Maybe python? However, yaml writing could even be done in bash, replacing the current formats. Afaik, arch plans have been to have one git repo per package and these connected to a CI? It is much more easy to feed some structured data format to a CI in my view, so that's the main thought behind the idea. We have been having a CI/CD running for some years now, and the shortcomings are, that its not easy to hook into makepkg process to have say a dedicated CI test stage, the check() function in PKGBUILD. A yaml based PKGBUILD would allow to have these makepkg stages easily accessible by the CI and run them each in their own stage with proper makepkg flags. To have a little glimpse what a PKGBUILD in yaml might look like, here it is: $ pkg2yaml artixlinux/main/udev/trunk --- pkgbase: name: udev pkgver: 251.2 pkgrel: 2 url: https://www.github.com/systemd/systemd arch: - x86_64 license: - GPL2 - LGPL2.1 makedepends: - acl - libacl.so - kmod - libkmod.so - util-linux - libblkid.so - hwdata - libcap - libcap.so - kbd - gperf - intltool - git - meson - docbook-xsl - rsync - python-jinja packages: - pkgname: udev depends: - acl - libacl.so - kmod - libkmod.so - util-linux - libblkid.so - libudev - hwdata - kbd provides: - udev=251.2 - pkgname: libudev depends: - gcc-libs provides: - libudev.so - pkgname: esysusers groups: - base-devel depends: - gcc-libs - libxcrypt - pkgname: etmpfiles groups: - base-devel depends: - acl - libacl.so - libcap - libcap.so version: 251.2-2 files: - udev-251.2-2-x86_64.pkg.tar.zst - libudev-251.2-2-x86_64.pkg.tar.zst - esysusers-251.2-2-x86_64.pkg.tar.zst - etmpfiles-251.2-2-x86_64.pkg.tar.zst debug: ``` The yaml representation doesn't contain any functions so far, and thus there is no separate makepkg implementation, but I have been considering it for our CI for quite a while. Our CI gets all necessary information from the yaml. In theory, build jobs could use PKGBUILD dependencies and do queue checks for example before building a given package. This is the PKGBUILD side of things, thus better implemented upstream pacman. In my humble opinion, addressing the repod first is the wrong end to start, if you want a CI based build system. If everything was in yaml, repod could simply directly handle these yaml files and tar them as db, very streamlined approach. In its current state, repod would only be of limited use for us, since we don't add packages asynchronously to the repo. They are added on build success by our CI via a repo-add/links-add wrapper called in a stage. I hope it doesn't sound all too crazy, its a pretty radical and work intensive idea, but that would be my vision for a complete set of build and repo management tools easy to connect to some CI. An async repo management tool could be built on top of such a structure. Kind regards Artoo
Hi artoo, thanks for your input! There seem to be a few misconceptions about repod and I'll try to untangle them below. On 2022-06-24 20:33:15 (+0200), artoo@artixlinux.org wrote:
Hi arch team,
after receiving your email on repod, here is an idea.
What I have been asking myself since the python db scripts arrived at the arch gitlab instance.
"Why doesn't arch consider writing yaml files with makepkg instead of the various formats in pkg.tar, db.tar, files.tar and links.tar?"
The scope of repod is to eventually create an alternative to dbscripts, which currently handles the binary package repository state, while being tied to our svn mono repos that contain our package build sources. Rewriting pacman or makepkg is out-of-scope for the repod project, as it is meant to consume package files (which have sort of well-established/defined metadata) and their potential signatures while outputting machine readable state files which allow to reproducibly create repository sync databases from them. This type of setup allows us to recreate the entire set of repository sync databases from existing packages in their package pools and the state data (in this scenario e.g. the repository sync databases had been damaged or needed to be reset), or even completely rebuilding all packages from their respective package build sources and recreating everything from scratch (in this scenario we lost all or some package files and their potential signatures and/ or the repository sync databases). All actions in the management stateg are meant to be tracked in a git repository (plus additional caching), to allow for maximum transparency. As such repod does replace/ supersede parts of the functionality shipped with pacman (e.g. repo-add/ repo-remove) and implements a very basic - yet powerful - approach to managing binary package repositories, which can replace our use of dbscripts (if we move to package build sources in git in the future). You can find a few thoughts on this in this article [1] and in the current repod documentation [2]. This all being said: The project's usefulness is currently still quite limited and in the beginning it will only expose a few CLI tools for conversion and validation of packages and repository sync databases. Going forward, the idea is to expose functionality via an API, that can be integrated into different authentication schemes, so that we can move away from a scenario in which we "call a script as some user on some host" to one where we make an authenticated call to an endpoint to trigger an asynchronous action. As you can imagine, data validation alone takes quite some time to figure out (this part is reaching a first milestone with 0.1.0 though), as many things in pacman/ makepkg serve as reference implementation and might not offer a versioned approach to adding or deprecating data fields.
In theory, build jobs could use PKGBUILD dependencies and do queue checks for example before building a given package.
The automation of simple version bumps to mass rebuilds in CI is something that we are looking at as well and there are many different approaches by now. In the future we can imagine a workflow in which packages can built on guarded build machines and be signed by a signing enclave, after which the build process e.g. hands the files to repod for consumption.
This is the PKGBUILD side of things, thus better implemented upstream pacman.
Implementing data handling in a completely new format while being able to maintain compatibility or a migration path from old formats is a huge undertaking. There is no "clean split" scenario for these things. Doing something like that in the context of a set of thousands of existing packages and an already existing code base, that uses an established structured data format is complicated. Starting to define versioned approaches, as we do with repod is the first step in the direction of being able to more easily introduce change and react to change (e.g. new fields in .PKGINFO) while allowing to search through the use of certain keywords (e.g. "how many/ which packages still use that deprecated keyword?").
In my humble opinion, addressing the repod first is the wrong end to start, if you want a CI based build system. If everything was in yaml, repod could simply directly handle these yaml files and tar them as db, very streamlined approach.
Unfortunately, everything is rather complicated/ entangled and implementing this in the various pieces of software at the same time as suggested by you would require a lot of time. Hence starting with the binary package repository management system is a reasonable approach, as it allows us to decouple our package build sources from the existing system and manage a secure standalone solution separate from the package build and signing process. I hope I could answer some of your questions and dissolve some of the misconceptions. Best, David [1] https://sleepmap.de/2022/packaging-for-arch-linux/ [2] https://repod.readthedocs.io -- https://sleepmap.de
participants (2)
-
artoo@artixlinux.org
-
David Runge