I like the idea of having a meta data based system in json/yaml, but our CI/CD is not asynchronous action based and probably won't be. Our pipeline does add/remove/move repo ops as well calculated from the commit sha status. However, I think that repod should not do any data validation, but makepkg, namcap could do it by a yaml schema validation for example, assuming a yaml PKGBUILD and yaml meta files inside the pkg.tar. The ideal scenario would be that the repo tool could pass the PKGINFO meta directly to the meta db, if necessary append it with eg sig or checksum, but you would only deal with more or less on file format that is schema validated. The consuming repo tool would not have and should not have to bother with such task at all in my opinion, but could do another schema validation if necessary. I am saying, the repo tool should rely on the reference meta yaml implementation of pacman/makepkg in my view, rather than adding another layer of complexity and thus eventual error in the transformation between these formats, including the archiving or decompression step. If pacman changes the meta files, your tool has to implement any changes, instead of a possible updated schema validation only. The last thing I'd want on a server is a mismatch between db.tar.gz and meta db for whatever reason. That would be my great concern with the current implementation. It should move in pacman I think, and have the repod tool just handle the meta files directly in whatever fashion suitable. In short, I would decouple the nice idea of meta data files in json(I'd prefer yaml) from the repod, and move that in pacman upstream possibly tied to schema validations so the yaml is guaranteed valid. As an idea, perhaps consider different pacman meta file backends, the classic and maybe a new json/yaml based that could be tested and developed this way? I would imagine pyalpm would also profit greatly from a yaml based approach, as well as namcap and maybe even archweb. Am 24.06.22 um 23:42 schrieb David Runge:
Hi artoo,
thanks for your input!
There seem to be a few misconceptions about repod and I'll try to untangle them below.
On 2022-06-24 20:33:15 (+0200), artoo@artixlinux.org wrote:
Hi arch team,
after receiving your email on repod, here is an idea.
What I have been asking myself since the python db scripts arrived at the arch gitlab instance.
"Why doesn't arch consider writing yaml files with makepkg instead of the various formats in pkg.tar, db.tar, files.tar and links.tar?" The scope of repod is to eventually create an alternative to dbscripts, which currently handles the binary package repository state, while being tied to our svn mono repos that contain our package build sources.
Rewriting pacman or makepkg is out-of-scope for the repod project, as it is meant to consume package files (which have sort of well-established/defined metadata) and their potential signatures while outputting machine readable state files which allow to reproducibly create repository sync databases from them. This type of setup allows us to recreate the entire set of repository sync databases from existing packages in their package pools and the state data (in this scenario e.g. the repository sync databases had been damaged or needed to be reset), or even completely rebuilding all packages from their respective package build sources and recreating everything from scratch (in this scenario we lost all or some package files and their potential signatures and/ or the repository sync databases). All actions in the management stateg are meant to be tracked in a git repository (plus additional caching), to allow for maximum transparency.
As such repod does replace/ supersede parts of the functionality shipped with pacman (e.g. repo-add/ repo-remove) and implements a very basic - yet powerful - approach to managing binary package repositories, which can replace our use of dbscripts (if we move to package build sources in git in the future).
You can find a few thoughts on this in this article [1] and in the current repod documentation [2].
This all being said: The project's usefulness is currently still quite limited and in the beginning it will only expose a few CLI tools for conversion and validation of packages and repository sync databases.
Going forward, the idea is to expose functionality via an API, that can be integrated into different authentication schemes, so that we can move away from a scenario in which we "call a script as some user on some host" to one where we make an authenticated call to an endpoint to trigger an asynchronous action.
As you can imagine, data validation alone takes quite some time to figure out (this part is reaching a first milestone with 0.1.0 though), as many things in pacman/ makepkg serve as reference implementation and might not offer a versioned approach to adding or deprecating data fields.
In theory, build jobs could use PKGBUILD dependencies and do queue checks for example before building a given package. The automation of simple version bumps to mass rebuilds in CI is something that we are looking at as well and there are many different approaches by now.
In the future we can imagine a workflow in which packages can built on guarded build machines and be signed by a signing enclave, after which the build process e.g. hands the files to repod for consumption.
This is the PKGBUILD side of things, thus better implemented upstream pacman. Implementing data handling in a completely new format while being able to maintain compatibility or a migration path from old formats is a huge undertaking. There is no "clean split" scenario for these things. Doing something like that in the context of a set of thousands of existing packages and an already existing code base, that uses an established structured data format is complicated.
Starting to define versioned approaches, as we do with repod is the first step in the direction of being able to more easily introduce change and react to change (e.g. new fields in .PKGINFO) while allowing to search through the use of certain keywords (e.g. "how many/ which packages still use that deprecated keyword?").
In my humble opinion, addressing the repod first is the wrong end to start, if you want a CI based build system. If everything was in yaml, repod could simply directly handle these yaml files and tar them as db, very streamlined approach. Unfortunately, everything is rather complicated/ entangled and implementing this in the various pieces of software at the same time as suggested by you would require a lot of time.
Hence starting with the binary package repository management system is a reasonable approach, as it allows us to decouple our package build sources from the existing system and manage a secure standalone solution separate from the package build and signing process.
I hope I could answer some of your questions and dissolve some of the misconceptions.
Best, David
[1] https://sleepmap.de/2022/packaging-for-arch-linux/ [2] https://repod.readthedocs.io