Fwd: [repod][pacman/makepkg] artix feedback

24 Jun 2022

      I like the idea of having a meta data based system in json/yaml, but our 
CI/CD is not asynchronous action based and probably won't be. Our 
pipeline does add/remove/move repo ops as well calculated from the 
commit sha status.

However, I think that repod should not do any data validation, but 
makepkg, namcap could do it by a yaml schema validation for example, 
assuming a yaml PKGBUILD and yaml meta files inside the pkg.tar.

The ideal scenario would be that the repo tool could pass the PKGINFO 
meta directly to the meta db, if necessary append it with eg sig or 
checksum, but you would only deal with more or less on file format that 
is schema validated.

The consuming repo tool would not have and should not have to bother 
with such task at all in my opinion, but could do another schema 
validation if necessary.

I am saying, the repo tool should rely on the reference meta yaml 
implementation of pacman/makepkg in my view, rather than adding another 
layer of complexity and thus eventual error in the transformation 
between these formats, including the archiving or decompression step. If 
pacman changes the meta files, your tool has to implement any changes, 
instead of a possible updated schema validation only.

The last thing I'd want on a server is a mismatch between db.tar.gz and 
meta db for whatever reason. That would be my great concern with the 
current implementation. It should move in pacman I think, and have the 
repod tool just handle the meta files directly in whatever fashion suitable.

In short, I would decouple the nice idea of meta data files in json(I'd 
prefer yaml) from the repod, and move that in pacman upstream possibly 
tied to schema validations so the yaml is guaranteed valid.

As an idea, perhaps consider different pacman meta file backends, the 
classic and maybe a new json/yaml based that could be tested and 
developed this way?

I would imagine pyalpm would also profit greatly from a yaml based 
approach, as well as namcap and maybe even archweb.

Am 24.06.22 um 23:42 schrieb David Runge:
...
Hi artoo,
thanks for your input!
There seem to be a few misconceptions about repod and I'll try to
untangle them below.
On 2022-06-24 20:33:15 (+0200), artoo@artixlinux.org wrote:
...
Hi arch team,
after receiving your email on repod, here is an idea.
What I have been asking myself since the python db scripts arrived at
the arch gitlab instance.
"Why doesn't arch consider writing yaml files with makepkg instead of
the various formats in pkg.tar, db.tar, files.tar and links.tar?"
The scope of repod is to eventually create an alternative to dbscripts,
which currently handles the binary package repository state, while being
tied to our svn mono repos that contain our package build sources.
Rewriting pacman or makepkg is out-of-scope for the repod project, as it
is meant to consume package files (which have sort of
well-established/defined metadata) and their potential signatures while
outputting machine readable state files which allow to reproducibly
create repository sync databases from them.
This type of setup allows us to recreate the entire set of repository
sync databases from existing packages in their package pools and the
state data (in this scenario e.g. the repository sync databases had been
damaged or needed to be reset), or even completely rebuilding all
packages from their respective package build sources and recreating
everything from scratch (in this scenario we lost all or some package
files and their potential signatures and/ or the repository sync
databases).
All actions in the management stateg are meant to be tracked in a git
repository (plus additional caching), to allow for maximum transparency.
As such repod does replace/ supersede parts of the functionality shipped
with pacman (e.g. repo-add/ repo-remove) and implements a very basic -
yet powerful - approach to managing binary package repositories, which
can replace our use of dbscripts (if we move to package build sources in
git in the future).
You can find a few thoughts on this in this article [1] and in the
current repod documentation [2].
This all being said: The project's usefulness is currently still quite
limited and in the beginning it will only expose a few CLI tools for
conversion and validation of packages and repository sync databases.
Going forward, the idea is to expose functionality via an API, that can
be integrated into different authentication schemes, so that we can move
away from a scenario in which we "call a script as some user on some
host" to one where we make an authenticated call to an endpoint to
trigger an asynchronous action.
As you can imagine, data validation alone takes quite some time to
figure out (this part is reaching a first milestone with 0.1.0 though),
as many things in pacman/ makepkg serve as reference implementation and
might not offer a versioned approach to adding or deprecating data
fields.
...
In theory, build jobs could use PKGBUILD dependencies and do queue
checks for example before building a given package.
The automation of simple version bumps to mass rebuilds in CI is
something that we are looking at as well and there are many different
approaches by now.
In the future we can imagine a workflow in which packages can built on
guarded build machines and be signed by a signing enclave, after which
the build process e.g. hands the files to repod for consumption.
...
This is the PKGBUILD side of things, thus better implemented upstream
pacman.
Implementing data handling in a completely new format while being able
to maintain compatibility or a migration path from old formats is a
huge undertaking. There is no "clean split" scenario for these things.
Doing something like that in the context of a set of thousands of
existing packages and an already existing code base, that uses an
established structured data format is complicated.
Starting to define versioned approaches, as we do with repod is the
first step in the direction of being able to more easily introduce
change and react to change (e.g. new fields in .PKGINFO) while allowing
to search through the use of certain keywords (e.g. "how many/ which
packages still use that deprecated keyword?").
...
In my humble opinion, addressing the repod first is the wrong end to 
start,
if you want a CI based build system.
If everything was in yaml, repod could simply directly handle these yaml
files and tar them as db, very streamlined approach.
Unfortunately, everything is rather complicated/ entangled and
implementing this in the various pieces of software at the same time as
suggested by you would require a lot of time.
Hence starting with the binary package repository management system is a
reasonable approach, as it allows us to decouple our package build
sources from the existing system and manage a secure standalone
solution separate from the package build and signing process.
I hope I could answer some of your questions and dissolve some of the
misconceptions.
Best,
David
[1] https://sleepmap.de/2022/packaging-for-arch-linux/
[2] https://repod.readthedocs.io

Fwd: [repod][pacman/makepkg] artix feedback

artoo＠artixlinux.org