Re: [pacman-dev] Repository management

16 May 2017

      On 2017-05-09 22:54 +1000
Allan McRae wrote:
...
I am looking for ideas here.  Please brainstorm to your hearts content.
Ok :)
...
So two points up for discussion:
1) Sync repository layout?  I don't see any point in leaving the tar
based format, as reading of sync databases is not a bottleneck.  (The
local db format can be a bottleneck, but that is a separate discussion...)
Do we split the information in .db out of .files and add a .full db with
complete information?  Then any .src db could follow suit and just have
source package information.  How do we get around the out of sync issue
(e.g., a package is removed from .db, but we have an old .files database
with it).  Do we add timestamps, and print a warning on -F operations
when the two are out of sync?
Add a timestamp inside each database (*.db, *.files, *.src). When pacman
downloads a database, instead of saving it as <repo>.<ext> and squashing the
previous database, save it as <repo>-<timestamp>.<ext>. Each refresh operation
(pacman -Sy, pacman -Fy) is associated with a particular database (*.db and
*.files, respectively). Create an untimestamped symlink to that database, e.g.

$ pacman -Sy...
# retrieve <repo>.db and save as <repo>-<timestamp_1>.db
# ln -s <repo>-<timestamp_1>.db <repo>.db

$ pacman -Fy...
# retrieve <repo>.db and save as <repo>-<timestamp_2>.db
# retrieve <repo>.files and save as <repo>-<timestamp_2>.files
# ln -s <repo>-<timestamp_2>.files <repo>.files

# something similar for *.src files

For operations that only involve the current <repo>.db files, no change is
needed for loading the database.

For loading <repo>.files, you will need to dereference <repo>.files first,
grab <timestamp_2> from <repo>-<timestamp_2>.files in the example above, and
then use it to load <repo>-<timestamp_2>.db instead of <repo>.db. Same method
for *.src files.

For cleanup of the timestamped files, collect the valid timestamps from the
untimestamped symlinks and then remove anything that doesn't match them. This
should probably be done with each database refresh. Maybe you can use the same
function that you use to clean up the package cache with -Sc while leaving
installed packages.

Obviously there will be some redundancy in the up to 3 copies of
<repo>-<timestamp>.db but I think that's better than e.g. breaking pkgfile
searches after an upgrade.

With this approach you could also download the latest version of the sync
databases as <repo>-<timestamp>.db without symlinking <repo>.db to it, and then
use that to query upgradable packages and other info from the mirror.

For propagating the database to the servers, nothing changes. Whenever the
database is updated, generate <repo>.db, <repo>.files, <repo>.src and whatever
else at the same time with the same internal timestamp and then just push them
out as usual.
...
2) Do we need a better (read "more easily maintainable") tool for
handling database generation and updates?  libalpm already can read in
information package files, so we could add libalpm/db_write.c with the
database creation functions.   Should we unify our repo format with our
local database format which we already write?
Yes for unification, preferably in a standardized format (e.g. yaml). Having
the functionality to read and write the files in libalpm would be useful for
third-party tool developers.

On 2017-05-10 12:54 -0400
Dave Reisner wrote:
...
WRT replacing repo-add, I'd suggest we come up with a the use cases we
want to support, design an interface to meet them, and then come up with
the implementation. Might be nice to start with the Arch Linux
repository layout as an example that we'd want to support (pooled
packages with symlinks into repo dirs).
What about using a relative subpath instead of a filename in the database. That
would enable transparent freeform repo layouts (e.g. pooled packages without
symlinks, package groups in different subdirs, etc.).

You could also avoid the need for subdirectories by adding the architecture
to the database filename, e.g. <repo>.<arch>.<ext>

To simplify repo-add, you could include .SRCINFO directly to avoid parsing and
reformatting/rewriting that metadata. Keep it as a separate file then add a new
one (call it PKGINFO?) for information about the *.pkg.* file itself (build
date, packager, signature, checksum, size, relative filepath, etc.). Add other
files to contain related information (e.g. INSTALLINFO with install time, file
list, install origin?). That way, each step copies existing files and adds a
new one with the new info (repo-add: collect SRCINFO, add PKGINFO; install a
package: copy SRCINFO AND PKGINFO to local db, create INSTALLINFO etc.)

A repo metadata file would also be required in the root directory with the repo
timestamp for the timestamped databases described above. The file could also
collect other metadata such as package providers and maybe replacements to
speed up some operations. 

Regards,
Xyne