[pacman-dev] Repository management
Hi all, Every time I attempt to work on repo-add, I find it to be a very difficult endeavour. Even though it is half the size of makepkg (without even including any of libmakepkg), it is much more convoluted to work on. We also have a weird repository database system. We have: - .db dbs with package information, signatures and delta information - .files dbs that are the same as .db dbs but additionally include filelists There are two reasons the .files dbs replicate all information in the .db dbs - .db and .files dbs getting out of sync could cause issues - a complete database is useful for things like archweb, mostly to avoid the above I would also like to include information on source packages to these databases. The files information is separate due to wanting our primary database to be small. Likewise, source package information needs to be separate (the signatures take most of the size in the .db dbs, so adding source package signatures effectively doubles the size). So two points up for discussion: 1) Sync repository layout? I don't see any point in leaving the tar based format, as reading of sync databases is not a bottleneck. (The local db format can be a bottleneck, but that is a separate discussion...) Do we split the information in .db out of .files and add a .full db with complete information? Then any .src db could follow suit and just have source package information. How do we get around the out of sync issue (e.g., a package is removed from .db, but we have an old .files database with it). Do we add timestamps, and print a warning on -F operations when the two are out of sync? 2) Do we need a better (read "more easily maintainable") tool for handling database generation and updates? libalpm already can read in information package files, so we could add libalpm/db_write.c with the database creation functions. Should we unify our repo format with our local database format which we already write? I am looking for ideas here. Please brainstorm to your hearts content. Cheers, Allan
On Tue, May 09, 2017 at 10:54:44PM +1000, Allan McRae wrote:
Hi all,
Every time I attempt to work on repo-add, I find it to be a very difficult endeavour. Even though it is half the size of makepkg (without even including any of libmakepkg), it is much more convoluted to work on.
We also have a weird repository database system. We have: - .db dbs with package information, signatures and delta information - .files dbs that are the same as .db dbs but additionally include filelists
There are two reasons the .files dbs replicate all information in the .db dbs - .db and .files dbs getting out of sync could cause issues - a complete database is useful for things like archweb, mostly to avoid the above
I would also like to include information on source packages to these databases. The files information is separate due to wanting our primary database to be small. Likewise, source package information needs to be separate (the signatures take most of the size in the .db dbs, so adding source package signatures effectively doubles the size).
So two points up for discussion:
1) Sync repository layout? I don't see any point in leaving the tar based format, as reading of sync databases is not a bottleneck. (The local db format can be a bottleneck, but that is a separate discussion...)
Isn't this a historical reversal? IIRC, the sync DBs used to be expanded onto disk, and we decided to leave them as tarballs to address performance/fragmentation concerns.
Do we split the information in .db out of .files and add a .full db with complete information? Then any .src db could follow suit and just have source package information. How do we get around the out of sync issue (e.g., a package is removed from .db, but we have an old .files database with it). Do we add timestamps, and print a warning on -F operations when the two are out of sync?
2) Do we need a better (read "more easily maintainable") tool for handling database generation and updates? libalpm already can read in information package files, so we could add libalpm/db_write.c with the database creation functions. Should we unify our repo format with our local database format which we already write?
I'd urge you not to make this a part of pacman. It's too far off the beaten path for most users to make it a part of an already complicated tool.
I am looking for ideas here. Please brainstorm to your hearts content.
WRT replacing repo-add, I'd suggest we come up with a the use cases we want to support, design an interface to meet them, and then come up with the implementation. Might be nice to start with the Arch Linux repository layout as an example that we'd want to support (pooled packages with symlinks into repo dirs).
Cheers, Allan
On 11/05/17 02:54, Dave Reisner wrote:
On Tue, May 09, 2017 at 10:54:44PM +1000, Allan McRae wrote:
Hi all,
Every time I attempt to work on repo-add, I find it to be a very difficult endeavour. Even though it is half the size of makepkg (without even including any of libmakepkg), it is much more convoluted to work on.
We also have a weird repository database system. We have: - .db dbs with package information, signatures and delta information - .files dbs that are the same as .db dbs but additionally include filelists
There are two reasons the .files dbs replicate all information in the .db dbs - .db and .files dbs getting out of sync could cause issues - a complete database is useful for things like archweb, mostly to avoid the above
I would also like to include information on source packages to these databases. The files information is separate due to wanting our primary database to be small. Likewise, source package information needs to be separate (the signatures take most of the size in the .db dbs, so adding source package signatures effectively doubles the size).
So two points up for discussion:
1) Sync repository layout? I don't see any point in leaving the tar based format, as reading of sync databases is not a bottleneck. (The local db format can be a bottleneck, but that is a separate discussion...)
Isn't this a historical reversal? IIRC, the sync DBs used to be expanded onto disk, and we decided to leave them as tarballs to address performance/fragmentation concerns.
To be clear, I was saying to stay tar based and not to move to something else.
Do we split the information in .db out of .files and add a .full db with complete information? Then any .src db could follow suit and just have source package information. How do we get around the out of sync issue (e.g., a package is removed from .db, but we have an old .files database with it). Do we add timestamps, and print a warning on -F operations when the two are out of sync?
2) Do we need a better (read "more easily maintainable") tool for handling database generation and updates? libalpm already can read in information package files, so we could add libalpm/db_write.c with the database creation functions. Should we unify our repo format with our local database format which we already write?
I'd urge you not to make this a part of pacman. It's too far off the beaten path for most users to make it a part of an already complicated tool.
Definitely not part of pacman. I was suggesting another program with a libalpm backend.
I am looking for ideas here. Please brainstorm to your hearts content.
WRT replacing repo-add, I'd suggest we come up with a the use cases we want to support, design an interface to meet them, and then come up with the implementation. Might be nice to start with the Arch Linux repository layout as an example that we'd want to support (pooled packages with symlinks into repo dirs).
Cheers, Allan .
On 05/09/17 at 10:54pm, Allan McRae wrote:
Hi all,
Every time I attempt to work on repo-add, I find it to be a very difficult endeavour. Even though it is half the size of makepkg (without even including any of libmakepkg), it is much more convoluted to work on.
We also have a weird repository database system. We have: - .db dbs with package information, signatures and delta information - .files dbs that are the same as .db dbs but additionally include filelists
There are two reasons the .files dbs replicate all information in the .db dbs - .db and .files dbs getting out of sync could cause issues - a complete database is useful for things like archweb, mostly to avoid the above
I would also like to include information on source packages to these databases. The files information is separate due to wanting our primary database to be small. Likewise, source package information needs to be separate (the signatures take most of the size in the .db dbs, so adding source package signatures effectively doubles the size).
So two points up for discussion:
1) Sync repository layout? I don't see any point in leaving the tar based format, as reading of sync databases is not a bottleneck. (The local db format can be a bottleneck, but that is a separate discussion...)
Do we split the information in .db out of .files and add a .full db with complete information? Then any .src db could follow suit and just have source package information. How do we get around the out of sync issue (e.g., a package is removed from .db, but we have an old .files database with it). Do we add timestamps, and print a warning on -F operations when the two are out of sync?
What about just not including the signature in the database? Make the inclusion of the signature optional and have pacman (or whatever downloads the source package) also look for a corresponding .sig file if it's not in the db. pacman -U already looks for a .sig file when downloading a package and you have a feature request to download .sig files even with -S, so code-wise this seems like a pretty clean solution. Then you can include the source information right in the primary DB and Arch's devtools can opt to omit the signature from the db.
2) Do we need a better (read "more easily maintainable") tool for handling database generation and updates? libalpm already can read in information package files, so we could add libalpm/db_write.c with the database creation functions. Should we unify our repo format with our local database format which we already write?
I would love to see us drop the ini-style .PKGINFO format, if that's what you mean. Even without adding a database writer to libalpm, having two formats for the exact same data is unnecessary and leads to inconsistencies between the two. apg
On 11/05/17 07:54, Andrew Gregory wrote:
2) Do we need a better (read "more easily maintainable") tool for handling database generation and updates? libalpm already can read in information package files, so we could add libalpm/db_write.c with the database creation functions. Should we unify our repo format with our local database format which we already write?
I would love to see us drop the ini-style .PKGINFO format, if that's what you mean. Even without adding a database writer to libalpm, having two formats for the exact same data is unnecessary and leads to inconsistencies between the two.
I was not considering .PKGINFO when I wrote that, although it is a good point... Currently we have the following: https://wiki.archlinux.org/index.php/User:Allan/Pacman_DB_Format Notice the local and sync database formats are near identical (there are some field differences), but we use two different functions to read them, where the main differences is what fgets variant gets used - this is what I was talking about unifying. We already have the ability to write that format in libalpm given we write the local db entry, so could extend that as the basis of writing repo databases via a libalpm tool too. So, expanding on this idea. It would be great to have a single package information reader that covered .PKGINFO files, local database files, and sync database files. To do this .PKGINFO files (and assumably .BUILDINFO) would need to change to the same format as the database files. How would we make such a transition? Add a new file into the package (e.g. .PACKAGE) that has the new format. Have pacman read the new format if available, but fall back to old format if it is not available. Then wait a release or two to remove support for the old format? Allan
On 2017-05-09 22:54 +1000 Allan McRae wrote:
I am looking for ideas here. Please brainstorm to your hearts content.
Ok :)
So two points up for discussion:
1) Sync repository layout? I don't see any point in leaving the tar based format, as reading of sync databases is not a bottleneck. (The local db format can be a bottleneck, but that is a separate discussion...)
Do we split the information in .db out of .files and add a .full db with complete information? Then any .src db could follow suit and just have source package information. How do we get around the out of sync issue (e.g., a package is removed from .db, but we have an old .files database with it). Do we add timestamps, and print a warning on -F operations when the two are out of sync?
Add a timestamp inside each database (*.db, *.files, *.src). When pacman downloads a database, instead of saving it as <repo>.<ext> and squashing the previous database, save it as <repo>-<timestamp>.<ext>. Each refresh operation (pacman -Sy, pacman -Fy) is associated with a particular database (*.db and *.files, respectively). Create an untimestamped symlink to that database, e.g. $ pacman -Sy... # retrieve <repo>.db and save as <repo>-<timestamp_1>.db # ln -s <repo>-<timestamp_1>.db <repo>.db $ pacman -Fy... # retrieve <repo>.db and save as <repo>-<timestamp_2>.db # retrieve <repo>.files and save as <repo>-<timestamp_2>.files # ln -s <repo>-<timestamp_2>.files <repo>.files # something similar for *.src files For operations that only involve the current <repo>.db files, no change is needed for loading the database. For loading <repo>.files, you will need to dereference <repo>.files first, grab <timestamp_2> from <repo>-<timestamp_2>.files in the example above, and then use it to load <repo>-<timestamp_2>.db instead of <repo>.db. Same method for *.src files. For cleanup of the timestamped files, collect the valid timestamps from the untimestamped symlinks and then remove anything that doesn't match them. This should probably be done with each database refresh. Maybe you can use the same function that you use to clean up the package cache with -Sc while leaving installed packages. Obviously there will be some redundancy in the up to 3 copies of <repo>-<timestamp>.db but I think that's better than e.g. breaking pkgfile searches after an upgrade. With this approach you could also download the latest version of the sync databases as <repo>-<timestamp>.db without symlinking <repo>.db to it, and then use that to query upgradable packages and other info from the mirror. For propagating the database to the servers, nothing changes. Whenever the database is updated, generate <repo>.db, <repo>.files, <repo>.src and whatever else at the same time with the same internal timestamp and then just push them out as usual.
2) Do we need a better (read "more easily maintainable") tool for handling database generation and updates? libalpm already can read in information package files, so we could add libalpm/db_write.c with the database creation functions. Should we unify our repo format with our local database format which we already write?
Yes for unification, preferably in a standardized format (e.g. yaml). Having the functionality to read and write the files in libalpm would be useful for third-party tool developers. On 2017-05-10 12:54 -0400 Dave Reisner wrote:
WRT replacing repo-add, I'd suggest we come up with a the use cases we want to support, design an interface to meet them, and then come up with the implementation. Might be nice to start with the Arch Linux repository layout as an example that we'd want to support (pooled packages with symlinks into repo dirs).
What about using a relative subpath instead of a filename in the database. That would enable transparent freeform repo layouts (e.g. pooled packages without symlinks, package groups in different subdirs, etc.). You could also avoid the need for subdirectories by adding the architecture to the database filename, e.g. <repo>.<arch>.<ext> To simplify repo-add, you could include .SRCINFO directly to avoid parsing and reformatting/rewriting that metadata. Keep it as a separate file then add a new one (call it PKGINFO?) for information about the *.pkg.* file itself (build date, packager, signature, checksum, size, relative filepath, etc.). Add other files to contain related information (e.g. INSTALLINFO with install time, file list, install origin?). That way, each step copies existing files and adds a new one with the new info (repo-add: collect SRCINFO, add PKGINFO; install a package: copy SRCINFO AND PKGINFO to local db, create INSTALLINFO etc.) A repo metadata file would also be required in the root directory with the repo timestamp for the timestamped databases described above. The file could also collect other metadata such as package providers and maybe replacements to speed up some operations. Regards, Xyne
Xyne wrote:
Obviously there will be some redundancy in the up to 3 copies of <repo>-<timestamp>.db but I think that's better than e.g. breaking pkgfile searches after an upgrade.
Just to expand on that, the worst case scenario leads to the same level of redundancy as we currently have with complete *.files databases, while the best case leads to no redundancy, all the while preserving the independence of pacman -S... and pacman -F... (and whatever else you want to add).
With this approach you could also download the latest version of the sync databases as <repo>-<timestamp>.db without symlinking <repo>.db to it, and then use that to query upgradable packages and other info from the mirror.
To make that work with my suggestion for cleaning up old timestamped databases, add a symlink named e.g. <repo>.future, <repo>.next or <repo>.remote. That could be used by e.g. checkupdates or pre-emptive package downloading scripts. There may even be cases where the cleanup is unwanted, such as for a script that regularly downloads databases and upgradable packages to provide an incremental upgrade path at a later date (obviously regular updates are preferred, but maybe useful and reasonable in some rare cases). In my previous reply, I had forgotten that pacman -Sc prompts for the database and pkgcache cleanups independently. Forget what I said about automatic cleanups. Offload that to pacman -Sc. Regards, Xyne
On Tue, 2017-05-09 at 22:54 +1000, Allan McRae wrote:
Hi all,
Every time I attempt to work on repo-add, I find it to be a very difficult endeavour. Even though it is half the size of makepkg (without even including any of libmakepkg), it is much more convoluted to work on.
We also have a weird repository database system. We have: - .db dbs with package information, signatures and delta information - .files dbs that are the same as .db dbs but additionally include filelists
There are two reasons the .files dbs replicate all information in the .db dbs - .db and .files dbs getting out of sync could cause issues - a complete database is useful for things like archweb, mostly to avoid the above
I would also like to include information on source packages to these databases. The files information is separate due to wanting our primary database to be small. Likewise, source package information needs to be separate (the signatures take most of the size in the .db dbs, so adding source package signatures effectively doubles the size).
So two points up for discussion:
1) Sync repository layout? I don't see any point in leaving the tar based format, as reading of sync databases is not a bottleneck. (The local db format can be a bottleneck, but that is a separate discussion...)
Do we split the information in .db out of .files and add a .full db with complete information? Then any .src db could follow suit and just have source package information. How do we get around the out of sync issue (e.g., a package is removed from .db, but we have an old .files database with it). Do we add timestamps, and print a warning on -F operations when the two are out of sync?
Perhaps instead of timestamps, how about adding a .DBINFO file and include a hash in that file that is shared between both the .db and .files databases (and perhaps the source db as well). This way, when something checks the .files, you can tell if it doesn't match the .db (because in my opinion, the .db is more important so that's what I would compare anything to). I'm not really sure what good a .full db would do for us though. Just seems to me like extra stuff to download.
2) Do we need a better (read "more easily maintainable") tool for handling database generation and updates? libalpm already can read in information package files, so we could add libalpm/db_write.c with the database creation functions. Should we unify our repo format with our local database format which we already write?
I think this would be great. Especially the part of implementing something in libalpm to do this. It would allow projects like pyalpm or my own php-alpm to be used to also create repos.
I am looking for ideas here. Please brainstorm to your hearts content.
I know this is two months after the fact, but here's my take on it. Mark
participants (5)
-
Allan McRae
-
Andrew Gregory
-
Dave Reisner
-
Mark Weiman
-
Xyne