I'm a long-time user of Linux/unix, but am are relatively new to Arch. (I switched my home desktop to manjaro a few years ago when I finally got completely fed up with ubuntu/canonical.) Lately, I've been looking at the pacman/alpm source code in relation to a couple of projects. I noticed this post a few days ago, and have an idea for a possible implementation. Since this is relatively new to me, I might be missing some key points and the proposal below might be entirely unworkable. If so, apologies in advance, and feel free to just ignore it. I'll start by saying that I think the comment about "perfect being the enemy of the good" really "hits the nail on the head". (I've always enjoyed mixing metaphors :-). Other attempts at binary-diff updates that I've seen have fallen by the wayside. (Does redhat even use jigdo any more?). In general, rsync seems to be the only effective system that's widely used. But it doesn't work well with most compressed files, and imposes a load on the server. Doing incremental updates at the granularity of files instead of data blocks seems like a really good idea. One of the main things that influenced my decision to move to Arch is the fact that packages are distributed as simple tar files containing the package's directory structure plus a few simple metadata files. Even better, they are transparently labeled as such (foo.pkg.tar.zst). By contrast, rpm is a binary format that requires at least rpm2cio to start with. And .deb files are 'ar' archives containing several tar archives. (A really obtuse decision that is unfortunately-in-line with debian's general philosophy :-). I think that combining file-level granularity with the straightforward format of pacman archives could lead to a flexible scheme that wouldn't need specific "foo-q.r-s-to-foo-x.y-z.pkgdiff" files. It would also allow upgrading from several older versions, not just the previous one. The scheme relies on the following assumptions: A) The order of the files in the uncompressed pkg.tar file doesn't matter. I'm pretty sure this is correct. The alpm library seems to have it's own general list-sorter, which is used when the order is critical (i.e, so that all the package-files are deleted from under a directory before the directory itself is considered). I ran a trial update of chromium from a randomized pkg.tar file that seems to have worked fine. B) Pacman currently uses tar for the underlying archive and zstd for compression. This might change, but I'm assuming that the choices will always be "streamable" in some sense. Archives would never become something like zip files which can't be processed at all without reading the entire file to the end. C) A system that is interested in upgrading it's 'foo' pkg from version q.r-s to the current version x.y-z still has a copy of foo-q.r-s.pkg.tar.zst in its package cache. (For simplicity, I'm leaving 'x86_64' out of file names.) --- The core of the idea is to change the ordering of the package archives so that the files required for an upgrade will be at the beginning. Ideally, they would ordered according to the package-version in which they were last changed. Each pkg.tar file would start with files that are different in the current version than in the previous one. The next section would be those that are the same in the last two versions but differed in the one previous to that, etc. This would turn the tar file into something like a multi-volume incremental backup tape, except that all of the outdated files have been expunged. There would be no redundant files (and no way to restore previous versions). If I'm correct about (A), these files would work fine with current pacman/alpm software and be completely backwards compatible. This would make it possible to do an incremental upgrade by retrieving just an initial part of the archive, which could be done by adding an 'HTTP range' header to the request. A client that wants to upgrade foo from p.q-r to x.y-z, would need to have two additional pieces of information in order to try this. It would need the exact size N of an initial portion of the uncompressed foo-x.y-z.pkg.tar that contains all blocks for the x.y-z files that differ from the q.r-s ones. (This initial portion would, in fact, be a valid tar file.) It would also need a number Nz such that piping the first Nz bytes of the compressed pkg.tar.zst package file to zstd would produce at least N valid bytes of output before zstd failed upon receiving an early eof. If the client decides that Nz is sufficiently smaller than the full download size, it would do something like the following: curl --range 0-Nz https://server:/path/to/foo-x.y-z.pkg.tar.zsd \ | zstd -d 2>/dev/null | head -c N > foo-x.y-z.pkg.part.tar The client could make its decision on whether this is worthwhile based on its network bandwidth, computing resources, etc. The pkg.part.tar file has the updated .MTREE and other metadata files. The client would compare the shasum's from the x.y-z mtree to the ones from the p.q-r mtree in order to verify that pkg.part.tar really does have everything that it needs to do an upgrade. In fact, it should be possible to just append blocks to the pkg.part.tar file in order to re-create a full foo-x.y-z.pkg.tar.zst that is bye-for-byte identical to the one on the server. This could then be verified, used to easily re-install the package, to upgrade other systems on the local network, etc. In order for this to work, the client would also need to know what versions of bsdtar and zstd were used to create the server's archive. I now suspect that minor differences in the bsdtar version would usually be ok in this regard, but that zstd is likely to be more troublesome (as are a great many things that come from facebook :-). There are various details to consider, and I have a couple of different ideas about how to compute and communicate appropriate values of 'N' and 'Nz', but I think I'll stop here. If you think this idea might have some merit, I'll be happy to continue working on it, and run some tests. If not, it has been an interesting exercise that helped me to figure out some things about the alpm library that I probably wouldn't have otherwise. -Jeff Norden PS: Also, I noticed that the link to this mailing list on the pacman "home page" (https://archlinux.org/pacman/#_mailing_list) needs to be updated. It points to the "pipermail" page that is no longer current. On 11/23/22 06:27, Allan McRae wrote:
The idea of package deltas just won't go away... ... I wondered if this was a case of perfect being the enemy of good, so I have investigated a different, very lazy approach. Instead of taking a binary diff, we could just provide the files that have changed between package versions. ...