RFC: Simple package diff
The idea of package deltas just won't go away... However, binary diffs really are not ideal with pacman verifying the compressed package - that means we need to reconstruct the package on the users system to verify. Also our old approach using xdelta3 somewhat died when moving packages away from gz (or xz?) compression. Other binary diff approaches really suffered the same issue. In general, I find the approach of reconstructing the full package to be suboptimal. I also don't particulaly want to verify uncompressed packages. I wondered if this was a case of perfect being the enemy of good, so I have investigated a different, very lazy approach. Instead of taking a binary diff, we could just provide the files that have changed between package versions. This is super easy to do as we have checksums for all files in the mtree file. We could then extract this "diff" package directly, and use the mtree file to adjust timestamps/permissions/etc(?) on kept files, and it would be just like the full package had been installed. I ran some numbers to see if this was worth while. The results for the last bunch of updates for bash, coreutils, qt5-base and systemd are given here: https://wiki.archlinux.org/title/User:Allan/Pkgdiff On major version updates, this is approach is a waste of time. But for minor updates bash download would average 25% of the size, coreutils about 36% (though was ~1% for simple rebuilds!), qt5-base about 40% and systemd 60%. Not shown but worth noting note that when Arch changes gcc/binutils versions or updates CFLAGS etc, this can stop any binary diff being as useful. If we implemented using these diffs but only allowed it for updates from the previous package version (i.e. no diffs to package (current - 2) or earlier, or diff chaining), then this would be rather simple to implement (at least from the pacman side...). Any comments before I invest time implementing? Allan
On 11/23/22 at 10:27pm, Allan McRae wrote:
The idea of package deltas just won't go away... However, binary diffs really are not ideal with pacman verifying the compressed package - that means we need to reconstruct the package on the users system to verify. Also our old approach using xdelta3 somewhat died when moving packages away from gz (or xz?) compression. Other binary diff approaches really suffered the same issue. In general, I find the approach of reconstructing the full package to be suboptimal. I also don't particulaly want to verify uncompressed packages.
I wondered if this was a case of perfect being the enemy of good, so I have investigated a different, very lazy approach. Instead of taking a binary diff, we could just provide the files that have changed between package versions. This is super easy to do as we have checksums for all files in the mtree file. We could then extract this "diff" package directly, and use the mtree file to adjust timestamps/permissions/etc(?) on kept files, and it would be just like the full package had been installed.
As I understand your intended approach, operations using a diff package would be fundamentally different than those involving a full package. Files changed on the system but unchanged in the package would not be restored. Once upgraded, the cached diff package would be useless for reinstallation/downgrading without downgrading to the previous version first then upgrading again using the diff. `pacman -S foo` to reinstall would no longer work without downloading the full package. It's not ideal, but I think those are reasonable caveats. People generally shouldn't be messing with non-backup files anyway and as long as they manage their cache properly, reinstallation and downgrading using the cache are still possible.
I ran some numbers to see if this was worth while. The results for the last bunch of updates for bash, coreutils, qt5-base and systemd are given here: https://wiki.archlinux.org/title/User:Allan/Pkgdiff
On major version updates, this is approach is a waste of time. But for minor updates bash download would average 25% of the size, coreutils about 36% (though was ~1% for simple rebuilds!), qt5-base about 40% and systemd 60%. Not shown but worth noting note that when Arch changes gcc/binutils versions or updates CFLAGS etc, this can stop any binary diff being as useful.
If we implemented using these diffs but only allowed it for updates from the previous package version (i.e. no diffs to package (current - 2) or earlier, or diff chaining), then this would be rather simple to implement (at least from the pacman side...).
I agree with no diff chaining; keeping them as separate partial packages instead of reconstructing a full package would make chaining a little complicated. I'm not sure about the previous-version-only rule though. The db is going to have to know the base version for the partial package either way, so the cost of supporting multiple bases seems low as far as we're concerned; just a simple search through the available partial files for one based on the currently installed version.
On 24/11/22 04:51, Andrew Gregory wrote:
On 11/23/22 at 10:27pm, Allan McRae wrote:
The idea of package deltas just won't go away... However, binary diffs really are not ideal with pacman verifying the compressed package - that means we need to reconstruct the package on the users system to verify. Also our old approach using xdelta3 somewhat died when moving packages away from gz (or xz?) compression. Other binary diff approaches really suffered the same issue. In general, I find the approach of reconstructing the full package to be suboptimal. I also don't particulaly want to verify uncompressed packages.
I wondered if this was a case of perfect being the enemy of good, so I have investigated a different, very lazy approach. Instead of taking a binary diff, we could just provide the files that have changed between package versions. This is super easy to do as we have checksums for all files in the mtree file. We could then extract this "diff" package directly, and use the mtree file to adjust timestamps/permissions/etc(?) on kept files, and it would be just like the full package had been installed.
As I understand your intended approach, operations using a diff package would be fundamentally different than those involving a full package. Files changed on the system but unchanged in the package would not be restored. Once upgraded, the cached diff package would be useless for reinstallation/downgrading without downgrading to the previous version first then upgrading again using the diff. `pacman -S foo` to reinstall would no longer work without downloading the full package.
It's not ideal, but I think those are reasonable caveats. People generally shouldn't be messing with non-backup files anyway and as long as they manage their cache properly, reinstallation and downgrading using the cache are still possible.
That is correct. I thought about special casing "pacman -S pkg" when it is a reinstall to always use the full package. But maybe a flag is better.
I ran some numbers to see if this was worth while. The results for the last bunch of updates for bash, coreutils, qt5-base and systemd are given here: https://wiki.archlinux.org/title/User:Allan/Pkgdiff
On major version updates, this is approach is a waste of time. But for minor updates bash download would average 25% of the size, coreutils about 36% (though was ~1% for simple rebuilds!), qt5-base about 40% and systemd 60%. Not shown but worth noting note that when Arch changes gcc/binutils versions or updates CFLAGS etc, this can stop any binary diff being as useful.
If we implemented using these diffs but only allowed it for updates from the previous package version (i.e. no diffs to package (current - 2) or earlier, or diff chaining), then this would be rather simple to implement (at least from the pacman side...).
I agree with no diff chaining; keeping them as separate partial packages instead of reconstructing a full package would make chaining a little complicated. I'm not sure about the previous-version-only rule though. The db is going to have to know the base version for the partial package either way, so the cost of supporting multiple bases seems low as far as we're concerned; just a simple search through the available partial files for one based on the currently installed version.
This was a thought to keep the database from ballooning in size. I guess the minimum per pkgdiff is a filename and sha256sum. Probably signature too if --include-sigs is used. Also, hosting the pkgdiffs would get big if many versions past were used - I guess that is a distro problem. In practise, I guess distros only providing pkgdiffs up to a threshold of a packages size, and for a given update window would remove this as an issue. Allan
I'm a long-time user of Linux/unix, but am are relatively new to Arch. (I switched my home desktop to manjaro a few years ago when I finally got completely fed up with ubuntu/canonical.) Lately, I've been looking at the pacman/alpm source code in relation to a couple of projects. I noticed this post a few days ago, and have an idea for a possible implementation. Since this is relatively new to me, I might be missing some key points and the proposal below might be entirely unworkable. If so, apologies in advance, and feel free to just ignore it. I'll start by saying that I think the comment about "perfect being the enemy of the good" really "hits the nail on the head". (I've always enjoyed mixing metaphors :-). Other attempts at binary-diff updates that I've seen have fallen by the wayside. (Does redhat even use jigdo any more?). In general, rsync seems to be the only effective system that's widely used. But it doesn't work well with most compressed files, and imposes a load on the server. Doing incremental updates at the granularity of files instead of data blocks seems like a really good idea. One of the main things that influenced my decision to move to Arch is the fact that packages are distributed as simple tar files containing the package's directory structure plus a few simple metadata files. Even better, they are transparently labeled as such (foo.pkg.tar.zst). By contrast, rpm is a binary format that requires at least rpm2cio to start with. And .deb files are 'ar' archives containing several tar archives. (A really obtuse decision that is unfortunately-in-line with debian's general philosophy :-). I think that combining file-level granularity with the straightforward format of pacman archives could lead to a flexible scheme that wouldn't need specific "foo-q.r-s-to-foo-x.y-z.pkgdiff" files. It would also allow upgrading from several older versions, not just the previous one. The scheme relies on the following assumptions: A) The order of the files in the uncompressed pkg.tar file doesn't matter. I'm pretty sure this is correct. The alpm library seems to have it's own general list-sorter, which is used when the order is critical (i.e, so that all the package-files are deleted from under a directory before the directory itself is considered). I ran a trial update of chromium from a randomized pkg.tar file that seems to have worked fine. B) Pacman currently uses tar for the underlying archive and zstd for compression. This might change, but I'm assuming that the choices will always be "streamable" in some sense. Archives would never become something like zip files which can't be processed at all without reading the entire file to the end. C) A system that is interested in upgrading it's 'foo' pkg from version q.r-s to the current version x.y-z still has a copy of foo-q.r-s.pkg.tar.zst in its package cache. (For simplicity, I'm leaving 'x86_64' out of file names.) --- The core of the idea is to change the ordering of the package archives so that the files required for an upgrade will be at the beginning. Ideally, they would ordered according to the package-version in which they were last changed. Each pkg.tar file would start with files that are different in the current version than in the previous one. The next section would be those that are the same in the last two versions but differed in the one previous to that, etc. This would turn the tar file into something like a multi-volume incremental backup tape, except that all of the outdated files have been expunged. There would be no redundant files (and no way to restore previous versions). If I'm correct about (A), these files would work fine with current pacman/alpm software and be completely backwards compatible. This would make it possible to do an incremental upgrade by retrieving just an initial part of the archive, which could be done by adding an 'HTTP range' header to the request. A client that wants to upgrade foo from p.q-r to x.y-z, would need to have two additional pieces of information in order to try this. It would need the exact size N of an initial portion of the uncompressed foo-x.y-z.pkg.tar that contains all blocks for the x.y-z files that differ from the q.r-s ones. (This initial portion would, in fact, be a valid tar file.) It would also need a number Nz such that piping the first Nz bytes of the compressed pkg.tar.zst package file to zstd would produce at least N valid bytes of output before zstd failed upon receiving an early eof. If the client decides that Nz is sufficiently smaller than the full download size, it would do something like the following: curl --range 0-Nz https://server:/path/to/foo-x.y-z.pkg.tar.zsd \ | zstd -d 2>/dev/null | head -c N > foo-x.y-z.pkg.part.tar The client could make its decision on whether this is worthwhile based on its network bandwidth, computing resources, etc. The pkg.part.tar file has the updated .MTREE and other metadata files. The client would compare the shasum's from the x.y-z mtree to the ones from the p.q-r mtree in order to verify that pkg.part.tar really does have everything that it needs to do an upgrade. In fact, it should be possible to just append blocks to the pkg.part.tar file in order to re-create a full foo-x.y-z.pkg.tar.zst that is bye-for-byte identical to the one on the server. This could then be verified, used to easily re-install the package, to upgrade other systems on the local network, etc. In order for this to work, the client would also need to know what versions of bsdtar and zstd were used to create the server's archive. I now suspect that minor differences in the bsdtar version would usually be ok in this regard, but that zstd is likely to be more troublesome (as are a great many things that come from facebook :-). There are various details to consider, and I have a couple of different ideas about how to compute and communicate appropriate values of 'N' and 'Nz', but I think I'll stop here. If you think this idea might have some merit, I'll be happy to continue working on it, and run some tests. If not, it has been an interesting exercise that helped me to figure out some things about the alpm library that I probably wouldn't have otherwise. -Jeff Norden PS: Also, I noticed that the link to this mailing list on the pacman "home page" (https://archlinux.org/pacman/#_mailing_list) needs to be updated. It points to the "pipermail" page that is no longer current. On 11/23/22 06:27, Allan McRae wrote:
The idea of package deltas just won't go away... ... I wondered if this was a case of perfect being the enemy of good, so I have investigated a different, very lazy approach. Instead of taking a binary diff, we could just provide the files that have changed between package versions. ...
On 7/12/22 03:51, Jeff Norden wrote: <snip>
The core of the idea is to change the ordering of the package archives so that the files required for an upgrade will be at the beginning. Ideally, they would ordered according to the package-version in which they were last changed. Each pkg.tar file would start with files that are different in the current version than in the previous one. The next section would be those that are the same in the last two versions but differed in the one previous to that, etc. This would turn the tar file into something like a multi-volume incremental backup tape, except that all of the outdated files have been expunged. There would be no redundant files (and no way to restore previous versions). If I'm correct about (A), these files would work fine with current pacman/alpm software and be completely backwards compatible. This would make it possible to do an incremental upgrade by retrieving just an initial part of the archive, which could be done by adding an 'HTTP range' header to the request.
A client that wants to upgrade foo from p.q-r to x.y-z, would need to have two additional pieces of information in order to try this. It would need the exact size N of an initial portion of the uncompressed foo-x.y-z.pkg.tar that contains all blocks for the x.y-z files that differ from the q.r-s ones. (This initial portion would, in fact, be a valid tar file.) It would also need a number Nz such that piping the first Nz bytes of the compressed pkg.tar.zst package file to zstd would produce at least N valid bytes of output before zstd failed upon receiving an early eof.
If the client decides that Nz is sufficiently smaller than the full download size, it would do something like the following:
curl --range 0-Nz https://server:/path/to/foo-x.y-z.pkg.tar.zsd \ | zstd -d 2>/dev/null | head -c N > foo-x.y-z.pkg.part.tar
The client could make its decision on whether this is worthwhile based on its network bandwidth, computing resources, etc.
<snip> Thanks for the very detailed suggestion. I think the downfall is how to calculate the size needed to download. Pacman/makepkg supports at least 9 compression methods, and I don't think this will be an easy calculation for any of them. Allan
participants (3)
-
Allan McRae
-
Andrew Gregory
-
Jeff Norden