On 11/23/22 at 10:27pm, Allan McRae wrote:
The idea of package deltas just won't go away... However, binary diffs really are not ideal with pacman verifying the compressed package - that means we need to reconstruct the package on the users system to verify. Also our old approach using xdelta3 somewhat died when moving packages away from gz (or xz?) compression. Other binary diff approaches really suffered the same issue. In general, I find the approach of reconstructing the full package to be suboptimal. I also don't particulaly want to verify uncompressed packages.
I wondered if this was a case of perfect being the enemy of good, so I have investigated a different, very lazy approach. Instead of taking a binary diff, we could just provide the files that have changed between package versions. This is super easy to do as we have checksums for all files in the mtree file. We could then extract this "diff" package directly, and use the mtree file to adjust timestamps/permissions/etc(?) on kept files, and it would be just like the full package had been installed.
As I understand your intended approach, operations using a diff package would be fundamentally different than those involving a full package. Files changed on the system but unchanged in the package would not be restored. Once upgraded, the cached diff package would be useless for reinstallation/downgrading without downgrading to the previous version first then upgrading again using the diff. `pacman -S foo` to reinstall would no longer work without downloading the full package. It's not ideal, but I think those are reasonable caveats. People generally shouldn't be messing with non-backup files anyway and as long as they manage their cache properly, reinstallation and downgrading using the cache are still possible.
I ran some numbers to see if this was worth while. The results for the last bunch of updates for bash, coreutils, qt5-base and systemd are given here: https://wiki.archlinux.org/title/User:Allan/Pkgdiff
On major version updates, this is approach is a waste of time. But for minor updates bash download would average 25% of the size, coreutils about 36% (though was ~1% for simple rebuilds!), qt5-base about 40% and systemd 60%. Not shown but worth noting note that when Arch changes gcc/binutils versions or updates CFLAGS etc, this can stop any binary diff being as useful.
If we implemented using these diffs but only allowed it for updates from the previous package version (i.e. no diffs to package (current - 2) or earlier, or diff chaining), then this would be rather simple to implement (at least from the pacman side...).
I agree with no diff chaining; keeping them as separate partial packages instead of reconstructing a full package would make chaining a little complicated. I'm not sure about the previous-version-only rule though. The db is going to have to know the base version for the partial package either way, so the cost of supporting multiple bases seems low as far as we're concerned; just a simple search through the available partial files for one based on the currently installed version.