Re: RFC: Simple package diff

6 Dec 2022

      I'm a long-time user of Linux/unix, but am are relatively new to Arch.
(I switched my home desktop to manjaro a few years ago when I finally got
completely fed up with ubuntu/canonical.)  Lately, I've been looking at the
pacman/alpm source code in relation to a couple of projects.  I noticed this
post a few days ago, and have an idea for a possible implementation.  Since
this is relatively new to me, I might be missing some key points and the
proposal below might be entirely unworkable.  If so, apologies in advance, and
feel free to just ignore it.

I'll start by saying that I think the comment about "perfect being the enemy
of the good" really "hits the nail on the head".  (I've always enjoyed mixing
metaphors :-).  Other attempts at binary-diff updates that I've seen have
fallen by the wayside. (Does redhat even use jigdo any more?).  In general,
rsync seems to be the only effective system that's widely used.  But it
doesn't work well with most compressed files, and imposes a load on the
server.  Doing incremental updates at the granularity of files instead of data
blocks seems like a really good idea.

One of the main things that influenced my decision to move to Arch is the fact
that packages are distributed as simple tar files containing the package's
directory structure plus a few simple metadata files.  Even better, they are
transparently labeled as such (foo.pkg.tar.zst).  By contrast, rpm is a binary
format that requires at least rpm2cio to start with.  And .deb files are 'ar'
archives containing several tar archives.  (A really obtuse decision that is
unfortunately-in-line with debian's general philosophy :-).

I think that combining file-level granularity with the straightforward format
of pacman archives could lead to a flexible scheme that wouldn't need specific
"foo-q.r-s-to-foo-x.y-z.pkgdiff" files.  It would also allow upgrading from
several older versions, not just the previous one.  The scheme relies on the
following assumptions:

A) The order of the files in the uncompressed pkg.tar file doesn't matter. I'm
   pretty sure this is correct.  The alpm library seems to have it's own
   general list-sorter, which is used when the order is critical (i.e, so that
   all the package-files are deleted from under a directory before the
   directory itself is considered).  I ran a trial update of chromium from a
   randomized pkg.tar file that seems to have worked fine.

B) Pacman currently uses tar for the underlying archive and zstd for
   compression.  This might change, but I'm assuming that the choices will
   always be "streamable" in some sense.  Archives would never become something
   like zip files which can't be processed at all without reading the entire
   file to the end.

C) A system that is interested in upgrading it's 'foo' pkg from version q.r-s
   to the current version x.y-z still has a copy of foo-q.r-s.pkg.tar.zst in
   its package cache. (For simplicity, I'm leaving 'x86_64' out of file names.)

---

The core of the idea is to change the ordering of the package archives so that
the files required for an upgrade will be at the beginning.  Ideally, they
would ordered according to the package-version in which they were last
changed.  Each pkg.tar file would start with files that are different in the
current version than in the previous one. The next section would be those that
are the same in the last two versions but differed in the one previous to
that, etc.  This would turn the tar file into something like a multi-volume
incremental backup tape, except that all of the outdated files have been
expunged.  There would be no redundant files (and no way to restore previous
versions).  If I'm correct about (A), these files would work fine with current
pacman/alpm software and be completely backwards compatible.  This would make
it possible to do an incremental upgrade by retrieving just an initial part of
the archive, which could be done by adding an 'HTTP range' header to the
request.

A client that wants to upgrade foo from p.q-r to x.y-z, would need to have two
additional pieces of information in order to try this.  It would need the
exact size N of an initial portion of the uncompressed foo-x.y-z.pkg.tar that
contains all blocks for the x.y-z files that differ from the q.r-s ones.
(This initial portion would, in fact, be a valid tar file.)  It would also
need a number Nz such that piping the first Nz bytes of the compressed
pkg.tar.zst package file to zstd would produce at least N valid bytes of
output before zstd failed upon receiving an early eof.

If the client decides that Nz is sufficiently smaller than the full download
size, it would do something like the following:

     curl --range 0-Nz https://server:/path/to/foo-x.y-z.pkg.tar.zsd \
            | zstd -d  2>/dev/null | head -c N > foo-x.y-z.pkg.part.tar

The client could make its decision on whether this is worthwhile based on its
network bandwidth, computing resources, etc.

The pkg.part.tar file has the updated .MTREE and other metadata files.  The
client would compare the shasum's from the x.y-z mtree to the ones from the
p.q-r mtree in order to verify that pkg.part.tar really does have everything
that it needs to do an upgrade.

In fact, it should be possible to just append blocks to the pkg.part.tar file
in order to re-create a full foo-x.y-z.pkg.tar.zst that is bye-for-byte
identical to the one on the server.  This could then be verified, used to
easily re-install the package, to upgrade other systems on the local network,
etc.  In order for this to work, the client would also need to know what
versions of bsdtar and zstd were used to create the server's archive.  I now
suspect that minor differences in the bsdtar version would usually be ok in
this regard, but that zstd is likely to be more troublesome (as are a great
many things that come from facebook :-).

There are various details to consider, and I have a couple of different ideas
about how to compute and communicate appropriate values of 'N' and 'Nz', but I
think I'll stop here.  If you think this idea might have some merit, I'll be
happy to continue working on it, and run some tests.  If not, it has been an
interesting exercise that helped me to figure out some things about the alpm
library that I probably wouldn't have otherwise.

-Jeff Norden

PS: Also, I noticed that the link to this mailing list on the pacman "home
page" (https://archlinux.org/pacman/#_mailing_list) needs to be updated.
It points to the "pipermail" page that is no longer current.

On 11/23/22 06:27, Allan McRae wrote:
...
The idea of package deltas just won't go away...  
...
I wondered if this was a case of perfect being the enemy of good, so I
have investigated a different, very lazy approach.  Instead of taking a
binary diff, we could just provide the files that have changed between
package versions.
...