new Delta Update support

Ruben Kelevra cyrond at gmail.com
Sat May 7 23:10:58 UTC 2022


Hey guys,

previously pacman have had a delta update functionality which was (to my
understanding) removed because of safety concerns and poor performance.

Since the old delta version was written (and removed) zstd got the ability
to create delta patches.

I was running some tests on that, and it seems super promising in terms of
performance for creation as well as applying those patches, and the
deduplication ratio is phenomenal.

I created the patches on an arch-vm running on an AMD EPYC 7702P with 10
unshared cores and 32 GB memory, the storage is an SSD RAID10.

# Example: Widelands

It got 731M version 1.0-1 and 4 successive updates within ~1 year with just
pkgrel increasing.

Total for the 4 additional versions is 2.9G.

Creating all 4 delta patches:
1->5
2->5
3->5
4->5

It would take a total of real 0m24.383s usr 0m19.814s sys 0m5.859s and all
patches combined would be 39M.

I tested out applying them on a very old netbook (the slowest device I got
on hand) 2 GB memory, Intel Atom x5-Z8300 and Arch on an MMC-SSD. The 4->5
patch for example takes real 1m32,850s user 0m2,325s sys 0m16,562s.

On a more modern system (Intel Core i5-1135G7 / 16 GB memory / NVMe SSD)
this take just real 0m1.208s user 0m0.255s sys 0m0.748s.

Downloading the full package on the other hand over LTE takes 9m22s -
downloading just the delta patch take 0m5s.

# Other examples

While I chose a fairly big package with (obviously) low amount of changes
between the pkgrel versions, here are some other examples:

libreoffice-fresh-7.2.5-5 to 7.3.0-1 saving 40%
libreoffice-fresh-7.3.0-1 to 7.3.0-2 saving 45%
0ad-data-a25-1 to a25-2 saving 99.96% *
0ad-data-a25-2 to a25.b-1 saving 99.84% *
glibc-2.35-3 to 2.35-4 saving 70%
glibc-2.32-5 to 2.35-1 saving 51.8%
opencv-4.5.5-4 to opencv-4.5.5-5 saving 93.3%
opencv-4.5.4-9 to 4.5.5-1 saving 37%

* had to split the tar archive after 1.6 GB and make patches for each part
since zstd can only handle 2 GB files.

# What would need to be done, to get this going?

Well, the packages in the repo need to get a second signature, for the
uncompressed tar package. Pacman could first try to fetch this tar
signature. If it's on the server, the server supports delta updates – if
not, the full update would be loaded.

The database files would need to be extended to include the signatures for
the uncompressed tar archives as well as the signature.

Now pacman can fetch the patch file which fits for it's stored version to
the latest version, decompressed the package stored locally and applies the
patch file to it. Then pacman would check the signature/checksum of the
resulting tar archive, read it and discard the uncompressed files
afterwards.

# Caveats

This would obviously result in the pkg cache containing a full package file
and 1 full then, 1->2 delta, 2->3 delta, 3->4 delta over time. This could
be cleaned up by calculating the last version which should be stored and
compress the file locally and store it with a dedicated extension to not
clash with the regular packages (and their signatures).

# Database files

This could also work for database updates, obviously. But it would need a
bit more work as the database files would need to be versioned (or maybe
the timestamp is enough?).

On a daily update of the community db, we could for example save 88.3%
(2022-05-06 to 2022-05-07).

On inter-daily updates, this would be down to just a couple of K: The last
update of the community repo was just 40K as patch-file. That means saving
99.4% while applying only takes 0m0.075s.

---

Hope you that's interesting and a thing you could look into :)


Best regards,

Ruben


More information about the pacman-dev mailing list