On Sat, Nov 8, 2008 at 12:47 AM, Henning Garus <henning.garus@googlemail.com> wrote:
Hi,
I have been looking through the current delta implementation in libalpm and have put some thought into changing makepkg/repo-add to support delta creation. However, I'm running into some problems, mostly due to md5sums and gzip.
The current implementation works as follows. On a sync operation it is checked, whether a valid delta path exists and if the summed filesize of the deltas is smaller than the filesize of the whole download. When this is the case the deltas are downloaded and applied to the old file. After that the patched file is treated as if it was downloaded normally, this includes a check of the md5sum. Gzip files have a header, that has a timestamp, which will screw with this md5sum. When a patch is applied to a gzipped file by xdelta, xdelta will unzip the file, apply the patch and then rezip the file. The author of xdelta was obviously aware of the problems with the timestamp, because he decided to leave it empty. The same can be achieved by the -n option of gzip. But there comes the next problem, xdelta uses zlib for compression, gzip implements compression itself. And files created by gzip can differ from files created by zlib. Bsdtar uses zlib as well, but writes the timestamp and there is no option to prevent this (at least none that I can see).
There are four ways around this, that I can think of:
1. create the package, then create the delta, apply the delta to the old version, remove the original new package and present the patched package as output
I think this sucks, this ties delta creation to makepkg (more about that later) and has an incredibly huge and useless overhead (countless unzips and rezips and applying the patch).
2. create the package, but don't compress it with bsdtar, use gzip -n instead. This means we have to use gzip again, in libalpm, when we apply the delta.
Seems better than 1, but makes makepkg and libalpm rely on gzip. Not sure if this is a good thing, especially for libalpm.
3. save the md5sums of the unzipped tars in the synchdb and change libalpm to check those
Seems reasonable, but I don't see a way to do this with libarchive, so this would require using zlib directly and pacman would lose the ability to handle to handle tar.bz2
4. Skip checking the md5sum for deltas
OK during the initial synch, as long as we trust xdelta to do its job (the md5sums of both the old and the new file are in the delta file). But the created package will have the wrong md5sum and can't be used to reinstall, etc. which makes this look like a bad idea.
In a previous mail Xavier toyed with the idea to put delta creation into repo-add, I have given this some thought, as it seems nice in principle, but there are drawbacks. For Arch this would mean creating deltas on Gerolde, which seems to be fairly strained already, according to the dev list. Furthermore this introduces some new variables to repo-add (at least repo location and an output location) this would be manageable, but doesn't look very nice.
Delta creation in makepkg seems somehow ok (its already in there after all). But what I would really like is a separate tool for delta creation, which would allow the separation of building packages and creating deltas and setting up a separated delta server. This leaves us with options 2 and 3 and I am not really sure, which way to go.
looking forward to your comments
A very small bump on this :) 1) gzip -n usage But first, in the last discussion we had which started with the above mail, it seems we were more in favor of option 2) :
2. create the package, but don't compress it with bsdtar, use gzip -n instead. This means we have to use gzip again, in libalpm, when we apply the delta.
In fact, Nathan already made a patch for that. I think this patch looks fine : http://archive.netbsd.se/?ml=pacman-dev&a=2008-02&m=6427986 2) repo-add vs makepkg support Nathan even made one to add support to repo-add too, but this patch looked a bit more scary : http://archive.netbsd.se/?ml=pacman-dev&a=2008-02&m=6427987 It was more complex than I hoped. But the simpler way I was thinking about was to get delta support only in repo-add, instead of both makepkg and repo-add : http://archive.netbsd.se/?ml=pacman-dev&a=2008-02&m=6601225 Dan seemed to think it was better in repo-add, and Henning seems to think it is better in makepkg. We need more discussion on this and finally take a decision :) 2.1) About Nathan's patch to support both If we do want to have the functionality in both makepkg and repo-add, it would be cool to try to cleanup the code a bit, for example this : +# create_xdelta_file - will create a delta for the package filename given. +# +# params: +# $1 - the filename of the package +# $2 - the arch of the package +# $3 - the version and release of the package +# $4 - the directory where the package is located +# $5 - the extension of packages +# $6 - 0 if an existing delta file should not be overwritten +# $7 - the filename of the previous package (blank if not known) +# $8 - the version of the previous package (blank if not known) That's a lot of params :) 3) format of delta in the database However I don't think there is any repo-add / makepkg patch to support the new format. Henning also made a comment about the format : http://bugs.archlinux.org/task/12000#comment34162 "So basically the current delta implementation is working. Only the support in makepkg/repo-add is wrong. I am not exactly sure though, why libalpm expects the md5sums of the old and the new package. I am not sure if these are even used anywhere. I would feel save enough with xdelta checking those and then libalpm checking the md5sum of the final patched package." I guess Dan added these two md5sums for safety but yes, they might not be needed, I would also be fine with dropping them, even if they don't hurt.