[pacman-dev] delta support in libalpm
Hi, I have been looking through the current delta implementation in libalpm and have put some thought into changing makepkg/repo-add to support delta creation. However, I'm running into some problems, mostly due to md5sums and gzip. The current implementation works as follows. On a sync operation it is checked, whether a valid delta path exists and if the summed filesize of the deltas is smaller than the filesize of the whole download. When this is the case the deltas are downloaded and applied to the old file. After that the patched file is treated as if it was downloaded normally, this includes a check of the md5sum. Gzip files have a header, that has a timestamp, which will screw with this md5sum. When a patch is applied to a gzipped file by xdelta, xdelta will unzip the file, apply the patch and then rezip the file. The author of xdelta was obviously aware of the problems with the timestamp, because he decided to leave it empty. The same can be achieved by the -n option of gzip. But there comes the next problem, xdelta uses zlib for compression, gzip implements compression itself. And files created by gzip can differ from files created by zlib. Bsdtar uses zlib as well, but writes the timestamp and there is no option to prevent this (at least none that I can see). There are four ways around this, that I can think of: 1. create the package, then create the delta, apply the delta to the old version, remove the original new package and present the patched package as output I think this sucks, this ties delta creation to makepkg (more about that later) and has an incredibly huge and useless overhead (countless unzips and rezips and applying the patch). 2. create the package, but don't compress it with bsdtar, use gzip -n instead. This means we have to use gzip again, in libalpm, when we apply the delta. Seems better than 1, but makes makepkg and libalpm rely on gzip. Not sure if this is a good thing, especially for libalpm. 3. save the md5sums of the unzipped tars in the synchdb and change libalpm to check those Seems reasonable, but I don't see a way to do this with libarchive, so this would require using zlib directly and pacman would lose the ability to handle to handle tar.bz2 4. Skip checking the md5sum for deltas OK during the initial synch, as long as we trust xdelta to do its job (the md5sums of both the old and the new file are in the delta file). But the created package will have the wrong md5sum and can't be used to reinstall, etc. which makes this look like a bad idea. In a previous mail Xavier toyed with the idea to put delta creation into repo-add, I have given this some thought, as it seems nice in principle, but there are drawbacks. For Arch this would mean creating deltas on Gerolde, which seems to be fairly strained already, according to the dev list. Furthermore this introduces some new variables to repo-add (at least repo location and an output location) this would be manageable, but doesn't look very nice. Delta creation in makepkg seems somehow ok (its already in there after all). But what I would really like is a separate tool for delta creation, which would allow the separation of building packages and creating deltas and setting up a separated delta server. This leaves us with options 2 and 3 and I am not really sure, which way to go. looking forward to your comments
Hi,
I have been looking through the current delta implementation in libalpm and have put some thought into changing makepkg/repo-add to support delta creation. However, I'm running into some problems, mostly due to md5sums and gzip.
The current implementation works as follows. On a sync operation it is checked, whether a valid delta path exists and if the summed filesize of the deltas is smaller than the filesize of the whole download. When this is the case the deltas are downloaded and applied to the old file. After that the patched file is treated as if it was downloaded normally, this includes a check of the md5sum. Gzip files have a header, that has a timestamp, which will screw with this md5sum. When a patch is applied to a gzipped file by xdelta, xdelta will unzip the file, apply the patch and then rezip the file. The author of xdelta was obviously aware of the problems with the timestamp, because he decided to leave it empty. The same can be achieved by the -n option of gzip. But there comes the next problem, xdelta uses zlib for compression, gzip implements compression itself. And files created by gzip can differ from files created by zlib. Bsdtar uses zlib as well, but writes the timestamp and there is no option to prevent this (at least none that I can see).
First of all, our current delta implementation doesn't work at all atm, see FS#12000. So any maintainer are welcome ;-)
There are four ways around this, that I can think of:
1. create the package, then create the delta, apply the delta to the old version, remove the original new package and present the patched package as output
I think this sucks, this ties delta creation to makepkg (more about that later) and has an incredibly huge and useless overhead (countless unzips and rezips and applying the patch).
-1
2. create the package, but don't compress it with bsdtar, use gzip -n instead. This means we have to use gzip again, in libalpm, when we apply the delta.
Seems better than 1, but makes makepkg and libalpm rely on gzip. Not sure if this is a good thing, especially for libalpm.
maybe. But I don't see why we should use gzip in libalpm. Iirc we never compress things in alpm.
3. save the md5sums of the unzipped tars in the synchdb and change libalpm to check those
Seems reasonable, but I don't see a way to do this with libarchive, so this would require using zlib directly and pacman would lose the ability to handle to handle tar.bz2
-1
4. Skip checking the md5sum for deltas
OK during the initial synch, as long as we trust xdelta to do its job (the md5sums of both the old and the new file are in the delta file). But the created package will have the wrong md5sum and can't be used to reinstall, etc. which makes this look like a bad idea.
-1 Although, xdelta has its own md5sum mechanism, it won't help here, as you said.
In a previous mail Xavier toyed with the idea to put delta creation into repo-add, I have given this some thought, as it seems nice in principle, but there are drawbacks. For Arch this would mean creating deltas on Gerolde, which seems to be fairly strained already, according to the dev list. Furthermore this introduces some new variables to repo-add (at least repo location and an output location) this would be manageable, but doesn't look very nice.
I don't even understand why we create deltas in makepkg. But if we create deltas with repo-add, makepkg should be changed as well (the resultant pkg.tar.gz should not contain timestamp).
Delta creation in makepkg seems somehow ok (its already in there after all). But what I would really like is a separate tool for delta creation, which would allow the separation of building packages and creating deltas and setting up a separated delta server. This leaves us with options 2 and 3 and I am not really sure, which way to go.
Bye ------------------------------------------------------ SZTE Egyetemi Konyvtar - http://www.bibl.u-szeged.hu This message was sent using IMP: http://horde.org/imp/
On Sat, Nov 8, 2008 at 1:38 PM, Nagy Gabor <ngaba@bibl.u-szeged.hu> wrote:
Hi,
I have been looking through the current delta implementation in libalpm and have put some thought into changing makepkg/repo-add to support delta creation. However, I'm running into some problems, mostly due to md5sums and gzip.
The current implementation works as follows. On a sync operation it is checked, whether a valid delta path exists and if the summed filesize of the deltas is smaller than the filesize of the whole download. When this is the case the deltas are downloaded and applied to the old file. After that the patched file is treated as if it was downloaded normally, this includes a check of the md5sum. Gzip files have a header, that has a timestamp, which will screw with this md5sum. When a patch is applied to a gzipped file by xdelta, xdelta will unzip the file, apply the patch and then rezip the file. The author of xdelta was obviously aware of the problems with the timestamp, because he decided to leave it empty. The same can be achieved by the -n option of gzip. But there comes the next problem, xdelta uses zlib for compression, gzip implements compression itself. And files created by gzip can differ from files created by zlib. Bsdtar uses zlib as well, but writes the timestamp and there is no option to prevent this (at least none that I can see).
First of all, our current delta implementation doesn't work at all atm, see FS#12000. So any maintainer are welcome ;-)
I have yet to test this, but I think this comes down to repo-add not being in line with the current implementation, as you already pointed out in the discussion.
2. create the package, but don't compress it with bsdtar, use gzip -n instead. This means we have to use gzip again, in libalpm, when we apply the delta.
Seems better than 1, but makes makepkg and libalpm rely on gzip. Not sure if this is a good thing, especially for libalpm.
maybe. But I don't see why we should use gzip in libalpm. Iirc we never compress things in alpm.
Yes you do. libalpm uses system() to execute: xdelta patch [deltafile] [oldpkg] [newpkg] xdelta will unzip the old package, apply the patch and rezip the new package. Due to the zlib/gzip inconsistencies the md5sum for the patched package can differ from the md5sum of the new package, which was zipped with gzip. Unless that line is changed to something like xdelta patch -0 [deltafile] [oldpkg] - | gzip -cn > [newpkg]
Henning Garus wrote:
On Sat, Nov 8, 2008 at 1:38 PM, Nagy Gabor <ngaba@bibl.u-szeged.hu> wrote:
Hi,
I have been looking through the current delta implementation in libalpm and have put some thought into changing makepkg/repo-add to support delta creation. However, I'm running into some problems, mostly due to md5sums and gzip.
The current implementation works as follows. On a sync operation it is checked, whether a valid delta path exists and if the summed filesize of the deltas is smaller than the filesize of the whole download. When this is the case the deltas are downloaded and applied to the old file. After that the patched file is treated as if it was downloaded normally, this includes a check of the md5sum. Gzip files have a header, that has a timestamp, which will screw with this md5sum. When a patch is applied to a gzipped file by xdelta, xdelta will unzip the file, apply the patch and then rezip the file. The author of xdelta was obviously aware of the problems with the timestamp, because he decided to leave it empty. The same can be achieved by the -n option of gzip. But there comes the next problem, xdelta uses zlib for compression, gzip implements compression itself. And files created by gzip can differ from files created by zlib. Bsdtar uses zlib as well, but writes the timestamp and there is no option to prevent this (at least none that I can see).
First of all, our current delta implementation doesn't work at all atm, see FS#12000. So any maintainer are welcome ;-)
I have yet to test this, but I think this comes down to repo-add not being in line with the current implementation, as you already pointed out in the discussion.
2. create the package, but don't compress it with bsdtar, use gzip -n instead. This means we have to use gzip again, in libalpm, when we apply the delta.
Seems better than 1, but makes makepkg and libalpm rely on gzip. Not sure if this is a good thing, especially for libalpm.
maybe. But I don't see why we should use gzip in libalpm. Iirc we never compress things in alpm.
Yes you do. libalpm uses system() to execute: xdelta patch [deltafile] [oldpkg] [newpkg]
xdelta will unzip the old package, apply the patch and rezip the new package. Due to the zlib/gzip inconsistencies the md5sum for the patched package can differ from the md5sum of the new package, which was zipped with gzip. Unless that line is changed to something like xdelta patch -0 [deltafile] [oldpkg] - | gzip -cn > [newpkg]
Is any of this fixed by using the xdelta3 branch? From memory that does not use gzip/bzip2. Allan
On Mon, Nov 10, 2008 at 8:38 AM, Allan McRae <allan@archlinux.org> wrote:
Henning Garus wrote:
Yes you do. libalpm uses system() to execute: xdelta patch [deltafile] [oldpkg] [newpkg]
xdelta will unzip the old package, apply the patch and rezip the new package. Due to the zlib/gzip inconsistencies the md5sum for the patched package can differ from the md5sum of the new package, which was zipped with gzip. Unless that line is changed to something like xdelta patch -0 [deltafile] [oldpkg] - | gzip -cn > [newpkg]
Is any of this fixed by using the xdelta3 branch? From memory that does not use gzip/bzip2.
According to http://xdelta.org/xdelta3.html xdelta3 uses a builtin compression to compress the delta files (xdelta1 uses zlib). However, you won't get around decompression and recompression when using deltas with compressed files. When xdelta3 gets compressed files as input it will use the appropriate external compression engine to decompress the inputs and compute a delta. It do that again to compress the output after patching. So basically it will do the same, as my original proposal 2, only internally. It could be interesting nonetheless, because with xdelta3 deltas would probably work for bzip2 compressed packages, without any further changes in pacman. Henning
On Thu, Nov 13, 2008 at 9:31 PM, Henning Garus <henning.garus@googlemail.com> wrote:
On Mon, Nov 10, 2008 at 8:38 AM, Allan McRae <allan@archlinux.org> wrote:
Henning Garus wrote:
Yes you do. libalpm uses system() to execute: xdelta patch [deltafile] [oldpkg] [newpkg]
xdelta will unzip the old package, apply the patch and rezip the new package. Due to the zlib/gzip inconsistencies the md5sum for the patched package can differ from the md5sum of the new package, which was zipped with gzip. Unless that line is changed to something like xdelta patch -0 [deltafile] [oldpkg] - | gzip -cn > [newpkg]
Is any of this fixed by using the xdelta3 branch? From memory that does not use gzip/bzip2.
According to http://xdelta.org/xdelta3.html xdelta3 uses a builtin compression to compress the delta files (xdelta1 uses zlib). However, you won't get around decompression and recompression when using deltas with compressed files. When xdelta3 gets compressed files as input it will use the appropriate external compression engine to decompress the inputs and compute a delta. It do that again to compress the output after patching. So basically it will do the same, as my original proposal 2, only internally. It could be interesting nonetheless, because with xdelta3 deltas would probably work for bzip2 compressed packages, without any further changes in pacman.
Henning
I guess I spoke to soon. I was right concerning xdelta3 using gzip for handling gzipped files, however it doesn't use the -n flag. This gives us the same behaviour as xdelta1, with one minor difference: Method 1. stops working.
I guess I spoke to soon. I was right concerning xdelta3 using gzip for handling gzipped files, however it doesn't use the -n flag. This gives us the same behaviour as xdelta1, with one minor difference: Method 1. stops working.
Wait. I don't understand something. If it doesn't use the -n flag, how can we produce an md5sum-identical patched file? (The mtime is unpredictable.) This poses an extra problem, or not? I just did an effective test on xdelta3-diffing .tar.gz files and I saw that the patched md5sum indeed differ from the original one :-( Maybe we should search for a gzip header manipulation tool... Bye ------------------------------------------------------ SZTE Egyetemi Konyvtar - http://www.bibl.u-szeged.hu This message was sent using IMP: http://horde.org/imp/
On Fri, Nov 14, 2008 at 1:02 PM, Nagy Gabor <ngaba@bibl.u-szeged.hu> wrote:
I guess I spoke to soon. I was right concerning xdelta3 using gzip for handling gzipped files, however it doesn't use the -n flag. This gives us the same behaviour as xdelta1, with one minor difference: Method 1. stops working.
Wait. I don't understand something. If it doesn't use the -n flag, how can we produce an md5sum-identical patched file? (The mtime is unpredictable.) This poses an extra problem, or not? I just did an effective test on xdelta3-diffing .tar.gz files and I saw that the patched md5sum indeed differ from the original one :-( Maybe we should search for a gzip header manipulation tool...
One of the proposal was to use gzip -n, which should fix this problem.
Idézet Xavier <shiningxc@gmail.com>:
On Fri, Nov 14, 2008 at 1:02 PM, Nagy Gabor <ngaba@bibl.u-szeged.hu> wrote:
I guess I spoke to soon. I was right concerning xdelta3 using gzip for handling gzipped files, however it doesn't use the -n flag. This gives us the same behaviour as xdelta1, with one minor difference: Method 1. stops working.
Wait. I don't understand something. If it doesn't use the -n flag, how can we produce an md5sum-identical patched file? (The mtime is unpredictable.) This poses an extra problem, or not? I just did an effective test on xdelta3-diffing .tar.gz files and I saw that the patched md5sum indeed differ from the original one :-( Maybe we should search for a gzip header manipulation tool...
One of the proposal was to use gzip -n, which should fix this problem.
xdelta1 used(?) "gzip -n" for _patched_ .tar.gz file, xdelta3 doesn't use "-n", so setting "-n" for the original .tar.gz file won't help any more. Probably xdelta3 also can be configured or patched to use "-n"... Or I may completely misunderstood something... Bye ------------------------------------------------------ SZTE Egyetemi Konyvtar - http://www.bibl.u-szeged.hu This message was sent using IMP: http://horde.org/imp/
2008/11/14 Nagy Gabor <ngaba@bibl.u-szeged.hu>:
Idézet Xavier <shiningxc@gmail.com>:
On Fri, Nov 14, 2008 at 1:02 PM, Nagy Gabor <ngaba@bibl.u-szeged.hu> wrote:
I guess I spoke to soon. I was right concerning xdelta3 using gzip for handling gzipped files, however it doesn't use the -n flag. This gives us the same behaviour as xdelta1, with one minor difference: Method 1. stops working.
Wait. I don't understand something. If it doesn't use the -n flag, how can we produce an md5sum-identical patched file? (The mtime is unpredictable.) This poses an extra problem, or not? I just did an effective test on xdelta3-diffing .tar.gz files and I saw that the patched md5sum indeed differ from the original one :-( Maybe we should search for a gzip header manipulation tool...
One of the proposal was to use gzip -n, which should fix this problem.
xdelta1 used(?) "gzip -n" for _patched_ .tar.gz file, xdelta3 doesn't use "-n", so setting "-n" for the original .tar.gz file won't help any more. Probably xdelta3 also can be configured or patched to use "-n"... Or I may completely misunderstood something...
xdelta1 uses zlib, my proposal was taking the uncompressed output of xdelta1 and using gzip -n to compress it. xdelta3 uses gzip, but without -n. I can't find any options for external compression, but the easiest thing to do would be deactivating external compression in xdelta3 and apply it manually by piping the output through gzip -n.
On Sat, Nov 8, 2008 at 12:47 AM, Henning Garus <henning.garus@googlemail.com> wrote:
Hi,
I have been looking through the current delta implementation in libalpm and have put some thought into changing makepkg/repo-add to support delta creation. However, I'm running into some problems, mostly due to md5sums and gzip.
The current implementation works as follows. On a sync operation it is checked, whether a valid delta path exists and if the summed filesize of the deltas is smaller than the filesize of the whole download. When this is the case the deltas are downloaded and applied to the old file. After that the patched file is treated as if it was downloaded normally, this includes a check of the md5sum. Gzip files have a header, that has a timestamp, which will screw with this md5sum. When a patch is applied to a gzipped file by xdelta, xdelta will unzip the file, apply the patch and then rezip the file. The author of xdelta was obviously aware of the problems with the timestamp, because he decided to leave it empty. The same can be achieved by the -n option of gzip. But there comes the next problem, xdelta uses zlib for compression, gzip implements compression itself. And files created by gzip can differ from files created by zlib. Bsdtar uses zlib as well, but writes the timestamp and there is no option to prevent this (at least none that I can see).
There are four ways around this, that I can think of:
1. create the package, then create the delta, apply the delta to the old version, remove the original new package and present the patched package as output
I think this sucks, this ties delta creation to makepkg (more about that later) and has an incredibly huge and useless overhead (countless unzips and rezips and applying the patch).
2. create the package, but don't compress it with bsdtar, use gzip -n instead. This means we have to use gzip again, in libalpm, when we apply the delta.
Seems better than 1, but makes makepkg and libalpm rely on gzip. Not sure if this is a good thing, especially for libalpm.
3. save the md5sums of the unzipped tars in the synchdb and change libalpm to check those
Seems reasonable, but I don't see a way to do this with libarchive, so this would require using zlib directly and pacman would lose the ability to handle to handle tar.bz2
4. Skip checking the md5sum for deltas
OK during the initial synch, as long as we trust xdelta to do its job (the md5sums of both the old and the new file are in the delta file). But the created package will have the wrong md5sum and can't be used to reinstall, etc. which makes this look like a bad idea.
In a previous mail Xavier toyed with the idea to put delta creation into repo-add, I have given this some thought, as it seems nice in principle, but there are drawbacks. For Arch this would mean creating deltas on Gerolde, which seems to be fairly strained already, according to the dev list. Furthermore this introduces some new variables to repo-add (at least repo location and an output location) this would be manageable, but doesn't look very nice.
Delta creation in makepkg seems somehow ok (its already in there after all). But what I would really like is a separate tool for delta creation, which would allow the separation of building packages and creating deltas and setting up a separated delta server. This leaves us with options 2 and 3 and I am not really sure, which way to go.
looking forward to your comments
I am very glad you looked into this, you seem to have a very good understanding of the situation, possibly better than me, so it would be great if you could fix and maintain this part. I would just go with option 2. When deltas are used, libalpm already relies on xdelta, so why not on gzip as well.
On Sat, Nov 8, 2008 at 12:47 AM, Henning Garus <henning.garus@googlemail.com> wrote:
Hi,
I have been looking through the current delta implementation in libalpm and have put some thought into changing makepkg/repo-add to support delta creation. However, I'm running into some problems, mostly due to md5sums and gzip.
The current implementation works as follows. On a sync operation it is checked, whether a valid delta path exists and if the summed filesize of the deltas is smaller than the filesize of the whole download. When this is the case the deltas are downloaded and applied to the old file. After that the patched file is treated as if it was downloaded normally, this includes a check of the md5sum. Gzip files have a header, that has a timestamp, which will screw with this md5sum. When a patch is applied to a gzipped file by xdelta, xdelta will unzip the file, apply the patch and then rezip the file. The author of xdelta was obviously aware of the problems with the timestamp, because he decided to leave it empty. The same can be achieved by the -n option of gzip. But there comes the next problem, xdelta uses zlib for compression, gzip implements compression itself. And files created by gzip can differ from files created by zlib. Bsdtar uses zlib as well, but writes the timestamp and there is no option to prevent this (at least none that I can see).
There are four ways around this, that I can think of:
1. create the package, then create the delta, apply the delta to the old version, remove the original new package and present the patched package as output
I think this sucks, this ties delta creation to makepkg (more about that later) and has an incredibly huge and useless overhead (countless unzips and rezips and applying the patch).
2. create the package, but don't compress it with bsdtar, use gzip -n instead. This means we have to use gzip again, in libalpm, when we apply the delta.
Seems better than 1, but makes makepkg and libalpm rely on gzip. Not sure if this is a good thing, especially for libalpm.
3. save the md5sums of the unzipped tars in the synchdb and change libalpm to check those
Seems reasonable, but I don't see a way to do this with libarchive, so this would require using zlib directly and pacman would lose the ability to handle to handle tar.bz2
4. Skip checking the md5sum for deltas
OK during the initial synch, as long as we trust xdelta to do its job (the md5sums of both the old and the new file are in the delta file). But the created package will have the wrong md5sum and can't be used to reinstall, etc. which makes this look like a bad idea.
In a previous mail Xavier toyed with the idea to put delta creation into repo-add, I have given this some thought, as it seems nice in principle, but there are drawbacks. For Arch this would mean creating deltas on Gerolde, which seems to be fairly strained already, according to the dev list. Furthermore this introduces some new variables to repo-add (at least repo location and an output location) this would be manageable, but doesn't look very nice.
Delta creation in makepkg seems somehow ok (its already in there after all). But what I would really like is a separate tool for delta creation, which would allow the separation of building packages and creating deltas and setting up a separated delta server. This leaves us with options 2 and 3 and I am not really sure, which way to go.
looking forward to your comments
A very small bump on this :) 1) gzip -n usage But first, in the last discussion we had which started with the above mail, it seems we were more in favor of option 2) :
2. create the package, but don't compress it with bsdtar, use gzip -n instead. This means we have to use gzip again, in libalpm, when we apply the delta.
In fact, Nathan already made a patch for that. I think this patch looks fine : http://archive.netbsd.se/?ml=pacman-dev&a=2008-02&m=6427986 2) repo-add vs makepkg support Nathan even made one to add support to repo-add too, but this patch looked a bit more scary : http://archive.netbsd.se/?ml=pacman-dev&a=2008-02&m=6427987 It was more complex than I hoped. But the simpler way I was thinking about was to get delta support only in repo-add, instead of both makepkg and repo-add : http://archive.netbsd.se/?ml=pacman-dev&a=2008-02&m=6601225 Dan seemed to think it was better in repo-add, and Henning seems to think it is better in makepkg. We need more discussion on this and finally take a decision :) 2.1) About Nathan's patch to support both If we do want to have the functionality in both makepkg and repo-add, it would be cool to try to cleanup the code a bit, for example this : +# create_xdelta_file - will create a delta for the package filename given. +# +# params: +# $1 - the filename of the package +# $2 - the arch of the package +# $3 - the version and release of the package +# $4 - the directory where the package is located +# $5 - the extension of packages +# $6 - 0 if an existing delta file should not be overwritten +# $7 - the filename of the previous package (blank if not known) +# $8 - the version of the previous package (blank if not known) That's a lot of params :) 3) format of delta in the database However I don't think there is any repo-add / makepkg patch to support the new format. Henning also made a comment about the format : http://bugs.archlinux.org/task/12000#comment34162 "So basically the current delta implementation is working. Only the support in makepkg/repo-add is wrong. I am not exactly sure though, why libalpm expects the md5sums of the old and the new package. I am not sure if these are even used anywhere. I would feel save enough with xdelta checking those and then libalpm checking the md5sum of the final patched package." I guess Dan added these two md5sums for safety but yes, they might not be needed, I would also be fine with dropping them, even if they don't hurt.
On Sat, Nov 8, 2008 at 12:47 AM, Henning Garus <henning.garus@googlemail.com> wrote:
2. create the package, but don't compress it with bsdtar, use gzip -n instead. This means we have to use gzip again, in libalpm, when we apply the delta.
Seems better than 1, but makes makepkg and libalpm rely on gzip. Not sure if this is a good thing, especially for libalpm.
That sounds alright, I just noticed that xdelta3 has an option to disable the external recompression : -R So we don't even have extra decompression/recompression steps, there is no loss. + snprintf(command, PATH_MAX, "xdelta3 -d -R -c -s %s %s | gzip -n > %s", from, delta, to);
In a previous mail Xavier toyed with the idea to put delta creation into repo-add, I have given this some thought, as it seems nice in principle, but there are drawbacks. For Arch this would mean creating deltas on Gerolde, which seems to be fairly strained already, according to the dev list. Furthermore this introduces some new variables to repo-add (at least repo location and an output location) this would be manageable, but doesn't look very nice.
Delta creation in makepkg seems somehow ok (its already in there after all). But what I would really like is a separate tool for delta creation, which would allow the separation of building packages and creating deltas and setting up a separated delta server. This leaves us with options 2 and 3 and I am not really sure, which way to go.
looking forward to your comments
I just went further than I ever did on this task, I seriously considered a separate tool and spent some times thinking about all the possibilities. I am still not sure what is best. My thought was that it was very easy to generate the database info during the delta creation. However, how do we keep this info? Originally it was stored in the delta filename. But this was before the database change. Now we need two filenames and two md5sums (old pkg and new pkg), it does not seem realistic to store all this in the delta filename. Here are the options I considered : 1) delta support only in repo-add No problem of a temporary storage of the info here, it goes directly into the database. But maybe not flexible enough. 2) embed the .delta files into another format, eg delta.tar.gz archive = delta file + DELTA metafile Might be overkill? And we lost the ability of using xdelta directly 3) a separate tool creates the delta, generates the delta metainfo and stores it in a file This file can then be given to repo-add which basically just add its contents to $pkgname-*/deltas I gave a try to that third option. It's clearly not finished yet but I am attaching the script in its current state to give an idea and to know if I should move forward. Example of usage: $ create-xdelta libxml2-2.7.2-1-x86_64.pkg.tar.gz libxml2-2.7.3-1-x86_64.pkg.tar.gz $ create-xdelta libxml2-2.7.3-1-x86_64.pkg.tar.gz libxml2-2.7.3-1.1-x86_64.pkg.tar.gz (these two commands added one line each in a libxml2.pacdelta file) $ repo-add db.tar.gz libxml2-2.7.3-1.1-x86_64.pkg.tar.gz $ repo-add db.tar.gz libxml2.pacdelta (pkg.tar.gz and pacdelta can be added together, but the order is important, we need a package entry for libxml2 before adding deltas) Now we can upgrade from 2.7.2-1 or 2.7.3-1 to 2.7.3-1.1 using deltas.
Hi guys I'm new here so I'm asking in advance that you forgive my ignorance. Even before having clicked send, I feel like I'm spamming... o.O Xavier wrote:
On Sat, Nov 8, 2008 at 12:47 AM, Henning Garus <henning.garus@googlemail.com> wrote:
2. create the package, but don't compress it with bsdtar, use gzip -n instead. This means we have to use gzip again, in libalpm, when we apply the delta That sounds alright, I just noticed that xdelta3 has an option to disable the external recompression : -R So we don't even have extra decompression/recompression steps, there is no loss.
+ snprintf(command, PATH_MAX, "xdelta3 -d -R -c -s %s %s | gzip -n > %s", from, delta, to);
The way I understand xdelta3's -R and -D options: -D disable external decompression (encode/decode) When applying a delta, same behaviour as -R When creating a delta, even when given 2 compressed files, do not discern if the file is compressed, ie, given 2 .tar.gz files, pretend they're .bin files -R disable external recompression (decode) When applying a delta, given a compressed file, decompress *if* the delta's metadata indicates the file was decompressed in the encode process, apply the delta and, if decompression occurred whilst applying the delta, do not bother to recompress. ie, when given a .tar.gz and a .xd3, create a .tar Unless my understanding above is completely wrong, using -R is going to help but not without -D in the encoding process. Also, since we're doing md5s of the .tar.gz instead of the .tar, we'd also need to change some of the housekeeping - perhaps doing md5s of the .tar as well as (or instead of) the .tar.gz. There was also a bit of recent discussion on the Arch forum about this. Some statistics indicate that vanilla -D isn't really worth it. http://bbs.archlinux.org/viewtopic.php?pid=496539#p496539 shows a 10% bandwidth savings with -D versus 85% bw savings without. I mentioned a kluge workaround there, gzip --rsyncable, giving a 77% bw saving. The kluge probably isn't the right way to go anyway. So, um... how does this change the way forward? Or is my understanding of the -R parameter completely wrong? -- __________ Brendan Hide
Xavier Wrote:
So we don't even have extra decompression/recompression steps, there is no loss.
+ snprintf(command, PATH_MAX, "xdelta3 -d -R -c -s %s %s | gzip -n > %s", from, delta, to); Sorry. I see the logic now only. :/ The md5s are originally generated after using "gzip -n" to do the compression and so Xavier is specifically using "gzip -n" to prevent generating a "different" md5sum after applying the delta.
-- __________ Brendan Hide
On Thu, Feb 19, 2009 at 11:06 AM, Brendan Hide <brendan@swiftspirit.co.za> wrote:
Xavier Wrote:
So we don't even have extra decompression/recompression steps, there is no loss.
+ snprintf(command, PATH_MAX, "xdelta3 -d -R -c -s %s %s | gzip -n > %s", from, delta, to);
Sorry. I see the logic now only. :/ The md5s are originally generated after using "gzip -n" to do the compression and so Xavier is specifically using "gzip -n" to prevent generating a "different" md5sum after applying the delta.
Ah, no problem, to be honest I wasn't sure how to answer your previous mail : every statement you made seemed correct, only that you drew a wrong conclusion from them :) So yup, this will only work with packages generated from a patched makepkg, that uses gzip -n as compression. Or well, packages could also be decompressed and re-compressed with gzip -n.
On Sat, Nov 8, 2008 at 12:47 AM, Henning Garus <henning.garus@googlemail.com> wrote:
Delta creation in makepkg seems somehow ok (its already in there after all). But what I would really like is a separate tool for delta creation, which would allow the separation of building packages and creating deltas and setting up a separated delta server. This leaves us with options 2 and 3 and I am not really sure, which way to go.
looking forward to your comments
Sorry for answering this mail for the 10th time but I have always different points to discuss. Now I am very curious about "a separate delta server". What does this mean exactly? A different delta database (delta.tar.gz) + corresponding tools to deal with it (delta-add / delta-remove)? I am not so happy about adding support in repo-add for several reasons, I always run into many problems and issues. For example, deltas are not tied to one particular package version, rather to one package name. And when we remove or upgrade a package entry, the data index get lost. So I liked the idea of a separate delta database. The problem is that it might lead to a lot of code duplications in pacman if we need to handle pmdeltadb_t besides pmdb_t. So I am not so happy about that either.
Xavier wrote:
On Sat, Nov 8, 2008 at 12:47 AM, Henning Garus <henning.garus@googlemail.com> wrote:
Delta creation in makepkg seems somehow ok (its already in there after all). But what I would really like is a separate tool for delta creation, which would allow the separation of building packages and creating deltas and setting up a separated delta server. This leaves us with options 2 and 3 and I am not really sure, which way to go.
looking forward to your comments
Sorry for answering this mail for the 10th time but I have always different points to discuss. Now I am very curious about "a separate delta server". What does this mean exactly? A different delta database (delta.tar.gz) + corresponding tools to deal with it (delta-add / delta-remove)?
I am not so happy about adding support in repo-add for several reasons, I always run into many problems and issues. For example, deltas are not tied to one particular package version, rather to one package name. And when we remove or upgrade a package entry, the data index get lost.
So I liked the idea of a separate delta database. The problem is that it might lead to a lot of code duplications in pacman if we need to handle pmdeltadb_t besides pmdb_t. So I am not so happy about that either
Would it be useful if I put xdelta3 into the repos to help testing things out for this? Allan
On Fri, Feb 20, 2009 at 8:04 AM, Allan McRae <allan@archlinux.org> wrote:
Would it be useful if I put xdelta3 into the repos to help testing things out for this?
Short answer : yes, I believe it would be nice to have xdelta3 in the repos Long answer : Well I did have something running, but I realized I have way too many problems and questions to go in a testing phase. I have no idea where this is going and where it should go. There has never been any real official interests for delta. This seems to make a requirement the ability to make a separate delta server. This seems to require a separate delta database. This implies a new level of complexity and code bloat in pacman. Now maybe it is worth it, I don't know, it still makes me wondering why we put all this delta stuff in pacman to begin with. What was the problem with XferCommand, it seemed like it was a great idea. Now that wget-xdelta.sh script is just a toy, but a much more powerful python script could be written that has basically the same logic as pacman currently has + the ability to fetch and parse a separate delta database.
There has never been any real official interests for delta. This seems to make a requirement the ability to make a separate delta server. This seems to require a separate delta database. This implies a new level of complexity and code bloat in pacman. Now maybe it is worth it, I don't know, it still makes me wondering why we put all this delta stuff in pacman to begin with. What was the problem with XferCommand, it seemed like it was a great idea. Now that wget-xdelta.sh script is just a toy, but a much more powerful python script could be written that has basically the same logic as pacman currently has + the ability to fetch and parse a separate delta database. Unless the server is out of disk space, I'm not too sure exactly why
Xavier wrote: there's a requirement for a separate server. If pacman is distributed with the delta option turned on by default, the server doing the actual "serving" of the updates is probably going to have 60 to 85% less work to do. I will grant that there would be a new level of complexity involved, for example, if I've missed 4 updates, we'd have to "chain link" the tar.gz in my cache via 4 delta patches to get the current tar.gz. I believe that the following would be the simplest implementation both in terms of how much implementation work is needed and the probable effectiveness: Put delta files into a separate folder (thus also avoiding a snapshot from containing the deltas): http://archlinux.mirror.ac.za/delta/core/os/x86_64/kernel26-2.6.28.4-1-x86_6... Thus, I could do the following (bash pseudocode) curl http://archlinux.mirror.ac.za/delta/core/os/x86_64/ > tmpfile grep $pkgname < tmpfile > listing failed=false cat listing | while read delta do [ $pkgname-$currentpkgversion-$pkgarch.xd3.tar.gz *within* $delta ] && start=true if [ start=true ] then while read delta do wget http://archlinux.mirror.ac.za/delta/core/os/x86_64/$delta && applydelta $delta $curfile [ $output=$pkgname-$newpkgversion-$pkgarch.tar.gz ] && break curfile=`ls -rt | tail -n 1` done fi [ $output=$pkgname-$newpkgversion-$pkgarch.tar.gz ] && break done The above requires no db implementation at all and can work well even using the above very simple logic. And yes, by my own standards, the above is very bad bash pseudo-code. :P Of the above, what is already implemented in pacman? __________ Brendan Hide
On Mon, Feb 23, 2009 at 9:57 AM, Brendan Hide <brendan@swiftspirit.co.za> wrote:
Xavier wrote:
There has never been any real official interests for delta. This seems to make a requirement the ability to make a separate delta server. This seems to require a separate delta database. This implies a new level of complexity and code bloat in pacman. Now maybe it is worth it, I don't know, it still makes me wondering why we put all this delta stuff in pacman to begin with. What was the problem with XferCommand, it seemed like it was a great idea. Now that wget-xdelta.sh script is just a toy, but a much more powerful python script could be written that has basically the same logic as pacman currently has + the ability to fetch and parse a separate delta database.
Unless the server is out of disk space, I'm not too sure exactly why there's a requirement for a separate server. If pacman is distributed with the delta option turned on by default, the server doing the actual "serving" of the updates is probably going to have 60 to 85% less work to do.
I will grant that there would be a new level of complexity involved, for example, if I've missed 4 updates, we'd have to "chain link" the tar.gz in my cache via 4 delta patches to get the current tar.gz.
I believe that the following would be the simplest implementation both in terms of how much implementation work is needed and the probable effectiveness: Put delta files into a separate folder (thus also avoiding a snapshot from containing the deltas): http://archlinux.mirror.ac.za/delta/core/os/x86_64/kernel26-2.6.28.4-1-x86_6... Thus, I could do the following (bash pseudocode) curl http://archlinux.mirror.ac.za/delta/core/os/x86_64/ > tmpfile grep $pkgname < tmpfile > listing failed=false cat listing | while read delta do [ $pkgname-$currentpkgversion-$pkgarch.xd3.tar.gz *within* $delta ] && start=true if [ start=true ] then while read delta do wget http://archlinux.mirror.ac.za/delta/core/os/x86_64/$delta && applydelta $delta $curfile [ $output=$pkgname-$newpkgversion-$pkgarch.tar.gz ] && break curfile=`ls -rt | tail -n 1` done fi [ $output=$pkgname-$newpkgversion-$pkgarch.tar.gz ] && break done
The above requires no db implementation at all and can work well even using the above very simple logic. And yes, by my own standards, the above is very bad bash pseudo-code. :P
Of the above, what is already implemented in pacman?
Everything is already implemented in pacman, with a more complex logic (which might be totally useless after all) For each package in a sync db, there is a deltas file besides the depends and desc one which basically contains the list of deltas for that package and their size. With this information, and the contents of the filecache, it computes the shortest path (in term of download size) to the final package. That logic applied to an example : if you have file v1 in your cache, you want to upgrade to v3, and there are three deltas for this package : v1tov2 , v2tov3 and v1tov3 If v1tov2 + v2tov3 is smaller than v1tov3, it will download the first two deltas and apply them to get v3. Otherwise it will download the third one. The problem of this implementation (besides being probably overkill) is that it requires information in the sync databases. So either it requires a big official effort to integrate this stuff and add deltas to all the official databases. Otherwise, I don't know. You need to fully mirror the repository you want to add deltas to, then you need to generate deltas (maybe during mirror sync) and to add the deltas to your database, and then host everything somewhere (the packages + the deltas + the database with delta info).
b Xavier wrote:
Everything is already implemented in pacman, with a more complex logic (which might be totally useless after all) For each package in a sync db, there is a deltas file besides the depends and desc one which basically contains the list of deltas for that package and their size. With this information, and the contents of the filecache, it computes the shortest path (in term of download size) to the final package. That logic applied to an example : if you have file v1 in your cache, you want to upgrade to v3, and there are three deltas for this package : v1tov2 , v2tov3 and v1tov3 If v1tov2 + v2tov3 is smaller than v1tov3, it will download the first two deltas and apply them to get v3. Otherwise it will download the third one.
The problem of this implementation (besides being probably overkill) is that it requires information in the sync databases. So either it requires a big official effort to integrate this stuff and add deltas to all the official databases. Otherwise, I don't know. You need to fully mirror the repository you want to add deltas to, then you need to generate deltas (maybe during mirror sync) and to add the deltas to your database, and then host everything somewhere (the packages + the deltas + the database with delta info).
This makes a lot more sense to me now. Thank you for the clarification, Xavier. It is the most efficient way, end-user-wise, despite the possibly-excessive metadata. It isn't necessarily efficient for the server. :/ Looking at the logistics, the best time to make the delta is after the new .pkg.tar.(gz|bz2) is uploaded to the repo. I assume this is also about the time the db is updated. This could be implemented repo-wide as packages are updated and delta'd without any individual package maker's direct involvement in the delta process - a "passive" change that won't need to change anyone's habits. If you really want to be able to make lots of delta versions, ie, v1tov2, v1tov3, v1tov4, v2tov3, v2tov4, v3tov4, then you'd probably have to keep at least 4 older (full) versions that will take up a lot of disk space - or you'll need to regenerate all the other versions - take up a *lot* of IO / RAM / CPU during the generation of the new deltas. If you only take v1tov2, v2tov3, v3tov4, you only need to keep v4 and the 3 deltas. When v5 gets uploaded, you create v4tov5 and delete v4 from the server thus saving disk space. This is much simpler and more implementable than the current "brief". Mirror servers can mirror the old way - inefficiently - however they should mirror the deltas across too. I guess that the mirror servers do a lot less bandwidth from the official repository than the end users. The net result I believe is a much simpler implementation despite achieving 99% of the original brief's goal. Your thoughts? __________ Brendan Hide
On Mon, Feb 23, 2009 at 12:27 PM, Brendan Hide <brendan@swiftspirit.co.za> wrote:
This makes a lot more sense to me now. Thank you for the clarification, Xavier. It is the most efficient way, end-user-wise, despite the possibly-excessive metadata. It isn't necessarily efficient for the server. :/
Looking at the logistics, the best time to make the delta is after the new .pkg.tar.(gz|bz2) is uploaded to the repo. I assume this is also about the time the db is updated. This could be implemented repo-wide as packages are updated and delta'd without any individual package maker's direct involvement in the delta process - a "passive" change that won't need to change anyone's habits.
If you really want to be able to make lots of delta versions, ie, v1tov2, v1tov3, v1tov4, v2tov3, v2tov4, v3tov4, then you'd probably have to keep at least 4 older (full) versions that will take up a lot of disk space - or you'll need to regenerate all the other versions - take up a *lot* of IO / RAM / CPU during the generation of the new deltas.
If you only take v1tov2, v2tov3, v3tov4, you only need to keep v4 and the 3 deltas. When v5 gets uploaded, you create v4tov5 and delete v4 from the server thus saving disk space. This is much simpler and more implementable than the current "brief".
Mirror servers can mirror the old way - inefficiently - however they should mirror the deltas across too. I guess that the mirror servers do a lot less bandwidth from the official repository than the end users.
The net result I believe is a much simpler implementation despite achieving 99% of the original brief's goal.
Your thoughts?
These were my first thoughts, but here is how Garns answered to them : "" In a previous mail Xavier toyed with the idea to put delta creation into repo-add, I have given this some thought, as it seems nice in principle, but there are drawbacks. For Arch this would mean creating deltas on Gerolde, which seems to be fairly strained already, according to the dev list. Furthermore this introduces some new variables to repo-add (at least repo location and an output location) this would be manageable, but doesn't look very nice. "" http://www.archlinux.org/pipermail/pacman-dev/2008-November/007672.html
how Garns answered to them: ... For Arch this would mean creating deltas on Gerolde, which seems to be fairly strained already. ... http://www.archlinux.org/pipermail/pacman-dev/2008-November/007672.htm Is Gerolde separate from the server that serves the FTP and HTTP
Xavier wrote: traffic? If it is separate then I can't argue for the delta's improvement on the server's performance. If it *is* the same server then Garn's argument is illogical. What else is Gerolde doing for Arch and can it be moved to another server? __________ Brendan Hide
On Mon, Feb 23, 2009 at 12:43 PM, Brendan Hide <brendan@swiftspirit.co.za> wrote:
Xavier wrote:
how Garns answered to them: ... For Arch this would mean creating deltas on Gerolde, which seems to be fairly strained already. ... http://www.archlinux.org/pipermail/pacman-dev/2008-November/007672.htm
Is Gerolde separate from the server that serves the FTP and HTTP traffic? If it is separate then I can't argue for the delta's improvement on the server's performance. If it *is* the same server then Garn's argument is illogical.
What else is Gerolde doing for Arch and can it be moved to another server?
This is not the only problem. Another big problem is that it would require real interest and work from official developers, and this is clearly inexistent :) For example, dbscripts would require some work as well http://projects.archlinux.org/?p=dbscripts.git;a=tree If these two problems can be fixed, we still have some technical issues about how adding deltas to a database with repo-add should work.
On Mon, Feb 23, 2009 at 5:58 AM, Xavier <shiningxc@gmail.com> wrote:
On Mon, Feb 23, 2009 at 12:43 PM, Brendan Hide <brendan@swiftspirit.co.za> wrote:
Xavier wrote:
how Garns answered to them: ... For Arch this would mean creating deltas on Gerolde, which seems to be fairly strained already. ... http://www.archlinux.org/pipermail/pacman-dev/2008-November/007672.htm
Is Gerolde separate from the server that serves the FTP and HTTP traffic? If it is separate then I can't argue for the delta's improvement on the server's performance. If it *is* the same server then Garn's argument is illogical.
What else is Gerolde doing for Arch and can it be moved to another server?
Gerolde does everything - every service that has an archlinux.org domain name is hosted on gerolde (except ftp.archlinux.org). It can't be "moved to another server" because we don't have another and don't have the finances to get another at this time, nor do we have the manpower to maintain multiple servers.
This is not the only problem. Another big problem is that it would require real interest and work from official developers, and this is clearly inexistent :) For example, dbscripts would require some work as well http://projects.archlinux.org/?p=dbscripts.git;a=tree
I wouldn't say the interest is non-existent, it's just that the implementation is so complex at this point in time, and most of us are of the opinion that "bandwidth is cheap", so we go the easier route. Questions which make the implementation complex: * When do we generate deltas? As part of the db scripts? * How long do we keep them? 10 previous versions? 5? * How much additional space is this going to take? How do we set it up so that space-constrained mirrors can opt-out of the deltas? I'm sure there's more, but that's just "off the cuff". In my eyes, this is a complex change that doesn't really seem to benefit too many people. If you download 3megs instead of 7, it's not that big of a deal and has so many more points of failure to contend with.
On Mon, Feb 23, 2009 at 6:48 PM, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
Questions which make the implementation complex: * When do we generate deltas? As part of the db scripts?
Well I think that would be practical. When a new package is being added, grab the old one, generate a delta, and add it to the database. This could be doable.
* How long do we keep them? 10 previous versions? 5?
I would think 5 is more than enough. Allan suggested more complicated ways of cleaning deltas, but we could indeed just use a simple limit like that. There is still the problem of finding which are the 5 newest deltas to be kept.
* How much additional space is this going to take? How do we set it up so that space-constrained mirrors can opt-out of the deltas?
That's a very good question I didn't consider. But well, I didn't expect to figure out and answer all the problems alone. I know nothing about mirror setup. And it seems there are quite a few users interested by delta though, so maybe some could help to provide some results about how much space it could take.
I'm sure there's more, but that's just "off the cuff". In my eyes, this is a complex change that doesn't really seem to benefit too many people. If you download 3megs instead of 7, it's not that big of a deal and has so many more points of failure to contend with.
The benefit can be much greater than that. I just wrote a quick hack so that will generate a delta for each package upgrade on my box, and stores them in a database. The first package that came in : 2,8M openjdk6-1.4-2_to_1.4.1-1-x86_64.delta 67M openjdk6-1.4.1-1-x86_64.pkg.tar.gz On a decent 1MB/s line, that's a 1 minute difference for a single package. But yes, it is clearly more complex and there is clearly many more points of failure.
On Thu, Feb 26, 2009 at 4:19 PM, Xavier <shiningxc@gmail.com> wrote:
On Mon, Feb 23, 2009 at 6:48 PM, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
Questions which make the implementation complex: * When do we generate deltas? As part of the db scripts?
Well I think that would be practical. When a new package is being added, grab the old one, generate a delta, and add it to the database. This could be doable.
* How long do we keep them? 10 previous versions? 5?
I would think 5 is more than enough. Allan suggested more complicated ways of cleaning deltas, but we could indeed just use a simple limit like that. There is still the problem of finding which are the 5 newest deltas to be kept.
* How much additional space is this going to take? How do we set it up so that space-constrained mirrors can opt-out of the deltas?
That's a very good question I didn't consider. But well, I didn't expect to figure out and answer all the problems alone. I know nothing about mirror setup. And it seems there are quite a few users interested by delta though, so maybe some could help to provide some results about how much space it could take.
I'm sure there's more, but that's just "off the cuff". In my eyes, this is a complex change that doesn't really seem to benefit too many people. If you download 3megs instead of 7, it's not that big of a deal and has so many more points of failure to contend with.
The benefit can be much greater than that. I just wrote a quick hack so that will generate a delta for each package upgrade on my box, and stores them in a database. The first package that came in : 2,8M openjdk6-1.4-2_to_1.4.1-1-x86_64.delta 67M openjdk6-1.4.1-1-x86_64.pkg.tar.gz
On a decent 1MB/s line, that's a 1 minute difference for a single package.
But yes, it is clearly more complex and there is clearly many more points of failure.
So, ok, from a db-scripts point of view, we're going to have to do the following: when a new package is added: copy old package file from ftp to build dir generate delta from old file -> new file (in staging) add new pkg and delta to DB ? add new delta info _somewhere_? copy new pkg and delta to ftp Is this correct? If so, it's not all THAT complex. Less so if repo-add could simply spit out the deltas on it's own - if it can, we can simply add the logic to copy od packages to the build dir before calling repo-add, repo-add realizes there's another package there and uses it for deltas. Additionally, we run a cleanup script every few hours to remove old and/or unused packages this logic would simply need to be changed to scan deltas and leave $RETAINED_DELTAS for each package. I haven't been following the delta stuff too much, can we put the deltas in a totally unrelated directory? Is there delta information stored in the pacman DB? If so, the cleanup gets far more complicated?
Aaron Griffin wrote:
Is there delta information stored in the pacman DB? If so, the cleanup gets far more complicated?
Delta information is stored in the repo so removing them is not a simple delete. As Xavier pointed out, my proposal for removing deltas was slightly more complicated but I am beginning to see the need for a script to clean the deltas up - and so I can use my more complicated removal system :). I think whether that script is part of repo-add, or repo-add calls it when adding/removing a delta depends on how complicated the script gets. Anyway, a simple removal system based on number of deltas would be fine for now and more complicated stuff could be added later. Allan
On Thu, Feb 26, 2009 at 4:53 PM, Allan McRae <allan@archlinux.org> wrote:
Aaron Griffin wrote:
Is there delta information stored in the pacman DB? If so, the cleanup gets far more complicated?
Delta information is stored in the repo so removing them is not a simple delete. As Xavier pointed out, my proposal for removing deltas was slightly more complicated but I am beginning to see the need for a script to clean the deltas up - and so I can use my more complicated removal system :). I think whether that script is part of repo-add, or repo-add calls it when adding/removing a delta depends on how complicated the script gets.
Anyway, a simple removal system based on number of deltas would be fine for now and more complicated stuff could be added later.
So... if I hack at this, what would be the process to remove a delta? Delete the file and then remove a line from a db entry that matches the file?
Aaron Griffin wrote:
On Thu, Feb 26, 2009 at 4:53 PM, Allan McRae <allan@archlinux.org> wrote:
Aaron Griffin wrote:
Is there delta information stored in the pacman DB? If so, the cleanup gets far more complicated?
Delta information is stored in the repo so removing them is not a simple delete. As Xavier pointed out, my proposal for removing deltas was slightly more complicated but I am beginning to see the need for a script to clean the deltas up - and so I can use my more complicated removal system :). I think whether that script is part of repo-add, or repo-add calls it when adding/removing a delta depends on how complicated the script gets.
Anyway, a simple removal system based on number of deltas would be fine for now and more complicated stuff could be added later.
So... if I hack at this, what would be the process to remove a delta? Delete the file and then remove a line from a db entry that matches the file?
Remove the delta from the repo using e.g repo-remove repo/test.db.tar.gz libx11-1.1.5-2_to_1.1.99.2-2-x86_64.delta Allan
On Thu, Feb 26, 2009 at 11:34 PM, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
So, ok, from a db-scripts point of view, we're going to have to do the following:
when a new package is added: copy old package file from ftp to build dir generate delta from old file -> new file (in staging)
$ oldfile=tzdata-2009a-1-x86_64.pkg.tar.gz $ newfile=tzdata-2009b-1-x86_64.pkg.tar.gz $ pkgdelta $oldfile $newfile ==> Generating delta from version 2009a-1 to version 2009b-1 ==> Generated delta : 'tzdata-2009a-1_to_2009b-1-x86_64.delta'
add new pkg and delta to DB ? add new delta info _somewhere_?
$ repo-add repo.db.tar.gz $newfile $pkgname*.delta Adding the package will create a tzdata-2009b-1 entry in the database. Adding the delta will create a tzdata-2009b-1/deltas file, with only one line containing the information about the tzdata-2009a-1_to_2009b-1-x86_64.delta (supposing it was the first delta for tzdata which was added to the database)
copy new pkg and delta to ftp
Is this correct? If so, it's not all THAT complex.
Yup it is correct.
Less so if repo-add could simply spit out the deltas on it's own - if it can, we can simply add the logic to copy od packages to the build dir before calling repo-add, repo-add realizes there's another package there and uses it for deltas.
Indeed, it should be possible to put that in repo-add, I seriously considered that option a while ago. But using pkgdelta and repo-add on delta files should be easy enough, as you can see above.
Additionally, we run a cleanup script every few hours to remove old and/or unused packages this logic would simply need to be changed to scan deltas and leave $RETAINED_DELTAS for each package.
I haven't been following the delta stuff too much, can we put the deltas in a totally unrelated directory? Is there delta information stored in the pacman DB? If so, the cleanup gets far more complicated?
If you can get a list of the delta filenames you want to remove, just pass them to repo-remove to remove their information from the pacman DB. $ repo-remove repo.db.tar.gz *.delta
On Thu, Feb 26, 2009 at 5:21 PM, Xavier <shiningxc@gmail.com> wrote:
On Thu, Feb 26, 2009 at 11:34 PM, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
So, ok, from a db-scripts point of view, we're going to have to do the following:
when a new package is added: copy old package file from ftp to build dir generate delta from old file -> new file (in staging)
$ oldfile=tzdata-2009a-1-x86_64.pkg.tar.gz $ newfile=tzdata-2009b-1-x86_64.pkg.tar.gz $ pkgdelta $oldfile $newfile ==> Generating delta from version 2009a-1 to version 2009b-1 ==> Generated delta : 'tzdata-2009a-1_to_2009b-1-x86_64.delta' ...snip... Indeed, it should be possible to put that in repo-add, I seriously considered that option a while ago. But using pkgdelta and repo-add on delta files should be easy enough, as you can see above.
Hmm, actually this isn't as straightforward as it seems as we always have the possibility that multiple pkg files exist (haven't been cleaned up yet), so we need to determine the previous file's full name. That's why I mentioned doing it in repo-add, as repo-add knows the entry in the DB already and can very easily generate a new delta before it removes the old entry. To do this in the db-scripts, we need to open the existing DB, find the entry for this pkgname, and get the filename. Does pkgdelta correctly construct deltas in different directories, or do we also need to copy the file next to the new package before running pkgdelta need to be next to each other?
On Thu, Feb 26, 2009 at 5:37 PM, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
correctly construct deltas in different directories, or do we also need to copy the file next to the new package before running pkgdelta need to be next to each other?
Ignore the last "need to be next to each other", I reworded my sentence and didn't delete the old part :)
On Fri, Feb 27, 2009 at 12:37 AM, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
On Thu, Feb 26, 2009 at 5:21 PM, Xavier <shiningxc@gmail.com> wrote:
On Thu, Feb 26, 2009 at 11:34 PM, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
So, ok, from a db-scripts point of view, we're going to have to do the following:
when a new package is added: copy old package file from ftp to build dir generate delta from old file -> new file (in staging)
$ oldfile=tzdata-2009a-1-x86_64.pkg.tar.gz $ newfile=tzdata-2009b-1-x86_64.pkg.tar.gz $ pkgdelta $oldfile $newfile ==> Generating delta from version 2009a-1 to version 2009b-1 ==> Generated delta : 'tzdata-2009a-1_to_2009b-1-x86_64.delta' ...snip... Indeed, it should be possible to put that in repo-add, I seriously considered that option a while ago. But using pkgdelta and repo-add on delta files should be easy enough, as you can see above.
Hmm, actually this isn't as straightforward as it seems as we always have the possibility that multiple pkg files exist (haven't been cleaned up yet), so we need to determine the previous file's full name. That's why I mentioned doing it in repo-add, as repo-add knows the entry in the DB already and can very easily generate a new delta before it removes the old entry.
Hmm ok. So how do you control repo-add delta creation, do we need a new -d/--delta flag for it, or an environment variable REPO_DELTA=1 ? And where should repo-add assume that the oldfile is? Current working directory, same directory as the database, same directory as the package being added? And where should the newly created delta be put? Finally, db_remove_entry supports the odd corner case that there could be multiple entries for the same pkgname. How should delta creation deal with that? Only pick the first entry, or try to generate a delta for each entry? That is this kind of questions which bothers me the most, because I am never totally sure what the use case will be. You could be of a great help here :)
To do this in the db-scripts, we need to open the existing DB, find the entry for this pkgname, and get the filename. Does pkgdelta correctly construct deltas in different directories, or do we also need to copy the file next to the new package before running pkgdelta need to be next to each other?
The old package and new package can be anywhere. But the new delta will be put next to the new package. What would be the best output location for the new delta?
On Fri, Feb 27, 2009 at 6:18 AM, Xavier <shiningxc@gmail.com> wrote:
On Fri, Feb 27, 2009 at 12:37 AM, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
On Thu, Feb 26, 2009 at 5:21 PM, Xavier <shiningxc@gmail.com> wrote:
On Thu, Feb 26, 2009 at 11:34 PM, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
So, ok, from a db-scripts point of view, we're going to have to do the following:
when a new package is added: copy old package file from ftp to build dir generate delta from old file -> new file (in staging)
$ oldfile=tzdata-2009a-1-x86_64.pkg.tar.gz $ newfile=tzdata-2009b-1-x86_64.pkg.tar.gz $ pkgdelta $oldfile $newfile ==> Generating delta from version 2009a-1 to version 2009b-1 ==> Generated delta : 'tzdata-2009a-1_to_2009b-1-x86_64.delta' ...snip... Indeed, it should be possible to put that in repo-add, I seriously considered that option a while ago. But using pkgdelta and repo-add on delta files should be easy enough, as you can see above.
Hmm, actually this isn't as straightforward as it seems as we always have the possibility that multiple pkg files exist (haven't been cleaned up yet), so we need to determine the previous file's full name. That's why I mentioned doing it in repo-add, as repo-add knows the entry in the DB already and can very easily generate a new delta before it removes the old entry.
Hmm ok. So how do you control repo-add delta creation, do we need a new -d/--delta flag for it, or an environment variable REPO_DELTA=1 ? And where should repo-add assume that the oldfile is? Current working directory, same directory as the database, same directory as the package being added? And where should the newly created delta be put? Finally, db_remove_entry supports the odd corner case that there could be multiple entries for the same pkgname. How should delta creation deal with that? Only pick the first entry, or try to generate a delta for each entry?
That is this kind of questions which bothers me the most, because I am never totally sure what the use case will be. You could be of a great help here :)
Hmm, good questions. I was thinking that we would simply run "repo-add --deltas" or something, but it still requires us to figure out the old package file and copy that as well. That's the main issue here - determining the old/existing package file. I guess it's not TOO hard, the more I think about it. I'll just have to extract the DB and parse through it, unless someone else can figure out a cleaner way to get the filename for an old package from a DB knowing the new package's name (and version and all the other meta info)
To do this in the db-scripts, we need to open the existing DB, find the entry for this pkgname, and get the filename. Does pkgdelta correctly construct deltas in different directories, or do we also need to copy the file next to the new package before running pkgdelta need to be next to each other?
The old package and new package can be anywhere. But the new delta will be put next to the new package. What would be the best output location for the new delta?
That works fine. The way the dbscripts do operations is by: a) locking the existing DB (with a lock file) b) copying the db file to a temp dir c) copying all staging files to that temp dir d) adding everything to the db e) copying the entire temp dir to the existing db location f) unlocking the db So putting deltas in-place works fine - I was just hoping we wouldn't have to figure out additional files to copy to the temp dir for deltas
On Fri, Feb 27, 2009 at 4:55 PM, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
Hmm, good questions. I was thinking that we would simply run "repo-add --deltas" or something, but it still requires us to figure out the old package file and copy that as well.
That's the main issue here - determining the old/existing package file. I guess it's not TOO hard, the more I think about it. I'll just have to extract the DB and parse through it, unless someone else can figure out a cleaner way to get the filename for an old package from a DB knowing the new package's name (and version and all the other meta info)
To do this in the db-scripts, we need to open the existing DB, find the entry for this pkgname, and get the filename. Does pkgdelta correctly construct deltas in different directories, or do we also need to copy the file next to the new package before running pkgdelta need to be next to each other?
The old package and new package can be anywhere. But the new delta will be put next to the new package. What would be the best output location for the new delta?
That works fine. The way the dbscripts do operations is by: a) locking the existing DB (with a lock file) b) copying the db file to a temp dir c) copying all staging files to that temp dir d) adding everything to the db e) copying the entire temp dir to the existing db location f) unlocking the db
So putting deltas in-place works fine - I was just hoping we wouldn't have to figure out additional files to copy to the temp dir for deltas
So, even if repo-add finds the old filename, which is easy to do, it is not going to know in which directory that old package actually sits. That's a big problem, I started to add delta generation to repo-add but we have a show stopper here. Here is the code I used to find the old filename : pkgentry=$(find_pkgentry $pkgname) if [ -n "$pkgentry" ]; then oldfilename=$(grep -A1 FILENAME $pkgentry/desc | tail -n1) oldfile="$(dirname $1)/$oldfilename" fi So here I assumed that the old package file sits in the same directory that the new package file (dirname $1), which is not the case in dbscripts. find_pkgentry was already defined in repo-add so that was practical to re-use. find_pkgentry() { local pkgname=$1 local pkgentry for pkgentry in $gstmpdir/$pkgname*; do name=${pkgentry##*/} if [ "${name%-*-*}" = "$pkgname" ]; then echo $pkgentry return 0 fi done return 1 }
On Fri, Feb 27, 2009 at 10:14 AM, Xavier <shiningxc@gmail.com> wrote:
On Fri, Feb 27, 2009 at 4:55 PM, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
Hmm, good questions. I was thinking that we would simply run "repo-add --deltas" or something, but it still requires us to figure out the old package file and copy that as well.
That's the main issue here - determining the old/existing package file. I guess it's not TOO hard, the more I think about it. I'll just have to extract the DB and parse through it, unless someone else can figure out a cleaner way to get the filename for an old package from a DB knowing the new package's name (and version and all the other meta info)
To do this in the db-scripts, we need to open the existing DB, find the entry for this pkgname, and get the filename. Does pkgdelta correctly construct deltas in different directories, or do we also need to copy the file next to the new package before running pkgdelta need to be next to each other?
The old package and new package can be anywhere. But the new delta will be put next to the new package. What would be the best output location for the new delta?
That works fine. The way the dbscripts do operations is by: a) locking the existing DB (with a lock file) b) copying the db file to a temp dir c) copying all staging files to that temp dir d) adding everything to the db e) copying the entire temp dir to the existing db location f) unlocking the db
So putting deltas in-place works fine - I was just hoping we wouldn't have to figure out additional files to copy to the temp dir for deltas
So, even if repo-add finds the old filename, which is easy to do, it is not going to know in which directory that old package actually sits. That's a big problem, I started to add delta generation to repo-add but we have a show stopper here.
Here is the code I used to find the old filename : pkgentry=$(find_pkgentry $pkgname) if [ -n "$pkgentry" ]; then oldfilename=$(grep -A1 FILENAME $pkgentry/desc | tail -n1) oldfile="$(dirname $1)/$oldfilename" fi
So here I assumed that the old package file sits in the same directory that the new package file (dirname $1), which is not the case in dbscripts.
find_pkgentry was already defined in repo-add so that was practical to re-use.
find_pkgentry() { local pkgname=$1 local pkgentry for pkgentry in $gstmpdir/$pkgname*; do name=${pkgentry##*/} if [ "${name%-*-*}" = "$pkgname" ]; then echo $pkgentry return 0 fi done return 1 }
Yeah, this stuff probably fits better in the dbscripts, due to this discrepancy. The reason we don't do these operations directly in the DB dir is so that we can fail without too much cleanup, and so we can prevent things from rsyncing when we are in the middle of an update.
On Fri, Feb 27, 2009 at 10:14 AM, Xavier <shiningxc@gmail.com> wrote:
find_pkgentry() { local pkgname=$1 local pkgentry for pkgentry in $gstmpdir/$pkgname*; do name=${pkgentry##*/} if [ "${name%-*-*}" = "$pkgname" ]; then echo $pkgentry return 0 fi done return 1 }
Is this doable without unpacking the DB? Perhaps using bsdtar -tf ?
On Fri, Feb 27, 2009 at 5:24 PM, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
On Fri, Feb 27, 2009 at 10:14 AM, Xavier <shiningxc@gmail.com> wrote:
find_pkgentry() { local pkgname=$1 local pkgentry for pkgentry in $gstmpdir/$pkgname*; do name=${pkgentry##*/} if [ "${name%-*-*}" = "$pkgname" ]; then echo $pkgentry return 0 fi done return 1 }
Is this doable without unpacking the DB? Perhaps using bsdtar -tf ?
Ah yeah, good idea :) getfilename() { if [ $# -ne 2 ]; then return 1 fi repo=$1 pkgname=$2 bsdtar tf $repo "$pkgname*/desc" 2>/dev/null | while read entry; do name=${entry%%/*} if [ "${name%-*-*}" = "$pkgname" ]; then bsdtar xOf $repo $entry | grep -A1 FILENAME | tail -n1 return 0 fi done return 1 }
Xavier wrote:
On Fri, Feb 20, 2009 at 8:04 AM, Allan McRae <allan@archlinux.org> wrote:
Would it be useful if I put xdelta3 into the repos to help testing things out for this?
Short answer : yes, I believe it would be nice to have xdelta3 in the repos <snip>
xdelta3 is now in the repos. It was really, really annoying to get the python modules working though... Allan
participants (6)
-
Aaron Griffin
-
Allan McRae
-
Brendan Hide
-
Henning Garus
-
Nagy Gabor
-
Xavier