[pacman-dev] [PATCH] pkgdelta: use highest compression ratio when creating deltas with xdelta3
Looking how pkgdelta works, I found this line xdelta3 -q -f -s "$oldfile" "$newfile" "$deltafile" || ret=$? which seemed to be responsible for the actual delta generation, however man xdelta3 revealed that there were different compression levels (0-9) (not sure which one is default). To make it short, we could have had smaller deltas since pkgdelta was introduced! Examples: -9 16660K blender-12:2.69.c7ac0e-1_to_13:2.69.13290d-1-x86_64.delta default 17832K blender-12:2.69.c7ac0e-1_to_13:2.69.13290d-1-x86_64.delta -9 504K xbmc-12.3-10_to_12.3-11-x86_64.delta default 572K xbmc-12.3-10_to_12.3-11-x86_64.delta The attached patch adds the "-9" option to the line above. Side note: it might be even more advantageous to use bsdiff instead of xdelta3 comparing the /usr/bin/blender binaries of the above versions (12:2.69.c7ac0e-1 and 13:2.69.13290d-1) : xdelta3 10.4M xdelta3 -9 9.9M bsdiff 4.7M ,however bsdiff is only for direct diff creation, bsdiff foo1.pkg.tar.xz foo2.pkg.tar.xz will create a diff between the archives and not their contents, so that'd require further coding which I am unable to do (maybe someone else volunteers :) ). Kind regards, Matthias
On 06/03/14 09:25, Matthias Krüger wrote:
Looking how pkgdelta works, I found this line xdelta3 -q -f -s "$oldfile" "$newfile" "$deltafile" || ret=$? which seemed to be responsible for the actual delta generation, however man xdelta3 revealed that there were different compression levels (0-9) (not sure which one is default). To make it short, we could have had smaller deltas since pkgdelta was introduced!
Examples:
-9 16660K blender-12:2.69.c7ac0e-1_to_13:2.69.13290d-1-x86_64.delta default 17832K blender-12:2.69.c7ac0e-1_to_13:2.69.13290d-1-x86_64.delta
-9 504K xbmc-12.3-10_to_12.3-11-x86_64.delta default 572K xbmc-12.3-10_to_12.3-11-x86_64.delta
How is memory usage changed? Mainly when regenerating the package from deltas?
The attached patch adds the "-9" option to the line above.
Side note: it might be even more advantageous to use bsdiff instead of xdelta3 comparing the /usr/bin/blender binaries of the above versions (12:2.69.c7ac0e-1 and 13:2.69.13290d-1) :
xdelta3 10.4M xdelta3 -9 9.9M bsdiff 4.7M
,however bsdiff is only for direct diff creation, bsdiff foo1.pkg.tar.xz foo2.pkg.tar.xz will create a diff between the archives and not their contents, so that'd require further coding which I am unable to do (maybe someone else volunteers :) ).
What do you mean "create a diff between the archives and not their contents". That does seem a significant decrease. Allan
On 03/06/2014 12:33 AM, Allan McRae wrote:
On 06/03/14 09:25, Matthias Krüger wrote:
Looking how pkgdelta works, I found this line xdelta3 -q -f -s "$oldfile" "$newfile" "$deltafile" || ret=$? which seemed to be responsible for the actual delta generation, however man xdelta3 revealed that there were different compression levels (0-9) (not sure which one is default). To make it short, we could have had smaller deltas since pkgdelta was introduced!
Examples:
-9 16660K blender-12:2.69.c7ac0e-1_to_13:2.69.13290d-1-x86_64.delta default 17832K blender-12:2.69.c7ac0e-1_to_13:2.69.13290d-1-x86_64.delta
-9 504K xbmc-12.3-10_to_12.3-11-x86_64.delta default 572K xbmc-12.3-10_to_12.3-11-x86_64.delta
How is memory usage changed? Mainly when regenerating the package from deltas? Surprisingly, for blender both runs took ~96MB and 1:50m (+- a second).
The attached patch adds the "-9" option to the line above.
Side note: it might be even more advantageous to use bsdiff instead of xdelta3 comparing the /usr/bin/blender binaries of the above versions (12:2.69.c7ac0e-1 and 13:2.69.13290d-1) :
xdelta3 10.4M xdelta3 -9 9.9M bsdiff 4.7M
,however bsdiff is only for direct diff creation, bsdiff foo1.pkg.tar.xz foo2.pkg.tar.xz will create a diff between the archives and not their contents, so that'd require further coding which I am unable to do (maybe someone else volunteers :) ). What do you mean "create a diff between the archives and not their contents". That does seem a significant decrease. To my understanding xdelta opens the archive and diffs the files contained in it while bsdiff will just generate a diff between the archives. At least it would fit the observations:
xdelta ${args} blender-12:2.69.c7ac0e-1-x86_64.pkg.tar.xz blender-13:2.69.13290d-1-x86_64.pkg.tar.xz out => 17 M bsdiff blender-12:2.69.c7ac0e-1-x86_64.pkg.tar.xz blender-13:2.69.13290d-1-x86_64.pkg.tar.xz out => 37 M (while creating this diff, bsdiff ate up to ~660M ram :/) From man bsdiff: bsdiff uses memory equal to 17 times the size of ⟨oldfile⟩, and requires an absolute minimum working set size of 8 times the size of oldfile. Regards, Matthias
On 06/03/14 10:14, Matthias Krüger wrote:
On 03/06/2014 12:33 AM, Allan McRae wrote:
On 06/03/14 09:25, Matthias Krüger wrote:
Looking how pkgdelta works, I found this line xdelta3 -q -f -s "$oldfile" "$newfile" "$deltafile" || ret=$? which seemed to be responsible for the actual delta generation, however man xdelta3 revealed that there were different compression levels (0-9) (not sure which one is default). To make it short, we could have had smaller deltas since pkgdelta was introduced!
Examples:
-9 16660K blender-12:2.69.c7ac0e-1_to_13:2.69.13290d-1-x86_64.delta default 17832K blender-12:2.69.c7ac0e-1_to_13:2.69.13290d-1-x86_64.delta
-9 504K xbmc-12.3-10_to_12.3-11-x86_64.delta default 572K xbmc-12.3-10_to_12.3-11-x86_64.delta
How is memory usage changed? Mainly when regenerating the package from deltas? Surprisingly, for blender both runs took ~96MB and 1:50m (+- a second).
I'm assuming that is because that most of the time/memory is used applying the diff rather than decompressing it given it is quite small. I know memory usage and speed were issues when it was considered adding -9 to package compression. My concern is this also applies to deltas. However, given the reconstruction of a delta is reasonably computationally intensive, I'm not sure this restriction applies here. Anyone else care to comment? A
On Thu, Mar 6, 2014 at 1:35 AM, Allan McRae <allan@archlinux.org> wrote:
I'm assuming that is because that most of the time/memory is used applying the diff rather than decompressing it given it is quite small.
I know memory usage and speed were issues when it was considered adding -9 to package compression. My concern is this also applies to deltas. However, given the reconstruction of a delta is reasonably computationally intensive, I'm not sure this restriction applies here.
Anyone else care to comment?
I've been running repo-add -d with XDELTA="-9 -S djw" to tease out better ratios. That said, with the current system delta + XZ is kinda suboptimal since applying the delta requires recompression of the package and XZ's compression times are so awful.
On 03/06/2014 01:14 AM, Matthias Krüger wrote:
On 03/06/2014 12:33 AM, Allan McRae wrote:
On 06/03/14 09:25, Matthias Krüger wrote:
Looking how pkgdelta works, I found this line xdelta3 -q -f -s "$oldfile" "$newfile" "$deltafile" || ret=$? which seemed to be responsible for the actual delta generation, however man xdelta3 revealed that there were different compression levels (0-9) (not sure which one is default). To make it short, we could have had smaller deltas since pkgdelta was introduced!
Examples:
-9 16660K blender-12:2.69.c7ac0e-1_to_13:2.69.13290d-1-x86_64.delta default 17832K blender-12:2.69.c7ac0e-1_to_13:2.69.13290d-1-x86_64.delta
-9 504K xbmc-12.3-10_to_12.3-11-x86_64.delta default 572K xbmc-12.3-10_to_12.3-11-x86_64.delta
How is memory usage changed? Mainly when regenerating the package from deltas? Surprisingly, for blender both runs took ~96MB and 1:50m (+- a second). Mmh, just to clear out potential misunderstanding: both runs = decompression of normal and -9-compressed delta. Compression ate around 140M and ~16s for normal, and 236M and ~26s for -9 level.
On 06/03/14 09:25, Matthias Krüger wrote:
Side note: it might be even more advantageous to use bsdiff instead of xdelta3 comparing the /usr/bin/blender binaries of the above versions (12:2.69.c7ac0e-1 and 13:2.69.13290d-1) :
xdelta3 10.4M xdelta3 -9 9.9M bsdiff 4.7M
I took a look, and changing from xdelta3 to bsdiff would be very simple. It looks like it is a five minute patch... But what I need is for someone to generate deltas (with and without -9 maybe) for a whole bunch of packages. Then generate diffs using bsdiff and compare the results. The comparison will need to include: 1) size of deltas/diffs 2) memory used when reconstructing package 3) time taken to reconstruct package. Once we have that information, we can make an informed decision. Allan
On 13/03/14 08:11, Allan McRae wrote:
On 06/03/14 09:25, Matthias Krüger wrote:
Side note: it might be even more advantageous to use bsdiff instead of xdelta3 comparing the /usr/bin/blender binaries of the above versions (12:2.69.c7ac0e-1 and 13:2.69.13290d-1) :
xdelta3 10.4M xdelta3 -9 9.9M bsdiff 4.7M
I took a look, and changing from xdelta3 to bsdiff would be very simple. It looks like it is a five minute patch...
But what I need is for someone to generate deltas (with and without -9 maybe) for a whole bunch of packages. Then generate diffs using bsdiff and compare the results. The comparison will need to include:
1) size of deltas/diffs 2) memory used when reconstructing package 3) time taken to reconstruct package.
Once we have that information, we can make an informed decision.
Allan
Hi, Allan and Matthias Related, one of the potentials we saw when originally researching performance was that I was using gzip --rsyncable in CentOS to compress files for delta'd backups (completely unrelated to repo management of course). The --rsyncable flag allows completely avoiding the decompress/recompress steps but is not as good at reducing the total disk/network usage. The --rsyncable flag is not included in the gzip package however. See https://bbs.archlinux.org/viewtopic.php?pid=496539 Unfortunately if I do the testing you've mentioned from our mirror server, I'm not sure if the performance results would be relevant as our mirror server is now running OpenIndiana. :-/ -- __________ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97
On 03/13/2014 07:11 AM, Allan McRae wrote:
On 06/03/14 09:25, Matthias Krüger wrote:
Side note: it might be even more advantageous to use bsdiff instead of xdelta3 comparing the /usr/bin/blender binaries of the above versions (12:2.69.c7ac0e-1 and 13:2.69.13290d-1) :
xdelta3 10.4M xdelta3 -9 9.9M bsdiff 4.7M
I took a look, and changing from xdelta3 to bsdiff would be very simple. It looks like it is a five minute patch...
But what I need is for someone to generate deltas (with and without -9 maybe) for a whole bunch of packages. Then generate diffs using bsdiff and compare the results. The comparison will need to include:
1) size of deltas/diffs 2) memory used when reconstructing package 3) time taken to reconstruct package.
Once we have that information, we can make an informed decision.
Allan
I got some numbers for xdelta (-9), see attached file. If someone provides some script or tool to run bsdiff on a package properly (so that it does not diff the archives themselves), I can offer to compute the numbers for bsdiff as well.
On 15/03/14 13:12, Matthias Krüger wrote:
On 03/13/2014 07:11 AM, Allan McRae wrote:
On 06/03/14 09:25, Matthias Krüger wrote:
Side note: it might be even more advantageous to use bsdiff instead of xdelta3 comparing the /usr/bin/blender binaries of the above versions (12:2.69.c7ac0e-1 and 13:2.69.13290d-1) :
xdelta3 10.4M xdelta3 -9 9.9M bsdiff 4.7M
I took a look, and changing from xdelta3 to bsdiff would be very simple. It looks like it is a five minute patch...
But what I need is for someone to generate deltas (with and without -9 maybe) for a whole bunch of packages. Then generate diffs using bsdiff and compare the results. The comparison will need to include:
1) size of deltas/diffs 2) memory used when reconstructing package 3) time taken to reconstruct package.
Once we have that information, we can make an informed decision.
Allan
I got some numbers for xdelta (-9), see attached file. If someone provides some script or tool to run bsdiff on a package properly (so that it does not diff the archives themselves), I can offer to compute the numbers for bsdiff as well.
The numbers of -9 look like there is no significant change. A quick look here showed also no change adding -S djw. I'll accept a patch adding both -S djw and -9 to the diff creation. I ran some tests on my system. bsdiff uses masses of memory when reconstructing the file (needs ~16x the size of the file). It can use less, but the performance penalty is massive. And it uses even more when creating the diff. Coupled with its lack of transparent decompression, I don't think we should consider that further. Allan
On 03/16/2014 04:34 AM, Allan McRae wrote:
On 15/03/14 13:12, Matthias Krüger wrote:
On 03/13/2014 07:11 AM, Allan McRae wrote:
On 06/03/14 09:25, Matthias Krüger wrote:
Side note: it might be even more advantageous to use bsdiff instead of xdelta3 comparing the /usr/bin/blender binaries of the above versions (12:2.69.c7ac0e-1 and 13:2.69.13290d-1) :
xdelta3 10.4M xdelta3 -9 9.9M bsdiff 4.7M I took a look, and changing from xdelta3 to bsdiff would be very simple. It looks like it is a five minute patch...
But what I need is for someone to generate deltas (with and without -9 maybe) for a whole bunch of packages. Then generate diffs using bsdiff and compare the results. The comparison will need to include:
1) size of deltas/diffs 2) memory used when reconstructing package 3) time taken to reconstruct package.
Once we have that information, we can make an informed decision.
Allan
I got some numbers for xdelta (-9), see attached file. If someone provides some script or tool to run bsdiff on a package properly (so that it does not diff the archives themselves), I can offer to compute the numbers for bsdiff as well. The numbers of -9 look like there is no significant change. A quick look here showed also no change adding -S djw. I'll accept a patch adding both -S djw and -9 to the diff creation.
I ran some tests on my system. bsdiff uses masses of memory when reconstructing the file (needs ~16x the size of the file). It can use less, but the performance penalty is massive. And it uses even more when creating the diff. Coupled with its lack of transparent decompression, I don't think we should consider that further.
Allan Uhmm, according to the bsdiff website: bsdiff is quite memory-hungry. It requires max(17*n,9*n+m)+O(1) bytes of memory, where n is the size of the old file and m is the size of the new file. bspatch requires n+m+O(1) bytes. I did a quick test and generating a 40 MB newfile (binary executable) from a 40 MB oldfile and 4kb patch was done very quickly, ps_mem said bspatch needed around 98.2 MB for the one time I managed to actually measure it.
Fedora seems to have some deltarpm tool https://gitorious.org/deltarpm which seems to make use of bsdiff, maybe this can be tweaked to be used for arch packages? Regards, Matthias
On 21/03/14 12:28, Matthias Krüger wrote:
On 03/16/2014 04:34 AM, Allan McRae wrote:
On 15/03/14 13:12, Matthias Krüger wrote:
On 03/13/2014 07:11 AM, Allan McRae wrote:
On 06/03/14 09:25, Matthias Krüger wrote:
Side note: it might be even more advantageous to use bsdiff instead of xdelta3 comparing the /usr/bin/blender binaries of the above versions (12:2.69.c7ac0e-1 and 13:2.69.13290d-1) :
xdelta3 10.4M xdelta3 -9 9.9M bsdiff 4.7M I took a look, and changing from xdelta3 to bsdiff would be very simple. It looks like it is a five minute patch...
But what I need is for someone to generate deltas (with and without -9 maybe) for a whole bunch of packages. Then generate diffs using bsdiff and compare the results. The comparison will need to include:
1) size of deltas/diffs 2) memory used when reconstructing package 3) time taken to reconstruct package.
Once we have that information, we can make an informed decision.
Allan
I got some numbers for xdelta (-9), see attached file. If someone provides some script or tool to run bsdiff on a package properly (so that it does not diff the archives themselves), I can offer to compute the numbers for bsdiff as well. The numbers of -9 look like there is no significant change. A quick look here showed also no change adding -S djw. I'll accept a patch adding both -S djw and -9 to the diff creation.
I ran some tests on my system. bsdiff uses masses of memory when reconstructing the file (needs ~16x the size of the file). It can use less, but the performance penalty is massive. And it uses even more when creating the diff. Coupled with its lack of transparent decompression, I don't think we should consider that further.
Allan Uhmm, according to the bsdiff website: bsdiff is quite memory-hungry. It requires max(17*n,9*n+m)+O(1) bytes of memory, where n is the size of the old file and m is the size of the new file. bspatch requires n+m+O(1) bytes. I did a quick test and generating a 40 MB newfile (binary executable) from a 40 MB oldfile and 4kb patch was done very quickly, ps_mem said bspatch needed around 98.2 MB for the one time I managed to actually measure it.
Looks like I got my numbers mixed up with creating and applying the diff.
Fedora seems to have some deltarpm tool https://gitorious.org/deltarpm which seems to make use of bsdiff, maybe this can be tweaked to be used for arch packages?
Sure. If someone comes up with a tool that takes two packages and creates the diff and can reconstruct it from our package files (compressed tarballs) and uses bsdiff as its backend, then I will consider it. Looks like some of that could be guided by deltarpm. But as bsdiff stands, we need to manually decompress the packages before making the diff and before and after reconstructing the package. bsdiff is not worth further consideration without someone putting in that work. For the moment, I will take a patch adding "-S djw -9" to our delta creation script. Allan
participants (4)
-
Allan McRae
-
Brendan Hide
-
Jan Alexander Steffens
-
Matthias Krüger