Re: [pacman-dev] [arch-dev-public] doc size
On Wed, Dec 16, 2009 at 11:07 PM, Xavier <shiningxc@gmail.com> wrote:
The results are slightly different, mostly because makepkg overestimates uncompressed size by not using -a with du, while namcap lotsofdocs computes the real size.
Now that I think about it, maybe the trick/hack I used in my first script would actually be a portable way to get the real uncompressed size : bsdtar tvf foo.pkg.tar.gz 2>/dev/null | awk '{ SUM += $5 } END { print SUM }'
But this is offtopic, I should find again the makepkg bug report we had about this that we probably rejected :P
http://bugs.archlinux.org/task/11225 I see, so we made a patch for repo-add : http://bugs.archlinux.org/task/11225?getfile=2429 I also made one for makepkg : http://bugs.archlinux.org/task/11225?getfile=2426 But Dan rejected it with this reason : "Note that I did not touch makepkg because our size there is not critical- a size to the nearest K is just fine, and switching to a find/stat way of doing it would cause all hard links to get double-counted." It is a size to the nearest K for each file, and we accumulate the errors ? I did not realize the difference was so big until today, when playing with docsize stuff. e.g. for libsigc++-2.0 : du -s = 12040 K du -s --apparent-size = 10723 K So that's 1300K. and 12% error if I am not mistaken.
Xavier wrote:
On Wed, Dec 16, 2009 at 11:07 PM, Xavier <shiningxc@gmail.com> wrote:
The results are slightly different, mostly because makepkg overestimates uncompressed size by not using -a with du, while namcap lotsofdocs computes the real size.
Now that I think about it, maybe the trick/hack I used in my first script would actually be a portable way to get the real uncompressed size : bsdtar tvf foo.pkg.tar.gz 2>/dev/null | awk '{ SUM += $5 } END { print SUM }'
But this is offtopic, I should find again the makepkg bug report we had about this that we probably rejected :P
http://bugs.archlinux.org/task/11225
I see, so we made a patch for repo-add : http://bugs.archlinux.org/task/11225?getfile=2429 I also made one for makepkg : http://bugs.archlinux.org/task/11225?getfile=2426
But Dan rejected it with this reason : "Note that I did not touch makepkg because our size there is not critical- a size to the nearest K is just fine, and switching to a find/stat way of doing it would cause all hard links to get double-counted."
It is a size to the nearest K for each file, and we accumulate the errors ? I did not realize the difference was so big until today, when playing with docsize stuff.
e.g. for libsigc++-2.0 : du -s = 12040 K du -s --apparent-size = 10723 K
So that's 1300K. and 12% error if I am not mistaken.
So.... anyone want to do the analysis of how much counting hardlinks twice biases the size versus how much bias there is using what we currently do? As long as the bias is making the package appear bigger than it is and it is not orders of magnitude different, I really do not care that much. Allan
On Fri, Dec 18, 2009 at 1:39 AM, Allan McRae <allan@archlinux.org> wrote:
Xavier wrote:
On Wed, Dec 16, 2009 at 11:07 PM, Xavier <shiningxc@gmail.com> wrote:
The results are slightly different, mostly because makepkg overestimates uncompressed size by not using -a with du, while namcap lotsofdocs computes the real size.
Now that I think about it, maybe the trick/hack I used in my first script would actually be a portable way to get the real uncompressed size : bsdtar tvf foo.pkg.tar.gz 2>/dev/null | awk '{ SUM += $5 } END { print SUM }'
But this is offtopic, I should find again the makepkg bug report we had about this that we probably rejected :P
http://bugs.archlinux.org/task/11225
I see, so we made a patch for repo-add : http://bugs.archlinux.org/task/11225?getfile=2429 I also made one for makepkg : http://bugs.archlinux.org/task/11225?getfile=2426
But Dan rejected it with this reason : "Note that I did not touch makepkg because our size there is not critical- a size to the nearest K is just fine, and switching to a find/stat way of doing it would cause all hard links to get double-counted."
It is a size to the nearest K for each file, and we accumulate the errors ? I did not realize the difference was so big until today, when playing with docsize stuff.
e.g. for libsigc++-2.0 : du -s = 12040 K du -s --apparent-size = 10723 K
So that's 1300K. and 12% error if I am not mistaken.
So.... anyone want to do the analysis of how much counting hardlinks twice biases the size versus how much bias there is using what we currently do?
As long as the bias is making the package appear bigger than it is and it is not orders of magnitude different, I really do not care that much.
I had no idea about the number of hard links in packages, I thought there were not many. Note that it is not just twice, there can be one hundred hardlinks for the same file. By running : find /usr -type f -printf "%p %n\n" | grep -v '1$' I found one such extreme case : git :) /usr/lib/git-core/git-add 95 /usr/lib/git-core/git-grep 95 <93 others> And well, even if git is one extreme case / exception (I don't know if it is, maybe), it is enough to make it necessary to handle hardlinks. Otherwise the size would be completely wrong by a huge order of magnitude. 1. find . -exec stat -c %s '{}' ';' 2>/dev/null | awk '{sum+=$1} END {printf("%d\n", sum)}' 104747 kB 2. du -sk --apparent-size . 15233 kB 3. du -sk . 16016 kB 4. bsdtar tvf git-1.6.5.6-1-x86_64.pkg.tar.gz 2>/dev/null | awk '{ SUM += $5 } END { print SUM }' 15052 kB So 1. clearly fail :D However the new bsdtar way I was proposing 4. seems to work, because it apparently make an arbitrary choice of one hardlink, and show all hardlinks as links to this one, with size 0. -rwxr-xr-x 0 root root 974232 Dec 12 23:12 usr/lib/git-core/git-rerere hrwxr-xr-x 0 root root 0 Dec 12 23:12 usr/lib/git-core/git-get-tar-commit-id link to usr/lib/git-core/git-rerere hrwxr-xr-x 0 root root 0 Dec 12 23:12 usr/lib/git-core/git-send-pack link to usr/lib/git-core/git-rerere etc I do not know why I still get a size difference between 2 and 4. But at least 4 is more correct than 3. But well, using bsdtar can look weird. And use the compressed archive to compute uncompressed size too. And also metainfo files would have to be excluded at this stage. bsdtar --exclude='.*' -tvf /home/pkg/git-1.6.5.6-1-x86_64.pkg.tar.gz does not work because it also excludes usr/share/git/emacs/.gitignore bsdtar -tvf /home/pkg/git-1.6.5.6-1-x86_64.pkg.tar.gz 2>/dev/null | grep -v ' \..*$' seems to work but it's getting quite ugly. Anyway, let's go back to the beginning : 1) http://bugs.archlinux.org/task/10459 There was an easy fix to this : use different ways on different os, we already do this for various things. but we thought it was a bad idea , so we moved from du -b to less accurate du -k 2) http://bugs.archlinux.org/task/11225 Then we got a complaint. And we moved from du -k to os-specific stat for computing file size. But we kept du -k for computing dir size. Well we might as well just use a os specific 'du' to compute dir size too then ... And maybe re-use 'du' for files too, like in the beginning, and kill stat.
Xavier wrote:
On Fri, Dec 18, 2009 at 1:39 AM, Allan McRae <allan@archlinux.org> wrote:
Xavier wrote:
On Wed, Dec 16, 2009 at 11:07 PM, Xavier <shiningxc@gmail.com> wrote:
The results are slightly different, mostly because makepkg overestimates uncompressed size by not using -a with du, while namcap lotsofdocs computes the real size.
Now that I think about it, maybe the trick/hack I used in my first script would actually be a portable way to get the real uncompressed size : bsdtar tvf foo.pkg.tar.gz 2>/dev/null | awk '{ SUM += $5 } END { print SUM }'
But this is offtopic, I should find again the makepkg bug report we had about this that we probably rejected :P
http://bugs.archlinux.org/task/11225
I see, so we made a patch for repo-add : http://bugs.archlinux.org/task/11225?getfile=2429 I also made one for makepkg : http://bugs.archlinux.org/task/11225?getfile=2426
But Dan rejected it with this reason : "Note that I did not touch makepkg because our size there is not critical- a size to the nearest K is just fine, and switching to a find/stat way of doing it would cause all hard links to get double-counted."
It is a size to the nearest K for each file, and we accumulate the errors ? I did not realize the difference was so big until today, when playing with docsize stuff.
e.g. for libsigc++-2.0 : du -s = 12040 K du -s --apparent-size = 10723 K
So that's 1300K. and 12% error if I am not mistaken. So.... anyone want to do the analysis of how much counting hardlinks twice biases the size versus how much bias there is using what we currently do?
As long as the bias is making the package appear bigger than it is and it is not orders of magnitude different, I really do not care that much.
I had no idea about the number of hard links in packages, I thought there were not many. Note that it is not just twice, there can be one hundred hardlinks for the same file.
By running : find /usr -type f -printf "%p %n\n" | grep -v '1$' I found one such extreme case : git :) /usr/lib/git-core/git-add 95 /usr/lib/git-core/git-grep 95 <93 others>
And well, even if git is one extreme case / exception (I don't know if it is, maybe), it is enough to make it necessary to handle hardlinks. Otherwise the size would be completely wrong by a huge order of magnitude.
1. find . -exec stat -c %s '{}' ';' 2>/dev/null | awk '{sum+=$1} END {printf("%d\n", sum)}' 104747 kB 2. du -sk --apparent-size . 15233 kB 3. du -sk . 16016 kB 4. bsdtar tvf git-1.6.5.6-1-x86_64.pkg.tar.gz 2>/dev/null | awk '{ SUM += $5 } END { print SUM }' 15052 kB
So 1. clearly fail :D However the new bsdtar way I was proposing 4. seems to work, because it apparently make an arbitrary choice of one hardlink, and show all hardlinks as links to this one, with size 0. -rwxr-xr-x 0 root root 974232 Dec 12 23:12 usr/lib/git-core/git-rerere hrwxr-xr-x 0 root root 0 Dec 12 23:12 usr/lib/git-core/git-get-tar-commit-id link to usr/lib/git-core/git-rerere hrwxr-xr-x 0 root root 0 Dec 12 23:12 usr/lib/git-core/git-send-pack link to usr/lib/git-core/git-rerere etc
I do not know why I still get a size difference between 2 and 4. But at least 4 is more correct than 3.
Is this difference between 2. and 4. coming from rounding? As I pointed out earlier, an underestimate of the size is worse than and over estimate so I actually prefer 3. even though it is more wrong...
But well, using bsdtar can look weird. And use the compressed archive to compute uncompressed size too. And also metainfo files would have to be excluded at this stage. bsdtar --exclude='.*' -tvf /home/pkg/git-1.6.5.6-1-x86_64.pkg.tar.gz does not work because it also excludes usr/share/git/emacs/.gitignore bsdtar -tvf /home/pkg/git-1.6.5.6-1-x86_64.pkg.tar.gz 2>/dev/null | grep -v ' \..*$' seems to work but it's getting quite ugly.
Anyway, let's go back to the beginning : 1) http://bugs.archlinux.org/task/10459 There was an easy fix to this : use different ways on different os, we already do this for various things. but we thought it was a bad idea , so we moved from du -b to less accurate du -k 2) http://bugs.archlinux.org/task/11225 Then we got a complaint. And we moved from du -k to os-specific stat for computing file size. But we kept du -k for computing dir size.
Well we might as well just use a os specific 'du' to compute dir size too then ... And maybe re-use 'du' for files too, like in the beginning, and kill stat.
I do not mind OS specific stuff, as long as it is done during configure. I would readily accept a patch that does the correct thing in Linux/BSD/OSX/cygwin as long as there is no run-time detection (and to a lesser extent if the commands are not too different). Allan
participants (2)
-
Allan McRae
-
Xavier