[pacman-dev] [arch-dev-public] doc size

Xavier shiningxc at gmail.com
Fri Dec 18 11:15:30 EST 2009


On Fri, Dec 18, 2009 at 1:39 AM, Allan McRae <allan at archlinux.org> wrote:
> Xavier wrote:
>>
>> On Wed, Dec 16, 2009 at 11:07 PM, Xavier <shiningxc at gmail.com> wrote:
>>>
>>> The results are slightly different, mostly because makepkg
>>> overestimates uncompressed size by not using -a with du, while namcap
>>> lotsofdocs computes the real size.
>>>
>>> Now that I think about it, maybe the trick/hack I used in my first
>>> script would actually be a portable way to get the real uncompressed
>>> size :
>>> bsdtar tvf foo.pkg.tar.gz 2>/dev/null | awk '{ SUM += $5 } END { print
>>> SUM }'
>>>
>>> But this is offtopic, I should find again the makepkg bug report we
>>> had about this that we probably rejected :P
>>>
>>
>> http://bugs.archlinux.org/task/11225
>>
>> I see, so we made a patch for repo-add :
>> http://bugs.archlinux.org/task/11225?getfile=2429
>> I also made one for makepkg :
>> http://bugs.archlinux.org/task/11225?getfile=2426
>>
>> But Dan rejected it with this reason :
>> "Note that I did not touch makepkg because our size there is not
>> critical- a size to the nearest K is just fine, and switching to a
>> find/stat way of doing it would cause all hard links to get
>> double-counted."
>>
>> It is a size to the nearest K for each file, and we accumulate the errors
>> ?
>> I did not realize the difference was so big until today, when playing
>> with docsize stuff.
>>
>> e.g. for libsigc++-2.0 :
>> du -s = 12040 K
>> du -s --apparent-size = 10723 K
>>
>> So that's 1300K. and 12% error if I am not mistaken.
>
> So....   anyone want to do the analysis of how much counting hardlinks twice
> biases the size versus how much bias there is using what we currently do?
>
> As long as the bias is making the package appear bigger than it is and it is
> not orders of magnitude different, I really do not care that much.
>

I had no idea about the number of hard links in packages, I thought
there were not many.
Note that it is not just twice, there can be one hundred hardlinks for
the same file.

By running : find /usr -type f -printf "%p %n\n" | grep -v '1$'
I found one such extreme case : git :)
/usr/lib/git-core/git-add 95
/usr/lib/git-core/git-grep 95
<93 others>

And well, even if git is one extreme case / exception (I don't know if
it is, maybe), it is enough to make it necessary to handle hardlinks.
Otherwise the size would be completely wrong by a huge order of magnitude.

1. find . -exec stat -c %s '{}' ';' 2>/dev/null | awk '{sum+=$1} END
{printf("%d\n", sum)}'
104747 kB
2. du -sk --apparent-size  .
15233 kB
3. du -sk  .
16016 kB
4. bsdtar tvf git-1.6.5.6-1-x86_64.pkg.tar.gz 2>/dev/null | awk '{ SUM
+= $5 } END { print SUM }'
15052 kB

So 1. clearly fail :D
However the new bsdtar way I was proposing 4. seems to work, because
it apparently make an arbitrary choice of one hardlink, and show all
hardlinks as links to this one, with size 0.
-rwxr-xr-x  0 root   root   974232 Dec 12 23:12 usr/lib/git-core/git-rerere
hrwxr-xr-x  0 root   root        0 Dec 12 23:12
usr/lib/git-core/git-get-tar-commit-id link to
usr/lib/git-core/git-rerere
hrwxr-xr-x  0 root   root        0 Dec 12 23:12
usr/lib/git-core/git-send-pack link to usr/lib/git-core/git-rerere
etc

I do not know why I still get a size difference between 2 and 4. But
at least 4 is more correct than 3.

But well, using bsdtar can look weird. And use the compressed archive
to compute uncompressed size too. And also metainfo files would have
to be excluded at this stage.
bsdtar --exclude='.*' -tvf /home/pkg/git-1.6.5.6-1-x86_64.pkg.tar.gz
does not work because it also excludes usr/share/git/emacs/.gitignore
bsdtar -tvf /home/pkg/git-1.6.5.6-1-x86_64.pkg.tar.gz 2>/dev/null   |
grep -v ' \..*$'
seems to work but it's getting quite ugly.

Anyway, let's go back to the beginning :
1) http://bugs.archlinux.org/task/10459
There was an easy fix to this : use different ways on different os, we
already do this for various things. but we thought it was a bad idea ,
so we moved from du -b to less accurate du -k
2) http://bugs.archlinux.org/task/11225
Then we got a complaint. And we moved from du -k to os-specific stat
for computing file size. But we kept du -k for computing dir size.

Well we might as well just use a os specific 'du' to compute dir size
too then ...
And maybe re-use 'du' for files too, like in the beginning, and kill stat.


More information about the pacman-dev mailing list