[pacman-dev] [arch-dev-public] doc size
Allan McRae
allan at archlinux.org
Fri Dec 18 17:25:53 EST 2009
Xavier wrote:
> On Fri, Dec 18, 2009 at 1:39 AM, Allan McRae <allan at archlinux.org> wrote:
>> Xavier wrote:
>>> On Wed, Dec 16, 2009 at 11:07 PM, Xavier <shiningxc at gmail.com> wrote:
>>>> The results are slightly different, mostly because makepkg
>>>> overestimates uncompressed size by not using -a with du, while namcap
>>>> lotsofdocs computes the real size.
>>>>
>>>> Now that I think about it, maybe the trick/hack I used in my first
>>>> script would actually be a portable way to get the real uncompressed
>>>> size :
>>>> bsdtar tvf foo.pkg.tar.gz 2>/dev/null | awk '{ SUM += $5 } END { print
>>>> SUM }'
>>>>
>>>> But this is offtopic, I should find again the makepkg bug report we
>>>> had about this that we probably rejected :P
>>>>
>>> http://bugs.archlinux.org/task/11225
>>>
>>> I see, so we made a patch for repo-add :
>>> http://bugs.archlinux.org/task/11225?getfile=2429
>>> I also made one for makepkg :
>>> http://bugs.archlinux.org/task/11225?getfile=2426
>>>
>>> But Dan rejected it with this reason :
>>> "Note that I did not touch makepkg because our size there is not
>>> critical- a size to the nearest K is just fine, and switching to a
>>> find/stat way of doing it would cause all hard links to get
>>> double-counted."
>>>
>>> It is a size to the nearest K for each file, and we accumulate the errors
>>> ?
>>> I did not realize the difference was so big until today, when playing
>>> with docsize stuff.
>>>
>>> e.g. for libsigc++-2.0 :
>>> du -s = 12040 K
>>> du -s --apparent-size = 10723 K
>>>
>>> So that's 1300K. and 12% error if I am not mistaken.
>> So.... anyone want to do the analysis of how much counting hardlinks twice
>> biases the size versus how much bias there is using what we currently do?
>>
>> As long as the bias is making the package appear bigger than it is and it is
>> not orders of magnitude different, I really do not care that much.
>>
>
> I had no idea about the number of hard links in packages, I thought
> there were not many.
> Note that it is not just twice, there can be one hundred hardlinks for
> the same file.
>
> By running : find /usr -type f -printf "%p %n\n" | grep -v '1$'
> I found one such extreme case : git :)
> /usr/lib/git-core/git-add 95
> /usr/lib/git-core/git-grep 95
> <93 others>
>
> And well, even if git is one extreme case / exception (I don't know if
> it is, maybe), it is enough to make it necessary to handle hardlinks.
> Otherwise the size would be completely wrong by a huge order of magnitude.
>
> 1. find . -exec stat -c %s '{}' ';' 2>/dev/null | awk '{sum+=$1} END
> {printf("%d\n", sum)}'
> 104747 kB
> 2. du -sk --apparent-size .
> 15233 kB
> 3. du -sk .
> 16016 kB
> 4. bsdtar tvf git-1.6.5.6-1-x86_64.pkg.tar.gz 2>/dev/null | awk '{ SUM
> += $5 } END { print SUM }'
> 15052 kB
>
> So 1. clearly fail :D
> However the new bsdtar way I was proposing 4. seems to work, because
> it apparently make an arbitrary choice of one hardlink, and show all
> hardlinks as links to this one, with size 0.
> -rwxr-xr-x 0 root root 974232 Dec 12 23:12 usr/lib/git-core/git-rerere
> hrwxr-xr-x 0 root root 0 Dec 12 23:12
> usr/lib/git-core/git-get-tar-commit-id link to
> usr/lib/git-core/git-rerere
> hrwxr-xr-x 0 root root 0 Dec 12 23:12
> usr/lib/git-core/git-send-pack link to usr/lib/git-core/git-rerere
> etc
>
> I do not know why I still get a size difference between 2 and 4. But
> at least 4 is more correct than 3.
Is this difference between 2. and 4. coming from rounding? As I pointed
out earlier, an underestimate of the size is worse than and over
estimate so I actually prefer 3. even though it is more wrong...
> But well, using bsdtar can look weird. And use the compressed archive
> to compute uncompressed size too. And also metainfo files would have
> to be excluded at this stage.
> bsdtar --exclude='.*' -tvf /home/pkg/git-1.6.5.6-1-x86_64.pkg.tar.gz
> does not work because it also excludes usr/share/git/emacs/.gitignore
> bsdtar -tvf /home/pkg/git-1.6.5.6-1-x86_64.pkg.tar.gz 2>/dev/null |
> grep -v ' \..*$'
> seems to work but it's getting quite ugly.
>
> Anyway, let's go back to the beginning :
> 1) http://bugs.archlinux.org/task/10459
> There was an easy fix to this : use different ways on different os, we
> already do this for various things. but we thought it was a bad idea ,
> so we moved from du -b to less accurate du -k
> 2) http://bugs.archlinux.org/task/11225
> Then we got a complaint. And we moved from du -k to os-specific stat
> for computing file size. But we kept du -k for computing dir size.
>
> Well we might as well just use a os specific 'du' to compute dir size
> too then ...
> And maybe re-use 'du' for files too, like in the beginning, and kill stat.
I do not mind OS specific stuff, as long as it is done during configure.
I would readily accept a patch that does the correct thing in
Linux/BSD/OSX/cygwin as long as there is no run-time detection (and to a
lesser extent if the commands are not too different).
Allan
More information about the pacman-dev
mailing list