Re: [arch-dev-public] Cronjob for regular git garbage collection

3 Nov 2009

      On Tue, Nov 3, 2009 at 7:23 AM, Thomas Bächler <thomas@archlinux.org> wrote:
...
Dan McGee schrieb:
...
Realize that this has drawbacks; someone that is fetching (not
cloning) over HTTP will have to redownload the whole pack again and
not just the incremental changeset. You may want something more like
the included script as it gives you the benefits of compressing
objects but not creating one huge pack.
-Dan
$ cat bin/prunerepos
#!/bin/sh
cwd=$(pwd)
for dir in $(ls | grep -F '.git'); do
       cd $cwd/$dir
       echo "pruning and packing $cwd/$dir..."
       git prune
       git repack -d
done
I realize that, is it something we should be really concerned about? With
our small repositories, the overhead of downloading a bunch of small files
might even outweigh the size of a big pack.
That is the whole point, repack doesn't create small files, it bundles
them up for you. Downloading 3 packs is still quicker than downloading
1 big one if we do it once a week. The AUR pack is quite huge and that
is under active development, so I would feel bad gc-ing that when a
simple repack (I just did one) will do creating only a 230K pack:
$ ll objects/pack/
total 8.7M
-r--r--r-- 1 simo aur-git  22K 2009-11-03 08:28
pack-2def16dc5d8361b8a7c11e60e10c503ba9874fdb.idx
-r--r--r-- 1 simo aur-git 230K 2009-11-03 08:28
pack-2def16dc5d8361b8a7c11e60e10c503ba9874fdb.pack
-r--r--r-- 1 simo aur-git 139K 2009-01-22 21:38
pack-c7bd96b6fc392799991ad88824f935c09d470efa.idx
-r--r--r-- 1 simo aur-git 8.3M 2009-01-22 21:38
pack-c7bd96b6fc392799991ad88824f935c09d470efa.pack

And if it is still a problem we can always just switch to git-gc
later- we don't need to skip this intermediate step.
...
pacman.git is our biggest and currently has a 5.4MB pack when you gc it.
Note that this is an incredibly compacted initial pack- the repository
will weigh in around 9 MB if you packed it locally; I had to pull some
tricks to get it that small.
...
Or maybe we should prune && repack them weekly, but gc them monthly or every
2 months?
Last week, we had http access to http://projects.archlinux.org/git/ (not
counting 403s and 404s) from 12 different IPs, 66 the week before that, then
63 and 84. I hope most people use git://.
I also hope most people use git; but I don't want to leave those in
the dust that can't. They are also likely the ones with the worst
internet connections so watching out for them might be the nice thing
to do.