[arch-dev-public] Cronjob for regular git garbage collection
When I broke our projects.archlinux.org vhost, I noticed that cloning git via http:// takes ages. This could be vastly improved by running a regular cronjob to 'git gc' all /srv/projects/git repositories. It would also speed up cloning/pulling via git://, as the "remote: compressing objects" stage will be much less work on the server. Are there any objections against setting this up?
When I broke our projects.archlinux.org vhost, I noticed that cloning git via http:// takes ages. This could be vastly improved by running a regular cronjob to 'git gc' all /srv/projects/git repositories. It would also speed up cloning/pulling via git://, as the "remote: compressing objects" stage will be much less work on the server. Are there any objections against setting this up?
I'm not so familiar with git internals on the git server, but it sounds reasonable to me what you said. Even the documentation of "git gc" says: "Users are encouraged to run this task on a regular basis within each repository to maintain good disk space utilization and good operating performance." So, I say +1 from my (not-so-familiar-git) side. Daniel -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
On Tue, Nov 3, 2009 at 3:49 AM, Thomas Bächler <thomas@archlinux.org> wrote:
When I broke our projects.archlinux.org vhost, I noticed that cloning git via http:// takes ages. This could be vastly improved by running a regular cronjob to 'git gc' all /srv/projects/git repositories. It would also speed up cloning/pulling via git://, as the "remote: compressing objects" stage will be much less work on the server. Are there any objections against setting this up?
I used to do this fairly often on the pacman.git repo; I did a few of the others as well. No objections here, just make sure running the cronjob doesn't make the repository unwritable for the people that need it. Realize that this has drawbacks; someone that is fetching (not cloning) over HTTP will have to redownload the whole pack again and not just the incremental changeset. You may want something more like the included script as it gives you the benefits of compressing objects but not creating one huge pack. -Dan $ cat bin/prunerepos #!/bin/sh cwd=$(pwd) for dir in $(ls | grep -F '.git'); do cd $cwd/$dir echo "pruning and packing $cwd/$dir..." git prune git repack -d done
Dan McGee schrieb:
Realize that this has drawbacks; someone that is fetching (not cloning) over HTTP will have to redownload the whole pack again and not just the incremental changeset. You may want something more like the included script as it gives you the benefits of compressing objects but not creating one huge pack.
-Dan
$ cat bin/prunerepos #!/bin/sh
cwd=$(pwd)
for dir in $(ls | grep -F '.git'); do cd $cwd/$dir echo "pruning and packing $cwd/$dir..." git prune git repack -d done
I realize that, is it something we should be really concerned about? With our small repositories, the overhead of downloading a bunch of small files might even outweigh the size of a big pack. pacman.git is our biggest and currently has a 5.4MB pack when you gc it. Or maybe we should prune && repack them weekly, but gc them monthly or every 2 months? Last week, we had http access to http://projects.archlinux.org/git/ (not counting 403s and 404s) from 12 different IPs, 66 the week before that, then 63 and 84. I hope most people use git://.
On Tue, Nov 3, 2009 at 7:23 AM, Thomas Bächler <thomas@archlinux.org> wrote:
Dan McGee schrieb:
Realize that this has drawbacks; someone that is fetching (not cloning) over HTTP will have to redownload the whole pack again and not just the incremental changeset. You may want something more like the included script as it gives you the benefits of compressing objects but not creating one huge pack.
-Dan
$ cat bin/prunerepos #!/bin/sh
cwd=$(pwd)
for dir in $(ls | grep -F '.git'); do cd $cwd/$dir echo "pruning and packing $cwd/$dir..." git prune git repack -d done
I realize that, is it something we should be really concerned about? With our small repositories, the overhead of downloading a bunch of small files might even outweigh the size of a big pack.
That is the whole point, repack doesn't create small files, it bundles them up for you. Downloading 3 packs is still quicker than downloading 1 big one if we do it once a week. The AUR pack is quite huge and that is under active development, so I would feel bad gc-ing that when a simple repack (I just did one) will do creating only a 230K pack: $ ll objects/pack/ total 8.7M -r--r--r-- 1 simo aur-git 22K 2009-11-03 08:28 pack-2def16dc5d8361b8a7c11e60e10c503ba9874fdb.idx -r--r--r-- 1 simo aur-git 230K 2009-11-03 08:28 pack-2def16dc5d8361b8a7c11e60e10c503ba9874fdb.pack -r--r--r-- 1 simo aur-git 139K 2009-01-22 21:38 pack-c7bd96b6fc392799991ad88824f935c09d470efa.idx -r--r--r-- 1 simo aur-git 8.3M 2009-01-22 21:38 pack-c7bd96b6fc392799991ad88824f935c09d470efa.pack And if it is still a problem we can always just switch to git-gc later- we don't need to skip this intermediate step.
pacman.git is our biggest and currently has a 5.4MB pack when you gc it.
Note that this is an incredibly compacted initial pack- the repository will weigh in around 9 MB if you packed it locally; I had to pull some tricks to get it that small.
Or maybe we should prune && repack them weekly, but gc them monthly or every 2 months?
Last week, we had http access to http://projects.archlinux.org/git/ (not counting 403s and 404s) from 12 different IPs, 66 the week before that, then 63 and 84. I hope most people use git://.
I also hope most people use git; but I don't want to leave those in the dust that can't. They are also likely the ones with the worst internet connections so watching out for them might be the nice thing to do.
Dan McGee schrieb:
That is the whole point, repack doesn't create small files, it bundles them up for you. Downloading 3 packs is still quicker than downloading 1 big one if we do it once a week.
I just read the help of repack -d and it totally makes sense to use it this way. We could generate weekly packs then. Is there also an option to repack these weekly packs into one big pack once they're older than 6 months or so?
pacman.git is our biggest and currently has a 5.4MB pack when you gc it.
Note that this is an incredibly compacted initial pack- the repository will weigh in around 9 MB if you packed it locally; I had to pull some tricks to get it that small.
I don't understand. What did you do to it? I just ran "git gc" locally on it and it had that size.
I also hope most people use git; but I don't want to leave those in the dust that can't. They are also likely the ones with the worst internet connections so watching out for them might be the nice thing to do.
Agreed.
participants (3)
-
Dan McGee
-
Daniel Isenmann
-
Thomas Bächler