On 13/03/10 03:05, Shridhar Daithankar wrote:
Hi,
Just wanted to share an interesting experience I had today.
Maybe you're looking for http://docs.python.org/library/filecmp.html One cannot help but think that you took a disk-bound process and turned it into a cpu-bound one. Since you're just interested in which files are different you should have just used `cmp` instead of `md5sum` the latter is just overkill and I'd assume calling an external command that many times can't be very nice either. here are some comparisons, they use /usr/lib - i figured 75000 files should be a good test... I made this as deliberately unfair/in-comparable as possible, I wanted to show the potential overhead of calling md5sum that many times. [[ky] ~]# }} time python -c "import filecmp; print len(filecmp.dircmp('/usr/lib', '/temp/lib').diff_files)" 80 real 2m24.240s user 0m10.123s sys 0m10.669s That looks reasonable, on this crappy 5400 rpm (sata) laptop harddisk with ext4. You'll note that test below is pretty much just to see how much time calling md5sum takes, /tmp/a is a 1 byte file(contains the character a, to give md5sum as simple a job as possible). /tmp is a tmpfs, not that it matters as /tmp/a most likely remains in cached the entire time [[ky] ~]# }} time find /temp/lib -type f | wc -l 75272 real 0m0.532s user 0m0.140s sys 0m0.383s [[ky] ~]# }} time find /temp/lib -type f -exec md5sum /tmp/a \; real 2m6.781s user 0m2.200s sys 0m15.409s the disk-status light didn't come on at all during those 2mins meanwhile I could hear my cpu-fan going crazy the whole time (1.6ghz). I should note, the light remained on the entire time during the filecmp and cpu stayed low(800mhz) for most of that time as well.