On Monday 15 March 2010 15:44:35 Nathan Wayde wrote:
On 13/03/10 03:05, Shridhar Daithankar wrote:
Hi,
Just wanted to share an interesting experience I had today.
Maybe you're looking for http://docs.python.org/library/filecmp.html
One cannot help but think that you took a disk-bound process and turned it into a cpu-bound one. Since you're just interested in which files are different you should have just used `cmp` instead of `md5sum` the latter is just overkill and I'd assume calling an external command that many times can't be very nice either.
here are some comparisons, they use /usr/lib - i figured 75000 files should be a good test... I made this as deliberately unfair/in-comparable as possible, I wanted to show the potential overhead of calling md5sum that many times.
I didn't know of cmp, thanks. I tried the same thing with cmp in loops and it agrees with your comments that it is is totally I/O bound, not CPU bound at all. However, even in md5sum case, I/O was high too, the disk light was on all the time. May be it was the case for CPU speed difference. But as far as file system performance goes, the overhead should be identical for both the runs, no? Besides, I need to run the comparison(rather verification of file contents) many times over during the application life-cycle and I cannot afford to bring in another copy from disk. The working set is expected to be 30-40GB at a time, 3GB is just test setup. With md5sum, I can store it in database and verify it on one copy only. And finally, it is terrible on timings. Running md5sum is lot faster, about 3 times in the best case. shridhar@bheem /mnt1/shridhar/tmp/importtest.big$ time for i in `find . -type f`;do cmp "$i" "/data/shridhar/tmp/4/$i";done real 21m30.137s user 0m27.665s sys 1m21.581s shridhar@bheem /data/shridhar/tmp/4$ time for i in `find . -type f`;do cmp "$i" "/mnt1/shridhar/tmp/importtest.big/$i";done real 6m26.988s user 0m40.721s sys 1m28.371s shridhar@bheem /mnt1/shridhar/tmp/importtest.big$ time for i in `find . -type f`;do cmp "$i" "/data/shridhar/tmp/4/$i";done real 16m27.541s user 0m37.281s sys 1m23.995s So when the source file system is btrfs, it is still couple of times faster at least. -- Regards Shridhar