Is a tool available to check the integrity of copied files?
Hi, my google search was "does linux diff compare data using a cache". I'm trying to figure out what's going on. The first diff of 10 GiB of data copied from a SATA3 SSD to an USB 2 stick connected to an USB 3 port took around a minute, right after the copy finished. A second diff needed 3 seconds. Both returned exit status 0. It's impossible to read 10 GiB of data in 3 seconds from an USB 2 stick. Does diff use cached data instead of comparing the "real" files line by line? Google returned "diff isn't doing any caching. The OS is. If you are using Linux, you can flush the disk buffers and cache". I expected that diff ensures to compare the "real" files line by line, but seemingly diff isn't aimed to check integrity of data. Does a command exist that compares "real" files, not just cached files by default? I experience weird things with Raptor Lake hardware, especially if USB is involved and I want to check the integrity of USB transferred, saved files by using a tool, without manually clearing cached data manually. Regards, Ralf
try bsdiff maybe? https://www.daemonology.net/bsdiff/ it will probably be very slow if you're not going through the filesystem cache On Fri, Apr 14, 2023, 4:59 PM Ralf Mardorf <ralf-mardorf@riseup.net> wrote:
Hi,
my google search was "does linux diff compare data using a cache".
I'm trying to figure out what's going on. The first diff of 10 GiB of data copied from a SATA3 SSD to an USB 2 stick connected to an USB 3 port took around a minute, right after the copy finished. A second diff needed 3 seconds. Both returned exit status 0.
It's impossible to read 10 GiB of data in 3 seconds from an USB 2 stick. Does diff use cached data instead of comparing the "real" files line by line?
Google returned "diff isn't doing any caching. The OS is. If you are using Linux, you can flush the disk buffers and cache".
I expected that diff ensures to compare the "real" files line by line, but seemingly diff isn't aimed to check integrity of data.
Does a command exist that compares "real" files, not just cached files by default?
I experience weird things with Raptor Lake hardware, especially if USB is involved and I want to check the integrity of USB transferred, saved files by using a tool, without manually clearing cached data manually.
Regards, Ralf
On 4/14/23 16:59, Ralf Mardorf wrote:
Hi,
my google search was "does linux diff compare data using a cache".
Its not "diff" doing anything weird, its simply the linux kernel buffer cache - and it works great doesn't it. And no, its not 'stale' data, blocks that have not changed are fetched from buffer cache. Its one of the nice features of linux kernel and is partly why memory usage oftentimes seems to be larger than you might otherwise guess. best gewne
On Fri, 2023-04-14 at 17:34 -0400, Genes Lists wrote:
And no, its not 'stale' data, blocks that have not changed are fetched from buffer cache.
Hi, the kernel only knows access via software, it cannot know when something will break in the hardware without any feedback. If a nano thing in an external memory chip connected by USB breaks, e.g. on an SD card, external SSD etc., the cached block in the main memory cache of the computer is still ok. A mechanism that is superb for performance could be a problem when trying to check integrity of copied files. It's even possible that the data is ok, when reading it from an external memory chip, but directly after reading it, it is corrupted by an hardware issue, but it's still more reliable than checking against cached data, provided by the internal main memory. Regards, Ralf
On Fri, Apr 14, 2023 at 10:59:13PM +0200, Ralf Mardorf wrote: Hi Ralf,
Hi,
my google search was "does linux diff compare data using a cache".
I'm trying to figure out what's going on. The first diff of 10 GiB of data copied from a SATA3 SSD to an USB 2 stick connected to an USB 3 port took around a minute, right after the copy finished. A second diff needed 3 seconds. Both returned exit status 0.
It's impossible to read 10 GiB of data in 3 seconds from an USB 2 stick. Does diff use cached data instead of comparing the "real" files line by line? Google returned "diff isn't doing any caching. The OS is. If you are using Linux, you can flush the disk buffers and cache".
I expected that diff ensures to compare the "real" files line by line, but seemingly diff isn't aimed to check integrity of data.
Short answer, yes. Diff can't chose. When it's asking to open the file it's the linux kernel which returns the data and will do it from cache if available.
Does a command exist that compares "real" files, not just cached files by default?
I'm not aware of any such software. You will have to manually clear the cache if you want this behavior. You can use this command to achieve this: "echo 3 > /proc/sys/vm/drop_caches". What makes you think that the cache isn't matching the file on disk? I'd argue that you can be confident that the read cache is true to the file on disk and would correctly be invalidated should the file ever change on disk.
I experience weird things with Raptor Lake hardware, especially if USB is involved and I want to check the integrity of USB transferred, saved files by using a tool, without manually clearing cached data manually.
As an alternate approach I'd suggest checking file integrity with a hash tool. Eg. md5sum or similar. It's the more common approach rather then diff as diff could produce a lot of output if there is a lot of difference between 2 10GB files.
Regards, Ralf
Br, Linus
Ralf Mardorf <ralf-mardorf@riseup.net> wrote:
Hi,
my google search was "does linux diff compare data using a cache".
I'm trying to figure out what's going on. The first diff of 10 GiB of data copied from a SATA3 SSD to an USB 2 stick connected to an USB 3 port took around a minute, right after the copy finished. A second diff needed 3 seconds. Both returned exit status 0.
It's impossible to read 10 GiB of data in 3 seconds from an USB 2 stick. Does diff use cached data instead of comparing the "real" files line by line?
Google returned "diff isn't doing any caching. The OS is. If you are using Linux, you can flush the disk buffers and cache".
I expected that diff ensures to compare the "real" files line by line, but seemingly diff isn't aimed to check integrity of data.
Does a command exist that compares "real" files, not just cached files by default?
To the best of my knowelsge, Google gave you the correct answer. I got many results when searching for how to clear the filesystem buffer cache in linux. I don't know which, if any, is better. One of the issues is differentiating the related cache from the other entries in the cache. It could be unmounting, and pulling out the USB device, after a sync command completed is a step towrds cleaning unwanted entries. Perhaps there is also a way from the command line to disconnet and reconnect the power to the device, as if it was physically pulled out and in.
I experience weird things with Raptor Lake hardware, especially if USB is involved and I want to check the integrity of USB transferred, saved files by using a tool, without manually clearing cached data manually.
It is a good aproach to verify integrity of copied data, in my opinion. Unfortunately, you can not disregrad cache when doing that. -- u34
Regards, Ralf
Hi Ralf,
I want to check the integrity of USB transferred, saved files by using a tool, without manually clearing cached data manually.
Why do you want to avoid manually clearing the cache? The problem with clearing the cache is it is the cache for every device. It is better to simply umount(8) and then mount(8) the USB stick so its blocks in the cache are discarded and reads must access the device. With the cache empty, a simple ‘diff -qr src dest’ will show if any files' contents differ, reading from the USB stick. This is my recommendation. If you want more detail checked, then ask rsync to show what would be updated, e.g. permissions, modification time, etc. Note the ‘-n’. rsync -aciHAXSs --no-iconv -n src/ dest/ dd(1) can be told to use direct I/O, bypassing the cache, but that would be more useful if copying a disk or filesystem image to the USB stick and wanting to read it back from the block device. dstat(1) can be used to observe reads from the device to show the cache isn't supplying everything. ‘dstat -dDsdb,sdc’ monitors two block devices: /dev/sdb and /dev/sdc. -- Cheers, Ralph.
Ralph Corderoy <ralph@inputplus.co.uk> wrote:
Hi Ralf,
I want to check the integrity of USB transferred, saved files by using a tool, without manually clearing cached data manually.
Why do you want to avoid manually clearing the cache?
The problem with clearing the cache is it is the cache for every device. It is better to simply umount(8) and then mount(8) the USB stick so its blocks in the cache are discarded and reads must access the device.
Isn't it better to also turn the USB stick power off and on? The stick could have its own, builtin, cache. Doesn't it? Perhaps it is, too, reading from its cache, not from its permanent storage? -- u34
With the cache empty, a simple ‘diff -qr src dest’ will show if any files' contents differ, reading from the USB stick. This is my recommendation.
If you want more detail checked, then ask rsync to show what would be updated, e.g. permissions, modification time, etc. Note the ‘-n’.
rsync -aciHAXSs --no-iconv -n src/ dest/
dd(1) can be told to use direct I/O, bypassing the cache, but that would be more useful if copying a disk or filesystem image to the USB stick and wanting to read it back from the block device.
dstat(1) can be used to observe reads from the device to show the cache isn't supplying everything. ‘dstat -dDsdb,sdc’ monitors two block devices: /dev/sdb and /dev/sdc.
-- Cheers, Ralph.
Hi u34,
Isn't it better to also turn the USB stick power off and on? The stick could have its own, builtin, cache. Doesn't it? Perhaps it is, too, reading from its cache, not from its permanent storage?
True. Could it have a store of power, e.g. capacitor, which takes a while to discharge. :-) -- Cheers, Ralph.
On Sat, 2023-04-15 at 12:47 +0100, Ralph Corderoy wrote:
Could it have a store of power, e.g. capacitor, which takes a while to discharge. :-)
No! An electrolytic capacitor that holds enough charge, is way larger than an USB stick. When I was still working as a developer for recording studio stuff, we used to have the childish joke of throwing charged capacitors at each other and shouting "catch it".
On Sat, 15 Apr 2023 at 21:17, Ralf Mardorf <ralf-mardorf@riseup.net> wrote:
On Sat, 2023-04-15 at 12:47 +0100, Ralph Corderoy wrote:
Could it have a store of power, e.g. capacitor, which takes a while to discharge. :-)
No! An electrolytic capacitor that holds enough charge, is way larger than an USB stick.
How much power do you think is needed to hold this alleged cache? If the answer is minuscule, you don't need a capacitor as large as a AAA battery. According to Reddit [1], you'd need about 528 microjoules per second per 128 kB block to write. Assuming it takes 1 second to write this 128 kB block, you expend 528 microwatts. The size of the capacitor you need to hold that charge is quite small. Even if you enlarge this to a fair bit more data, you're still working with minuscule values and don't need electrolytic capacitors. Almost any type will do. Assuming my calculations are correct, of course. [1] https://www.reddit.com/r/askscience/comments/o9qtp3/what_is_the_amount_of_en...
Hi, Regarding cutting power to the USB stick, as well as physically pulling it out, it may be possible to ‘turn it off and on again’ by software. https://www.kernel.org/doc/html/v4.16/driver-api/usb/power-management.html#u... https://github.com/mvp/uhubctl#readme https://aur.archlinux.org/packages/uhubctl -- Cheers, Ralph.
On Sat, 2023-04-15 at 21:59 +0100, Andy Pieters wrote:
The size of the capacitor you need to hold that charge is quite small.
Hi, maybe I'm mistaken. If the capacitors are able to provide enough energy, then I still suspect a reset when powering an USB stick off and on again. It's unlikely that the controller does trust in data that might be available by a cache after powering it on. IOW I believe there's no need to wait a few seconds or minutes, to ensure that the cache is cleared by discharged capacitors. Regards, Ralf
participants (7)
-
Andy Pieters
-
Genes Lists
-
Linus Probert
-
Ralf Mardorf
-
Ralph Corderoy
-
u34@net9.ga
-
Wesley Kerfoot