[arch-general] broken pipe
hello, to avoid having to read twice the entire large file, I use the tee tool in this way to calculate two checksums "simultaneously" : file_info(){ echo -n ${1:=/dev/stdin}$'\t' ( tee < "${1}" \ >( md5sum >&3 ) \ >( sha1sum >&3 ) \ >/dev/null ) 3>&1 | tr '\n' '\t' echo } it works perfectly because both tools used, md5sum and sha1sum, consume all the data. on the other hand, the function returns wrong fingerprints if I insert a tool like file : file_info(){ echo -n ${1:=/dev/stdin}$'\t' ( tee < "${1}" \ >( file --mime-type -b -e compress -e tar -e elf - >&3 ) \ >( md5sum >&3 ) \ >( sha1sum >&3 ) \ >/dev/null ) 3>&1 | tr '\n' '\t' echo } it no longer works because the data flow is quickly interrupted by tee which does not consume all the data. do you know a way to get the file type in parallel with the fingerprint ? getting its type is just an example : the idea is to open the file only once... regards, lacsaP.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Wednesday, December 18, 2019 3:20 PM, Pascal via arch-general <arch-general@archlinux.org> wrote:
hello,
it works perfectly because both tools used, md5sum and sha1sum, consume all the data.
on the other hand, the function returns wrong fingerprints if I insert a tool like file :
...just do >(file --mime-type -b -e compress -e tar -e elf - >&3; cat) ? Sent with ProtonMail Secure Email.
that's awesome, it works ! it was so simple with cat taking over and consuming the data until the end ! (I added a redirect to /dev/null to cat) big thank you. Le mer. 18 déc. 2019 à 16:19, mar77i via arch-general < arch-general@archlinux.org> a écrit :
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Wednesday, December 18, 2019 3:20 PM, Pascal via arch-general < arch-general@archlinux.org> wrote:
hello,
it works perfectly because both tools used, md5sum and sha1sum, consume all the data.
on the other hand, the function returns wrong fingerprints if I insert a tool like file :
...just do >(file --mime-type -b -e compress -e tar -e elf - >&3; cat) ?
Sent with ProtonMail Secure Email.
On Wed, 18 Dec 2019 at 15:32, Pascal via arch-general <arch-general@archlinux.org> wrote:
that's awesome, it works ! it was so simple with cat taking over and consuming the data until the end ! (I added a redirect to /dev/null to cat) big thank you.
I'm interested in the amount of effort you put into this. Isn't the overhead of the pipes etc going to negate any memory/speed improvements you may get from only opening the file once or is there another consideration at play here?
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Wednesday, December 18, 2019 4:41 PM, Andy Pieters <arch-general@andypieters.me.uk> wrote:
On Wed, 18 Dec 2019 at 15:32, Pascal via arch-general arch-general@archlinux.org wrote:
that's awesome, it works ! it was so simple with cat taking over and consuming the data until the end ! (I added a redirect to /dev/null to cat) big thank you.
I'm interested in the amount of effort you put into this. Isn't the overhead of the pipes etc going to negate any memory/speed improvements you may get from only opening the file once or is there another consideration at play here?
IMHO, reading from memory is generally preferable to reading again from disk, that is, assuming a "worse" solution that would rely on multiple reads. It makes sure the data is actually the same data is transmitted and no other process writes to the file meanwhile. It's also easier to do than fiddling with flock() and the like. Also I'd advise against reading binary data into environment variables, which I think is what the approach with tee is supposed to solve, therefore I really like it. Technically you should be able to tee yourself into a full set of http headers. cheers! mar77i Sent with ProtonMail Secure Email.
it depends a little bit on the context in which the function will be played : - with an SSD disk, the playback of a large file is very fast, - with a lot of available memory, the file is cached at its first reading and subsequent readings are almost instantaneous. if these two conditions are met, indeed, I am not sure that the function brings a real benefit. otherwise, there is a real interest in using the function. the tests below were performed on a large file placed on a USB 3.0 key. ll -hgG elementaryos-5.1-stable.20191202.iso -rw-r--r-- 1 1.4G Dec 17 14:22 elementaryos-5.1-stable.20191202.iso here the large file is kept in cache : pv elementaryos-5.1-stable.20191202.iso > /dev/null 1.37GiB 0:00:35 [39.4MiB/s] [================================>] 100% pv elementaryos-5.1-stable.20191202.iso > /dev/null 1.37GiB 0:00:00 [4.51GiB/s] [================================>] 100% time ( file --mime-type -b -e compress -e tar -e elf elementaryos-5.1-stable.20191202.iso; md5sum elementaryos-5.1-stable.20191202.iso; sha1sum elementaryos-5.1-stable.20191202.iso ) application/octet-stream d1addd17377aa73700680ff937d7a0f4 elementaryos-5.1-stable.20191202.iso cbe37d55c44db5bb0ab0ae6f5bf0eb96209bb16f elementaryos-5.1-stable.20191202.iso real 0m4.680s user 0m4.298s sys 0m0.318s file_info(){ echo -n ${1:=/dev/stdin}$'\t'; ( tee < "${1}" >( file --mime-type -b -e compress -e tar -e elf - >&3; cat >/dev/null ) >( md5sum
&3 ) >( sha1sum >&3 ) >/dev/null; ) 3>&1 | tr '\n' '\t'; echo; }
time ( file_info elementaryos-5.1-stable.20191202.iso ) elementaryos-5.1-stable.20191202.iso application/octet-stream cbe37d55c44db5bb0ab0ae6f5bf0eb96209bb16f - d1addd17377aa73700680ff937d7a0f4 - real 0m3.435s user 0m0.113s sys 0m1.176s now the cache is cleared between each access (eg. limited available memory) : function flush { sync && sysctl -q vm.drop_caches=3; } flush; pv elementaryos-5.1-stable.20191202.iso > /dev/null 1.37GiB 0:00:35 [39.4MiB/s] [================================>] 100% flush; pv elementaryos-5.1-stable.20191202.iso > /dev/null 1.37GiB 0:00:35 [39.4MiB/s] [================================>] 100% flush; time ( file --mime-type -b -e compress -e tar -e elf elementaryos-5.1-stable.20191202.iso; flush; md5sum elementaryos-5.1-stable.20191202.iso; flush; sha1sum elementaryos-5.1-stable.20191202.iso ) application/octet-stream d1addd17377aa73700680ff937d7a0f4 elementaryos-5.1-stable.20191202.iso cbe37d55c44db5bb0ab0ae6f5bf0eb96209bb16f elementaryos-5.1-stable.20191202.iso real 1m11.054s user 0m8.092s sys 0m1.441s flush; time ( file_info elementaryos-5.1-stable.20191202.iso ) elementaryos-5.1-stable.20191202.iso application/octet-stream cbe37d55c44db5bb0ab0ae6f5bf0eb96209bb16f - d1addd17377aa73700680ff937d7a0f4 - real 0m35.217s user 0m0.316s sys 0m2.146s Le mer. 18 déc. 2019 à 16:42, Andy Pieters <arch-general@andypieters.me.uk> a écrit :
On Wed, 18 Dec 2019 at 15:32, Pascal via arch-general <arch-general@archlinux.org> wrote:
that's awesome, it works ! it was so simple with cat taking over and consuming the data until the
end !
(I added a redirect to /dev/null to cat) big thank you.
I'm interested in the amount of effort you put into this. Isn't the overhead of the pipes etc going to negate any memory/speed improvements you may get from only opening the file once or is there another consideration at play here?
Hi Pascal,
file_info(){ echo -n ${1:=/dev/stdin}$'\t' ( tee < "${1}" \ >( file --mime-type -b -e compress -e tar -e elf - >&3 ) \ >( md5sum >&3 ) \ >( sha1sum >&3 ) \ >/dev/null ) 3>&1 | tr '\n' '\t' echo }
it no longer works because the data flow is quickly interrupted by tee which does not consume all the data.
You're missing the reason why. tee(1) receives a SIGPIPE because it writes to a pipe that's closed. Adding a cat(1) is a waste of CPU, as is discarding tee's stdout instead of using it for one of the workers. Examine these differences. $ seq 31415 | wc 31415 31415 177384 $ seq 31415 | tee >(sed q) >(wc) > >(tr -d 42 | wc); sleep 1 1 14139 14109 62130 12773 12774 65536 $ seq 31415 | (trap '' pipe; tee >(sed q) >(wc) > >(tr -d 42 | wc)); sleep 1 1 31415 31369 142504 31415 31415 177384 $ Note the output of the commands can be in any order, and intermingle if they're long enough. tee(1) has -p and --output-error but they're not as specific as stating SIGPIPE is expected for just one worker. -- Cheers, Ralph.
hi Ralph, thank you for that clarification. the function works a little faster with them. file_info(){ echo -n ${1:=/dev/stdin}$'\t'; ( tee < "${1}" >( file --mime-type -b -e compress -e tar -e elf - >&3; cat >/dev/null ) >( md5sum
&3 ) >( sha1sum >&3 ) >/dev/null; ) 3>&1 | tr '\n' '\t'; echo; }
pv big.tar > /dev/null 1,71GiO 0:00:00 [5,32GiB/s] [==============================================================================================================>] 100% time ( for i in $( seq 24 ); do file_info big.tar ; done ) big.tar application/x-gtar 53f0d0240e5ddc94266087ec96ebb802236fa0bc - 6989542fabd98b04086524d1106b7907 - ... big.tar application/x-gtar 53f0d0240e5ddc94266087ec96ebb802236fa0bc - 6989542fabd98b04086524d1106b7907 - real 3m2,712s user 0m14,988s sys 1m13,303s file_info(){ echo -n ${1:=/dev/stdin}$'\t'; ( trap "" pipe; tee < "${1}" >( md5sum >&3 ) >( sha1sum >&3 ) | file --mime-type -b -e compress -e tar -e elf - ) 3>&1 | tr '\n' '\t'; echo; } pv big.tar > /dev/null 1,71GiO 0:00:00 [5,37GiB/s] [==============================================================================================================>] 100% time ( for i in $( seq 24 ); do file_info big.tar ; done ) big.tar application/x-gtar 53f0d0240e5ddc94266087ec96ebb802236fa0bc - 6989542fabd98b04086524d1106b7907 - ... big.tar application/x-gtar 53f0d0240e5ddc94266087ec96ebb802236fa0bc - 6989542fabd98b04086524d1106b7907 - real 2m36,013s user 0m9,349s sys 0m50,257s Le jeu. 19 déc. 2019 à 11:59, Ralph Corderoy <ralph@inputplus.co.uk> a écrit :
Hi Pascal,
file_info(){ echo -n ${1:=/dev/stdin}$'\t' ( tee < "${1}" \ >( file --mime-type -b -e compress -e tar -e elf - >&3 ) \ >( md5sum >&3 ) \ >( sha1sum >&3 ) \ >/dev/null ) 3>&1 | tr '\n' '\t' echo }
it no longer works because the data flow is quickly interrupted by tee which does not consume all the data.
You're missing the reason why. tee(1) receives a SIGPIPE because it writes to a pipe that's closed. Adding a cat(1) is a waste of CPU, as is discarding tee's stdout instead of using it for one of the workers.
Examine these differences.
$ seq 31415 | wc 31415 31415 177384 $ seq 31415 | tee >(sed q) >(wc) > >(tr -d 42 | wc); sleep 1 1 14139 14109 62130 12773 12774 65536 $ seq 31415 | (trap '' pipe; tee >(sed q) >(wc) > >(tr -d 42 | wc)); sleep 1 1 31415 31369 142504 31415 31415 177384 $
Note the output of the commands can be in any order, and intermingle if they're long enough.
tee(1) has -p and --output-error but they're not as specific as stating SIGPIPE is expected for just one worker.
-- Cheers, Ralph.
hi, https://github.com/patatetom/hashs and merry Christmas. Le jeu. 19 déc. 2019 à 23:16, Pascal <patatetom@gmail.com> a écrit :
hi Ralph,
thank you for that clarification. the function works a little faster with them.
file_info(){ echo -n ${1:=/dev/stdin}$'\t'; ( tee < "${1}" >( file --mime-type -b -e compress -e tar -e elf - >&3; cat >/dev/null ) >( md5sum
&3 ) >( sha1sum >&3 ) >/dev/null; ) 3>&1 | tr '\n' '\t'; echo; }
pv big.tar > /dev/null 1,71GiO 0:00:00 [5,32GiB/s] [==============================================================================================================>] 100%
time ( for i in $( seq 24 ); do file_info big.tar ; done ) big.tar application/x-gtar 53f0d0240e5ddc94266087ec96ebb802236fa0bc - 6989542fabd98b04086524d1106b7907 - ... big.tar application/x-gtar 53f0d0240e5ddc94266087ec96ebb802236fa0bc - 6989542fabd98b04086524d1106b7907 -
real 3m2,712s user 0m14,988s sys 1m13,303s
file_info(){ echo -n ${1:=/dev/stdin}$'\t'; ( trap "" pipe; tee < "${1}"
( md5sum >&3 ) >( sha1sum >&3 ) | file --mime-type -b -e compress -e tar -e elf - ) 3>&1 | tr '\n' '\t'; echo; }
pv big.tar > /dev/null 1,71GiO 0:00:00 [5,37GiB/s] [==============================================================================================================>] 100%
time ( for i in $( seq 24 ); do file_info big.tar ; done ) big.tar application/x-gtar 53f0d0240e5ddc94266087ec96ebb802236fa0bc - 6989542fabd98b04086524d1106b7907 - ... big.tar application/x-gtar 53f0d0240e5ddc94266087ec96ebb802236fa0bc - 6989542fabd98b04086524d1106b7907 -
real 2m36,013s user 0m9,349s sys 0m50,257s
Le jeu. 19 déc. 2019 à 11:59, Ralph Corderoy <ralph@inputplus.co.uk> a écrit :
Hi Pascal,
file_info(){ echo -n ${1:=/dev/stdin}$'\t' ( tee < "${1}" \ >( file --mime-type -b -e compress -e tar -e elf - >&3 ) \ >( md5sum >&3 ) \ >( sha1sum >&3 ) \ >/dev/null ) 3>&1 | tr '\n' '\t' echo }
it no longer works because the data flow is quickly interrupted by tee which does not consume all the data.
You're missing the reason why. tee(1) receives a SIGPIPE because it writes to a pipe that's closed. Adding a cat(1) is a waste of CPU, as is discarding tee's stdout instead of using it for one of the workers.
Examine these differences.
$ seq 31415 | wc 31415 31415 177384 $ seq 31415 | tee >(sed q) >(wc) > >(tr -d 42 | wc); sleep 1 1 14139 14109 62130 12773 12774 65536 $ seq 31415 | (trap '' pipe; tee >(sed q) >(wc) > >(tr -d 42 | wc)); sleep 1 1 31415 31369 142504 31415 31415 177384 $
Note the output of the commands can be in any order, and intermingle if they're long enough.
tee(1) has -p and --output-error but they're not as specific as stating SIGPIPE is expected for just one worker.
-- Cheers, Ralph.
participants (4)
-
Andy Pieters
-
mar77i@protonmail.ch
-
Pascal
-
Ralph Corderoy