[pacman-dev] [PATCH] [Idea] makepkg: Extract from any file bsdtar can recognize
If "file -bizL" does not return a supported type. Check if the dist file is recognized by bsdtar and if yes extract from it. Signed-off-by: Nezmer <git@nezmer.info> --- bsdtar can recognize many containers and compression algorithms. lzma is the best compression the GNU guys are using in their dist files. % file -bizL wget-1.12.tar.lzma application/octet-stream; charset=binary "file" output is not really useful. Using lzma can reduce the size of source packages by a good margin. scripts/makepkg.sh.in | 9 ++++++--- 1 files changed, 6 insertions(+), 3 deletions(-) diff --git a/scripts/makepkg.sh.in b/scripts/makepkg.sh.in index 76b6183..71b87ff 100644 --- a/scripts/makepkg.sh.in +++ b/scripts/makepkg.sh.in @@ -685,9 +685,12 @@ extract_sources() { *) continue;; esac ;; *) - # Don't know what to use to extract this file, - # skip to the next file - continue;; + # Check If bsdtar can recognize the file + if bsdtar -tf "$file" &>/dev/null; then + cmd="bsdtar" + else + continue + fi ;; esac local ret=0 -- 1.7.1
On Sat, May 29, 2010 at 03:44:07PM -0400, Nezmer wrote:
If "file -bizL" does not return a supported type. Check if the dist file is recognized by bsdtar and if yes extract from it.
This is great, but I think this check:
if bsdtar -tf "$file" &>/dev/null; then Should come before: local file_type=$(file -bizL "$file")
If bsdtar -tf is deemed as reliable, then it should make the file(1) check redundant, seeing that makepkg uses bsdtar to extract. Andres P
On Sat, May 29, 2010 at 4:08 PM, Andres P <aepd87@gmail.com> wrote:
On Sat, May 29, 2010 at 03:44:07PM -0400, Nezmer wrote:
If "file -bizL" does not return a supported type. Check if the dist file is recognized by bsdtar and if yes extract from it.
This is great, but I think this check:
if bsdtar -tf "$file" &>/dev/null; then Should come before: local file_type=$(file -bizL "$file")
If bsdtar -tf is deemed as reliable, then it should make the file(1) check redundant, seeing that makepkg uses bsdtar to extract.
Except you've now introduced the overhead of reading every archive twice which is really stupid, since 95% of files will pass the "file" check. It would be much more beneficial if someone could see if upstream bsdtar could add some command line flag to basically check if a file is a valid archive that bsdtar can process and gives a return code based off of that (without having to read through the entire archive). -Dan
On Sat, May 29, 2010 at 04:17:16PM -0500, Dan McGee wrote:
If bsdtar -tf is deemed as reliable, then it should make the file(1) check redundant, seeing that makepkg uses bsdtar to extract.
Except you've now introduced the overhead of reading every archive twice which is really stupid, since 95% of files will pass the "file" check.
I realized that it was slower before I mentioned it: $ time for i in {1..1000}; do file -bizL neon-0.29.3-2-i686.pkg.tar.xz >/dev/null; done real 0m20.664s $ time for i in {1..1000}; do bsdtar -tf neon-0.29.3-2-i686.pkg.tar.xz >/dev/null; done real 1m16.193s But it greatly simplifies code :/ It's the python dilemma... Andres P
On Sat, May 29, 2010 at 4:22 PM, Andres P <aepd87@gmail.com> wrote:
On Sat, May 29, 2010 at 04:17:16PM -0500, Dan McGee wrote:
If bsdtar -tf is deemed as reliable, then it should make the file(1) check redundant, seeing that makepkg uses bsdtar to extract.
Except you've now introduced the overhead of reading every archive twice which is really stupid, since 95% of files will pass the "file" check.
I realized that it was slower before I mentioned it: $ time for i in {1..1000}; do file -bizL neon-0.29.3-2-i686.pkg.tar.xz >/dev/null; done real 0m20.664s $ time for i in {1..1000}; do bsdtar -tf neon-0.29.3-2-i686.pkg.tar.xz >/dev/null; done real 1m16.193s
But it greatly simplifies code :/
It's the python dilemma...
neon? What is that, 172K? Try something that would really suck. I don't care if it greatly simplifies code. dmcgee@dublin ~ $ ll /var/cache/makepkg/src/linux-2.6.34.tar.bz2 -rw-r--r-- 1 dmcgee wheel 65M May 16 20:46 /var/cache/makepkg/src/linux-2.6.34.tar.bz2 This is *ONE* iteration: dmcgee@dublin ~ $ time bsdtar -tf /var/cache/makepkg/src/linux-2.6.34.tar.bz2 >/dev/null real 0m51.631s user 0m49.940s sys 0m0.287s -Dan
On Sat, May 29, 2010 at 5:01 PM, Dan McGee <dpmcgee@gmail.com> wrote:
neon? What is that, 172K? Try something that would really suck. I don't care if it greatly simplifies code.
Heh, then you wouldn't be using Bash Past few of my patches happen to be give better perfomance only out of coincidence. If I had gotten the impression that performance was the priority, it wouldn't have been from reading the source code. ;) Andres P
On Sat, May 29, 2010 at 04:31:23PM -0500, Dan McGee wrote:
On Sat, May 29, 2010 at 4:22 PM, Andres P <aepd87@gmail.com> wrote:
On Sat, May 29, 2010 at 04:17:16PM -0500, Dan McGee wrote:
If bsdtar -tf is deemed as reliable, then it should make the file(1) check redundant, seeing that makepkg uses bsdtar to extract.
Except you've now introduced the overhead of reading every archive twice which is really stupid, since 95% of files will pass the "file" check.
I realized that it was slower before I mentioned it: ?? ??$ time for i in {1..1000}; do file -bizL neon-0.29.3-2-i686.pkg.tar.xz >/dev/null; done ?? ??real ?? ??0m20.664s ?? ??$ time for i in {1..1000}; do bsdtar -tf neon-0.29.3-2-i686.pkg.tar.xz >/dev/null; done ?? ??real ?? ??1m16.193s
But it greatly simplifies code :/
It's the python dilemma...
neon? What is that, 172K? Try something that would really suck. I don't care if it greatly simplifies code.
dmcgee@dublin ~ $ ll /var/cache/makepkg/src/linux-2.6.34.tar.bz2 -rw-r--r-- 1 dmcgee wheel 65M May 16 20:46 /var/cache/makepkg/src/linux-2.6.34.tar.bz2
This is *ONE* iteration:
dmcgee@dublin ~ $ time bsdtar -tf /var/cache/makepkg/src/linux-2.6.34.tar.bz2 >/dev/null
real 0m51.631s user 0m49.940s sys 0m0.287s
Would caching help here if "bsdtar -tf" loads the archive into memory?
-Dan
On Sat, May 29, 2010 at 4:56 PM, Nezmer <git@nezmer.info> wrote:
On Sat, May 29, 2010 at 04:31:23PM -0500, Dan McGee wrote:
This is *ONE* iteration:
dmcgee@dublin ~ $ time bsdtar -tf /var/cache/makepkg/src/linux-2.6.34.tar.bz2 >/dev/null
real 0m51.631s user 0m49.940s sys 0m0.287s
Would caching help here if "bsdtar -tf" loads the archive into memory?
This was a cached run; it makes no difference though because the bottleneck here is CPU, not I/O. -Dan
On Sat, May 29, 2010 at 04:17:16PM -0500, Dan McGee wrote:
On Sat, May 29, 2010 at 4:08 PM, Andres P <aepd87@gmail.com> wrote:
On Sat, May 29, 2010 at 03:44:07PM -0400, Nezmer wrote:
If "file -bizL" does not return a supported type. Check if the dist file is recognized by bsdtar and if yes extract from it.
This is great, but I think this check:
?? if bsdtar -tf "$file" &>/dev/null; then Should come before: ?? local file_type=$(file -bizL "$file")
If bsdtar -tf is deemed as reliable, then it should make the file(1) check redundant, seeing that makepkg uses bsdtar to extract.
Except you've now introduced the overhead of reading every archive twice which is really stupid, since 95% of files will pass the "file" check.
Exactly
It would be much more beneficial if someone could see if upstream bsdtar could add some command line flag to basically check if a file is a valid archive that bsdtar can process and gives a return code based off of that (without having to read through the entire archive).
Unfortunately, "bsdtar -tqf" always returns 0 here. If I understand the man page correctly, this should not be the case as optimization should not bypass errors. The idea is a compromise after all. Is the automation of extraction from valid archives worth the overhead or not?
-Dan
From: Nezmer <git@nezmer.info> If "file -bizL" does not return a supported type, check if the file is recognized by bsdtar and if yes extract from it. Dan: use '-q' option to prevent needing to seek the entire archive. Signed-off-by: Nezmer <git@nezmer.info> Signed-off-by: Dan McGee <dan@archlinux.org> --- I think I got the '-q' option working just fine as long as you order the options correctly to bsdtar. This appears to work on random files I was testing it on from the command line. $ time bsdtar -tf /var/cache/makepkg/src/linux-2.6.34.tar.bz2 -q '*'; echo return: $? linux-2.6.34/ real 0m0.079s user 0m0.063s sys 0m0.010s return: 0 -Dan scripts/makepkg.sh.in | 9 ++++++--- 1 files changed, 6 insertions(+), 3 deletions(-) diff --git a/scripts/makepkg.sh.in b/scripts/makepkg.sh.in index 76b6183..28e550b 100644 --- a/scripts/makepkg.sh.in +++ b/scripts/makepkg.sh.in @@ -685,9 +685,12 @@ extract_sources() { *) continue;; esac ;; *) - # Don't know what to use to extract this file, - # skip to the next file - continue;; + # See if bsdtar can recognize the file + if bsdtar -tf "$file" -q '*' &>/dev/null; then + cmd="bsdtar" + else + continue + fi ;; esac local ret=0 -- 1.7.1
On 03/06/10 10:40, Dan McGee wrote:
From: Nezmer<git@nezmer.info>
If "file -bizL" does not return a supported type, check if the file is recognized by bsdtar and if yes extract from it.
Dan: use '-q' option to prevent needing to seek the entire archive.
Signed-off-by: Nezmer<git@nezmer.info> Signed-off-by: Dan McGee<dan@archlinux.org> ---
I think I got the '-q' option working just fine as long as you order the options correctly to bsdtar. This appears to work on random files I was testing it on from the command line.
$ time bsdtar -tf /var/cache/makepkg/src/linux-2.6.34.tar.bz2 -q '*'; echo return: $? linux-2.6.34/
real 0m0.079s user 0m0.063s sys 0m0.010s return: 0
-Dan
Looks good. Pushed to my post-3.4 branch. Allan
On 6/3/10, Allan McRae <allan@archlinux.org> wrote:
Looks good. Pushed to my post-3.4 branch.
Allan
$ time for i in {1..1000}; do bsdtar -tf /var/cache/pacman/pkg/kernel26-2.6.33.4-1-i686.pkg.tar.xz -q \*
/dev/null; done real 0m18.170s user 0m14.196s sys 0m3.130s
$ time for i in {1..1000}; do file -bizL /var/cache/pacman/pkg/kernel26-2.6.33.4-1-i686.pkg.tar.xz >/dev/null; done real 0m23.273s user 0m7.603s sys 0m9.956s So, Is the only reason the 'case in file -bizL' block is still there is because no one has written a patch for it? Because now you've got your performance sorted out. Andres P
On 03/06/10 20:10, Andres P wrote:
On 6/3/10, Allan McRae<allan@archlinux.org> wrote:
Looks good. Pushed to my post-3.4 branch.
Allan
$ time for i in {1..1000}; do bsdtar -tf /var/cache/pacman/pkg/kernel26-2.6.33.4-1-i686.pkg.tar.xz -q \*
/dev/null; done real 0m18.170s user 0m14.196s sys 0m3.130s
$ time for i in {1..1000}; do file -bizL /var/cache/pacman/pkg/kernel26-2.6.33.4-1-i686.pkg.tar.xz>/dev/null; done real 0m23.273s user 0m7.603s sys 0m9.956s
So,
Is the only reason the 'case in file -bizL' block is still there is because no one has written a patch for it?
Because now you've got your performance sorted out.
No... we would still need "file -bizL" to identify the non-tar compressed files and extract them. Given these are about the same speed (time difference is a much smaller percentage on my machine), we need to guess which is more regularly used as a source; compressed non-tar files or files bsdtar can extract that do not currently get extracted. So I see any further change giving little gain. Allan
On 30/05/10 05:44, Nezmer wrote:
If "file -bizL" does not return a supported type. Check if the dist file is recognized by bsdtar and if yes extract from it.
Signed-off-by: Nezmer<git@nezmer.info> ---
bsdtar can recognize many containers and compression algorithms.
lzma is the best compression the GNU guys are using in their dist files.
% file -bizL wget-1.12.tar.lzma application/octet-stream; charset=binary
"file" output is not really useful.
Using lzma can reduce the size of source packages by a good margin.
Is .lzma used by anyone these days? I have seen no GNU projects using it in recent times given has .xz completely taken over as is its successor. There is no point in doing this for a dead format. Allan
On Sun, May 30, 2010 at 09:01:51AM +1000, Allan McRae wrote:
On 30/05/10 05:44, Nezmer wrote:
If "file -bizL" does not return a supported type. Check if the dist file is recognized by bsdtar and if yes extract from it.
Signed-off-by: Nezmer<git@nezmer.info> ---
bsdtar can recognize many containers and compression algorithms.
lzma is the best compression the GNU guys are using in their dist files.
% file -bizL wget-1.12.tar.lzma application/octet-stream; charset=binary
"file" output is not really useful.
Using lzma can reduce the size of source packages by a good margin.
Is .lzma used by anyone these days? I have seen no GNU projects using it in recent times given has .xz completely taken over as is its successor. There is no point in doing this for a dead format.
I did a fast scan on repo packages that refer to "ftp.gnu.org/gnu" as a source. Apparently most projects still use or used in their latest release gz/bzip2 only. Some projects are starting to use xz(my dummy script counted 11). But automake texinfo libtool and wget used lzma as a modern(ish) compression in their latest releases. Anyway, the idea was not to depend on hard-coded mime types and extensions completely. bsdtar can do way more. Another example which might be as bad as the lzma one is all those AUR PKGBULDS that depend on rpm sources.
Allan
participants (5)
-
Allan McRae
-
Andres P
-
Dan McGee
-
Dan McGee
-
Nezmer