[pacman-dev] [PATCH] [Idea] makepkg: Extract from any file bsdtar can recognize
If "file -bizL" does not return a supported type. Check if the dist file is recognized by bsdtar and if yes extract from it. Signed-off-by: Nezmer <git@nezmer.info> --- bsdtar can recognize many containers and compression algorithms. lzma is the best compression the GNU guys are using in their dist files. % file -bizL wget-1.12.tar.lzma application/octet-stream; charset=binary "file" output is not really useful. Using lzma can reduce the size of source packages by a good margin. scripts/makepkg.sh.in | 9 ++++++--- 1 files changed, 6 insertions(+), 3 deletions(-) diff --git a/scripts/makepkg.sh.in b/scripts/makepkg.sh.in index 76b6183..71b87ff 100644 --- a/scripts/makepkg.sh.in +++ b/scripts/makepkg.sh.in @@ -685,9 +685,12 @@ extract_sources() { *) continue;; esac ;; *) - # Don't know what to use to extract this file, - # skip to the next file - continue;; + # Check If bsdtar can recognize the file + if bsdtar -tf "$file" &>/dev/null; then + cmd="bsdtar" + else + continue + fi ;; esac local ret=0 -- 1.7.1
On Sat, May 29, 2010 at 03:44:07PM -0400, Nezmer wrote:
If "file -bizL" does not return a supported type. Check if the dist file is recognized by bsdtar and if yes extract from it.
This is great, but I think this check:
If bsdtar -tf is deemed as reliable, then it should make the file(1) check redundant, seeing that makepkg uses bsdtar to extract. Andres P
On Sat, May 29, 2010 at 4:08 PM, Andres P <aepd87@gmail.com> wrote:
Except you've now introduced the overhead of reading every archive twice which is really stupid, since 95% of files will pass the "file" check. It would be much more beneficial if someone could see if upstream bsdtar could add some command line flag to basically check if a file is a valid archive that bsdtar can process and gives a return code based off of that (without having to read through the entire archive). -Dan
On Sat, May 29, 2010 at 04:17:16PM -0500, Dan McGee wrote:
I realized that it was slower before I mentioned it: $ time for i in {1..1000}; do file -bizL neon-0.29.3-2-i686.pkg.tar.xz >/dev/null; done real 0m20.664s $ time for i in {1..1000}; do bsdtar -tf neon-0.29.3-2-i686.pkg.tar.xz >/dev/null; done real 1m16.193s But it greatly simplifies code :/ It's the python dilemma... Andres P
On Sat, May 29, 2010 at 4:22 PM, Andres P <aepd87@gmail.com> wrote:
neon? What is that, 172K? Try something that would really suck. I don't care if it greatly simplifies code. dmcgee@dublin ~ $ ll /var/cache/makepkg/src/linux-2.6.34.tar.bz2 -rw-r--r-- 1 dmcgee wheel 65M May 16 20:46 /var/cache/makepkg/src/linux-2.6.34.tar.bz2 This is *ONE* iteration: dmcgee@dublin ~ $ time bsdtar -tf /var/cache/makepkg/src/linux-2.6.34.tar.bz2 >/dev/null real 0m51.631s user 0m49.940s sys 0m0.287s -Dan
On Sat, May 29, 2010 at 5:01 PM, Dan McGee <dpmcgee@gmail.com> wrote:
neon? What is that, 172K? Try something that would really suck. I don't care if it greatly simplifies code.
Heh, then you wouldn't be using Bash Past few of my patches happen to be give better perfomance only out of coincidence. If I had gotten the impression that performance was the priority, it wouldn't have been from reading the source code. ;) Andres P
On Sat, May 29, 2010 at 4:56 PM, Nezmer <git@nezmer.info> wrote:
This was a cached run; it makes no difference though because the bottleneck here is CPU, not I/O. -Dan
On Sat, May 29, 2010 at 04:17:16PM -0500, Dan McGee wrote:
Exactly
Unfortunately, "bsdtar -tqf" always returns 0 here. If I understand the man page correctly, this should not be the case as optimization should not bypass errors. The idea is a compromise after all. Is the automation of extraction from valid archives worth the overhead or not?
-Dan
From: Nezmer <git@nezmer.info> If "file -bizL" does not return a supported type, check if the file is recognized by bsdtar and if yes extract from it. Dan: use '-q' option to prevent needing to seek the entire archive. Signed-off-by: Nezmer <git@nezmer.info> Signed-off-by: Dan McGee <dan@archlinux.org> --- I think I got the '-q' option working just fine as long as you order the options correctly to bsdtar. This appears to work on random files I was testing it on from the command line. $ time bsdtar -tf /var/cache/makepkg/src/linux-2.6.34.tar.bz2 -q '*'; echo return: $? linux-2.6.34/ real 0m0.079s user 0m0.063s sys 0m0.010s return: 0 -Dan scripts/makepkg.sh.in | 9 ++++++--- 1 files changed, 6 insertions(+), 3 deletions(-) diff --git a/scripts/makepkg.sh.in b/scripts/makepkg.sh.in index 76b6183..28e550b 100644 --- a/scripts/makepkg.sh.in +++ b/scripts/makepkg.sh.in @@ -685,9 +685,12 @@ extract_sources() { *) continue;; esac ;; *) - # Don't know what to use to extract this file, - # skip to the next file - continue;; + # See if bsdtar can recognize the file + if bsdtar -tf "$file" -q '*' &>/dev/null; then + cmd="bsdtar" + else + continue + fi ;; esac local ret=0 -- 1.7.1
On 6/3/10, Allan McRae <allan@archlinux.org> wrote:
Looks good. Pushed to my post-3.4 branch.
Allan
$ time for i in {1..1000}; do bsdtar -tf /var/cache/pacman/pkg/kernel26-2.6.33.4-1-i686.pkg.tar.xz -q \*
$ time for i in {1..1000}; do file -bizL /var/cache/pacman/pkg/kernel26-2.6.33.4-1-i686.pkg.tar.xz >/dev/null; done real 0m23.273s user 0m7.603s sys 0m9.956s So, Is the only reason the 'case in file -bizL' block is still there is because no one has written a patch for it? Because now you've got your performance sorted out. Andres P
On 03/06/10 20:10, Andres P wrote:
No... we would still need "file -bizL" to identify the non-tar compressed files and extract them. Given these are about the same speed (time difference is a much smaller percentage on my machine), we need to guess which is more regularly used as a source; compressed non-tar files or files bsdtar can extract that do not currently get extracted. So I see any further change giving little gain. Allan
On 30/05/10 05:44, Nezmer wrote:
Is .lzma used by anyone these days? I have seen no GNU projects using it in recent times given has .xz completely taken over as is its successor. There is no point in doing this for a dead format. Allan
On Sun, May 30, 2010 at 09:01:51AM +1000, Allan McRae wrote:
I did a fast scan on repo packages that refer to "ftp.gnu.org/gnu" as a source. Apparently most projects still use or used in their latest release gz/bzip2 only. Some projects are starting to use xz(my dummy script counted 11). But automake texinfo libtool and wget used lzma as a modern(ish) compression in their latest releases. Anyway, the idea was not to depend on hard-coded mime types and extensions completely. bsdtar can do way more. Another example which might be as bad as the lzma one is all those AUR PKGBULDS that depend on rpm sources.
Allan
participants (5)
-
Allan McRae
-
Andres P
-
Dan McGee
-
Dan McGee
-
Nezmer