[pacman-dev] [arch-dev-public] Filename search for Arch

Sun Jan 27 12:13:53 EST 2008

### query the sql directly
$ time sqlite3 archweb.db "select path, pkgname from packages p, packages_files pf where p.id = pf.pkg_id and path like '%usr/bin/gcc%'"
usr/bin/gcc-3.3|gcc3
usr/bin/gccbug-3.3|gcc3
usr/bin/gcc-3.4|gcc34
usr/bin/gccbug-3.4|gcc34
usr/bin/gccbug|gcc
usr/bin/gcc|gcc
usr/bin/gccmakedep|imake
sqlite3 archweb.db   3.45s user 0.14s system 99% cpu 3.622 total

### create a greppable file
$ time sqlite3 archweb.db "select path, pkgname from packages p, packages_files pf where p.id = pf.pkg_id and path not like '%/'" | sort > filelist 
sqlite3 archweb.db   5.80s user 0.25s system 86% cpu 6.981 total
sort > filelist  1.30s user 0.41s system 21% cpu 7.843 total

$ time gzip -9 < filelist > filelist.gz
gzip -9 < filelist > filelist.gz  3.06s user 0.03s system 98% cpu 3.119 total

$ lh filelist*
-rw-r--r-- 1 nathanj users  32M 2008-01-27 11:46 filelist
-rw-r--r-- 1 nathanj users 2.7M 2008-01-27 11:46 filelist.gz

### query using grep
$ time grep usr/bin/gcc filelist
usr/bin/gcc-3.3|gcc3
usr/bin/gcc-3.4|gcc34
usr/bin/gccbug-3.3|gcc3
usr/bin/gccbug-3.4|gcc34
usr/bin/gccbug|gcc
usr/bin/gccmakedep|imake
usr/bin/gcc|gcc
grep usr/bin/gcc filelist  0.02s user 0.02s system 80% cpu 0.045 total

$ time zgrep usr/bin/gcc filelist.gz
usr/bin/gcc-3.3|gcc3
usr/bin/gcc-3.4|gcc34
usr/bin/gccbug-3.3|gcc3
usr/bin/gccbug-3.4|gcc34
usr/bin/gccbug|gcc
usr/bin/gccmakedep|imake
usr/bin/gcc|gcc
zgrep usr/bin/gcc filelist.gz  0.23s user 0.05s system 99% cpu 0.279 total

I think the best way to implement this would be to set a cronjob on the
main archlinux server to generate the greppable file. The generation is
fast enough to run every day if wanted, but that is a bit overkill
(FWIW, debian's apt-file database regenerates weekly I believe). I would
not gzip the file since the search is quite a bit slower that way.

All searches would go through the website. Server load shouldn't be a
problem since having 50 people hit the page to search a few times each
would still be more efficient than having those 50 people all download a
2.7MB file. I doubt there would be that many searches anyway; this type
of search is very useful when needed, but it is not needed that often.

For the command line usage, it should be easy to create a 10 line python
script to call to the server and print the results.