Re: [pacman-dev] [arch-dev-public] Filename search for Arch
On Jan 25, 2008 5:31 PM, eliott <eliott@cactuswax.net> wrote:
On 1/25/08, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
On Jan 25, 2008 5:02 PM, Thomas Bächler <thomas@archlinux.org> wrote:
eliott schrieb:
I guess I don't see where this script fits in, and how it is supposed to be used. thomas made mention of using zgrep for advanced users, but that seems just as difficult as opening a web browser and typing into a search box.
The purpose is to provide filelists for download, so they can be searched offline by pacman. My first idea (implementing an online search in pacman) was rejected, thus I thought about a "download the filelist and search it" offline solution.
Oh, I must have misunderstood too. If you're going to implement filelist search and all that stuff, we should: a) Move this to the pacman-dev mailing list b) Add external tools to do this as part of the "pacman source", i.e. as a patch to repo-add c) Not use this script until pacman actually has this feature.
If the intent is to let users zgrep it, then I agree with cactus that that is significantly more complex then actually using the website to provide a search interface.
Yeah. I wasn't apposed to having a file search mechanism on the site. I was apposed to having pacman query the website. If a user opens up a browser and searches, no problem. It was tying this to pacman that I felt was a *really bad idea*.
Alternatively, if there is a pacman only solution, that involves some mirrored meta in the repository, that is something else entirely, and should probably be talked about on the pacman dev list, so as to make it as distribution neutral as possible.
As I've said already, I really don't think this feature belongs in pacman. Obviously you can draw the connection with the -Ql operation, and the fact that we have -Ss, but this is something a bit different than that and I see it as feature creep. -Dan
On Jan 25, 2008 6:40 PM, Dan McGee <dpmcgee@gmail.com> wrote:
[...] this is something a bit different than that and I see it as feature creep.
-Dan
After a lil thinking, I agree that this doesn't belong in pacman. I also feel this is an important tool people should have offline. Maybe just a entirely new script that searches a provided filelist, or goes to a default location? Something like the following sounds appealing to me: pacman -s findpkgfile # or whatever else would be a good name findpkgfile -y # download filelists from some given mirror findpkgfile '/lib/libyourmom.so.hot' => yourmom-5.2-2 Then we can add fun functionality and features to THAT script instead of pacman. One that comes to mind is asking if you'd like to install the found package that contains/owns the file, etc. Thought I'd document my thoughts. // jeff -- . : [ + carpe diem totus tuus + ] : .
On Fri, Jan 25, 2008 at 07:48:37PM -0500, Jeff Mickey <jeff@archlinux.org> wrote:
After a lil thinking, I agree that this doesn't belong in pacman.
i think the bigger problem is that the file list for all pkgs is quite big (no, i haven't made a test on the size, just guessing).
I also feel this is an important tool people should have offline.
do you have a mirror on your own computer? if not, searching offline is not so important. (i think the bigger benefit is that you could search from command-line. - which can be handy when you don't want to suck with elinks and you don't have x.)
findpkgfile -y # download filelists from some given mirror
from where? if the filelist database is outdated, it makes no sense, so if it's not supported by the maintainers of the give repo keeping it up to date is quite problematic imho. - VMiklos
On Jan 26, 2008 6:59 PM, Miklos Vajna <vmiklos@frugalware.org> wrote:
On Fri, Jan 25, 2008 at 07:48:37PM -0500, Jeff Mickey <jeff@archlinux.org> wrote:
I also feel this is an important tool people should have offline.
do you have a mirror on your own computer? if not, searching offline is not so important. (i think the bigger benefit is that you could search from command-line. - which can be handy when you don't want to suck with elinks and you don't have x.)
By offline, I meant it it shouldn't query a server every time someone does a search, I think it should be from a local file on your computer. Increasing the load on mirrors etc would be a bad idea.
findpkgfile -y # download filelists from some given mirror
from where? if the filelist database is outdated, it makes no sense, so if it's not supported by the maintainers of the give repo keeping it up to date is quite problematic imho.
Not to blow minds here or anything, but my idea is that we would have a .gz or .bz2 file that is hosted by all the mirrors, and you download it every now and then. You don't need to download it every day, and by it being a separate tool, we could do delta files instead of a straight download. And yes, even if my filelist is a day old, there is still a decent chance I'll find the package I'm looking for. A simple patch to repo-add fixes your "quite problematic" issue. // jeff -- . : [ + carpe diem totus tuus + ] : .
On Sat, Jan 26, 2008 at 07:37:12PM -0500, Jeff Mickey <jeff@archlinux.org> wrote:
it every now and then. You don't need to download it every day, and by it being a separate tool, we could do delta files instead of a straight download.
And yes, even if my filelist is a day old, there is still a decent chance I'll find the package I'm looking for. A simple patch to repo-add fixes your "quite problematic" issue.
i see only once design issue here: repo-add would then generate something that is only used by an external tool? - VMiklos
On Jan 26, 2008 7:47 PM, Miklos Vajna <vmiklos@frugalware.org> wrote:
i see only once design issue here: repo-add would then generate something that is only used by an external tool?
- VMiklos
External in the sense of outside of pacman, yes. But if it's an "official" tool of arch, I don't see the problem here. If anything, we could make yet another script that wraps around repo-add if you really don't want the functionality in repo-add. On Jan 27, 2008 12:13 PM, Nathan Jones <nathanj@insightbb.com> wrote:
For the command line usage, it should be easy to create a 10 line python script to call to the server and print the results.
This is exactly what I think is a bad idea. I wouldn't want someone to query a server when they searched the filelist.. maybe I'm alone in this sentiment. // jeff -- . : [ + carpe diem totus tuus + ] : .
On Jan 27, 2008 12:31 PM, Jeff Mickey <jeff@archlinux.org> wrote:
This is exactly what I think is a bad idea. I wouldn't want someone to query a server when they searched the filelist.. maybe I'm alone in this sentiment.
I agree - if this is to be done, it's to be done in a downloadable, offline file.
On 1/27/08, Travis Willard <travis@archlinux.org> wrote:
On Jan 27, 2008 12:31 PM, Jeff Mickey <jeff@archlinux.org> wrote:
This is exactly what I think is a bad idea. I wouldn't want someone to query a server when they searched the filelist.. maybe I'm alone in this sentiment.
I agree - if this is to be done, it's to be done in a downloadable, offline file.
Crazy idea. Why not make it a pacman package? $ pacman -S repo-filelist that would be able to use our current mirror system, we can push out updates whenever we want, people can easily opt in, or remove it later if they need to, and the package size wouldn't be a whole lot larger than say...the kernel. We could even include a query utility, that calls zgrep on a compressed file or something. $ repo-filelist "*gcc*" usr/bin/gcc-3.3|gcc3 usr/bin/gcc-3.4|gcc34 usr/bin/gccbug-3.3|gcc3 usr/bin/gccbug-3.4|gcc34
On Sun 2008-01-27 11:46 , eliott wrote:
On 1/27/08, Travis Willard <travis@archlinux.org> wrote:
On Jan 27, 2008 12:31 PM, Jeff Mickey <jeff@archlinux.org> wrote:
This is exactly what I think is a bad idea. I wouldn't want someone to query a server when they searched the filelist.. maybe I'm alone in this sentiment.
I agree - if this is to be done, it's to be done in a downloadable, offline file.
Crazy idea. Why not make it a pacman package?
$ pacman -S repo-filelist
that would be able to use our current mirror system, we can push out updates whenever we want, people can easily opt in, or remove it later if they need to, and the package size wouldn't be a whole lot larger than say...the kernel. [...]
I had the same idea, but I decided it wasn't a great one because to be useful it should be updated every time a new package is released, and IMHO that's not feasible (mainly for the user), considering its size. I still think the best implementation is to create a filelist with repo-add (or whatever script you use to generate the db) on the Arch server and then let is spread in the mirrors. As someone already wrote, xdelta would decrease dramatically the size of the filelist to download. -- Alessio Bolognino Please send personal email to themolok@gmail.com Public Key http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xFE0270FB GPG Key ID = 1024D / FE0270FB 2007-04-11 Key Fingerprint = 9AF8 9011 F271 450D 59CF 2D7D 96C9 8F2A FE02 70FB
Alessio Bolognino wrote:
I had the same idea, but I decided it wasn't a great one because to be useful it should be updated every time a new package is released, and IMHO that's not feasible (mainly for the user), considering its size.
I still think the best implementation is to create a filelist with repo-add (or whatever script you use to generate the db) on the Arch server and then let is spread in the mirrors. As someone already wrote, xdelta would decrease dramatically the size of the filelist to download.
I don't see why it would need to be updated so often. Even if it was updated once a week, I would still find it very useful.
On Sun 2008-01-27 22:37 , Xavier wrote:
Alessio Bolognino wrote:
I had the same idea, but I decided it wasn't a great one because to be useful it should be updated every time a new package is released, and IMHO that's not feasible (mainly for the user), considering its size.
I still think the best implementation is to create a filelist with repo-add (or whatever script you use to generate the db) on the Arch server and then let is spread in the mirrors. As someone already wrote, xdelta would decrease dramatically the size of the filelist to download.
I don't see why it would need to be updated so often. Even if it was updated once a week, I would still find it very useful.
Well, updated say, once a week, would be still very useful, but if an user search for /lib/libwhatever.so.7 and doesn't have any result, then he could ask himself if this is because no package owns that file or because the db is outdated. Anyway, once a week is still better than nothing :) -- Alessio Bolognino Please send personal email to themolok@gmail.com Public Key http://pgp.mit.edu:11371/pks/lookup?op=get&search=0xFE0270FB GPG Key ID = 1024D / FE0270FB 2007-04-11 Key Fingerprint = 9AF8 9011 F271 450D 59CF 2D7D 96C9 8F2A FE02 70FB
Alessio Bolognino wrote:
Well, updated say, once a week, would be still very useful, but if an user search for /lib/libwhatever.so.7 and doesn't have any result, then he could ask himself if this is because no package owns that file or because the db is outdated. Anyway, once a week is still better than nothing :)
In that case, he could search for lib/libwhatever only, maybe find the package that owns lib/libwhatever.so.6 or whatever, and then see if the package has been recently updated or not.
On Jan 27, 2008 11:31 AM, Jeff Mickey <jeff@archlinux.org> wrote:
On Jan 26, 2008 7:47 PM, Miklos Vajna <vmiklos@frugalware.org> wrote:
i see only once design issue here: repo-add would then generate something that is only used by an external tool?
- VMiklos
External in the sense of outside of pacman, yes. But if it's an "official" tool of arch, I don't see the problem here. If anything, we could make yet another script that wraps around repo-add if you really don't want the functionality in repo-add.
I agree with vmiklos here. repo-add is not "an official tool of arch" - it's an official tool for pacman, which, while used by arch, we try to stay fairly generic with. So if the utilitiy to do this search is arch specific, then repo-add should not be changed. This idea sure has sparked a lot of discussion. I'm not sure if the original thread bled over from arch-dev-public, so here's my idea: We allow repo-add to build additional tarballs (--make-filelist), pacman has an option like SyncFilelists that checks the server for those tarballs on an -Sy operation. Now, here's where it gets fun. I'm not suggesting the one-giant-file format for this. I'm suggesting a tarball that can be unpacked right on top of the existing sync DB, that simply has the 'filelist' files in it. This way it's almost trivial to extend -Qo and other ops from the local DB to the sync DB.
### query the sql directly $ time sqlite3 archweb.db "select path, pkgname from packages p, packages_files pf where p.id = pf.pkg_id and path like '%usr/bin/gcc%'" usr/bin/gcc-3.3|gcc3 usr/bin/gccbug-3.3|gcc3 usr/bin/gcc-3.4|gcc34 usr/bin/gccbug-3.4|gcc34 usr/bin/gccbug|gcc usr/bin/gcc|gcc usr/bin/gccmakedep|imake sqlite3 archweb.db 3.45s user 0.14s system 99% cpu 3.622 total ### create a greppable file $ time sqlite3 archweb.db "select path, pkgname from packages p, packages_files pf where p.id = pf.pkg_id and path not like '%/'" | sort > filelist sqlite3 archweb.db 5.80s user 0.25s system 86% cpu 6.981 total sort > filelist 1.30s user 0.41s system 21% cpu 7.843 total $ time gzip -9 < filelist > filelist.gz gzip -9 < filelist > filelist.gz 3.06s user 0.03s system 98% cpu 3.119 total $ lh filelist* -rw-r--r-- 1 nathanj users 32M 2008-01-27 11:46 filelist -rw-r--r-- 1 nathanj users 2.7M 2008-01-27 11:46 filelist.gz ### query using grep $ time grep usr/bin/gcc filelist usr/bin/gcc-3.3|gcc3 usr/bin/gcc-3.4|gcc34 usr/bin/gccbug-3.3|gcc3 usr/bin/gccbug-3.4|gcc34 usr/bin/gccbug|gcc usr/bin/gccmakedep|imake usr/bin/gcc|gcc grep usr/bin/gcc filelist 0.02s user 0.02s system 80% cpu 0.045 total $ time zgrep usr/bin/gcc filelist.gz usr/bin/gcc-3.3|gcc3 usr/bin/gcc-3.4|gcc34 usr/bin/gccbug-3.3|gcc3 usr/bin/gccbug-3.4|gcc34 usr/bin/gccbug|gcc usr/bin/gccmakedep|imake usr/bin/gcc|gcc zgrep usr/bin/gcc filelist.gz 0.23s user 0.05s system 99% cpu 0.279 total I think the best way to implement this would be to set a cronjob on the main archlinux server to generate the greppable file. The generation is fast enough to run every day if wanted, but that is a bit overkill (FWIW, debian's apt-file database regenerates weekly I believe). I would not gzip the file since the search is quite a bit slower that way. All searches would go through the website. Server load shouldn't be a problem since having 50 people hit the page to search a few times each would still be more efficient than having those 50 people all download a 2.7MB file. I doubt there would be that many searches anyway; this type of search is very useful when needed, but it is not needed that often. For the command line usage, it should be easy to create a 10 line python script to call to the server and print the results.
On Jan 25, 2008 6:40 PM, Dan McGee <dpmcgee@gmail.com> wrote:
On Jan 25, 2008 5:31 PM, eliott <eliott@cactuswax.net> wrote:
On 1/25/08, Aaron Griffin <aaronmgriffin@gmail.com> wrote:
On Jan 25, 2008 5:02 PM, Thomas Bächler <thomas@archlinux.org> wrote:
eliott schrieb:
I guess I don't see where this script fits in, and how it is supposed to be used. thomas made mention of using zgrep for advanced users, but that seems just as difficult as opening a web browser and typing into a search box.
The purpose is to provide filelists for download, so they can be searched offline by pacman. My first idea (implementing an online search in pacman) was rejected, thus I thought about a "download the filelist and search it" offline solution.
Oh, I must have misunderstood too. If you're going to implement filelist search and all that stuff, we should: a) Move this to the pacman-dev mailing list b) Add external tools to do this as part of the "pacman source", i.e. as a patch to repo-add c) Not use this script until pacman actually has this feature.
If the intent is to let users zgrep it, then I agree with cactus that that is significantly more complex then actually using the website to provide a search interface.
Yeah. I wasn't apposed to having a file search mechanism on the site. I was apposed to having pacman query the website. If a user opens up a browser and searches, no problem. It was tying this to pacman that I felt was a *really bad idea*.
Alternatively, if there is a pacman only solution, that involves some mirrored meta in the repository, that is something else entirely, and should probably be talked about on the pacman dev list, so as to make it as distribution neutral as possible.
As I've said already, I really don't think this feature belongs in pacman. Obviously you can draw the connection with the -Ql operation, and the fact that we have -Ss, but this is something a bit different than that and I see it as feature creep.
I disagree - pacman is the tool for managing local and remote collections of packages, and knowing what files are inside what packages certainly falls in that realm. I don't see how this feature is any more feature-creepish than pacman -Ql or pacman -Qo. There've been many valid use-cases suggested already, so it's not a fluff request. Maybe I'm missing something here, but I don't see what's so horrible about including it, aside from the fact it means we need to download more meta-info from the repos. I've skimmed through the thread, and haven't seen this yet, so I'll ask - can those who are opposed (Dan, and Jeff for instance) give reasons why you think it's improper to place this functionality inside pacman itself?
Maybe I'm missing something here, but I don't see what's so horrible about including it, aside from the fact it means we need to download more meta-info from the repos... much more meta-info: I guess at least 5-15x size increase in db files [and memory usage boost with the current pkgcache implementation in case of "-So"] However, the implementation is quite trivial, indeed.
I maybe missed something but you mention a webpage search. Is it implemented already? (Where?) Because I can agree, that query_owner is needed somewhere for sync repos too (but imho not in pacman): http://aur.archlinux.org/packages.php?do_Details=1&ID=2952 Bye
participants (10)
-
Aaron Griffin
-
Alessio Bolognino
-
Dan McGee
-
eliott
-
Jeff Mickey
-
Miklos Vajna
-
Nagy Gabor
-
Nathan Jones
-
Travis Willard
-
Xavier