[aur-general] pkgconflict: a tool to find file conflicts when building packages
I wrote this code in anger. Hopefully, others might find it useful. There is a little story behind it. I maintain the nmh package in unsupported. I just discovered that a package from extra provides /usr/bin/dp, which is also provided by nmh. We have this nice collection of files under /var/cache/pkgtools/lists, thanks to Daenyth's very useful pkgfile tool. We ought to be able to use that collection to detect conflicts like the one I just described. Hence the birth of pkgconflict. Usage: pkgconflict <PACKAGEFILE> Source is here: http://members.cox.net/cmbrannon/pkgconflict If someone else has a better tool for this purpose, I'd love to hear about it! Enjoy, -- Chris PS. pkgconflict already proved useful. It showed me that /usr/bin/scan, also from nmh, is provided by linuxtv-dvb-apps from community.
2009/4/5 Chris Brannon <cmbrannon@cox.net>:
I wrote this code in anger. Hopefully, others might find it useful. There is a little story behind it.
I maintain the nmh package in unsupported. I just discovered that a package from extra provides /usr/bin/dp, which is also provided by nmh. We have this nice collection of files under /var/cache/pkgtools/lists, thanks to Daenyth's very useful pkgfile tool. We ought to be able to use that collection to detect conflicts like the one I just described. Hence the birth of pkgconflict.
Usage: pkgconflict <PACKAGEFILE>
Source is here: http://members.cox.net/cmbrannon/pkgconflict
If someone else has a better tool for this purpose, I'd love to hear about it!
Enjoy, -- Chris
Thanks for the useful script! But shouldn't this line: known_files[entry] = (repo, package) be known_files[entry].append((repo, package)) and then this check: (repo, package) = known_files[file] could be modified to if len(known_files[file]) > 0: for repo, pkg in known_files[file]: if pkg != pkg_given: print "%s: conflicts %s/%s (%s)" % (pkg_given, repo, pkg, file) Otherwise the previous entry for a file might be overwritten if there are conflicts and the package name specified on the command line occurs *after* the conflicting package name. The case you've described will always work since you're testing against a package from unsupported. -- Abhishek
Abhishek Dasgupta wrote:
Thanks for the useful script! But shouldn't this line: known_files[entry] =3D (repo, package) be known_files[entry].append((repo, package))
That is a nice suggestion! Using lists as hash values could make the tool useful for purposes other than checking packages built from unsupported. The program does quite a bit of I/O. I'm tempted to convert those file lists into a sqlite database, since it would really improve efficiency. -- Chris
On Sun, Apr 5, 2009 at 09:29, Chris Brannon <cmbrannon@cox.net> wrote:
Abhishek Dasgupta wrote:
Thanks for the useful script! But shouldn't this line: known_files[entry] =3D (repo, package) be known_files[entry].append((repo, package))
That is a nice suggestion! Using lists as hash values could make the tool useful for purposes other than checking packages built from unsupported.
The program does quite a bit of I/O. I'm tempted to convert those file lists into a sqlite database, since it would really improve efficiency.
-- Chris
That's not a bad idea, and it's one that's been recommended to me for pkgfile as well. Is there a simple (read: bash-usable) interface I could work with to speed up pkgfile? I think it would be smart for us to coordinate on this, and in fact I'd like to add pkgconflict into my pkgtools package. Are you currently using git for it? pkgtools is hosted on github[1] right now, though that might be a little ajax-y for you to navigate easily. If so, try using the "github" command line utility[2]. I'll be on IRC if you want to discuss the options with this, or jabber would be fine as well. [1] http://github.com/Daenyth/pkgtools [2] http://aur.archlinux.org/packages.php?ID=24484
2009/4/5 Daenyth Blank <daenyth+arch@gmail.com>:
That's not a bad idea, and it's one that's been recommended to me for pkgfile as well. Is there a simple (read: bash-usable) interface I could work with to speed up pkgfile?
I was trying out the command line sqlite3 tool to manage a database comprised of the following fields: key pkgname file The problem is that insertion into the database is *slow*. It takes over an hour to generate the databases for the official repositories. Thus, if we're to use sqlite then only the differences should be updated. Otherwise these filelist databases could be generated on some other server and then downloaded by pkgfile. The advantage of sqlite is of course that searching is nearly instanteneous. -- Abhishek
On Wed, Apr 8, 2009 at 14:04, Abhishek Dasgupta <abhidg@gmail.com> wrote:
2009/4/5 Daenyth Blank <daenyth+arch@gmail.com>:
That's not a bad idea, and it's one that's been recommended to me for pkgfile as well. Is there a simple (read: bash-usable) interface I could work with to speed up pkgfile?
I was trying out the command line sqlite3 tool to manage a database comprised of the following fields: key pkgname file The problem is that insertion into the database is *slow*. It takes over an hour to generate the databases for the official repositories. Thus, if we're to use sqlite then only the differences should be updated. Otherwise these filelist databases could be generated on some other server and then downloaded by pkgfile.
The advantage of sqlite is of course that searching is nearly instanteneous.
-- Abhishek
That sounds promising... If we can find some method to construct only the differences, then it would be doable. Have a method for the user to do initial generation, update an existing db, and in addition, to download a database from a known source (I could generate a db in a cronjob and then host it on twilightlair I think). I have no experience with this at all, what would you recommend for doing an incremental update?
2009/4/8 Daenyth Blank <daenyth+arch@gmail.com>:
That sounds promising... If we can find some method to construct only the differences, then it would be doable. Have a method for the user to do initial generation, update an existing db, and in addition, to download a database from a known source (I could generate a db in a cronjob and then host it on twilightlair I think). I have no experience with this at all, what would you recommend for doing an incremental update?
The initial generation is taking really long! It's over two hours and still generating... As for differences, we could generate a text containing all files with the corresponding pkgnames after each update. Then by doing a diff -u with the previous text we can delete and add accordingly. This step won't take too much time, hopefully. Also, there has to be some way of escaping the filename so that sqlite does not give errors like this: SQL error: near "N": syntax error SQL error: near "n": syntax error SQL error: near "s_Eye": syntax error Of course if the databases could be stored on some server, then the user is spared the problem of generating the database and updating it. -- Abhishek
Also, I forgot to mention that the size of the dbs is quite large compared to the simple *files.tar.gz on ftp: core.files.db.tar.gz 220K core.db 2.1M core.db has 35796 files. extra.db (still generating) is already at 13M! -- Abhishek
Abhishek Dasgupta wrote:
The initial generation is taking really long! It's over two hours and still generating... *SNIP* Also, there has to be some way of escaping the filename so that sqlite does not give errors like this: SQL error: near "N": syntax error
I made a fork of pkgtools on github. Have a look at the sqlite branch of http://github.com/CMB/pkgtools I wrote a script which incrementally updates an SQLite database from a REPO.files.tar.gz tarball. It is in pkgtools/other/repofile2db.py The code is terribly raw! I'm afraid to try it, until I've done some more desk-checking. -- Chris
2009/4/9 Chris Brannon <cmbrannon@cox.net>:
I made a fork of pkgtools on github. Have a look at the sqlite branch of http://github.com/CMB/pkgtools I wrote a script which incrementally updates an SQLite database from a REPO.files.tar.gz tarball. It is in pkgtools/other/repofile2db.py The code is terribly raw! I'm afraid to try it, until I've done some more desk-checking.
I looked at the code and I couldn't find any obvious problems. Did you try creating a database? I made a sqlite-bash branch on github [1] with the create_db function which I'm using right now. It's not incremental yet but adding a separate table with package versions like you've done would solve the problem. [1]: http://github.com/abhidg/pkgtools/tree/sqlite-bash -- Abhishek
2009/4/9 Abhishek Dasgupta <abhidg@gmail.com>:
I made a sqlite-bash branch on github [1] with the create_db function which I'm using right now. It's not incremental yet but adding a separate table with package versions like you've done would solve the problem.
Just added incremental update support to sqlite-bash branch. I tested it out with the databases I already had [1] and it worked fine. Updating the filelists was far quicker and finished within a few minutes. For anyone wishing to try out, download the databases and drop them into /var/cache/pkgtools/lists after gunzipping. Use the pkgfile from the git branch and run `sudo pkgfile -uv`. The new database versions.db contains the package names, versions and repository names and has the following format: pkgname TEXT, pkgver TEXT, repo TEXT At the moment, /var/cache/pkgtools/lists/{reponame} is not deleted, though it's not really used (except for the listfiles() function which I haven't modified to use the sqlite db yet). Also the --binaries switch which allows one to search for files in bin/ or sbin/ might not work properly. [1]: http://abhidg.mine.nu/arch/package-databases/ -- Abhishek
On Thu, Apr 9, 2009 at 12:31, Abhishek Dasgupta <abhidg@gmail.com> wrote:
2009/4/9 Abhishek Dasgupta <abhidg@gmail.com>:
I made a sqlite-bash branch on github [1] with the create_db function which I'm using right now. It's not incremental yet but adding a separate table with package versions like you've done would solve the problem.
Just added incremental update support to sqlite-bash branch. I tested it out with the databases I already had [1] and it worked fine. Updating the filelists was far quicker and finished within a few minutes.
For anyone wishing to try out, download the databases and drop them into /var/cache/pkgtools/lists after gunzipping. Use the pkgfile from the git branch and run `sudo pkgfile -uv`. The new database versions.db contains the package names, versions and repository names and has the following format: pkgname TEXT, pkgver TEXT, repo TEXT
At the moment, /var/cache/pkgtools/lists/{reponame} is not deleted, though it's not really used (except for the listfiles() function which I haven't modified to use the sqlite db yet). Also the --binaries switch which allows one to search for files in bin/ or sbin/ might not work properly.
[1]: http://abhidg.mine.nu/arch/package-databases/
-- Abhishek
How is the filesize of the REPO.db? CMB was reporting that community was taking up more than 75mb with his version. Regardless of how the sqlite features would be implemented, I will also keep the flat file version available, and configurable. How are the searching and updating speeds compared to before?
Daenyth Blank wrote:
How is the filesize of the REPO.db? CMB was reporting that community was taking up more than 75mb with his version. Regardless of how the sqlite features would be implemented, I will also keep the flat file version available, and configurable. How are the searching and updating speeds compared to before?
Here are the rest of my results from yesterday. Inserting all of core, extra, and community took less than 6 minutes, on a machine with a single 2.5 GHz Pentium 4 CPU and 512 MB of RAM. The database weighed in at a whopping 240 megabytes. I haven't tried an incremental update yet. Here is what I know about search speeds. A query like "select * from files where filename = 'usr/bin/getmail'" is practically instantaneous, as one would expect. "select * from files where filename like '%/getmail'" takes 5 seconds or so, because it has to iterate through all the filenames in the table. If sqlite were used for pkgfile, that second query would need to be quite a bit faster than it is now. -- Chris
On Thu, Apr 9, 2009 at 13:40, Chris Brannon <cmbrannon@cox.net> wrote:
Daenyth Blank wrote:
How is the filesize of the REPO.db? CMB was reporting that community was taking up more than 75mb with his version. Regardless of how the sqlite features would be implemented, I will also keep the flat file version available, and configurable. How are the searching and updating speeds compared to before?
Here are the rest of my results from yesterday. Inserting all of core, extra, and community took less than 6 minutes, on a machine with a single 2.5 GHz Pentium 4 CPU and 512 MB of RAM. The database weighed in at a whopping 240 megabytes. I haven't tried an incremental update yet.
Here is what I know about search speeds. A query like "select * from files where filename = 'usr/bin/getmail'" is practically instantaneous, as one would expect. "select * from files where filename like '%/getmail'" takes 5 seconds or so, because it has to iterate through all the filenames in the table.
If sqlite were used for pkgfile, that second query would need to be quite a bit faster than it is now.
-- Chris
Can you elaborate on the structure of your database? Perhaps that is influencing it. Have a look at Abhidg's sqlite-bash branch to compare, his .db feels seem MUCH smaller. I haven't tested any speeds yet, but I am poking at the code a bit.
Chris Brannon wrote:
"select * from files where filename like '%/getmail'" takes 5 seconds or so, because it has to iterate through all the filenames in the table.
If sqlite were used for pkgfile, that second query would need to be quite a bit faster than it is now.
-- Chris
Couldn't we simply introduce a new column, store the filename there and create an index on it?
2009/4/9 Evangelos Foutras <foutrelis@gmail.com>:
Couldn't we simply introduce a new column, store the filename there and create an index on it?
That's a nice idea. I'll try out and see how much the db size increases. -- Abhishek
On Thu, Apr 9, 2009 at 13:50, Abhishek Dasgupta <abhidg@gmail.com> wrote:
2009/4/9 Evangelos Foutras <foutrelis@gmail.com>:
Couldn't we simply introduce a new column, store the filename there and create an index on it?
That's a nice idea. I'll try out and see how much the db size increases. -- Abhishek
Can you pull from my sqlite-bash branch? I have some cleanup fixes that will be easier to apply before proceeding, though the function is still pretty much the same,
2009/4/9 Daenyth Blank <daenyth+arch@gmail.com>:
Can you pull from my sqlite-bash branch? I have some cleanup fixes that will be easier to apply before proceeding, though the function is still pretty much the same,
Done.
2009/4/9 Daenyth Blank <daenyth+arch@gmail.com>:
How is the filesize of the REPO.db? CMB was reporting that community was taking up more than 75mb with his version. Regardless of how the sqlite features would be implemented, I will also keep the flat file version available, and configurable. How are the searching and updating speeds compared to before?
Filesizes: core.db 2.1M (440K) extra.db 42M (7.2M) community.db 39M(5.7M) Filesizes in brackets are after gzip -9. Searching (after) $ time pkgfile search getmail getmail usr/bin/getmail real 0m4.458s user 0m2.816s sys 0m0.297s Searching (before) extra/getmail real 0m12.403s user 0m0.450s sys 0m1.380s Updating speed is slower than previous since the previous method just extracted the files. However incremental update finishes within 5 mins. -- Abhishek
participants (4)
-
Abhishek Dasgupta
-
Chris Brannon
-
Daenyth Blank
-
Evangelos Foutras