Split. On 10/26/06, Roman Kyrylych <roman.kyrylych@gmail.com> wrote:
2006/10/26, Aaron Griffin <aaronmgriffin@gmail.com>:
On 10/26/06, Roman Kyrylych <roman.kyrylych@gmail.com> wrote:
[offtopic on] BTW, about complex patches - I have a patch from Martin Devera that adds support for gdbm database backend to pacman 2.9.8. I suggested him to contact VMiklos for help in inclusion it in pacman, but I don't know if he did that. If I adapt it for Pacman 3 API is there a chance it will be included at least in CVS/DARCS? [/offtopic off]
There's no chance that would get included in pacman 2.9.8 - it will die soon.
Converting it to pacman3 is entirely doable and it's already setup to do so. You should be able to copy be_files.c to be_gdbm.c and consume the interface provided there.
FTR 'be_' means 'backend'. I still need to add compile-time flags for selecting a backend, but as we only provide _files at the moment, there's no need.
I worked on a 'compressed' backend that required no changes, but read directly from the downloaded db files... it caused problems on the local db, however, so I never completed it (yet).
As for backend schemes, I think we need to discuss avenues of attack here. The files backend is painfully slow on ext filesystems (11 seconds to parse [community]). There are many users touting a database backend as a good idea (I highly disagree), and if we do this, we NEED a generic layer to support any sane database format, not just one specific thing. Using a flat-file database is a better option, but here people seem to think sqlite is a good idea (it's not) - there are many flat-file schemes that can drastically outperform sqlite.
There are many many other options, and I think it's worth at least _supplying_ multiple backends, even if they're not used. So gdbm is cool.
Maybe we should start separate thread for this? Personally, I think sqlite will be the best choice because of: 1) it's small 2) it's fast 3) it's embedable
Ok, sqlite is "fast and etc etc", right? So if I do [pacman -Ss foo], what does sqlite do? It opens the file (exactly as plaintext searching would), initializes some structures and things like that describing table data, then proceeds to sequentially search (FULL TABLE SCAN) through all rows. It is impossible, even in a good database, to index substrings. Yes, you can index entire strings, that's easy. The search operations check internal substrings, not whole strings. You could make this faster by converting each entry into a suffix tree and storing that somewhere, but that gets irritating and gets into that crappy Computer Science stuf everyone hates. Fact of the matter is, there's two main "slow" operations - searching, and loading the whole DB (in the case of an -Su). Loading the whole database can easilly be sped up by using lazy package evaluation - that is, a simple readdir() or the db directory gives us package name, version, and release - all we need for a upgrade check. The additional data can be read when/if required. Substring searching will never be uber fast, but we can maintain a minor indexed file with all text that's searched... something like: extra/package-name-1.0-1 : This is the description for the package community/package-two-1.1.1-1 : This is the description for this thing extra/another-99-1 : This is the description for some stuff -Ss can check this file, and use the first entry before the colon to construct the path to open (though that may not be needed as all the information is right there). Ok, I'm gonna stop myself before I rant too much. There are cries of "use a database" from many people. Frankly, I don't think that these people have evaluated all the options and are stuck in the "everyone uses databases for their websites" mentality. Take this for instance: The google engineers here in Chicago explained to me something they use to store whatever the hell google stores there: A custom filesystem tooled to their spec. They don't use mysql or oracle or sqlite any of that - the king of searching uses the simplest of all options.
4) it offers full power of SQL99 (with ACID transactions!!!)
Nope. Not at all. It doesn't even support SQL92 - a 14 year old standard. http://www.sqlite.org/omitted.html http://www.sqlite.org/cvstrac/wiki?p=UnsupportedSql They have said publicly they do not plan on adding support. Standards compliance is HUGE in my book.
2006/10/26, Aaron Griffin <aaronmgriffin@gmail.com>:
Personally, I think sqlite will be the best choice because of: 1) it's small 2) it's fast 3) it's embedable
Ok, sqlite is "fast and etc etc", right? So if I do [pacman -Ss foo], what does sqlite do? It opens the file (exactly as plaintext searching would), initializes some structures and things like that describing table data, then proceeds to sequentially search (FULL TABLE SCAN) through all rows. It is impossible, even in a good database, to index substrings. Yes, you can index entire strings, that's easy. The search operations check internal substrings, not whole strings. . . . . . . . . . . I don't get what you whant to say here. Using SQLite will not allow full text search to be faster or slower, but easier, because all needed functions are already there. You don't need to do complex indexing. Indexing by package name is enough. By terms "small, fast, embedable" I mean that it's better than MySQL or something else. And it's better than dbm-style databases when it comes to random writing and transactions. And yes, it is faster than filesystem backend, because there is a _simple_ way to use _in-memory_ databases! ;-)
4) it offers full power of SQL99 (with ACID transactions!!!)
Nope. Not at all. It doesn't even support SQL92 - a 14 year old standard. Oops, my bad, I confused SQL99 with SQL92. :-p Anyway, SQLite offers many functions almost for free. And anyway there should be a choice of files/gdbm/sqlite/whatever.
-- Roman Kyrylych (Роман Кирилич)
On 10/26/06, Roman Kyrylych <roman.kyrylych@gmail.com> wrote:
You don't need to do complex indexing. Indexing by package name is enough. By terms "small, fast, embedable" I mean that it's better than MySQL or something else. And it's better than dbm-style databases when it comes to random writing and transactions.
What I'm saying is that you have the overhead of a full-scale database with little gain. Indexing by package names, yes that's great and all, but that doesn't help with the slowest-of-the-slow -Ss operation. -Ss searches package names AND descriptions, allowing for regex matching. Sqlite (and most DBs) do not support regex matching. Indexing the search by package name means little, because a -Ss "foo" still SHOULD match a package named "barfoo" and a package with "foo" in the description. This means a sequential search. Not only that, but because the DB will not support regex, that means that one must iterate over each and every entry, get the values at the C level, apply a regex pattern, and note if it is a match or not. The only speed you gain would be in the initial opening of the files. To me, this does not mean a DB backend is the solution. It may be better than the files backend, yes, but not the best, and outperforming the files backend is not hard... I was able to improve performance approximately 6 times by simply using the db.tar.gz files in place of disparate text files.
2006/10/26, Aaron Griffin <aaronmgriffin@gmail.com>:
What I'm saying is that you have the overhead of a full-scale database with little gain. Indexing by package names, yes that's great and all, but that doesn't help with the slowest-of-the-slow -Ss operation. -Ss searches package names AND descriptions, allowing for regex matching. Sqlite (and most DBs) do not support regex matching. Indexing the search by package name means little, because a -Ss "foo" still SHOULD match a package named "barfoo" and a package with "foo" in the description. This means a sequential search. Not only that, but because the DB will not support regex, that means that one must iterate over each and every entry, get the values at the C level, apply a regex pattern, and note if it is a match or not. The only speed you gain would be in the initial opening of the files. To me, this does not mean a DB backend is the solution. It may be better than the files backend, yes, but not the best, and outperforming the files backend is not hard... I was able to improve performance approximately 6 times by simply using the db.tar.gz files in place of disparate text files.
Yes, the main problem with files backend is that huge amount of files in /var/lib/pacman that leads to slow performance on many filesystems. gdbm/sqlite/whatever has a big "+" that everithing can be in one file. Plus, as I said, with SQLite you have the ability to easily use in-memory database + you have easy ACID transactions support. This will be faster than seeking through the /var/lib/pacman/ anyway. And there will be no need for pacman-optimize-like scripts for database backends. The overhead is not big. (A quote from sqlite.org: "Small code footprint: less than 250KiB fully configured or less than 150KiB with optional features omitted") Anyway nobody imposes use of some database backend as default and the only one. But having alternative is good. Users can test performance by themself and choose the solution that fits their needs in the best way. Otherwise, what's the point of the ALPM's ability to have different backends? ;-) About regexps - yes, SQLite doesn't support them, but adding this on top of it will be not slower than in files backend anyway. I don't see a problem here. I'm not "advertising" SQLite or other databases because "everybody use databases for everything". I just think that this will offer some new abilities for libalpm and will simplify (and speedup, yes - speedup) some things. That's IMHO, of course, but I think it worth a try gdbm and sqlite as alternatives to files backend. -- Roman Kyrylych (Роман Кирилич)
On Fri, Oct 27, 2006 at 12:03:40AM +0300, Roman Kyrylych wrote:
2006/10/26, Aaron Griffin <aaronmgriffin@gmail.com>:
What I'm saying is that you have the overhead of a full-scale database with little gain. Indexing by package names, yes that's great and all, but that doesn't help with the slowest-of-the-slow -Ss operation. -Ss searches package names AND descriptions, allowing for regex matching. Sqlite (and most DBs) do not support regex matching. Indexing the search by package name means little, because a -Ss "foo" still SHOULD match a package named "barfoo" and a package with "foo" in the description. This means a sequential search. Not only that, but because the DB will not support regex, that means that one must iterate over each and every entry, get the values at the C level, apply a regex pattern, and note if it is a match or not. The only speed you gain would be in the initial opening of the files. To me, this does not mean a DB backend is the solution. It may be better than the files backend, yes, but not the best, and outperforming the files backend is not hard... I was able to improve performance approximately 6 times by simply using the db.tar.gz files in place of disparate text files.
Yes, the main problem with files backend is that huge amount of files in /var/lib/pacman that leads to slow performance on many filesystems. gdbm/sqlite/whatever has a big "+" that everithing can be in one file.
This is a big "-", because unix lovers prefer multiple text files. This makes it easy to use powerful text/file-processing tools like "sed", "awk", "shell" and "find" to build package-management related scripts.
Plus, as I said, with SQLite you have the ability to easily use in-memory database + you have easy ACID transactions support. This will be faster than seeking through the /var/lib/pacman/ anyway. And there will be no need for pacman-optimize-like scripts for database backends. The overhead is not big. (A quote from sqlite.org: "Small code footprint: less than 250KiB fully configured or less than 150KiB with optional features omitted")
You will gain little performance (some seconds on a slow system?) at the expense of complexity. Jürgen
2006/10/27, Jürgen Hötzel <juergen@hoetzel.info>:
Yes, the main problem with files backend is that huge amount of files in /var/lib/pacman that leads to slow performance on many filesystems. gdbm/sqlite/whatever has a big "+" that everithing can be in one file.
This is a big "-", because unix lovers prefer multiple text files. This makes it easy to use powerful text/file-processing tools like "sed", "awk", "shell" and "find" to build package-management related scripts.
There're nothing that don't allow to create some "API" of shell functions for doing most common operations that can be used by other shell scripts. And this way will be more backward compatible in case there will be some changes to storage format.
Plus, as I said, with SQLite you have the ability to easily use in-memory database + you have easy ACID transactions support. This will be faster than seeking through the /var/lib/pacman/ anyway. And there will be no need for pacman-optimize-like scripts for database backends. The overhead is not big. (A quote from sqlite.org: "Small code footprint: less than 250KiB fully configured or less than 150KiB with optional features omitted")
You will gain little performance (some seconds on a slow system?) at the expense of complexity.
Why not make a benchmark? ;-) Really, there are some users that complain about pacman slowness, and it's not about really slow operations like -Ss, but about really trivial operations, that shouldn't be so slow. That's because files backend is highly dependent on the type of underlying filesystem. Database backend may make these users happy,because it's not dependent on filesystem type. Personally, I don't think there's a problem _at_all_! Anyway files backend will be default in pacman3, and anyway I doubt it will be replaced soon (if ever). Why not just have an _alternative_? Keep in mind that ALPM can and will be used by other distros. Except Arch Linux and Frugalware there're Underground Linux (mostly Arch Linux Install CD + KDE and other desktop software) and Rubix Linux (formally, simply Rubix, anyway it's dead now), but I'm sure there will be some more in future (though I don't think more than 300 distros is good, even DistroWatch author thinks the same). Other distros may choose another backend. Of course, they can implement them by themselves, but anyway there was a reason why multiple backend fearure was included into ALPM. So why not to use it? Especially when it does not impose anything. Having alternatives is good, IMHO. -- Roman Kyrylych (Роман Кирилич)
On 10/26/06, Roman Kyrylych <roman.kyrylych@gmail.com> wrote:
Why not make a benchmark? ;-) Really, there are some users that complain about pacman slowness, and it's not about really slow operations like -Ss, but about really trivial operations, that shouldn't be so slow. That's because files backend is highly dependent on the type of underlying filesystem. Database backend may make these users happy,because it's not dependent on filesystem type.
No no no.... you're missing the point. Because reading individual files is slow, we need a new backend. This does not imply a database backend is a good idea. It implies a single-file solution is ideal, yes. What I've been trying to show was that, considering the slow-down is from reading hundreds of files, that does not imply a database. ANY single-file solution would work. That is why I was on about string matching and things of that nature - because if we take "single file solution A" and compare it to "DB solution A", there are negligable differences, and the DB one adds much more complexity and security flaws (i.e. embedded SQL statement strings).
2006/10/27, Aaron Griffin <aaronmgriffin@gmail.com>:
No no no.... you're missing the point. Because reading individual files is slow, we need a new backend. This does not imply a database backend is a good idea. It implies a single-file solution is ideal, yes. What I've been trying to show was that, considering the slow-down is from reading hundreds of files, that does not imply a database. ANY single-file solution would work.
Yes, I know this. You just get words out of my mouth! :) But single file will need some structure anyway. So, either we'll have our own format, or use something like gdbm or even simpler, or something more complex like sqlite.
That is why I was on about string matching and things of that nature - because if we take "single file solution A" and compare it to "DB solution A", there are negligable differences, and the DB one adds much more complexity and security flaws (i.e. embedded SQL statement strings).
OK, so, what do you propose? Use a simple tar.gz and process it in memory, or what? Still, I think sqlite or gdbm can be tried, but that's will be not in the near future anyway. There're more important things now anyway. :-) -- Roman Kyrylych (Роман Кирилич)
On Fri, 27 Oct 2006 01:22:03 +0300 "Roman Kyrylych" <roman.kyrylych@gmail.com> wrote:
Still, I think sqlite or gdbm can be tried, but that's will be not in the near future anyway. There're more important things now anyway. :-)
As for gdbm backend, I have been going through adding it to pacman3. However I have been working on some quite old pacman3 cvs sources (seems like updating will be quite painful given the recent discussions on the list). Depending on the amount if free time I will have, I may provide the sources in 2 weeks or so, but I would consider this as a proof of concept rathern than release (or anything close to it) version. Maciek
On Thu, 26 Oct 2006 17:11:05 -0500 "Aaron Griffin" <aaronmgriffin@gmail.com> wrote:
No no no.... you're missing the point. Because reading individual files is slow, we need a new backend. This does not imply a database backend is a good idea. It implies a single-file solution is ideal, yes. What I've been trying to show was that, considering the slow-down is from reading hundreds of files, that does not imply a database. ANY single-file solution would work.
Heh, well almost any. See pacman1 for an implementation that did not work. Well, it worked, but it was painfully slow -- sequential searching through a massive text file to find the package in question. - J
On 10/27/06, Judd Vinet <jvinet@zeroflux.org> wrote:
On Thu, 26 Oct 2006 17:11:05 -0500 "Aaron Griffin" <aaronmgriffin@gmail.com> wrote:
No no no.... you're missing the point. Because reading individual files is slow, we need a new backend. This does not imply a database backend is a good idea. It implies a single-file solution is ideal, yes. What I've been trying to show was that, considering the slow-down is from reading hundreds of files, that does not imply a database. ANY single-file solution would work.
Heh, well almost any. See pacman1 for an implementation that did not work. Well, it worked, but it was painfully slow -- sequential searching through a massive text file to find the package in question.
Ok, "any SANE single-file solution" 8) Though I have seen some decent flat-file systems in the past (VMS's RMS files were fairly fast if you only used one file, though "joining" was terriblly slow)
participants (5)
-
Aaron Griffin
-
Judd Vinet
-
Jürgen Hötzel
-
Maciek Borzecki
-
Roman Kyrylych