[pacman-dev] proof of concept code with bsd db4
hi. i wanted to make a new bsd db4 back-end for alpm. but i never reached my goal. and will not all i have is a proof of concept code that use bsd db4 api to store pmpkg_t and wanted to share it with anyone (interested ?) i have coded 3 utilities: - one that converts pacman's db into a bsd db4 file for each repo - one that reads that new db format to perform query as pacman does - one that converts directly a tarball db (taken from a sync mirror) into a bsd db4 file if this proves useful for someone, great. More info at http://pagesperso-orange.fr/solstice.dhiver/alpmdb4.html and in the README of http://pagesperso-orange.fr/solstice.dhiver/data/readdb.tar.gz
On Sat, Oct 31, 2009 at 11:37 AM, solsTiCe d'Hiver <solstice.dhiver@gmail.com> wrote:
hi. i wanted to make a new bsd db4 back-end for alpm. but i never reached my goal. and will not all i have is a proof of concept code that use bsd db4 api to store pmpkg_t and wanted to share it with anyone (interested ?)
i have coded 3 utilities: - one that converts pacman's db into a bsd db4 file for each repo - one that reads that new db format to perform query as pacman does - one that converts directly a tarball db (taken from a sync mirror) into a bsd db4 file
if this proves useful for someone, great. More info at http://pagesperso-orange.fr/solstice.dhiver/alpmdb4.html and in the README of http://pagesperso-orange.fr/solstice.dhiver/data/readdb.tar.gz
Nice work on actually doing something here and sharing the code! Thanks, as it might just make some wheels turn for some other people here on the list. I grabbed your code and took it for a spin. I liked the fact that you had a README and all, I didn't have much trouble at all getting it running. I even found a real hotspot in readdb (add_sorted is a killer in a tight loop; it makes a lot more sense to do all your adds followed by an alpm_list_msort()). For others on the list who haven't looked at it yet: * Raw speed alone, this wins. Of course, pacman does a lot more (this isn't parsing conf files, reading mirrorlists, etc) but a "-Ss pacman" search yielded times of 0.083 seconds vs 0.282 seconds (in the hot cache case, of course). * BDB uses key/value pairs for those who aren't familiar. The database layout could probably be simplified a bit- we could pack many attributes into one key/value pair for those we don't use all that often, or never search by but only do lookups. * It didn't take all that much code to do this. That is encouraging. What do people think about non-file-system-based backends? There are several options we could think about: * BSD DB4, similar to what was done here (fast and pretty simple) * SQLite, which might give us a bit more flexibility for querying/lookup * Direct tarfile parsing each time, no conversion needed but likely rather inefficient * ??? The biggest reason always raised in the past against non-file backends was corruption. If you get a corrupted localdb or something you can't recover from, you are in a bad place. With files, you have the lowest barrier to recovery. With a more binary format, it is a lot trickier. Thoughts? -Dan
On Mon, Nov 9, 2009 at 5:50 AM, Dan McGee <dpmcgee@gmail.com> wrote:
On Sat, Oct 31, 2009 at 11:37 AM, solsTiCe d'Hiver <solstice.dhiver@gmail.com> wrote:
hi. i wanted to make a new bsd db4 back-end for alpm. but i never reached my goal. and will not all i have is a proof of concept code that use bsd db4 api to store pmpkg_t and wanted to share it with anyone (interested ?)
i have coded 3 utilities: - one that converts pacman's db into a bsd db4 file for each repo - one that reads that new db format to perform query as pacman does - one that converts directly a tarball db (taken from a sync mirror) into a bsd db4 file
if this proves useful for someone, great. More info at http://pagesperso-orange.fr/solstice.dhiver/alpmdb4.html and in the README of http://pagesperso-orange.fr/solstice.dhiver/data/readdb.tar.gz
Nice work on actually doing something here and sharing the code! Thanks, as it might just make some wheels turn for some other people here on the list.
I grabbed your code and took it for a spin. I liked the fact that you had a README and all, I didn't have much trouble at all getting it running. I even found a real hotspot in readdb (add_sorted is a killer in a tight loop; it makes a lot more sense to do all your adds followed by an alpm_list_msort()).
For others on the list who haven't looked at it yet: * Raw speed alone, this wins. Of course, pacman does a lot more (this isn't parsing conf files, reading mirrorlists, etc) but a "-Ss pacman" search yielded times of 0.083 seconds vs 0.282 seconds (in the hot cache case, of course). * BDB uses key/value pairs for those who aren't familiar. The database layout could probably be simplified a bit- we could pack many attributes into one key/value pair for those we don't use all that often, or never search by but only do lookups. * It didn't take all that much code to do this. That is encouraging.
What do people think about non-file-system-based backends? There are several options we could think about: * BSD DB4, similar to what was done here (fast and pretty simple) * SQLite, which might give us a bit more flexibility for querying/lookup * Direct tarfile parsing each time, no conversion needed but likely rather inefficient * ???
The biggest reason always raised in the past against non-file backends was corruption. If you get a corrupted localdb or something you can't recover from, you are in a bad place. With files, you have the lowest barrier to recovery. With a more binary format, it is a lot trickier. Thoughts?
-Dan
Interesting. A quicker pacman should be a positive thing, right? :) I vote for BerkeleyDB, because I've used it in previous projects, and besides performance it also brings data integrity and recoverability. (For example what happens if a power outage happens during pacman upgrading, just when pacman is writing its file system? In the case of BerkeleyDB we have atomic operations without a problem.) Another note: BerkeleyDB also supports indices, thus allowing us to more efficiently search fol keys based on values (searching packages by fields). Also newer versions of BerkeleyDB have a kind of SQL-like language for defining structures. [1] About backups, there is a tool to dump and load a database, thus backups should be very easy. So if someone needs some help with implementing this feature I could also help. Ciprian. [1] http://www.oracle.com/technology/pub/articles/seltzer-berkeleydb-sql.html
On Mon, Nov 9, 2009 at 8:54 AM, Ciprian Dorin, Craciun <ciprian.craciun@gmail.com> wrote:
On Mon, Nov 9, 2009 at 5:50 AM, Dan McGee <dpmcgee@gmail.com> wrote:
On Sat, Oct 31, 2009 at 11:37 AM, solsTiCe d'Hiver <solstice.dhiver@gmail.com> wrote:
hi. i wanted to make a new bsd db4 back-end for alpm. but i never reached my goal. and will not all i have is a proof of concept code that use bsd db4 api to store pmpkg_t and wanted to share it with anyone (interested ?)
i have coded 3 utilities: - one that converts pacman's db into a bsd db4 file for each repo - one that reads that new db format to perform query as pacman does - one that converts directly a tarball db (taken from a sync mirror) into a bsd db4 file
if this proves useful for someone, great. More info at http://pagesperso-orange.fr/solstice.dhiver/alpmdb4.html and in the README of http://pagesperso-orange.fr/solstice.dhiver/data/readdb.tar.gz
Nice work on actually doing something here and sharing the code! Thanks, as it might just make some wheels turn for some other people here on the list.
I grabbed your code and took it for a spin. I liked the fact that you had a README and all, I didn't have much trouble at all getting it running. I even found a real hotspot in readdb (add_sorted is a killer in a tight loop; it makes a lot more sense to do all your adds followed by an alpm_list_msort()).
For others on the list who haven't looked at it yet: * Raw speed alone, this wins. Of course, pacman does a lot more (this isn't parsing conf files, reading mirrorlists, etc) but a "-Ss pacman" search yielded times of 0.083 seconds vs 0.282 seconds (in the hot cache case, of course). * BDB uses key/value pairs for those who aren't familiar. The database layout could probably be simplified a bit- we could pack many attributes into one key/value pair for those we don't use all that often, or never search by but only do lookups. * It didn't take all that much code to do this. That is encouraging.
What do people think about non-file-system-based backends? There are several options we could think about: * BSD DB4, similar to what was done here (fast and pretty simple) * SQLite, which might give us a bit more flexibility for querying/lookup * Direct tarfile parsing each time, no conversion needed but likely rather inefficient * ???
The biggest reason always raised in the past against non-file backends was corruption. If you get a corrupted localdb or something you can't recover from, you are in a bad place. With files, you have the lowest barrier to recovery. With a more binary format, it is a lot trickier. Thoughts?
-Dan
Interesting. A quicker pacman should be a positive thing, right? :)
I vote for BerkeleyDB, because I've used it in previous projects, and besides performance it also brings data integrity and recoverability. (For example what happens if a power outage happens during pacman upgrading, just when pacman is writing its file system? In the case of BerkeleyDB we have atomic operations without a problem.)
Another note: BerkeleyDB also supports indices, thus allowing us to more efficiently search fol keys based on values (searching packages by fields). Also newer versions of BerkeleyDB have a kind of SQL-like language for defining structures. [1]
About backups, there is a tool to dump and load a database, thus backups should be very easy.
So if someone needs some help with implementing this feature I could also help.
Ciprian.
[1] http://www.oracle.com/technology/pub/articles/seltzer-berkeleydb-sql.html
Sory for the wronk link (I've searched it in a hurry on Google). It's the following one: http://www.oracle.com/technology/documentation/berkeley-db/db/api_reference/...
participants (3)
-
Ciprian Dorin, Craciun
-
Dan McGee
-
solsTiCe d'Hiver