[pacman-dev] proof of concept code with bsd db4

Mon Nov 9 01:54:28 EST 2009

On Mon, Nov 9, 2009 at 5:50 AM, Dan McGee <dpmcgee at gmail.com> wrote:
> On Sat, Oct 31, 2009 at 11:37 AM, solsTiCe d'Hiver
> <solstice.dhiver at gmail.com> wrote:
>> hi.
>> i wanted to make a new bsd db4 back-end for alpm. but i never reached my
>> goal. and will not
>> all i have is a proof of concept code that use bsd db4 api to store
>> pmpkg_t and wanted to share it with anyone (interested ?)
>>
>> i have coded 3 utilities:
>> - one that converts pacman's db into a bsd db4 file for each repo
>> - one that reads that new db format to perform query as pacman does
>> - one that converts directly a tarball db (taken from a sync mirror)
>> into a bsd db4 file
>>
>> if this proves useful for someone, great.
>> More info at http://pagesperso-orange.fr/solstice.dhiver/alpmdb4.html
>> and in the README of
>> http://pagesperso-orange.fr/solstice.dhiver/data/readdb.tar.gz
>
> Nice work on actually doing something here and sharing the code!
> Thanks, as it might just make some wheels turn for some other people
> here on the list.
>
> I grabbed your code and took it for a spin. I liked the fact that you
> had a README and all, I didn't have much trouble at all getting it
> running. I even found a real hotspot in readdb (add_sorted is a killer
> in a tight loop; it makes a lot more sense to do all your adds
> followed by an alpm_list_msort()).
>
> For others on the list who haven't looked at it yet:
> * Raw speed alone, this wins. Of course, pacman does a lot more (this
> isn't parsing conf files, reading mirrorlists, etc) but a "-Ss pacman"
> search yielded times of 0.083 seconds vs 0.282 seconds (in the hot
> cache case, of course).
> * BDB uses key/value pairs for those who aren't familiar. The database
> layout could probably be simplified a bit- we could pack many
> attributes into one key/value pair for those we don't use all that
> often, or never search by but only do lookups.
> * It didn't take all that much code to do this. That is encouraging.
>
> What do people think about non-file-system-based backends? There are
> several options we could think about:
> * BSD DB4, similar to what was done here (fast and pretty simple)
> * SQLite, which might give us a bit more flexibility for querying/lookup
> * Direct tarfile parsing each time, no conversion needed but likely
> rather inefficient
> * ???
>
> The biggest reason always raised in the past against non-file backends
> was corruption. If you get a corrupted localdb or something you can't
> recover from, you are in a bad place. With files, you have the lowest
> barrier to recovery. With a more binary format, it is a lot trickier.
> Thoughts?
>
> -Dan


    Interesting. A quicker pacman should be a positive thing, right? :)

    I vote for BerkeleyDB, because I've used it in previous projects,
and besides performance it also brings data integrity and
recoverability. (For example what happens if a power outage happens
during pacman upgrading, just when pacman is writing its file system?
In the case of BerkeleyDB we have atomic operations without a
problem.)

    Another note: BerkeleyDB also supports indices, thus allowing us
to more efficiently search fol keys based on values (searching
packages by fields). Also newer versions of BerkeleyDB have a kind of
SQL-like language for defining structures. [1]

    About backups, there is a tool to dump and load a database, thus
backups should be very easy.

    So if someone needs some help with implementing this feature I
could also help.

    Ciprian.

    [1] http://www.oracle.com/technology/pub/articles/seltzer-berkeleydb-sql.html