[pacman-dev] [PATCH] Database with index??
Hello to all. After 2 weeks of intensive work (:P) I'm glad to post the first proof-of-concept code about using index in our life. Do you remember the discussion Backend DB, where we talked about in November? If not, see http://archlinux.org/pipermail/pacman-dev/2007-November/009938.html : Mikos proposed to see how pacman works with a big database file, like pacman1, but with a smaller index, that contains some information ( like pkgname - pkgver ) and where an entries start and finish in the big database. So, I've thinked about the making of index every time that users do pacman -Sy; but we haven't time to re-make index all times; so, I've coded all on the "B-side" of pacman: repo-add. The result are two file: txtdb.idx ===> big, big database file; index.idx ===> smaller file, that has this format: pkgname - pkgver - line_start - line_end Example: chpax - 0.12 - 22 - 53 Reading the line that contains "chpax", I see that the entries in txtdb.idx starts on line 22 and ends on line 53. Note that this patch doesn't modify the old way to make database. It simply add a new one :D Now the only part that I need to code is the pacman-side. Why I haven't coded this yet? Because pacman-side is ( in my point of view ) the more difficult part, and ( last but no last ) I want to hear your opinion about this. Excuse my elementary english ( I have rewrited 3 times the introduction to the patch :D ) and excuse If I can't explain how patch works ( I don't write any comments.. ). I will do all ( comments and explain ) tomorrow.. now it's very late :) Thanks -- JJDaNiMoTh - ArchLinux Trusted User
On Sat, Dec 08, 2007 at 11:21:40PM +0100, JJDaNiMoTh wrote:
Hello to all.
After 2 weeks of intensive work (:P) I'm glad to post the first proof-of-concept code about using index in our life.
A few suggestions: 1. Make sure to test with packages that contain hyphens, like 'gcc-libs'. Your regular expression does not work well with those packages. 2. Store the actual byte offsets in the index file rather than (or in addition to) the line numbers. It is easier to seek to a position than a line number; see the man page for fseek. 3. You call writeIndexEntry() n times (n = # of pkgs), and each call reads in the entire huge database file. Change it so that it is only read once. Once you do this, you should find that the tot_lines being passed to the script is unnecessary. Pseudocode: open_files() count = 0 pkg = None for i in txtdb.readlines(): if pkg is None: (pkg, ver) = parse_pkg() count = 0 else: count += 1 if i == "@@ENDS\n": write_index_line() pkg = None close_files() I am interested in seeing what the performance differences would be between this, the current backend, and a tar backend (FS#8586), so keep it up and good luck.
On Sat, 8 Dec 2007 21:36:54 -0500 Nathan Jones <nathanj@insightbb.com> wrote:
A few suggestions:
1. Make sure to test with packages that contain hyphens, like 'gcc-libs'. Your regular expression does not work well with those packages. Ok; I try to use ' - ' instead of '-', but I opted to use '@' as separator. I hope that this will be useful when I do search with C code.
2. Store the actual byte offsets in the index file rather than (or in addition to) the line numbers. It is easier to seek to a position than a line number; see the man page for fseek. Right. Now it stores also byte offset.
3. You call writeIndexEntry() n times (n = # of pkgs), and each call reads in the entire huge database file. Change it so that it is only read once. Once you do this, you should find that the tot_lines being passed to the script is unnecessary. Pseudocode:
[cut] Done. Patch attached.. but it is based on previously patch. Do you like a patch based directly on master? If yes tell me, I rebuild :D
I am interested in seeing what the performance differences would be between this, the current backend, and a tar backend (FS#8586), so keep it up and good luck.
Thanks! -- JJDaNiMoTh - ArchLinux Trusted User
participants (2)
-
JJDaNiMoTh
-
Nathan Jones