[pacman-dev] Pacman, sqlite and dialectic of competent people
Maybe this is a better place where discuss this ( http://bbs.archlinux.org/viewtopic.php?id=42374 ) or at least I've have to clarify some things. 1)** toofishes says: ------------------------------------------------------------------------ Yes its faster. But does it matter all that much? $ time pacman -Qg gnome > /dev/null real 0m0.105s user 0m0.047s sys 0m0.037s $ time pacman -Sg gnome > /dev/null real 0m0.388s user 0m0.160s sys 0m0.203s ------------------------------------------------------------------------ Well... these are my results with Pacman 3.1.0 (Pentium III 933 Mhz, PC133 512 MB, 120 GB 7200rpm with XFS filesystem, "pacman-optimize" regularly done): [root@PC-ekerazha ekerazha]# time pacman -Qg gnome > /dev/null real 0m8.083s user 0m0.107s sys 0m0.233s [root@PC-ekerazha ekerazha]# time pacman -Sg gnome > /dev/null real 0m44.482s user 0m0.510s sys 0m0.963s So 8 and 44 seconds (44 seconds!). Apparently it really matters... toofishes says: ------------------------------------------------------------------------ Here are the things to "do" since you don't seem to understand we are "busy" developing "software" that works and doesn't "break" people's systems in the "name" of speed. ------------------------------------------------------------------------ Why should the sqlite approach "break people's systems in the name of speed"? You clearly don't know what are you talking about... I really hope there are also competent pacman devs there (as you say). 2) phrakture, people like "toofishes" is the living example of why "show you the code" (already partially done but incomplete) is useless. As I've already said, <People will "do" when the things to "do" will be *understood and accepted*, people won't waste their time vainly> and it seems obvious that people like "toofishes" isn't understanding anything about this. However... keep up the good work because Arch is a great Linux distro (the best out there) and well... pacman is a "decent" package manager (although poorly engineered and very slow). P.S. Excuse me for my bad English but it isn't my native language.
Uh... I was forgetting this thing... toofishes says: ------------------------------------------------------------------------ I didn't know utilizing the kernel cache was against the law. ------------------------------------------------------------------------ I didn't know the cache was *always* populated... what about the "44 seconds"?
On Sat, 19 Jan 2008 11:35:08 +0100 "Manuel \"ekerazha\" C." <manuel@ekerazha.com> wrote:
[cut]
I don't want to troll there, only one suggestion ( then I go to write some code ): Effectively, pacman needs 40 seconds to perform a search.. using kernel cache is a workaround. Isn't against any law, but using it is equivalent to hide pacman inefficiency. -- JJDaNiMoTh - ArchLinux Trusted User
JJDaNiMoTh wrote:
On Sat, 19 Jan 2008 11:35:08 +0100 "Manuel \"ekerazha\" C." <manuel@ekerazha.com> wrote:
[cut]
I don't want to troll there, only one suggestion ( then I go to write some code ):
Effectively, pacman needs 40 seconds to perform a search.. using kernel cache is a workaround. Isn't against any law, but using it is equivalent to hide pacman inefficiency.
Hi everybody, I agree with JJDaNiMoTh. Here some tests on my machine too: [root /home/giuseppe ]# sync; echo 3 > /proc/sys/vm/drop_caches [root /home/giuseppe ]# time pacman -Qi > /dev/null real 0m17.745s [not cached] user 0m1.057s sys 0m1.107s [root /home/giuseppe ]# time pacman -Qi > /dev/null real 0m1.166s [cached] user 0m1.017s sys 0m0.120s --- [root /home/giuseppe ]# sync; echo 3 > /proc/sys/vm/drop_caches [root /home/giuseppe ]# time pacman -Qg gnome > /dev/null real 0m8.944s [not cached] user 0m0.117s sys 0m0.840s [root /home/giuseppe ]# time pacman -Qg gnome > /dev/null real 0m0.216s [cached] user 0m0.127s sys 0m0.063s --- Tests done on a Asus A7D notebook. Regards, Giuseppe
JJDaNiMoTh wrote:
On Sat, 19 Jan 2008 11:35:08 +0100 "Manuel \"ekerazha\" C."<manuel@ekerazha.com> wrote:
[cut]
I don't want to troll there, only one suggestion ( then I go to write some code ):
Effectively, pacman needs 40 seconds to perform a search.. using kernel cache is a workaround. Isn't against any law, but using it is equivalent to hide pacman inefficiency.
For the record, I had huge performance issues with my filesystem a while ago. Obviously, this didn't cause slow downs only in pacman, but in most apps I use. Reinstalling my system fixed it. http://archlinux.org/pipermail/arch-general/2007-October/015751.html However, if you only care about pacman, you could use a separate partition for /var/lib/pacman (or a loopback filesystem).
However, if you only care about pacman, you could use a separate partition for /var/lib/pacman (or a loopback filesystem).
This is beating the dead horse. Somebody should use a loopback filesystem in order to *partially* compensate a pacman "misimplementation"...?
Manuel "ekerazha" C. wrote:
However, if you only care about pacman, you could use a separate partition for /var/lib/pacman (or a loopback filesystem).
This is beating the dead horse. Somebody should use a loopback filesystem in order to *partially* compensate a pacman "misimplementation"...?
I said you could use that, not should. The other way would be to experiment with other backends. In this case, an actual sqlite proof of concept (that is, working code) would help.
I said you could use that, not should.
The other way would be to experiment with other backends. In this case, an actual sqlite proof of concept (that is, working code) would help.
Well, independently from the code of a new pacman backend, just think... SQLite: SELECT * FROM foo_bar WHERE group = 'gnome' Do you think this would take "44 seconds" in the same conditions?
I said you could use that, not should.
The other way would be to experiment with other backends. In this case, an actual sqlite proof of concept (that is, working code) would help.
Hey. I got interested in this last week, and started breaking libalpm apart and try to fit in an sqliteish implementation. The code was new to me and I didn't have any other consideration but to get something working as fast as possible, so the result is nasty. Basically, I first commented treename from struct __pmdb_t so the compiler would tell me all(or most of) the places where the old db is used, and either disabled those functions or did the same with sqlite. Mainly the additions are in be_sqlite.c (renamed be_files.c), where _alpm_db_open opens the sqlite db, and db.c: _alpm_db_search, which executes a simple SELECT * FROM packages WHERE name LIKE "%foo%" and populates the return list. So, I implemented about 40% of pacman -Ss. If someone cares about timings (and you probably shouldn't, since my version doesn't do quite the same thing), here they are: (running pacman -Ss g three times after a reboot) pacman-3.1: 41.866s, 0.765s, 0.762s mutilated-pacman-with-sqlite: 1.036s, 0.131s, 0.133s pacman-3.1 shows probably rather worse performance in the worst run than it usually would, since my /var/ was 99% full at the time :) Anyway. The timing is not the most important issue, I think. libalpm has a lot of code that is merely there because C sucks for things like string and directory manipulation. And we need to do a lot of that. My humble guess is that a proper implementation of libalpm done with sqlite could be at least 50% smaller with a more understandable codebase. If we want to do this, then how? Some options from the top of my head: 1) for the parts that deal with the db, start from scratch. With the talent you guys have, shouldn't be a problem? Libalpm isn't very large... 2) for the development phase, consider sqlite to be a cache for the filedb, and gradually move each piece of code to the other side. This way, the legacy code would weigh us down a bit, but the change might be more sustainable. 3) Just hack in the functionality somehow, anyhow. 4) Refactor alpm to support different backends and implement whatever backend de jour. Ideas, praise, flames welcome. Code available by request. --vk
Hey. I got interested in this last week, and started breaking libalpm apart and try to fit in an sqliteish implementation. The code was new to me and I didn't have any other consideration but to get something working as fast as possible, so the result is nasty. Basically, I first commented treename from struct __pmdb_t so the compiler would tell me all(or most of) the places where the old db is used, and either disabled those functions or did the same with sqlite. Mainly the additions are in be_sqlite.c (renamed be_files.c), where _alpm_db_open opens the sqlite db, and db.c: _alpm_db_search, which executes a simple SELECT * FROM packages WHERE name LIKE "%foo%" and populates the return list.
So, I implemented about 40% of pacman -Ss. If someone cares about timings (and you probably shouldn't, since my version doesn't do quite the same thing), here they are:
Huh, this "sqlite backend idea" is quite popular nowadays, imho more people working on its implementation, so I suggest co-operation ;-)
(running pacman -Ss g three times after a reboot)
pacman-3.1: 41.866s, 0.765s, 0.762s mutilated-pacman-with-sqlite: 1.036s, 0.131s, 0.133s
pacman-3.1 shows probably rather worse performance in the worst run than it usually would, since my /var/ was 99% full at the time :)
Anyway. The timing is not the most important issue, I think. libalpm has a lot of code that is merely there because C sucks for things like string and directory manipulation. And we need to do a lot of that. My humble guess is that a proper implementation of libalpm done with sqlite could be at least 50% smaller with a more understandable codebase.
If we want to do this, then how? Some options from the top of my head:
1) for the parts that deal with the db, start from scratch. With the talent you guys have, shouldn't be a problem? Libalpm isn't very large...
2) for the development phase, consider sqlite to be a cache for the filedb, and gradually move each piece of code to the other side. This way, the legacy code would weigh us down a bit, but the change might be more sustainable.
3) Just hack in the functionality somehow, anyhow.
4) Refactor alpm to support different backends and implement whatever backend de jour.
Ideas, praise, flames welcome. Code available by request.
First of all, I appreciate your work/attempt to make pacman better. Well, I'm pretty sure, that 1. sqlite is faster 2. reduces codebase (find replacements, check for provisions, groups etc. can be reduced to a simple sql query), but we haven't got reassuring answers to our "database corruption" fear. So I would like to ask you to convince us, why sqlite is safe (well, personally I have very limited sql[ite] knowledge now). Please try to understand that for most of the pacman devels/contributors "stability" is more important than speed [obviously corrupted localdb == unusable system]. That's why I can guess that this idea won't be accepted until we cannot see the proof of the fact that the new db back-end is as safe as the old one (or more safer ;-P). Bye
but we haven't got reassuring answers to our "database corruption" fear.
So I would like to ask you to convince us, why sqlite is safe (well, personally I have very limited sql[ite] knowledge now). Please try to understand that for most of the pacman devels/contributors "stability" is more important than speed [obviously corrupted localdb == unusable system]. That's why I can guess that this idea won't be accepted until we cannot see the proof of the fact that the new db back-end is as safe as the old one (or more safer ;-P).
Of course. Although I hope you don't mean 'proof' as in 'mathematical proof'. It might take me some while to complete that... Sqlite implements ACID except for Consistency, and that because it ignores foreign key constraints. But otherwise, I believe that part would be immediately safer than what we have now. As for filesystem corruption (how often does this even happen with current sensible filesystems?), one option would be to back up the localdb after every commiting pacman operation. Probably with some autoexpiration logic. For instance, extra.db is 337KB unpacked (using sqlize.py -- might be missing some fields for now), extra.db.bz2 is 126K, so we could clone those quite a lot of times before using more space than what /var/lib/pacman/ always does (which may be as high as 50MB on a fresh install, ext3). By the way, how safe do you see the current DB currently? I remember having at least a couple of disasters with the filedb when trees were truncated. Sure it was my fault for trying out alternative filesystems, but still, this is exactly the sort of corruption you're afraid of, is it not? --vk
but we haven't got reassuring answers to our "database corruption" fear.
So I would like to ask you to convince us, why sqlite is safe (well, personally I have very limited sql[ite] knowledge now). Please try to understand that for most of the pacman devels/contributors "stability" is more important than speed [obviously corrupted localdb == unusable system]. That's why I can guess that this idea won't be accepted until we cannot see the proof of the fact that the new db back-end is as safe as the old one (or more safer ;-P).
Of course. Although I hope you don't mean 'proof' as in 'mathematical proof'. It might take me some while to complete that...
Sqlite implements ACID except for Consistency, and that because it ignores foreign key constraints. But otherwise, I believe that part would be immediately safer than what we have now.
As for filesystem corruption (how often does this even happen with current sensible filesystems?), one option would be to back up the localdb after every commiting pacman operation. Probably with some autoexpiration logic. For instance, extra.db is 337KB unpacked (using sqlize.py -- might be missing some fields for now), extra.db.bz2 is 126K, so we could clone those quite a lot of times before using more space than what /var/lib/pacman/ always does (which may be as high as 50MB on a fresh install, ext3).
By the way, how safe do you see the current DB currently? I remember having at least a couple of disasters with the filedb when trees were truncated. Sure it was my fault for trying out alternative filesystems, but still, this is exactly the sort of corruption you're afraid of, is it not?
First, I need some more info (deeplinks) about this ACID stuff :-P Second, about the current db: You are in the middle of a db write, and your little sister accidentally presses reset button. What will happen? At most one corrupted db "entry" [written desc, but missing depends for example], which can be easily deleted by hand, and fixed by -Udf package reinstall. The other parts of the db still readable and valid. I don't know what happens in the same situation with sqlite: The worst thing I can imagine, that I get "database corrupted" after reboot, which would be a nightmare. <- This is where we you must convince us. Or since sqlite is a "black box", I got some spam/corrupt info inside the db, which I cannot clean-up so easily. So if you can "guarantee", that with sqlite we get an "atomic update" capable (<- this is missing with current method), "reset-button-safe" db, it might worth thinking about the change. Bye
On Jan 28, 2008 9:34 AM, Nagy Gabor <ngaba@bibl.u-szeged.hu> wrote:
First, I need some more info (deeplinks) about this ACID stuff :-P
http://en.wikipedia.org/wiki/ACID - part of ACID (the "A") is atomicity.
sqlite also has transactions, which might be what nagy is looking for in regard to 'power button pushed in the middle of an operation'. You can also set an sqlite pragma to perform operations synchronously for improved safety (might make sqlite a bit slower on writes). historically sqlite has had a few corruption bugs. I would consider most of them edge cases, that are largely preventable, pacman has the benefit of single thread locking on the db, so issues with contention and multiprocess journal rollback shouldn't be a problem. http://www.sqlite.org/cvstrac/wiki?p=DatabaseCorruption Also some more "how to corrupt" info (also contains remediation information): http://www.sqlite.org/lockingv3.html#how_to_corrupt For improved safety, pacman could open the db in write mode only when it needs to (when run as root user), and open the db in read only mode the rest of the time.
On Jan 28, 2008 9:34 AM, Nagy Gabor <ngaba@bibl.u-szeged.hu> wrote:
First, I need some more info (deeplinks) about this ACID stuff :-P
http://en.wikipedia.org/wiki/ACID - part of ACID (the "A") is atomicity.
Well this looks impressive. I've also read this: http://www.sqlite.org/lockingv3.html#how_to_corrupt The key question is, what corrupt means here; and I don't really know [unable to open or just inconsistent, but it can be rollbacked easily]. In other words a wrong/missing bit (for simplicity: caused by a very rare hw failure) could corrupt the db to an unusable state (cannot open, cannot rollback) or not? Bye
On Jan 28, 2008 3:57 AM, Vesa Kaihlavirta <vpkaihla@gmail.com> wrote:
Ideas, praise, flames welcome. Code available by request.
I just wanted to throw my 2 cents in here - this is amazing. I actually got a lot of emails saying "look see! sqlite is teh better!" after you wrote this, but sadly, they didn't write it. You are the winner here. For the past 3 years this idea has been discussed, and you are the first person to actually do this. Consider yourself godlike 8)
toofishes says: ------------------------------------------------------------------------ Here are the things to "do" since you don't seem to understand we are "busy" developing "software" that works and doesn't "break" people's systems in the "name" of speed. ------------------------------------------------------------------------
Why should the sqlite approach "break people's systems in the name of speed"? You clearly don't know what are you talking about... I really hope there are also competent pacman devs there (as you say).
I think this was discussed many times here, why we prefer the current backend: Simply because we are paranoid; we afraid of the following "message": "~Database is corrupted, restore it from backup", and we got a cryptic bunch of bytes as database. Saying not competent just because he doesn't agree with you is not a good argument. Well, probably we are not dbms experts, so instead of flaming you should convince us, why using your method is safe. I add one argument here: maybe using a professional dbms would reduce memory usage; but I am incompenent here, indeed ;-) Bye ---------------------------------------------------- SZTE Egyetemi Könyvtár - http://www.bibl.u-szeged.hu This mail sent through IMP: http://horde.org/imp/
I think this was discussed many times here, why we prefer the current backend: Simply because we are paranoid; we afraid of the following "message": "~Database is corrupted, restore it from backup", and we got a cryptic bunch of bytes as database. Saying not competent just because he doesn't agree with you is not a good argument. Well, probably we are not dbms experts, so instead of flaming you should convince us, why using your method is safe.
I add one argument here: maybe using a professional dbms would reduce memory usage; but I am incompenent here, indeed ;-)
Did you read the bbs thread I linked before? ;-) ( http://bbs.archlinux.org/viewtopic.php?id=42374 )
Wanted to get some relevant linkage in this thread: http://www.archlinux.org/pipermail/pacman-dev/2006-October/006113.html http://www.archlinux.org/pipermail/pacman-dev/2007-November/009936.html http://www.archlinux.org/pipermail/pacman-dev/2007-November/010278.html I am going to try really hard to keep this civil in here, so please do the same. I post the above links for this reason- this idea has come up many times before. And every time it doesn't seem to catch the attention of our devs. To find out why, you are going to have to do some reading of the above threads. This is not to say it can't be done. I just don't think those of us that are currently coding a lot of things for pacman find this to be a priority or a big problem in our minds, and/or think there are other ways to better solve the problem, such as reading straight from a tar.gz database which libarchive makes *really* easy, but the current pacman code needs some work to support. I would be very interested in working on a refactoring so that multiple backends could be possible- the code as it currently stands makes that awfully hard. -Dan
Wanted to get some relevant linkage in this thread:
http://www.archlinux.org/pipermail/pacman-dev/2006-October/006113.html http://www.archlinux.org/pipermail/pacman-dev/2007-November/009936.html http://www.archlinux.org/pipermail/pacman-dev/2007-November/010278.html
I am going to try really hard to keep this civil in here, so please do the same. I post the above links for this reason- this idea has come up many times before. And every time it doesn't seem to catch the attention of our devs. To find out why, you are going to have to do some reading of the above threads.
This is not to say it can't be done. I just don't think those of us that are currently coding a lot of things for pacman find this to be a priority or a big problem in our minds, and/or think there are other ways to better solve the problem, such as reading straight from a tar.gz database which libarchive makes *really* easy, but the current pacman code needs some work to support. I would be very interested in working on a refactoring so that multiple backends could be possible- the code as it currently stands makes that awfully hard.
As I've already pointed out inside the previous linked thread ( http://bbs.archlinux.org/viewtopic.php?id=42374 ) I've already read all the previous "sqlite" discussions. What about the "libarchive" way? Well... it IS a step forward compared to the current backend. I think this is still worse than a sqlite based approach however it IS definitely a good improvement.
On Jan 19, 2008 10:28 AM, Manuel ekerazha C. <manuel@ekerazha.com> wrote:
Wanted to get some relevant linkage in this thread:
http://www.archlinux.org/pipermail/pacman-dev/2006-October/006113.html http://www.archlinux.org/pipermail/pacman-dev/2007-November/009936.html http://www.archlinux.org/pipermail/pacman-dev/2007-November/010278.html
I am going to try really hard to keep this civil in here, so please do the same. I post the above links for this reason- this idea has come up many times before. And every time it doesn't seem to catch the attention of our devs. To find out why, you are going to have to do some reading of the above threads.
This is not to say it can't be done. I just don't think those of us that are currently coding a lot of things for pacman find this to be a priority or a big problem in our minds, and/or think there are other ways to better solve the problem, such as reading straight from a tar.gz database which libarchive makes *really* easy, but the current pacman code needs some work to support. I would be very interested in working on a refactoring so that multiple backends could be possible- the code as it currently stands makes that awfully hard.
As I've already pointed out inside the previous linked thread ( http://bbs.archlinux.org/viewtopic.php?id=42374 ) I've already read all the previous "sqlite" discussions.
You clearly have not thought about them then. Stop linking the BBS here please.
What about the "libarchive" way? Well... it IS a step forward compared to the current backend. I think this is still worse than a sqlite based approach however it IS definitely a good improvement.
OK. Since you didn't seem to want to help me with refactoring, I'll do my thing and you do yours. Best of luck! -Dan
You clearly have not thought about them then. Stop linking the BBS here please.
Well... most of the "anti sqlite" arguments inside that linked pages are just ridicoulous and YES, I've already replied to them inside the BBS page I linked before.
OK. Since you didn't seem to want to help me with refactoring, I'll do my thing and you do yours. Best of luck!
Good luck with your ideas, Dan!
participants (10)
-
Aaron Griffin
-
Dan McGee
-
eliott
-
Giuseppe Fuggiano
-
JJDaNiMoTh
-
Manuel "ekerazha" C.
-
Nagy Gabor
-
Travis Willard
-
Vesa Kaihlavirta
-
Xavier