[arch-general] Implement sql/sqlite database for pacman local database
I was curious why does 'pacman -Q' operations took longer than 'apt' counterparts. It seems that the local pacman databases are just subdirectories with text files (desc, files) and gzipped text (mtree). No wonder why local pacman databases tend to slow down over time and need to be optimized periodically. For long term pacman development road map, it would be better to use single sql based database for tracking locally installed packages instead of keeping directories of every installed packages. This would provide faster access for local database as sql databases are optimized for fast access.
Hi,
I was curious why does 'pacman -Q' operations took longer than 'apt' counterparts. Sounds interesting but I have a few question about how did you measure this and how big the difference is. (Shouldn't be that big). Would be great if you provide more information on the comparability of you systems and the tools you used for tracing. Maybe there are other reasons why it is slow on your installation ?
For long term pacman development road map, it would be better to use single sql based database for tracking locally installed packages instead of keeping directories of every installed packages. I am not sure if a sql based database would be a good solution if you where right. It adds much more complexity and also a dependencies on $SQL backend. For me as a semi-professional arch user this would be worse than a maybe "not that fast" package dB querying.
Regards, Robin
On Sat, Oct 22, 2016 at 1:54 AM, Robin via arch-general < arch-general@archlinux.org> wrote:
Hi,
I was curious why does 'pacman -Q' operations took longer than 'apt' counterparts. Sounds interesting but I have a few question about how did you measure this and how big the difference is. (Shouldn't be that big). Would be great if you provide more information on the comparability of you systems and the tools you used for tracing. Maybe there are other reasons why it is slow on your installation ?
For long term pacman development road map, it would be better to use single sql based database for tracking locally installed packages instead of keeping directories of every installed packages. I am not sure if a sql based database would be a good solution if you where right. It adds much more complexity and also a dependencies on $SQL backend. For me as a semi-professional arch user this would be worse than a maybe "not that fast" package dB querying.
Regards, Robin
Sometimes I have a similar problem, too. When the system just boots up, or I just exploits my disk (for example building Firefox), pacman-related files are moved out of the disk cache, so it needs some time to read them all from the disk. Here's a simple performance test: $ sudo -v && time pacman -Q linux && sudo sync && sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches && time pacman -Q linux [sudo] password for yen: linux 4.8.3-1 pacman --color=auto -Q linux 0.00s user 0.00s system 2% cpu 0.121 total 3 linux 4.8.3-1 pacman --color=auto -Q linux 0.00s user 0.01s system 0% cpu 1.229 total The difference is more than 10 times. I use a 5-year-old HDD. I guess on even older machines things are worse. Regards, Yen Chi Hsuan
On Sat, Oct 22, 2016 at 02:15:01AM +0800, Chi-Hsuan Yen via arch-general wrote:
On Sat, Oct 22, 2016 at 1:54 AM, Robin via arch-general < arch-general@archlinux.org> wrote:
Hi,
I was curious why does 'pacman -Q' operations took longer than 'apt' counterparts. Sounds interesting but I have a few question about how did you measure this and how big the difference is. (Shouldn't be that big). Would be great if you provide more information on the comparability of you systems and the tools you used for tracing. Maybe there are other reasons why it is slow on your installation ?
For long term pacman development road map, it would be better to use single sql based database for tracking locally installed packages instead of keeping directories of every installed packages. I am not sure if a sql based database would be a good solution if you where right. It adds much more complexity and also a dependencies on $SQL backend. For me as a semi-professional arch user this would be worse than a maybe "not that fast" package dB querying.
Regards, Robin
Sometimes I have a similar problem, too. When the system just boots up, or I just exploits my disk (for example building Firefox), pacman-related files are moved out of the disk cache, so it needs some time to read them all from the disk. Here's a simple performance test:
$ sudo -v && time pacman -Q linux && sudo sync && sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches && time pacman -Q linux [sudo] password for yen: linux 4.8.3-1 pacman --color=auto -Q linux 0.00s user 0.00s system 2% cpu 0.121 total 3 linux 4.8.3-1 pacman --color=auto -Q linux 0.00s user 0.01s system 0% cpu 1.229 total
The difference is more than 10 times. I use a 5-year-old HDD. I guess on even older machines things are worse.
Regards,
Yen Chi Hsuan
My own test - before optimization, ``pacman -Qs linux`` took almost half a minute. $ time pacman -Qs linux real 0m26.716s user 0m0.063s sys 0m0.230s After running ``pacman-optimize``, it runs instantly. $ time pacman -Qs linux real 0m0.048s user 0m0.030s sys 0m0.017s The filesystem fragmentation can be felt more deeply on slower and older HDD.
On 22/10/16 14:06, Alive 4ever wrote:
On Sat, Oct 22, 2016 at 02:15:01AM +0800, Chi-Hsuan Yen via arch-general wrote:
On Sat, Oct 22, 2016 at 1:54 AM, Robin via arch-general < arch-general@archlinux.org> wrote:
Hi,
I was curious why does 'pacman -Q' operations took longer than 'apt' counterparts. Sounds interesting but I have a few question about how did you measure this and how big the difference is. (Shouldn't be that big). Would be great if you provide more information on the comparability of you systems and the tools you used for tracing. Maybe there are other reasons why it is slow on your installation ?
For long term pacman development road map, it would be better to use single sql based database for tracking locally installed packages instead of keeping directories of every installed packages. I am not sure if a sql based database would be a good solution if you where right. It adds much more complexity and also a dependencies on $SQL backend. For me as a semi-professional arch user this would be worse than a maybe "not that fast" package dB querying.
Regards, Robin
Sometimes I have a similar problem, too. When the system just boots up, or I just exploits my disk (for example building Firefox), pacman-related files are moved out of the disk cache, so it needs some time to read them all from the disk. Here's a simple performance test:
$ sudo -v && time pacman -Q linux && sudo sync && sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches && time pacman -Q linux [sudo] password for yen: linux 4.8.3-1 pacman --color=auto -Q linux 0.00s user 0.00s system 2% cpu 0.121 total 3 linux 4.8.3-1 pacman --color=auto -Q linux 0.00s user 0.01s system 0% cpu 1.229 total
The difference is more than 10 times. I use a 5-year-old HDD. I guess on even older machines things are worse.
Regards,
Yen Chi Hsuan
My own test - before optimization, ``pacman -Qs linux`` took almost half a minute. $ time pacman -Qs linux real 0m26.716s user 0m0.063s sys 0m0.230s
After running ``pacman-optimize``, it runs instantly. $ time pacman -Qs linux real 0m0.048s user 0m0.030s sys 0m0.017s
The filesystem fragmentation can be felt more deeply on slower and older HDD. .
Isn't caching brilliant...
On Sat, 22 Oct 2016 04:06:31 +0000 Alive 4ever <alive4ever@live.com> wrote:
On Sat, Oct 22, 2016 at 02:15:01AM +0800, Chi-Hsuan Yen via arch-general wrote:
On Sat, Oct 22, 2016 at 1:54 AM, Robin via arch-general < arch-general@archlinux.org> wrote:
Hi,
I was curious why does 'pacman -Q' operations took longer than 'apt' counterparts. Sounds interesting but I have a few question about how did you measure this and how big the difference is. (Shouldn't be that big). Would be great if you provide more information on the comparability of you systems and the tools you used for tracing. Maybe there are other reasons why it is slow on your installation ?
For long term pacman development road map, it would be better to use single sql based database for tracking locally installed packages instead of keeping directories of every installed packages. I am not sure if a sql based database would be a good solution if you where right. It adds much more complexity and also a dependencies on $SQL backend. For me as a semi-professional arch user this would be worse than a maybe "not that fast" package dB querying.
Regards, Robin
Sometimes I have a similar problem, too. When the system just boots up, or I just exploits my disk (for example building Firefox), pacman-related files are moved out of the disk cache, so it needs some time to read them all from the disk. Here's a simple performance test:
$ sudo -v && time pacman -Q linux && sudo sync && sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches && time pacman -Q linux [sudo] password for yen: linux 4.8.3-1 pacman --color=auto -Q linux 0.00s user 0.00s system 2% cpu 0.121 total 3 linux 4.8.3-1 pacman --color=auto -Q linux 0.00s user 0.01s system 0% cpu 1.229 total
The difference is more than 10 times. I use a 5-year-old HDD. I guess on even older machines things are worse.
Regards,
Yen Chi Hsuan
My own test - before optimization, ``pacman -Qs linux`` took almost half a minute. $ time pacman -Qs linux real 0m26.716s user 0m0.063s sys 0m0.230s
After running ``pacman-optimize``, it runs instantly. $ time pacman -Qs linux real 0m0.048s user 0m0.030s sys 0m0.017s
The filesystem fragmentation can be felt more deeply on slower and older HDD.
HUmm just tried that and the results were exactly the same before and after pacman-optimise . All you have done the second run is read the cache which is obviously a lot quicker than hunting it out on an actual HDD . Pete .
On Fri, Oct 21, 2016 at 17:20:53 +0000, Alive 4ever wrote:
[...] It seems that the local pacman databases are just subdirectories with text files (desc, files) and gzipped text (mtree). No wonder why local pacman databases tend to slow down over time and need to be optimized periodically. This is a little contradictory: if it is just directories with text files (plain or compressed), how does it need "periodical optimisation"? What is optimised? And how?
This would provide faster access for local database as sql databases are optimized for fast access. This just adds complexity, and for what? Marginal performance gain (if at all? honestly, `pacman -Q` runs almost instantly here).
Best, Tinu/ayekat
On Fri, Oct 21, 2016 at 08:03:53PM +0200, Tinu Weber wrote:
On Fri, Oct 21, 2016 at 17:20:53 +0000, Alive 4ever wrote:
[...] It seems that the local pacman databases are just subdirectories with text files (desc, files) and gzipped text (mtree). No wonder why local pacman databases tend to slow down over time and need to be optimized periodically. This is a little contradictory: if it is just directories with text files (plain or compressed), how does it need "periodical optimisation"? What is optimised? And how? Local text files uses the underlying filesystem capability to optimize. Currently, pacman package includes a ``pacman-optimize`` script to do manual periodic local database optimization.
Invoking ``pacman-optimize --help`` should give you a hint that pacman developers recognize that there is a problem with many small files: filesystem fragmentation. Basically, the optimization is just file system rotation. The old databases are archived into tar files in the tmp, the local database is moved/renamed, the tar archive of local database is extracted back to its original location. If everything goes fine, the renamed old database is deleted.
This would provide faster access for local database as sql databases are optimized for fast access. This just adds complexity, and for what? Marginal performance gain (if at all? honestly, `pacman -Q` runs almost instantly here).
Filesystem fragmentation is no problem for faster disk.
Best, Tinu/ayekat
On Sat, 22 Oct 2016 03:53:20 +0000 Alive 4ever <alive4ever@live.com> wrote:
On Fri, Oct 21, 2016 at 08:03:53PM +0200, Tinu Weber wrote: Currently, pacman package includes a ``pacman-optimize`` script to do manual periodic local database optimization.
Not for long. https://git.archlinux.org/pacman.git/commit/?id=d590a45795b30a14cdb697754749...
On Fri, Oct 21, 2016 at 11:06:04PM -0500, Doug Newgard wrote:
On Sat, 22 Oct 2016 03:53:20 +0000 Alive 4ever <alive4ever@live.com> wrote:
On Fri, Oct 21, 2016 at 08:03:53PM +0200, Tinu Weber wrote: Currently, pacman package includes a ``pacman-optimize`` script to do manual periodic local database optimization.
Not for long. https://git.archlinux.org/pacman.git/commit/?id=d590a45795b30a14cdb697754749...
I just refer to locally installed pacman files, since the commit hasn't landed to repository package yet.
On 10/22/2016 02:28 AM, Alive 4ever wrote:
On Fri, Oct 21, 2016 at 11:06:04PM -0500, Doug Newgard wrote:
On Sat, 22 Oct 2016 03:53:20 +0000 Alive 4ever <alive4ever@live.com> wrote:
On Fri, Oct 21, 2016 at 08:03:53PM +0200, Tinu Weber wrote: Currently, pacman package includes a ``pacman-optimize`` script to do manual periodic local database optimization.
Not for long. https://git.archlinux.org/pacman.git/commit/?id=d590a45795b30a14cdb697754749...
I just refer to locally installed pacman files, since the commit hasn't landed to repository package yet.
Right... but if you look at the stated reason for its removal, you tend to get the impression that the lead pacman developer is actually explicitly calling you (rhet.) a blithering idiot if you (rhet.) think that pacman-optimize is actually optimizing anything. Particularly, since the ext filesystem (like all linux filesystems) was sort of designed around a principle meant to escape filesystem fragmentation altogether. ... If you really think your filesystem fragmentation is a problem, then the problem is not limited to pacman. You should consider doing a full disk defragmentation. Also consider the fact that your pacman database is one of the least likely pieces of data to target for the sake of noticeably improving your computer's overall performance. -- Eli Schwartz
On Sat, Oct 22, 2016 at 08:35:15PM -0400, Eli Schwartz via arch-general wrote:
Right... but if you look at the stated reason for its removal, you tend to get the impression that the lead pacman developer is actually explicitly calling you (rhet.) a blithering idiot if you (rhet.) think that pacman-optimize is actually optimizing anything.
I'd appreciate if developers refrain from such abusive behavior to keep healthy community relationship.
Particularly, since the ext filesystem (like all linux filesystems) was sort of designed around a principle meant to escape filesystem fragmentation altogether.
...
If you really think your filesystem fragmentation is a problem, then the problem is not limited to pacman. You should consider doing a full disk defragmentation.
Despite the filesystem is as well designed as it could be, if the underlying hardware can't fully take the advantage of the filesystem, hard drive fragmentation still occurs. As for my case, it may have been caused by low free disk space.
Also consider the fact that your pacman database is one of the least likely pieces of data to target for the sake of noticeably improving your computer's overall performance.
Yeah, I am aware of it. Hardware components tend to degrade over time, especially mechanical hard drive.
-- Eli Schwartz
On 10/23/2016 01:41 AM, Alive 4ever wrote:
Also consider the fact that your pacman database is one of the least likely pieces of data to target for the sake of noticeably improving your computer's overall performance.
Yeah, I am aware of it. Hardware components tend to degrade over time, especially mechanical hard drive.
What part of "least likely pieces of *data*" was not obvious enough in its reference to *data*, such that you felt the need to conflate data with hardware? If you are agreeing with me ("Yeah, I am aware of it") then stop mentioning hardware. If you are arguing with me, then please explain what you actually meant to say and what it has to do with hardware. ... I am still pretty sure that whatever problem you may have, it is not pacman-specific and it doesn't require a pacman-specific tool. So while you might disagree with the political commentary invoked in that commit message, the general idea that the script is a waste of space is something I can get behind! -- Eli Schwartz
On 10/23/2016 01:41 AM, Alive 4ever wrote:
Also consider the fact that your pacman database is one of the least likely pieces of data to target for the sake of noticeably improving your computer's overall performance.
Yeah, I am aware of it. Hardware components tend to degrade over time, especially mechanical hard drive.
What part of "least likely pieces of *data*" was not obvious enough in its reference to *data*, such that you felt the need to conflate data with hardware?
If you are agreeing with me ("Yeah, I am aware of it") then stop mentioning hardware. If you are arguing with me, then please explain what you actually meant to say and what it has to do with hardware.
... Back to the subject, I wanted to propose my idea on how to speed up
On Sun, Oct 23, 2016 at 01:57:23AM -0400, Eli Schwartz via arch-general wrote: pacman local database access - regardless of the hardware. Some folks replied with something like ``there is no need for this, just go get an ssd``, which is misleading. I post the idea here as suggestion for pacman local database improvement, not as complaint on slow filesystem access on mechanical drive.
I am still pretty sure that whatever problem you may have, it is not pacman-specific and it doesn't require a pacman-specific tool.
So while you might disagree with the political commentary invoked in that commit message, the general idea that the script is a waste of space is something I can get behind!
While using sql for pacman database could potentially provide faster access to local database, there is a risk of database corruption that isn't easy to recover. Current approach of using smaller files for databases also has its own advantage as gsnijder explained.
One really big advantage of this approach is that you don't have to worry about corrupted databases. It's been a while since I used an RPM based distro, but it always surprised me how quickly the db would fall over and needed to be rebuilt. The beauty of the Arch approach is it's robust simplicity. And yes, there are/were wrappers that use a SQL DB for specific operations. If such a db gets problems, there is nothing to worry about: pacman just keeps working anyway.
I'll leave it to developers to take whichever methods for pacman local database.
2016/10/23 20:42、Alive 4ever <alive4ever@live.com> のメッセージ:
On Sun, Oct 23, 2016 at 01:57:23AM -0400, Eli Schwartz via arch-general wrote: On 10/23/2016 01:41 AM, Alive 4ever wrote:
Also consider the fact that your pacman database is one of the least likely pieces of data to target for the sake of noticeably improving your computer's overall performance.
Yeah, I am aware of it. Hardware components tend to degrade over time, especially mechanical hard drive.
What part of "least likely pieces of *data*" was not obvious enough in its reference to *data*, such that you felt the need to conflate data with hardware?
If you are agreeing with me ("Yeah, I am aware of it") then stop mentioning hardware. If you are arguing with me, then please explain what you actually meant to say and what it has to do with hardware.
... Back to the subject, I wanted to propose my idea on how to speed up pacman local database access - regardless of the hardware.
Some folks replied with something like ``there is no need for this, just go get an ssd``, which is misleading.
I post the idea here as suggestion for pacman local database improvement, not as complaint on slow filesystem access on mechanical drive.
I am still pretty sure that whatever problem you may have, it is not pacman-specific and it doesn't require a pacman-specific tool.
So while you might disagree with the political commentary invoked in that commit message, the general idea that the script is a waste of space is something I can get behind!
While using sql for pacman database could potentially provide faster access to local database, there is a risk of database corruption that isn't easy to recover. Current approach of using smaller files for databases also has its own advantage as gsnijder explained.
One really big advantage of this approach is that you don't have to worry about corrupted databases. It's been a while since I used an RPM based distro, but it always surprised me how quickly the db would fall over and needed to be rebuilt. The beauty of the Arch approach is it's robust simplicity. And yes, there are/were wrappers that use a SQL DB for specific operations. If such a db gets problems, there is nothing to worry about: pacman just keeps working anyway.
I'll leave it to developers to take whichever methods for pacman local database.
In my mind, There is actually good side and bad side. in a good side of SQL db, Faster access, fast query, can read with simpler method... in a bad side of SQL db, If power lost during writing, surely DB will corrput and not easy to recover.
On 10/21/2016 01:20 PM, Alive 4ever wrote:
I was curious why does 'pacman -Q' operations took longer than 'apt' counterparts. It seems that the local pacman databases are just subdirectories with text files (desc, files) and gzipped text (mtree). No wonder why local pacman databases tend to slow down over time and need to be optimized periodically.
For long term pacman development road map, it would be better to use single sql based database for tracking locally installed packages instead of keeping directories of every installed packages.
This would provide faster access for local database as sql databases are optimized for fast access.
The reason pacman uses a flat file database as opposed to a relational database, is the result of a deliberate design decision by the lead pacman developers. Therefore, I really really really doubt you will be able to convince them to change -- they already know all the arguments about speed, and declared that they preferentially value the current lack of complexity. -- Eli Schwartz
Am 21.10.2016 um 21:48 schrieb Eli Schwartz via arch-general:
The reason pacman uses a flat file database as opposed to a relational database, is the result of a deliberate design decision by the lead pacman developers.
Therefore, I really really really doubt you will be able to convince them to change -- they already know all the arguments about speed, and declared that they preferentially value the current lack of complexity.
Maybe a solution more in line with that would be to implement a search database similar to updatedb/locate.
I was curious why does 'pacman -Q' operations took longer than 'apt' counterparts. It seems that the local pacman databases are just subdirectories with text files (desc, files) and gzipped text (mtree). No wonder why local pacman databases tend to slow down over time and need to be optimized periodically.
For long term pacman development road map, it would be better to use single sql based database for tracking locally installed packages instead of keeping directories of every installed packages.
This would provide faster access for local database as sql databases are optimized for fast access.
optimisation ideas to the pacman database are (nearly) as old as this distribution, but none ever really convinced many. if you go back through the archives of the mailinglists and forums (as well as the outer parts of the interwebs) you will find many similar ideas. i specifically remember an approach to read the database once on startup, and store it in a ramfs. hat its disadvantages as well… for all actions regarding the database, i strongly advise you to do a full backup of the whole database before tinkering with it. my advice: get a SSD drive for your system, that makes the pacman database lightning fast, but your data is as safe as can be. georg
On Fri, Oct 21, 2016 at 11:44:42PM +0200, G. Schlisio wrote:
optimisation ideas to the pacman database are (nearly) as old as this distribution, but none ever really convinced many. if you go back through the archives of the mailinglists and forums (as well as the outer parts of the interwebs) you will find many similar ideas. i specifically remember an approach to read the database once on startup, and store it in a ramfs. hat its disadvantages as well…
for all actions regarding the database, i strongly advise you to do a full backup of the whole database before tinkering with it.
my advice: get a SSD drive for your system, that makes the pacman database lightning fast, but your data is as safe as can be.
georg
I suggest this idea thinking that pacman can be improvised further. Sql queries should be faster than doing 'grep' over thousands of files, regardless of the underlying filesystem. This is just one reason of why sql is invented. The cons of using sql is it adds more complexity and more dependency. Instead of just extracting .mtree, desc, and files to /var/lib/pacman/local/ subdirectories, a more complex sql query should be performed to update local database. If pacman developers chooses simplicity over speed, text files make sense. If speed is preferred, sql database makes more sense, although the complexity to implement might be the reason to not implement this feature. I want to discuss this so that if more people get interested in this feature, it would make sense to implement it. http://stackoverflow.com/questions/2356851/database-vs-flat-files http://dba.stackexchange.com/questions/23124/whats-better-faster-mysql-or-fi...
As the currently lead pacman developer... We will never have a sql (or other) database backend. When we did tests for the sync backends, using a single tar file gave the same speed-up as using some sql variant (and we still have not optimised any reading from that - for the sync "dbs", we always load all information regardless of what proportion is needed) I intend to move the local db to a single tar file too at some stage. Allan
On Sat, Oct 22, 2016 at 02:20:30PM +1000, Allan McRae wrote:
As the currently lead pacman developer... We will never have a sql (or other) database backend.
When we did tests for the sync backends, using a single tar file gave the same speed-up as using some sql variant (and we still have not optimised any reading from that - for the sync "dbs", we always load all information regardless of what proportion is needed) I intend to move the local db to a single tar file too at some stage.
Allan
Thanks for clarification. If it achieves the same purpose, i.e. faster local database access, it doesn't matter which local database backend to use. Be it sql or single tar file, as long as the performance is acceptable, I think it's OK.
participants (11)
-
Alive 4ever
-
Allan McRae
-
Chi-Hsuan Yen
-
Doug Newgard
-
Dragon “floppy1” ryu
-
Eli Schwartz
-
G. Schlisio
-
pete nikolic
-
ProgAndy
-
Robin
-
Tinu Weber