[pacman-dev] pacman cold caches performance, too much stat()ing
Hello list, I have been investigating the slow performance of pacman regarding the cold caches scenario and I'm trying to write some proof of concept code that improves things a lot for some cases. However I need your help regarding some facts I might have misunderstood, and any pointers to the source code you also give me would also help a lot. I wouldn't like to lose time changing stuff that would break current functionality. So here are some first questions that come to mind, just by using strace: When doing "pacman -Q blah" I can see that besides the getdents() syscalls in /var/lib/pacman/local (probably caused by readdir()), there are also stat() and access() calls for every single subdirectory. Why are the last ones necessary? Isn't readdir enough? The same goes when doing "pacman -S blah". But in that case it stat()'s both 'local' and 'sync' directories, so worst case is really bad, it will stat() all contents of local, core, extra and community... In the case of "pacman -S" I measured that a great deal of time goes also to reading the "depends" files of all packages in local, please enlighten me what this is for. I have thought of a new way to store dependencies that should improve things, but I should first be sure it doesn't break anything and get some measurements myself. Thanks in advance, Dimitris P.S. Is there some option --pretend I might have missed? What I need is to get exactly the same actions of "pacman -S blah" or "pacman -Su" until the Y/N prompt, as non-root user.
Hello list,
I have been investigating the slow performance of pacman regarding the cold caches scenario and I'm trying to write some proof of concept code that improves things a lot for some cases. However I need your help regarding some facts I might have misunderstood, and any pointers to the source code you also give me would also help a lot. I wouldn't like to lose time changing stuff that would break current functionality. So here are some first questions that come to mind, just by using strace:
When doing "pacman -Q blah" I can see that besides the getdents() syscalls in /var/lib/pacman/local (probably caused by readdir()), there are also stat() and access() calls for every single subdirectory. Why are the last ones necessary? Isn't readdir enough?
The same goes when doing "pacman -S blah". But in that case it stat()'s both 'local' and 'sync' directories, so worst case is really bad, it will stat() all contents of local, core, extra and community...
In the case of "pacman -S" I measured that a great deal of time goes also to reading the "depends" files of all packages in local, please enlighten me what this is for. I have thought of a new way to store dependencies that should improve things, but I should first be sure it doesn't break anything and get some measurements myself.
depends files are read in order to ensure that the upgraded package won't break any "old" dependencies. Example: local foo requires bar=2.0 (which is installed) Then "pacman -S bar" is not allowed (if bar in sync has different version).
Thanks in advance, Dimitris
P.S. Is there some option --pretend I might have missed? What I need is to get exactly the same actions of "pacman -S blah" or "pacman -Su" until the Y/N prompt, as non-root user.
-Su is even worse: We have to read all desc files in sync (to get %REPLACES%) Bye
On Fri, 11 Dec 2009, Nagy Gabor wrote:
depends files are read in order to ensure that the upgraded package won't break any "old" dependencies.
Example: local foo requires bar=2.0 (which is installed) Then "pacman -S bar" is not allowed (if bar in sync has different version).
I just noticed that local depends are read even when installing a new package (not upgrading an old one). Why is this for?
By debugging I found one more reason: Conflict checking. %CONFLICTS% field is also stored in depends file. Bye
On Fri, 11 Dec 2009, Nagy Gabor wrote:
depends files are read in order to ensure that the upgraded package won't break any "old" dependencies.
Example: local foo requires bar=2.0 (which is installed) Then "pacman -S bar" is not allowed (if bar in sync has different version).
I see now, thanks! So if we somehow had an %RDEPENDS% field (reverse-dependencies) for every package in local that would not be necessary. I will see if this is doable during every install.
P.S. Is there some option --pretend I might have missed? What I need is to get exactly the same actions of "pacman -S blah" or "pacman -Su" until the Y/N prompt, as non-root user.
-Su is even worse: We have to read all desc files in sync (to get %REPLACES%)
Ah you are right, all desc files are being read. It really takes a lot of time on my old laptop. If this is only for the %REPLACES% field then a first action would be to put that info in a separate file or in the "depends" file. I'll try that too, thanks for your help! Dimitris
Dimitrios Apostolou wrote:
On Fri, 11 Dec 2009, Nagy Gabor wrote:
depends files are read in order to ensure that the upgraded package won't break any "old" dependencies.
Example: local foo requires bar=2.0 (which is installed) Then "pacman -S bar" is not allowed (if bar in sync has different version).
I see now, thanks! So if we somehow had an %RDEPENDS% field (reverse-dependencies) for every package in local that would not be necessary. I will see if this is doable during every install.
pacman used to do such a thing, but from my understanding it caused more issues than it solved so it was removed. Allan
On Fri, 11 Dec 2009, Nagy Gabor wrote:
depends files are read in order to ensure that the upgraded package won't break any "old" dependencies.
Example: local foo requires bar=2.0 (which is installed) Then "pacman -S bar" is not allowed (if bar in sync has different version).
I just noticed that local depends are read even when installing a new package (not upgrading an old one). Why is this for? Dimitris
Dimitrios Apostolou wrote:
On Fri, 11 Dec 2009, Nagy Gabor wrote:
depends files are read in order to ensure that the upgraded package won't break any "old" dependencies.
Example: local foo requires bar=2.0 (which is installed) Then "pacman -S bar" is not allowed (if bar in sync has different version).
I just noticed that local depends are read even when installing a new package (not upgrading an old one). Why is this for?
How about this example: pacman -S foo, foo replaces=bar and provides=bar, installed package baz depends on foo>=2.0. This should fail as bar only provides foo and so it is unknown whether the foo>=2.0 dependency for baz is solved. Allan
On Sat, 12 Dec 2009, Allan McRae wrote:
Dimitrios Apostolou wrote:
On Fri, 11 Dec 2009, Nagy Gabor wrote:
depends files are read in order to ensure that the upgraded package won't break any "old" dependencies.
Example: local foo requires bar=2.0 (which is installed) Then "pacman -S bar" is not allowed (if bar in sync has different version).
I just noticed that local depends are read even when installing a new package (not upgrading an old one). Why is this for?
How about this example:
pacman -S foo, foo replaces=bar and provides=bar, installed package baz depends on foo>=2.0.
If already installed package baz depends on foo, then foo is already installed. Perhaps you mean it depends on bar>=2?
This should fail as bar only provides foo and so it is unknown whether the foo>=2.0 dependency for baz is solved.
And here you probably mean that "foo" only provides "bar" without version. So doing a pacman -S foo should remove "bar" because of /replaces/, but should also keep it because of "baz" requiring the specific version already installed. Anyway I think I get it, it always needs to read local depends. It is getting far more complex than I initially thought... But even for this complex case, an RDEPENDS field for local packages would help significantly. Perhaps you remember when it was discussed again, date or pacman version? Dimitris
On Sat, 12 Dec 2009, Dimitrios Apostolou wrote:
P.S. Is there some option --pretend I might have missed? What I need is to get exactly the same actions of "pacman -S blah" or "pacman -Su" until the Y/N prompt, as non-root user.
Regarding this issue, it's the first I tried to fix since I didn't want to run my random changes as root. Fortunately I accidentaly found about -p option (--print-uris) which does exactly what I need, so I quickly hacked a -P (--pretend) version for sync operations, which runs as user and only outputs package list and sizes.
I /think/ that this functionality is required by packagekit, so that it can automatically notify the user when updates become available. I remember on fedora something similar is "yum check-update".
Ofcourse it would be much nicer if it worked for all operations, not only sync. For example in a recursive remove it could notify the user of all packages that would be removed, the size to be freed etc without needing root or locking the db. But I'll skip that part, since my focus is elsewhere.
What do you think? Dimitris
What do you think about Xav's --print patch: http://code.toofishes.net/cgit/xavier/pacman.git/log/?h=print Bye
On Sat, Dec 12, 2009 at 5:46 AM, Allan McRae <allan@archlinux.org> wrote:
Dimitrios Apostolou wrote:
On Fri, 11 Dec 2009, Nagy Gabor wrote:
depends files are read in order to ensure that the upgraded package won't break any "old" dependencies.
Example: local foo requires bar=2.0 (which is installed) Then "pacman -S bar" is not allowed (if bar in sync has different version).
I see now, thanks! So if we somehow had an %RDEPENDS% field (reverse-dependencies) for every package in local that would not be necessary. I will see if this is doable during every install.
pacman used to do such a thing, but from my understanding it caused more issues than it solved so it was removed.
This was the former %REQUIREDBY% field (http://code.toofishes.net/cgit/dan/pacman.git/commit/?id=7219326dd4d01d7e49b...). It did end up causing more problems than it solved, as they never seemed to be stored quite right. Instead, we switched to computing them on the fly. This is the reason for the delay on -Qi in the cold cache case, for instance. I think most of this thread is the wrong approach to the problem. Rather than try to meld the DB to fit pacman, we should just swap out DBs so it doesn't have these bad worst-case conditions. $ du -sh --apparent-size local/ sync/* 15M local/ 8.4M sync/community 31K sync/community-testing 812K sync/core 11M sync/extra 302K sync/testing With those numbers in mind, we're talking about ~30 to ~35 MB of raw text data here. That is not a lot of data; most hard drives have at least ~30 MB/sec performance so this whole mess could be read in under a second if it was stored in a single file. So that would be one way of thinking about the issue differently. Other ways of course have come up many times on this list. BDB, Sqlite, reading the compressed DB directly, etc. -Dan
On Sat, 12 Dec 2009, Dimitrios Apostolou wrote:
P.S. Is there some option --pretend I might have missed? What I need is to get exactly the same actions of "pacman -S blah" or "pacman -Su" until the Y/N prompt, as non-root user.
Regarding this issue, it's the first I tried to fix since I didn't want to run my random changes as root. Fortunately I accidentaly found about -p option (--print-uris) which does exactly what I need, so I quickly hacked a -P (--pretend) version for sync operations, which runs as user and only outputs package list and sizes. I /think/ that this functionality is required by packagekit, so that it can automatically notify the user when updates become available. I remember on fedora something similar is "yum check-update". Ofcourse it would be much nicer if it worked for all operations, not only sync. For example in a recursive remove it could notify the user of all packages that would be removed, the size to be freed etc without needing root or locking the db. But I'll skip that part, since my focus is elsewhere. What do you think? Dimitris
On Sat, 12 Dec 2009, Dimitrios Apostolou wrote:
Hello list,
I have been investigating the slow performance of pacman regarding the cold caches scenario and I'm trying to write some proof of concept code that improves things a lot for some cases. However I need your help regarding some facts I might have misunderstood, and any pointers to the source code you also give me would also help a lot. I wouldn't like to lose time changing stuff that would break current functionality. So here are some first questions that come to mind, just by using strace:
When doing "pacman -Q blah" I can see that besides the getdents() syscalls in /var/lib/pacman/local (probably caused by readdir()), there are also stat() and access() calls for every single subdirectory. Why are the last ones necessary? Isn't readdir enough?
The same goes when doing "pacman -S blah". But in that case it stat()'s both 'local' and 'sync' directories, so worst case is really bad, it will stat() all contents of local, core, extra and community...
Regarding the stat() and access() operations I finally found out why they happen exactly:
In case of corrupted db the sync, for example, directory might contain files, not subdirectories. So in that case _alpm_db_populate() just makes sure it's a directory. However stat()ing thousands of files is too much of a price to pay. Similarly, access() checks it is accessible by the user.
In the attached patch I have just removed the relevant lines, with the following rationale: In the rare case of corrupted db, even if we do open("sync/not_a_dir/depends") it will still fail and we'll catch the failure there, no need to investigate the cause further, just write a message like "couldn't access sync/not_a_dir/depends".
By dropping caches ("echo 3 > /proc/sys/vm/drop_caches") before running, I measure a nice performance boost on my old laptop: "pacman -Q gdb" time is reduced from about 7s to 2.5s.
Hm. This is a nice time boost... Did you test this with other operations, too?
What do you think? Is it possible to remove those checks? Dimitris
The best solution would be to rewrite our whole database crap as Dan said. I am pretty sure that this patch would not cause any harm irl, but our code would become a little bit more dangerous: As I see, db_read(INFRQ_BASE) would become a ~NOP function and db_populate would become a simple "ls" function (the only remaining sanity check is splitname). Bye
On Sat, 12 Dec 2009, Nagy Gabor wrote:
What do you think about Xav's --print patch: http://code.toofishes.net/cgit/xavier/pacman.git/log/?h=print
Looks ideal! Will it be included in the master branch? If only I new about that some hours ago... Dimitris
On Sat, 12 Dec 2009, Dimitrios Apostolou wrote:
Hello list,
I have been investigating the slow performance of pacman regarding the cold caches scenario and I'm trying to write some proof of concept code that improves things a lot for some cases. However I need your help regarding some facts I might have misunderstood, and any pointers to the source code you also give me would also help a lot. I wouldn't like to lose time changing stuff that would break current functionality. So here are some first questions that come to mind, just by using strace:
When doing "pacman -Q blah" I can see that besides the getdents() syscalls in /var/lib/pacman/local (probably caused by readdir()), there are also stat() and access() calls for every single subdirectory. Why are the last ones necessary? Isn't readdir enough?
The same goes when doing "pacman -S blah". But in that case it stat()'s both 'local' and 'sync' directories, so worst case is really bad, it will stat() all contents of local, core, extra and community...
Regarding the stat() and access() operations I finally found out why they happen exactly: In case of corrupted db the sync, for example, directory might contain files, not subdirectories. So in that case _alpm_db_populate() just makes sure it's a directory. However stat()ing thousands of files is too much of a price to pay. Similarly, access() checks it is accessible by the user. In the attached patch I have just removed the relevant lines, with the following rationale: In the rare case of corrupted db, even if we do open("sync/not_a_dir/depends") it will still fail and we'll catch the failure there, no need to investigate the cause further, just write a message like "couldn't access sync/not_a_dir/depends". By dropping caches ("echo 3 > /proc/sys/vm/drop_caches") before running, I measure a nice performance boost on my old laptop: "pacman -Q gdb" time is reduced from about 7s to 2.5s. What do you think? Is it possible to remove those checks? Dimitris P.S. Now all that remains is the depends/conflicts/requiredby stuff which is by far the hardest... I'm still trying to decipher the patch implementing REQUIREDBY that was posted earlier.
On Sat, 12 Dec 2009, Nagy Gabor wrote:
On Sat, 12 Dec 2009, Dimitrios Apostolou wrote: Regarding the stat() and access() operations I finally found out why they happen exactly:
In case of corrupted db the sync, for example, directory might contain files, not subdirectories. So in that case _alpm_db_populate() just makes sure it's a directory. However stat()ing thousands of files is too much of a price to pay. Similarly, access() checks it is accessible by the user.
In the attached patch I have just removed the relevant lines, with the following rationale: In the rare case of corrupted db, even if we do open("sync/not_a_dir/depends") it will still fail and we'll catch the failure there, no need to investigate the cause further, just write a message like "couldn't access sync/not_a_dir/depends".
By dropping caches ("echo 3 > /proc/sys/vm/drop_caches") before running, I measure a nice performance boost on my old laptop: "pacman -Q gdb" time is reduced from about 7s to 2.5s.
Hm. This is a nice time boost... Did you test this with other operations, too?
I didn't time it, but strace shows this improvement applies to -Qi, -Si, -Su as well. It doesn't show that much however because all these operations actually read() thousands of files (depends, desc) which is much worse than stat(). :-)
What do you think? Is it possible to remove those checks? Dimitris
The best solution would be to rewrite our whole database crap as Dan said. I am pretty sure that this patch would not cause any harm irl, but
Because I really like the ease of use of the current format, I'll try improving things with minimum changes to it. If we can avoid a complete backend rewrite with minor changes, that is a good think, isn't it?
our code would become a little bit more dangerous: As I see, db_read(INFRQ_BASE) would become a ~NOP function and db_populate would become a simple "ls" function (the only remaining sanity check is splitname).
Exactly! Just a simple ls should be necessary, that was my initial motivation. And I have thought of a way to even avoid that readdir(), but I should get some measurements first. Dimitris
Dan McGee wrote:
I think most of this thread is the wrong approach to the problem. Rather than try to meld the DB to fit pacman, we should just swap out DBs so it doesn't have these bad worst-case conditions.
$ du -sh --apparent-size local/ sync/* 15M local/ 8.4M sync/community 31K sync/community-testing 812K sync/core 11M sync/extra 302K sync/testing
With those numbers in mind, we're talking about ~30 to ~35 MB of raw text data here. That is not a lot of data; most hard drives have at least ~30 MB/sec performance so this whole mess could be read in under a second if it was stored in a single file. So that would be one way of thinking about the issue differently.
Has anyone looked at the idea of using a single tarball for each db recently? From memory, most of the issues of treating the reading from archives in a style like a file stream had been solved when looking at changelogs. Having said that, I have not looked at that code since it was implemented... Allan
On Sat, 12 Dec 2009, Nagy Gabor wrote:
(the only remaining sanity check is splitname).
Correct, and it has just caught an error for me: $ ./src/pacman/.libs/lt-pacman -Sup :: Starting full system upgrade... error: invalid name for database entry '.lastupdate' error: invalid name for database entry '.lastupdate' error: invalid name for database entry '.lastupdate' local database is up to date It seems there is a non-directory file inside the sync db, I missed this one among the heaps of strace output. Is this intentional? Is this actually used? If it is, and since _alpm_db_populate() knows about the db structure, it should avoid that specific case. The following patch takes care of that special case and also reserves all hidden files for special database purposes, not db entries: --- a/lib/libalpm/be_files.c +++ b/lib/libalpm/be_files.c @@ -238,7 +238,8 @@ int _alpm_db_populate(pmdb_t *db) const char *name = ent->d_name; pmpkg_t *pkg; - if(strcmp(name, ".") == 0 || strcmp(name, "..") == 0) { + /* skip hidden files and '.' and '..' subdirectories */ + if (name[0] == '.') { continue; } pkg = _alpm_pkg_new(); Dimitris
On 13/12/2009, at 2:31 AM, Nagy Gabor wrote:
On Sat, 12 Dec 2009, Dimitrios Apostolou wrote:
Hello list,
I have been investigating the slow performance of pacman regarding the cold caches scenario and I'm trying to write some proof of concept code that improves things a lot for some cases. However I need your help regarding some facts I might have misunderstood, and any pointers to the source code you also give me would also help a lot. I wouldn't like to lose time changing stuff that would break current functionality. So here are some first questions that come to mind, just by using strace:
When doing "pacman -Q blah" I can see that besides the getdents() syscalls in /var/lib/pacman/local (probably caused by readdir()), there are also stat() and access() calls for every single subdirectory. Why are the last ones necessary? Isn't readdir enough?
The same goes when doing "pacman -S blah". But in that case it stat()'s both 'local' and 'sync' directories, so worst case is really bad, it will stat() all contents of local, core, extra and community...
Regarding the stat() and access() operations I finally found out why they happen exactly:
In case of corrupted db the sync, for example, directory might contain files, not subdirectories. So in that case _alpm_db_populate() just makes sure it's a directory. However stat()ing thousands of files is too much of a price to pay. Similarly, access() checks it is accessible by the user.
In the attached patch I have just removed the relevant lines, with the following rationale: In the rare case of corrupted db, even if we do open("sync/not_a_dir/depends") it will still fail and we'll catch the failure there, no need to investigate the cause further, just write a message like "couldn't access sync/not_a_dir/depends".
By dropping caches ("echo 3 > /proc/sys/vm/drop_caches") before running, I measure a nice performance boost on my old laptop: "pacman -Q gdb" time is reduced from about 7s to 2.5s.
Hm. This is a nice time boost... Did you test this with other operations, too?
What do you think? Is it possible to remove those checks? Dimitris
The best solution would be to rewrite our whole database crap as Dan said. I am pretty sure that this patch would not cause any harm irl, but our code would become a little bit more dangerous: As I see, db_read(INFRQ_BASE) would become a ~NOP function and db_populate would become a simple "ls" function (the only remaining sanity check is splitname).
It occurs to me time and time again, that it would be a good idea to try and abstract the database functions, to allow different database backends to be "plugged in." This would make experimentation with backends a lot easier, since you just compile a different file in (the interface remains the same). A library called libpkg[1] does something along these lines, by leveraging function pointers. Unfortunately I don't have a lot of time to look into it further, but it's an interesting idea. [1]: http://libpkg.berlios.de/doc/trunk/html/pkg__db_8h_source.html
On Fri, 11 Dec 2009, Nagy Gabor wrote:
-Su is even worse: We have to read all desc files in sync (to get %REPLACES%)
I changed pacman DB using the following script, and applied the attached script to pacman so that "pacman -Su" is possible only by reading "depends" files, not "desc". for d in $PACDB/sync/*/* $PACDB/local/* do sed -ne '/%REPLACES%/,/^$/p' $d/desc >> $d/depends sed -i -e '/%REPLACES%/,/^$/d' $d/desc done The timings I got for the command "pacman -Sup -b $PACDB" are the following. Before: 2m15s (cold caches) 0.98s (hot caches) After: 2m11s (cold caches) 0.78s (hot caches) Unfortunately it didn't provide such a big improvement as I expected. I can now justify this because desc files in sync/ were being read *instead* of depends, not in addition to them. So I _replaced_ the read("*/desc") with read("*/depends"), but disk seeks are still happening, even for these smaller files. I believe big differences will show when people have many upgrades pending, in which case the already cached depends will come handy. Unfortunately I only had 6 upgrades pending. But I think it's vital that not all descriptions and stuff are read during a common operation like -Su, and with the provided script (run only for local, since sync should be upgraded on the server) the transition should be smooth for end-users. What's your opinion? Dimitris
participants (5)
-
Allan McRae
-
Dan McGee
-
Dimitrios Apostolou
-
Nagy Gabor
-
Sebastian Nowicki