[pacman-dev] [PATCH 0/4] Package list find performance improvements
This series of patches makes finding a package in our linked list implementation a whole lot faster, if that search is using the standard _alpm_pkg_find, which nearly all are (after the first patch). It does this by adding a hash function to util.c which is nothing too complicated and named after a publicly available algorithm. When packages are created, we fill in this hash value as soon as the pkgname is read. Finally, the _alpm_pkg_find function is rewritten to take advantage of this field, avoiding repeated strcmp() calls and only falling back to that if a hash is not available and to verify the hash value was not some sort of collision. Performance figures and numbers are available in the last patch. This actually speeds up operations by nearly 33%, so this is not a total waste of time to consider. :) Review and questions/comments/concerns welcome! -Dan Dan McGee (4): Use _alpm_pkg_find in deps search Add hash_sdbm function When setting package name, set hash value as well Used hashed package name in _alpm_pkg_find lib/libalpm/be_package.c | 1 + lib/libalpm/deps.c | 4 ++-- lib/libalpm/package.c | 16 ++++++++++++++-- lib/libalpm/package.h | 1 + lib/libalpm/util.c | 22 ++++++++++++++++++++++ lib/libalpm/util.h | 1 + 6 files changed, 41 insertions(+), 4 deletions(-) -- 1.7.3.3
Signed-off-by: Dan McGee <dan@archlinux.org> --- lib/libalpm/deps.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/libalpm/deps.c b/lib/libalpm/deps.c index cdaf524..2157829 100644 --- a/lib/libalpm/deps.c +++ b/lib/libalpm/deps.c @@ -236,8 +236,8 @@ alpm_list_t SYMEXPORT *alpm_checkdeps(alpm_list_t *pkglist, int reversedeps, targets = alpm_list_join(alpm_list_copy(remove), alpm_list_copy(upgrade)); for(i = pkglist; i; i = i->next) { - void *pkg = i->data; - if(alpm_list_find(targets, pkg, _alpm_pkg_cmp)) { + pmpkg_t *pkg = i->data; + if(_alpm_pkg_find(targets, pkg->name)) { modified = alpm_list_add(modified, pkg); } else { dblist = alpm_list_add(dblist, pkg); -- 1.7.3.3
This is prepping for the addition of a hash field to each package to greatly speed up the string comparisons we frequently do on package name in _alpm_pkg_find. Signed-off-by: Dan McGee <dan@archlinux.org> --- lib/libalpm/util.c | 21 +++++++++++++++++++++ lib/libalpm/util.h | 1 + 2 files changed, 22 insertions(+), 0 deletions(-) diff --git a/lib/libalpm/util.c b/lib/libalpm/util.c index e425fa4..8e83bda 100644 --- a/lib/libalpm/util.c +++ b/lib/libalpm/util.c @@ -845,4 +845,25 @@ int _alpm_splitname(const char *target, pmpkg_t *pkg) return(0); } +/** + * Hash the given string to an unsigned long value. + * This is the standard sdbm hashing algorithm. + * @param str string to hash + * @return the hash value of the given string + */ +unsigned long _alpm_hash_sdbm(const char *str) +{ + unsigned long hash = 0; + int c; + + if(!str) { + return(hash); + } + while((c = *str++)) { + hash = c + (hash << 6) + (hash << 16) - hash; + } + + return(hash); +} + /* vim: set ts=2 sw=2 noet: */ diff --git a/lib/libalpm/util.h b/lib/libalpm/util.h index 78877a2..0b80420 100644 --- a/lib/libalpm/util.h +++ b/lib/libalpm/util.h @@ -78,6 +78,7 @@ int _alpm_lstat(const char *path, struct stat *buf); int _alpm_test_md5sum(const char *filepath, const char *md5sum); char *_alpm_archive_fgets(char *line, size_t size, struct archive *a); int _alpm_splitname(const char *target, pmpkg_t *pkg); +unsigned long _alpm_hash_sdbm(const char *str); #ifndef HAVE_STRSEP char *strsep(char **, const char *); -- 1.7.3.3
Signed-off-by: Dan McGee <dan@archlinux.org> --- lib/libalpm/be_package.c | 1 + lib/libalpm/package.c | 1 + lib/libalpm/package.h | 1 + lib/libalpm/util.c | 1 + 4 files changed, 4 insertions(+), 0 deletions(-) diff --git a/lib/libalpm/be_package.c b/lib/libalpm/be_package.c index ae95194..32ec993 100644 --- a/lib/libalpm/be_package.c +++ b/lib/libalpm/be_package.c @@ -179,6 +179,7 @@ static int parse_descfile(struct archive *a, pmpkg_t *newpkg) ptr = _alpm_strtrim(ptr); if(strcmp(key, "pkgname") == 0) { STRDUP(newpkg->name, ptr, RET_ERR(PM_ERR_MEMORY, -1)); + newpkg->name_hash = _alpm_hash_sdbm(newpkg->name); } else if(strcmp(key, "pkgver") == 0) { STRDUP(newpkg->version, ptr, RET_ERR(PM_ERR_MEMORY, -1)); } else if(strcmp(key, "pkgdesc") == 0) { diff --git a/lib/libalpm/package.c b/lib/libalpm/package.c index 57ab9a6..5b479b4 100644 --- a/lib/libalpm/package.c +++ b/lib/libalpm/package.c @@ -414,6 +414,7 @@ pmpkg_t *_alpm_pkg_dup(pmpkg_t *pkg) CALLOC(newpkg, 1, sizeof(pmpkg_t), RET_ERR(PM_ERR_MEMORY, NULL)); + newpkg->name_hash = pkg->name_hash; STRDUP(newpkg->filename, pkg->filename, RET_ERR(PM_ERR_MEMORY, newpkg)); STRDUP(newpkg->name, pkg->name, RET_ERR(PM_ERR_MEMORY, newpkg)); STRDUP(newpkg->version, pkg->version, RET_ERR(PM_ERR_MEMORY, newpkg)); diff --git a/lib/libalpm/package.h b/lib/libalpm/package.h index 9a6f0cb..d1abb55 100644 --- a/lib/libalpm/package.h +++ b/lib/libalpm/package.h @@ -89,6 +89,7 @@ struct pkg_operations { extern struct pkg_operations default_pkg_ops; struct __pmpkg_t { + unsigned long name_hash; char *filename; char *name; char *version; diff --git a/lib/libalpm/util.c b/lib/libalpm/util.c index 8e83bda..1291ea0 100644 --- a/lib/libalpm/util.c +++ b/lib/libalpm/util.c @@ -840,6 +840,7 @@ int _alpm_splitname(const char *target, pmpkg_t *pkg) FREE(pkg->name); } STRDUP(pkg->name, tmp, RET_ERR(PM_ERR_MEMORY, -1)); + pkg->name_hash = _alpm_hash_sdbm(pkg->name); free(tmp); return(0); -- 1.7.3.3
This results in huge gains to a lot of our codepaths since this is the most frequent method of random access to packages in a list. The gains are seen in both profiling and real life. $ pacman -Sii zvbi real: 0.41 sec -> 0.32 sec strcmp: 16,669,760 calls -> 473,942 calls _alpm_pkg_find: 52.73% -> 26.31% of time $ pacman -Su (no upgrades found) real: 0.40 sec -> 0.50 sec strcmp: 19,497,226 calls -> 524,097 calls _alpm_pkg_find: 52.36% -> 26.15% of time There is some minor risk with this patch, but most of it should be avoided by falling back to strcmp() if we encounter a package with a '0' hash value (which we should not via any existing code path). We also do a strcmp once hash values match to ensure against hash collisions. The risk left is that a package name is modified once it was originally set, but the hash value is left alone. That would probably result in a lot of other problems anyway. Signed-off-by: Dan McGee <dan@archlinux.org> --- lib/libalpm/package.c | 15 +++++++++++++-- 1 files changed, 13 insertions(+), 2 deletions(-) diff --git a/lib/libalpm/package.c b/lib/libalpm/package.c index 5b479b4..2ea5125 100644 --- a/lib/libalpm/package.c +++ b/lib/libalpm/package.c @@ -552,6 +552,7 @@ int _alpm_pkg_cmp(const void *p1, const void *p2) pmpkg_t *_alpm_pkg_find(alpm_list_t *haystack, const char *needle) { alpm_list_t *lp; + unsigned long needle_hash; ALPM_LOG_FUNC; @@ -559,11 +560,21 @@ pmpkg_t *_alpm_pkg_find(alpm_list_t *haystack, const char *needle) return(NULL); } + needle_hash = _alpm_hash_sdbm(needle); + for(lp = haystack; lp; lp = lp->next) { pmpkg_t *info = lp->data; - if(info && strcmp(info->name, needle) == 0) { - return(info); + if(info) { + /* a zero hash will cause a fall-through just in case */ + if(info->name_hash && info->name_hash != needle_hash) { + continue; + } + + /* finally: we had hash match, verify string match */ + if(strcmp(info->name, needle) == 0) { + return(info); + } } } return(NULL); -- 1.7.3.3
On Tue, Dec 14, 2010 at 7:46 PM, Dan McGee <dan@archlinux.org> wrote:
This series of patches makes finding a package in our linked list implementation a whole lot faster, if that search is using the standard _alpm_pkg_find, which nearly all are (after the first patch).
It does this by adding a hash function to util.c which is nothing too complicated and named after a publicly available algorithm. When packages are created, we fill in this hash value as soon as the pkgname is read. Finally, the _alpm_pkg_find function is rewritten to take advantage of this field, avoiding repeated strcmp() calls and only falling back to that if a hash is not available and to verify the hash value was not some sort of collision.
Performance figures and numbers are available in the last patch. This actually speeds up operations by nearly 33%, so this is not a total waste of time to consider. :) Review and questions/comments/concerns welcome!
That's nice and short :)
-Dan
Dan McGee (4): Use _alpm_pkg_find in deps search Add hash_sdbm function When setting package name, set hash value as well Used hashed package name in _alpm_pkg_find
lib/libalpm/be_package.c | 1 + lib/libalpm/deps.c | 4 ++-- lib/libalpm/package.c | 16 ++++++++++++++-- lib/libalpm/package.h | 1 + lib/libalpm/util.c | 22 ++++++++++++++++++++++ lib/libalpm/util.h | 1 + 6 files changed, 41 insertions(+), 4 deletions(-)
-- 1.7.3.3
On 15/12/10 04:46, Dan McGee wrote:
This series of patches makes finding a package in our linked list implementation a whole lot faster, if that search is using the standard _alpm_pkg_find, which nearly all are (after the first patch).
It does this by adding a hash function to util.c which is nothing too complicated and named after a publicly available algorithm. When packages are created, we fill in this hash value as soon as the pkgname is read. Finally, the _alpm_pkg_find function is rewritten to take advantage of this field, avoiding repeated strcmp() calls and only falling back to that if a hash is not available and to verify the hash value was not some sort of collision.
Performance figures and numbers are available in the last patch. This actually speeds up operations by nearly 33%, so this is not a total waste of time to consider. :) Review and questions/comments/concerns welcome!
My only comment is more of a wondering whether would it be better to have an _alpm_pkg_set_name(pmpkg_t *) function that automatically updates the hash. It is a tradeoff between having to always remember to update the hash after adjusting pmpkg_t->name (seems likely to get missed at some stage) and complexity that I am undecided on. Allan
On Tue, Dec 14, 2010 at 5:00 PM, Allan McRae <allan@archlinux.org> wrote:
On 15/12/10 04:46, Dan McGee wrote:
This series of patches makes finding a package in our linked list implementation a whole lot faster, if that search is using the standard _alpm_pkg_find, which nearly all are (after the first patch).
It does this by adding a hash function to util.c which is nothing too complicated and named after a publicly available algorithm. When packages are created, we fill in this hash value as soon as the pkgname is read. Finally, the _alpm_pkg_find function is rewritten to take advantage of this field, avoiding repeated strcmp() calls and only falling back to that if a hash is not available and to verify the hash value was not some sort of collision.
Performance figures and numbers are available in the last patch. This actually speeds up operations by nearly 33%, so this is not a total waste of time to consider. :) Review and questions/comments/concerns welcome!
My only comment is more of a wondering whether would it be better to have an _alpm_pkg_set_name(pmpkg_t *) function that automatically updates the hash. It is a tradeoff between having to always remember to update the hash after adjusting pmpkg_t->name (seems likely to get missed at some stage) and complexity that I am undecided on.
I thought about that as well; realized I only had to update two places, "forgot" about it. If we did this we would want to do something like we did for db and path: rename the field to _pkgname so people realize it shouldn't be mucked with directly, and all the sudden the patch is huge. So I think it can be missed either way, but here's to automated testing to hopefully catching fallout when someone forgets it. -Dan
On Tue, Dec 14, 2010 at 12:46:15PM -0600, Dan McGee wrote:
This series of patches makes finding a package in our linked list implementation a whole lot faster, if that search is using the standard _alpm_pkg_find, which nearly all are (after the first patch).
It does this by adding a hash function to util.c which is nothing too complicated and named after a publicly available algorithm. When packages are created, we fill in this hash value as soon as the pkgname is read. Finally, the _alpm_pkg_find function is rewritten to take advantage of this field, avoiding repeated strcmp() calls and only falling back to that if a hash is not available and to verify the hash value was not some sort of collision.
Performance figures and numbers are available in the last patch. This actually speeds up operations by nearly 33%, so this is not a total waste of time to consider. :) Review and questions/comments/concerns welcome!
-Dan
Well, nothing's broken so far. Installed a few new packages with -U and ran an -Syu with no trouble. Speed improvement is there, but nothing significant on my smallish database of 500 packages. These patches are boring -- they just work. I thought you had to be brave to use pacman-git. Stop boring me, Dan. (very nice work though) dave
participants (5)
-
Allan McRae
-
Dan McGee
-
Dan McGee
-
Dave Reisner
-
Xavier Chantry