[pacman-dev] Paralellising integrity checks

Dan McGee dpmcgee at gmail.com
Fri Feb 25 11:13:37 EST 2011


On Sat, Feb 19, 2011 at 6:11 PM, Tavian Barnes
<tavianator at tavianator.com> wrote:
> Oh I get it.  Well, the code in both is pretty similar, but git seems
> to support HPUX while x264 doesn't.  Also, git just uses the number of
> online processors, so "taskset 0x1 git gc" runs N threads at once on 1
> core, while x264 uses the process's CPU affinity on Linux and Windows
> to behave better in that case.
Yeah, that seems nice but also overkill- git has a config var to limit
thread count; we could add that to the config file and just fall back
to online_cpus() if not provided.

> Anyway, that's not the main point.  Are you guys interested in this
> change?  I'm almost done a better version of the patch that adds an
> _alpm_for_each_cpu() function (to util.h) which takes a callback
> function to call in N threads.
>
> On a related note, I just tried running the test suite after entirely
> patching out integrity checks, and there weren't any regressions.
> Maybe the test suite should test the handling of corrupt packages?  I
> can add a test case myself if you want, once I've figured out how the
> test suite works.

Tests for this would definitely be nice. You will probably have to add
the ability to pactest to create a broken package and/or database
entry.

I don't want to review these just yet, as I want to focus my time on
3.5.0 releasing. I will add this and maybe you can take it into
account in the patchset- we do a lot of things we could parallelize
and/or combine.

Steps I know of and notes about them:
* "Checking integrity" is really two things- md5sum iterations on the
file, and then an alpm_pkg_load() call to build the package object and
create the filelist. Not sure how you incorporated this but at least
something to think about.
* We do yet another iteration of all of the package contents if
diskspace checking is enabled and read through the archive. This could
be eliminated if we grabbed the necessary data in pkg_load, which I
believe is simply some parts of the stat buffer and the type of the
entry. This would also be hugely helpful in conflict checking, where
we don't have this info available, and you will see some comments
alluding to the "12 checks we do in add.c" or something.
* Downloads. I see a call to do this in parallel a lot and I will
continue to think this is stupid, but maybe that is just me. If you
can't find a mirror that saturates your connection, look around- we
have a lot.
* File conflicts- we've made this one pretty damn fast already, so
probably not worth parallelizing.

-Dan


More information about the pacman-dev mailing list