[pacman-dev] [PATCH] makepkg: parallelize integrity checks
This enables parallel integrity checks in makepkg within a given family of integrity sums. Subshell jobs for each source file are kicked off and run in parallel, and then we wait for each of them in turn to complete and print the same information as before. Note that programming sense says this loop should be done differently for filesystem access reasons; doing all checks for a given file would make more sense rather than running through the filelist multiple times. However, that would be a very different patch than what this is trying to accomplish. On a completely suited for the task PKGBUILD containing md5sums ans sha256sums of several large data files, as well as a failing integrity check so, this brings execution time way down: $ time makepkg -f 2>/dev/null real 0m7.924s user 0m7.293s sys 0m0.480s $ time ~/projects/pacman/scripts/makepkg -f 2>/dev/null real 0m2.447s user 0m7.470s sys 0m0.537s Signed-off-by: Dan McGee <dan@archlinux.org> --- Said PKGBUILD from above: pkgname=integspeed pkgver=1.0 pkgrel=1 arch=('any') source=( eclipse-3.6.1-1-x86_64.pkg.tar.xz eclipse-3.6.2-1-x86_64.pkg.tar.xz nltk-data-2.0-5-any.pkg.tar.xz openoffice-base-3.2.1-5-x86_64.pkg.tar.xz smc-1.9-8-x86_64.pkg.tar.xz texlive-core-2010.20954-2-any.pkg.tar.xz tremulous-data-1.1.0-1-any.pkg.tar.gz ) md5sums=('6ab0488636b4b52957a8d069c4330d3d' 'e889ada205ab8b9b17c64de6c7b62956' '968bebb2a8e77b2cd2ee2bbb3d9122d4' '17d59366ef890dab62bebfe786a0cccb' '129fa6ed2208b355e8e55bdd38ae6677' '36437e05082be8c4e6764fc2bcfe4f4b' '107d367e2e0245b38402f457c1d735f7') sha256sums=('d0fdac48f982f4f7794187c6652fb55ac6c63c439a62476bcf90addad25f9274' 'fc30a73f6313ba202a6fb4fd112de93c88c28b3f7af68dd0789d3317afe89741' 'aec68224fd31713ff0cee49fb3c27cf5be82ada03657b9d592ab029c5df1ad92' '1c9404f87a8b00bd07d226ca93c7254985af3195719a5ddb69be40faa78914c4' 'bbde245ff84a3ec93f76a53b97d2d261193699a55881209ffae9310de00c6ad4' '10a0ba732253d89ef3b6d72486474a356179e4665efc6482ac3c47c891f7f3a2' '53d990b7123d12409290c3a3c8b6be32e2887c88a0650cd338a7f58cc951f4d9') scripts/makepkg.sh.in | 47 +++++++++++++++++++++++++++++++++-------------- 1 files changed, 33 insertions(+), 14 deletions(-) diff --git a/scripts/makepkg.sh.in b/scripts/makepkg.sh.in index 3d5184a..92b7597 100644 --- a/scripts/makepkg.sh.in +++ b/scripts/makepkg.sh.in @@ -627,33 +627,52 @@ check_checksums() { correlation=1 local errors=0 local idx=0 + local -a jobs + local job local file for file in "${source[@]}"; do - local found=1 - file="$(get_filename "$file")" - echo -n " $file ... " >&2 + ( + file="$(get_filename "$file")" - if ! file="$(get_filepath "$file")"; then - echo "$(gettext "NOT FOUND")" >&2 - errors=1 - found=0 - fi + if ! file="$(get_filepath "$file")"; then + exit 2 + fi - if (( $found )) ; then local expectedsum=$(tr '[:upper:]' '[:lower:]' <<< "${integrity_sums[$idx]}") local realsum="$(openssl dgst -${integ} "$file")" realsum="${realsum##* }" if [[ $expectedsum = $realsum ]]; then - echo "$(gettext "Passed")" >&2 - else - echo "$(gettext "FAILED")" >&2 - errors=1 + exit 0 fi - fi + exit 1 + ) & + jobs[$idx]=$! idx=$((idx + 1)) done + idx=0 + while [[ $idx -lt ${#source[@]} ]]; do + file="$(get_filename "${source[$idx]}")" + job=${jobs[$idx]} + status=0 + wait $job || status=$? + echo -n " $file ... " >&2 + if [[ $status -eq 0 ]]; then + echo "$(gettext "Passed")" >&2 + elif [[ $status -eq 1 ]]; then + echo "$(gettext "FAILED")" >&2 + errors=1 + elif [[ $status -eq 2 ]]; then + echo "$(gettext "NOT FOUND")" >&2 + errors=1 + else + echo "$(gettext "ERROR")" >&2 + errors=1 + fi + idx=$((idx + 1)) + done + if (( errors )); then error "$(gettext "One or more files did not pass the validity check!")" exit 1 # TODO: error code -- 1.7.4.4
On Fri, Apr 22, 2011 at 11:16 AM, Dan McGee <dan@archlinux.org> wrote:
This enables parallel integrity checks in makepkg within a given family of integrity sums. Subshell jobs for each source file are kicked off and run in parallel, and then we wait for each of them in turn to complete and print the same information as before.
Note that programming sense says this loop should be done differently for filesystem access reasons; doing all checks for a given file would make more sense rather than running through the filelist multiple times. However, that would be a very different patch than what this is trying to accomplish.
Two other things worth mentioning: 1. We don't limit the number of jobs here in any way, so in theory you could have a lot... 2. Applying this to source file extraction would be the next logical step, as that is a much slower part than this, and we might as well use more cores since all extraction programs we use are single-threaded. -Dan
On 23/04/11 02:22, Dan McGee wrote:
On Fri, Apr 22, 2011 at 11:16 AM, Dan McGee<dan@archlinux.org> wrote:
This enables parallel integrity checks in makepkg within a given family of integrity sums. Subshell jobs for each source file are kicked off and run in parallel, and then we wait for each of them in turn to complete and print the same information as before.
Note that programming sense says this loop should be done differently for filesystem access reasons; doing all checks for a given file would make more sense rather than running through the filelist multiple times. However, that would be a very different patch than what this is trying to accomplish.
Two other things worth mentioning: 1. We don't limit the number of jobs here in any way, so in theory you could have a lot... 2. Applying this to source file extraction would be the next logical step, as that is a much slower part than this, and we might as well use more cores since all extraction programs we use are single-threaded.
I'd be very careful about applying this to extraction. The main package I maintain where this would be useful is gcc, but there the source files gave lots of overlapping directories and I would want to be sure no race condition occurred in extracting them. Allan
On Fri, Apr 22, 2011 at 3:35 PM, Allan McRae <allan@archlinux.org> wrote:
On 23/04/11 02:22, Dan McGee wrote:
On Fri, Apr 22, 2011 at 11:16 AM, Dan McGee<dan@archlinux.org> wrote:
This enables parallel integrity checks in makepkg within a given family of integrity sums. Subshell jobs for each source file are kicked off and run in parallel, and then we wait for each of them in turn to complete and print the same information as before.
Note that programming sense says this loop should be done differently for filesystem access reasons; doing all checks for a given file would make more sense rather than running through the filelist multiple times. However, that would be a very different patch than what this is trying to accomplish.
Two other things worth mentioning: 1. We don't limit the number of jobs here in any way, so in theory you could have a lot... 2. Applying this to source file extraction would be the next logical step, as that is a much slower part than this, and we might as well use more cores since all extraction programs we use are single-threaded.
I'd be very careful about applying this to extraction. The main package I maintain where this would be useful is gcc, but there the source files gave lots of overlapping directories and I would want to be sure no race condition occurred in extracting them.
Wait, seriously? So both archive A and archive B have a file that extracts to the same path? Is it so insane that each archive doesn't end up in it's own folder anyway? -Dan
This one could definitely benefit from limiting the number of parallel threads, especially when dealing with the GCC PKGBUILD. Definitely a lot of contention and times varied from twice as fast to a few seconds slower, depending on if the cache decided it needed to flush out some data. Limiting the number of threads to the number of CPUs would probably go a long way to resolving some of this contention for IO bandwidth. Signed-off-by: Dan McGee <dan@archlinux.org> --- scripts/makepkg.sh.in | 23 ++++++++++++++++++----- 1 files changed, 18 insertions(+), 5 deletions(-) diff --git a/scripts/makepkg.sh.in b/scripts/makepkg.sh.in index 92b7597..b9588a1 100644 --- a/scripts/makepkg.sh.in +++ b/scripts/makepkg.sh.in @@ -692,6 +692,8 @@ check_checksums() { extract_sources() { msg "$(gettext "Extracting Sources...")" local netfile + local -a jobs + local job for netfile in "${source[@]}"; do local file=$(get_filename "$netfile") if in_array "$file" ${noextract[@]}; then @@ -732,15 +734,26 @@ extract_sources() { fi ;; esac - local ret=0 msg2 "$(gettext "Extracting %s with %s")" "$file" "$cmd" if [[ $cmd = bsdtar ]]; then - $cmd -xf "$file" || ret=$? + ( + $cmd -xf "$file" + ) & + job=$! else - rm -f "${file%.*}" - $cmd -dcf "$file" > "${file%.*}" || ret=$? + ( + rm -f "${file%.*}" + $cmd -dcf "$file" > "${file%.*}" + ) & + job=$! fi - if (( ret )); then + # push job id onto jobs stack + jobs[${#jobs[@]}]=$job + done + + for job in ${jobs[@]}; do + wait $job + if (( $? )); then error "$(gettext "Failed to extract %s")" "$file" plain "$(gettext "Aborting...")" exit 1 -- 1.7.4.4
On 23/04/11 08:04, Dan McGee wrote:
This one could definitely benefit from limiting the number of parallel threads, especially when dealing with the GCC PKGBUILD. Definitely a lot of contention and times varied from twice as fast to a few seconds slower, depending on if the cache decided it needed to flush out some data. Limiting the number of threads to the number of CPUs would probably go a long way to resolving some of this contention for IO bandwidth.
Signed-off-by: Dan McGee<dan@archlinux.org>
Took this for a spin and it does make a difference for the GCC PKGBUILD. Reduced total extraction time from 35sec to 25sec on my laptop. I was actually surprised it made that much difference given I figured this would be more disk speed bound that cpu bound... My main concern is still what happens if two processes try to extract the same directory at the same time. I guess such an occurrence would be very rare, and perhaps bstar actually would gracefully handle this, but it is something to consider. Also, the extraction time actually seemed slower despite not being so due to the output. On the non-parallel version, you get a visual cue on how far through the extraction process you are (in terms of number of files extracted), but with the parallel extraction, all "Extracting" output is printed at once and then there is a big wait. I guess that could be adjusted. So overall, I am +/-0 on this... Allan
On Sat, Apr 23, 2011 at 1:10 AM, Allan McRae <allan@archlinux.org> wrote:
On 23/04/11 08:04, Dan McGee wrote:
This one could definitely benefit from limiting the number of parallel threads, especially when dealing with the GCC PKGBUILD. Definitely a lot of contention and times varied from twice as fast to a few seconds slower, depending on if the cache decided it needed to flush out some data. Limiting the number of threads to the number of CPUs would probably go a long way to resolving some of this contention for IO bandwidth.
Signed-off-by: Dan McGee<dan@archlinux.org>
Took this for a spin and it does make a difference for the GCC PKGBUILD. Reduced total extraction time from 35sec to 25sec on my laptop. I was actually surprised it made that much difference given I figured this would be more disk speed bound that cpu bound...
My main concern is still what happens if two processes try to extract the same directory at the same time. I guess such an occurrence would be very rare, and perhaps bstar actually would gracefully handle this, but it is something to consider. For the < 1% that would have a problem with this, I might say noextract=() is the answer?
Also, the extraction time actually seemed slower despite not being so due to the output. On the non-parallel version, you get a visual cue on how far through the extraction process you are (in terms of number of files extracted), but with the parallel extraction, all "Extracting" output is printed at once and then there is a big wait. I guess that could be adjusted.
Not quite true. We start X jobs, but then have to wait for those X jobs. For sanity, we wait on each job in the order it was started, and you are hitting the common case of the biggest file being first- thus, when the output from that job appears, we then notice that the X - 1 jobs following it have already finished and can immediately print the output for them. Not really easy to adjust, outside of iterating the jobs list backwards with the heuristic that most people tend to put big files first, or patch bash wait to have better semantics. Thread handling capabilities aren't exactly stellar in a shell script, so I did the best I could here. -Dan
On 23/04/11 02:16, Dan McGee wrote:
This enables parallel integrity checks in makepkg within a given family of integrity sums. Subshell jobs for each source file are kicked off and run in parallel, and then we wait for each of them in turn to complete and print the same information as before.
Note that programming sense says this loop should be done differently for filesystem access reasons; doing all checks for a given file would make more sense rather than running through the filelist multiple times. However, that would be a very different patch than what this is trying to accomplish.
On a completely suited for the task PKGBUILD containing md5sums ans sha256sums of several large data files, as well as a failing integrity check so, this brings execution time way down:
$ time makepkg -f 2>/dev/null real 0m7.924s user 0m7.293s sys 0m0.480s
$ time ~/projects/pacman/scripts/makepkg -f 2>/dev/null real 0m2.447s user 0m7.470s sys 0m0.537s
Signed-off-by: Dan McGee<dan@archlinux.org>
Ack. I guess we should also do the same for generating the checksums with "makepkg -g". Allan
On Sat, Apr 23, 2011 at 1:22 AM, Allan McRae <allan@archlinux.org> wrote:
On 23/04/11 02:16, Dan McGee wrote:
This enables parallel integrity checks in makepkg within a given family of integrity sums. Subshell jobs for each source file are kicked off and run in parallel, and then we wait for each of them in turn to complete and print the same information as before.
Note that programming sense says this loop should be done differently for filesystem access reasons; doing all checks for a given file would make more sense rather than running through the filelist multiple times. However, that would be a very different patch than what this is trying to accomplish.
On a completely suited for the task PKGBUILD containing md5sums ans sha256sums of several large data files, as well as a failing integrity check so, this brings execution time way down:
$ time makepkg -f 2>/dev/null real 0m7.924s user 0m7.293s sys 0m0.480s
$ time ~/projects/pacman/scripts/makepkg -f 2>/dev/null real 0m2.447s user 0m7.470s sys 0m0.537s
Signed-off-by: Dan McGee<dan@archlinux.org>
Ack. I guess we should also do the same for generating the checksums with "makepkg -g".
We could- I didn't do it because: * this one would be a lot trickier (we need to somehow return a checksum string, not just a return code) * in theory this is done less than checking and extraction -Dan
participants (3)
-
Allan McRae
-
Dan McGee
-
Dan McGee