[arch-dev-public] [PATCH 0/4] create-filelists patchset
This is a series of patches to make create-filelists a lot more efficient at what it does, and also make the files database a lot more useful. The end-user result is that the files database also includes the 'desc' and 'depends' entries found in a normal .db.tar.gz database. As far as other changes, the package loop rework patch is the most important one. Rather than inefficiently unzip every package to get the .PKGINFO file and determine its name and version, we use the .db.tar.gz directly which saves us a ton of work in most cases. Comments welcome, I'm not sure who is the head honcho that will pull these in, but it does help setup re-adding file support in archweb in addition to making this cron job suck a bit less. We might even think about running it more often than once a day now. -Dan Dan McGee (4): create-filelists: general cleanups create-filelists: s/REPO_DB_FILE/FILES_DB_FILE/g create-filelists: rework the package loop completely create-filelists: include desc/depends entries cron-jobs/create-filelists | 78 ++++++++++++++++++++++++++++---------------- 1 files changed, 50 insertions(+), 28 deletions(-)
* Specify lock name once * Use new script name everywhere * Clean up tabs/spaces and add a modeline. This isn't necessarily the one we wanted to standardize on, but I picked the one the entire file is written to at the moment. Signed-off-by: Dan McGee <dan@archlinux.org> --- cron-jobs/create-filelists | 23 +++++++++++++---------- 1 files changed, 13 insertions(+), 10 deletions(-) diff --git a/cron-jobs/create-filelists b/cron-jobs/create-filelists index c9d7db9..e90da00 100755 --- a/cron-jobs/create-filelists +++ b/cron-jobs/create-filelists @@ -3,27 +3,28 @@ reposdir="/srv/ftp" targetdir="/srv/ftp" repos="core extra testing community community-testing" +lock="/tmp/create-filelists.lock" . "$(dirname $0)/../db-functions" . "$(dirname $0)/../config" -if [ -f "/tmp/createFileList.lock" ]; then - echo "Error: createFileList allready in progress." +if [ -f "$lock" ]; then + echo "Error: create-filelists already in progress." exit 1 fi -touch "/tmp/createFileList.lock" || exit 1 -TMPDIR="$(mktemp -d /tmp/createFileList.XXXXXX)" || exit 1 -CACHEDIR="$(mktemp -d /tmp/createFileList.XXXXXX)" || exit 1 +touch "$lock" || exit 1 +TMPDIR="$(mktemp -d /tmp/create-filelists.XXXXXX)" || exit 1 +CACHEDIR="$(mktemp -d /tmp/create-filelists.XXXXXX)" || exit 1 #adjust the nice level to run at a lower priority /usr/bin/renice +10 -p $$ > /dev/null case "${DBEXT}" in - *.gz) TAR_OPT="z" ;; - *.bz2) TAR_OPT="j" ;; - *.xz) TAR_OPT="J" ;; - *) echo "Unknown compression type for DBEXT=${DBEXT}" && exit 1 ;; + *.gz) TAR_OPT="z" ;; + *.bz2) TAR_OPT="j" ;; + *.xz) TAR_OPT="J" ;; + *) echo "Unknown compression type for DBEXT=${DBEXT}" && exit 1 ;; esac FILESEXT="${DBEXT//db/files}" @@ -77,5 +78,7 @@ done cd - >/dev/null rm -rf "$TMPDIR" || exit 1 rm -rf "$CACHEDIR" || exit 1 -rm -f "/tmp/createFileList.lock" || exit 1 +rm -f "$lock" || exit 1 # echo 'done' + +# vim: set ts=4 sw=4 et ft=sh: -- 1.7.0
This will set up changes soon to come where we actually use the real repos DB file so I don't want variable name confusion. Signed-off-by: Dan McGee <dan@archlinux.org> --- cron-jobs/create-filelists | 14 +++++++------- 1 files changed, 7 insertions(+), 7 deletions(-) diff --git a/cron-jobs/create-filelists b/cron-jobs/create-filelists index e90da00..a0b6a57 100755 --- a/cron-jobs/create-filelists +++ b/cron-jobs/create-filelists @@ -30,7 +30,7 @@ esac FILESEXT="${DBEXT//db/files}" for repo in $repos; do - REPO_DB_FILE="${repo}$FILESEXT" + FILES_DB_FILE="${repo}$FILESEXT" for arch in ${ARCHES[@]}; do cd "$reposdir" @@ -38,9 +38,9 @@ for repo in $repos; do cached="no" # extract old file archive - if [ -f "${targetdir}/${repodir}/${REPO_DB_FILE}" ]; then + if [ -f "${targetdir}/${repodir}/${FILES_DB_FILE}" ]; then mkdir -p "${CACHEDIR}/${repodir}" - bsdtar -xf "${targetdir}/${repodir}/${REPO_DB_FILE}" -C "${CACHEDIR}/${repodir}" + bsdtar -xf "${targetdir}/${repodir}/${FILES_DB_FILE}" -C "${CACHEDIR}/${repodir}" cached="yes" fi @@ -64,13 +64,13 @@ for repo in $repos; do # create new file archive if [ "$cached" == "no" ]; then # at least one package has changed, so let's rebuild the archive -# echo "creating ${REPO_DB_FILE}/${arch}" +# echo "creating ${FILES_DB_FILE}/${arch}" pkgdir="${targetdir}/${repodir}" mkdir -p "$pkgdir" cd "${TMPDIR}/${repodir}" - [ -f "${pkgdir}/${REPO_DB_FILE}.old" ] && rm "${pkgdir}/${REPO_DB_FILE}.old" - [ -f "${pkgdir}/${REPO_DB_FILE}" ] && mv "${pkgdir}/${REPO_DB_FILE}" "${pkgdir}/${REPO_DB_FILE}.old" - bsdtar --exclude=*${DBEXT//\.db/} -c${TAR_OPT}f "${pkgdir}/${REPO_DB_FILE}" * + [ -f "${pkgdir}/${FILES_DB_FILE}.old" ] && rm "${pkgdir}/${FILES_DB_FILE}.old" + [ -f "${pkgdir}/${FILES_DB_FILE}" ] && mv "${pkgdir}/${FILES_DB_FILE}" "${pkgdir}/${FILES_DB_FILE}.old" + bsdtar --exclude=*${DBEXT//\.db/} -c${TAR_OPT}f "${pkgdir}/${FILES_DB_FILE}" * fi done done -- 1.7.0
Instead of wasting time extracting .PKGINFO twice from every single package in the repos, use the package DB to eliminate most of the heavy lifting. This way we only need to worry about looking at the packages that actually have changed since the last time we built the package database. This should give a noticeable performance increase to this job in addition to reducing IO load and unnecessary reading of every package file. Signed-off-by: Dan McGee <dan@archlinux.org> --- cron-jobs/create-filelists | 41 ++++++++++++++++++++++++++++------------- 1 files changed, 28 insertions(+), 13 deletions(-) diff --git a/cron-jobs/create-filelists b/cron-jobs/create-filelists index a0b6a57..6091bf4 100755 --- a/cron-jobs/create-filelists +++ b/cron-jobs/create-filelists @@ -14,8 +14,12 @@ if [ -f "$lock" ]; then fi touch "$lock" || exit 1 -TMPDIR="$(mktemp -d /tmp/create-filelists.XXXXXX)" || exit 1 -CACHEDIR="$(mktemp -d /tmp/create-filelists.XXXXXX)" || exit 1 +# location where the package DB is extracted so we know what to include +DBDIR="$(mktemp -d /tmp/create-filelists.dbdir.XXXXXX)" || exit 1 +# location where the old files DB is extracted to save us some work +CACHEDIR="$(mktemp -d /tmp/create-filelists.cachedir.XXXXXX)" || exit 1 +# location where the new files DB is built up and eventually zipped +TMPDIR="$(mktemp -d /tmp/create-filelists.tmpdir.XXXXXX)" || exit 1 #adjust the nice level to run at a lower priority /usr/bin/renice +10 -p $$ > /dev/null @@ -30,33 +34,45 @@ esac FILESEXT="${DBEXT//db/files}" for repo in $repos; do + REPO_DB_FILE="${repo}$DBEXT" FILES_DB_FILE="${repo}$FILESEXT" for arch in ${ARCHES[@]}; do +# echo "Running for architecture $arch, repo $repo" cd "$reposdir" repodir="${repo}/os/${arch}" cached="no" + # extract package db archive + if [ -f "${targetdir}/${repodir}/${REPO_DB_FILE}" ]; then + mkdir -p "${DBDIR}/${repodir}" +# echo "extracting $REPO_DB_FILE" + bsdtar -xf "${targetdir}/${repodir}/${REPO_DB_FILE}" -C "${DBDIR}/${repodir}" + else + echo "Fail! Does the repo $repo with arch $arch even exist?" + continue + fi + # extract old file archive if [ -f "${targetdir}/${repodir}/${FILES_DB_FILE}" ]; then mkdir -p "${CACHEDIR}/${repodir}" +# echo "extracting $FILES_DB_FILE" bsdtar -xf "${targetdir}/${repodir}/${FILES_DB_FILE}" -C "${CACHEDIR}/${repodir}" cached="yes" fi # create file lists - for pkg in $repodir/*${PKGEXT}; do - pkgname="$(getpkgname "$pkg")" - pkgver="$(getpkgver "$pkg")" - tmppkgdir="${TMPDIR}/${repodir}/${pkgname}-${pkgver}" + for pkg in $(ls ${DBDIR}/${repodir}); do + tmppkgdir="${TMPDIR}/${repodir}/${pkg}" mkdir -p "$tmppkgdir" - if [ -f "${CACHEDIR}/${repodir}/${pkgname}-${pkgver}/files" ]; then -# echo "cache: $pkgname" - mv "${CACHEDIR}/${repodir}/${pkgname}-${pkgver}/files" "${tmppkgdir}/files" + if [ -f "${CACHEDIR}/${repodir}/${pkg}/files" ]; then +# echo "cache: $pkg" + mv "${CACHEDIR}/${repodir}/${pkg}/files" "${tmppkgdir}/files" else -# echo "$repo/$arch: $pkgname" +# echo "not cache: $repo/$arch: $pkg" + filename=$(grep -A1 '^%FILENAME%$' "${DBDIR}/${repodir}/${pkg}/desc" | tail -n1) echo '%FILES%' > "${tmppkgdir}/files" - bsdtar --exclude=.* -tf "$pkg" >> "${tmppkgdir}/files" + bsdtar --exclude=.* -tf "$repodir/$filename" >> "${tmppkgdir}/files" cached="no" fi done @@ -76,8 +92,7 @@ for repo in $repos; do done cd - >/dev/null -rm -rf "$TMPDIR" || exit 1 -rm -rf "$CACHEDIR" || exit 1 +rm -rf "$TMPDIR" "$CACHEDIR" "$DBDIR" rm -f "$lock" || exit 1 # echo 'done' -- 1.7.0
Make the files DB include everything the original packages DB includes instead of just being 'files' entries. This will allow tools to do more with these generated files and they can be used as a drop-in replacement for a regular package database. Signed-off-by: Dan McGee <dan@archlinux.org> --- cron-jobs/create-filelists | 10 +++++++--- 1 files changed, 7 insertions(+), 3 deletions(-) diff --git a/cron-jobs/create-filelists b/cron-jobs/create-filelists index 6091bf4..84867d8 100755 --- a/cron-jobs/create-filelists +++ b/cron-jobs/create-filelists @@ -63,14 +63,18 @@ for repo in $repos; do # create file lists for pkg in $(ls ${DBDIR}/${repodir}); do + dbpkgdir="${DBDIR}/${repodir}/${pkg}" + cachepkgdir="${CACHEDIR}/${repodir}/${pkg}" tmppkgdir="${TMPDIR}/${repodir}/${pkg}" mkdir -p "$tmppkgdir" - if [ -f "${CACHEDIR}/${repodir}/${pkg}/files" ]; then + ln "${dbpkgdir}/desc" "${tmppkgdir}/desc" + ln "${dbpkgdir}/depends" "${tmppkgdir}/depends" + if [ -f "${cachepkgdir}/files" ]; then # echo "cache: $pkg" - mv "${CACHEDIR}/${repodir}/${pkg}/files" "${tmppkgdir}/files" + ln "${cachepkgdir}/files" "${tmppkgdir}/files" else # echo "not cache: $repo/$arch: $pkg" - filename=$(grep -A1 '^%FILENAME%$' "${DBDIR}/${repodir}/${pkg}/desc" | tail -n1) + filename=$(grep -A1 '^%FILENAME%$' "${dbpkgdir}/desc" | tail -n1) echo '%FILES%' > "${tmppkgdir}/files" bsdtar --exclude=.* -tf "$repodir/$filename" >> "${tmppkgdir}/files" cached="no" -- 1.7.0
Pierre/Aaron/Thomas, you guys have worked on these the most of anyone. Any thoughts, or can I just push these? -Dan On Sat, Feb 27, 2010 at 12:01 PM, Dan McGee <dan@archlinux.org> wrote:
This is a series of patches to make create-filelists a lot more efficient at what it does, and also make the files database a lot more useful. The end-user result is that the files database also includes the 'desc' and 'depends' entries found in a normal .db.tar.gz database.
As far as other changes, the package loop rework patch is the most important one. Rather than inefficiently unzip every package to get the .PKGINFO file and determine its name and version, we use the .db.tar.gz directly which saves us a ton of work in most cases.
Comments welcome, I'm not sure who is the head honcho that will pull these in, but it does help setup re-adding file support in archweb in addition to making this cron job suck a bit less. We might even think about running it more often than once a day now.
-Dan
Dan McGee (4): create-filelists: general cleanups create-filelists: s/REPO_DB_FILE/FILES_DB_FILE/g create-filelists: rework the package loop completely create-filelists: include desc/depends entries
cron-jobs/create-filelists | 78 ++++++++++++++++++++++++++++---------------- 1 files changed, 50 insertions(+), 28 deletions(-)
On Sun, 28 Feb 2010 19:16:10 -0600, Dan McGee <dan@archlinux.org> wrote:
Pierre/Aaron/Thomas, you guys have worked on these the most of anyone. Any thoughts, or can I just push these?
Didn't had time to really think about it. It looks fine; though I didn't remember why I didn't implement it that way in the first place. Maybe in the long term we should add this to repo-add as an optional task. This would be the most efficient and consistent way of doing this instead of running this script via cron job. However, you can push this changes if you like. And don'T worry about the coding style right now; I'll take care of it once I had finished the "Bash Coding Style" document. And from that time on we only accept patches that apply to these guide lines. :-) -- Pierre Schmitz, https://users.archlinux.de/~pierre
On Mon, Mar 1, 2010 at 01:16, Pierre Schmitz <pierre@archlinux.de> wrote:
Maybe in the long term we should add this to repo-add as an optional task. This would be the most efficient and consistent way of doing this instead of running this script via cron job. http://bugs.archlinux.org/task/11302 It would be the best. I haven't implemented it yet though..
Am Samstag, 27. Februar 2010 19:01:32 schrieb Dan McGee:
As far as other changes, the package loop rework patch is the most important one. Rather than inefficiently unzip every package to get the .PKGINFO file and determine its name and version, we use the .db.tar.gz directly which saves us a ton of work in most cases.
Do you think this script should lock the repo while reading it's content? Otherwise there is a chance it reads a package or db file while it's being modified. We might even loose all previous data which would cause the script to reread all packages on the next run. The problem with all this might be that we end up with a bunch of scripts and operations which block each other. ATM, if we move a lot of packages, db-move will lock and unlock the repo for each of them. If one of our cronjobs gets lucky and catches the lock between two packages the move script will fail and we'll end up with an inconsistent repo. I guess one step to solve this is this (hard to read) patch http://code.phraktured.net/cgit.cgi/dbscripts/commit/?h=working&id=bfaec9eb47c1fe042b83f9539f81dca1cad609a2 And maybe we should even have read and write locks. -- Pierre Schmitz, https://users.archlinux.de/~pierre
participants (3)
-
Daenyth Blank
-
Dan McGee
-
Pierre Schmitz