[pacman-dev] [PATCH] Write mtree of package files to local database
When installing a package, write an mtree of the package files into the local database. This will be useful for doing validation of all files on a system. Signed-off-by: Allan McRae <allan@archlinux.org> --- Query: should we keep the info on .INSTALL and .CHANGELOG files? Changing a .INSTALL file would be an interesting tactic, but if someone is doing that then they can already adjust the mtree file... Also, from http://goo.gl/Uq6X5 it appears that this could be made more efficient by reusing the file descriptor, but I could not get that working after many, many, many attempts. lib/libalpm/add.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 68 insertions(+), 0 deletions(-) diff --git a/lib/libalpm/add.c b/lib/libalpm/add.c index b007766..31313a7 100644 --- a/lib/libalpm/add.c +++ b/lib/libalpm/add.c @@ -449,6 +449,68 @@ needbackup_cleanup: return errors; } +static int write_package_mtree(alpm_handle_t *handle, alpm_pkg_t *pkg, const char *file) +{ + char *path; + struct archive *archive, *mtree; + struct archive_entry *entry; + struct stat buf; + char buffer[ALPM_BUFFER_SIZE]; + int fd, ret = 0, read = 0; + + _alpm_log(handle, ALPM_LOG_DEBUG, "creating package mtree\n"); + + fd = _alpm_open_archive(handle, file, &buf, + &archive, ALPM_ERR_PKG_OPEN); + if(fd < 0) { + ret = ALPM_ERR_PKG_OPEN; + goto error; + } + + if((mtree = archive_write_new()) == NULL) { + ret = ALPM_ERR_LIBARCHIVE; + archive_read_finish(archive); + goto error; + } + + archive_write_set_format_mtree(mtree); + archive_write_set_compression_gzip(mtree); + + /* output the type, uid, gid, mode, size, time, md5 and link fields */ + archive_write_set_options(mtree, "use-set,!device,!flags,!gname,!nlink,!uname,md5"); + + path = _alpm_local_db_pkgpath(handle->db_local, pkg, "files.mtree"); + + if(archive_write_open_filename(mtree, path) != ARCHIVE_OK) { + _alpm_log(handle, ALPM_LOG_ERROR, _("could not open file %s: %s\n"), + path, archive_error_string(archive)); + ret = ALPM_ERR_DB_WRITE; + goto cleanup; + } + + while(archive_read_next_header(archive, &entry) == ARCHIVE_OK) { + const char *entryname = archive_entry_pathname(entry); + if(*entryname != '.') { + archive_write_header(mtree, entry); + while ((read = archive_read_data(archive, &buffer, ALPM_BUFFER_SIZE)) > 0) { + archive_write_data(mtree, buffer, read); + } + } + } + +cleanup: + archive_read_finish(archive); + archive_write_finish(mtree); + free(path); + + if(ret == 0) { + return 0; + } + +error: + RET_ERR(handle, ret, -1); +} + static int commit_single_pkg(alpm_handle_t *handle, alpm_pkg_t *newpkg, size_t pkg_current, size_t pkg_count) { @@ -589,6 +651,12 @@ static int commit_single_pkg(alpm_handle_t *handle, alpm_pkg_t *newpkg, archive_read_finish(archive); CLOSE(fd); + if(write_package_mtree(handle, newpkg, pkgfile) == -1) + { + _alpm_log(handle, ALPM_LOG_WARNING, + _("could not create mtree file (%s)\n"), alpm_strerror(handle->pm_errno)); + } + /* restore the old cwd if we have it */ if(cwdfd >= 0) { if(fchdir(cwdfd) != 0) { -- 1.7.9.1
On Tue, Feb 21, 2012 at 10:02 PM, Allan McRae <allan@archlinux.org> wrote:
When installing a package, write an mtree of the package files into the local database. This will be useful for doing validation of all files on a system.
Signed-off-by: Allan McRae <allan@archlinux.org> ---
Query: should we keep the info on .INSTALL and .CHANGELOG files? Changing a .INSTALL file would be an interesting tactic, but if someone is doing that then they can already adjust the mtree file...
Also, from http://goo.gl/Uq6X5 it appears that this could be made more efficient by reusing the file descriptor, but I could not get that working after many, many, many attempts. Did you rewind the file descriptor? You should just have to call `lseek(fd, 0, SEEK_SET)` first. Of course, since the current version of _alpm_open_archive does both the open() and archive_read_new() business, the abstraction there would have to change.
With that said, not having to decompress everything twice would also be a win; I saw some chatter about this on IRC but I would definitely prefer to not iterate again; removing the iteration from the diskspace sped it up enough that I enabled that by default; I don't want to lose those gains. -Dan
lib/libalpm/add.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 68 insertions(+), 0 deletions(-)
diff --git a/lib/libalpm/add.c b/lib/libalpm/add.c index b007766..31313a7 100644 --- a/lib/libalpm/add.c +++ b/lib/libalpm/add.c @@ -449,6 +449,68 @@ needbackup_cleanup: return errors; }
+static int write_package_mtree(alpm_handle_t *handle, alpm_pkg_t *pkg, const char *file) +{ + char *path; + struct archive *archive, *mtree; + struct archive_entry *entry; + struct stat buf; + char buffer[ALPM_BUFFER_SIZE]; + int fd, ret = 0, read = 0; + + _alpm_log(handle, ALPM_LOG_DEBUG, "creating package mtree\n"); + + fd = _alpm_open_archive(handle, file, &buf, + &archive, ALPM_ERR_PKG_OPEN); + if(fd < 0) { + ret = ALPM_ERR_PKG_OPEN; + goto error; + } + + if((mtree = archive_write_new()) == NULL) { + ret = ALPM_ERR_LIBARCHIVE; + archive_read_finish(archive); + goto error; You're leaking the read archive here.
+ } + + archive_write_set_format_mtree(mtree); + archive_write_set_compression_gzip(mtree); + + /* output the type, uid, gid, mode, size, time, md5 and link fields */ + archive_write_set_options(mtree, "use-set,!device,!flags,!gname,!nlink,!uname,md5"); Did 'use-set' end up being a net-win on size and/or speed?
+ + path = _alpm_local_db_pkgpath(handle->db_local, pkg, "files.mtree"); + + if(archive_write_open_filename(mtree, path) != ARCHIVE_OK) { + _alpm_log(handle, ALPM_LOG_ERROR, _("could not open file %s: %s\n"), + path, archive_error_string(archive)); + ret = ALPM_ERR_DB_WRITE; Extra space snuck in here.
+ goto cleanup; + } + + while(archive_read_next_header(archive, &entry) == ARCHIVE_OK) { + const char *entryname = archive_entry_pathname(entry); + if(*entryname != '.') { + archive_write_header(mtree, entry); + while ((read = archive_read_data(archive, &buffer, ALPM_BUFFER_SIZE)) > 0) { + archive_write_data(mtree, buffer, read); Something tells me we should be checking the return value of this call and handling it accordingly. We check the read calls every time, but I would expect the write calls to have a higher chance of failing in real life. + } + } + } + +cleanup: + archive_read_finish(archive); + archive_write_finish(mtree); + free(path); You're leaking the fd.
+ + if(ret == 0) { + return 0; + } + +error: + RET_ERR(handle, ret, -1); This had me confused for a bit; I wouldn't call the variable that ends up being stashed on pm_errno "ret"; when in fact -1 is always returned from here, not ret. +} + static int commit_single_pkg(alpm_handle_t *handle, alpm_pkg_t *newpkg, size_t pkg_current, size_t pkg_count) { @@ -589,6 +651,12 @@ static int commit_single_pkg(alpm_handle_t *handle, alpm_pkg_t *newpkg, archive_read_finish(archive); CLOSE(fd);
+ if(write_package_mtree(handle, newpkg, pkgfile) == -1) + { + _alpm_log(handle, ALPM_LOG_WARNING, + _("could not create mtree file (%s)\n"), alpm_strerror(handle->pm_errno)); + } + /* restore the old cwd if we have it */ if(cwdfd >= 0) { if(fchdir(cwdfd) != 0) { -- 1.7.9.1
On 23/02/12 17:09, Dan McGee wrote:
On Tue, Feb 21, 2012 at 10:02 PM, Allan McRae <allan@archlinux.org> wrote:
When installing a package, write an mtree of the package files into the local database. This will be useful for doing validation of all files on a system.
Signed-off-by: Allan McRae <allan@archlinux.org> ---
Query: should we keep the info on .INSTALL and .CHANGELOG files? Changing a .INSTALL file would be an interesting tactic, but if someone is doing that then they can already adjust the mtree file...
Also, from http://goo.gl/Uq6X5 it appears that this could be made more efficient by reusing the file descriptor, but I could not get that working after many, many, many attempts. Did you rewind the file descriptor? You should just have to call `lseek(fd, 0, SEEK_SET)` first. Of course, since the current version of _alpm_open_archive does both the open() and archive_read_new() business, the abstraction there would have to change.
Ah... lseek was the key. I can do that and make the abstraction to _alpm_open archive(). But it will not be needed if...
With that said, not having to decompress everything twice would also be a win; I saw some chatter about this on IRC but I would definitely prefer to not iterate again; removing the iteration from the diskspace sped it up enough that I enabled that by default; I don't want to lose those gains.
I think this can be done. But it is far from simple. It involves us doing an archive_read_data() to read the data into a buffer, duplicating that buffer and then passing one copy to the archive_write_data() for the file on disk and the other to the write for the mtree archive. It means that we can not use the convenience function archive_read_extract() and that is a big convenience... archive_read_extract(), archive_read_extract_set_skip_file(): A convenience function that wraps the corresponding archive_write_disk(3) interfaces. The first call to archive_read_extract() creates a restore object using archive_write_disk_new(3) and archive_write_disk_set_standard_lookup(3), then transparently invokes archive_write_disk_set_options(3), archive_write_header(3), archive_write_data(3), and archive_write_finish_entry(3) to create the entry on disk and copy data into it. The flags argument is passed unmodified to archive_write_disk_set_options(3). So we would have to duplicate that entire functionality... <snip>
+ /* output the type, uid, gid, mode, size, time, md5 and link fields */ + archive_write_set_options(mtree, "use-set,!device,!flags,!gname,!nlink,!uname,md5"); Did 'use-set' end up being a net-win on size and/or speed?
The size is much small for the raw file when using 'use-set' but that difference entirely disappears when compressing with gzip. In the brief tests I did, the reading was slightly faster using 'use-set'. So, should I go ahead and write a version of archive_read_extract into a function that does both the extraction and mtree creation? Or do people see another way around this? Allan
On 28/02/12 14:35, Allan McRae wrote:
On 23/02/12 17:09, Dan McGee wrote:
On Tue, Feb 21, 2012 at 10:02 PM, Allan McRae <allan@archlinux.org> wrote:
When installing a package, write an mtree of the package files into the local database. This will be useful for doing validation of all files on a system.
Signed-off-by: Allan McRae <allan@archlinux.org> ---
Query: should we keep the info on .INSTALL and .CHANGELOG files? Changing a .INSTALL file would be an interesting tactic, but if someone is doing that then they can already adjust the mtree file...
Also, from http://goo.gl/Uq6X5 it appears that this could be made more efficient by reusing the file descriptor, but I could not get that working after many, many, many attempts. Did you rewind the file descriptor? You should just have to call `lseek(fd, 0, SEEK_SET)` first. Of course, since the current version of _alpm_open_archive does both the open() and archive_read_new() business, the abstraction there would have to change.
Ah... lseek was the key. I can do that and make the abstraction to _alpm_open archive(). But it will not be needed if...
With that said, not having to decompress everything twice would also be a win; I saw some chatter about this on IRC but I would definitely prefer to not iterate again; removing the iteration from the diskspace sped it up enough that I enabled that by default; I don't want to lose those gains.
I think this can be done. But it is far from simple. It involves us doing an archive_read_data() to read the data into a buffer, duplicating that buffer and then passing one copy to the archive_write_data() for the file on disk and the other to the write for the mtree archive. It means that we can not use the convenience function archive_read_extract() and that is a big convenience...
archive_read_extract(), archive_read_extract_set_skip_file(): A convenience function that wraps the corresponding archive_write_disk(3) interfaces. The first call to archive_read_extract() creates a restore object using archive_write_disk_new(3) and archive_write_disk_set_standard_lookup(3), then transparently invokes archive_write_disk_set_options(3), archive_write_header(3), archive_write_data(3), and archive_write_finish_entry(3) to create the entry on disk and copy data into it. The flags argument is passed unmodified to archive_write_disk_set_options(3).
So we would have to duplicate that entire functionality...
<snip>
+ /* output the type, uid, gid, mode, size, time, md5 and link fields */ + archive_write_set_options(mtree, "use-set,!device,!flags,!gname,!nlink,!uname,md5"); Did 'use-set' end up being a net-win on size and/or speed?
The size is much small for the raw file when using 'use-set' but that difference entirely disappears when compressing with gzip. In the brief tests I did, the reading was slightly faster using 'use-set'.
So, should I go ahead and write a version of archive_read_extract into a function that does both the extraction and mtree creation? Or do people see another way around this?
In fact... thinking about this further, we do not need to duplicate the read buffer at all, given it is not destroyed during the write operation. So that simplifies things a bit. Also, I was going to need to do all the checks for file writing for the mtree file anyway so that can be abstracted used for the extraction checking too. So the only real addition is all the checking for file creation and setting its mode etc. I'll probably take a stab at this in the next couple of weeks. Allan
participants (2)
-
Allan McRae
-
Dan McGee