On 28/02/12 14:35, Allan McRae wrote:
On 23/02/12 17:09, Dan McGee wrote:
On Tue, Feb 21, 2012 at 10:02 PM, Allan McRae <allan@archlinux.org> wrote:
When installing a package, write an mtree of the package files into the local database. This will be useful for doing validation of all files on a system.
Signed-off-by: Allan McRae <allan@archlinux.org> ---
Query: should we keep the info on .INSTALL and .CHANGELOG files? Changing a .INSTALL file would be an interesting tactic, but if someone is doing that then they can already adjust the mtree file...
Also, from http://goo.gl/Uq6X5 it appears that this could be made more efficient by reusing the file descriptor, but I could not get that working after many, many, many attempts. Did you rewind the file descriptor? You should just have to call `lseek(fd, 0, SEEK_SET)` first. Of course, since the current version of _alpm_open_archive does both the open() and archive_read_new() business, the abstraction there would have to change.
Ah... lseek was the key. I can do that and make the abstraction to _alpm_open archive(). But it will not be needed if...
With that said, not having to decompress everything twice would also be a win; I saw some chatter about this on IRC but I would definitely prefer to not iterate again; removing the iteration from the diskspace sped it up enough that I enabled that by default; I don't want to lose those gains.
I think this can be done. But it is far from simple. It involves us doing an archive_read_data() to read the data into a buffer, duplicating that buffer and then passing one copy to the archive_write_data() for the file on disk and the other to the write for the mtree archive. It means that we can not use the convenience function archive_read_extract() and that is a big convenience...
archive_read_extract(), archive_read_extract_set_skip_file(): A convenience function that wraps the corresponding archive_write_disk(3) interfaces. The first call to archive_read_extract() creates a restore object using archive_write_disk_new(3) and archive_write_disk_set_standard_lookup(3), then transparently invokes archive_write_disk_set_options(3), archive_write_header(3), archive_write_data(3), and archive_write_finish_entry(3) to create the entry on disk and copy data into it. The flags argument is passed unmodified to archive_write_disk_set_options(3).
So we would have to duplicate that entire functionality...
<snip>
+ /* output the type, uid, gid, mode, size, time, md5 and link fields */ + archive_write_set_options(mtree, "use-set,!device,!flags,!gname,!nlink,!uname,md5"); Did 'use-set' end up being a net-win on size and/or speed?
The size is much small for the raw file when using 'use-set' but that difference entirely disappears when compressing with gzip. In the brief tests I did, the reading was slightly faster using 'use-set'.
So, should I go ahead and write a version of archive_read_extract into a function that does both the extraction and mtree creation? Or do people see another way around this?
In fact... thinking about this further, we do not need to duplicate the read buffer at all, given it is not destroyed during the write operation. So that simplifies things a bit. Also, I was going to need to do all the checks for file writing for the mtree file anyway so that can be abstracted used for the extraction checking too. So the only real addition is all the checking for file creation and setting its mode etc. I'll probably take a stab at this in the next couple of weeks. Allan