[pacman-dev] [PATCH] Write mtree of package files to local database

Mon Feb 27 23:35:53 EST 2012

On 23/02/12 17:09, Dan McGee wrote:
> On Tue, Feb 21, 2012 at 10:02 PM, Allan McRae <allan at archlinux.org> wrote:
>> When installing a package, write an mtree of the package files into
>> the local database.  This will be useful for doing validation of all
>> files on a system.
>>
>> Signed-off-by: Allan McRae <allan at archlinux.org>
>> ---
>>
>> Query: should we keep the info on .INSTALL and .CHANGELOG files?  Changing a
>> .INSTALL file would be an interesting tactic, but if someone is doing that then
>> they can already adjust the mtree file...
>>
>> Also, from http://goo.gl/Uq6X5 it appears that this could be made more efficient
>> by reusing the file descriptor, but I could not get that working after many, many,
>> many attempts.
> Did you rewind the file descriptor? You should just have to call
> `lseek(fd, 0, SEEK_SET)` first. Of course, since the current version
> of _alpm_open_archive does both the open() and archive_read_new()
> business, the abstraction there would have to change.

Ah... lseek was the key.  I can do that and make the abstraction to
_alpm_open archive().  But it will not be needed if...

> With that said, not having to decompress everything twice would also
> be a win; I saw some chatter about this on IRC but I would definitely
> prefer to not iterate again; removing the iteration from the diskspace
> sped it up enough that I enabled that by default; I don't want to lose
> those gains.

I think this can be done.  But it is far from simple.  It involves us
doing an archive_read_data() to read the data into a buffer, duplicating
that buffer and then passing one copy to the archive_write_data() for
the file on disk and the other to the write for the mtree archive.  It
means that we can not use the convenience function
archive_read_extract() and that is a big convenience...

archive_read_extract(), archive_read_extract_set_skip_file():
    A convenience function that wraps the corresponding
archive_write_disk(3) interfaces. The first call to
archive_read_extract() creates a restore object using
archive_write_disk_new(3) and archive_write_disk_set_standard_lookup(3),
then transparently invokes archive_write_disk_set_options(3),
archive_write_header(3), archive_write_data(3), and
archive_write_finish_entry(3) to create the entry on disk and copy data
into it. The flags argument is passed unmodified to
archive_write_disk_set_options(3).

So we would have to duplicate that entire functionality...

<snip>

>> +       /* output the type, uid, gid, mode, size, time, md5 and link fields */
>> +       archive_write_set_options(mtree, "use-set,!device,!flags,!gname,!nlink,!uname,md5");
> Did 'use-set' end up being a net-win on size and/or speed?

The size is much small for the raw file when using 'use-set' but that
difference entirely disappears when compressing with gzip.  In the brief
tests I did, the reading was slightly faster using 'use-set'.

So, should I go ahead and write a version of archive_read_extract into a
function that does both the extraction and mtree creation?  Or do people
see another way around this?

Allan