[pacman-dev] [PATCH] Write mtree of package files to local database

Tue Feb 28 00:17:20 EST 2012

On 28/02/12 14:35, Allan McRae wrote:
> On 23/02/12 17:09, Dan McGee wrote:
>> On Tue, Feb 21, 2012 at 10:02 PM, Allan McRae <allan at archlinux.org> wrote:
>>> When installing a package, write an mtree of the package files into
>>> the local database.  This will be useful for doing validation of all
>>> files on a system.
>>>
>>> Signed-off-by: Allan McRae <allan at archlinux.org>
>>> ---
>>>
>>> Query: should we keep the info on .INSTALL and .CHANGELOG files?  Changing a
>>> .INSTALL file would be an interesting tactic, but if someone is doing that then
>>> they can already adjust the mtree file...
>>>
>>> Also, from http://goo.gl/Uq6X5 it appears that this could be made more efficient
>>> by reusing the file descriptor, but I could not get that working after many, many,
>>> many attempts.
>> Did you rewind the file descriptor? You should just have to call
>> `lseek(fd, 0, SEEK_SET)` first. Of course, since the current version
>> of _alpm_open_archive does both the open() and archive_read_new()
>> business, the abstraction there would have to change.
> 
> Ah... lseek was the key.  I can do that and make the abstraction to
> _alpm_open archive().  But it will not be needed if...
> 
>> With that said, not having to decompress everything twice would also
>> be a win; I saw some chatter about this on IRC but I would definitely
>> prefer to not iterate again; removing the iteration from the diskspace
>> sped it up enough that I enabled that by default; I don't want to lose
>> those gains.
> 
> I think this can be done.  But it is far from simple.  It involves us
> doing an archive_read_data() to read the data into a buffer, duplicating
> that buffer and then passing one copy to the archive_write_data() for
> the file on disk and the other to the write for the mtree archive.  It
> means that we can not use the convenience function
> archive_read_extract() and that is a big convenience...
> 
> archive_read_extract(), archive_read_extract_set_skip_file():
>     A convenience function that wraps the corresponding
> archive_write_disk(3) interfaces. The first call to
> archive_read_extract() creates a restore object using
> archive_write_disk_new(3) and archive_write_disk_set_standard_lookup(3),
> then transparently invokes archive_write_disk_set_options(3),
> archive_write_header(3), archive_write_data(3), and
> archive_write_finish_entry(3) to create the entry on disk and copy data
> into it. The flags argument is passed unmodified to
> archive_write_disk_set_options(3).
> 
> So we would have to duplicate that entire functionality...
> 
> 
> <snip>
> 
>>> +       /* output the type, uid, gid, mode, size, time, md5 and link fields */
>>> +       archive_write_set_options(mtree, "use-set,!device,!flags,!gname,!nlink,!uname,md5");
>> Did 'use-set' end up being a net-win on size and/or speed?
> 
> The size is much small for the raw file when using 'use-set' but that
> difference entirely disappears when compressing with gzip.  In the brief
> tests I did, the reading was slightly faster using 'use-set'.
> 
> 
> So, should I go ahead and write a version of archive_read_extract into a
> function that does both the extraction and mtree creation?  Or do people
> see another way around this?
> 

In fact...  thinking about this further, we do not need to duplicate the
read buffer at all, given it is not destroyed during the write
operation.  So that simplifies things a bit.

Also, I was going to need to do all the checks for file writing for the
mtree file anyway so that can be abstracted used for the extraction
checking too.  So the only real addition is all the checking for file
creation and setting its mode etc.

I'll probably take a stab at this in the next couple of weeks.

Allan