Re: [pacman-dev] [PATCH] Write mtree of package files to local database

28 Feb 2012

      On 28/02/12 14:35, Allan McRae wrote:
...
On 23/02/12 17:09, Dan McGee wrote:
...
On Tue, Feb 21, 2012 at 10:02 PM, Allan McRae <allan@archlinux.org> wrote:
...
When installing a package, write an mtree of the package files into
the local database.  This will be useful for doing validation of all
files on a system.
Signed-off-by: Allan McRae <allan@archlinux.org>
---
Query: should we keep the info on .INSTALL and .CHANGELOG files?  Changing a
.INSTALL file would be an interesting tactic, but if someone is doing that then
they can already adjust the mtree file...
Also, from http://goo.gl/Uq6X5 it appears that this could be made more efficient
by reusing the file descriptor, but I could not get that working after many, many,
many attempts.
Did you rewind the file descriptor? You should just have to call
`lseek(fd, 0, SEEK_SET)` first. Of course, since the current version
of _alpm_open_archive does both the open() and archive_read_new()
business, the abstraction there would have to change.
Ah... lseek was the key.  I can do that and make the abstraction to
_alpm_open archive().  But it will not be needed if...
...
With that said, not having to decompress everything twice would also
be a win; I saw some chatter about this on IRC but I would definitely
prefer to not iterate again; removing the iteration from the diskspace
sped it up enough that I enabled that by default; I don't want to lose
those gains.
I think this can be done.  But it is far from simple.  It involves us
doing an archive_read_data() to read the data into a buffer, duplicating
that buffer and then passing one copy to the archive_write_data() for
the file on disk and the other to the write for the mtree archive.  It
means that we can not use the convenience function
archive_read_extract() and that is a big convenience...
archive_read_extract(), archive_read_extract_set_skip_file():
    A convenience function that wraps the corresponding
archive_write_disk(3) interfaces. The first call to
archive_read_extract() creates a restore object using
archive_write_disk_new(3) and archive_write_disk_set_standard_lookup(3),
then transparently invokes archive_write_disk_set_options(3),
archive_write_header(3), archive_write_data(3), and
archive_write_finish_entry(3) to create the entry on disk and copy data
into it. The flags argument is passed unmodified to
archive_write_disk_set_options(3).
So we would have to duplicate that entire functionality...
<snip>
...
...
+       /* output the type, uid, gid, mode, size, time, md5 and link fields */
+       archive_write_set_options(mtree, "use-set,!device,!flags,!gname,!nlink,!uname,md5");
Did 'use-set' end up being a net-win on size and/or speed?
The size is much small for the raw file when using 'use-set' but that
difference entirely disappears when compressing with gzip.  In the brief
tests I did, the reading was slightly faster using 'use-set'.
So, should I go ahead and write a version of archive_read_extract into a
function that does both the extraction and mtree creation?  Or do people
see another way around this?
In fact...  thinking about this further, we do not need to duplicate the
read buffer at all, given it is not destroyed during the write
operation.  So that simplifies things a bit.

Also, I was going to need to do all the checks for file writing for the
mtree file anyway so that can be abstracted used for the extraction
checking too.  So the only real addition is all the checking for file
creation and setting its mode etc.

I'll probably take a stab at this in the next couple of weeks.

Allan