[pacman-dev] [PATCH 2/2] Overhaul archive fgets function

Thu Dec 23 18:11:47 EST 2010

On 22/12/10 04:22, Dan McGee wrote:
> On Sun, Dec 19, 2010 at 11:32 AM, Xavier Chantry
> <chantry.xavier at gmail.com>  wrote:
>> On Wed, Dec 15, 2010 at 7:48 AM, Dan McGee<dan at archlinux.org>  wrote:
>>> The old function was written in a time before we relied on it for nearly
>>> every operation. Since then, we have switched to the archive backend and now
>>> fast parsing is a big deal.
>>>
>>> The former function made a per-character call to the libarchive
>>> archive_read_data() function, which resulted in some 21 million calls in a
>>> typical "load all sync dbs" operation. If we instead do some buffering of
>>> our own and read the blocks directly, and then find our newlines from there,
>>> we can cut out the multiple layers of overhead and go from archive to parsed
>>> data much quicker.
>>>
>>> Both users of the former function are switched over to the new signature,
>>> made easier by the macros now in place in the sync backend parsing code.
>>>
>>> Performance: for a `pacman -Su` (no upgrades available),
>>> _alpm_archive_fgets() goes from being 29% of the total time to 12% The time
>>> spent on the libarchive function being called dropped from 24% to 6%.
>>>
>>> This pushes _alpm_pkg_find back to the title of slowest low-level function.
>>>
>>> Signed-off-by: Dan McGee<dan at archlinux.org>
>>> ---
>>>
>>> I wish this one was a bit easier to comprehend and grok, but the cost of
>>> performance here is a bit more complicated fgets function than the very
>>> simplistic one we had before.
>>>
>>> The new function revolves around the zero-copy, retrieve an entire block
>>> functinoality of libarchive. We create a structure to keep all the bookkeeping
>>> info we need around, including a buffer of our own where each line ends up and
>>> auto-resizes itself as we poll through an archive file entry, and also cleans
>>> itself up when the file is EOF. This means malloc/free is not necessary by the
>>> caller, which keeps most concerns in the fgets function.
>>>
>>> Please let me know if it is unclear how the new function works, so more
>>> comments can be put in there.
>>>
>>
>> Since we have just one line, I would not have bothered trying to alloc
>> exactly what we need. I would just have used a static or dynamic buf
>> of max line size, and be done with it.
>> But anyway you already implemented the more complex way and I don't
>> see a problem with that as long as its tested (especially the re-alloc
>> part) :)
>
> I'm not sure what you are saying you would do with a dynamic buffer-
> doesn't it make sense to pick the original size as whatever we get
> back first, rather than any other arbitrary size that we might have to
> resize anyway? And I've always hated static buffer sizes; see the
> PATH_MAX junk for more reason for that.
>
>> I could not spot a problem by simply reading the code.
>>
>> Another minor observation : in the !done case, the alloc block does a
>> strict allocation and thus does not take into account that we will
>> very likely have to extend it on the next iteration (if we did not
>> reach the last archive block).
>>
>> Finally can it actually happen with our sync archives which only has
>> very small files that a line is split across two blocks ?
>
> It can with the smoke002.py pactest I'll include that just blew a hole
> through my realloc code; necessary changes below. Pactest created
> packages usually produce a block size of 65536 or so.
>
> I also bumped the max line size to 512KB rather than PATH_MAX (4096
> bytes), so now the delayed allocation makes a bit more sense.
>
>> In any cases, this is another very welcome improvement.
>
> Thanks!
>
> -Dan
>
> diff --git a/lib/libalpm/util.c b/lib/libalpm/util.c
> index 87880de..d34eab5 100644
> --- a/lib/libalpm/util.c
> +++ b/lib/libalpm/util.c
> @@ -841,6 +842,8 @@ int _alpm_archive_fgets(struct archive *a, struct
> archive_read_buffer *b)
>                          /* we've looked through the whole block but no
> newline, copy it */
>                          size_t len = b->block + b->block_size - b->block_offset;
>                          memcpy(b->line_offset, b->block_offset, len);
> +                       b->line_offset += len;
> +                       b->block_offset = i;
>                  }
>          }
>
>

Would this fix some sort of infinite loop in database reading?   I just 
experienced pacman "freezing" with 100% CPU usage while reading the 
[extra] database.   I figured out that I had not rebuilt my pacman-git 
package since applying this diff on my working branch and with a 
rebuild, everything went away.

So I figure this was the fix...  but I want to make sure that it makes 
sense to be the fix and I have not just ignored another issue that is no 
longer obviously exposed.

Allan