[pacman-dev] [PATCH] Optimize _alpm_archive_fgets with an internal buffer.

Thu Mar 27 10:53:39 EDT 2008

I haven't tried this yet, but you beat me to my own suggestion, well done. :)

On Thu, Mar 27, 2008 at 6:13 AM, Chantry Xavier <shiningxc at gmail.com> wrote:
> Signed-off-by: Chantry Xavier <shiningxc at gmail.com>
>  ---
>
>  Well, I saw that comment, so I gave it a try, but I am not sure if I did it
>  correctly. It doesn't seem too bad but maybe there are still cleaner ways.
>  Also I wasn't sure which buf size to choose, and how to comment the code.
>
>  Xav
>
>   lib/libalpm/util.c |   30 ++++++++++++++++++++++++------
>   1 files changed, 24 insertions(+), 6 deletions(-)
>
>  diff --git a/lib/libalpm/util.c b/lib/libalpm/util.c
>  index 92a0a26..0b7750e 100644
>  --- a/lib/libalpm/util.c
>  +++ b/lib/libalpm/util.c
>  @@ -675,22 +675,40 @@ int _alpm_test_md5sum(const char *filepath, const char *md5sum)
>
>   char *_alpm_archive_fgets(char *line, size_t size, struct archive *a)
>   {
>  -       /* for now, just read one char at a time until we get to a
>  -        * '\n' char. we can optimize this later with an internal
>  -        * buffer. */
>  +       /* internal buffer */
>  +       static char buf[BUFSIZ];
>  +       /* index of the next char to read in the buffer */
>  +       static int index = 0;
>  +       /* total number of char in the buffer */
>  +       static int nread = 0;
Can't index just be a char* pointer instead? That should make things a
bit simpler below.
if(index == &buf), for example, and line[loop] = index++;

>         /* leave room for zero terminator */
>         int want = size - 1;
>         int loop;
>
>         for(loop = 0; loop < want; loop++) {
>  -               int ret = archive_read_data(a, line + loop, 1);
>  +               if(index == 0) {
>  +                       /* fill the buffer */
>  +                       nread = archive_read_data(a, buf, BUFSIZ);
>  +               }
>  +               if(index < nread) {
>  +                       /* get the next char from the buffer */
>  +                       line[loop] = buf[index++];
>  +               } else {
>  +                       /* nothing left to read */
>  +                       line[loop] = '\0';
>  +                       index = 0;
>  +               }
>  +               if(index == BUFSIZ) {
>  +                       /* end of the buffer, so reset the index to refill it */
>  +                       index = 0;
>  +               }
Unless I'm mistaken, this should correctly handle the buffer running
out in the middle of a fgets call. The best way to test your code here
might be to set the buffer to something very small such as 4, and see
what happens. This will also ensure you never overrun your buffer,
etc.

>                 /* special check for first read- if null, return null,
>                  * this indicates EOF */
>  -               if(loop == 0 && (ret <= 0 || line[loop] == '\0')) {
>  +               if(loop == 0 && line[loop] == '\0') {
>                         return(NULL);
>                 }
>                 /* check if read value was null or newline */
>  -               if(ret <= 0 || line[loop] == '\0' || line[loop] == '\n') {
>  +               if(line[loop] == '\0' || line[loop] == '\n') {
>                         want = loop + 1;
>                         break;
>                 }
>  --
>  1.5.4.4

The big issues:
1. Once you use archive_fgets, you can't do a manual archive_read*
call since data is buffered here. This should get noted in a comment.
2. I can't call this function with one archive followed by a call with
another archive. This would obviously cause issues, and we should
probably find a way to address this. We might need a helper struct
that the caller has to pass in containing anything static so this
function is thread-safe, etc.
3. This can be optimized even more by removing the 1 char at a time
loop and making good use of memchr and memcpy. This is what I
originally planned to do when I got around to making this a buffered
implementation.

Good work though, glad to see someone else is interested in making
improvements here, and this is a large component of the read directly
from db.tar.gz files, although you can see from my other patch it can
already be put to use elsewhere in the code.

-Dan