[pacman-dev] Pacman database size study

Dave Reisner d at falconindy.com
Wed Jan 22 18:06:37 UTC 2020


On Wed, Jan 22, 2020 at 11:04 AM Anatol Pomozov <anatol.pomozov at gmail.com>
wrote:

> Hello
>
> On Wed, Jan 22, 2020 at 2:23 AM Allan McRae <allan at archlinux.org> wrote:
> >
> > On 22/1/20 6:54 pm, Anatol Pomozov wrote:
> > > The first experiment is to parse db tarfile using the script and then
> > > write it back to a file:
> > >   uncompressed size is 17757184 that is equal to original sample
> > >   'zstd -19' compressed size is 4366994 that is 1.0084540990896713
> > > times better than original sample
> > >
> > > Tar *entries* content is identical to the original file. Uncompressed
> > > size is exactly the same. Compressed (zstd -19) size is 0.8% better.
> > > It comes from the fact that my script does not set entries user/group
> > > value and neither sets tar entries modification time. I am not sure if
> > > this information is actually used by pacman. Modification time
> > > contains a lot of entropy that compressor does not like.
> >
> > tl;dr
> >
> > "original"      4366994
> > no md5          4188019
> > no pgp          1160912
> > np md5+pgp      1021667
> >
> >
> > But do any of these numbers stand if you keep the tar file?
>
> I do not fully understand your question here. plainXXX+uncomressed is
> a TAR file that matches current db format.
>
> original   17757184
> no md5   17536365
> no pgp     14085120
> no md5/pgp 13248000
>
> But compressed size is what really matters for users. Dropping pgp
> signature from db file provides the biggest benefit for compressed
> data (3.8 times smaller files).
>
> >
> > Also, I find downloading signature files causes a big pause in
> > processing the downloads.   Is that just a slow connection to the world
> > at my end?
>
> *.sig files are small so bandwidth should not be a problem.
>
> My guess is that latency to your Arch mirror is too high and setting
> up twice as many ssl connections gives noticeable slowdown. Check if
> you use local Australian mirror - it will help to reduce the
> connection setup time. Using HTTP over HTTPS might help a bit as well.
>

Point of order: If you only use a single mirror, there should only be a
single
connection -- pacman (curl) reuses connections whenever possible and only
gives up if the remote doesn't support keepalives (should be rare) or a
socket
error occurs.


> But the best solution for your problem is to have a proper pacman
> parallel download support. In this case connection setup will run in
> parallel thus sharing its setup latency. It would also require less
> HTTP/HTTPS connections as HTTP2 supports multiplexing - multiple
> downloads from the same server would share single connection.


It's more subtle than this. As I mentioned above, there should only be a
single
(reused) connection. If the problem is actually latency in TTFB, then it
might be a
matter of using a more geographically local mirror. Parallelization could
help mask
some problems here, but it's going to be a LOT of work to change pacman
internals
to accommodate this.


More information about the pacman-dev mailing list