[pacman-dev] Pacman database size study

22 Jan 2020

      Hello folks.

A few days ago there was a conversation at #pacman-dev IRC channel
about the ways to reduce pacman database size. David and few other
folks discussed how pgp signatures affect the final compressed size. I
decided to use this question for a small personal study. I would like
to present its results here.

Here are the scripts that I used for the following experiments
https://pkgbuild.com/~anatolik/db-size-experiments/ This directory
also contains all the generated datafiles that I am going to discuss
below. I use today’s community.db as the initial sample (file
db.original).

The first experiment is to parse db tarfile using the script and then
write it back to a file:
  uncompressed size is 17757184 that is equal to original sample
  'zstd -19' compressed size is 4366994 that is 1.0084540990896713
times better than original sample

Tar *entries* content is identical to the original file. Uncompressed
size is exactly the same. Compressed (zstd -19) size is 0.8% better.
It comes from the fact that my script does not set entries user/group
value and neither sets tar entries modification time. I am not sure if
this information is actually used by pacman. Modification time
contains a lot of entropy that compressor does not like.

The next experiment is the same database but without ‘%MD5SUM%
entries. md5sum is an old hash function algorithm that has known
security issues and it cannot be qualified as a cryptographically
strong hash function by modern day standards.

Experiment 'plain_nomd5':
  uncompressed size is 17536365 that is 1.0125920622660398 times
better than original sample
  'zstd -19' compressed size is 4188019 that is 1.0515503869490563
times better than original sample

Uncompressed size reduced by 1% and compressed size is 5% smaller. Md5
has a lot of entropy and compressor does not like it.

The next experiment is to remove package pgp signatures. Arch repos
have detached *.sig files for the packages so let’s see how removing
%PGPSIG% affects database size:
  uncompressed size is 14085120 that is 1.2607051981097783 times
better than original sample
  'zstd -19' compressed size is 1160912 that is 3.7934942527943547
times better than original sample

Uncompressed size is 1.26 times smaller and compressed size is 3.8
times smaller!!!! That’s a big difference.

And now db without both md5 and pgpsig:
  uncompressed size is 13248000 that is 1.3403671497584542 times
better than original sample
  'zstd -19' compressed size is 1021667 that is 4.310517027563776
times better than original sample

Compressed db size is down to 1MB that 4.3 times smaller than the
original compressed db!!!

Until now we were talking about current db format that is a set of
files with plain text format. While this format is flexible and it is
easy to add new fields it also introduces a lot of redundancy. I've
been always curious how far we can go down with the db file size. What
if we remove the repeating information and use some form of ‘stripped
down’ format. Something similar to ‘packed’ C structs where only field
values are stored. In addition to space effectiveness it is very easy
to implement the parser (C, ruby or in any other language), see the
scripts I shared above. The downside of such format its inflexibility
and it’s harder to achieve forward compatibility.

Okay let’s talk about the numbers:

Experiment 'packed':
  uncompressed size is 7317679 that is 2.4266142310970458 times better
than original sample
  'zstd -19' compressed size is 4145790 that is 1.062261474893808
times better than original sample

Packed uncompressed format is 2.42 times better than the tar file
solution. That’s how much redundant data our current format contains.

But what is really mind-blowing is that compressed format is only 6.2%
better than for the plain text. Woah!!! Modern day compressors make
wonders. zstd is very good in detecting and compressing all the
redundancy in our files.

Packed format without md5 field:
  uncompressed size is 7201183 that is 2.4658703993496625 times better
than original sample
  'zstd -19' compressed size is 3996814 that is 1.1018558782069918
times better than original sample

Now without pgp signatures:
  uncompressed size is 2954238 that is 6.010749303204413 times better
than original sample
  'zstd -19' compressed size is 883681 that is 4.983600416892521 times
better than original sample

Packed format, no md5, no pgp:
  uncompressed size is 2837742 that is 6.257504734397982 times better
than original sample
  'zstd -19' compressed size is 763730 that is 5.7663218676757495
times better than original sample

The smallest db size I was able to achieve is 764K (packed +
compressed) that is 1.34 times better than plain+compressed format w/o
md5 and pgp.

[pacman-dev] Pacman database size study

Anatol Pomozov