Hello folks. A few days ago there was a conversation at #pacman-dev IRC channel about the ways to reduce pacman database size. David and few other folks discussed how pgp signatures affect the final compressed size. I decided to use this question for a small personal study. I would like to present its results here. Here are the scripts that I used for the following experiments https://pkgbuild.com/~anatolik/db-size-experiments/ This directory also contains all the generated datafiles that I am going to discuss below. I use today’s community.db as the initial sample (file db.original). The first experiment is to parse db tarfile using the script and then write it back to a file: uncompressed size is 17757184 that is equal to original sample 'zstd -19' compressed size is 4366994 that is 1.0084540990896713 times better than original sample Tar *entries* content is identical to the original file. Uncompressed size is exactly the same. Compressed (zstd -19) size is 0.8% better. It comes from the fact that my script does not set entries user/group value and neither sets tar entries modification time. I am not sure if this information is actually used by pacman. Modification time contains a lot of entropy that compressor does not like. The next experiment is the same database but without ‘%MD5SUM% entries. md5sum is an old hash function algorithm that has known security issues and it cannot be qualified as a cryptographically strong hash function by modern day standards. Experiment 'plain_nomd5': uncompressed size is 17536365 that is 1.0125920622660398 times better than original sample 'zstd -19' compressed size is 4188019 that is 1.0515503869490563 times better than original sample Uncompressed size reduced by 1% and compressed size is 5% smaller. Md5 has a lot of entropy and compressor does not like it. The next experiment is to remove package pgp signatures. Arch repos have detached *.sig files for the packages so let’s see how removing %PGPSIG% affects database size: uncompressed size is 14085120 that is 1.2607051981097783 times better than original sample 'zstd -19' compressed size is 1160912 that is 3.7934942527943547 times better than original sample Uncompressed size is 1.26 times smaller and compressed size is 3.8 times smaller!!!! That’s a big difference. And now db without both md5 and pgpsig: uncompressed size is 13248000 that is 1.3403671497584542 times better than original sample 'zstd -19' compressed size is 1021667 that is 4.310517027563776 times better than original sample Compressed db size is down to 1MB that 4.3 times smaller than the original compressed db!!! Until now we were talking about current db format that is a set of files with plain text format. While this format is flexible and it is easy to add new fields it also introduces a lot of redundancy. I've been always curious how far we can go down with the db file size. What if we remove the repeating information and use some form of ‘stripped down’ format. Something similar to ‘packed’ C structs where only field values are stored. In addition to space effectiveness it is very easy to implement the parser (C, ruby or in any other language), see the scripts I shared above. The downside of such format its inflexibility and it’s harder to achieve forward compatibility. Okay let’s talk about the numbers: Experiment 'packed': uncompressed size is 7317679 that is 2.4266142310970458 times better than original sample 'zstd -19' compressed size is 4145790 that is 1.062261474893808 times better than original sample Packed uncompressed format is 2.42 times better than the tar file solution. That’s how much redundant data our current format contains. But what is really mind-blowing is that compressed format is only 6.2% better than for the plain text. Woah!!! Modern day compressors make wonders. zstd is very good in detecting and compressing all the redundancy in our files. Packed format without md5 field: uncompressed size is 7201183 that is 2.4658703993496625 times better than original sample 'zstd -19' compressed size is 3996814 that is 1.1018558782069918 times better than original sample Now without pgp signatures: uncompressed size is 2954238 that is 6.010749303204413 times better than original sample 'zstd -19' compressed size is 883681 that is 4.983600416892521 times better than original sample Packed format, no md5, no pgp: uncompressed size is 2837742 that is 6.257504734397982 times better than original sample 'zstd -19' compressed size is 763730 that is 5.7663218676757495 times better than original sample The smallest db size I was able to achieve is 764K (packed + compressed) that is 1.34 times better than plain+compressed format w/o md5 and pgp.