[pacman-dev] Pacman database size study
Hello folks. A few days ago there was a conversation at #pacman-dev IRC channel about the ways to reduce pacman database size. David and few other folks discussed how pgp signatures affect the final compressed size. I decided to use this question for a small personal study. I would like to present its results here. Here are the scripts that I used for the following experiments https://pkgbuild.com/~anatolik/db-size-experiments/ This directory also contains all the generated datafiles that I am going to discuss below. I use today’s community.db as the initial sample (file db.original). The first experiment is to parse db tarfile using the script and then write it back to a file: uncompressed size is 17757184 that is equal to original sample 'zstd -19' compressed size is 4366994 that is 1.0084540990896713 times better than original sample Tar *entries* content is identical to the original file. Uncompressed size is exactly the same. Compressed (zstd -19) size is 0.8% better. It comes from the fact that my script does not set entries user/group value and neither sets tar entries modification time. I am not sure if this information is actually used by pacman. Modification time contains a lot of entropy that compressor does not like. The next experiment is the same database but without ‘%MD5SUM% entries. md5sum is an old hash function algorithm that has known security issues and it cannot be qualified as a cryptographically strong hash function by modern day standards. Experiment 'plain_nomd5': uncompressed size is 17536365 that is 1.0125920622660398 times better than original sample 'zstd -19' compressed size is 4188019 that is 1.0515503869490563 times better than original sample Uncompressed size reduced by 1% and compressed size is 5% smaller. Md5 has a lot of entropy and compressor does not like it. The next experiment is to remove package pgp signatures. Arch repos have detached *.sig files for the packages so let’s see how removing %PGPSIG% affects database size: uncompressed size is 14085120 that is 1.2607051981097783 times better than original sample 'zstd -19' compressed size is 1160912 that is 3.7934942527943547 times better than original sample Uncompressed size is 1.26 times smaller and compressed size is 3.8 times smaller!!!! That’s a big difference. And now db without both md5 and pgpsig: uncompressed size is 13248000 that is 1.3403671497584542 times better than original sample 'zstd -19' compressed size is 1021667 that is 4.310517027563776 times better than original sample Compressed db size is down to 1MB that 4.3 times smaller than the original compressed db!!! Until now we were talking about current db format that is a set of files with plain text format. While this format is flexible and it is easy to add new fields it also introduces a lot of redundancy. I've been always curious how far we can go down with the db file size. What if we remove the repeating information and use some form of ‘stripped down’ format. Something similar to ‘packed’ C structs where only field values are stored. In addition to space effectiveness it is very easy to implement the parser (C, ruby or in any other language), see the scripts I shared above. The downside of such format its inflexibility and it’s harder to achieve forward compatibility. Okay let’s talk about the numbers: Experiment 'packed': uncompressed size is 7317679 that is 2.4266142310970458 times better than original sample 'zstd -19' compressed size is 4145790 that is 1.062261474893808 times better than original sample Packed uncompressed format is 2.42 times better than the tar file solution. That’s how much redundant data our current format contains. But what is really mind-blowing is that compressed format is only 6.2% better than for the plain text. Woah!!! Modern day compressors make wonders. zstd is very good in detecting and compressing all the redundancy in our files. Packed format without md5 field: uncompressed size is 7201183 that is 2.4658703993496625 times better than original sample 'zstd -19' compressed size is 3996814 that is 1.1018558782069918 times better than original sample Now without pgp signatures: uncompressed size is 2954238 that is 6.010749303204413 times better than original sample 'zstd -19' compressed size is 883681 that is 4.983600416892521 times better than original sample Packed format, no md5, no pgp: uncompressed size is 2837742 that is 6.257504734397982 times better than original sample 'zstd -19' compressed size is 763730 that is 5.7663218676757495 times better than original sample The smallest db size I was able to achieve is 764K (packed + compressed) that is 1.34 times better than plain+compressed format w/o md5 and pgp.
On 22/1/20 6:54 pm, Anatol Pomozov wrote:
The first experiment is to parse db tarfile using the script and then write it back to a file: uncompressed size is 17757184 that is equal to original sample 'zstd -19' compressed size is 4366994 that is 1.0084540990896713 times better than original sample
Tar *entries* content is identical to the original file. Uncompressed size is exactly the same. Compressed (zstd -19) size is 0.8% better. It comes from the fact that my script does not set entries user/group value and neither sets tar entries modification time. I am not sure if this information is actually used by pacman. Modification time contains a lot of entropy that compressor does not like.
tl;dr "original" 4366994 no md5 4188019 no pgp 1160912 np md5+pgp 1021667 But do any of these numbers stand if you keep the tar file? Also, I find downloading signature files causes a big pause in processing the downloads. Is that just a slow connection to the world at my end? Allan
Hello On Wed, Jan 22, 2020 at 2:23 AM Allan McRae <allan@archlinux.org> wrote:
On 22/1/20 6:54 pm, Anatol Pomozov wrote:
The first experiment is to parse db tarfile using the script and then write it back to a file: uncompressed size is 17757184 that is equal to original sample 'zstd -19' compressed size is 4366994 that is 1.0084540990896713 times better than original sample
Tar *entries* content is identical to the original file. Uncompressed size is exactly the same. Compressed (zstd -19) size is 0.8% better. It comes from the fact that my script does not set entries user/group value and neither sets tar entries modification time. I am not sure if this information is actually used by pacman. Modification time contains a lot of entropy that compressor does not like.
tl;dr
"original" 4366994 no md5 4188019 no pgp 1160912 np md5+pgp 1021667
But do any of these numbers stand if you keep the tar file?
I do not fully understand your question here. plainXXX+uncomressed is a TAR file that matches current db format. original 17757184 no md5 17536365 no pgp 14085120 no md5/pgp 13248000 But compressed size is what really matters for users. Dropping pgp signature from db file provides the biggest benefit for compressed data (3.8 times smaller files).
Also, I find downloading signature files causes a big pause in processing the downloads. Is that just a slow connection to the world at my end?
*.sig files are small so bandwidth should not be a problem. My guess is that latency to your Arch mirror is too high and setting up twice as many ssl connections gives noticeable slowdown. Check if you use local Australian mirror - it will help to reduce the connection setup time. Using HTTP over HTTPS might help a bit as well. But the best solution for your problem is to have a proper pacman parallel download support. In this case connection setup will run in parallel thus sharing its setup latency. It would also require less HTTP/HTTPS connections as HTTP2 supports multiplexing - multiple downloads from the same server would share single connection.
On Wed, Jan 22, 2020 at 11:04 AM Anatol Pomozov <anatol.pomozov@gmail.com> wrote:
Hello
On Wed, Jan 22, 2020 at 2:23 AM Allan McRae <allan@archlinux.org> wrote:
On 22/1/20 6:54 pm, Anatol Pomozov wrote:
The first experiment is to parse db tarfile using the script and then write it back to a file: uncompressed size is 17757184 that is equal to original sample 'zstd -19' compressed size is 4366994 that is 1.0084540990896713 times better than original sample
Tar *entries* content is identical to the original file. Uncompressed size is exactly the same. Compressed (zstd -19) size is 0.8% better. It comes from the fact that my script does not set entries user/group value and neither sets tar entries modification time. I am not sure if this information is actually used by pacman. Modification time contains a lot of entropy that compressor does not like.
tl;dr
"original" 4366994 no md5 4188019 no pgp 1160912 np md5+pgp 1021667
But do any of these numbers stand if you keep the tar file?
I do not fully understand your question here. plainXXX+uncomressed is a TAR file that matches current db format.
original 17757184 no md5 17536365 no pgp 14085120 no md5/pgp 13248000
But compressed size is what really matters for users. Dropping pgp signature from db file provides the biggest benefit for compressed data (3.8 times smaller files).
Also, I find downloading signature files causes a big pause in processing the downloads. Is that just a slow connection to the world at my end?
*.sig files are small so bandwidth should not be a problem.
My guess is that latency to your Arch mirror is too high and setting up twice as many ssl connections gives noticeable slowdown. Check if you use local Australian mirror - it will help to reduce the connection setup time. Using HTTP over HTTPS might help a bit as well.
Point of order: If you only use a single mirror, there should only be a single connection -- pacman (curl) reuses connections whenever possible and only gives up if the remote doesn't support keepalives (should be rare) or a socket error occurs.
But the best solution for your problem is to have a proper pacman parallel download support. In this case connection setup will run in parallel thus sharing its setup latency. It would also require less HTTP/HTTPS connections as HTTP2 supports multiplexing - multiple downloads from the same server would share single connection.
It's more subtle than this. As I mentioned above, there should only be a single (reused) connection. If the problem is actually latency in TTFB, then it might be a matter of using a more geographically local mirror. Parallelization could help mask some problems here, but it's going to be a LOT of work to change pacman internals to accommodate this.
On 23/1/20 2:03 am, Anatol Pomozov wrote:
Hello
On Wed, Jan 22, 2020 at 2:23 AM Allan McRae <allan@archlinux.org> wrote:
On 22/1/20 6:54 pm, Anatol Pomozov wrote:
The first experiment is to parse db tarfile using the script and then write it back to a file: uncompressed size is 17757184 that is equal to original sample 'zstd -19' compressed size is 4366994 that is 1.0084540990896713 times better than original sample
Tar *entries* content is identical to the original file. Uncompressed size is exactly the same. Compressed (zstd -19) size is 0.8% better. It comes from the fact that my script does not set entries user/group value and neither sets tar entries modification time. I am not sure if this information is actually used by pacman. Modification time contains a lot of entropy that compressor does not like.
tl;dr
"original" 4366994 no md5 4188019 no pgp 1160912 np md5+pgp 1021667
But do any of these numbers stand if you keep the tar file?
I do not fully understand your question here. plainXXX+uncomressed is a TAR file that matches current db format.
Oops... Did not look down far enough your supplied files. I downloaded db.original from your link, which is not original, and thought your numbers were based off that. A
Hi On Wed, Jan 22, 2020 at 2:03 PM Allan McRae <allan@archlinux.org> wrote:
On 23/1/20 2:03 am, Anatol Pomozov wrote:
Hello
On Wed, Jan 22, 2020 at 2:23 AM Allan McRae <allan@archlinux.org> wrote:
On 22/1/20 6:54 pm, Anatol Pomozov wrote:
The first experiment is to parse db tarfile using the script and then write it back to a file: uncompressed size is 17757184 that is equal to original sample 'zstd -19' compressed size is 4366994 that is 1.0084540990896713 times better than original sample
Tar *entries* content is identical to the original file. Uncompressed size is exactly the same. Compressed (zstd -19) size is 0.8% better. It comes from the fact that my script does not set entries user/group value and neither sets tar entries modification time. I am not sure if this information is actually used by pacman. Modification time contains a lot of entropy that compressor does not like.
tl;dr
"original" 4366994 no md5 4188019 no pgp 1160912 np md5+pgp 1021667
But do any of these numbers stand if you keep the tar file?
I do not fully understand your question here. plainXXX+uncomressed is a TAR file that matches current db format.
Oops... Did not look down far enough your supplied files. I downloaded db.original from your link, which is not original, and thought your numbers were based off that.
Yeah, db.original at [1] is uncompressed community.db without any modifications. The script uses it as a base for comparisons. It was not clear for me what exactly compression parameters are currently used for *.db file so I chose 'zstd -19'. Let me know if anyone wants to see other compression algorithms/parameters, is it easy to add another experiment to this set. I also pushed a few updates to the 'packed' format implementation and the script is now available at github [2]. [1] https://pkgbuild.com/~anatolik/db-size-experiments/data/ [2] https://github.com/anatol/pacmandb-size-analysis
participants (3)
-
Allan McRae
-
Anatol Pomozov
-
Dave Reisner