[pacman-dev] Pacman database size study

Wed Jan 22 22:18:39 UTC 2020

Hi

On Wed, Jan 22, 2020 at 2:03 PM Allan McRae <allan at archlinux.org> wrote:
>
> On 23/1/20 2:03 am, Anatol Pomozov wrote:
> > Hello
> >
> > On Wed, Jan 22, 2020 at 2:23 AM Allan McRae <allan at archlinux.org> wrote:
> >>
> >> On 22/1/20 6:54 pm, Anatol Pomozov wrote:
> >>> The first experiment is to parse db tarfile using the script and then
> >>> write it back to a file:
> >>>   uncompressed size is 17757184 that is equal to original sample
> >>>   'zstd -19' compressed size is 4366994 that is 1.0084540990896713
> >>> times better than original sample
> >>>
> >>> Tar *entries* content is identical to the original file. Uncompressed
> >>> size is exactly the same. Compressed (zstd -19) size is 0.8% better.
> >>> It comes from the fact that my script does not set entries user/group
> >>> value and neither sets tar entries modification time. I am not sure if
> >>> this information is actually used by pacman. Modification time
> >>> contains a lot of entropy that compressor does not like.
> >>
> >> tl;dr
> >>
> >> "original"      4366994
> >> no md5          4188019
> >> no pgp          1160912
> >> np md5+pgp      1021667
> >>
> >>
> >> But do any of these numbers stand if you keep the tar file?
> >
> > I do not fully understand your question here. plainXXX+uncomressed is
> > a TAR file that matches current db format.
> >
>
> Oops...  Did not look down far enough your supplied files.  I downloaded
> db.original from your link, which is not original, and thought your
> numbers were based off that.

Yeah, db.original at [1] is uncompressed community.db without any
modifications. The script uses it as a base for comparisons.

It was not clear for me what exactly compression parameters are
currently used for *.db file so I chose 'zstd -19'. Let me know if
anyone wants to see other compression algorithms/parameters, is it
easy to add another experiment to this set.

I also pushed a few updates to the 'packed' format implementation and
the script is now available at github [2].

[1] https://pkgbuild.com/~anatolik/db-size-experiments/data/
[2] https://github.com/anatol/pacmandb-size-analysis