[arch-general] Suggestion: switch to zstd -19 for compressing packages over xz
Hi, It's now been about half a year since support for zstd landed in our packaging tools. I've been quietly using it for all my locally built packages since then with no issues. I think it would be worthwhile to have a discussion about whether to use zstd for officially built packages. Here is a brief summary of negatives and positives: Negatives: * Changing things takes time and might break someone's workflow. * Zstd -19 results in slightly larger files than xz -6 (default). Positives: * Change would be invisible to most users. * Would keep Arch Linux on top of the latest in packaging tech. * Updating would be much faster for most users. To expand on that last point: for reasonably fast connections, the additional time required to decompress xz compressed packages means that updating can actually take more time than it would for packages that are not compressed at all! Because zstd is designed to have very fast decompression, for a wide range of modern broadband connections zstd -19 is the fastest algorithm to download and decompress. For example, check out this compression test (with the Mozilla dataset): https://quixdb.github.io/squash-benchmark/unstable/ Or look at my local test with the most recent Firefox package: Tool Compression time Size Time to DL (100 Mbps) + decompress xz -6 5m 53s 49 MB 0m 21s zstd -19 6m 0s 53 MB 0m 6s So while xz and zstd compress in about the same amount of time and result in files of similar size, from the user's standpoint zstd results in much faster updates. Multiply this by a few hundred packages and you have a pretty substantial effect. I look forward to discussing this with you all. Cheers, Adam
I have 4Mbps (512KBytes/s) 'broad'band and i7-6500U CPU. I wanna cry. On Sat, Mar 16, 2019, 13:38 Adam Fontenot via arch-general < arch-general@archlinux.org> wrote:
Hi,
It's now been about half a year since support for zstd landed in our packaging tools. I've been quietly using it for all my locally built packages since then with no issues. I think it would be worthwhile to have a discussion about whether to use zstd for officially built packages. Here is a brief summary of negatives and positives:
Negatives: * Changing things takes time and might break someone's workflow. * Zstd -19 results in slightly larger files than xz -6 (default).
Positives: * Change would be invisible to most users. * Would keep Arch Linux on top of the latest in packaging tech. * Updating would be much faster for most users.
To expand on that last point: for reasonably fast connections, the additional time required to decompress xz compressed packages means that updating can actually take more time than it would for packages that are not compressed at all! Because zstd is designed to have very fast decompression, for a wide range of modern broadband connections zstd -19 is the fastest algorithm to download and decompress. For example, check out this compression test (with the Mozilla dataset): https://quixdb.github.io/squash-benchmark/unstable/
Or look at my local test with the most recent Firefox package:
Tool Compression time Size Time to DL (100 Mbps) + decompress xz -6 5m 53s 49 MB 0m 21s zstd -19 6m 0s 53 MB 0m 6s
So while xz and zstd compress in about the same amount of time and result in files of similar size, from the user's standpoint zstd results in much faster updates. Multiply this by a few hundred packages and you have a pretty substantial effect.
I look forward to discussing this with you all.
Cheers, Adam
On Fri, Mar 15, 2019 at 11:10 PM Darren Wu via arch-general <arch-general@archlinux.org> wrote:
I have 4Mbps (512KBytes/s) 'broad'band and i7-6500U CPU. I wanna cry.
Even in a worst-case scenario like this one, the Squash compression test I linked to shows an increase from 26 secs (xz -6) to 29 secs (zstd -19), so you wouldn't be significantly impacted by this change. I hope your local authorities decide to give you real broadband in the near future, however. :-)
On 3/16/19 1:30 AM, Adam Fontenot via arch-general wrote:
On Fri, Mar 15, 2019 at 11:10 PM Darren Wu via arch-general <arch-general@archlinux.org> wrote:
I have 4Mbps (512KBytes/s) 'broad'band and i7-6500U CPU. I wanna cry.
Even in a worst-case scenario like this one, the Squash compression test I linked to shows an increase from 26 secs (xz -6) to 29 secs (zstd -19), so you wouldn't be significantly impacted by this change.
I hope your local authorities decide to give you real broadband in the near future, however. :-)
My situation is similar to Darren's: My primary connection to the internet is through my cell phone carrier and a mobile WiFi hot spot. In urban areas, I can get as much as 50 megabits per second, but presently, due to my remote location, it's around 5 or 6. I also have a monthly data cap, which I share with my wife, and only WiFi (i.e., no wires; that nice 300 megabits from hot spot to device is shared by all devices, and there's a per device limit, too). FWIW, I have an i7-7700HQ CPU. In the old days (when large files were a megabyte or two and network bandwidth was measured in kilibits per second), we assumed that the network was the bottleneck. I think what Adam is propsing is that things are different now, and that the CPU is the bottleneck. As always, it depends. :-) My vote, whether it has any weight or not, is for higher compression ratios at the expense of CPU cycles when decompressing; i.e., xz rather than zstd. Also, consider that the 10% increase in archive size is suffered repeatedly as servers store and propagate new releases, but that the increase in decompression time is only suffered by the end user once, likely during a manual update operation or an automated background process, where it doesn't matter much. I used to have this argument with coworkers over build times and wake-from-sleep times. Is the extra time to decompress archives really killing anyone's productivity? Are users choosing OS distros based on how long it takes do install Open Office? Are Darren and I dinosaurs, doomed to live in a world where everyone else has a multi-gigabit per second internet connection and a cell phone class CPU? Jokingly, but not as much as you think, Dan
Hi Dan,
I hope your local authorities decide to give you real broadband in the near future, however. :-)
My situation is similar to Darren's: My primary connection to the internet is through my cell phone carrier and a mobile WiFi hot spot.
I'm UK mainland, get about 580 KiB/s download, and pay per byte during the day, which is why I try and get most package updates during the toll-free midnight hours, but sometimes I need a new package and can't delay.
My vote, whether it has any weight or not, is for higher compression ratios at the expense of CPU cycles when decompressing; i.e., xz rather than zstd.
I'd also favour fewer bytes, but would suggest replacing xz with lzip as xz has quite a few flaws in its file format. Xz format inadequate for long-term archiving https://www.nongnu.org/lzip/xz_inadequate.html https://www.nongnu.org/lzip/lzip_benchmark.html#xz compares the two in various ways, and explains how xz's -9 allocates double the dictionary memory than lzma's and lzip's -9 and thus initially looks better unless the playing field is levelled. -- Cheers, Ralph.
On 16/03/2019 13:24, Ralph Corderoy wrote:
suggest replacing xz with lzip as xz has quite a few flaws in its file format.
I seem to remember that this has been debunked/ruled out as irrelevant to package distribution every time it has been proposed to Debian, e.g. it even has worse decompression times then xz. https://lists.debian.org/debian-devel/2015/07/msg00377.html https://www.reddit.com/r/linux/comments/58swzc/xz_format_inadequate_for_long...
On 03/16/19 at 10:59pm, Jonathon Fernyhough wrote:
On 16/03/2019 13:24, Ralph Corderoy wrote:
suggest replacing xz with lzip as xz has quite a few flaws in its file format.
I seem to remember that this has been debunked/ruled out as irrelevant to package distribution every time it has been proposed to Debian, e.g. it even has worse decompression times then xz.
Apart from that, we have other important factors: * Is the archiving format reproducible, this is important for reproducible builds * zstd has been discussed in #archlinux-pacman some time ago but no descision has been taken yet. To be quite honest I don't believe this is one of the most pressing issues in Arch :)
On Sat, Mar 16, 2019 at 5:01 AM Dan Sommers <2QdxY4RzWzUUiLuE@potatochowder.com> wrote:
My situation is similar to Darren's: My primary connection to the internet is through my cell phone carrier and a mobile WiFi hot spot. In urban areas, I can get as much as 50 megabits per second, but presently, due to my remote location, it's around 5 or 6. I also have a monthly data cap, which I share with my wife, and only WiFi (i.e., no wires; that nice 300 megabits from hot spot to device is shared by all devices, and there's a per device limit, too). FWIW, I have an i7-7700HQ CPU.
In the old days (when large files were a megabyte or two and network bandwidth was measured in kilibits per second), we assumed that the network was the bottleneck. I think what Adam is propsing is that things are different now, and that the CPU is the bottleneck. As always, it depends. :-)
My vote, whether it has any weight or not, is for higher compression ratios at the expense of CPU cycles when decompressing; i.e., xz rather than zstd. Also, consider that the 10% increase in archive size is suffered repeatedly as servers store and propagate new releases, but that the increase in decompression time is only suffered by the end user once, likely during a manual update operation or an automated background process, where it doesn't matter much.
I used to have this argument with coworkers over build times and wake-from-sleep times. Is the extra time to decompress archives really killing anyone's productivity? Are users choosing OS distros based on how long it takes do install Open Office? Are Darren and I dinosaurs, doomed to live in a world where everyone else has a multi-gigabit per second internet connection and a cell phone class CPU?
Jokingly, but not as much as you think, Dan
I think you're overstating your case a little bit. In the United States, nothing less than 25 Mbps can legally be called broadband, and the average download speed is approaching 100 Mbps (90% of us have access to 25 Mbps or better internet). Zstd -19 is faster overall than xz -6 starting at around 20 Mbps, so it's a better choice even on some sub-broadband connections. Your PassMark score is only about 50% better than that used on the Squash compression test, so I don't know that the computer speed element is significant. Furthermore, if space saving is the primary concern, why are we using the default xz -6 option, rather than something stronger like -9? I support using zstd because even in the absolute worst case (instant decompression), you're looking at less than a 10% increase in upgrade time, while for most users, a reduction of 50% would not be atypical (lzma is slow!). I'm not suggesting throwing out all concerns about disk space and transfer time, I'm just suggesting that times have changed *somewhat*, and that for most users zstd may provide a better trafe-off. In my case (100 Mbit connection), which is close to the US average, downloading and decompressing the latest Firefox package would take less than 1/3 the time it currently takes if we switched to zstd. Adam
On Sun, 17 Mar 2019 00:05:50 -0700, Adam Fontenot wrote:
I think you're overstating your case a little bit. In the United States, nothing less than 25 Mbps can legally be called broadband
There are several Internet related developing countries, such as Germany, were ISPs don't fulfil the contract. In the past German customers got a reduction of price, nowadays customers even don't get a reduction of price anymore. It's probably illegal here, too, but there is nothing a customer could do against it. Legal action would most likely lead nowhere. -- pacman -Q linux{,-rt{-cornflower,,-securityink,-pussytoes}}|cut -d\ -f2 5.0.2.arch1-1 4.19.25_rt16-0 4.19.23_rt13-0.1 4.19.15_rt12-0 4.18.16_rt9-1
On Fri, 15 Mar 2019 22:38:15 -0700 Adam Fontenot via arch-general <arch-general@archlinux.org> wrote:
Hi,
It's now been about half a year since support for zstd landed in our packaging tools. I've been quietly using it for all my locally built packages since then with no issues. I think it would be worthwhile to have a discussion about whether to use zstd for officially built packages. Here is a brief summary of negatives and positives:
Negatives: * Changing things takes time and might break someone's workflow. * Zstd -19 results in slightly larger files than xz -6 (default).
Positives: * Change would be invisible to most users. * Would keep Arch Linux on top of the latest in packaging tech. * Updating would be much faster for most users.
To expand on that last point: for reasonably fast connections, the additional time required to decompress xz compressed packages means that updating can actually take more time than it would for packages that are not compressed at all! Because zstd is designed to have very fast decompression, for a wide range of modern broadband connections zstd -19 is the fastest algorithm to download and decompress. For example, check out this compression test (with the Mozilla dataset): https://quixdb.github.io/squash-benchmark/unstable/
Or look at my local test with the most recent Firefox package:
Tool Compression time Size Time to DL (100 Mbps) + decompress xz -6 5m 53s 49 MB 0m 21s zstd -19 6m 0s 53 MB 0m 6s
So while xz and zstd compress in about the same amount of time and result in files of similar size, from the user's standpoint zstd results in much faster updates. Multiply this by a few hundred packages and you have a pretty substantial effect.
I look forward to discussing this with you all.
Cheers, Adam What would be the Reason for the change to a slower less efficient method apart from it may be someones pet toy ..
I say Stick with what we have it works is more efficient and quicker . Pete .
On 3/16/19 7:03 AM, pete via arch-general wrote:
What would be the Reason for the change to a slower less efficient method apart from it may be someones pet toy ..
It's only less efficient because networks are more faster than CPUs. If I had a *multi-terabit* per second network, and multi-*kilohertz* 8-bit CPUs on both ends, then zstd might be more efficient for the whole compress, transmit, decompress cycle. Arch can potentially control the CPU at the servers, but neither the network(s) nor the user's CPUs, which means that I'm agreeing with your conclusion rather than your premise. :-)
On Sat, 16 Mar 2019 07:11:31 -0500 Dan Sommers <2QdxY4RzWzUUiLuE@potatochowder.com> wrote:
On 3/16/19 7:03 AM, pete via arch-general wrote:
What would be the Reason for the change to a slower less efficient method apart from it may be someones pet toy ..
It's only less efficient because networks are more faster than CPUs. If I had a *multi-terabit* per second network, and multi-*kilohertz* 8-bit CPUs on both ends, then zstd might be more efficient for the whole compress, transmit, decompress cycle.
Arch can potentially control the CPU at the servers, but neither the network(s) nor the user's CPUs, which means that I'm agreeing with your conclusion rather than your premise. :-)
Yeah, it's the CPU processing itself that's the bottle neck, and more importantly, one CPU cache miss and it's taking that flight from Pluto to the center of the sun and back. All to retrieve some forgotten data.
participants (9)
-
Adam Fontenot
-
Dan Sommers
-
Darren Wu
-
Jelle van der Waa
-
Jonathon Fernyhough
-
Michael Lojkovic
-
pete
-
Ralf Mardorf
-
Ralph Corderoy