Re: [arch-general] RFC: Use x86_64-v2 architecture
Hello, I am going to benchmark the performance difference between the various x86 uarch levels. I will be using Phoronix Test Suite, which has some support for performing compiler and compile flag benchmarks. I am opposed to dropping support for older CPUs, but I will perform this test fairly. That's why I'm posting this advance notice before performing actual benchmarks. I am an Ubuntu user and I am concerned that Ubuntu may require x86_64-2 in the not so distant future. I will be performing this test on Ubuntu 20.04.2 with GCC 9.3.0 [1]. I am going to use selected tests from this Phoronix article [2], but I will exclude benchmarks that do return much performance difference when using the "-O1" and "-O3" compiler flags, as: - their build scripts may ignore the CFLAGS/CXXFLAGS variables, - they may use some assembly code or C asm intrinsics, - they may have separate SSE4/AVX code paths, - the compiler is unable to optimize the code much, due to its nature. These are the benchmarks that probably would benefit the most from compiling for different uarch levels, which should be taken into account when interpreting the results. My rough comparision between "-O1" and "-O3" are at [3]. So, I will use the following tests: pts/scimark2 (all tests) pts/john-the-ripper (all tests) pts/graphics-magick ("swirl", "resizing", "HWB Color space") pts/coremark pts/himeno pts/encode-flac pts/c-ray Greetings, Mateusz Jończyk [1] GCC 9.3 does not support -march=x86_64-v2 and so. I will use switches like -march=nehalem instead. [2] https://www.phoronix.com/scan.php?page=article&item=gcc-10900k-compiler [3] https://openbenchmarking.org/result/2103131-HA-DRAFTUARC92 ---------------------- Benchmark selection details: Page 2 from the Phoronix article: cryptopp - is compiling some code with the flag "-msse4.2", so skipping it. smhasher - same fftw - has some kernels that explicitly use AVX / AVX2, point in benchmarking it, scimark - OK, I will also run some other tests from this benchmark - where the difference between "-O1" and "-O3" is nice, Page 3: TSCP - no real difference between "-O1" and "-O3" performance data, John The Ripper - small difference between "-O1" and "-O3", but leave it now GraphichMagick - OK, I'll choose tests with the biggest difference between "-O1" and "-O3": "swirl", "resizing", "HWB Color space", Page 4: AOM AV1 - no performance difference between "-O1" and "-O3", so skipping, x265 - patent encumbered format, skipping, Coremark - OK, "CoreMark Size 666 - Iterations Per Second" Himeno - OK, Stockfish - enables SSSE3 and SSE4.1 by default, leave it, FLAC Audio Encoding - OK, but probably the difference won't be big, Minion - tries to install many package dependencies, just leave it now, LevelDB - benchmark suite ignores the "-O1" flag, highly susceptible to non-quiet systems (so the results were bogus), GROMACS - same as Minion, additionally has long runtime IIRC, Darmstadt Automotive Parallel Heterogeneous Suite (daphne) - requires huge amounts of disk space, seems it will be IO-bound pgbench - I have still spinning HDDs, the benchmark was IO-bound NGiNX - it looks like it is testing both kernel and userspace, so leave it out, Additional benchmarks: n-queens - no difference between "-O1" and "-O3", build scripts seem to ignore CFLAGS, OpenSSL - no real difference between "-O1" and "-O3" c-ray - OK, let's include it
I am going to benchmark the performance difference between the various x86 uarch levels. I will be using Phoronix Test Suite, which has some support for performing compiler and compile flag benchmarks. I believe that for the purpose of this discussion that will be a waste of your time. There is no disagreement on whether move from v1 to v3 has a potential to improve performance. And this is what such benchmarks are intended to show.
Yes, synthetic tests are expected to show improvement. So are specific tasks, which naturally benefit from tuning for newer CPUs. But those results will not answer the important question: whether there is a considerable improvement for a typical user. Anyway, the current status is to provide two repositories: one generic x86_64 and one for x86_64-v3. This is not only solving the problem. With two repositories user may simply choose the optimized version and everyone can compare the options during their normal, daily work.
Hello, I have run the benchmarks and here are the results: https://openbenchmarking.org/result/2103142-HA-UARCHLEVE55 <https://openbenchmarking.org/result/2103142-HA-UARCHLEVE55> TL;DR: - there is no or negligible performance benefit of *-march=nehalem*, which corresponds to x86_64-v2, - there is a moderate benefit of *-march=haswell* (x86_64-v3) - of around 10%-20% as compared to baseline for the tests performed Geometric Mean Of All Test Results Result Composite Geometric Mean > Higher Is Better O1_generic ....... 367.99 O3_generic ....... 459.84 O3_march_nehalem . 462.89 O3_march_haswell . 531.99 x86_64-v2: There were only two tests in which march=nehalem was meaningfully faster then march=x86_64 (the baseline architecture). These were "graphicsmagick/Swirl" and "FLAC audio encoding". FLAC results were quite noisy (click the "Result confidence" button above the pie chart to show data) so the benefits may not be statistically significant. Swirl appeared to be only around 4% faster. I was surprised because I thought that the benefits would be somewhere around 5-10%. It looks like GCC's autovectorisation does not make much use from the instructions added in SSE3/SSSE3/SSE4. x86_64-v3: The geometric mean of test results was around 15% higher on march=haswell then on baseline x86_64. Apart from john-the-ripper/md5, the tests were up to 36% faster with median performance increase of around 10%. [1] As described in my previous email, I have excluded tests that use dedicated code paths for processors supporting AVX/AVX2/etc. - I saw little point of benchmarking them. I have also excluded some tests with little difference between the -O1 and -O3 optimization levels as it appears that the compiler has little work to do there. So real-world performance benefits of compiling whole Arch for x86_64-v3 would be probably smaller. I think that many workloads of a "typical user" are I/O bound. The limiting factor is likely to be a HDD/SSD, network throughput / latency or a memory speed. Limitations: - GCC 9.3.0 was used, which is not the most recent compiler available. Further research: - benchmarking web browser performance, as this is what matters most for many users, - comparing battery usage (Phoronix Test Suite has support for this when running benchmarks). I do not think it will be much different to performance data, though, How to reproduce: export CFLAGS="-O1 -mtune=generic -march=x86-64" export CXXFLAGS="-O1 -mtune=generic -march=x86-64" phoronix-test-suite benchmark 2103142-HA-UARCHLEVE55 export CFLAGS="-O3 -mtune=generic -march=x86-64" export CXXFLAGS="-O3 -mtune=generic -march=x86-64" phoronix-test-suite benchmark $name_of_test_identifier_specified_before #etc. Conflict of interest: I am opposed to increasing baseline x86_64 requirements in general-purpose distributions. Greetings, Mateusz [1] Visit https://openbenchmarking.org/result/2103142-HA-UARCHLEVE55&rmm=O1_generic%2CO3_march_nehalem and scroll slightly lower.
participants (2)
-
Mateusz Jończyk
-
mpan