On 26.10.22 08:30, Torsten Keßler wrote:
Hi! I'm Torsten Keßler (tpkessler in AUR and on GitHub) from Saarland, a federal state in the south west of Germany. With this email I'm applying to be become a trusted user. After graduating with a PhD in applied mathematics this year I'm now a post-doc with a focus on numerical analysis, the art of solving physical problems with mathematically sound algorithms on a computer. I've been using Arch Linux on my private machines (and at work) since my first weeks at university ten years ago. After initial distro hopping a friend recommended Arch. I immediately liked the way it handles packages via pacman, its wiki and the flexibility of its installation process.
Owing to their massively parallel architecture, GPUs have emerged as the leading platform for computationally expensive problems: Machine Learning/AI, real-world engineering problems, simulation of complex physical systems. For a long time, nVidia's CUDA framework (closed source, exclusively for their GPUs) has dominated this field. In 2015, AMD announced ROCm, their open source compute framework for GPUs. A common interface to CUDA, called HIP, makes it possible to write code that compiles and runs both on AMD and nVidia hardware. I've been closely following the development of ROCm on GitHub, trying to compile the stack from time to time. But only since 2020, the kernel includes all the necessary code to compile the ROCm stack on Arch Linux. Around this time I've started to contribute to rocm-arch on GitHub, a collection of PKGBUILDs for ROCm (with around 50 packages). Soon after that, I became the main contributor to the repository and, since 2021, I've been the maintainer of the whole ROCm stack.
We have an active issue tracker and recently started a discussion page for rocm-arch. Most of the open issues as of now are for bookkeeping of patches we applied to run ROCm on Arch Linux. Many of them are linked to an upstream issue and a corresponding pull request that fixes the issues. This way I've already contributed code to a couple of libraries of the ROCm stack.
Over the years, many libraries have added official support for ROCm, including tensorflow, pytorch, python-cupy, python-numba (not actively maintained anymore) and blender. Support of ROCm for the latter generated large interest in the community and is one reason Sven contacted me, asking me if I would be interested to take care of ROCm in [community]. In its current version, ROCm support for blender works out of the box. Just install hip-runtime-amd from the AUR and enable the HIP backend in blender's settings for rendering. The machine learning libraries require more dependencies from the AUR. Once installed, pytorch and tensorflow are known to work on Vega GPUs and the recent RDNA architecture.
My first action as a TU would be to add basic support of ROCm to [community], i.e. the low level libraries, including HIP and an open source runtime for OpenCL based on ROCm. That would be enough to run blender with its ROCm backend. At the same time, I would expand the wiki article on ROCm. The interaction with the community would also move from the issue tracker of rocm-arch to the Arch Linux bug tracker and the forums. In a second phase I would add the high level libraries that would enable users to quickly compile and run complex libraries such as tensorflow, pytorch or cupy.
#BEGIN Technical details
The minimal package list for HIP which includes the runtime libraries for basic GPU programming and the GPU compiler (hipcc) comprises eight packages
* rocm-cmake (basic cmake files for ROCm) * rocm-llvm (upstream llvm with to-be-merged changes by AMD) * rocm-device-libs (implements math functions for all GPU architectures) * comgr (runtime library, "compiler support" for rocm-llvm) * hsakmt-roct (interface to the amdgpu kernel driver) * hsa-rocr (runtime for HSA compute kernels) * rocminfo (display information on HSA agents: GPU and possibly CPU) * hip-runtime-amd (runtime and compiler for HIP, a C++ dialect inspired by CUDA C++)
All but rocm-llvm are small libraries under the permissive MIT license. Since ROCm 5.2, all packages successfully build in a clean chroot and are distributed in the community repo arch4edu.
The application libraries for numerical linear algebra, sparse matrices or random numbers start with roc and hip (rocblas, rocsparse, rocrand). The hip* packages are designed in such a way that they would also work with CUDA if hip is configured with CUDA instead of a ROCm/HSA backend. With few exceptions (rocthrust, rccl) these packages are licensed under MIT.
Possible issues: There are three packages that are not fully working under Arch Linux or lack an open source license. The first is rocm-gdb, a fork of gdb with GPU support. To work properly it needs a kernel module currently not available in upstream linux but only as part of AMD's dkms modules. But they only work with specific kernel versions. Support for this from my side on Arch Linux was dropped a while ago. One closed source package is hsa-amd-aqlprofile. As the name suggests it is used for profiling as part of rocprofiler. Above mentioned packages are only required for debugging and profiling but are no runtime dependencies of the big machine learning libraries or any other package with ROCm support I'm aware of. The third package is rocm-core, a package only part of the meta packages for ROCm with no influence on the ROCm runtime. It provides a single header and a library with a single function that returns the current ROCm version. No source code has been published by AMD so far and the official package lacks a license file.
A second issue is GPU support. AMD officially only supports the professional compute GPUs. This does not mean that ROCm is not working on consumer cards but merely that AMD cannot guarantee all functionalities through excessive testing. Recently, ROCm added support for Navi 21 (RX 6800 onwards), see
https://docs.amd.com/bundle/Hardware_and_Software_Reference_Guide/page/Hardw...
I own a Vega 56 (gfx900) that is officially supported, so I can test all packages before publishing them on the AUR / in [community].
#END Technical details
On the long term, I would like to foster Arch Linux as the leading platform for scientific computing. This includes Machine Learning libraries in the official repositories as well as packages for classical "number crunching" such as petsc, trilinos and packages that depend on them: deal-ii, dune or ngsolve.
The sponsors of my application are Sven (svenstaro) and Bruno (archange).
I'm looking forward to the upcoming the discussion and your feedback on my application.
Best, Torsten
Everyone's had ample time to discuss Torsten's application. It is time to cast your votes: https://aur.archlinux.org/tu/141 Voting will conclude in one week on 2022-11-20. Sven