New subject: TU Application - tpkessler (Justin Kromlinger)

26 Oct 2022

      Hi! I'm Torsten Keßler (tpkessler in AUR and on GitHub) from Saarland, a
federal state in the south west of Germany. With this email
I'm applying to be become a trusted user.
After graduating with a PhD in applied mathematics this year I'm now
a post-doc with a focus on numerical analysis, the art of solving physical
problems with mathematically sound algorithms on a computer.
I've been using Arch Linux on my private machines (and at work) since my
first weeks at university ten years ago. After initial distro hopping a
friend recommended Arch. I immediately liked the way it handles packages
via pacman, its wiki and the flexibility of its installation process.

Owing to their massively parallel architecture, GPUs have emerged as the
leading platform for computationally expensive problems: Machine
Learning/AI, real-world engineering problems, simulation of complex
physical systems. For a long time, nVidia's CUDA framework (closed
source, exclusively for their GPUs) has dominated this field. In 2015,
AMD announced ROCm, their open source compute framework for GPUs. A
common interface to CUDA, called HIP, makes it possible to write code
that compiles and runs both on AMD and nVidia hardware. I've been
closely following the development of ROCm on GitHub, trying to compile
the stack from time to time. But only since 2020, the kernel includes
all the necessary code to compile the ROCm stack on Arch Linux. Around
this time I've started to contribute to rocm-arch on GitHub, a
collection of PKGBUILDs for ROCm (with around 50 packages). Soon after
that, I became the main contributor to the repository and, since 2021,
I've been the maintainer of the whole ROCm stack.

We have an active issue tracker and recently started a discussion page
for rocm-arch. Most of the open issues as of now are for bookkeeping of
patches we applied to run ROCm on Arch Linux. Many of them are linked to
an upstream issue and a corresponding pull request that fixes the
issues. This way I've already contributed code to a couple of libraries
of the ROCm stack.

Over the years, many libraries have added official support for ROCm,
including tensorflow, pytorch, python-cupy, python-numba (not actively
maintained anymore) and blender. Support of ROCm for the latter
generated large interest in the community and is one reason Sven
contacted me, asking me if I would be interested to take care of ROCm in
[community]. In its current version, ROCm support for blender works out
of the box. Just install hip-runtime-amd from the AUR and enable the HIP
backend in blender's settings for rendering. The machine learning
libraries require more dependencies from the AUR. Once installed,
pytorch and tensorflow are known to work on Vega GPUs and the recent
RDNA architecture.

My first action as a TU would be to add basic support of ROCm to
[community], i.e. the low level libraries, including HIP and an open
source runtime for OpenCL based on ROCm. That would be enough to run
blender with its ROCm backend. At the same time, I would expand the wiki
article on ROCm. The interaction with the community would also move from
the issue tracker of rocm-arch to the Arch Linux bug tracker and the
forums. In a second phase I would add the high level libraries that
would enable users to quickly compile and run complex libraries such as
tensorflow, pytorch or cupy.

#BEGIN Technical details

The minimal package list for HIP which includes the runtime libraries
for basic GPU programming and the GPU compiler (hipcc) comprises eight
packages

* rocm-cmake (basic cmake files for ROCm)
* rocm-llvm (upstream llvm with to-be-merged changes by AMD)
* rocm-device-libs (implements math functions for all GPU architectures)
* comgr (runtime library, "compiler support" for rocm-llvm)
* hsakmt-roct (interface to the amdgpu kernel driver)
* hsa-rocr (runtime for HSA compute kernels)
* rocminfo (display information on HSA agents: GPU and possibly CPU)
* hip-runtime-amd (runtime and compiler for HIP, a C++ dialect inspired
by CUDA C++)

All but rocm-llvm are small libraries under the permissive MIT license.
Since ROCm 5.2, all packages successfully build in a clean chroot and
are distributed in the community repo arch4edu.

The application libraries for numerical linear algebra, sparse matrices
or random numbers start with roc and hip (rocblas, rocsparse, rocrand).
The hip* packages are designed in such a way that they would also work
with CUDA if hip is configured with CUDA instead of a ROCm/HSA backend.
With few exceptions (rocthrust, rccl) these packages are licensed under MIT.

Possible issues:
There are three packages that are not fully working under Arch Linux or
lack an open source license. The first is rocm-gdb, a fork of gdb with
GPU support. To work properly it needs a kernel module currently not
available in upstream linux but only as part of AMD's dkms modules. But
they only work with specific kernel versions. Support for this from my
side on Arch Linux was dropped a while ago. One closed source package is
hsa-amd-aqlprofile. As the name suggests it is used for profiling as
part of rocprofiler. Above mentioned packages are only required for
debugging and profiling but are no runtime dependencies of the big
machine learning libraries or any other package with ROCm support I'm
aware of. The third package is rocm-core, a package only part of the
meta packages for ROCm with no influence on the ROCm runtime. It
provides a single header and a library with a single function that
returns the current ROCm version. No source code has been published by
AMD so far and the official package lacks a license file.

A second issue is GPU support. AMD officially only supports the
professional compute GPUs. This does not mean that ROCm is not working
on consumer cards but merely that AMD cannot guarantee all
functionalities through excessive testing. Recently, ROCm added support
for Navi 21  (RX 6800 onwards), see

https://docs.amd.com/bundle/Hardware_and_Software_Reference_Guide/page/Hardw...

I own a Vega 56 (gfx900) that is officially supported, so I can test all
packages before publishing them on the AUR / in [community].

#END Technical details

On the long term, I would like to foster Arch Linux as the leading
platform for scientific computing. This includes Machine Learning
libraries in the official repositories as well as packages for classical
"number crunching" such as petsc, trilinos and packages that depend on
them: deal-ii, dune or ngsolve.

The sponsors of my application are Sven (svenstaro) and Bruno (archange).

I'm looking forward to the upcoming the discussion and your feedback on 
my application.

Best,
Torsten

TU Application - tpkessler

Torsten Keßler

Sven-Hendrik Haase

Archange

Sven-Hendrik Haase

Sven-Hendrik Haase

Filipe Laíns

Torsten Keßler

Filipe Laíns

Justin Kromlinger

Levente Polyak

Torsten Keßler

Justin Kromlinger

Justin Kromlinger

Torsten Keßler

Sven-Hendrik Haase

Archange

tags

participants (7)