On Mon, 7 Nov 2022 19:05:16 +0000 Torsten Keßler <t.kessler@posteo.de> wrote:
Hi Justin!
Some ROC repositories include documentation (cmake, device libs, hip), maybe it would make sense to include those in `/usr/share/doc/${pkgname}`? That's a very good idea. For some packages, AMD bundles them with the package (rocm-dbgapi) and sometimes it's shipped separately, see hip-doc [1].
The limited support of ROCm has been one of the main things locking me into Nvidia for my workstations. Yes, that's really the main drawback of ROCm. CUDA works on almost any Nvidia GPU (even on mobile variants). I hope AMD will change their policy with Navi 30+.
Have you tried contacting AMD about `rocm-core`? Others already did. AMD supported promised to release the source code in March [2].
Finding information about ROCm support in consumer cards really isn't easy – but I guess with CUDA I just expect it to work with recent Nvidia cards? Do you mean the common HIP abstraction layer (like hipfft, hipblas,...)? Yes, that should work with any recent CUDA version. But I haven't tried this as I don't have access to an Nvidia GPU. Furthermore, this feature (HIP with CUDA) has never been requested by the community at rocm-arch. I think Nvidia users just stick with CUDA and don't need HIP.
I mean with ROCm I'm not sure if a GPU I'm going to buy will support it.
Maybe it would be a good idea to provide testing scripts / documents for them, so they can report back once you push things into testing? Absolutely! There's HIP examples [3] from AMD which checks basic HIP language features. Additionally, we have `rocm-validation-suite` which offers several tests.
Having a list of tested cards in the wiki would be great as well. I agree! Once we have an established test suite, this should be straightforward.
Best! Torsten
[1] http://repo.radeon.com/rocm/apt/5.3/pool/main/h/hip-doc/ [2] https://github.com/RadeonOpenCompute/ROCm/issues/1705#issuecomment-108159928... [3] https://github.com/ROCm-Developer-Tools/HIP-Examples
Am 06.11.22 um 23:10 schrieb aur-general-request@lists.archlinux.org:
Send Aur-general mailing list submissions to aur-general@lists.archlinux.org
To subscribe or unsubscribe via email, send a message with subject or body 'help' to aur-general-request@lists.archlinux.org
You can reach the person managing the list at aur-general-owner@lists.archlinux.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Aur-general digest..."
Today's Topics:
1. Re: TU Application - tpkessler (Justin Kromlinger) 2. Re: TU Application - tpkessler (Torsten Keßler) 3. Re: TU Application - tpkessler (Filipe Laíns)
----------------------------------------------------------------------
Message: 1 Date: Sun, 6 Nov 2022 20:01:14 +0100 From: Justin Kromlinger <hashworks@archlinux.org> Subject: Re: TU Application - tpkessler To: aur-general@lists.archlinux.org Message-ID: <20221106200114.438404af@maker.hashworks.net> Content-Type: multipart/signed; boundary="Sig_//b6Alp4sEqgQD9YkXUpB1=1"; protocol="application/pgp-signature"; micalg=pgp-sha256
Hi Torsten!
On Wed, 26 Oct 2022 06:30:33 +0000 Torsten Keßler <t.kessler@posteo.de> wrote:
Hi! I'm Torsten Keßler (tpkessler in AUR and on GitHub) from Saarland, a federal state in the south west of Germany. With this email I'm applying to be become a trusted user. After graduating with a PhD in applied mathematics this year I'm now a post-doc with a focus on numerical analysis, the art of solving physical problems with mathematically sound algorithms on a computer. I've been using Arch Linux on my private machines (and at work) since my first weeks at university ten years ago. After initial distro hopping a friend recommended Arch. I immediately liked the way it handles packages via pacman, its wiki and the flexibility of its installation process. Soon we can switch the Arch Linux IRC main language to German!
Owing to their massively parallel architecture, GPUs have emerged as the leading platform for computationally expensive problems: Machine Learning/AI, real-world engineering problems, simulation of complex physical systems. For a long time, nVidia's CUDA framework (closed source, exclusively for their GPUs) has dominated this field. In 2015, AMD announced ROCm, their open source compute framework for GPUs. A common interface to CUDA, called HIP, makes it possible to write code that compiles and runs both on AMD and nVidia hardware. I've been closely following the development of ROCm on GitHub, trying to compile the stack from time to time. But only since 2020, the kernel includes all the necessary code to compile the ROCm stack on Arch Linux. Around this time I've started to contribute to rocm-arch on GitHub, a collection of PKGBUILDs for ROCm (with around 50 packages). Soon after that, I became the main contributor to the repository and, since 2021, I've been the maintainer of the whole ROCm stack. We have an active issue tracker and recently started a discussion page for rocm-arch. Most of the open issues as of now are for bookkeeping of patches we applied to run ROCm on Arch Linux. Many of them are linked to an upstream issue and a corresponding pull request that fixes the issues. This way I've already contributed code to a couple of libraries of the ROCm stack.
Over the years, many libraries have added official support for ROCm, including tensorflow, pytorch, python-cupy, python-numba (not actively maintained anymore) and blender. Support of ROCm for the latter generated large interest in the community and is one reason Sven contacted me, asking me if I would be interested to take care of ROCm in [community]. In its current version, ROCm support for blender works out of the box. Just install hip-runtime-amd from the AUR and enable the HIP backend in blender's settings for rendering. The machine learning libraries require more dependencies from the AUR. Once installed, pytorch and tensorflow are known to work on Vega GPUs and the recent RDNA architecture.
My first action as a TU would be to add basic support of ROCm to [community], i.e. the low level libraries, including HIP and an open source runtime for OpenCL based on ROCm. That would be enough to run blender with its ROCm backend. At the same time, I would expand the wiki article on ROCm. The interaction with the community would also move from the issue tracker of rocm-arch to the Arch Linux bug tracker and the forums. In a second phase I would add the high level libraries that would enable users to quickly compile and run complex libraries such as tensorflow, pytorch or cupy. The limited support of ROCm has been one of the main things locking me into Nvidia for my workstations. Having stuff in community would certainly help with that!
#BEGIN Technical details
The minimal package list for HIP which includes the runtime libraries for basic GPU programming and the GPU compiler (hipcc) comprises eight packages
* rocm-cmake (basic cmake files for ROCm) * rocm-llvm (upstream llvm with to-be-merged changes by AMD) * rocm-device-libs (implements math functions for all GPU architectures) * comgr (runtime library, "compiler support" for rocm-llvm) * hsakmt-roct (interface to the amdgpu kernel driver) * hsa-rocr (runtime for HSA compute kernels) * rocminfo (display information on HSA agents: GPU and possibly CPU) * hip-runtime-amd (runtime and compiler for HIP, a C++ dialect inspired by CUDA C++) PKGBUILDs look good to me. Some ROC repositories include documentation (cmake, device libs, hip), maybe it would make sense to include those in `/usr/share/doc/${pkgname}`?
All but rocm-llvm are small libraries under the permissive MIT license. Since ROCm 5.2, all packages successfully build in a clean chroot and are distributed in the community repo arch4edu.
The application libraries for numerical linear algebra, sparse matrices or random numbers start with roc and hip (rocblas, rocsparse, rocrand). The hip* packages are designed in such a way that they would also work with CUDA if hip is configured with CUDA instead of a ROCm/HSA backend. With few exceptions (rocthrust, rccl) these packages are licensed under MIT.
Possible issues: There are three packages that are not fully working under Arch Linux or lack an open source license. The first is rocm-gdb, a fork of gdb with GPU support. To work properly it needs a kernel module currently not available in upstream linux but only as part of AMD's dkms modules. But they only work with specific kernel versions. Support for this from my side on Arch Linux was dropped a while ago. One closed source package is hsa-amd-aqlprofile. As the name suggests it is used for profiling as part of rocprofiler. Above mentioned packages are only required for debugging and profiling but are no runtime dependencies of the big machine learning libraries or any other package with ROCm support I'm aware of. The third package is rocm-core, a package only part of the meta packages for ROCm with no influence on the ROCm runtime. It provides a single header and a library with a single function that returns the current ROCm version. No source code has been published by AMD so far and the official package lacks a license file. Have you tried contacting AMD about `rocm-core`? It seems odd to keep such a small thing closed source / without a license.
A second issue is GPU support. AMD officially only supports the professional compute GPUs. This does not mean that ROCm is not working on consumer cards but merely that AMD cannot guarantee all functionalities through excessive testing. Recently, ROCm added support for Navi 21 (RX 6800 onwards), see
https://docs.amd.com/bundle/Hardware_and_Software_Reference_Guide/page/Hardw...
I own a Vega 56 (gfx900) that is officially supported, so I can test all packages before publishing them on the AUR / in [community]. Finding information about ROCm support in consumer cards really isn't easy – but I guess with CUDA I just expect it to work with recent Nvidia cards?
I would guess that we have a bunch of TUs with Radeon RX 5000/6000 (and soon 7000) series cards, but without the needed knowledge / use case for ROCm. Maybe it would be a good idea to provide testing scripts / documents for them, so they can report back once you push things into testing?
Having a list of tested cards in the wiki would be great as well.
#END Technical details
On the long term, I would like to foster Arch Linux as the leading platform for scientific computing. This includes Machine Learning libraries in the official repositories as well as packages for classical "number crunching" such as petsc, trilinos and packages that depend on them: deal-ii, dune or ngsolve.
The sponsors of my application are Sven (svenstaro) and Bruno (archange).
I'm looking forward to the upcoming the discussion and your feedback on my application.
Best, Torsten Best Regards Justin