October 5, 2022 — Junchao Zhang, a software engineer at the U.S. Department of Energy’s (DOE) Argonne National Laboratory, leads a team of researchers working on preparing PETSc (Portable, Extensible Toolkit for Scientific Computation) for exascale supercomputers nationwide—including Aurora, the exascale system slated for deployment at the Argonne Leadership Computing Facility (ALCF), a DOE Office of Science user facility located in Argonne.
PDE library used in many fields
PETSc is a mathematical library for the scalable solution of models generated with continuous partial differential equations (PDEs). PDEs, fundamental to describing the natural world, are ubiquitous in science and engineering. As such, PETSc is used in many disciplines and industry sectors, including aerodynamics, neuroscience, computational fluid dynamics, seismology, fusion, materials science, ocean dynamics, and the petroleum industry. .
As researchers in science and industry seek to generate increasingly high-fidelity simulations and apply them to increasingly large-scale problems, PETSc stands to benefit directly from advances in exascale computing power. . Furthermore, the technology developed for exascale can also be applied to less powerful computing systems and make applications of PETSc on such systems faster and cheaper, resulting in wider adoption.
Additionally, each of the exascale machines slated to go live at DOE facilities have adopted an accelerator-based architecture and derive the majority of their computing power from graphics processing units (GPUs). This made porting PETSc for efficient use on GPUs an absolute necessity.
However, each exascale computing system vendor has adopted its own programming model and corresponding ecosystem. Also, portability between different models, if any, remains in its infancy for all intents and purposes.
In order to avoid being locked into a particular vendor’s programming model and to take advantage of its extensive user support and mathematical library, Zhang’s team opted to prepare PETSc for GPUs using Kokkos, vendor-neutral, as a portability layer and as the main backend wherever possible (other than relying on CUDA, SYCL and HIP).
Instead of writing multiple interfaces for different vendor libraries, researchers use the Kokkos math library, known as Kokkos-Kernels, as a wrapper. Kokkos, as a library, has benefited the team by allowing them to consider their users’ choice of programming model, thus enabling seamless and natural GPU support.
Extended GPU support
Prior to the efforts of Zhang’s team, sponsored by the DOE’s Exascale Computing Project (ECP), PETSc support for GPUs was limited to NVIDIA processors and required many of its compute cores to run on host machines. This had the effect of minimizing both the portability of the code and its capacity.
“So far, we think the adoption of Kokkos is a success, because we only need one source code,” Zhang said. “We had direct support for NVIDIA GPUs with CUDA. We tried to duplicate the code to directly support AMD GPUs with HIP. We find it a pain to maintain duplicate code: the same functionality needs to be implemented in multiple places and the same bug needs to be fixed in multiple places. Once the CUDA and HIP application programming interfaces (APIs) diverge, it becomes even more difficult to duplicate code.
However, while PETSc is written in C, enough GPU programming models use C++ that Zhang’s team found it necessary to add an increasing number of C++ files.
“As part of the ECP project, keeping in mind a formula in computer architecture known as Amdahl’s Law, which suggests that any unaccelerated part of the code could become a bottleneck for the overall speedup,” Zhang explained, “we tried to look at GPU porting work and GPU code portability in holistic terms.
Optimization of communication and calculation
The team is working on optimizing GPU functionality on two fronts: communication and computation.
As the team discovered, CPU-to-GPU data synchronizations must be carefully isolated to avoid the tricky and elusive bugs they cause.
Therefore, to improve communication, the researchers added support for GPU-enabled Message Passing Interfaces (MPIs), allowing data to pass directly to GPUs instead of buffering it on CPUs. Additionally, to remove GPU timings resulting from current MPI constraints on asynchronous computing, the team researched GPU stream-aware communication that, completely bypassing MPI, transmits data using the NVIDIA NVSHMEM library. The team is also collaborating with the MPICH group at Argonne to test new extensions that meet MPI constraints, as well as stream-aware MPI functionality developed by the group.
To optimize GPU computing, Zhang’s team ported a number of features to the device to reduce round-trip copying of data between host and device. For example, while matrix assembly—essential for using PETSc—was previously performed on host machines, its APIs could not be feasibly parallelized on GPUs, despite their compatibility with CPUs. The team added new GPU-friendly matrix assembly APIs, improving performance.
Improve code development
In addition to recognizing the importance of avoiding code duplication and encapsulating and isolating interprocessor data synchronizations, the team learned to profile often (leveraging NVIDIA nvprof and Nsight Systems) and inspect the GPU activity timeline to identify hidden and unexpected activities (and eliminate them later).
A crucial difference between the Intel Xe GPUs that will power Aurora and the GPUs contained in other exascale machines is that the Xs have multiple subslices, indicating that optimal performance depends on NUMA-enabled programming. (NUMA, or non-uniform memory access, is a method of configuring a group of processors to share memory locally.)
Relying on a single source code allows PETSc to run easily on Intel, AMD, and NVIDIA GPUs, but with some tradeoffs. By making Kokkos a kind of intermediary between PETSc and sellers, PETSc becomes dependent on the quality of Kokkos. Kokkos-Kernel APIs should therefore be optimized for vendor libraries to avoid performance degradation. Discovering that some key functions of Kokkos-Kernels are not optimized for vendor libraries, researchers are making fixes to address issues as they arise.
As part of the next steps of the project, researchers will help the Kokkos-Kernels team add interfaces to the Intel oneMKL math kernel library before testing them with PETSc. This, in turn, will help the Intel oneMKL team prepare the library for Aurora.
Zhang noted that to further expand PETSc’s GPU capabilities, his team will strive to support more low-level data structures in PETSc as well as higher-level GPU user-facing interfaces. The researchers also intend to work with users to ensure efficient use of PETSc on Aurora.
The Best Practices for GPU Code Development series highlights researchers’ efforts to optimize codes to run efficiently on the ALCF’s exascale Aurora supercomputer.
About the ALCF
The Argonne Leadership Computing Facility provides supercomputing capabilities to the scientific and engineering community to advance fundamental discovery and understanding across a wide range of disciplines. Supported by the Advanced Scientific Computing Research (ASCR) program of the U.S. Department of Energy’s (DOE) Office of Science, the ALCF is one of two DOE advanced computing facilities dedicated to open science.
Source: Nils Heionen, ALCF
#Argonne #researchers #work #preparing #PETSc #nations #exascale #supercomputers