Skip to content
gitrepoidoscar edited this page Nov 6, 2023 · 19 revisions

Agenda (USA Central time zone)

Date Time Topic Speaker/Moderator
12/5 09:00-09:15
Opening Remarks and UCF

Unified Communication Framework (UCF) - Collaboration between industry,laboratories, and academia to create production grade communication frameworks and open standards for data-centric and high-performance applications. In this talk we will present recent advances in development UCF projects including Open UCX, Apache Spark UCX as well incubation projects in the area of SmartNIC programming, benchmarking, and other areas of accelerated compute.

Gilad Shainer, NVIDIA

Gilad Shainer serves as senior vice-president of marketing for Mellanox networking at NVIDIA, focusing on high- performance computing, artificial intelligence and the InfiniBand technology. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles since 2005. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization, the president of UCF and CCIX consortiums, a member of IBTA and a contributor to the PCISIG PCI-X and PCIe specifications. Mr. Shainer holds multiple patents in the field of high-speed networking. He is a recipient of 2015 R&D100 award for his contribution to the CORE-Direct In-Network Computing technology and the 2019 R&D100 award for his contribution to the Unified Communication X (UCX) technology. Gilad Shainer holds a MSc degree and a BSc degree in Electrical Engineering from the Technion Institute of Technology in Israel.

09:15-10:00
Recent Advances in UCX for AMD GPUs

This talk will focus on recent developments in UCX to support AMD GPUs and the ROCm software stack. The presentation will go over some of the most relevant enhancements to the ROCm components in UCX over the last year, including: 1) enhancements to the uct/rocm-copy components zero-copy functionality to enhance device-to-host and host-to-device transfers: by allowing the zero-copy operations to perform asynchronously, device-to-host and host-device transfers can overlap the various stages in the rendezvous protocol, leading to up to 30% performance improvements in our measurements, 2) adding support for dma-buf based memory registration for ROCm devices: the linux kernels dma-buf mechanism is a portable mechanism which enables sharing device buffers across multiple devices by creating a dma-buf handle at the source and importing the handle at the consumer side. ROCm 5.6 introduced the runtime functionality to export a user-space dma-buf handle of GPU device memory, and support has been added to the ROCm memory domain in UCX starting from release 1.15.0, 3) updates required to support the ROCm versions released during the year (ROCm 5.4, 5.5, 5.6). Furthermore, the presentation will also include some details on the ongoing work to take advantage of new interfaces available starting from ROCm 5.7 which will allow to explicitly control and select which DMA engine(s) to use for a inter-process device-to-device data transfer operations.

Edgar Gabriel, AMD

BIO

10:00-11:00
UCX Backend for Realm: Design, Benefits, and Feature Gaps

An update about the status of protocols v2 implementation - what is upstream, what is planned for next year, performance status, error flows, and debug/analysis infrastructure.

Hessam Mirsadeghi , NVIDIA

BIO

11:00-11:30 Lunch
11:30-12:15
Use In-Chip Memory for RDMA Operations

UCX API for managing exported memory keys could be utilized by users to develop their application offloaded on DPU with direct access to a system memory available on HOST.

Roie Danino, NVIDIA

BIO

12:15-13:00
Low-Latency MPI RMA: Implementation and Challenges

Multirail support can provide significant performance boost on certain platforms. In this talk we will describe the way multirail is supported for RMA operations in UCX and will demonstrate performance benefits using benchmarks.

Thomas Gillis, ANL

BIO

13:00-13:45
UCX Protocols for NVIDIA Grace Hopper

Multirail support can provide significant performance boost on certain platforms. In this talk we will describe the way multirail is supported for RMA operations in UCX and will demonstrate performance benefits using benchmarks.

Akshay Venkatesh, NVIDIA

BIO

13:45-14:00 Adjourn
12/6 08:00-08:15
Opening Remarks
Pavel Shamis (Pasha), NVIDIA

Pavel Shamis is a Principal Research Engineer at Arm. His work is focused on co-design software, and hardware building blocks for high-performance interconnect technologies, development of communication middleware, and novel programming models. Prior to joining ARM, he spent five years at Oak Ridge National Laboratory (ORNL) as a research scientist at Computer Science and Math Division (CSMD). In this role, Pavel was responsible for research and development multiple projects in high-performance communication domains including Collective Communication Offload (CORE-Direct & Cheetah), OpenSHMEM, and OpenUCX. Before joining ORNL, Pavel spent ten years at Mellanox Technologies, where he led Mellanox HPC team and was one of the key drivers in the enablement Mellanox HPC software stack, including OFA software stack, OpenMPI, MVAPICH, OpenSHMEM, and other. Pavel is a board member of UCF consortium and co-maintainer of Open UCX. He holds multiple patents in the area of in-network accelerator. Pavel is a recipient of 2015 R&D100 award for his contribution to the development CORE-Direct in-network computing technology and the 2019 R&D100 award for the development of Open Unified Communication X (Open UCX) software framework for HPC, data analytics, and AI.

08:15-09:00
InfiniBand Performance Isolation Best Practices

High performance computing and artificial intelligence have evolved to be the primary data processing engines for wide commercial use. HPC clouds host growing numbers of users and applications, and therefore need to carefully manage the network resources and provide performance isolation between workloads. We'll explore best practices for optimizing the network activity and supporting variety of applications and users on the same network, including application examples from on premise clusters and from Microsoft Azure HPC Cloud.

Gilad Shainer, NVIDIA

Gilad Shainer serves as senior vice-president of marketing for Mellanox networking at NVIDIA, focusing on high- performance computing, artificial intelligence and the InfiniBand technology. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles since 2005. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization, the president of UCF and CCIX consortiums, a member of IBTA and a contributor to the PCISIG PCI-X and PCIe specifications. Mr. Shainer holds multiple patents in the field of high-speed networking. He is a recipient of 2015 R&D100 award for his contribution to the CORE-Direct In-Network Computing technology and the 2019 R&D100 award for his contribution to the Unified Communication X (UCX) technology. Gilad Shainer holds a MSc degree and a BSc degree in Electrical Engineering from the Technion Institute of Technology in Israel.

Jithin Jose, Microsoft

Speaker Bio

09:00-09:45
Unified Collective Communication (UCC) State of the Union 2022

abstract

Manjunath Gorentla Venkata, NVIDIA

Manjunath Gorentla Venkata is a director of architecture and principal HPC architect at NVIDIA. He has researched, architected, and developed multiple HPC products and features. His team is primarily responsible for developing features for parallel programming models, libraries, and network libraries to address the needs of HPC and AI/DL systems. The innovations architected and designed by him and his team land as features in NVIDIA networking products including UCC, UCX, CX HCAs, and BlueField DPUs. Prior to NVIDIA, Manju worked as a research scientist at DOE’s ORNL focused on middleware for HPC systems, including InfiniBand and Cray Systems. Manju earned Ph.D. and M.S. degrees in computer science from the University of New Mexico.

Valentine Petrov, NVIDIA

BIO

Ferrol Aderholdt, NVIDIA

BIO

Sergey Lebdev, NVIDIA

BIO

09:45-10:45 Break
10:45-11:30
MPICH + UCX: State of the Union

In this talk, we will discuss the current state of MPICH support for the UCX library, focusing on changes since the last annual meeting. Topics covered will include build configuration, point-to-point communication, RMA, multi-threading, GPU support, and more. We also look toward future UCX development items for the coming year.

Yanfei Guo, Argonne National Laboratory

Dr. Yanfei Guo holds an appointment as an Assistant Computer Scientist at the Argonne National Laboratory. He is a member of the Programming Models and the Runtime Systems group. He has been working on multiple software projects including MPI, Yaksa and OSHMPI. His research interests include parallel programming models and runtime systems in extreme-scale supercomputing systems, data-intensive computing and cloud computing systems. Yanfei has received the best paper award at the USENIX International Conference on Autonomic Computing 2013 (ICAC’13). His work on programming models and runtime systems has been published on peer-reviewed conferences and journals including the ACM/IEEE Supercomputing Conference (SC’14, SC’15) and IEEE Transactions on Parallel and Distributed Systems (TPDS).

11:30-12:15
Stream-synchronous communication in UCX

Applications that take advantage of GPU capabilities often use stream abstractions to express dependencies, concurrency and to make the best use of the underlying hardware capabilities. Streams capture the notion of a queue of tasks that the GPU executes in order. This allows for enqueuing and dequeuing of compute (such as GPU kernels) and communication (such as a memory copy between host and device memory) tasks. The GPU is not required to maintain any ordering between tasks belonging to different streams and hence applications commonly use multiple streams to increase occupancy of GPU resources. A task enqueued onto a stream is generally asynchronous from the CPU’s perspective but synchronous with respect to other tasks enqueued on the same stream. A current limitation in UCX (and most of the libraries that take advantage of UCX) is that it does not provide abstractions to build dependencies between tasks enqueued onto streams and UCX communication operations. This means that if the CPU is required to send the result of a GPU kernel to another peer process, it must first synchronize with the stream onto which the GPU kernel was enqueued. This results in CPU resources being wasted when there exist methods of building communication dependencies without explicit CPU intervention in the critical path. The problem is especially important to solve in applications dominated by short running kernels, and kernel launch overheads present the primary bottleneck. Finally, such capabilities are already part of existing communication libraries such as NCCL, so the limitation in UCX presents a gap that applications are looking to have addressed for better composition. In this work, we plan to present 1. the current shortcomings in CPU-synchronous communication; 2. Alternatives to extending UCX API to embed stream objects into communication tasks; 3. Stream-synchronous send/receive and progress semantics; 4. Interoperability with CPU-synchronous semantics; 5. Implications on protocol implementations for performance and overlap.

Akshay Venkatesh, NVIDIA

Speaker Bio

Sreeram Potluri, NVIDIA

Speaker Bio

Jim Dinan, NVIDIA

Jim Dinan is a principal engineer at NVIDIA in the GPU communications team. Prior to joining NVIDIA, Jim was a principal engineer at Intel and a James Wallace Givens postdoctoral fellow at Argonne National Laboratory. He earned a Ph.D. in computer science from The Ohio State University and a B.S. in computer systems engineering from the University of Massachusetts at Amherst. Jim has served for more than a decade on open standards committees for HPC parallel programming models, including MPI and OpenSHMEM, and he currently leads the MPI Hybrid & Accelerator Working Group.

Hessam Mirsadeghi, NVIDIA

Speaker Bio

12:15-12:30 Break
12:30-13:15
Bring the BitCODE - Moving Compute and Data in Distributed Heterogeneous Systems

In this paper, we present a framework for moving compute and data between processing elements in a distributed heterogeneous system. The implementation of the framework is based on the LLVM compiler toolchain combined with the UCX communication framework. The framework can generate binary machine code or LLVM bitcode for multiple CPU architectures and move the code to remote machines while dynamically optimizing and linking the code on the target platform. The remotely injected code can recursively propagate itself to other remote machines or generate new code. The goal of this paper is threefold: (a) to present an architecture and implementation of the framework that provides essential infrastructure to program a new class of disaggregated systems wherein heterogeneous programming elements such as compute nodes and data processing units (DPUs) are distributed across the system, (b) to demonstrate how the framework can be integrated with modern, high-level programming languages such as Julia, and (c) to demonstrate and evaluate a new class of eXtended Remote Direct Memory Access (X-RDMA) communication operations that are enabled by this framework. To evaluate the capabilities of the framework, we used a cluster with Fujitsu CPUs and heterogeneous cluster with Intel CPUs and BlueField-2 DPUs interconnected using high-performance RDMA fabric. We demonstrated an X-RDMA pointer chase application that outperforms an RDMA GET-based implementation by 70% and is as fast as Active Messages, but does not require function predeployment on remote platforms.

Luis E. Peña, Arm

Speaker bio

13:15-14:00
UCX on RISC-V After-Action Review

Tactical Computing Labs recently ported UCX onto RISC-V. Porting UCX to RISC-V presents opportunities for the high performance computing (HPC) community to identify gaps in the current RISC-V GNU/Linux implementation, codify RISC-V ISA extensions for HPC, and identify nuances in the RISC-V ISA specification which need to be clarified for HPC software. RISC-V is an "open source" instruction set architecture (ISA) providing hardware developers a royalty free specification, or contract, for implementing RISC-V processors. The RISC-V ISA is segmented into extensions providing hardware developers "building blocks" to select when designing and implementing a RISC-V processor. Currently, RISC-V enjoys popularity in the IoT (Internet-of-Things) device market. Commercial GNU/Linux distribution support for RISC-V is nascent. As a consequence of RISC-V's popularity in the IoT device market, GNU/Linux support for RISC-V has been driven and focused towards IoT devices. Tactical Computing Labs' port of UCX to RISC-V identified gaps in current operating system support and the ISA specification that are relevant to hardware engineers and software developers interested the utilization of RISC-V in HPC. This technical talk will highlight and explain the gaps discovered by Tactical Computing Lab's RISC-V UCX port, specifically, limitations in the current GNU/Linux support for RISC-V, the impact of the GNU/Linux kernel's RISC-V support on interconnect choices, the base RISC-V ISA's support for managing memory consistency and implications in an HPC context, nuances in the RISC-V specification's advise for handling self-modifying code, and recommendations for parties interested in the use of RISC-V for HPC.

Christopher Taylor, Tactical Computing Labs

Speaker bio

14:00 Adjourn
12/7 08:00-08:15
Opening Remarks for OpenSHMEM session

OpenSHMEM update

Steve Poole, LANL

BIO

08:15-08:45
Rethinking OpenSHMEM Concepts for Better Small Message Performance

OpenSHMEM update

Aaron Welch, University of Houston

BIO

08:45-09:15
QoS-based Interfaces for Taming Tail Latency

OpenSHMEM update

Vishwanath Venkatesan, NVIDIA and Manjunath Gorentla Venkata, NVIDIA

BIO

09:15-10:00 Break
10:00-11:00
Panel: Future Direction of OpenSHMEM and Related Technologies

Community discussion

Steve Poole, LANL and Pavel Shamis, NVIDIA and Oscar Hernandez, NVIDIA and Tony Curtis, Stony Brook University and Jim Dinan, NVIDIA and Manjunath Gorentla Venkata, NVIDIA and Matthew Baker, Voltron Data

BIO

11:00-11:05 Adjourn
Clone this wiki locally