Skip to content
gitrepoidoscar edited this page Nov 21, 2021 · 80 revisions

Agenda (USA Pacific time zone)

Date Time Topic Speaker/Moderator
11/30 08:00-08:15
Welcome Remarks and UCF

Unified Communication Framework (UCF) - Collaboration between industry,laboratories, and academia to create production grade communication frameworks and open standards for data-centric and high-performance applications. In this talk we will present recent advances in development UCF projects including Open UCX, Apache Spark UCX as well incubation projects in the area of SmartNIC programming, benchmarking, and other areas of accelerated compute.

Gilad Shainer, NVIDIA

Gilad Shainer serves as senior vice-president of marketing for Mellanox networking at NVIDIA, focusing on high- performance computing, artificial intelligence and the InfiniBand technology. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles since 2005. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization, the president of UCF and CCIX consortiums, a member of IBTA and a contributor to the PCISIG PCI-X and PCIe specifications. Mr. Shainer holds multiple patents in the field of high-speed networking. He is a recipient of 2015 R&D100 award for his contribution to the CORE-Direct In-Network Computing technology and the 2019 R&D100 award for his contribution to the Unified Communication X (UCX) technology. Gilad Shainer holds a MSc degree and a BSc degree in Electrical Engineering from the Technion Institute of Technology in Israel.

8:15-09:00
UCX on Azure HPC/AI Clusters

Recent technology advancements have substantially improved the performance potential of virtualization. As a result, the performance gap between bare-metal and cloud clusters are continuing to shrink. This is quite evident as public clouds such as Microsoft Azure has climbed up into the top rankings in Graph500 and Top500 lists. Moreover, public clouds democratize these technology advancements with focus on performance, scalability, and cost-efficiency. Though the platform technologies and features are continuing to evolve, communication runtimes such as UCX play a key role in enabling applications to make use of the technology advancements, and with high performance. This talk focuses on how UCX efficiently enables the latest technology advancements in Azure HPC and AI clusters. This talk will also provide an overview of the latest HPC and AI offerings in Microsoft Azure HPC along with their performance characteristics. It will cover the Microsoft Azure HPC marketplace images that include UCX powered MPI libraries, as well as recommendations and best practices on Microsoft Azure HPC. We will also discuss the performance and scalability characteristics using microbenchmark and HPC applications.

Jithin Jose, Microsoft

Speaker Bio

09:00-09:30
Accelerating recommendation model training using ByteCCL and UCX

BytePS[1] is a state-of-the-art open source distributed training framework for machine learning in the field of computer vision (CV), natural language processing (NLP), and speech. Recently, there has been growing interest in leveraging GPUs for training recommendation systems and computational advertising models (e.g. DLRM[2]), which involes large scale datasets requires distributed training. While most models in CV/NLP/speech can fit onto a single GPU, DLRM-like models require model parallelism which shards model parameters across devices, and they exhibits much lower computation-communication ratio. To accelerate large scale training for these models, we develop ByteCCL, the next generation of BytePS with UCX as the communication backend, providing communication primitives such as alltoall, allreduce, gather, scatter, and send/recv. The library is designed with (1) asynchronous APIs, (2) multiple deep learning framework support, (3) zero-copy and GPU-direct RDMA support, (4) multiple hardware support (CPU and GPU), (5) and topology-aware optimizations. We discuss how ByteCCL leverages UCX for high performance data transfers, and how model parallelism for DLRM-like models can be supported by ByteCCL primitives for both synchronous and asynchronous training. As the result, ByteCCL-enabled asynchronous DLRM-like model training with GPUs, and provides significant speedup over CPU-based asynchronous training systems. For synchronous DLRM-like model training, we present alltoall micro-benchmarks that the performance of ByteCCL is on par with NCCL in 800 Gb/s CX6 A100 GPU clusters, and HPCX on 200 Gb/s CX6 AMD CPU cluster respectively. We further show that ByteCCL provides up to 9% and 12% speedup for end-to-end model training over NCCL and HPCX respectively. We plan to open source ByteCCL, and will finally discuss the future work for ByteCCL. [1] Jiang, Yimin, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. "A Unified Architecture for Accelerating Distributed {DNN} Training in Heterogeneous GPU/CPU Clusters." In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), pp. 463-479. 2020. [2] Naumov, Maxim, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang et al. "Deep learning recommendation model for personalization and recommendation systems." arXiv preprint arXiv:1906.00091 (2019).

Haibin Lin*, Bytedance Inc.

Speaker Bio

Mikhail Brinskii, NVIDIA

Speaker Bio

Yimin Jiang, Bytedance Inc.

Speaker Bio

Yulu Jia, Bytedance Inc.

Speaker Bio

Chengyu Dai, Bytedance Inc.

Speaker Bio

Yibo Zhu, Bytedance Inc.

Speaker Bio

09:30-10:00
MPICH + UCX: State of the Union

In this talk, we will discuss the current state of MPICH support for the UCX library, focusing on changes since the last annual meeting. Topics covered will include build configuration, point-to-point communication, RMA, multi-threading, MPI-4 partitioned communication, and more. We also look towards future UCX development items for the coming year.

Ken Raffenetti, Argonne National Laboratory

Speaker Bio

10:00-10:30 Break
10:30-11:00
UCX.jl -- Feature rich UCX bindings for Julia

Julia is a high-level programming language with a focus on technical programming. It has seen increased adoption in HPC fields like CFD, due to it's expressiveness, high-performance, and support for GPU based programming. In this talk I will introduce UCX.jl a Julia package that contains both low-level and high-level bindings to UCX from Julia. I will talk about the challenges of integrating UCX Julia, in particular integrating with Julia's task runtime.

Valentin Churavy, MIT

Speaker Bio

11:00-11:30
Go bindings for UCX

Go is a popular programming language for cloud services. UCX can provide big benefits not only for traditional HPC applications but for other network-intensive, distributed applications. Java and Python bindings are already being used by the community in various applications and frameworks. UCX v1.12 will include bindings for the GO language. Will present an example of API usage and next steps for Go bindings development.

Peter Rudenko, NVIDIA

Speaker Bio

11:30-12:00
UCX-Py: Stability and Feature Updates

UCX-Py is a high-level library enabling UCX connectivity for Python applications. Because Python is a high-level language, it is very difficult to directly interface with specialized hardware to take advantage of high-performance communication. UCX is a crucial component of modern HPC clusters, and UCX-Py delivers that capability to various Python data science applications, including RAPIDS and Dask. UCX-Py has been under development for 3 years, starting as a simplified wrapper for UCP that evolved to support Python asyncio, and today both implementations are available depending on the application's programming model and needs. The first couple years of development brought many features to UCX-Py and served to prove UCX performance as the de-facto necessity for HPC data science in Python. Its third year of existence was vastly focused on stability improvements and made UCX-Py more stable than ever, but despite the focus on stability some new features were also included. Handling of endpoint errors is now done via UCX endpoint callbacks allowing for cleanest endpoint lifetime. Newly-introduced Active Message support now provides an alternative communication to the TAG API, providing a more familiar communication pattern for Python developers. Support for RMA operations and creating endpoints directly from a remote worker via the UCP address add to the list of new UCX-Py features. Heavy testing and benchmarking on hardware providing connectivity such as InfiniBand and NVLink allowed UCX-Py to be an important tool in catching numerous UCX bugs as well. Following a complete split of UCX-Py asyncio implementation from the core synchronous piece of the library, this now allows a smooth complete upstreaming from UCX-Py's own repository into the mainline OpenUCX repository. This process has already started but is still in early stages.

Peter Entschev, NVIDIA

Speaker Bio

12:00-12:30 Break
12:30-13:00
OpenSHMEM and Rust

Many institutions engaged in High Performance Computing (HPC) are interested in moving from C-based setups to newer languages like Rust and Go to improve code safety and security. The Partitioned Global Address Space (PGAS) programming family is a set of HPC libraries and languages that employs a relaxed parallelism model utilizing Remote Memory Access (RMA or RDMA). Traditionally most libraries are targeted at C/C++ (and also Fortran) and are then used to implement PGAS programming languages or models. The OpenSHMEM library provides a communication and memory management API for C/C++. However, the use of raw C pointers creates safety issues in languages like Rust, and that detracts from the end-user's experience. This presentation will look at how we can integrate OpenSHMEM into Rust while retaining Rust’s safety guarantees.

Tony Curtis*, Stony Brook University

Speaker Bio

Rebecca Hassett, Stony Brook University

Speaker Bio

13:00-13:30
Towards Cost-Effective and Elastic Cloud Database Deployment via Memory Disaggregation

It is challenging for cloud-native relational databases to meet the ever-increasing needs of scaling compute and memory resources independently and elastically. The recent emergence of memory disaggregation architecture, relying on high-speed RDMA network, offers opportunities to build cost-effective and elastic cloud-native databases. There exist proposals to let unmodified applications run transparently on disaggregated systems. However, running relational database kernel atop such proposals experiences notable performance degradation and time-consuming failure recovery, offsetting the benefits of disaggregation. To address these challenges, in this paper, we propose a novel database architecture called LegoBase, which explores the co-design of database kernel and memory disaggregation. It pushes the memory management back to the database layer for bypassing the Linux I/O stack and re-using or designing (remote) memory access optimizations with an understanding of data access patterns. LegoBase leverages RDMA and further splits the conventional ARIES fault tolerance protocol to independently handle the local and remote memory failures for fast recovery of compute instances. We implemented LegoBase atop MySQL. We compare LegoBase against MySQL running on a standalone machine and the state-of-the-art disaggregation proposal Infiniswap. Our evaluation shows that even with a large fraction of data placed on the remote memory, LegoBase’s system performance in terms of throughput (up to 9.41% drop) and P99 latency (up to 11.58% increase) is comparable to the monolithic MySQL setup, and significantly outperforms (1.99×-2.33×, respectively) the deployment of MySQL over Infiniswap. Meanwhile, LegoBase introduces an up to 3.87× and 5.48× speedup of the recovery and warm-up time, respectively, over the monolithic MySQL and MySQL over Infiniswap, when handling failures or planned re-configurations.

Cheng Li, University of Science and Technology of China

Speaker Bio

13:30-13:45 Adjourn
12/01 08:00-08:40 Welcome Remarks and UCX
Pavel Shamis (Pasha), Arm

Pavel Shamis is a Principal Research Engineer at Arm. His work is focused on co-design software, and hardware building blocks for high-performance interconnect technologies, development of communication middleware, and novel programming models. Prior to joining ARM, he spent five years at Oak Ridge National Laboratory (ORNL) as a research scientist at Computer Science and Math Division (CSMD). In this role, Pavel was responsible for research and development multiple projects in high-performance communication domains including Collective Communication Offload (CORE-Direct & Cheetah), OpenSHMEM, and OpenUCX. Before joining ORNL, Pavel spent ten years at Mellanox Technologies, where he led Mellanox HPC team and was one of the key drivers in the enablement Mellanox HPC software stack, including OFA software stack, OpenMPI, MVAPICH, OpenSHMEM, and other. Pavel is a board member of UCF consortium and co-maintainer of Open UCX. He holds multiple patents in the area of in-network accelerator. Pavel is a recipient of 2015 R&D100 award for his contribution to the development CORE-Direct in-network computing technology and the 2019 R&D100 award for the development of Open Unified Communication X (Open UCX) software framework for HPC, data analytics, and AI.

8:15-09:00
Porting UCX for Tofu-D interconnect of Fugaku

Tofu-D is an interconnect designed for Fugaku Supercomputer. It has a 6-dimensional Torus network for high scalability up to 160K nodes in Fugaku. The Tofu-D’s network interface is integrated into Fujitsu A64FX CPU chip. A library called uTofu API is provided by Fujitsu to make use of Tofu-D from userland. Currently, we are working on porting UCX for Tofu-D so that users can use UCX ecosystems with native support. Before the porting, as a preliminary evaluation, we evaluated UCX's TCP mode with tag messaging API on Tofu-D. We measured the bandwidth by ping-pong benchmark and compare it with Fujitsu MPI, which supports Tofu-D natively. The result was disappointing that the bandwidth of UCX with TCP mode is about 170MB/s with 4KB message while that of Fujitsu MPI is about 6.2GB/s. This implies we need to port UCX directly on top of uTofu to achieve better performance. With our latest implementation, we are successfully able to run UCT's zero-copy PUT API on Tofu-D without using TCP mode. With performance evaluation and comparison with Fujitsu MPI and uTofu, modified UCX achieves about 6.3GB/s with a 4KB message, matching the performance of uTofu. It is a promising result that UCX has better scaling than MPI.

Yutaka Watanabe*, University of Tsukuba

Speaker Bio

Mitsuhisa Sato, RIKEN

Speaker Bio

Miwako Tsuji, RIKEN

Speaker Bio

Hitoshi Murai, RIKEN

Speaker Bio

Taisuke Boku, University of Tsukuba

Speaker Bio

09:00-10:00
UCP Active Messages

Active messages is a common messaging interface for various PGAS (Partitioned Global Address Space) APIs/libraries. UCP provides rich and performance efficient API suitable for implementing networking layer in different type of frameworks. In this talk we will describe UCP Active Message API, highlighting its advantages for those application and frameworks which do not require tag-matching capabilities. With Active Messages API, application can benefit from all major UCX features available for well-known tag-matching API, such as: GPU memory and multi HCA support, error-handling support, etc. The API provides optimal performance by auto-selecting proper protocols depending on the message size being transferred. Besides that, Active Messages API has several benefits comparing to tag-matching API: 1) No extra memory copies on the receiver even with eager protocols 2) Ability for peer-to-peer communication (message can be received in the scope of the endpoint) 3) Better error handling support 4) Lack of tag-matching overhead (for those frameworks where tag-matching is not really needed). All these capabilities make UCP Active Message API a preferable choice when implementing networking layer for HPC and AI communication frameworks.

Mikhail Brinskii*, NVIDIA

Speaker Bio

Yossi Itigin, NVIDIA

Speaker Bio

10:00-10:30 Break
10:30-12:00
UCX GPU support

We would like to discuss last year changes and future plans for UCX GPU support, including but not limited to: Pipeline protocols, Out-of-box performance optimization, Shared memory 2-stage pipeline, Device memory pipeline, GPU-CPU locality information, DMA-buf support and moving registration cache to UCP, Limiting/extending GPU memory registration, Memory type cache

Yossi Itigin*, NVIDIA

Speaker Bio

Akshay Venkatesh, NVIDIA

Speaker Bio

Devendar Bureddy, NVIDIA

Speaker Bio

12:00-12:30 Break
12:30-13:00
UCX on AWS: Adding support for Amazon EFA in UCX

HPC in cloud is an emerging trend that provides users with easier, more flexible, and more affordable access to HPC resources. Amazon Web Services (AWS) provides its users with compute instances that are equipped with a low-latency high-bandwidth network interface, as well as multiple NVIDIA GPUs. More specifically, AWS has introduced the Elastic Fabric Adapter (EFA): an HPC-optimized 100 Gbps network interface with GPUDirect RDMA capability to satisfy the high-performance communication requirements of HPC/DL applications. EFA exposes two transport protocols: 1) Unreliable Datagram (UD), and 2) Scalable Reliable Datagram (SRD). The UD transport is very similar to the UD transport from IB. However, EFA is not an IB device, so it lacks certain features and assumptions that the existing transport layers in UCX depend on. SRD is a new transport developed by Amazon that is exposed as a new QP type in RDMA core Verbs. Therefore, it is not compatible with any of the existing transports in UCX. SRD is a reliable unordered datagram protocol that relies on sending packets over many network paths to achieve the best performance on AWS. Out-of-order packet arrival is very frequent with SRD. Therefore, efficient handling of such packets is the key to achieve high performance communications with UCX on AWS. Currently, UCX does not support EFA which means UCX usage on AWS is limited to TCP which has a per-flow bandwidth cap of ~1.2 GB/s. Adding the support for EFA through SRD allows UCX to achieve ~12 GB/s bandwidth on AWS network and deliver lower latency too. It also enables UCX-optimized software such as GPU-accelerated libraries from the RAPIDS suite to run directly on AWS. In this talk, we discuss how we add support for EFA in UCX. More specifically, 1) we add a new EFA memory domain that extends from the existing IB memory domain to capture EFA-specific features and limitations, 2) we add a new UCT interface for SRD and discuss the challenges and solutions for the reliable but relaxed ordered protocol, and 3) we update the existing UD transport in UCX to make it work over EFA too. Our micro-benchmark results show good performance for MPI communications between two EC2 p4d.24xlarge instances. Using Open MPI plus UCX with a prototype SRD interface, we achieve the maximum bandwidth offered by the EFA device (12 GB/s) for both host and GPU communications. This is 10x higher than what we achieve without EFA support, i.e., 1.2 GB/s with UCX + TCP. Latency results are encouraging too. For host memory, we achieve a latency close to RDMA core performance test, i.e., ~19 us with UD and ~20 us with SRD, respectively. This is 44% lower than the latency we achieve with UCX+TCP. Moreover, by taking advantage of GPUDirect RDMA in UCX, we achieve the same latency and bandwidth as host communications for large-message GPU communications.

Hessam Mirsadeghi*, NVIDIA

Speaker Bio

Akshay Venkatesh, NVIDIA

Speaker Bio

Jim Dinan, NVIDIA

Speaker Bio

Sreeram Potluri, NVIDIA

Speaker Bio

13:00-13:30
rdma-core update

30 min talk to review changes in rdma-core and the kernel over the last year. The RDMA stack provides the low level hardware interfaces that UCX rides on top of.

Jason Gunthorpe, NVIDIA

Speaker Bio

13:30-14:00
Congestion Control for Large Scale RoCEv2 Deployments

100 Gbps Ethernet with RoCEv2 is currently being deployed in High Performance Computing (HPC) and Machine Learning (ML) environments. For large scale applications, the network congestion control (CC) plays a big role in application performance. RoCEv2 does not scale well if only Priority-based Flow Control (PFC) is used. The use of PFC can lead to head-of-line blocking that results in traffic interference. RoCEv2 with ECN-based congestion control schemes scales well without requiring PFC. In this talk, we will present the benefits of hardware-based congestion control for RoCEv2. We will also discuss congestion control enhancements to further improve RoCEv2 performance in large scale deployments.

Hemal Shah*, Broadcom

Speaker Bio

Moshe Voloshin, Broadcom

Speaker Bio

14:00-14:15 Adjourn
12/02 08:00-09:00
Welcome Remarks and UCC

UCC is a community-driven effort to develop collective API and library implementation for applications in various domains, including High-Performance Computing, Artificial Intelligence, Data Center, and I/O. Over the last few months, the UCC WG group has met weekly to develop the UCC specification. In this talk, I will highlight some of the design principles of the UCC v1.0 specification. I will also share the status of UCC implementation and upcoming plans of the working group. Further, we will share results from the experimental implementation - XCCL, which has helped make an informed decision regarding UCC interfaces and semantics.

Manjunath Gorentla Venkata, NVIDIA

Manjunath Gorentla Venkata is an HPC software architect at NVIDIA. His focus is on programming models and network libraries for HPC systems. Previously, he was a research scientist and the Languages Team lead at Oak Ridge National Laboratory. He's served on open standards committees for parallel programming models, including OpenSHMEM and MPI for many years, and he is the author of more than 50 research papers in this area. Manju earned Ph.D. and M.S. degrees in computer science from the University of New Mexico.

8:15-09:30
Unified Collective Communication (UCC) State of the Union 2021

UCC is an open source collective library for current and emerging programming models. The goal is to provide a single library for collectives serving the various use cases. Particularly, UCC aims to unify the collective communication interfaces and semantics for (i) parallel programming models including HPC (message passing and PGAS), deep-learning and I/O (ii) collectives execution for CPUs and GPUs and (iii) unify collectives for software and hardware transports In this talk, first we will highlight some of design principles, abstractions, and the current implementation. Then, we provide highlights of recent advances in the project and upcoming plans for the project.

Manjunath Gorentla Venkata, NVIDIA

Manjunath Gorentla Venkata is an HPC software architect at NVIDIA. His focus is on programming models and network libraries for HPC systems. Previously, he was a research scientist and the Languages Team lead at Oak Ridge National Laboratory. He's served on open standards committees for parallel programming models, including OpenSHMEM and MPI for many years, and he is the author of more than 50 research papers in this area. Manju earned Ph.D. and M.S. degrees in computer science from the University of New Mexico.

09:30-10:00
High Performance Compute Availability (HPCA) Benchmark Project for Smart Networks

Abstract

Geoffroy Vallee*, NVIDIA

Speaker Bio

Richard Graham, NVIDIA

Speaker Bio

Steve Poole, Los Alamos National Laboratory

Speaker Bio

10:00-10:30 Break
10:30-11:00
Cloud-Native Supercomputing Performance isolation

Abstract

Gilad Shainer*, NVIDIA

Speaker Bio

Richard Graham, NVIDIA

Speaker Bio

Jithin Jose, Microsoft

Speaker Bio

11:00-12:00
ifunc: UCX Programming Interface for Remote Function Injection and Invocation

Abstract

Wenbin Lu*, Stony Brook University

Speaker Bio

Luis E. Peña, Arm Research

Speaker Bio

12:00-12:30 Break
12:30-13:00
Remote OpenMP Offloading with UCX

Abstract

Atmn Patel*, University of Waterloo

Speaker Bio

Johannes Doerfert, Argonne National Laboratory

Speaker Bio

13:00-13:30
From naive to smart: leveraging offloaded capabilities to enable intelligent NICs

Abstract

Whit Schonbein, Sandia National Laboratories

Speaker Bio

13:30-14:00
Using Data Processing Units to manage large sets of small files

Abstract

Matthew Baker, Oak Ridge National Laboratory

Speaker Bio

14:00-14:15 Adjourn
Clone this wiki locally