Skip to content

UCC Virtual F2F Meeting Information

Manjunath Gorentla Venkata edited this page May 13, 2020 · 35 revisions

UCC Virtual F2F Meeting (May 11-13th and May 18-19th)

Registration

Please fill in the form here

Agenda

Day1

Meeting Notes

Monday, May 11th, 2020

Time Topic Telecon
7:00 am - 7:30 PT Kickoff and Opening Remarks (Gilad Shainer)
7:30 - 8:15 PT Highlights of UCC API (Review) (Manju)
8:15 - 8:30 AM PT Break
8:30 - 9:30 AM PT Teams API (Manju; All/Discussion)
9:30 - 9:45 AM PT Break
9:45 - 11:00 AM PT Endpoints / Collective Operations (Manju; All/Discussion)

Day_1_Notes

Participants

  • Manjunath Gorentla Venkata
  • Alex Margolin
  • Sergey Lebedev
  • Valentin Petrov
  • Rami Nudelman
  • Baker, Matthew
  • Tony
  • Gilad Shainer
  • James S Dinan .
  • Chambreau, Chris
  • Gil Bloch
  • Dmitry Gladkov
  • Arturo
  • Pavel Shamis
  • Ravi, Naveen
  • Raffenetti, Kenneth J.
  • Akshay Venkatesh

Discussion

  • Initialization

    • Have a flexible infrastructure for initialization and selection of library functionality
    • Discuss final options during component arch discussion
    • UCC config interface to follow UCS config. 
    • Rename ucc_config to ucc_params to reflect UCX style  
  • Context

    • Do we need sync model config on the context create ?
      • Yes for enabling RDMA based implementations
      • The drawback - might have to create more contexts (sync and non-sync)
        • Yes, might require multiple objects but not necessarily multiple resources
        • Explore explicit device abstraction and ability to express affinity and propose to the WG group
  • Team Creation

    • Need to revisit endpoints (as this seems to be implementation specific) after presentation from Alex
    • Can we hide endpoint from interface and enable agnostic way of creating teams
  • Collective Operations

    • Need to define the mapping of programming model (src, dst) to UCC (src, dst) for cases like MPI broadcast, which has only set of buffers.
    • Is there a need for multiple outstanding persistent collective operations of same type ? No use case yet.

Day2

Time Topic Telecon
7:00 am - 7:45 PT Topology Aware Collectives (Sameh)
7:45 - 8:00 AM PT Break
8:00 am - 8:45 PT Collectives API - the Reactive alternative (Alex)
8:45 - 9:00 AM PT Break
9:00 - 11:00 PT Task and Plan API Discussion

Day_2_Notes

  • Manjunath Gorentla Venkata
  • Richard Graham
  • Sameh
  • Gil Bloch
  • Ravi, Naveen
  • Alex Margolin
  • Tony
  • Raffenetti, Kenneth J.
  • Sergey Lebedev
  • Rami Nudelman
  • Arturo
  • James Dinan
  • Pavel Shamis
  • Geoffroy
  • Valentine Petrov

Topology aware collectives

WG to sync with Sameh (IBM) about topology definition as we abstract topology, device, and affinity

Multiple-level API ?

Option 1: Standardize ucc and ucc_mpi interfaces Option 2: Standardize only ucc interfaces   Discussion on UCC base, UCC MPI

  • For now focus on UCC base and continue the discussion on UCC MPI in the working group 
  • Option for UCC MPI (driver) - provide as a part of UCC project (example contrib directory) 
  • (Alex correct this if needed)

Task API

Task API is use-full (feedback from the WG)

  • To be considered for a later version of API (not the first version)
  • It is useful to address the use-cases that include 
    • computation + communication
    • Pipelined protocols
    • provide a use case for bundled collectives
    • Propose Task API to the working group  

Topology Information

What topology information to abstract and what to pass? 

  • Capture distance between various processes/threads that forms the team/groups
  • Capture distance between context (resource) and devices (GPU/CPU)
  • Where to pass this information team creation or init?
  • AI for the working group: Propose an API that covers the above requirements

Endpoints

  • Endpoint in UCC is member_index in UCG
  • Move the endpoint to the team_config structure
  • Make endpoint an input 
  • If no input is provided the library will create the endpoints and it will be available via get_attrib interface

Day3

Wednesday, May 13th, 2020

  • Join the Meeting
  • +1 425-659-5232 United States, Seattle (Toll)
  • (844) 612-0969 United States (Toll-free)
  • Conference ID: 874 275 202#
Time Topic Telecon
7:00 am - 8:00 PT GPUs/DL (NVIDIA/IBM/All)
8:00 - 8:45 PT Multirail Discussion (Sergey;All)
8:45 - 9:00 PT Break
9:00 - 9:30 PT Algorithm Selection Models (All)
9:30 - 10:00 PT Memory registration and Global Symmetric Memory (All)
10:00 - 11:00 PT Document on differences and plan to converge

Day4

Monday, May 18th, 2020

  • Join the Meeting
  • +1 425-659-5232 United States, Seattle (Toll)
  • (844) 612-0969 United States (Toll-free)
  • Conference ID:379 429 327#
Time Topic Telecon
7:00 am - 7:45 PT OMPI-X / ADAPT (George Bosilca/Talk)
7:45 am - 8:00 PT Break
8:00 am - 9:30 PT Component Architecture (Review for non-WG participants)(Alex/Val/Discussion)
9:30 am - 9:45 PT Break
9:45 am - 10:30 PT Library initialization parameters
10:30 am - 11:00 Documentation / Code Structure

Day5

Tuesday, May 19th, 2020

  • Join the Meeting
  • +1 425-659-5232 United States, Seattle (Toll)
  • (844) 612-0969 United States (Toll-free)
  • Conference ID:868 569 033#
Time Topic Telecon
7:00 am - 11:00 PT

Topics

(Laundry List)

  • Kickoff (Gilad)
  • Highlights of UCC API (Review for non-WG participants) (Manju)
  • OMPI-X / ADAPT (George Bosilca/Talk)
  • Requirements from the AI Users/Deep Learning/GPUs (NVIDIA; All)
  • API Discussion (Incase not completed in WG)
    • Library Initialization
    • Resource Abstraction (Contexts)
    • Teams API (Manju; All/Discussion)
    • Endpoints (Manju; All/Discussion)
    • Collective Operations (Manju; All/Discussion)
    • Task API (Manju; All/Discussion)
    • Alternative Control-path API (Initialization and communicator creation) (Alex; All/Discussion)
    • Alternative Data-path API (Starting and progressing collectives) (Alex; All/Discussion)
  • Component Architecture (Review for non-WG participants)(Alex/Val/Discussion)
  • Flesh out UCC.H Header (All)
  • Unit tests and CI infrastructure (?)
  • Documentation (doxygen ?)(?)
  • Multirail Support (Sergey)
  • Topology-aware collectives (Sameh/Talk)
  • Memory registration (Discussion)
  • Algorithm selection (Discussion)
Clone this wiki locally