Skip to content

WeeklyTelcon_20211116

Geoffrey Paulsen edited this page Nov 21, 2021 · 1 revision

Open MPI Weekly Telecon ---

Attendees (on Web-ex)

  • Austen Lauria (IBM)
  • Brendan Cunningham (Cornelis Networks)
  • Brian Barrett (AWS)
  • Christoph Niethammer (HLRS)
  • Corey A. Henderson (AWS)
  • David Bernholdt (ORNL)
  • Geoffrey Paulsen (IBM)
  • George Bosilca (UTK)
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Joseph Schuchart (HLRS)
  • Josh Hursey (IBM)
  • Sriraj Paul (Intel)
  • Thomas Naughton (ORNL)
  • Todd Kordenbrock (Sandia)
  • Tomislav Janjusic (NVIDIA)

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (NVIDIA)
  • Aurelien Bouteiller (UTK)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • Edgar Gabriel (UH)
  • Erik Zeiske (HPE)
  • Geoffroy Vallee (ARM)
  • Harumi Kuno (HPE)
  • Hessam Mirsadeghi (NVIDIA))
  • Joshua Ladd (NVIDIA)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Cornelis Networks)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Raghu Raja
  • Ralph Castain (Intel)
  • Sam Gutierrez (LANL)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • William Zhang (AWS)
  • Xin Zhao (NVIDIA)

New Topics For Today

v4.0.x

  • Schedule: Just Released 4.0.7 Monday (Nov 15)
  • Started a v4.0.8 milestone/Checklist

v4.1.x

  • Schedule: next (final?) rc later this week or next week?
  • All outstanding PRs as of yesterday were merged.
  • Bug about alltoallw with MPI_IN_PLACE
    • Fixed in master, but has another issue.
    • Brian and George working on PR for this.
    • Will bring the whole Alltoallw series into v4.1.x
    • Should also cleanup alltoallv issues in v5.0.x
  • hcoll 9619 went back to v4.1.x yesterday.
  • George saw some issues with MPI_Comm_Spawn intermittent hangs, will open an issue.
    • IBM _inter
  • Once all of Alltoall[v|w] fixes are merged to master and CPed back to v4.1.x, will roll another rc.

v5.0.x

  • Schedule: slipped to Q1, 2022
  • A lot of fixes went into master but didn't get Cherry-Picked back.
    • Austen will investigate and open PRs later this week.
  • Jeff, Brian and others are working on PMIx/PRRTE integration.
  • Sent an email to devel
  • https://github.com/open-mpi/ompi/issues/9540 might be ready on v5.0.x
  • 8 PRs open.
    • PR 9594 - Fixes some BTL issues (against master) will take a few days to review.
  • Issue #9554 Jeff asked about Partitions support going to v5.0 or not?
    • Matthew is interested
  • PR #9495 TCP Onesided for master.
  • Tommy's still pushing on UCX Onesided.
  • PR 9576 - Ralph filed a ticket about building packages externally.
    • Working with fedora packagers. Will be a v5.0.x
    • Might need some back and forth with PMIx. The way he updated PMIx might need massive change to OMPI.
      • Ball is somewhat in Jeff's Court.
      • Across OMPI/PMIx/PRRTE - Just need to
  • MPI Info stuff that Yoseph and Howard are working on.
    • Marking a few MPI_ calls as deprecated.
    • Nevermind, Don't mark as deprecated, since we're not MPI 4.0 compliant, so DONT mark as deprecated yet.
    • No additional discussion.
  • Documentation
    • Got a change in sphynx tools needed. No sure if there's a release yet.
      • This fixes outputting issues in manpages.
    • Process to update FAQ is to talk to Jeff or Harumi.
    • Any changes in README or FAQ let them know to make changes in NEW docs.
      • For now, make changes in ompi-www and README as usual and let them know.
  • Issue 9501 regression, needs to be fixed or reverted.
  • No test for building from tarball, ensure we don't need pandoc.
  • Github Project of [critical v5.0.x issues|https://github.com/open-mpi/ompi/projects/3]
    • Issue #8983 If we partially disable OSC/TCP BTL - Not breaking MPI compliance, just breaking One-sided performance badly.
    • Described approach of rc1 on Sept 23, disabling any functionality that are blockers to allow for the rc.
      • Worried that blockers might not be fixed in time, so will put in code to issue an error at runtime to prevent getting into those paths, and document it heavily.
  • RDMA Onesided might be stalled.
    • He's identified the core issues
    • A bunch of cleanup work, he's done about half of.
    • Understand and have written down the problems.
    • All BTL completion semantec stuff.
    • Who has time.
    • Regression and Silent data corruption.
    • Would it be worth sending an email to devel list?

Super Computing SC BoF

  • Time and Date of BOF Nov 16 @ 12:15pm US Eastern Time.
  • Everyone who's involved has been preparing for SC21_BOF slack channel.
  • 140 people registered for Open MPI BoF, usually 75-100.
  • Jeff will post PDFs of slides on
  • Jeff will drive the slides.
  • 3000 ish on-site. in

Legal

  • Brian and Jeff are official reps for legal ownership
  • Usually pinged in first quarter of the year.
  • Todo item in Q1. Do an audit of infrastructure.
    • Who has what permisisons, etc.
    • Who owns DNS domains
    • Few other resources, that someone is managing on their own, and we don't know until it breaks.
    • Also consider consolidating because we have a lot of infrastructure.
    • Document!

Master

Documentation

  • No update
  • Don't do the old system, use this new system for v5.0.0

MPI 4.0 API

  • [Open MPI 4.0 API Compliance Github Project|https://github.com/open-mpi/ompi/projects/2]
    • Howard opened the project and discussed
  • MPI_T events #8057 - This needs to be rebased and merged.
    • ECP person at livermore. Either ask him or Howard will rebase.
  • Sessions branch, don't want to merge into master until possibly v5.0.1 gets out.
    • It will complicate things in finalize/initialize code.

MTT

  • Looking okay.
  • Cisco tests are reenabled again.
  • IBM still seeing Onesided issues.
  • Static builds are somewhat broken on master.
    • If Pmix is staticly linked, and compiled with janson, will be broken.
    • We're not always pulling in dependencies of our dependencies
  • Revive MTT development montly meetings in January of 2022
  • Last few days haven't been able to build with internal pmix.
    • dynamic linking.
    • If add -lpmix
  • MPI_Comm_spawn - reportedly hanging.
    • All IBM _inter tests do a Comm-Spawn (in same comm-world), THIS fails sometimes.
      • Blocked in pmix_get function.

Longer Term discussions

  • No discussion.
Clone this wiki locally