Skip to content

WeeklyTelcon_20190219

Geoffrey Paulsen edited this page Mar 12, 2019 · 2 revisions

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoff Paulsen
  • Jeff Squyres
  • Geoffroy Vallee
  • Howard Pritchard
  • Ralph Castain
  • Todd Kordenbrock
  • Xin Zhao
  • Brian Barrett
  • Josh Hursey
  • Joshua Ladd

not there today (I keep this for easy cut-n-paste for future notes)

  • Matias Cabral
  • Thomas Naughton
  • David Bernholdt
  • Matthew Dosanjh
  • George
  • Akshay Venkatesh
  • Edgar Gabriel
  • Aravind Gopalakrishnan (Intel)
  • Nathan Hjelm
  • Dan Topa (LANL)
  • Akshay Venkatesh (nVidia)
  • Arm (UTK)
  • Peter Gottesman (Cisco)
  • mohan

Agenda/New Business

  • The HostGator web site (open-mpi.org) is coming up for renewal. We need to decide what we are going to do about it
    • Expires in Summer (Start in May) Expires July 27th.
    • Need to move domain names. (Who owns that?)
    • It'd be nice to move to AWS.
    • DNS should be owned by SPI. Still need to transfer that.
    • Topic for April.
  • Nathan Hjelm's day job will no longer involve Open MPI, so if you want him to review something, please check with him first.
  • Next face to face is San Jose - April 23-April25 @ Cisco -San Jose.

Minutes

Review v3.0.x Milestones v3.0.3

Review v3.1.x Milestones v3.1.0

Review v4.0.x Milestones v4.0.1

  • Schedule: waiting for Issue6278 fix
  • v4.0.1
  • Consider disabling pmix-new-shmem mca param. (see PMIx Issue 1114)
    • We have one report on older machine. Segv due to sharedmem lock creation.
    • IBM's using that component heavily, and no issues.
    • UoH has same architecture machine we could try to reproduce there.
    • There is an mca param to disable if user hits.
    • Consensus says leave it enabled.
  • Adding OSHMEM API - bugfix. Need to rev .so versions correctly
  • Serious issue https://github.com/open-mpi/ompi/issues/6198, but won't hold v4.0.1
  • OFI/RML - was removed on master, but in v4.0.x the configury was broken.
    • We could claim that removing ofi/rml is a bugfix.
      • It was never intended to be in a production release. Must explicitly activate.
    • Removing it is easiest. Don't suspect anyone is actually using this.

v5.0.0

  • Schedule: Delaying post Summer ***
  • Discussion of schedule depends on scope discussion
    • if we want to separate Orte out for that? Would be a bit past summer.
    • Giles has a prototype of PRTE replacing ORTE
  • Want to open up release-manager elections.
    • Now that we're delaying, will decide at face2face.
  • Is anyone pushing for a Summer of 2019 schedule?
    • It seems too aggressive to everyone on the call
    • One driver was to remove things to break ABI.
    • Not a bad idea to DO v5.0, but summer timing is bad.
    • Delaying would allow for switching to PRTE.
    • PMIx Tools support
  • Now the possibility of v4.1 from master is a possibility
    • If we instead do a v4.1, some things we'd need fixed on master.
  • will discuss more at face to face.

Master

  • Good Job Ralph fixed the 100% Cisco MTT fail.
  • Cisco now has 70,000+ good runs. Still some static build issues.

PMIx

  • New Alert in PMIx side PMIx Issue 1114. - wrong answer in shared memory component.
  • Ralph fixed a bug over the weekend:
    • If you hit a process with SIGTERM while in a fence, PMIx server can sometimes get into a codepath that causes a SEGFault.
  • Howard is still working on Open MPI calling PMIx directly.
    • Take a look at Gile's PRTE work. He may have done SOME of that. He should have done that all in PRTE layer, maybe just some MPI layer work remains.
    • PR6339 - seems to be working.
    • 2000 files? - Because rm ORTE
    • Howard will review PR6339, and ensure that whatever Giles did will survive that.
    • Did he keep the framework, but keep it static?
      • That's a better approach, so we can easily bring in an external component.

MTT

  • IBM still has 10% failure rate and build issue. Please fix.

New topics

  • PMIX direct call / PRTE replacement for ORTE.
  • Howard has been changing OMPI or OPAL places that call the PMIx framework,
    • to use PMIx data structures directly in the code.
    • Doesn't look like Howard would step on Ralph's toes.
  • March 4th is next MPI Forum (then June)
  • We have a new open-mpi SLACK channel for Open MPI developers.
    • Not for users, just developers...
    • email Jeff If you're interested in being added.

face to face -

  • how do we get more participation, and make MTT more meaningful

Review Master Master Pull Requests

  • didn't discuss today.

Oldest PR

Oldest Issue


Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2018 WeeklyTelcon-2018

Clone this wiki locally