Skip to content

WeeklyTelcon_20180925

Geoffrey Paulsen edited this page Jan 15, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Matthew Dosanjh
  • Geoff Paulsen
  • Jeff Squyres
  • Arm (UTK)
  • Brian
  • Dan Topa (LANL)
  • Josh Hursey
  • Ralph Castain
  • Thomas Naughton
  • Howard Pritchard
  • Aravind Gopalakrishnan (Intel)
  • mohan

not there today (I keep this for easy cut-n-paste for future notes)

  • Edgar Gabriel
  • Matias Cabral
  • Akvenkatesh (nVidia)
  • Todd Kordenbrock
  • Xin Zhao
  • Nathan Hjelm
  • Josh Hursey
  • Geoffroy Vallee
  • Joshua Ladd
  • Dan Topa (LANL)
  • David Bernholdt
  • George
  • Peter Gottesman (Cisco)

Agenda/New Business

  • Ralph proposed moving mailman to a new hosting site.

    • mailmanhost.com - $3/list for up to 4K members.
    • dotlist is company behind them.
    • We have about 2600 subscribers.
    • Might have had some more issues today with current provider.
    • No action until face to face
  • Silent Wrong Issue(s)

    • Vader fence issue (Originally Issue 4937)
    • Released v2.1.x with this.
    • Other things for v3.1.x
      • Put out an RC for v3.1.x
    • ACTION: Did this get fixed for v4.0.x?
    • ACTION: did this go to all release branches?
  • Nathan is requestiong Comments on

    • C11 integration into master. PR5445
    • Got good comments from George and others.
    • eliminate all of our atomic for C11 atomics.
      • So will need to support until 2020 due to RHEL.
    • Nathan agreed to clear out old stuff now, and will rebase.
  • github suggestion on email filtering

Minutes

Review v2.1.6 (not going to do this in immediate future.

  • Vader problem is still happening on i386 and MIPSL nodes.
    • Do we want to just NOT support 32bit builds?
    • That makes our packager's lives difficult.
    • 32bit should be considered a "canary in the coalmine", and we might have other REAL issues.
    • Tested with patch, and still failing, so THIS might not be the only issue.
    • Not ready to say "drop 32bit".
    • Brian will investigate as time permits.
  • Driving a new release because it's a regression.
  • Dec 1st.

Review v3.0.x Milestones v3.0.3

  • Schedule:
  • Will do an RC end of this work "I/O configury stuff"
  • v3.0.3 - targeting Oct 1st (more start RCs when 2.1 wraps up.
    • Not important enough to do in parallel with v4.0.x
  • Issue with external PMIX v3.0 hanging. Fixed on master, Ralph backported the fix to OMPI v3.0.x and v3.1.x Already fixed in OMPI v4.0
  • fairly extensive bug fix list is building.
  • Few more PRs
  • UCS shaming.
  • Issue: MPI Connect accept broken except within same mpirun.
    • Ignoring for v3.0.3
  • Issue: CUDA Direct RDMA blocks on msgs larger than RDMA message length.
    • Fix for this is to use mca_cuda_memcpy_asycn_send_recv is not default.
    • Would like to change this default to be false.
    • Want a PR from someone who can test this for v3.0.3 and futures.
    • Lower priority because moving to UCX.
  • Issue 1763: Tune BCAST Data Corruption.
    • Looks like George worked around (2 years ago) but doesn't fix.
  • Probably want PMIx v2.1.4 - fix that came in from ARM last week.
    • Is there a timeline for v2.1.4 release?
    • Can release soon. Then Open MPI will pickup.
  • There are a bunch of Issues "targeting v3.0" not in v3.0.3 release.
    • Many have been merged in, but waiting to merge in everywhere.

Review v3.1.x Milestones v3.1.0

  • Schedule: Dec 1st
  • Issue 5083 - ucx segfault - Geoff (IBM) will grab UCX from upstream release and verify Issue 5083 (UCX issue not OMPI issue)
    • Open PR: PMIx v2.1.4 upgrade
    • PR 4986 - if no updates in 7 days, Brian will close PR.
    • Issue 5540 issue with overlapping datatype.
      • George is working on.

v4.0.0

  • Schedule: release: End of Sept.
    • Date for first RC - Setp 11 (today)
  • We've never Announced an RC to announce list before
    • We've been asked at Super Computing BOF to announced before a Major release to give users who only subscribe to that, that a new Major x.0.0 is coming.
    • Only needed on
  • Issue 5638 - 32bit fail in vader probably for all releases.
    • Fastbox thing is just an optimization.
    • We could just disable this optimization for 32bit.
      • We should do a build with fastbox disabled, and run through user's CI.
      • Then if we have a fix in time for release, then perhaps
      • Disabling fastbox, no mca parameter.
    • Briant will look into this a bit more
    • Howard will look into adding an mca param to disable this.
  • Another issue is it's hard to see how pmix was configured.
    • pmix has a pmix_info - and we should build/package that.
    • Would like for v4.0.0
    • Jeff and Ralph when back and forth on this. Ralph ran into an issue.
  • Issue: 5375 in vader.
    • may be new blocker for v4.0.0
  • Added several labels with prefix 'state_' or 'severity_'
    • This helps us remember the state of the issue.
    • Does require users to update.
    • Would be nice to have a wiki page describing intent of these.
    • Jeff - wiki page sent. on Wiki.

PMIx

  • PMIx team close to releasing the version 2 of the PMIx standard.
  • No action: Open MPI v5.x Future of Launch
    • Geoffroy Vallee sent out document with summary to core-devel.
      Everyone please read and reply.
    • ORTE/PRTE
      • We had a working group meeting to discuss launching under Open MPI v5.0
      • Summary is to throw away ORTE, and make calls directly to PMIx, and then use PRTE with an mpirun wrapper around PRTE.
    • Split this into two steps:
      1. Make PMIx a first class citizen - and call PMIx API directly.
        • When we added the opal PMIx layer, we added infrastructure, and we're talking about flipping that around, so internally Open MPI calls PMIx calls, and then other components might translate the PMIx calls to PMI1 or PMI2 or whatever else.
        • PMIx community operating as a "standard" for over a year or so now.
        • PMIx standard document is in progress.
        • Just doing this much, should make ORTE much more in-line with PRTE, and make bugfixing between the two much less.
      2. Packaging / Launcher.
        • PRTE is that far ahead of ORTE because it's painful to move them back.
        • Many don't want to have to download something different to launch.
      3. Will need to ponder and come to consensus at face to face.

New topics

  • MTT License discussion - MTT needs to be de-GPL-ified.

    • Main desire is python is in a repo with no GPL code (no Perl code)
    • Current status:
      • Need to make progress on sooner than later.
      • Ralph will move existing MTT to new mtt-legacy repo,
        • then rip out perl from MTT repo.
      • Cisco spins up a different slurm job for each MPI build, with a single ini file. By doing it this way, it depends on many perl funclets.
      • If change to have a different ini for each different "stream", it should work okay with python. Didn't happen before Peter left.
    • Ralph is waiting for MTT users to move to MTT-legacy repo.
      • Absoft, Amazon, IBM, need to move.
  • Do we need to update the LICENSE doc?

    • No, because not planning to distribute the legacy repo.
    • There are plans to redistribute the new MTT repo.
  • MTT performance database?

    • No status for a while.
    • MTT does report this, but no one looks.
    • Howard suggests many different performance dashboards.
      • Influx DB with jenkins, and can be queried.
      • Still need to get an up to date viewer.

Review Master Master Pull Requests

  • didn't discuss today.
  • Ralphs setting up a virtual machine and hitting a TON of new warnings
    • Most of these are not checking return code of snprintf or asprintf.
      • There is an opal_asprintf().
  • Thought about adding CI to check for new warnings.
    • warning count delta is gross.
    • Getting warning free would be next to printf.
  • Next Face to Face
    • When? Week of Oct 16-18th
    • Where? San Jose - Cisco
    • Need Agenda items added to the face to face.
      • Issue with devel-core / mailman.
      • Discuss MPIR / PMIx debugger interfaces.

Oldest PR

Oldest Issue


Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2018 WeeklyTelcon-2018

Clone this wiki locally