Skip to content

WeeklyTelcon_20161108

Jeff Squyres edited this page Nov 18, 2016 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Artem Polyakov
  • Edgar Gabriel
  • Geoffroy Vallee
  • Howard
  • Josh Hursey
  • Nathan Hjelm
  • ralph
  • Ryan Grant
  • Sylvain Jeaugey
  • Todd Kordenbrock

Agenda

Review 1.10.x: v1.10.5

  • All issues and pull requests for v1.10.x
  • 1.10.5
    • Nathan sees a segv and will be submitting a bug. May be a driver for 1.10.5.
    • Some PRs still here, waiting for reviews (Jeff).
    • Did we ever fix signal handler?

Review 2.0.x: v2.0.2

  • All issues and pull requests for v2.0.2
  • Desired / must-haves for v2.0.x series
  • Known / ongoing issues to discuss
    • #2234 COMM_SPAWN broken:
      • nathan just filed a PR on this yesterday.
    • v2.0.2 schedule:
      • IBM jenkins is down, due to lost Filesystem.
    • Josh found something with idup with MT
      • Nathan's comm_spawn fix should fix this too.
    • New issue from yesterday: neighborhood collectives
    • Nonuniformed datatypes in base and tuned.
      • same issue in tuned, but can shut off
      • assuming same count at all ranks. "Fix" is to turn off the logic based on count, with mca to turn on if really know your app sends same counts at all ranks.
      • ibcast stuff is just a work around.
      • libnbc also has this same issue. So whatever fix, fix both blocking and non-blocking.

Review 2.x: v2.1.0

  • All issues and pull requests for v2.1.0

  • Desired / must-haves for v2.1.x series

    • Reviewed this today (Nov 1st)
    • MPI-IO - no good MT test for MPI_IO - Edger would like this.
    • Possible we'll see something for new coll_tuned component.
  • Known / ongoing issues to discuss

    • PMIx 1.2.0: status?
      • PR 2286: will update to PMIx v1.2.0rc1. Testing is looking good. Two outstanding issues -- both should be done this week:
        • Update to Get
        • One thing Boris is working on.
      • Estimate release PMIx v1.2.0 this time next week.
      • People please try it out!
  • Can we delete ompi-pmix2-integration repo?

  • Performance issue 1831.

    • PSM2 and also libfabric.
    • Don't know if it's the psm component or the CM component.
    • Performance on v2.x and master is good. v2.0.x is worst, v1.10.x is best.
      • Perhaps something from Request refactor didn't get back ported correctly?
    • Fixed the BLTs, but not the MTLs. Something in that code path is not right.
    • Is this a blocker for a 2.0.2 release? From Intel perspective, would like it to be a blocker.
      • It's weird that it only affects larger 64KB messages.
    • Only single threaded build.
    • Hard to tell if it's in the PSM library, or in the Open MPI code-path.
    • If we see the Genie provider is impacted, then we know it's probably Open MPI.
      • Should get data by tomorrow.
      • If it also affects u-genie, then it's probably Open MPI, and should be an Open MPI 2.0.x blocker.

Open MPI v2.1

  • Where are we on PMIx?
    • performance difference between two different types of machines, especially when a lot of core counts.
    • Not going to happen before super computing.
  • Job Info in PMIx v1.2 - Artem is working on. Should go into PMIx master in next day or so.
    • Will get another RC after datastore.
    • Nathan can run a launch scaling test, but not a data scaling test.
    • Any data could help. Without datastore, will probably die about 512 nodes @ 272ppn.
  • A few other PRs on
    • PR #2354 - Can Artem explain what this is for?
    • PR #2365 - Slurm bindings, Open MPI doesn't recognize that it's binding, and does it's own thing. Had fixed this in skitzo framework. In the ESS we can make more intellegent decision, but this meant we had to bring over SLURM ESS, and added one new Skitzo framework call. Change in Orte Init to open up skitzo. This allows us to detect slurm bindings, but also allowed us to fix singletons in a slurm environment. Since singleton ESS doesn't recognize it's in a SLURM environment.

Master review

  • PR #2285: enabling orte to use libfabric
    • Please go test it!
    • Uses RDM messaging
    • @hppritcha would like to test, but will not be able to test until next week
  • On November 14th Agenda.
  • Guy who heads up conservancy has been emailing Ralph.
    • Ralph filled out Conservancy form as well (no obligation)
  • Big difference between SPI and conservancy. SPI is all volunteer, Conservancy has 4 fulltime staff.
  • SPI is still a lot of confusion. switched to use Conservancy financing.
  • Ralph pointed out that at most they'll get like $100 from us (10% fee).

No meeting next week, we WILL have a meeting Nov 22nd!

Note: OMPI BOF is Wed Night at SC16

  • 5:30 - Asked a number of people for content, and others to talk.

Review Master MTT testing (https://mtt.open-mpi.org/)

  • Not seeing Morning MTT reports, or tarball generation email or coverity.
  • 2.0.x series, still having some failures. Cisco has 2041 failures (1800ish is OSHMEM)

MTT Dev status:

  • We've been losing a little bit of data up until now. A major serialization difference in versions. Josh fixed.
  • Not getting morning MTT result emails. Jeff looked into that last week, and went back and forth with Brian.
    • mail to gmail gets there, mail to cisco doesn't get there.
    • Ralph thinks, there is a new security thing called SPF, if you don't have things setup correctly on server, some sites (google is not one) will reject email (not even return). Setting on server side, to say that I'm spoofing this domain name, to get systems to accept it.

Open MPI Developer's Meeting


Status Update Rotation

  1. LANL, Houston, IBM
  2. Cisco, ORNL, UTK, NVIDIA
  3. Mellanox, Sandia, Intel

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally