Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

including openib in btl list causes horrible vader or sm performance. #1252

Closed
gpaulsen opened this issue Dec 21, 2015 · 82 comments
Closed

including openib in btl list causes horrible vader or sm performance. #1252

gpaulsen opened this issue Dec 21, 2015 · 82 comments

Comments

@gpaulsen
Copy link
Member

On master branch

I observe a strange behavior. I think that openib may be using too large of a hammer for numa membinding, possibly setting the wrong memory binding policy for the vader and sm shared memory segments. I've only come to this conclusion empirically based on performance numbers.

For example, I have a RHEL 6.5 node with a single Mellanox Technologies MT25204 [InfiniHost III Lx HCA] ConnectX-3 card with a single port active.

Bad Latency run single host:

$  mpirun -host "mpi03" -np 4 --bind-to core --report-bindings --mca btl openib,vader,self ./ping_pong_ring.x2
[mpi03:12941] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
[mpi03:12941] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]
[mpi03:12941] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
[mpi03:12941] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 7.11 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 7.10 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 7.15 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 7.17 usec/msg

Similar behavior with sm:

$ mpirun -host "mpi03" -np 4 --bind-to core --report-bindings --mca btl openib,sm,self ./ping_pong_ring.x2
[mpi03:14928] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
[mpi03:14928] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]
[mpi03:14928] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
[mpi03:14928] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 7.45 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 7.38 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 7.35 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 7.38 usec/msg

When I remove openib results look much better:

$ mpirun -host "mpi03" -np 4 --bind-to core --report-bindings --mca btl vader,self ./ping_pong_ring.x2
[mpi03:15819] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
[mpi03:15819] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]
[mpi03:15819] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
[mpi03:15819] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 0.49 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 0.51 usec/msg

Similar behavior with sm (though it's half as fast as vader):

$ mpirun -host "mpi03" -np 4 --bind-to core --report-bindings --mca btl sm,self ./ping_pong_ring.x2
[mpi03:16608] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
[mpi03:16608] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]
[mpi03:16608] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
[mpi03:16608] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 0.98 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 1.00 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 0.95 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 0.93 usec/msg

If I disable binding explicitly with --bind-to none, even when specifying openib I see the expected results (with either vader or sm, but now sm is the same speed as vader... weird):

$ mpirun -host "mpi03" -np 4 --bind-to none --report-bindings --mca btl openib,vader,self ./ping_pong_ring.x2
[mpi03:20206] MCW rank 1 is not bound (or bound to all available processors)
[mpi03:20205] MCW rank 0 is not bound (or bound to all available processors)
[mpi03:20207] MCW rank 2 is not bound (or bound to all available processors)
[mpi03:20208] MCW rank 3 is not bound (or bound to all available processors)
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 0.49 usec/msg
$ mpirun -host "mpi03" -np 4 --bind-to none --report-bindings --mca btl openib,sm,self ./ping_pong_ring.x2
[mpi03:21058] MCW rank 0 is not bound (or bound to all available processors)
[mpi03:21059] MCW rank 1 is not bound (or bound to all available processors)
[mpi03:21060] MCW rank 2 is not bound (or bound to all available processors)
[mpi03:21061] MCW rank 3 is not bound (or bound to all available processors)
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 0.51 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 0.51 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 0.49 usec/msg

Finally just for completeness... the best 0 byte ping pong ring times I could get was with --bind-to core --map-by core:

$ mpirun -host "mpi03" -np 4 --bind-to core --map-by core --report-bindings --mca btl vader,self ./ping_pong_ring.x2
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
[mpi03:32149] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
[mpi03:32149] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
[mpi03:32149] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../..][../../../../../../../..]
[mpi03:32149] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../..][../../../../../../../..]
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 0.37 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 0.37 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 0.38 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 0.38 usec/msg

I've attached my source for ping_pong_ring.c:

ping_pong_ring.txt

@gpaulsen
Copy link
Member Author

@hjelmn - Jeff Squyres said I should tag you on this, that you might have some insight.

@hjelmn
Copy link
Member

hjelmn commented Dec 21, 2015

Yeah. This happens because we are polling the completion queue. I have been planning to work on this when i get a chance.

@hppritcha
Copy link
Member

@gpaulsen which milestone would you like to give this. If it involves a lot of changes I'd prefer it to be in 2.1.0 or maybe 2.0.1? (I'll add the 2.0.1 milestone in a bit)

@gpaulsen
Copy link
Member Author

I was hoping for 2.0, (I didn't see a 2.0 label), as this makes openib runs pretty useless.
This is NOT a blocker for IBM, but I assumed it was a blocker for Open MPI Community.

@hppritcha hppritcha added this to the v2.0.0 milestone Dec 22, 2015
@jsquyres
Copy link
Member

jsquyres commented Jan 4, 2016

Hey @hjelmn, can we talk about this tomorrow on the call? I'm wondering:

  • Should the openib BTL just be smart enough to not register its progress function when its add_procs() hasn't found any peers, and/or del_procs() has removed all peers?
    • As a side effect: is the openib BTL progress function required to process incoming connections? Or is this not even an issue if the BTL determines that it can't talk to any peers? More specifically: is there ever a race condition -- e.g., involving dynamic processes -- where an incoming connection request could arrive in a BTL (such as the openib BTL) before the BTL knew that an incoming connection could be coming from that peer?
  • I'm guessing that other BTL will also have this (set of) issues. Do you know?
  • If so, how much of a change does this represent to openib (and potentially others)? I.e., do we need this for 2.0.0, or can it wait until 2.1.0? I ask because Open MPI has had this behavior for a long time, so it's not technically a regression. But it would be a nice performance issue to fix.

@jsquyres
Copy link
Member

jsquyres commented Jan 5, 2016

Discussion on the call...

This seems to be because of how we now add all endpoints for all BTLs (because of the new add_procs() work). Hence, even though a process may be 100% on a single server, if network BTLs determine that they could run (e.g., openib), the openib BTL modules will be added in the PML. Therefore we queue up their progress functions and poll them -- even though they will never be used (unless a dynamic process comes along and uses them).

There's a secondary implication: one-sided atomics. It's unfortunate that CPU atomics != network hardware atomics, so we have to go 100% one way or the other. The current implementation uses the network stack atomics (for the same reason as above: a dynamic process may come along and therefore require the use of network atomics).

@hjelmn is going to look at a slightly different approach: have opal_progress() monitor the progress functions that it calls. Progress functions that don't return any events for a long time will get downgraded in priority and called less frequently.

@gpaulsen
Copy link
Member Author

gpaulsen commented Jan 5, 2016

We discussed in today's call that this is a complex issue with the new dynamic add procs, and that we must include network endpoints for two reasons. One is that if we use any network atomics, all atomics must be network atomics. The other is to support spawning a new job on another node.

Nathan's proposed solution is to add a decay function to the progression loop, so that any components that are not actually progressing anything, won't get called as often.

@bosilca
Copy link
Member

bosilca commented Jan 5, 2016

Talking specifically about IB, we should take advantage of the fact that it is a connection-based network and only register the progress callback when there are established connections. More generally, a more reasonable approach will be to delay the progress registration until there is something to progress (this is under the assumption that the connection establishment is handled by a separate thread).

@jsquyres
Copy link
Member

jsquyres commented Jan 5, 2016

Agreed: that, too.

@gpaulsen
Copy link
Member Author

I should add that this was build Multi-threaded, for x86 with the GNU compilers.

@gpaulsen
Copy link
Member Author

I'm ready to retest. Please comment in this issue with which PR to try.

@hppritcha
Copy link
Member

@hjelmn do you have an eta for the backoff fix?

@hjelmn
Copy link
Member

hjelmn commented Jan 21, 2016

Should have it ready to test later today.

@hjelmn
Copy link
Member

hjelmn commented Jan 22, 2016

hmm, I see why I am not getting the same level of slowdown as you. The problem is less the progress function (which adds ~ 100ns) and more the asynchronous progress thread (connections, errors, etc). It should be completely quiet in this case but something is clearly causing the thread to wake up.

@gpaulsen Could you run the reproducer with a debug build and the -mca btl_base_verbose 100 option and send me the output?

@jsquyres
Copy link
Member

I just pinged @gpaulsen -- he's going to check this out today.

@hjelmn
Copy link
Member

hjelmn commented Jan 25, 2016

Good. I am sure a btl log will should what is causing the async thread to wake up.

@gpaulsen
Copy link
Member Author

I emailed output directly to Nathan.
Strangely I was only able to reproduce with a non-debug build. -g did not show the issue (though I didn't try very hard).

@jsquyres
Copy link
Member

FWIW, for large text outputs, we typically tend to use the gist.github.com service, and then post the link here.

@jsquyres
Copy link
Member

@gpaulsen @hjelmn Where are we on this issue?

@gpaulsen
Copy link
Member Author

I've been trying to instrument the code to provide additional output, but when I do, I see nothing new in the output. I'll have some more time after 3pm central to work on this again today.

@gpaulsen
Copy link
Member Author

Just finished a web-x with Jeff, and we put some more opal_outputs in and around the openib component. It looks like OpenIB IS calling it's progress thread, even on a single host.

Only reproduce the bad performance without --enable-debug.
I have compiled WITH Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
Command:

cat hosts
mpi03 slots=4
mpirun --hostfile hosts -np 4 --bind-to core --report-bindings --mca btl_openib_verbose 100 --mca btl_base_verbose 100 --mca btl openib,vader,self ./ping_pong_ring.x2

I don't see any calls to btl_openib_async_device, udcm_cq_event_dispatch.

I DO see calls to: btl_openib_component_open, and udcm_module_init (which returns success).

And when I'm running I see many many calls to the progress function: btl_openib_component_progress()

Jeff thinks this might be evidence for your initial thoughts @hjelmn. Could you please take another look?

@gpaulsen
Copy link
Member Author

gpaulsen commented Feb 1, 2016

@hjelmn, Any luck? Do you want to do a shared screen with me?

@hjelmn
Copy link
Member

hjelmn commented Feb 1, 2016

I still think this is unrelated to the progress function since --bind-to none helps as well. I will look deeper and see what is different between the optimized and debug builds that could be having an impact.

@hjelmn
Copy link
Member

hjelmn commented Feb 1, 2016

Can you put an opal output in the loop in progress_engine in opal/runtime/opal_progress_threads.c? That should be waking up almost never during the steady-state.

@gpaulsen
Copy link
Member Author

gpaulsen commented Feb 1, 2016

That is correct. I've verified that each rank is only calling opal/runtime/progress_engine() once, but we're still getting many many calls in the openib progress function

@hjelmn
Copy link
Member

hjelmn commented Feb 1, 2016

@gpaulsen Strange. I will have to do some more digging to figure out what might be going on. I measured the affect of the openib btl's progress engine on a shared memory ping-pong and it is at best 50-100 ns. I tested this by just setting btl_progress to NULL in the btl and comparing to the normal latency.

@jsquyres
Copy link
Member

jsquyres commented Feb 1, 2016

@gpaulsen Can you try removing the openib progress function (i.e., maybe just hack up the code to not register the openib progress function) and see what happens?

I know @hjelmn says it only takes something like 40ns, but there's also caching effects, and vader/sm are sooo sensitive to that kind of stuff. If it's easy, it's worth trying.

@hjelmn
Copy link
Member

hjelmn commented Feb 4, 2016

Well, can throw out the udcm cause. We did some more digging and it is indeed the openib progress function. Problem is whatever is causing the slowdown for shared memory is also causing a slowdown for internode ping-pong. The slowdown only happens for processes on the second socket of either node. @gpaulsen is going to experiment with different ofed/mofed versions to see if the slowdown is a ofed bug.

@hjelmn
Copy link
Member

hjelmn commented Feb 4, 2016

As such this may not be a blocker for 2.0.0. We certainly can add the code to reduce the number of calls to the openib progress function when the btl is not in active use or we could to not poll the openib completion queues unless there are connections but this would fix the symptom not the cause.

@jsquyres
Copy link
Member

jsquyres commented Feb 4, 2016

Sounds like this could use some testing on other people's IB hardware -- do the same thing as @gpaulsen (including an MPI process that is NUMA-far from the HCA) and see if they get the same kind of bad performance.

@nysal
Copy link
Member

nysal commented Feb 5, 2016

Is this sandy bridge? I vaguely recollect adding a stall loop before calling ibv_poll_cq() for processes on the second socket to work around a hardware issue.

@gpaulsen
Copy link
Member Author

gpaulsen commented Feb 5, 2016

Yes! This IS a Sandy Bridge. Intel Xeon E5-2660.

@gpaulsen
Copy link
Member Author

gpaulsen commented Feb 5, 2016

So, I talked with @nysal. He remembers finding this problem internally at IBM with another PE-MPI, and putting in a spin delay before ibv_poll_cq() when running on non-first socket on Sandybridge with certain OFED versions. That's all he remembers.

@gpaulsen
Copy link
Member Author

gpaulsen commented Feb 5, 2016

@hjelmn, The two "good" changes to openib that we made yesterday, do you want to get those upstream, or would you like me to? That might be good practice for me getting stuff upstream, unless you wanted to go and look at other stuff too.

Also, I am not as confident that it's only ibv_poll_cq's fault for the horrible performance. I really thought that I NULLed out the progression pointer as stated above. I'm going to try to reproduce that today.
Also, Isn't that ibv_poll_cq() call in openib on another thread from the shared memory? WHY would that slow down the shared memory progression? or is ALL BTL progression done on that other thread?
Finally, if the new version of MOFED doesn't fix this, does it warrant trying your progression decay function? Or would you prefer to try to detect the Sandybridge architecture and put a spin loop before the ibv_poll_cq()? or just ignore it for this hardware?

@hppritcha
Copy link
Member

Okay this brings back memories. On Cray for processes on the remote socket from the Aries/hcae in craypich we had to add a back off for functions we knew polled cache lines that were also potentially written to by the I/o device. It was a bug in the sandy bridge north ridge.

Craypich only turned on the back off if rank was on remote socket and it was sandybridge. Problem was fixed in ivy ridge as I recall.

@hjelmn
Copy link
Member

hjelmn commented Feb 5, 2016

Ok, it sounds like we can fix this a similar way in Open MPI. We can easily detect both sandybridge and a remote socket. I will take a crack at writing this later today.

@hjelmn
Copy link
Member

hjelmn commented Feb 5, 2016

Looks like we have a Sandy Bridge system with QLogic. Let me see if I can get the same slowdown with this system.

@hjelmn
Copy link
Member

hjelmn commented Feb 5, 2016

No luck getting a similar slowdown with libibverbs 1.0.8 from RHELL 6.7 with QLogic. Will have to test the workaround on the system @gpaulsen is running on.

@jsquyres
Copy link
Member

jsquyres commented Feb 9, 2016

Update from discussion on Feb 9 2016 webex: it may not be the well-known Sandy Bridge bug -- it seems like the latency added is too high (the Sandy Bridge bug only aded hundreds of nanoseconds, not multiple microseconds). @gpaulsen is going to test with MXM/RC to see if he can duplicate the issue -- if it's a hardware / driver issue, it should show up with MXM, too.

That being said, we still want the progressive backoff progress functionality, but probably not for v2.0.0. A good target would likely be v2.1.x.

@hjelmn
Copy link
Member

hjelmn commented Feb 9, 2016

@jsquyres multiple microseconds, not milli.

@jsquyres
Copy link
Member

jsquyres commented Feb 9, 2016

@hjelmn Thanks -- I edited/fixed the comment.

@gpaulsen
Copy link
Member Author

gpaulsen commented Feb 9, 2016

I THINK it's showing up with MxM also. I can reproduce with just MxM across 2 nodes (fast on first socket and slow on 2nd), or MxM on same node. The thing is, I'm getting nasty messages saying that MxM was unable to be opened, so Now I don't know WHAT is happening.

[gpaulsen@mpi03 ompibase]$ mpirun --hostfile hosts -np 2 --bind-to core --dispy-map --report-bindings --mca mtl mxm --map-by core --cpu-set 9,11 ./ping_pong_ring.x3 64 | & tee mxm_one_node_both_2nd_socket.out
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
 Data for JOB [2344,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: mpi03   Num slots: 16   Max slots: 0    Num procs: 2
        Process OMPI jobid: [2344,1] App: 0 Process rank: 0 Bound: socket 1[core 9[hwt 0-1]]:[../../../../../../../..][../BB/../../../../../..]
        Process OMPI jobid: [2344,1] App: 0 Process rank: 1 Bound: socket 1[core 11[hwt 0-1]]:[../../../../../../../..][../../../BB/../../../..]

 =============================================================
[mpi03:07262] MCW rank 0 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
[mpi03:07262] MCW rank 1 bound to socket 1[core 11[hwt 0-1]]: [../../../../../../../..][../../../BB/../../../..]
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      mpi03
Framework: mtl
Component: mxm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      mpi03
Framework: mtl
Component: mxm
--------------------------------------------------------------------------
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
[0:mpi03] ping-pong 64 bytes ...
64 bytes: 10.64 usec/msg
64 bytes: 6.01 MB/sec
[1:mpi03] ping-pong 64 bytes ...
64 bytes: 10.66 usec/msg
64 bytes: 6.00 MB/sec

And then if I run the same command with --cpu_set 1,4 (both on first socket), I see that they're bound correctly each to a different core on the first socket, and I get good latency. BUT I still see this "Could not find component" error message.

Is my command correct? Is it using MxM intra-node? Or is it falling back to TCP or SHMEM or something else

@gpaulsen
Copy link
Member Author

gpaulsen commented Feb 9, 2016

ANOTHER thing...

I do NOT see this behavior with Platform MPI using VERBS RC, but it's not calling ibv_poll_cq():

[gpaulsen@mpi03 ompibase]$ /opt/ibm/platform_mpi/bin/mpirun -np 2 -intra=nic -prot -IBV -aff=bandwidth,v ~/bin/ppr.x 64
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
Host 0 -- ip 9.21.55.63 -- ranks 0 - 1

 host | 0
======|======
    0 : IBV

 Prot -  All Intra-node communication is: IBV

Host 0 -- ip 9.21.55.63 -- [0+16 1+17 2+18 3+19 4+20 5+21 6+22 7+23],[8+24 9+25 10+26 11+27 12+28 13+29 14+30 15+31]
- R0: [11 00 00 00 00 00 00 00],--  : 0x10001
- R1: --,[11 00 00 00 00 00 00 00]  : 0x1000100
[0:mpi03] ping-pong 64 bytes ...
64 bytes: 1.67 usec/msg
64 bytes: 38.35 MB/sec
[1:mpi03] ping-pong 64 bytes ...
64 bytes: 1.68 usec/msg
64 bytes: 38.17 MB/sec

When I try Platform MPI and turn on SRQ mode (which I think calls ibv_poll_cq) I do notice the change, but it's not this horrible performance. Perhaps the SRQ mode introduces enough of a delay that we're not hitting this issue? i'm checking to see if we call ibv_poll_cq in this mode.

[gpaulsen@mpi03 ompibase]$ /opt/ibm/platform_mpi/bin/mpirun -srq -np 2 -intra=nic -prot -IBV -aff=bandwidth,v ~/bin/ppr.x 64
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
Host 0 -- ip 9.21.55.63 -- ranks 0 - 1

 host | 0
======|======
    0 : IBV

 Prot -  All Intra-node communication is: IBV

Host 0 -- ip 9.21.55.63 -- [0+16 1+17 2+18 3+19 4+20 5+21 6+22 7+23],[8+24 9+25 10+26 11+27 12+28 13+29 14+30 15+31]
- R0: [11 00 00 00 00 00 00 00],--  : 0x10001
- R1: --,[11 00 00 00 00 00 00 00]  : 0x1000100
[0:mpi03] ping-pong 64 bytes ...
64 bytes: 2.38 usec/msg
64 bytes: 26.91 MB/sec
[1:mpi03] ping-pong 64 bytes ...
64 bytes: 2.36 usec/msg
64 bytes: 27.08 MB/sec

@jladd-mlnx
Copy link
Member

Geoff, looks like you haven't built your OMPI with MXM support.

On Tue, Feb 9, 2016 at 12:23 PM, Geoff Paulsen notifications@github.com
wrote:

I THINK it's showing up with MxM also. I can reproduce with just MxM
across 2 nodes (fast on first socket and slow on 2nd), or MxM on same node.
The thing is, I'm getting nasty messages saying that MxM was unable to be
opened, so Now I don't know WHAT is happening.
`[gpaulsen@mpi03 ompibase]$ mpirun --hostfile hosts -np 2 --bind-to core
--dispy-map --report-bindings --mca mtl mxm --map-by core --cpu-set 9,11
./ping_pong_ring.x3 64 | & tee mxm_one_node_both_2nd_socket.out
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
Data for JOB [2344,1] offset 0

======================== JOB MAP ========================

Data for node: mpi03 Num slots: 16 Max slots: 0 Num procs: 2
Process OMPI jobid: [2344,1] App: 0 Process rank: 0 Bound: socket 1[core
9[hwt 0-1]]:[../../../../../../../..][../BB/../../../../../..]
Process OMPI jobid: [2344,1] App: 0 Process rank: 1 Bound: socket 1[core
11[hwt 0-1]]:[../../../../../../../..][../../../BB/../../../..]

[mpi03:07262] MCW rank 0 bound to socket 1[core 9[hwt 0-1]]:
[../../../../../../../..][../BB/../../../../../..]
[mpi03:07262] MCW rank 1 bound to socket 1[core 11[hwt 0-1]]:
[../../../../../../../..][../../../BB/../../../..]

A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.

Host: mpi03
Framework: mtl

Component: mxm

A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.

Host: mpi03
Framework: mtl
Component: mxm

libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
[0:mpi03] ping-pong 64 bytes ...
64 bytes: 10.64 usec/msg
64 bytes: 6.01 MB/sec
[1:mpi03] ping-pong 64 bytes ...
64 bytes: 10.66 usec/msg
64 bytes: 6.00 MB/sec
`

And then if I run the same command with --cpu_set 1,4 (both on first
socket), I see that they're bound correctly each to a different core on the
first socket, and I get good latency. BUT I still see this "Could not find
component" error message.

Is my command correct? Is it using MxM intra-node? Or is it falling back
to TCP or SHMEM or something else?


Reply to this email directly or view it on GitHub
#1252 (comment).

@jladd-mlnx
Copy link
Member

Your experiment 'With MXM' is not valid. It's falling back onto the BTL.

On Tue, Feb 9, 2016 at 12:39 PM, Joshua Ladd jladd.mlnx@gmail.com wrote:

Geoff, looks like you haven't built your OMPI with MXM support.

On Tue, Feb 9, 2016 at 12:23 PM, Geoff Paulsen notifications@github.com
wrote:

I THINK it's showing up with MxM also. I can reproduce with just MxM
across 2 nodes (fast on first socket and slow on 2nd), or MxM on same node.
The thing is, I'm getting nasty messages saying that MxM was unable to be
opened, so Now I don't know WHAT is happening.
`[gpaulsen@mpi03 ompibase]$ mpirun --hostfile hosts -np 2 --bind-to core
--dispy-map --report-bindings --mca mtl mxm --map-by core --cpu-set 9,11
./ping_pong_ring.x3 64 | & tee mxm_one_node_both_2nd_socket.out
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
Data for JOB [2344,1] offset 0

======================== JOB MAP ========================

Data for node: mpi03 Num slots: 16 Max slots: 0 Num procs: 2
Process OMPI jobid: [2344,1] App: 0 Process rank: 0 Bound: socket 1[core
9[hwt 0-1]]:[../../../../../../../..][../BB/../../../../../..]
Process OMPI jobid: [2344,1] App: 0 Process rank: 1 Bound: socket 1[core
11[hwt 0-1]]:[../../../../../../../..][../../../BB/../../../..]

[mpi03:07262] MCW rank 0 bound to socket 1[core 9[hwt 0-1]]:
[../../../../../../../..][../BB/../../../../../..]
[mpi03:07262] MCW rank 1 bound to socket 1[core 11[hwt 0-1]]:
[../../../../../../../..][../../../BB/../../../..]

A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.

Host: mpi03
Framework: mtl

Component: mxm

A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.

Host: mpi03
Framework: mtl
Component: mxm

libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
[0:mpi03] ping-pong 64 bytes ...
64 bytes: 10.64 usec/msg
64 bytes: 6.01 MB/sec
[1:mpi03] ping-pong 64 bytes ...
64 bytes: 10.66 usec/msg
64 bytes: 6.00 MB/sec
`

And then if I run the same command with --cpu_set 1,4 (both on first
socket), I see that they're bound correctly each to a different core on the
first socket, and I get good latency. BUT I still see this "Could not find
component" error message.

Is my command correct? Is it using MxM intra-node? Or is it falling back
to TCP or SHMEM or something else?


Reply to this email directly or view it on GitHub
#1252 (comment).

@gpaulsen
Copy link
Member Author

gpaulsen commented Feb 9, 2016

Ah, yes. Drat. Thanks. ompi_info |grep -i mxm shows nothing...
rebuilding. Thanks.

@gpaulsen
Copy link
Member Author

I updated to latest master, and can no longer reproduce.
I'll try to reproduce on 2.0 branch tuesday morning (I'm off today).

@gpaulsen
Copy link
Member Author

I reran with open MPI 2.0 branch, and the latency hit is about 500ns to RDMA via openib to the far socket on haswell.

@jladd-mlnx
Copy link
Member

Nice. Exactly what I would expect.

On Thu, Feb 25, 2016 at 10:54 AM, Geoff Paulsen notifications@github.com
wrote:

I reran with open MPI 2.0 branch, and the latency hit is about 500ns to
RDMA via openib to the far socket on haswell.


Reply to this email directly or view it on GitHub
#1252 (comment).

@gpaulsen
Copy link
Member Author

okay great good to know. I have the additional item to test on 1.10 to see if it was a regression, but we're pretty sure it's not a regression. Assuming it's not, we'll close this.

@jladd-mlnx
Copy link
Member

Geoff, did you mean Sandy Bridge?

On Thu, Feb 25, 2016 at 12:22 PM, Geoff Paulsen notifications@github.com
wrote:

okay great good to know. I have the additional item to test on 1.10 to see
if it was a regression, but we're pretty sure it's not a regression.
Assuming it's not, we'll close this.


Reply to this email directly or view it on GitHub
#1252 (comment).

@gpaulsen
Copy link
Member Author

Oh right sorry. Sandy Bridge.: Intel(R) Xeon(R) CPU E5-2660

@jladd-mlnx
Copy link
Member

OK. This is expected on SB.

On Thu, Feb 25, 2016 at 6:41 PM, Geoff Paulsen notifications@github.com
wrote:

Oh right sorry. Sandy Bridge.: Intel(R) Xeon(R) CPU E5-2660


Reply to this email directly or view it on GitHub
#1252 (comment).

@hppritcha
Copy link
Member

So what's the outcome here? I thought we'd decided that this could be closed in discussions last week?

@gpaulsen
Copy link
Member Author

gpaulsen commented Mar 2, 2016

Yes I think so too. I committed to test with the 1.10 to prove it's not a regression, but that seems very unlikely, and I'm swamped at the moment.

@gpaulsen gpaulsen closed this as completed Mar 2, 2016
jsquyres added a commit to jsquyres/ompi that referenced this issue Aug 23, 2016
…erbose

coll/base verbose, and neg priority cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants