btl: add an rdma only btl for using uct #4919

hjelmn · 2018-03-15T18:54:22Z

This commit adds a new btl for one-sided communication only. This btl
uses the uct layer in openucx. This btl makes use of multiple uct
contexts and device pinning to provide good performance when using
threads and osc/rdma. This btl can not be used with pml/ob1 at this
time. This btl has been tested extensively with osc/rdma and passes
all MTT tests on aries hardware.

For now this new component disables itself but can be enabled by
setting the btl_ucx_transports MCA variable with a comma-delimited
list of supported memory domains/transport layers. For example:
--mca btl_ucx_transports ugni-ugni_rdma.

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

hjelmn · 2018-03-15T18:58:05Z

@yosefe Now we can argue about the relative merits of osc/rdma + btl/uct vs osc/ucx. Here are the results so far:

btl/ugni (fastest):

mpirun  --mca osc ^ucx --mca btl_uct_transports ugni-ugni_rdma --mca osc_rdma_btls ugni,openib -n 2 -N 1 --bind-to socket rmamt_bw -x -t 16 -o put -s flush -i 1000 2>&1 | tee out
##########################################
# RMA-MT Bandwidth
#
# Operation: put
# Sync: flush
# Thread count: 16
# Iterations: 1000
# Ibarrier: no, sleep interval: 10000ns
# Bind worker threads: yes
# Number of windows: 1
##########################################
  BpT(16)	  BxT(16)	Bandwidth(MiB/s)	Message_Rate(M/s)
        1	       16	       18.564		 19466022.666
        2	       32	       37.619		 19723258.036
        4	       64	       75.514		 19795487.866
        8	      128	      149.811		 19636069.988
       16	      256	      300.699		 19706617.729
       32	      512	      601.167		 19699047.797
       64	     1024	     1190.705		 19508507.538
      128	     2048	     2321.862		 19020695.706
      256	     4096	     4082.585		 16722268.208
      512	     8192	     5943.750		 12172798.968
     1024	    16384	     7445.997		  7624700.790
     2048	    32768	     7596.990		  3889659.122
     4096	    65536	     8356.294		  2139211.316
     8192	   131072	     9130.748		  1168735.792
    16384	   262144	     9344.420		   598042.845
    32768	   524288	     9435.311		   301929.951
    65536	  1048576	     9476.555		   151624.873
   131072	  2097152	     9487.441		    75899.531
   262144	  4194304	     9493.385		    37973.540

btl/uct: (still very good-- though the aries tl is not optimal)

mpirun  --mca osc ^ucx --mca btl_uct_transports ugni-ugni_rdma --mca osc_rdma_btls uct,openib  -n 2 -N 1 --bind-to socket rmamt_bw -x -t 16 -o put -s flush -i 1000 2>&1 | tee out
##########################################
# RMA-MT Bandwidth
#
# Operation: put
# Sync: flush
# Thread count: 16
# Iterations: 1000
# Ibarrier: no, sleep interval: 10000ns
# Bind worker threads: yes
# Number of windows: 1
##########################################
  BpT(16)	  BxT(16)	Bandwidth(MiB/s)	Message_Rate(M/s)
        1	       16	       14.821		 15540845.713
        2	       32	       29.664		 15552553.049
        4	       64	       57.495		 15071845.604
        8	      128	      114.418		 14997028.714
       16	      256	      238.134		 15606345.540
       32	      512	      473.390		 15512052.865
       64	     1024	      929.684		 15231944.434
      128	     2048	     1838.197		 15058511.730
      256	     4096	     3557.929		 14573276.255
      512	     8192	     5683.658		 11640130.806
     1024	    16384	     7330.364		  7506292.384
     2048	    32768	     8185.718		  4191087.391
     4096	    65536	     5679.973		  1454073.145
     8192	   131072	     6393.785		   818404.443
    16384	   262144	     6970.270		   446097.284
    32768	   524288	     7375.021		   236000.676
    65536	  1048576	     7754.446		   124071.134
   131072	  2097152	     7917.473		    63339.783
   262144	  4194304	     8009.919		    32039.678

osc/ucx: (really bad for small messages)

mpirun  --mca osc ucx 0 -n 2 -N 1 --bind-to socket rmamt_bw -x -t 16 -o put -s flush -i 1000 2>&1 | tee out
##########################################
# RMA-MT Bandwidth
#
# Operation: put
# Sync: flush
# Thread count: 16
# Iterations: 1000
# Ibarrier: no, sleep interval: 10000ns
# Bind worker threads: yes
# Number of windows: 1
##########################################
  BpT(16)	  BxT(16)	Bandwidth(MiB/s)	Message_Rate(M/s)
        1	       16	        0.445		   466518.166
        2	       32	        1.337		   700773.917
        4	       64	        2.697		   707078.563
        8	      128	        5.423		   710838.402
       16	      256	       10.616		   695711.703
       32	      512	       21.485		   704026.911
       64	     1024	       42.511		   696499.481
      128	     2048	       84.214		   689882.834
      256	     4096	      165.378		   677386.879
      512	     8192	      326.862		   669413.717
     1024	    16384	      614.650		   629401.137
     2048	    32768	     1023.041		   523796.809
     4096	    65536	     1136.166		   290858.408
     8192	   131072	     1347.477		   172477.030
    16384	   262144	     2248.915		   143930.546
    32768	   524288	     4008.367		   128267.741
    65536	  1048576	     6250.420		   100006.720
   131072	  2097152	     8071.948		    64575.584
   262144	  4194304	     8563.655		    34254.620

hjelmn · 2018-03-15T18:59:45Z

This is on a Cray XC system with Haswell processors. You do not want to see the osc/ucx results @ 32 threads or on knl. It gets really really bad.

hjelmn · 2018-03-15T19:03:44Z

I do have to say, uct is a very reasonable fit for a btl. The only semantic that doesn't completely match is completion semantics (local vs remote). This is fine for osc/rdma though.

jladd-mlnx · 2018-03-15T19:49:56Z

@hjelmn can you post the data for thread single?

jladd-mlnx · 2018-03-15T19:50:05Z

@xinzhao3

jladd-mlnx · 2018-03-15T19:53:42Z

@hjelmn is this an OSU benchmark?

xinzhao3 · 2018-03-15T19:58:30Z

@hjelmn Nathan, is this the benchmark you are using? http://www.cs.sandia.gov/smb/rma-mt.html

hjelmn · 2018-03-15T20:35:14Z

@xinzhao3 Its a modified version that also pins the worker threads. I can't release it until I get an ok from LANL. I will get you access once I have the ok though it could be awhile (lawyers).

Single threaded with osu. Though something is wrong since the high-end bandwidth is way low.

btl/ucx:

mpirun  --mca osc ^ucx --mca btl_uct_transports ugni-ugni_rdma --mca osc_rdma_btls uct,openib --mca osc_base_verbose 0 -n 2 -N 1 --bind-to core ./osu_put_bw
# OSU MPI_Put Bandwidth Test v5.3
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size      Bandwidth (MB/s)
1                       1.78
2                       3.56
4                       7.05
8                      14.09
16                     28.65
32                     54.80
64                    107.20
128                   214.79
256                   421.95
512                   826.50
1024                 1552.63
2048                 2301.67
4096                 4439.05
8192                 6128.77
16384                7505.59
32768                8554.46
65536                9228.94
131072               9517.16
262144               9491.26
524288               8560.75
1048576              8401.46
2097152              8401.46
4194304              8408.53

mpirun  --mca osc ^ucx --mca btl_uct_transports ugni-ugni_rdma --mca osc_rdma_btls uct,openib --mca osc_base_verbose 0 -n 2 -N 1 --bind-to core ./osu_put_latency
# OSU MPI_Put Latency Test v5.3
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size          Latency (us)
0                       0.04
1                       1.14
2                       1.14
4                       1.14
8                       1.14
16                      1.14
32                      1.15
64                      1.17
128                     1.18
256                     1.19
512                     1.21
1024                    1.27
2048                    1.52
4096                    2.13
8192                    2.67
16384                   3.87
32768                   5.09
65536                   8.60
131072                 14.90
262144                 28.30
524288                 54.79
1048576               107.63
2097152               213.27
4194304               425.86

osc/ucx:

mpirun  --mca osc ucx --mca btl_uct_transports ugni-ugni_rdma --mca osc_rdma_btls uct,openib --mca osc_base_verbose 0 -n 2 -N 1 --bind-to core ./osu_put_bw
# OSU MPI_Put Bandwidth Test v5.3
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size      Bandwidth (MB/s)
1                       2.31
2                       4.61
4                      10.53
8                      21.03
16                     37.05
32                     74.04
64                    147.37
128                   294.79
256                   578.19
512                  1121.26
1024                 2040.18
2048                 2743.45
4096                 2494.84
8192                 2397.47
16384                4640.44
32768                7738.90
65536                8882.82
131072               9341.79
262144               9345.77
524288               8556.47
1048576              8448.75
2097152              8473.68
4194304              8493.53

mpirun  --mca osc ucx --mca btl_uct_transports ugni-ugni_rdma --mca osc_rdma_btls uct,openib --mca osc_base_verbose 0 -n 2 -N 1 --bind-to core ./osu_put_latency
# OSU MPI_Put Latency Test v5.3
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size          Latency (us)
0                       0.05
1                       1.13
2                       1.14
4                       1.10
8                       1.10
16                      1.13
32                      1.14
64                      1.16
128                     1.16
256                     1.17
512                     1.19
1024                    1.23
2048                    1.47
4096                    2.21
8192                    3.59
16384                   6.70
32768                   8.76
65536                  12.67
131072                 20.59
262144                 36.10
524288                 66.95
1048576               128.60
2097152               251.04
4194304               497.91

Latencies are almost identical. There is a difference in small message bandwidth, this is probably due to different protocol selections and tuning. With rmamt and 1 thread osc/ucx is slower.

hjelmn · 2018-03-15T20:36:44Z

rmamt 1 thread:

btl/ucx:

mpirun  --mca osc ^ucx --mca btl_uct_transports ugni-ugni_rdma --mca osc_rdma_btls ucx,openib --mca osc_base_verbose 0 -n 2 -N 1 --bind-to socket /lustre/ttscratch1/hjelmn/rmamt_benchmarks/src/rmamt_bw -x -t 1 -o put -s flush -i 1000 2>&1 | tee out
##########################################
# RMA-MT Bandwidth
#
# Operation: put
# Sync: flush
# Thread count: 1
# Iterations: 1000
# Ibarrier: no, sleep interval: 10000ns
# Bind worker threads: yes
# Number of windows: 1
##########################################
  BpT(1)	  BxT(1)	Bandwidth(MiB/s)	Message_Rate(M/s)
        1	        1	        1.302		  1364736.838
        2	        2	        2.647		  1387855.707
        4	        4	        5.219		  1368022.742
        8	        8	       10.457		  1370565.194
       16	       16	       20.738		  1359053.881
       32	       32	       42.471		  1391681.917
       64	       64	       82.061		  1344492.622
      128	      128	      163.632		  1340475.386
      256	      256	      324.015		  1327164.672
      512	      512	      628.830		  1287843.275
     1024	     1024	     1189.524		  1218072.790
     2048	     2048	     1777.268		   909961.163
     4096	     4096	     4245.841		  1086935.256
     8192	     8192	     6049.973		   774396.590
    16384	    16384	     7775.083		   497605.275
    32768	    32768	     8797.514		   281520.436
    65536	    65536	     9201.532		   147224.516
   131072	   131072	     9364.482		    74915.856
   262144	   262144	     9389.455		    37557.822
   524288	   524288	     9474.814		    18949.628
  1048576	  1048576	     9494.220		     9494.219
  2097152	  2097152	     9503.151		     4751.576
  4194304	  4194304	     9494.578		     2373.645

osc/ucx:

mpirun  --mca osc ucx --mca btl_uct_transports ugni-ugni_rdma --mca osc_rdma_btls ucx,openib --mca osc_base_verbose 0 -n 2 -N 1 --bind-to socket /lustre/ttscratch1/hjelmn/rmamt_benchmarks/src/rmamt_bw -x -t 1 -o put -s flush -i 1000 2>&1 | tee out
##########################################
# RMA-MT Bandwidth
#
# Operation: put
# Sync: flush
# Thread count: 1
# Iterations: 1000
# Ibarrier: no, sleep interval: 10000ns
# Bind worker threads: yes
# Number of windows: 1
##########################################
  BpT(1)	  BxT(1)	Bandwidth(MiB/s)	Message_Rate(M/s)
        1	        1	        1.233		  1293155.843
        2	        2	        3.109		  1630262.896
        4	        4	        6.082		  1594235.245
        8	        8	       12.198		  1598761.919
       16	       16	       24.770		  1623347.635
       32	       32	       49.213		  1612598.913
       64	       64	       99.158		  1624605.627
      128	      128	      190.378		  1559575.795
      256	      256	      376.512		  1542193.647
      512	      512	      736.898		  1509167.438
     1024	     1024	     1390.398		  1423767.231
     2048	     2048	     1986.502		  1017089.132
     4096	     4096	      897.387		   229731.006
     8192	     8192	     1184.091		   151563.629
    16384	    16384	     3271.639		   209384.925
    32768	    32768	     5763.955		   184446.545
    65536	    65536	     8481.175		   135698.799
   131072	   131072	     8761.659		    70093.275
   262144	   262144	     9039.583		    36158.331
   524288	   524288	     9178.323		    18356.646
  1048576	  1048576	     9319.072		     9319.072
  2097152	  2097152	     9341.931		     4670.965
  4194304	  4194304	     9356.176		     2339.044

Hmm, osc/ucx slower for 1 byte and faster for some other sizes. I strongly suspect it is because there is an extra progress call in the btl put function. This helps with the multi-threaded case as it spreads the load of reaping completions across all the threads. Maybe not. Will investigate as it should be a wash at a single thread.

hjelmn · 2018-03-15T21:19:03Z

Looks like most of the small message difference is having to unpack the rkey on every call. This is a small mismatch between the uct and btl interfaces that will hopefully be corrected in a future version of ucx.

xinzhao3 · 2018-03-16T18:12:23Z

@hjelmn thanks Nathan. Could you release the MT code once it is allowed? I want to try it on Mellanox machine.

hjelmn · 2018-03-16T18:59:56Z

Sure. Would be interesting to see how btl/ucx holds up there. ugni is not the optimal network for ucx. I pushed some osc/rdma updates that should be used for any performance comparisons.

thananon · 2018-03-19T15:35:18Z

@hjelmn This is interesting. I would like to give it a try on Cori but I'm not sure how to properly use this.

So I built with UCX but when I do ompi_info it uct doesnt show up as a BTL. Am I doing anything wrong? What is the proper command to use this? Here is what I tried (and failed?):

mpirun --mca btl_ucx_transports ugni-ugni_rdma ./app 
mpirun --mca btl uct,self ./app

hjelmn · 2018-03-19T15:43:05Z

@thananon It needs openucx installed which you can build yourself. I tested with master and it works on Trinity. I also recommend totally disabling pml/ucx because there seems to be some issues with it on Crays right now. Also, do not disable btl/ugni. That is needed for two-sided communication as btl/uct is only for OSC.

Open MPI configure line:

./configure --with-ucx=/path/to/ucx --enable-mca-no-build pml-ucx

Run:

mpirun --mca btl_ucx_transports ugni-ugni_rdma --mca osc_rdma_btls uct

btl/uct will not be able to match btl/ugni but the performance is not bad. btw, if you keep it to yourself I can get you a copy of the updated btl/ugni that I am working on. It shows almost perfect thread scaling with the RMA-MT benchmarks. I plan to push it for Open MPI v4.0.0.

artpol84 · 2018-04-02T04:36:34Z

opal/mca/btl/uct/btl_uct_rdma.c

+            continue;
+        }
+
+        mca_btl_uct_context_lock (context);


@hjelmn I think you'll get the better performance if you will use trylock and will allow multiple threads to independently call progress on multiple workers.
One thread is unlikely to fully utilize the NIC.

I meant that now it looks like one thread will be held by another one holding the lock of the context 0 for example.
If use trylock - you can skip context 0 and go to context 1

If I’m not mistaken, one concerning thing is that if number of contexts is less than number of threads, this flush in one thread can block other threads from posting new requests while waiting for the current work to finish.

The problem is the requirements of MPI_Flush. All operations targeting a remote process (or all remote processes) need to be flushed regardless of which thread started it so we can't skip a device context. This is something that the endpoints proposal would be able to help with. In theory flushing everything will have an impact but without many apps it is hard to tell what the real impact is. I can modify the RMA-MT benchmarks to include this scenario and see how we might be able to work around this limitation without endpoints. It might require tweaking the BTL interface (which is fine).

But you can still let multiple threads to do the progress
You can skip context, but do multiple iterations on top of for(i=0; i < num_contexts;i++) loop to make sure all is done.
I'd guess this would have better perf as multiple threads will progress.

Good point. I will make that change.

artpol84 · 2018-04-02T04:42:19Z

@hjelmn I think that main advantage of this implementation is multiple contexts. I like the idea of static thread-local variables to hold the context ID assigned to the thread, very simple and yet efficient.

Having multiple ucp contexts/workers will improve osc/ucx as well. We are looking into that now. We are using the benchmark that I wrote some time ago:
https://github.com/artpol84/poc/tree/master/benchmarks/message_rate
It is now updated (in-house) to have multiple "modules":

mpi send/recv
ucx send/recv
mxm send/recv
ucx one-sided (in progress)
mpi one-sided (in progress)

We will put it into the public access this week I think. I will update.

artpol84 · 2018-04-02T04:51:18Z

Also, can you provide results for methods other than "flush"?
Flush is obviously looks performant in your implementation as all threads can intensively progress different workers (even without "trylock" optimization that I mentioned).
Would be interesting to see how other methods behave in your environment.

hjelmn · 2018-04-02T18:49:13Z

Yes, multiple contexts is one of the advantages. The other (for me) is maintainability of the code. As far as OMPI RMA is concerned there is very little benefit for treating UCX as anything other than a byte transport layer. The bandwidth difference between osc/ucx and osc/rdma is going to come down to how well it has been optimized.

And I would be very careful about how you add multiple contexts to the OSC. An app can use any number of windows and with the btl strategy they all use the same shared resources. Resource sharing can be done in the osc component but the multi-threaded performance of osc/ucx is the least of my concerns. Its scalability is a far bigger concern. This is part of the reason I recommended last year not targeting ucx with an osc component.

artpol84 · 2018-04-02T18:53:24Z

Thank you for clarification.
Its scalability is a far bigger concern. This is part of the reason I recommended last year not targeting ucx with an osc component.
Can you elaborate more on this?

hjelmn · 2018-04-02T18:53:27Z

There is little performance difference with changing the synchronization methods. At this small a scale lock, lock-all, pscw, and fence are all about the same. That changes with larger scale (especially (edit) fence) but the benchmark doesn't yet look at anything other than single pair communication.

hjelmn · 2018-04-02T18:55:21Z

@artpol84 A lot of work has been put into reducing the memory footprint, handling locking efficiently, and time scaling of osc/rdma. With a new component you are starting from scratch or just copying osc/rdma. At that point why bother. A btl is < 4k lines of code with comments and a well-optimized osc component is much larger.

Right now the lock-all scaling of osc/ucx is bad. I mean really bad. O(n) bad.

hjelmn · 2018-04-02T18:57:07Z

Also, I am going to on a PR sometime later this month or next that would (at compile time) inline a btl with osc/rdma. It will be some work but will reduce the overhead. We have a similar (but less complicated) concept for the pml.

artpol84 · 2018-04-02T18:58:44Z

But this is a question of maintenance. I was interested in scalability concern. Is it something in particular or about the strategic decision of adding osc/ucx?

hjelmn · 2018-04-02T18:59:33Z

Its nothing that can't be fixed without work. Just saying the work really isn't worth it. Since it has already been done once.

artpol84 · 2018-04-02T19:00:03Z

I see, thank you very much!

hjelmn · 2018-04-02T19:00:50Z

@artpol84 No problem :)

hjelmn · 2018-04-02T19:01:53Z

I do plan to commit this btl in the near future (in its current disabled-by-default form). I need to work through some bugs when using the target hardware though. Should have that done in the next several weeks.

angainor · 2018-04-26T07:30:00Z

@hjelmn Is this in a shape where I could test on an Infiniband EDR system? Can I apply this PR to master/3.1.0rc4, or do you plan to make more changes?

artpol84 · 2018-04-26T14:37:58Z

@angainor
I would avoid doing that. We were unable to use it with InfiniBand even after doing some debugging.
Also note, that this BTL will not work as is with sent-receive as is uses multiple workers and I'm not sure how it will map on the tag matching.

hjelmn · 2018-04-26T14:44:30Z

Yeah. I would wait on using this with infiniband. I plan to have it working for RMA on infiniband in the next couple of weeks.

hjelmn · 2018-05-09T15:42:08Z

@xinzhao3 That took forever. You can find the latest RMA-MT benchmarks @ http://github.com/hpc/rma-mt

hjelmn · 2018-06-11T15:54:00Z

By request I am currently adding AM features to this BTL. So far it is working with RC and UD. The performance is good (at least on ARM) and pml/ob1 + btl/uct is fairly close to pml/ucx for two-sided. I plan to finish the code changes this week and update this PR. The component will still be disabled by default.

ibm-ompi · 2018-06-25T16:22:32Z

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/446018bf818e671422ba235105d12cdc

ibm-ompi · 2018-06-25T16:46:10Z

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/ec89369be352ab82143690f09977c0cb

ibm-ompi · 2018-06-25T17:19:48Z

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/1395ad33819ee94538c4b4d297a65950

ibm-ompi · 2018-06-25T17:44:55Z

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/16670bcec69b62c813fae1662fadf079

hjelmn · 2018-06-25T21:31:13Z

Should be good to go now. Still disabled by default. Tested on multiple ib/mlx5 systems and a ugni/aries system. This BTL should be a suitable replacement for btl/openib when using Mellanox IB.

ibm-ompi · 2018-06-25T21:37:12Z

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/d1348cfc899fd700bfd3a6d26d758a8b

thananon · 2018-06-25T21:39:18Z

This looks great!
I will give it a try sometime this week, especially 2 sided mt. @bosilca

ibm-ompi · 2018-06-25T21:40:06Z

The IBM CI (GNU Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/a58c9e5d5e2008bcca70a9d36255387c

hjelmn · 2018-06-25T21:40:49Z

^&$^#. Still can't figure out why the -luct is missing. Back to the configury.

thananon · 2018-06-25T21:43:09Z

hello_c: symbol lookup error: /smpi_dev/mpiczar/jenkins/workspace/ompi_public_pr_master_gnu/ompi-install/lib/openmpi/mca_pml_ucx.so: undefined symbol: ucp_config_read

isn't that ucp?

ibm-ompi · 2018-06-25T21:49:12Z

The IBM CI (PGI Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/d94b66fd2c4260dfdbcc5c3f17d5acfd

hjelmn · 2018-06-25T21:49:34Z

@thananon Yeah, I think something I did screwed up the configury for pml/ucx. All ucx components in my tree are just getting -lucp but sometimes it looks like they just get -luct (maybe?). I am now just adding -luct to the btl to see if that resolves it. I can figure out a better fix later.

hjelmn · 2018-06-25T21:58:37Z

@thananon Note the option to enable btl/uct is now --mca btl_uct_memory_domains. For IB it will look something like --mca btl_uct_memory_domains ib/mlx5_0,ib/mlx5_3. I have tested multi-rail with pml/ob1 and it seems to be working correctly.

ibm-ompi · 2018-06-25T22:01:03Z

The IBM CI (XL Compiler) build failed! Please review the log, linked below.

Gist: https://gist.github.com/4bd054b89a3a8ace61cce301e3b378a8

This commit adds a new btl for one-sided and two-sided. This btl uses the uct layer in OpenUCX. This btl makes use of multiple uct contexts and per-thread device pinning to provide good performance when using threads and osc/rdma. This btl has been tested extensively with osc/rdma and passes all MTT tests on aries and IB hardware. For now this new component disables itself but can be enabled by setting the btl_ucx_transports MCA variable with a comma-delimited list of supported memory domains/transport layers. For example: --mca btl_uct_memory_domains ib/mlx5_0. The specific transports used can be selected using --mca btl_uct_transports. The default is to use any available transport. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

hjelmn · 2018-06-25T22:08:38Z

Was a missing line in the Makefile.am :D

bwbarrett added the Target: main label Mar 27, 2018

hjelmn mentioned this pull request Mar 28, 2018

IMB-EXT stalls using openmpi 2.1.3 #4976

Closed

artpol84 reviewed Apr 2, 2018

View reviewed changes

hjelmn force-pushed the btl_uct branch from 6a46872 to c6b4afd Compare June 25, 2018 15:52

hjelmn force-pushed the btl_uct branch from 49fa15e to 7d853b3 Compare June 25, 2018 16:49

hjelmn force-pushed the btl_uct branch from 7d853b3 to cfc9066 Compare June 25, 2018 17:31

hjelmn force-pushed the btl_uct branch from cfc9066 to 35ceedf Compare June 25, 2018 21:24

hjelmn force-pushed the btl_uct branch from 35ceedf to 6ecb6f9 Compare June 25, 2018 21:48

hjelmn force-pushed the btl_uct branch from 6ecb6f9 to e4a31f0 Compare June 25, 2018 22:04

btl/uct: make uct endpoints array a flexible array member

091d1ca

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

hjelmn merged commit 6c08951 into open-mpi:master Jun 26, 2018

angainor mentioned this pull request Oct 9, 2018

rma-mt benchmark segfaults with osc ucx on master #5873

Closed

btl: add an rdma only btl for using uct #4919

btl: add an rdma only btl for using uct #4919

Conversation

hjelmn commented Mar 15, 2018

hjelmn commented Mar 15, 2018

hjelmn commented Mar 15, 2018

hjelmn commented Mar 15, 2018

jladd-mlnx commented Mar 15, 2018

jladd-mlnx commented Mar 15, 2018

jladd-mlnx commented Mar 15, 2018

xinzhao3 commented Mar 15, 2018

hjelmn commented Mar 15, 2018 • edited Loading

hjelmn commented Mar 15, 2018 • edited Loading

hjelmn commented Mar 15, 2018 • edited Loading

xinzhao3 commented Mar 16, 2018

hjelmn commented Mar 16, 2018 • edited Loading

thananon commented Mar 19, 2018

hjelmn commented Mar 19, 2018 • edited Loading

artpol84 Apr 2, 2018 • edited Loading

Choose a reason for hiding this comment

artpol84 Apr 2, 2018

Choose a reason for hiding this comment

artpol84 Apr 2, 2018

Choose a reason for hiding this comment

hjelmn Apr 2, 2018

Choose a reason for hiding this comment

artpol84 Apr 2, 2018

Choose a reason for hiding this comment

hjelmn May 1, 2018

Choose a reason for hiding this comment

artpol84 commented Apr 2, 2018 • edited Loading

artpol84 commented Apr 2, 2018 • edited Loading

hjelmn commented Apr 2, 2018

artpol84 commented Apr 2, 2018

hjelmn commented Apr 2, 2018 • edited Loading

hjelmn commented Apr 2, 2018 • edited Loading

hjelmn commented Apr 2, 2018

artpol84 commented Apr 2, 2018

hjelmn commented Apr 2, 2018 • edited Loading

artpol84 commented Apr 2, 2018

hjelmn commented Apr 2, 2018

hjelmn commented Apr 2, 2018

angainor commented Apr 26, 2018

artpol84 commented Apr 26, 2018

hjelmn commented Apr 26, 2018

hjelmn commented May 9, 2018 • edited Loading

hjelmn commented Jun 11, 2018

ibm-ompi commented Jun 25, 2018

ibm-ompi commented Jun 25, 2018

ibm-ompi commented Jun 25, 2018

ibm-ompi commented Jun 25, 2018

hjelmn commented Jun 25, 2018

ibm-ompi commented Jun 25, 2018

thananon commented Jun 25, 2018

ibm-ompi commented Jun 25, 2018

hjelmn commented Jun 25, 2018

thananon commented Jun 25, 2018

ibm-ompi commented Jun 25, 2018

hjelmn commented Jun 25, 2018

hjelmn commented Jun 25, 2018

ibm-ompi commented Jun 25, 2018

hjelmn commented Jun 25, 2018

hjelmn commented Mar 15, 2018 •

edited

Loading

hjelmn commented Mar 15, 2018 •

edited

Loading

hjelmn commented Mar 15, 2018 •

edited

Loading

hjelmn commented Mar 16, 2018 •

edited

Loading

hjelmn commented Mar 19, 2018 •

edited

Loading

artpol84 Apr 2, 2018 •

edited

Loading

artpol84 commented Apr 2, 2018 •

edited

Loading

artpol84 commented Apr 2, 2018 •

edited

Loading

hjelmn commented Apr 2, 2018 •

edited

Loading

hjelmn commented Apr 2, 2018 •

edited

Loading

hjelmn commented Apr 2, 2018 •

edited

Loading

hjelmn commented May 9, 2018 •

edited

Loading