Skip to content

Performance Tuning

David Ozog edited this page May 9, 2024 · 22 revisions

This page discusses methods for optimizing and troubleshooting the performance of Sandia OpenSHMEM applications.

FAQ

Which combination of settings will get the best performance?

For performance builds, we recommend disabling error checking --disable-error-checking and enabling remote virtual addressing --enable-remote-virtual-addressing. Note that remote virtual addressing is incompatible with ASLR. If ASLR is enabled, disabling position-independent executable code via LDFLAGS may be a workable solution. When taking this route, also configure with --disable-aslr-check. A high-performance OFI provider or Portals 4 build is also required.

I'm seeing a lot of noise in my performance numbers.

Portals 4 and some OFI providers utilize communication threads. It is helpful to bind the SHMEM PEs and companion threads to a set of cores, using the --bind-to option from Hydra. For example, the oshrun --bind-to core:2 ... assigns two cores to each PE and use oshrun -bind-to hwthread:2... for hardware thread assignment. For more details on binding options, run oshrun --bind-to -h. In addition, multi-threaded applications may benefit from binding to a number of cores (roughly) equal to the number of application threads. For example, when running shmem_perf_suite's multi-threaded blocking put bandwidth test (shmem_bw_put_ctx_perf) with 8 threads, considering trying a command similar to:
SHMEM_OFI_STX_MAX=9 oshrun --bind-to core:8 -ppn 1 -n 2 ./shmem_bw_put_ctx_perf -T 8 -C MULTIPLE

The OFI sockets provider allows you to directly control the affinity of the progress thread through the FI_SOCKETS_PE_AFFINITY environment variable. See the output of the fi_info -e command for details.

The PSM2 provider allows you to control the duty cycle and affinity of the progress thread through the FI_PSM2_PROG_INTERVAL and FI_PSM2_PROG_AFFINITY environment variables. Refer to the provider manpage for additional details.

If you are using Portals 4 revision 32c452f or later, you can set the environment variable PTL_PROGRESS_NOSLEEP=1 to prevent the Portals progress thread from sleeping. This eliminates noise from waking up the progress thread, but requires that the progress thread is assigned its own hardware thread.

Additional noise can come from bounce buffering in the SOS runtime system. This can be disabled with the following environment variable setting, SMA_BOUNCE_SIZE=0.

The process launcher itself can also be the source of variation in execution times for SOS applications. If you still experience performance variation after addressing the issues above, please download the MPICH 3.2 Hydra process launcher from this link before reporting the issue to the SOS developers.

OFI Performance Considerations:

Sockets Provider

Performance variation/degradation in the sockets provider can be due to the affinity of the progress thread. The OFI sockets provider allows you to directly control the affinity of the progress thread through the FI_SOCKETS_PE_AFFINITY environment variable. See the output of the fi_info -e command for details.

PSM2 Provider

Note: PSM2 users should use libfabric 1.6 or later to achieve the best performance and most efficient resource utilization in Sandia OpenSHMEM applications.

Progress

Performance variation/degradation in the PSM2 provider can be due to the affinity and polling interval of the progress thread. The PSM2 provider allows you to control the duty cycle and affinity of the progress thread through the FI_PSM2_PROG_INTERVAL and FI_PSM2_PROG_AFFINITY environment variables. Refer to the provider manpage for additional details (latest version is here: https://ofiwg.github.io/libfabric/main/man/fi_psm2.7.html).

SOS can also improve progress performance with the "manual progress" mode. To enable manual progress, pass --enable-manual-progress at configure time. Manual progress mode makes intermittent libfabric calls to read a receiver counter, forcing the runtime to make progress. This setting is particularly effective with the PSM2 provider when using multiple threads in SHMEM_THREAD_MULTIPLE mode.

Shareable Transmit Contexts (STXs)

By default, the Host Fabric Interface (HFI) driver on PSM2 may restrict the number of PEs per device to be no more than number of physical processor cores on the compute node. If you want to oversubscribe PEs to the available physical cores (e.g., to run 1 PE per Intel® Hyper-Threaded logical core), it may be necessary for the system administrator to configure the HFI to permit a larger number of contexts. For example, if you want 80 contexts, first remove the hfi1 module:

$ rmmod hfi1

Then reload it with the new parameter:

$ modprobe hfi1 num_user_contexts=80

You can put a file hfi1.conf under /etc/modprobe.d to make this the default parameter. The contents of the file would then be:

options hfi1 num_user_contexts=80

While the Intel® Omni-Path Architecture 100 HFI supports 80 contexts, you will likely achieve better performance with fewer contexts, because each context consumes hardware resources.

Please see the STX section below for more detailed information about optimizing the utilization of STX resources on other providers.

Lazy Connections with libfabric v1.6

In libfabric v1.6, the PSM2 provider transitioned to using separate PSM2 endpoints for each OFI endpoint, whereas in the past all communication was multiplexed over a single PSM2 endpoint. Setting the FI_PSM2_LAZY_CONN environment variable ensures that unnecessary connections between remote endpoints which never communicate are not made, which may improve performance during SOS finalization. Setting FI_PSM2_LAZY_CONN moves the connection setup overhead from initialization time to the very first communication between each endpoint pair. Usually the overhead can be amortized over later communication activities. For applications that have few communications over each connection, the overhead can be more perceivable. But even for such cases, the overall time may be similar because the initialization time is reduced.

Process Affinity

As described in the FAQ above, the --bind-to "X" argument to the launcher (or --cpu_bind argument to srun, etc.) have considerable effects on application performance. On PSM2, the optimal value for "X" depends on the number of processes, threads, NUMA effects, and the application. Binding to core is usually a good place to start, but highly threaded application on mulit-socket systems may benefit from --bind-to node or similar.

Multiple HFI

If the system provides multiple HFI units, the selection of one particular HFI unit can be made through the environment variable HFI_UNIT. As an example, selection of the first HFI unit can be achieved by:

export HFI_UNIT=0

By default, HFI_UNIT is unset and thus all available units in the system are auto-detected and used.

Portals 4 Performance Considerations:

If you are using Portals 4 revision 32c452f or later, you can set the environment variable PTL_PROGRESS_NOSLEEP=1 to prevent the Portals progress thread from sleeping. This eliminates noise from waking up the progress thread, but requires that the progress thread is assigned its own hardware thread.

Shareable Transmit Contexts (STXs)

The OFI transport layer in SOS makes use of shareable transmit contexts (STXs) for managing communication. An STX is a libfabric software object that represents a resource that is shared across multiple transport endpoints. Ideally, each STX would be associated with a transmit hardware resource within the host-to-fabric (HFI) interface or network interface card (NIC) on each compute node. SOS provides the SHMEM_OFI_STX_AUTO environment variable, which attempts to limit the maximum number of STX objects to the number of outbound command queues optimally supported by the provider. In addition, when SHMEM_OFI_STX_AUTO is enabled SOS evenly partitions the STX resources evenly across PEs that share a compute node. Setting SHMEM_DEBUG=1 before running an SOS application will print useful information regarding the STX resources. For example, the following output:

STX[8] = [ 2S 1S 1S 1S 3P 3P 0S 0S ]

shows that this PE uses 8 STX resources. The first STX has 2 shared contexts, the next 3 STXs have 1 shared context each, and the next two have 3 private contexts each. The 7th and 8th STXs are unused.

Setting SHMEM_OFI_STX_AUTO may not achieve the optimal performance, so it may be necessary to set additional parameters. When SHMEM_OFI_STX_AUTO is enabled, you may optionally set the SHMEM_OFI_STX_NODE_MAX value to the desired maximum number of STXs per compute node (instead of using the value provided by libfabric by SHMEM_OFI_STX_AUTO). These STXs are evenly partitioned across PEs that reside on the same compute node. If SHMEM_OFI_STX_AUTO is off, then SHMEM_OFI_STX_NODE_MAX has no effect.

Setting SHMEM_OFI_STX_DISABLE_PRIVATE may improve load balance across transmit resources, especially in scenarios where the number of contexts exceeds the number of STXs.

Other STX parameters related to the allocation algorithm (SHMEM_OFI_STX_ALLOCATOR and SHMEM_OFI_STX_THRESHOLD) may also improve the performance of Sandia OpenSHMEM applications. More information on these parameters and others can be found in the README file.