Skip to content

Troubleshooting

David Ozog edited this page Nov 7, 2022 · 36 revisions

Below are some helpful tips to commonly encountered problems.

Bugs and Other Gotchas

I think I found a bug!

Great! Please check the issues page to see if has already been reported. If not, please file a new issue.

Full output (enabling the SHMEM_INFO and SHMEM_DEBUG environment variables will produce additional output that is helpful to developers), build settings, and a test case that reproduces it will help us to diagnose and correct the error.


CMA runs fail with an "Operation Not Permitted" error.

You may need to disable Linux ptrace protrection: https://wiki.ubuntu.com/SecurityTeam/Roadmap/KernelHardening#ptrace_Protection

In Ubuntu, this can be done by running sudo sysctl kernel.yama.ptrace_scope=0 on the nodes that will execute OpenSHMEM processes.

Troubleshooting the process manager

SOS supports a number of different process manager configurations, check configure --help for details.


The process manager is not finding my binary

The system’s processes do not assume that the binary is in the current working directory, therefore you must pass the path to the binary (rather than simply the binary’s name) as an argument to the process manager.


I'm seeing a failure in the global_exit test case.

SOS implements the shmem_global_exit() routine using the process manager's PMI_abort() functionality. In older versions of Hydra, this was not implemented properly. Please update to Hydra 3.2 or newer.


The oshrun wrapper is not choosing the right launcher.

Set the OSHRUN_LAUNCHER environment variable to the correct launcher.


shmem_my_pe() returns 0 for all PEs and shmem_num_pes() returns 1 in multi-process jobs.

Either the launcher is not supported by SOS, or the launcher requires a particular PMI that needs to be enabled in the SOS configuration (e.g., using --with-pmi). When in doubt, please use a recent version of the hydra launcher, because hydra is tested with all SOS releases.

Troubleshooting the OFI build

OFI is not selecting the right provider.

The SMA_OFI_PROVIDER environment variable can be used to request a specific provider from libfabric, e.g. SMA_OFI_PROVIDER=sockets.


OFI is not selecting the right domain/fabric.

The SMA_OFI_DOMAIN and SMA_OFI_FABRIC environment variables can be used to request a specific domain or fabric from libfabric. The domain and fabric names can be queried by running the fi_info -c "FI_RMA|FI_ATOMIC" -t FI_EP_RDM program provided with libfabric. See the SOS README file for additional details.


OFI reports "No space left on device" (e.g. on the Cray Aries interconnect)

The following warning suggests that the default maximum number of STX resources (16) is too high:

Warning: Unable to initialize DLA, GNI_RC_ERROR_RESOURCE at line 506 in file cdm.c
WARN:  ../../src/transport_ofi.c:585: bind_enable_cntr_ep_resources
       fi_ep_bind STX to CNTR endpoint failed
WARN:  ../../src/transport_ofi.c:1340: shmem_transport_ofi_ctx_init
       context bind/enable CNTR endpoint failed (No space left on device)

Lowering the value of the environment variable SHMEM_OFI_STX_MAX fixes this issue.

Troubleshooting a socket provider

Experiencing issues with socket providers

You may check which socket providers are available to you by invoking the “fi_info” tool, provided by the OFI libfabric libraries. This tool can be found in the bin directory within the SOS install area. It is convenient to set the environment PATH to the bin directory within the libfabric install area so that this tool can be invoked from anywhere without having to pass the full path:

  • An example of setting the environment PATH on bash:
  • $ export PATH=<path-to-libfabric-install>/bin:$PATH

Troubleshooting the PSM2 provider

PSM2 error: can't open hfi unit.

If the following error is printed at launch:

hfi_userinit: assign_context command failed: Invalid argument
PSM2 can't open hfi unit: -1 (err=23)

then please set the PSM2_SHAREDCONTEXTS environment variable to 0.

This bug has been filed: http://ibbugzilla.ph.intel.com/bugzilla/show_bug.cgi?id=135318

and the fix should be in libpsm2 > 10.2.84.


PSM2 error: message from unknown process.

The following error:

Received eager message(s) ptype=0x1 opcode=0xcc from an unknown process

likely means that a previous job with the same PSM2_UUID is still running (i.e. didn’t terminate properly). Killing any latent processes should remove the error.


PSM2 error: assertion failure

The following error:

Assertion failure at [...]/ptl_ips/ips_proto.c:1877: (scb->payload_size & 0x3) == 0

Is caused by a bug in an older version of PSM2, please upgrade your PSM2 library.


PSM2 and SHMEM_THREAD_MULTIPLE

The following error message when initializing SOS in SHMEM_THREAD_MULTIPLE mode:

[0000] WARN:  transport_ofi.c:1182: query_for_fabric
[0000]        OFI transport did not find any valid fabric services (provider=<auto>)
[0000] ERROR: init.c:259: shmem_internal_init
[0000]        Transport init failed (-61)

May occur because PSM2 provider supports the FI_THREAD_COMPLETION model instead of the default FI_THREAD_SAFE mode that is assumed by SOS. To enable support for FI_THREAD_COMPLETION, SOS must be configured with the --enable-thread-completion flag.


Missing dynamic libraries, such as psm_infinpath

You may also want to disable the libtool-wrapper as it may interfere with the path to some of the dynamic libraries used by Sandia-OpenSHMEM (such as the infinipath library). This can be done by adding the following option to the Sandia-OpenSHMEM configuration: --disable-libtool-wrapper

Troubleshooting the Portals Build

I'm seeing several failures in the test suite when using the sockets build of Portals 4.

Unfortunately, this is a known issue. For sockets builds, we recommend using OFI.


I got the following error message: ptl_mr.c:456: mr_lookup: Assertion `res == ((void *)0)' failed.

This is caused by a variation in the IB Verbs implementation. Add --enable-zero-mrs to your Portals 4 configuration to correct for it.


I got the following error message: PtlLEAppend of all memory failed: 1.

Some configurations (e.g. if configured with --with-cma) appear to be incompatible with remote virtual addressing. Try reconfiguring without --enable-remote-virtual-addressing.


My system does not have support for ummunotify.

If you’re running without ummunotify or KNEM, you’ll need to add the following env variable: PTL_DISABLE_MEM_REG_CACHE=1. Note that this will negatively impact performance, but it will provide correctness on your system.

Troubleshooting MPI Interoperability Use-case

While SOS allows OpenSHMEM routines to be used in conjunction with MPI routines in a hybrid MPI + OpenSHMEM program, the current supported usage mode is limited to using PMI-MPI only. To build and run such programs, users should use --enable-pmi-mpi with CC=mpicc during configuration. In addition to that, the program order must follow an MPI initialization followed by an OpenSHMEM initialization and an OpenSHMEM finalize followed by an MPI finalize. As an example, the following is a valid MPI + OpenSHMEM program to be built and run with SOS. Any other ordering of initialization and finalize routines may lead to undefined behavior.

#include <mpi.h>
#include <shmem.h>
...

int main (int argc, char *argv[]) {
  MPI_Init(argc, argv);
  shmem_init();

  // Other program code

  shmem_finalize();
  MPI_Finalize();
  return 0;
}

Performance Troubleshooting

Please refer to the Performance Tuning wiki page.

Fortran Users

OpenSHMEM functions like shmem_my_pe are not defined?

That's right! SHMEM implementations have historically waffled on whether these functions are declared in the header file or by the application. The OpenSHMEM specification chose the latter semantic. If you would like a header file that includes all of the function declarations, it can be selected when configuring the build via the --enable-long-fortran-header option.

Clone this wiki locally