Bump Parthenon and Kokkos #114

pgrete · 2024-09-04T09:45:25Z

Updates Parthenon to 24.08 and Kokkos to 4.4.0 (both released last month).
Changes to the interface are described in the Changelog.

pgrete · 2024-09-04T10:17:42Z

Looks like I still need to update the new hst file name in the tests.

BenWibking

LGTM

oops, missed test failure

pgrete · 2024-09-04T18:16:41Z

@par-hermes format

pgrete · 2024-09-06T10:08:22Z

I think I caught everything now and tests pass again.
Would you mind reviewing the changes again @BenWibking ?

BenWibking

The code looks fine, but the MPI regression test looks like it's still failing.

CHANGELOG.md

tst/regression/test_suites/aniso_therm_cond_gauss_conv/aniso_therm_cond_gauss_conv.py

BenWibking · 2024-09-06T14:40:44Z

The MPI regression error messages are very odd, and they all happen for the cluster_magetic_tower test:

10/12 Test #22: regression_mpi_test:cluster_magnetic_tower .........***Failed  260.63 sec

OpenMPI errors:

--------------------------------------------------------------------------
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
  address: 0x34dc78788
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------
[d4ca0f87c519:10106] [[41645,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/show_help.c at line 501
[d4ca0f87c519:10106] [[41645,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/show_help.c at line 501
[d4ca0f87c519:10106] [[41645,0],0] ORTE_ERROR_LOG: Out of resource in file util/show_help.c at line 507

Actual error that causes the regression to fail:

Traceback (most recent call last):
  File "/__w/athenapk/athenapk/external/parthenon/scripts/python/packages/parthenon_tools/parthenon_tools/phdf.py", line 147, in __init__
    f = h.File(filename, "r")
  File "/usr/lib/python3/dist-packages/h5py/_debian_h5py_serial/_hl/files.py", line 507, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
  File "/usr/lib/python3/dist-packages/h5py/_debian_h5py_serial/_hl/files.py", line 220, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_debian_h5py_serial/_objects.pyx", line 54, in h5py._debian_h5py_serial._objects.with_phil.wrapper
  File "h5py/_debian_h5py_serial/_objects.pyx", line 55, in h5py._debian_h5py_serial._objects.with_phil.wrapper
  File "h5py/_debian_h5py_serial/h5f.pyx", line 106, in h5py._debian_h5py_serial.h5f.open
OSError: Unable to open file (file signature not found)

pgrete · 2024-09-07T07:57:10Z

This is all so annoying...
I tried varying things (again "investing" hours) without success.
I was not able to get a cuda-aware MPI version cleanly working with the updated (ubuntu 22.04 cuda 12.1) container.
I tried (with a small MPI ping pong test)

OpenMPI 5.0.3 (couldn't convince it to see that cuda-aware mpi is included though the cuda extension is being build)
OpenMPI 5.0.3 with the cuda-aware UCX transport layer -- again didn't work (segfail on sends from GPU buffers)
OpenMPI 4.0.4 (the version in the old container) - works for some cases but then fails for other with the cuIpcGetMemHandle error above (not sure where that's coming from but I assume it's related to being executed in a docker container)

So I went back to the cuda 11.6 and ubuntu 20.04 container and am now testing various combinations of scipy, h5py, and numpy with by default have some incompatibilities due to use of deprecated interfaces...

Such a mess...

BenWibking · 2024-09-07T14:58:35Z

I've gotten OpenMPI 5 + UCX to work outside of a container with CUDA-awareness, so I'm a bit surprised by that combination. Does it work outside the container? Does ompi_info and ucx_info show a CUDA entry?

pgrete · 2024-09-09T07:57:29Z

Leaving this for posterity -- sth about the IPC is odd (note the node=787462364b88

root@787462364b88:/athenapk/build# /opt/openmpi/bin/mpirun -np 2 --mca opal_cuda_verbose 10 --mca btl_smcuda_cuda_ipc_verbose 100  /athenapk/build/bin/athenaPK -i /athenapk/inputs/cluster/hydro_agn_feedback.in parthenon/output2/id=kinetic_only_precessed_True parthenon/output2/dt=0.005 parthenon/time/tlim=0.005 hydro/gamma=1.6666666666666667 hydro/He_mass_fraction=0.25 units/code_length_cgs=3.085677580962325e+24 units/code_mass_cgs=1.98841586e+47 units/code_time_cgs=3.15576e+16 problem/cluster/uniform_gas/init_uniform_gas=true problem/cluster/uniform_gas/rho=147.7557589278723 problem/cluster/uniform_gas/ux=0.0 problem/cluster/uniform_gas/uy=0.0 problem/cluster/uniform_gas/uz=0.0 problem/cluster/uniform_gas/pres=1.5454368403867562 problem/cluster/precessing_jet/jet_phi0=1.2 problem/cluster/precessing_jet/jet_phi_dot=0 problem/cluster/precessing_jet/jet_theta=0.4 problem/cluster/agn_feedback/fixed_power=0.3319965633348792 problem/cluster/agn_feedback/efficiency=0.001 problem/cluster/agn_feedback/thermal_fraction=0.0 problem/cluster/agn_feedback/kinetic_fraction=1.0 problem/cluster/agn_feedback/magnetic_fraction=0 problem/cluster/agn_feedback/thermal_radius=0.1 problem/cluster/agn_feedback/kinetic_jet_temperature=10000000.0 problem/cluster/agn_feedback/kinetic_jet_radius=0.05 problem/cluster/agn_feedback/kinetic_jet_thickness=0.05 problem/cluster/agn_feedback/kinetic_jet_offset=0.01 --kokkos-map-device-id-by=mpi_rank
[787462364b88:10789] Sending CUDA IPC REQ (try=1): myrank=0, mydev=0, peerrank=1
[787462364b88:10790] Sending CUDA IPC REQ (try=1): myrank=1, mydev=1, peerrank=0
[787462364b88:10789] Not sending CUDA IPC ACK because request already initiated
[787462364b88:10790] Analyzed CUDA IPC request: myrank=1, mydev=1, peerrank=0, peerdev=0 --> ACCESS=1
[787462364b88:10790] BTL smcuda: rank=1 enabling CUDA IPC to rank=0 on node=787462364b88 
[787462364b88:10790] Sending CUDA IPC ACK:  myrank=1, mydev=1, peerrank=0, peerdev=0
[787462364b88:10789] Received CUDA IPC ACK, notifying PML: myrank=0, peerrank=1
[787462364b88:10789] BTL smcuda: rank=0 enabling CUDA IPC to rank=1 on node=787462364b88 
Starting up hydro driver
# Variables in use:
# Package: parthenon::resolved_state
# ---------------------------------------------------
# Variables:
# Name	Metadata flags
# ---------------------------------------------------
theta_sph                 Cell,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
mach_sonic                Cell,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
log10_cell_radius         Cell,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
magnetic_tower_A          Cell,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
prim                      Cell,Provides,Real,Derived,Hydro,parthenon::resolved_state
v_r                       Cell,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
temperature               Cell,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
entropy                   Cell,Provides,Real,Derived,OneCopy,Hydro,parthenon::resolved_state
cons                      Cell,Provides,Real,Independent,FillGhost,WithFluxes,Hydro,parthenon::resolved_state
bnd_flux::cons            Face,Provides,Real,Derived,OneCopy,Flux,parthenon::resolved_state
# ---------------------------------------------------
# Sparse Variables:
# Name	sparse id	Metadata flags
# ---------------------------------------------------
# ---------------------------------------------------
# Swarms:
# Swarm	Value	metadata
# ---------------------------------------------------


Setup complete, executing driver...

cycle=0 time=0.0000000000000000e+00 dt=5.0000000000000001e-03 zone-cycles/wsec_step=0.00e+00 wsec_total=6.69e-01 wsec_step=2.72e+00
--------------------------------------------------------------------------
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
  address: 0x63a34e300
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------
[787462364b88:10785] *** Process received signal ***
[787462364b88:10785] Signal: Segmentation fault (11)
[787462364b88:10785] Signal code: Address not mapped (1)
[787462364b88:10785] Failing at address: (nil)
[787462364b88:10785] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fc8cb6b2090]
[787462364b88:10785] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x183bf2)[0x7fc8cb7f2bf2]
[787462364b88:10785] [ 2] /opt/openmpi/lib/libopen-rte.so.40(+0x2d821)[0x7fc8cb9a9821]
[787462364b88:10785] [ 3] /opt/openmpi/lib/libopen-rte.so.40(orte_show_help_recv+0x177)[0x7fc8cb9a9cb7]
[787462364b88:10785] [ 4] /opt/openmpi/lib/libopen-rte.so.40(orte_rml_base_process_msg+0x3e1)[0x7fc8cba077a1]
[787462364b88:10785] [ 5] /opt/openmpi/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0x7b3)[0x7fc8cb8edf13]
[787462364b88:10785] [ 6] /opt/openmpi/bin/mpirun(+0x14a1)[0x561273e924a1]
[787462364b88:10785] [ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fc8cb693083]
[787462364b88:10785] [ 8] /opt/openmpi/bin/mpirun(+0x11fe)[0x561273e921fe]
[787462364b88:10785] *** End of error message ***
Segmentation fault (core dumped)

pgrete · 2024-09-09T08:01:26Z

I as getting around a couple of errors like

Setup complete, executing driver...

cycle=0 time=0.0000000000000000e+00 dt=5.0000000000000001e-03 zone-cycles/wsec_step=0.00e+00 wsec_total=3.95e-01 wsec_step=2.46e+00
--------------------------------------------------------------------------
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
  address: 0x638e733c0
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------
[787462364b88:10817] [[2900,0],0] ORTE_ERROR_LOG: Out of resource in file util/show_help.c at line 507
[787462364b88:10817] [[2900,0],0] ORTE_ERROR_LOG: Out of resource in file util/show_help.c at line 507
cycle=1 time=5.0000000000000001e-03 dt=1.0000000000000000e-02 zone-cycles/wsec_step=1.79e+05 wsec_total=4.41e+00 wsec_step=4.02e+00

Driver completed.
time=5.00e-03 cycle=1
tlim=5.00e-03 nlim=-1

walltime used = 4.41e+00
zone-cycles/wallsecond = 1.63e+05

by disabling CUDA IPC alltogether:
export OMPI_MCA_btl_smcuda_use_cuda_ipc=0
Let's see how this goes.

pgrete · 2024-09-09T09:33:52Z

I cannot believe it. Victory!!!

btw this unmerged (open-mpi/ompi#12137) doc (https://github.com/open-mpi/ompi/blob/909168e501b7eb144d4a361a88938af99c1a4352/docs/tuning-apps/networking/cuda.rst) was quite helpful

BenWibking · 2024-09-17T19:33:04Z

For future reference, OLCF has a Dockerfile for CUDA-aware MPI here: For future reference, OLCF has example CUDA-aware MPI Dockerfiles here: https://code.ornl.gov/olcfcontainers/olcfbaseimages/-/blob/master/summit/mpiimage-centos-cuda/Dockerfile?ref_type=heads.

It looks like they download the pre-built UCX and OpenMPI from NVIDIA/Mellanox:

# Accept mpi_root environment variable. Should come from host $MPI_ROOT. Should be pointing to GNU instead of XL, etc.
ARG mpi_root
 
# Set MPI environment variables
ENV PATH=$mpi_root/bin:$PATH
ENV LD_LIBRARY_PATH=$mpi_root/lib:$LD_LIBRARY_PATH
ENV LIBRARY_PATH=$mpi_root/lib:$LIBRARY_PATH
ENV INCLUDE=$mpi_root/include:$INCLUDE
ENV C_INCLUDE_PATH=$mpi_root/include:$C_INCLUDE_PATH
ENV CPLUS_INCLUDE_PATH=$mpi_root/include:$CPLUS_INCLUDE_PATH
 
# MOFED is sufficient, but is it necessary?
# Set MOFED version, OS version and platform (updated to match Summit 1/30/2024)
ENV MOFED_VER 4.9-6.0.6.1
ENV OS_VER rhel8.6
ENV PLATFORM ppc64le
ENV MOFED_DIR /mlnx

# MLNX_OFED
RUN mkdir ${MOFED_DIR} 
RUN wget https://content.mellanox.com/ofed/MLNX_OFED-${MOFED_VER}/MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM}.tgz -P ${MOFED_DIR}

RUN rm -rf /var/cache/dnf \
    && fakeroot dnf install -y perl lsof numactl-libs pciutils tk libnl3 python36 tcsh gcc-gfortran tcl libmnl ethtool fuse-libs \
	&& fakeroot dnf -y install tar wget git openssh \
	&& dnf clean all
RUN cd ${MOFED_DIR} && \
    tar -xvf MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM}.tgz --no-same-owner && \
    MLNX_OFED_LINUX-${MOFED_VER}-${OS_VER}-${PLATFORM}/mlnxofedinstall --user-space-only --without-fw-update --distro ${OS_VER} -q && \
    cd / && \
    rm -rf ${MOFED_DIR}

fglines-nv · 2024-09-18T16:58:43Z

@pgrete @BenWibking I'm a little late to the party, it took LAMMPS running into the same issue running GPUDirect Cray MPICH and a new version of Kokkos to hit the same errors with IPC. I just figured this one out yesterday.

The core issue is that the Kokkos recently made cudaMallocAsync the default allocator for GPU memory but memory allocated with cudaMallocAsync is incompatible with the old IPC API call cuIpcGetMemHandle. It seems that almost all the MPI implementations run into this problem, at least for UCX and libfabric. See openucx/ucx#7110 and ofiwg/libfabric#10162. HPC-X (NVIDIA's OpenMPI) is the only implementation that might work with cudaMallocAsync, albeit with performance hits (see https://docs.nvidia.com/hpc-sdk//hpc-sdk-release-notes/index.html#known-limitations). I've been the new API is due to a performance issue with cuIpcGetMemHandle+cudaMallocAsync.

To switch to the new IPC, Kokkos would need to create a CUDA memory pool separate from the default pool. One would then need to get that memory pool object to create a file descriptor to pass to the MPI framework/other process to access that memory. So the total solution would involve changes to both Kokkos and the comm libraries.

For now you could disable IPC in the MPI library, or you can disable it in Kokkos with Kokkos_ENABLE_CUDA_MALLOC_ASYNC=OFF

Unless we're calling cudaMalloc very often in a very-active AMR simulation, I believe disabling cudaMallocAsync instead of IPC in the MPI layer will have a better performance outcome. IPC should be more beneficial for multi GPU-per-node systems and I believe for MPS as well.

felker · 2024-09-24T21:21:26Z

@fglines-nv I recently stumbled upon this issue with AthenaK on ALCF Polaris with Cray MPICH, and your links to the other Issues led me here. Do you know when/what version Kokkos made this change? I think we started having issues at 4.2.00, see here: kokkos/kokkos#7294

We have been recompiling with -DKokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF--- is this not the correct flag? I see it renamed from your suggestion in
kokkos/kokkos@c3ec284

The Kokkos team removed the comprehensive list of CMake build flags from BUILD.md in April 2023: kokkos/kokkos@83873a6#diff-40f60e1037245d7b8a98a7325d53890a717da9979adeb54a61a795c4ba07f9c9R114 but their Wiki page is missing the flag... https://kokkos.org/kokkos-core-wiki/keywords.html

fglines-nv · 2024-09-24T21:28:45Z

@felker Supposedly this PR in Kokkos kokkos/kokkos#6402. The flag should be Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF They're reverting this change in kokkos/kokkos#7353. Looks like it might be merged soon

pgrete and others added 6 commits May 29, 2024 15:46

bump Parthenon

567b2b9

Merge branch 'main' into pgrete/bump-parth-for-rr

9bf6ae9

Bump Parth to include restart ghost fix

e778d03

update parthenon to openpmd branch

3a8fe16

Merge branch 'main' into pgrete/bump-parth-for-rr

7e568ea

Bump Kokkos 4.4.0 and Parthenon 24.08

dddd529

pgrete requested a review from BenWibking September 4, 2024 09:45

Fix history filename

192cddf

BenWibking previously approved these changes Sep 4, 2024

View reviewed changes

Update interface (cleanup and workaround)

da0237a

par-hermes and others added 8 commits September 4, 2024 18:19

cpp-py-formatter

63bdfcc

Fix typo

5ee71f6

Fix unit handling in hse test

ab1a2f7

Bump CI image to be used

d0ff735

Chagne CI machine

577fbd7

And back to the old machine

69cd18b

Limit timestep

ce26db7

Fix logical location

97004f4

BenWibking reviewed Sep 6, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

tst/regression/test_suites/aniso_therm_cond_gauss_conv/aniso_therm_cond_gauss_conv.py Show resolved Hide resolved

Back to good old CI container (with slightly updated python modules)

9bc10a8

adjust container resources

03af08f

Disable CUDA IPC in CI container

d870e8d

pgrete merged commit 7a50cca into main Sep 9, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump Parthenon and Kokkos #114

Bump Parthenon and Kokkos #114

pgrete commented Sep 4, 2024

pgrete commented Sep 4, 2024

BenWibking left a comment

pgrete commented Sep 4, 2024

pgrete commented Sep 6, 2024

BenWibking left a comment

BenWibking commented Sep 6, 2024

pgrete commented Sep 7, 2024

BenWibking commented Sep 7, 2024 •

edited

Loading

pgrete commented Sep 9, 2024

pgrete commented Sep 9, 2024

pgrete commented Sep 9, 2024

BenWibking commented Sep 17, 2024 •

edited

Loading

fglines-nv commented Sep 18, 2024 •

edited

Loading

felker commented Sep 24, 2024

fglines-nv commented Sep 24, 2024

Bump Parthenon and Kokkos #114

Bump Parthenon and Kokkos #114

Conversation

pgrete commented Sep 4, 2024

pgrete commented Sep 4, 2024

BenWibking left a comment

Choose a reason for hiding this comment

pgrete commented Sep 4, 2024

pgrete commented Sep 6, 2024

BenWibking left a comment

Choose a reason for hiding this comment

BenWibking commented Sep 6, 2024

pgrete commented Sep 7, 2024

BenWibking commented Sep 7, 2024 • edited Loading

pgrete commented Sep 9, 2024

pgrete commented Sep 9, 2024

pgrete commented Sep 9, 2024

BenWibking commented Sep 17, 2024 • edited Loading

fglines-nv commented Sep 18, 2024 • edited Loading

felker commented Sep 24, 2024

fglines-nv commented Sep 24, 2024

BenWibking commented Sep 7, 2024 •

edited

Loading

BenWibking commented Sep 17, 2024 •

edited

Loading

fglines-nv commented Sep 18, 2024 •

edited

Loading