-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Wrong result when summing on Pascal GPUs #956
Comments
The MPI abort looks like a shutdown failure. I believe we fixed one such issue recently, can you try a recent (untested) nightly build, to see if it's already been fixed?
The wrong result might be a Pascal-specific issue, @eddy16112 could you please try to reproduce on sapling, since you already have a working build there? |
The MPI error seems to have gone away. However, it still gives the wrong answer when I add a gpu. $ legate-issue
Python : 3.12.5 | packaged by conda-forge | (main, Aug 8 2024, 18:36:51) [GCC 12.4.0]
Platform : Linux-5.4.0-169-generic-x86_64-with-glibc2.31
Legion : (failed to detect)
Legate : 24.09.00.dev+230.gb4d27ab1
Cunumeric : 24.09.00.dev+97.g2217c6c8
Numpy : 1.26.4
Scipy : 1.14.1
Numba : (failed to detect)
CTK package : cuda-version-12.6-3 (nvidia)
GPU driver : 535.54.03
GPU devices :
GPU 0: Tesla P100-SXM2-16GB
GPU 1: Tesla P100-SXM2-16GB
GPU 2: Tesla P100-SXM2-16GB
GPU 3: Tesla P100-SXM2-16GB
It still works with 1 process and no gpus:
But it says the sum is 0 when I add a gpu:
And if I try 2 nodes (using srun): legate --verbose --launcher srun --nodes 2 --gpus 1 --fbmem 1000 hostname.py--- Legion Python Configuration ------------------------------------------------
I switched to srun because mpirun gives a new error:
Also, somehow it fails when I request
|
What
That is weird, is there anything else running on the node that might be using up the GPU memory? Can you run |
Noting that this might be the same underlying problem as nv-legate/cunumeric#1149. @amberhassaan will try to reproduce that on an internal machine with Pascals. Amber, feel free to try either benchmark. |
Re: the comment above about srun vs mpirun. This looks to be an issue with how
I can use srun directly and not legate to launch the processes:
With many HPCs
So if |
I think I know what's happening with the |
I tried replicating this issue on my home machine with a
I have verified with |
relevant log:
|
Oh, so |
@suranap what are the conda package versions you used? Just to make sure we're using the same ones. Also, does the issue reproduce if you skip the launcher entirely, and run directly on the compute node?
|
@suranap , @manopapad : I can reproduce the bug with
|
The same test works when I build
while the run with
|
Did you also confirm that this happens on a from-source release build? Generally, I would check with the DevOps team if there's any difference between the build you're doing and the build that's being triggered in CI. For now, can you run using the packages inside gdb ( |
Update after some investigations: Details (shortened log):
|
I've confirmed that we weren't compiling cunumeric kernels for |
Software versions
Python : 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0]
Platform : Linux-5.4.0-169-generic-x86_64-with-glibc2.31
Legion : legion-24.06.0-119-ga66da82b8
Legate : 24.06.01
Cunumeric : 24.06.01
Numpy : 1.26.4
Scipy : 1.14.0
Numba : (failed to detect)
/home/suranap/mambaforge/lib/python3.10/site-packages/conda_package_streaming/package_streaming.py:19: UserWarning: zstandard could not be imported. Running without .conda support.
warnings.warn("zstandard could not be imported. Running without .conda support.")
/home/suranap/mambaforge/lib/python3.10/site-packages/conda_package_handling/api.py:29: UserWarning: Install zstandard Python bindings for .conda support
_warnings.warn("Install zstandard Python bindings for .conda support")
CTK package : cuda-version-12.5-hd4f0392_3 (conda-forge)
GPU driver : 535.54.03
GPU devices :
GPU 0: Tesla P100-SXM2-16GB
GPU 1: Tesla P100-SXM2-16GB
GPU 2: Tesla P100-SXM2-16GB
GPU 3: Tesla P100-SXM2-16GB
Jupyter notebook / Jupyter Lab version
No response
Expected behavior
I'm running legate on sapling at Stanford. I'd like to run on multiple proceses/nodes and multiple gpus. I'm running a simple test program (below). I'd like to see some evidence it is partitioning the array across 2 processes. Instead, it delivers crashes when I increase to 2 ranks. And it gives the wrong answer when I add
--gpus 1
.Observed behavior
I haven't been able to run 2 process, either on separate nodes or same node. Here's a sample of what I've tried.
This works, but doesn't use the GPU:
$ legate --launcher mpirun --ranks-per-node 1 --fbmem 1000 hostname.py
(100, 100, 100) 2000000.0
If I add a GPU, the sum becomes 0:
$ legate --launcher mpirun --ranks-per-node 1 --gpus 1 --fbmem 1000 hostname.py
(100, 100, 100) 0.0
And if I increase ranks to 2, it crashes and complains about MPI_Abort():
$ legate --launcher mpirun --ranks-per-node 2 --gpus 2 --fbmem 1000 hostname.py
(100, 100, 100) 0.0 *** The MPI_Abort() function was called after MPI_FINALIZE was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [g0002.stanford.edu:346219] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- *** The MPI_Abort() function was called after MPI_FINALIZE was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [g0002.stanford.edu:346218] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[46518,1],1]
Exit code: 1
Again, if I remove the gpu option it gives the right answer. But it crashes at the end because rank is 2:
$ legate --launcher mpirun --ranks-per-node 2 --fbmem 1000 hostname.py
(100, 100, 100) 2000000.0 *** The MPI_Abort() function was called after MPI_FINALIZE was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [g0002.stanford.edu:346747] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- *** The MPI_Abort() function was called after MPI_FINALIZE was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [g0002.stanford.edu:346748] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[47015,1],0]
Exit code: 1
I'd prefer to use srun, but it also fails with rank 2:
$ legate --launcher srun --launcher-extra "-c 15" --ranks-per-node 2 --gpus 1 --fbmem 1000 hostname.py
(100, 100, 100) 0.0 [1 - 7fd912f2a000] 0.000195 {4}{threads}: reservation ('GPU proc 1d00010000000007') cannot be satisfied *** The MPI_Abort() function was called after MPI_FINALIZE was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [g0003.stanford.edu:278113] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! srun: error: g0003: task 0: Exited with exit code 1 *** The MPI_Abort() function was called after MPI_FINALIZE was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [g0003.stanford.edu:278114] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! srun: error: g0003: task 1: Exited with exit code 1
Example code or instructions
The text was updated successfully, but these errors were encountered: