Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Wrong result when summing on Pascal GPUs #956

Open
suranap opened this issue Sep 17, 2024 · 15 comments
Open

[BUG] Wrong result when summing on Pascal GPUs #956

suranap opened this issue Sep 17, 2024 · 15 comments
Assignees

Comments

@suranap
Copy link

suranap commented Sep 17, 2024

Software versions

Python : 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0]
Platform : Linux-5.4.0-169-generic-x86_64-with-glibc2.31
Legion : legion-24.06.0-119-ga66da82b8
Legate : 24.06.01
Cunumeric : 24.06.01
Numpy : 1.26.4
Scipy : 1.14.0
Numba : (failed to detect)
/home/suranap/mambaforge/lib/python3.10/site-packages/conda_package_streaming/package_streaming.py:19: UserWarning: zstandard could not be imported. Running without .conda support.
warnings.warn("zstandard could not be imported. Running without .conda support.")
/home/suranap/mambaforge/lib/python3.10/site-packages/conda_package_handling/api.py:29: UserWarning: Install zstandard Python bindings for .conda support
_warnings.warn("Install zstandard Python bindings for .conda support")
CTK package : cuda-version-12.5-hd4f0392_3 (conda-forge)
GPU driver : 535.54.03
GPU devices :
GPU 0: Tesla P100-SXM2-16GB
GPU 1: Tesla P100-SXM2-16GB
GPU 2: Tesla P100-SXM2-16GB
GPU 3: Tesla P100-SXM2-16GB

Jupyter notebook / Jupyter Lab version

No response

Expected behavior

I'm running legate on sapling at Stanford. I'd like to run on multiple proceses/nodes and multiple gpus. I'm running a simple test program (below). I'd like to see some evidence it is partitioning the array across 2 processes. Instead, it delivers crashes when I increase to 2 ranks. And it gives the wrong answer when I add --gpus 1.

Observed behavior

I haven't been able to run 2 process, either on separate nodes or same node. Here's a sample of what I've tried.

This works, but doesn't use the GPU:

$ legate --launcher mpirun --ranks-per-node 1 --fbmem 1000 hostname.py (100, 100, 100) 2000000.0

If I add a GPU, the sum becomes 0:

$ legate --launcher mpirun --ranks-per-node 1 --gpus 1 --fbmem 1000 hostname.py (100, 100, 100) 0.0

And if I increase ranks to 2, it crashes and complains about MPI_Abort():

$ legate --launcher mpirun --ranks-per-node 2 --gpus 2 --fbmem 1000 hostname.py (100, 100, 100) 0.0 *** The MPI_Abort() function was called after MPI_FINALIZE was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [g0002.stanford.edu:346219] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- *** The MPI_Abort() function was called after MPI_FINALIZE was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [g0002.stanford.edu:346218] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[46518,1],1]
Exit code: 1

Again, if I remove the gpu option it gives the right answer. But it crashes at the end because rank is 2:

$ legate --launcher mpirun --ranks-per-node 2 --fbmem 1000 hostname.py (100, 100, 100) 2000000.0 *** The MPI_Abort() function was called after MPI_FINALIZE was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [g0002.stanford.edu:346747] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- *** The MPI_Abort() function was called after MPI_FINALIZE was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [g0002.stanford.edu:346748] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[47015,1],0]
Exit code: 1

I'd prefer to use srun, but it also fails with rank 2:

$ legate --launcher srun --launcher-extra "-c 15" --ranks-per-node 2 --gpus 1 --fbmem 1000 hostname.py (100, 100, 100) 0.0 [1 - 7fd912f2a000] 0.000195 {4}{threads}: reservation ('GPU proc 1d00010000000007') cannot be satisfied *** The MPI_Abort() function was called after MPI_FINALIZE was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [g0003.stanford.edu:278113] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! srun: error: g0003: task 0: Exited with exit code 1 *** The MPI_Abort() function was called after MPI_FINALIZE was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [g0003.stanford.edu:278114] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! srun: error: g0003: task 1: Exited with exit code 1

Example code or instructions

Sample code:

import cunumeric as np

x = np.ones((100, 100, 100))
y = x + x
print(y.shape)
print(np.sum(y))


### Stack traceback or browser console output

_No response_
@manopapad
Copy link
Contributor

The MPI abort looks like a shutdown failure. I believe we fixed one such issue recently, can you try a recent (untested) nightly build, to see if it's already been fixed?

conda create -n myenv -c legate/label/experimental -c conda-forge cunumeric

The wrong result might be a Pascal-specific issue, @eddy16112 could you please try to reproduce on sapling, since you already have a working build there?

@suranap
Copy link
Author

suranap commented Sep 19, 2024

The MPI error seems to have gone away. However, it still gives the wrong answer when I add a gpu.

$ legate-issue Python : 3.12.5 | packaged by conda-forge | (main, Aug 8 2024, 18:36:51) [GCC 12.4.0] Platform : Linux-5.4.0-169-generic-x86_64-with-glibc2.31 Legion : (failed to detect) Legate : 24.09.00.dev+230.gb4d27ab1 Cunumeric : 24.09.00.dev+97.g2217c6c8 Numpy : 1.26.4 Scipy : 1.14.1 Numba : (failed to detect) CTK package : cuda-version-12.6-3 (nvidia) GPU driver : 535.54.03 GPU devices : GPU 0: Tesla P100-SXM2-16GB GPU 1: Tesla P100-SXM2-16GB GPU 2: Tesla P100-SXM2-16GB GPU 3: Tesla P100-SXM2-16GB

It still works with 1 process and no gpus:

$ legate --launcher mpirun  --fbmem 10000 hostname.py
(100, 100, 100)
2000000.0

But it says the sum is 0 when I add a gpu:

$ legate --launcher mpirun --gpus 1 --fbmem 1000 hostname.py
(100, 100, 100)
0.0

And if I try 2 nodes (using srun):

legate --verbose --launcher srun --nodes 2 --gpus 1 --fbmem 1000 hostname.py --- Legion Python Configuration ------------------------------------------------

Legate paths:
legate_dir : /home/suranap/mambaforge/envs/legate-experimental/lib/python3.12/site-packages
legate_build_dir : None
bind_sh_path : /home/suranap/mambaforge/envs/legate-experimental/bin/bind.sh
legate_lib_path : /home/suranap/mambaforge/envs/legate-experimental/lib

Legion paths:
legion_bin_path : /home/suranap/mambaforge/envs/legate-experimental/bin
legion_lib_path : /home/suranap/mambaforge/envs/legate-experimental/lib
realm_defines_h : /home/suranap/mambaforge/envs/legate-experimental/include/realm_defines.h
legion_defines_h : /home/suranap/mambaforge/envs/legate-experimental/include/legion_defines.h
legion_spy_py : /home/suranap/mambaforge/envs/legate-experimental/bin/legion_spy.py
legion_prof : /home/suranap/mambaforge/envs/legate-experimental/bin/legion_prof
legion_module : /home/suranap/mambaforge/envs/legate-experimental/lib/python3.1/site-packages
legion_jupyter_module : /home/suranap/mambaforge/envs/legate-experimental/lib/python3.1/site-packages

Versions:
legate_version : 24.09.00.dev+230.gb4d27ab1

Command:
srun -n 2 --ntasks-per-node 1 /home/suranap/mambaforge/envs/legate-experimental/bin/bind.sh --launcher srun -- python hostname.py

Customized Environment:
CUTENSOR_LOG_LEVEL=1
GASNET_MPI_THREAD=MPI_THREAD_MULTIPLE
LEGATE_CONFIG='--cpus 4 --gpus 1 --omps 0 --ompthreads 4 --utility 2 --sysmem 4000 --numamem 0 --fbmem 1000 --regmem 0 --logdir /home/suranap/tmp --eager-alloc-percentage 50'
LEGATE_MAX_DIM=4
LEGATE_MAX_FIELDS=256
LEGATE_NEED_CUDA=1
LEGATE_NEED_NETWORK=1
NCCL_LAUNCH_MODE=PARALLEL
PYTHONDONTWRITEBYTECODE=1
PYTHONPATH=/home/suranap/mambaforge/envs/legate-experimental/lib/python3.1/site-packages:/home/suranap/mambaforge/envs/legate-experimental/lib/python3.1/site-packages
REALM_BACKTRACE=1
REALM_UCP_BOOTSTRAP_PLUGIN=/home/suranap/mambaforge/envs/legate-experimental/lib/realm_ucp_bootstrap_mpi.so
UCX_CUDA_COPY_MAX_REG_RATIO=1.0
UCX_RCACHE_PURGE_ON_FORK=n


(100, 100, 100)
0.0
(100, 100, 100)
0.0
[0 - 7fc0c5145280] 0.000120 {4}{threads}: reservation ('CPU proc 1d00000000000003') cannot be satisfied
[1 - 7f5f08396280] 0.000196 {4}{threads}: reservation ('CPU proc 1d00010000000005') cannot be satisfied

I switched to srun because mpirun gives a new error:

--------------------------------------------------------------------------
Your job has requested more processes than the ppr for
this topology can support:

  App: /home/suranap/mambaforge/envs/legate-experimental/bin/bind.sh
  Number of procs:  2
  PPR: 1:node

Please revise the conflict and try again.
--------------------------------------------------------------------------

Also, somehow it fails when I request --fbmem 10000 though the GPU has 16gb:

$ legate --launcher mpirun --gpus 1 --fbmem 10000 hostname.py
[0 - 7fad850d1280]    0.000000 {6}{gpu}: Failed to allocate GPU memory of size 10485760000
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node g0001 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

@manopapad
Copy link
Contributor

I switched to srun because mpirun gives a new error:

What legate command produces this error?

Also, somehow it fails when I request --fbmem 10000 though the GPU has 16gb

That is weird, is there anything else running on the node that might be using up the GPU memory? Can you run nvidia-smi on the same compute node before invoking legate?

@manopapad
Copy link
Contributor

Noting that this might be the same underlying problem as nv-legate/cunumeric#1149. @amberhassaan will try to reproduce that on an internal machine with Pascals. Amber, feel free to try either benchmark.

@manopapad manopapad assigned amberhassaan and unassigned eddy16112 Sep 20, 2024
@suranap
Copy link
Author

suranap commented Sep 23, 2024

Re: the comment above about srun vs mpirun. This looks to be an issue with how legate launcher runs. It is expecting libcuda.so but that isn't available on the login node. If I salloc 2 nodes and then run legate this happens:

$ legate --launcher srun --launcher-extra "-p gpu -c 10" --nodes 2 --gpus 1 --fbmem 10000 hostname.py
Traceback (most recent call last):
  File "/home/suranap/mambaforge/envs/legate-experimental/bin/legate", line 7, in <module>
    from legate.driver import main
  File "/home/suranap/mambaforge/envs/legate-experimental/lib/python3.12/site-packages/legate/__init__.py", line 22, in <module>
    from ._lib.mapping.machine import (
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

I can use srun directly and not legate to launch the processes:

$ srun -N 2 legate hostname.py
g0003.stanford.edu
(100, 100, 100)
2000000.0
g0002.stanford.edu
(100, 100, 100)
2000000.0
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[g0002.stanford.edu:873270] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[g0003.stanford.edu:802994] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: error: g0002: task 0: Exited with exit code 1
srun: error: g0003: task 1: Exited with exit code 1

With many HPCs salloc will get you a shell on a node. Sapling isn't setup like that, so instead I do it manually:

$ srun --interactive --preserve-env --pty $SHELL
$ legate --launcher srun --nodes 2 --fbmem 10000 hostname.py
g0003.stanford.edu
(100, 100, 100)
2000000.0
g0002.stanford.edu
(100, 100, 100)
2000000.0

So if legate can avoid loading libcuda.so for the launch phase, that could improve things. For example, if I use sbatch this could be a problem on any HPC. There's still the problem that --gpus returns 0. Hopefully someone can replicate this on sapling.

@manopapad
Copy link
Contributor

I think I know what's happening with the srun path. I have opened an issue to address it https://github.com/nv-legate/legate.core.internal/issues/1246.

@manopapad manopapad changed the title [BUG] Unable to run multiple processes/nodes with legate [BUG] Wrong result when summing on Pascal GPUs Sep 23, 2024
@amberhassaan
Copy link

I tried replicating this issue on my home machine with a GTX 1060 card installed. I wasn't able to replicate the issue. Here's how I ran the script above:

LEGATE_CONFIG="--gpus 1" python3 sum.py

I have verified with ncu profiler that GPU kernels are indeed being invoked. I am not familiar as to which kernels I should be looking for. The names didn't include the word reduction though.

@amberhassaan
Copy link

relevant log:

$ LEGATE_CONFIG="--gpus 1" ncu python3 sum.py                                                                                                                                                                     
==PROF== Connected to process 6106 (/usr/bin/python3.12)                                                                                                                                                          
(100, 100, 100)                                                                                                                                                                                                   
==PROF== Profiling "fill_affine_batch2D_64" - 0: 0%....50%....100% - 8 passes                                                                                                                                     
==PROF== Profiling "dense_kernel" - 1: 0%....50%....100% - 8 passes                                                                                                                                               
==PROF== Profiling "scalar_unary_red_kernel" - 2: 0%....50%....100% - 8 passes                                                                                                                                    
==PROF== Profiling "copy_kernel" - 3: 0%....50%....100% - 8 passes                                                                                                                                                
16000000.0                                                                                                                                                                                                        
==PROF== Disconnected from process 6106                  

@amberhassaan
Copy link

Oh, so red_kernel is probably the reduction.

@manopapad
Copy link
Contributor

@suranap what are the conda package versions you used? Just to make sure we're using the same ones.

Also, does the issue reproduce if you skip the launcher entirely, and run directly on the compute node?

legate --gpus 1 --fbmem 1000 hostname.py

@amberhassaan
Copy link

@suranap , @manopapad : I can reproduce the bug with conda packages. Here's the relevant bits:

 conda list | grep legate                                                                                                                                                                                       
# packages in environment at /home/mhassaan/miniconda3/envs/mylegate:                                                                                                                                             
cunumeric                 24.09.00.dev90  cuda12_py312_g50702325_90_gpu    legate/label/experimental                                                                                                              
legate                    24.09.00.dev236 cuda12_py312_g426b1b04_236_ucx_gpu    legate/label/experimental      

LEGATE_CONFIG="--gpus 1" python3 sum.py                                                                                                                                                                        
(100, 100, 100)                                                                                                                                                                                                   
0.0    

@amberhassaan
Copy link

The same test works when I build legate and cunumeric from source with the latest internal repos. Here's the correct run:

 LEGATE_CONFIG="--gpus 1" nvprof  --print-gpu-trace --openacc-profiling off python3 sum.py
==3265== NVPROF is profiling process 3265, command: python3 sum.py
(100, 100, 100)
2000000.0
==3265== Profiling application: python3 sum.py
==3265== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
2.18176s  50.400us             (20 1 1)      (1024 1 1)        32        0B        0B         -           -           -           -  NVIDIA GeForce          1        15  fill_affine_batch2D_64 [518]
2.27309s  103.14us           (7813 1 1)       (128 1 1)        10        0B        0B         -           -           -           -  NVIDIA GeForce          1        31  void cunumeric::dense_kernel<cunumeric::BinaryOp<cunumeric::BinaryOpCode, legate::Type::Code>, double, double, double>(unsigned long, cunumeric::BinaryOpCode, legate::Type::Code*, cunumeric::BinaryOp<cunumeric::BinaryOpCode, legate::Type::Code> const *, double const *) [1054]
2.31241s     448ns                    -               -         -         -         -        8B  17.030MB/s    Pageable      Device  NVIDIA GeForce          1        31  [CUDA memcpy HtoD]
2.39081s  112.16us           (1024 1 1)       (128 1 1)        37       32B       32B         -           -           -           -  NVIDIA GeForce          1        31  void cunumeric::scalar_reduction_impl::scalar_unary_red_kernel<cunumeric::DeviceScalarReductionBuffer<Legion::SumReduction<double>>, cunumeric::ScalarUnaryRed<cunumeric::VariantKind, cunumeric::UnaryRedCode, legate::Type::Code, int=3, bool=0>, double, cunumeric::ScalarUnaryRed<cunumeric::VariantKind::SparseReduction, cunumeric::UnaryRedCode, legate::Type::Code, int=3, bool=0>>(unsigned long, unsigned long, double, Legion::SumReduction<double>, cunumeric::DeviceScalarReductionBuffer<Legion::SumReduction<double>>, cunumeric::VariantKind) [1062]
2.39093s  3.6470us              (1 1 1)         (1 1 1)        10        0B        0B         -           -           -           -  NVIDIA GeForce          1        31  void cunumeric::scalar_reduction_impl::copy_kernel<cunumeric::DeviceScalarReductionBuffer<Legion::SumReduction<double>>, Legion::ReductionAccessor<Legion::SumReduction<double>, bool=1, int=1, __int64, Realm::AffineAccessor<double, int=1, __int64>, bool=0>>(double, Legion::SumReduction<double>) [1063]

while the run with conda packages seems to not run all the kernels except the first one:

 LEGATE_CONFIG="--gpus 1" nvprof --trace gpu --print-gpu-trace python3 sum.py 
==2854== NVPROF is profiling process 2854, command: python3 sum.py
(100, 100, 100)
0.0
==2854== Profiling application: python3 sum.py
==2854== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
1.99928s  51.904us             (20 1 1)      (1024 1 1)        32        0B        0B         -           -           -           -  NVIDIA GeForce          1        15  fill_affine_batch2D_64 [507]
2.03979s     672ns                    -               -         -         -         -        8B  11.353MB/s    Pageable      Device  NVIDIA GeForce          1        31  [CUDA memcpy HtoD]

@manopapad
Copy link
Contributor

Did you also confirm that this happens on a from-source release build? Generally, I would check with the DevOps team if there's any difference between the build you're doing and the build that's being triggered in CI.

For now, can you run using the packages inside gdb (legate --gpus 1 --gdb sum.py), and see if the code which launches cunumeric::scalar_reduction_impl::scalar_unary_red_kernel (and the other kernels) is actually reached? The way I see it either the GPU task body was never entered, or the kernel launch failed and we missed it.

@amberhassaan
Copy link

amberhassaan commented Oct 8, 2024

Update after some investigations:
In summary, I've found that on pascal gpus, our conda packages are missing kernel binaries, and it seems that we suppress the error when kernel launch doesn't happen and thus produce the wrong output.

Details (shortened log):

$ LEGATE_SHOW_CONFIG=1 LEGATE_CONFIG="--gpus 1 --sysmem 100 --fbmem 4000 --zcmem 128" compute-sanitizer python3  ~/tmp/sum.py
...
========= Program hit cudaErrorNoKernelImageForDevice (error 209) due to "no kernel image is available for execution on the device" on CUDA API call to cudaLaunchKernel.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4469e5]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame:cudaLaunchKernel [0x75a2d]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/python3.12/site-packages/legate/_lib/mapping/../../../../.././libcudart.so.12
=========     Host Frame:__device_stub__ZN9cunumeric12dense_kernelINS_8BinaryOpILNS_12BinaryOpCodeE1ELN6legate4Type4CodeE11EEEdddEEvmT_PT0_PKT1_PKT2_(unsigned long, cunumeric::BinaryOp<(cunumeric::BinaryOpCode)
1, (legate::Type::Code)11>&, double*, double const*, double const*) [0x1bd79e4]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/libcunumeric.so
=========     Host Frame:void cunumeric::BinaryOpImpl<(cunumeric::VariantKind)2, (cunumeric::BinaryOpCode)1>::operator()<(legate::Type::Code)11, 3, (void*)0>(cunumeric::BinaryOpArgs&) const [clone .isra.0] [0x1
dad3ea]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/libcunumeric.so
=========     Host Frame:cunumeric::BinaryOpTask::gpu_variant(legate::TaskContext) [0x1c36866]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/libcunumeric.so
...
========= Program hit cudaErrorNoKernelImageForDevice (error 209) due to "no kernel image is available for execution on the device" on CUDA API call to cudaLaunchKernel.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4469e5]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame:cudaLaunchKernel [0x75a2d]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/python3.12/site-packages/legate/_lib/mapping/../../../../.././libcudart.so.12
=========     Host Frame:__device_stub__ZN9cunumeric21scalar_reduction_impl23scalar_unary_red_kernelINS_27DeviceScalarReductionBufferIN6Legion12SumReductionIdEEEENS_14ScalarUnaryRedILNS_11VariantKindE2ELNS_12Un
aryRedCodeE16ELN6legate4Type4CodeE11ELi3ELb0EEEdNSD_15SparseReductionEEEvmmT_T0_T1_T2_(unsigned long, unsigned long, cunumeric::DeviceScalarReductionBuffer<Legion::SumReduction<double> >&, cunumeric::ScalarUnar
yRed<(cunumeric::VariantKind)2, (cunumeric::UnaryRedCode)16, (legate::Type::Code)11, 3, false>&, double, cunumeric::ScalarUnaryRed<(cunumeric::VariantKind)2, (cunumeric::UnaryRedCode)16, (legate::Type::Code)11,
 3, false>::SparseReduction&) [0x1f00346]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/libcunumeric.so
=========     Host Frame:void cunumeric::ScalarReductionPolicy<(cunumeric::VariantKind)2, Legion::SumReduction<double>, cunumeric::ScalarUnaryRed<(cunumeric::VariantKind)2, (cunumeric::UnaryRedCode)16, (legate:
:Type::Code)11, 3, false>::SparseReduction>::operator()<Legion::ReductionAccessor<Legion::SumReduction<double>, true, 1, long long, Realm::AffineAccessor<double, 1, long long>, false> const, double, cunumeric::
ScalarUnaryRed<(cunumeric::VariantKind)2, (cunumeric::UnaryRedCode)16, (legate::Type::Code)11, 3, false> const&>(unsigned long, Legion::ReductionAccessor<Legion::SumReduction<double>, true, 1, long long, Realm:
:AffineAccessor<double, 1, long long>, false> const&, double const&, cunumeric::ScalarUnaryRed<(cunumeric::VariantKind)2, (cunumeric::UnaryRedCode)16, (legate::Type::Code)11, 3, false> const&) [clone .isra.0] [
0x20a6166]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/libcunumeric.so
=========     Host Frame:decltype(auto) legate::detail::InnerTypeDispatchFn<3>::operator()<cunumeric::ScalarUnaryRedImpl<(cunumeric::VariantKind)2, (cunumeric::UnaryRedCode)16, false>, cunumeric::ScalarUnaryRed
Args&>(legate::Type::Code, cunumeric::ScalarUnaryRedImpl<(cunumeric::VariantKind)2, (cunumeric::UnaryRedCode)16, false>, cunumeric::ScalarUnaryRedArgs&) [clone .isra.0] [0x210af0d]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/libcunumeric.so
=========     Host Frame:void cunumeric::scalar_unary_red_template<(cunumeric::VariantKind)2>(legate::TaskContext&) [0x1f0dd21]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/libcunumeric.so
=========     Host Frame:cunumeric::ScalarUnaryRedTask::gpu_variant(legate::TaskContext) [0x1f0df12]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/libcunumeric.so
=========     Host Frame:legate::detail::legion_task_body(void (*)(legate::TaskContext), legate::VariantCode, std::optional<std::basic_string_view<char, std::char_traits<char> > >, void const*, unsigned long, R
ealm::Processor) [0x2b75b8]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../liblegate.so.24.09.00
=========     Host Frame:void legate::LegateTask<cunumeric::ScalarUnaryRedTask>::task_wrapper_<&cunumeric::ScalarUnaryRedTask::gpu_variant, (legate::VariantCode)2>(void const*, unsigned long, void const*, unsig
ned long, Realm::Processor) [0x56c0a6]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/libcunumeric.so
...
========= Program hit cudaErrorNoKernelImageForDevice (error 209) due to "no kernel image is available for execution on the device" on CUDA API call to cudaLaunchKernel.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4469e5]
=========                in /usr/lib/libcuda.so.1
=========     Host Frame:cudaLaunchKernel [0x75a2d]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/python3.12/site-packages/legate/_lib/mapping/../../../../.././libcudart.so.12
=========     Host Frame:__device_stub__ZN9cunumeric21scalar_reduction_impl11copy_kernelINS_27DeviceScalarReductionBufferIN6Legion12SumReductionIdEEEENS3_17ReductionAccessorIS5_Lb1ELi1ExN5Realm14AffineAccessorIdLi1ExEELb0EEEEEvT_T0_(cunumeric::DeviceScalarReductionBuffer<Legion::SumReduction<double> >&, Legion::ReductionAccessor<Legion::SumReduction<double>, true, 1, long long, Realm::AffineAccessor<double, 1, long long>, false>&) [0x1ef2898]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/libcunumeric.so
=========     Host Frame:void cunumeric::ScalarReductionPolicy<(cunumeric::VariantKind)2, Legion::SumReduction<double>, cunumeric::ScalarUnaryRed<(cunumeric::VariantKind)2, (cunumeric::UnaryRedCode)16, (legate::Type::Code)11, 3, false>::SparseReduction>::operator()<Legion::ReductionAccessor<Legion::SumReduction<double>, true, 1, long long, Realm::AffineAccessor<double, 1, long long>, false> const, double, cunumeric::ScalarUnaryRed<(cunumeric::VariantKind)2, (cunumeric::UnaryRedCode)16, (legate::Type::Code)11, 3, false> const&>(unsigned long, Legion::ReductionAccessor<Legion::SumReduction<double>, true, 1, long long, Realm::AffineAccessor<double, 1, long long>, false> const&, double const&, cunumeric::ScalarUnaryRed<(cunumeric::VariantKind)2, (cunumeric::UnaryRedCode)16, (legate::Type::Code)11, 3, false> const&) [clone .isra.0] [0x20a5eab]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/libcunumeric.so
=========     Host Frame:decltype(auto) legate::detail::InnerTypeDispatchFn<3>::operator()<cunumeric::ScalarUnaryRedImpl<(cunumeric::VariantKind)2, (cunumeric::UnaryRedCode)16, false>, cunumeric::ScalarUnaryRedArgs&>(legate::Type::Code, cunumeric::ScalarUnaryRedImpl<(cunumeric::VariantKind)2, (cunumeric::UnaryRedCode)16, false>, cunumeric::ScalarUnaryRedArgs&) [clone .isra.0] [0x210af0d]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/libcunumeric.so
=========     Host Frame:void cunumeric::scalar_unary_red_template<(cunumeric::VariantKind)2>(legate::TaskContext&) [0x1f0dd21]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/libcunumeric.so
=========     Host Frame:cunumeric::ScalarUnaryRedTask::gpu_variant(legate::TaskContext) [0x1f0df12]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/libcunumeric.so
=========     Host Frame:legate::detail::legion_task_body(void (*)(legate::TaskContext), legate::VariantCode, std::optional<std::basic_string_view<char, std::char_traits<char> > >, void const*, unsigned long, Realm::Processor) [0x2b75b8]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/python3.12/site-packages/legate/_lib/mapping/../../../../../liblegate.so.24.09.00
=========     Host Frame:void legate::LegateTask<cunumeric::ScalarUnaryRedTask>::task_wrapper_<&cunumeric::ScalarUnaryRedTask::gpu_variant, (legate::VariantCode)2>(void const*, unsigned long, void const*, unsigned long, Realm::Processor) [0x56c0a6]
=========                in /home/mhassaan/miniconda3/envs/testleg-2/lib/libcunumeric.so
...

@amberhassaan
Copy link

I've confirmed that we weren't compiling cunumeric kernels for Pascal architecture when producing conda packages. We will fix this soon so that conda packages contain kernel binaries for Pascal. In the meantime, a build from source should include Pascal and fix the problem. I have tested source builds successfully at my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants