Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the scan implementation to follow P0571's guidance. #52

Closed
wants to merge 1 commit into from
Closed

Update the scan implementation to follow P0571's guidance. #52

wants to merge 1 commit into from

Conversation

alliepiper
Copy link
Collaborator

Found this while writing tests for the thrust async algorithms.

@leofang
Copy link

leofang commented Aug 18, 2020

I applied this patch to test building CuPy, and got a lot of errors; here's a clip, let me know if you need a full log.

We're interested in this fix because we cannot make double complex work with CUB's scan (but single complex works fine, see cupy/cupy#2919 (comment)). I can open an issue in NVlabs/CUB if you prefer, but it seems abandoned to me...

    building 'cupy.cuda.cub' extension
    gcc -pthread -B /home/leofang/miniconda3/envs/cupy_dev/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -D_FORCE_INLINES=1 -DCUPY_CUB_VERSION_CODE=101000 -I/home/leofang/cupy/install/../cupy/core/include/cupy/cub -I/home/leofang/cupy/install/../cupy/core/include -I/usr/local/cuda/include -I/home/leofang/miniconda3/envs/cupy_dev/include/python3.7m -c cupy/cuda/cub.cpp -o build/temp.linux-x86_64-3.7/cupy/cuda/cub.o -fopenmp
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    NVCC options: ['--generate-code=arch=compute_75,code=sm_75', '-O2', '--compiler-options="-fPIC"', '--std=c++11']
    /usr/local/cuda/bin/nvcc -D_FORCE_INLINES=1 -DCUPY_CUB_VERSION_CODE=101000 -I/home/leofang/cupy/install/../cupy/core/include/cupy/cub -I/home/leofang/cupy/install/../cupy/core/include -I/usr/local/cuda/include -I/home/leofang/miniconda3/envs/cupy_dev/include/python3.7m -c cupy/cuda/cupy_cub.cu -o build/temp.linux-x86_64-3.7/cupy/cuda/cupy_cub.o --generate-code=arch=compute_75,code=sm_75 -O2 --compiler-options="-fPIC" --std=c++11
    In file included from /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/../iterator/../util_arch.cuh:36:0,
                     from /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/../iterator/../config.cuh:35,
                     from /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/../iterator/arg_index_input_iterator.cuh:39,
                     from /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/device_reduce.cuh:41,
                     from cupy/cuda/cupy_cub.cu:3:
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/../iterator/../util_cpp_dialect.cuh:129:13: warning: CUB requires C++14. Please pass -std=c++14 to your compiler. Define CUB_IGNORE_DEPRECATED_CPP_DIALECT to suppress this message.
       CUB_COMPILER_DEPRECATION(C++14, pass -std=c++14 to your compiler);
                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/../../agent/../block/block_load.cuh(404): error: no suitable constructor exists to convert from "char" to "cub::NullType"
              detected during:
                instantiation of "void cub::LoadDirectWarpStriped(int, InputIteratorT, InputT (&)[ITEMS_PER_THREAD], int) [with InputT=cub::NullType, ITEMS_PER_THREAD=18, InputIteratorT=cub::CacheModifiedInputIterator<cub::LOAD_DEFAULT, char, int>]"
    (912): here
                instantiation of "void cub::BlockLoad<InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::LoadInternal<cub::BLOCK_LOAD_WARP_TRANSPOSE, DUMMY>::Load(InputIteratorT, InputT (&)[ITEMS_PER_THREAD], int) [with InputT=cub::NullType, BLOCK_DIM_X=64, ITEMS_PER_THREAD=18, ALGORITHM=cub::BLOCK_LOAD_WARP_TRANSPOSE, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=750, DUMMY=0, InputIteratorT=cub::CacheModifiedInputIterator<cub::LOAD_DEFAULT, char, int>]"
    (1169): here
                instantiation of "void cub::BlockLoad<InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::Load(InputIteratorT, InputT (&)[ITEMS_PER_THREAD], int) [with InputT=cub::NullType, BLOCK_DIM_X=64, ITEMS_PER_THREAD=18, ALGORITHM=cub::BLOCK_LOAD_WARP_TRANSPOSE, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=750, InputIteratorT=cub::CacheModifiedInputIterator<cub::LOAD_DEFAULT, char, int>]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/../../agent/agent_scan.cuh(295): here
                instantiation of "void cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::ConsumeTile<IS_LAST_TILE>(OffsetT, int, OffsetT, cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::ScanTileStateT &) [with AgentScanPolicyT=cub::AgentScanPolicy<64, 9, cub::NullType, cub::BLOCK_LOAD_WARP_TRANSPOSE, cub::LOAD_DEFAULT, cub::BLOCK_STORE_WARP_TRANSPOSE, cub::BLOCK_SCAN_WARP_SCANS, cub::MemBoundScaling<64, 9, cub::NullType>>, InputIteratorT=char *, OutputIteratorT=char *, ScanOpT=cub::Sum, InitValueT=cub::NullType, OffsetT=int, IS_LAST_TILE=false]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/../../agent/agent_scan.cuh(343): here
                instantiation of "void cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::ConsumeRange(int, cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::ScanTileStateT &, int) [with AgentScanPolicyT=cub::AgentScanPolicy<64, 9, cub::NullType, cub::BLOCK_LOAD_WARP_TRANSPOSE, cub::LOAD_DEFAULT, cub::BLOCK_STORE_WARP_TRANSPOSE, cub::BLOCK_SCAN_WARP_SCANS, cub::MemBoundScaling<64, 9, cub::NullType>>, InputIteratorT=char *, OutputIteratorT=char *, ScanOpT=cub::Sum, InitValueT=cub::NullType, OffsetT=int]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/dispatch_scan.cuh(130): here
                [ 8 instantiation contexts not shown ]
                instantiation of "cudaError_t cub::ChainedPolicy<PTX_VERSION, PolicyT, PrevPolicyT>::Invoke(int, FunctorT &) [with PTX_VERSION=600, PolicyT=cub::DeviceScanPolicy<cub::NullType>::Policy600, PrevPolicyT=cub::DeviceScanPolicy<cub::NullType>::Policy520, FunctorT=cub::DispatchScan<char *, char *, cub::Sum, cub::NullType, int, cub::DeviceScanPolicy<cub::NullType>>]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/dispatch_scan.cuh(480): here
                instantiation of "cudaError_t cub::DispatchScan<InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT, SelectedPolicy>::Dispatch(void *, size_t &, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT, cudaStream_t, __nv_bool) [with InputIteratorT=char *, OutputIteratorT=char *, ScanOpT=cub::Sum, InitValueT=cub::NullType, OffsetT=int, SelectedPolicy=cub::DeviceScanPolicy<cub::NullType>]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/device_scan.cuh(334): here
                instantiation of "cudaError_t cub::DeviceScan::InclusiveSum(void *, size_t &, InputIteratorT, OutputIteratorT, int, cudaStream_t, __nv_bool) [with InputIteratorT=char *, OutputIteratorT=char *]"
    cupy/cuda/cupy_cub.cu(564): here
                instantiation of "void _cub_inclusive_sum::operator()<T>(void *, size_t &, void *, void *, int, cudaStream_t) [with T=char]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/type_dispatcher.cuh(38): here
                instantiation of "void dtype_dispatcher(int, functor_t, Ts &&...) [with functor_t=_cub_inclusive_sum, Ts=<void *&, size_t &, void *&, void *&, int &, cudaStream_t &>]"
    cupy/cuda/cupy_cub.cu(721): here

    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/../../agent/../block/block_load.cuh(367): error: no suitable constructor exists to convert from "char" to "cub::NullType"
              detected during:
                instantiation of "void cub::LoadDirectWarpStriped(int, InputIteratorT, InputT (&)[ITEMS_PER_THREAD]) [with InputT=cub::NullType, ITEMS_PER_THREAD=18, InputIteratorT=cub::CacheModifiedInputIterator<cub::LOAD_DEFAULT, char, int>]"
    (901): here
                instantiation of "void cub::BlockLoad<InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::LoadInternal<cub::BLOCK_LOAD_WARP_TRANSPOSE, DUMMY>::Load(InputIteratorT, InputT (&)[ITEMS_PER_THREAD]) [with InputT=cub::NullType, BLOCK_DIM_X=64, ITEMS_PER_THREAD=18, ALGORITHM=cub::BLOCK_LOAD_WARP_TRANSPOSE, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=750, DUMMY=0, InputIteratorT=cub::CacheModifiedInputIterator<cub::LOAD_DEFAULT, char, int>]"
    (1122): here
                instantiation of "void cub::BlockLoad<InputT, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::Load(InputIteratorT, InputT (&)[ITEMS_PER_THREAD]) [with InputT=cub::NullType, BLOCK_DIM_X=64, ITEMS_PER_THREAD=18, ALGORITHM=cub::BLOCK_LOAD_WARP_TRANSPOSE, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=750, InputIteratorT=cub::CacheModifiedInputIterator<cub::LOAD_DEFAULT, char, int>]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/../../agent/agent_scan.cuh(297): here
                instantiation of "void cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::ConsumeTile<IS_LAST_TILE>(OffsetT, int, OffsetT, cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::ScanTileStateT &) [with AgentScanPolicyT=cub::AgentScanPolicy<64, 9, cub::NullType, cub::BLOCK_LOAD_WARP_TRANSPOSE, cub::LOAD_DEFAULT, cub::BLOCK_STORE_WARP_TRANSPOSE, cub::BLOCK_SCAN_WARP_SCANS, cub::MemBoundScaling<64, 9, cub::NullType>>, InputIteratorT=char *, OutputIteratorT=char *, ScanOpT=cub::Sum, InitValueT=cub::NullType, OffsetT=int, IS_LAST_TILE=false]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/../../agent/agent_scan.cuh(343): here
                instantiation of "void cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::ConsumeRange(int, cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::ScanTileStateT &, int) [with AgentScanPolicyT=cub::AgentScanPolicy<64, 9, cub::NullType, cub::BLOCK_LOAD_WARP_TRANSPOSE, cub::LOAD_DEFAULT, cub::BLOCK_STORE_WARP_TRANSPOSE, cub::BLOCK_SCAN_WARP_SCANS, cub::MemBoundScaling<64, 9, cub::NullType>>, InputIteratorT=char *, OutputIteratorT=char *, ScanOpT=cub::Sum, InitValueT=cub::NullType, OffsetT=int]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/dispatch_scan.cuh(130): here
                [ 8 instantiation contexts not shown ]
                instantiation of "cudaError_t cub::ChainedPolicy<PTX_VERSION, PolicyT, PrevPolicyT>::Invoke(int, FunctorT &) [with PTX_VERSION=600, PolicyT=cub::DeviceScanPolicy<cub::NullType>::Policy600, PrevPolicyT=cub::DeviceScanPolicy<cub::NullType>::Policy520, FunctorT=cub::DispatchScan<char *, char *, cub::Sum, cub::NullType, int, cub::DeviceScanPolicy<cub::NullType>>]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/dispatch_scan.cuh(480): here
                instantiation of "cudaError_t cub::DispatchScan<InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT, SelectedPolicy>::Dispatch(void *, size_t &, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT, cudaStream_t, __nv_bool) [with InputIteratorT=char *, OutputIteratorT=char *, ScanOpT=cub::Sum, InitValueT=cub::NullType, OffsetT=int, SelectedPolicy=cub::DeviceScanPolicy<cub::NullType>]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/device_scan.cuh(334): here
                instantiation of "cudaError_t cub::DeviceScan::InclusiveSum(void *, size_t &, InputIteratorT, OutputIteratorT, int, cudaStream_t, __nv_bool) [with InputIteratorT=char *, OutputIteratorT=char *]"
    cupy/cuda/cupy_cub.cu(564): here
                instantiation of "void _cub_inclusive_sum::operator()<T>(void *, size_t &, void *, void *, int, cudaStream_t) [with T=char]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/type_dispatcher.cuh(38): here
                instantiation of "void dtype_dispatcher(int, functor_t, Ts &&...) [with functor_t=_cub_inclusive_sum, Ts=<void *&, size_t &, void *&, void *&, int &, cudaStream_t &>]"
    cupy/cuda/cupy_cub.cu(721): here

    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/../../agent/../block/specializations/../../warp/specializations/../../thread/thread_operators.cuh(114): error: no operator "+" matches these operands
                operand types are: const cub::NullType + const cub::NullType
              detected during:
                instantiation of "T cub::Sum::operator()(const T &, const T &) const [with T=cub::NullType]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/../../agent/../block/specializations/../../warp/specializations/warp_scan_shfl.cuh(408): here
                instantiation of "_T cub::WarpScanShfl<T, LOGICAL_WARP_THREADS, PTX_ARCH>::InclusiveScanStep(_T, ScanOpT, int, int) [with T=cub::NullType, LOGICAL_WARP_THREADS=32, PTX_ARCH=750, _T=cub::NullType, ScanOpT=cub::Sum]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/../../agent/../block/specializations/../../warp/specializations/warp_scan_shfl.cuh(438): here
                instantiation of "_T cub::WarpScanShfl<T, LOGICAL_WARP_THREADS, PTX_ARCH>::InclusiveScanStep(_T, ScanOpT, int, int, cub::Int2Type<0>) [with T=cub::NullType, LOGICAL_WARP_THREADS=32, PTX_ARCH=750, _T=cub::NullType, ScanOpT=cub::Sum]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/../../agent/../block/specializations/../../warp/specializations/warp_scan_shfl.cuh(484): here
                instantiation of "void cub::WarpScanShfl<T, LOGICAL_WARP_THREADS, PTX_ARCH>::InclusiveScan(_T, _T &, ScanOpT) [with T=cub::NullType, LOGICAL_WARP_THREADS=32, PTX_ARCH=750, _T=cub::NullType, ScanOpT=cub::Sum]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/../../agent/../block/specializations/../../warp/warp_scan.cuh(451): here
                instantiation of "void cub::WarpScan<T, LOGICAL_WARP_THREADS, PTX_ARCH>::InclusiveScan(T, T &, ScanOp) [with T=cub::NullType, LOGICAL_WARP_THREADS=32, PTX_ARCH=750, ScanOp=cub::Sum]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/../../agent/../block/specializations/block_scan_warp_scans.cuh(341): here
                [ 14 instantiation contexts not shown ]
                instantiation of "cudaError_t cub::ChainedPolicy<PTX_VERSION, PolicyT, PrevPolicyT>::Invoke(int, FunctorT &) [with PTX_VERSION=600, PolicyT=cub::DeviceScanPolicy<cub::NullType>::Policy600, PrevPolicyT=cub::DeviceScanPolicy<cub::NullType>::Policy520, FunctorT=cub::DispatchScan<char *, char *, cub::Sum, cub::NullType, int, cub::DeviceScanPolicy<cub::NullType>>]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/dispatch_scan.cuh(480): here
                instantiation of "cudaError_t cub::DispatchScan<InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT, SelectedPolicy>::Dispatch(void *, size_t &, InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT, cudaStream_t, __nv_bool) [with InputIteratorT=char *, OutputIteratorT=char *, ScanOpT=cub::Sum, InitValueT=cub::NullType, OffsetT=int, SelectedPolicy=cub::DeviceScanPolicy<cub::NullType>]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/device_scan.cuh(334): here
                instantiation of "cudaError_t cub::DeviceScan::InclusiveSum(void *, size_t &, InputIteratorT, OutputIteratorT, int, cudaStream_t, __nv_bool) [with InputIteratorT=char *, OutputIteratorT=char *]"
    cupy/cuda/cupy_cub.cu(564): here
                instantiation of "void _cub_inclusive_sum::operator()<T>(void *, size_t &, void *, void *, int, cudaStream_t) [with T=char]"
    /home/leofang/cupy/install/../cupy/core/include/cupy/type_dispatcher.cuh(38): here
                instantiation of "void dtype_dispatcher(int, functor_t, Ts &&...) [with functor_t=_cub_inclusive_sum, Ts=<void *&, size_t &, void *&, void *&, int &, cudaStream_t &>]"
    cupy/cuda/cupy_cub.cu(721): here

    /home/leofang/cupy/install/../cupy/core/include/cupy/cub/cub/device/dispatch/../../agent/../block/block_store.cuh(334): error: no suitable conversion function from "cub::NullType" to "char" exists
              detected during:

@brycelelbach
Copy link
Owner

brycelelbach commented Aug 18, 2020 via email

@leofang
Copy link

leofang commented Aug 18, 2020

Just tried 1. C++14 + patch, 2. C++11 + branch, and 3. C++14 + branch, and all of them got most of the same errors. Just FYI, I applied the patch on to the latest master, so it should be a legit patch.

One of CuPy's CI is set up with Thrust + CUB from CUDA 11 and built with C++11, and so far everything works just fine, and we prefer to not bump to C++14 for as long as possible, as from the Python side the dependencies are more convolved and the compiler is only one of the the things we need to worry about.

@brycelelbach
Copy link
Owner

brycelelbach commented Aug 18, 2020 via email

@leofang
Copy link

leofang commented Aug 18, 2020

My first message was from a C++11 build (see the --std flag). Would you like me to upload a full log?

@brycelelbach
Copy link
Owner

brycelelbach commented Aug 18, 2020 via email

@alliepiper
Copy link
Collaborator Author

@leofang See if that last push fixes it. Relevant bits:

diff --git a/cub/device/device_scan.cuh b/cub/device/device_scan.cuh
index e0a8e3a4..ab5595fb 100644
--- a/cub/device/device_scan.cuh
+++ b/cub/device/device_scan.cuh
@@ -331,13 +331,16 @@ struct DeviceScan
         // Signed integer type for global offsets
         typedef int OffsetT;

-        return DispatchScan<InputIteratorT, OutputIteratorT, Sum, NullType, OffsetT>::Dispatch(
+        // Use the input iterator's value type per P0571
+        typedef typename std::iterator_traits<InputIteratorT>::value_type InitValueT;
+
+        return DispatchScan<InputIteratorT, OutputIteratorT, Sum, InitValueT, OffsetT>::Dispatch(
             d_temp_storage,
             temp_storage_bytes,
             d_in,
             d_out,
             Sum(),
-            NullType(),
+            InitValueT{},
             num_items,
             stream,
             debug_synchronous);
@@ -416,13 +419,16 @@ struct DeviceScan
         // Signed integer type for global offsets
         typedef int OffsetT;

-        return DispatchScan<InputIteratorT, OutputIteratorT, ScanOpT, NullType, OffsetT>::Dispatch(
+        // Use the input iterator's value type per P0571
+        typedef typename std::iterator_traits<InputIteratorT>::value_type InitValueT;
+
+        return DispatchScan<InputIteratorT, OutputIteratorT, ScanOpT, InitValueT, OffsetT>::Dispatch(
             d_temp_storage,
             temp_storage_bytes,
             d_in,
             d_out,
             scan_op,
-            NullType(),
+            InitValueT{},
             num_items,
             stream,
             debug_synchronous);

@leofang
Copy link

leofang commented Aug 18, 2020

Thank you for quick reply, @allisonvacanti! 5b965c6 builds with both C++11 and C++14, but I see several errors in cumsum/cumprod when running CuPy's test suite locally. I am trying to determine if it's due to this PR or something earlier. The search range shouldn't be too large because our CI runs on CUDA 11 Update 1 and it works fine.

By the way, which version of CUB is included in CUDA 11 Update 1? 1.9.10-1?

@alliepiper
Copy link
Collaborator Author

alliepiper commented Aug 18, 2020

Yes, 11.0u1 uses CUB 1.9.10-1. But be aware that main and 1.9.10-1 differ significantly -- main has dropped support for older C++ (C++03 and the GCC 4.8/MSVC2015 partial implementations of C++11 are no longer supported), while 1.9.10-1 hasn't seen most of those modernizations. Since you're compiling in the deprecated C++11 mode, you may see issues from this.

This patch may also slightly change the results of scans where the input/output types do not match, so that may also be causing some tests to fail at runtime.

@leofang
Copy link

leofang commented Aug 18, 2020

Yes, 11.0u1 uses CUB 1.9.10-1.

I've traced from 1.9.9 and up to 1.9.10-1 now and haven't seen errors. Testing the master branch now...

But be aware that main and 1.9.10-1 differ significantly -- main has dropped support for older C++ (C++03 and the GCC 4.8/MSVC2015 partial implementations of C++11 are no longer supported), while 1.9.10-1 hasn't seen most of those modernizations. Since you're compiling in the deprecated C++11 mode, you may see issues from this.

The errors I got appeared for both C++11 and C++14 mode, so definitely something else. One of the errors I got was that the output array from cumsum/cumprod is all 0 for float64 --- any chance this rings a bell to you?

This patch may also slightly change the results of scans where the input/output types do not match, so that may also be causing some tests to fail at runtime.

Thanks! I'm looking into this.

@leofang
Copy link

leofang commented Aug 18, 2020

I've tested the master branch and it works fine too, so looks like this PR caused the errors that I saw. We're actually in a GPU Hackathon now, so I'll continue looking into the root cause later today, but from a quick glance I don't spot any issue 🧐 (IIRC, in CuPy we set the output type of cumsum to be the same as the input's, so the type matching thing shouldn't be a problem.)

@leofang
Copy link

leofang commented Aug 18, 2020

btw, another class of errors I see is that all elements are shifted by 1, but I don't understand what could lead to this from the code change:

>>> import cupy as cp
>>> a = cp.arange(10, dtype=cp.float64)
>>> a.cumsum()  # CuPy with this PR
array([ 0.,  0.,  1.,  3.,  6., 10., 15., 21., 28., 36.])
>>> 
>>> import numpy as np
>>> a = np.arange(10, dtype=np.float64)
>>> a.cumsum()  # NumPy, or CuPy before this PR
array([ 0.,  1.,  3.,  6., 10., 15., 21., 28., 36., 45.])

@leofang
Copy link

leofang commented Aug 19, 2020

@allisonvacanti I think I got the bug: if NullType is removed, this line will never be true, causing inclusive scans (as needed for CuPy's cumsum/cumprod) to not work properly.
https://github.com/thrust/cub/blob/5b965c66e8fce30cfb947c585b37823f17064a05/cub/agent/agent_scan.cuh#L121

@alliepiper
Copy link
Collaborator Author

@leofang Ah, good catch. I'll be sure to fix that before this goes in.

@brycelelbach brycelelbach added this to the 1.10.0 milestone Sep 2, 2020
@brycelelbach brycelelbach modified the milestones: 1.10.0, 1.11.0 Sep 16, 2020
@alliepiper alliepiper added the multiconfig: passed PR builds and tests pass locally label Sep 23, 2020
@alliepiper
Copy link
Collaborator Author

Submitted to DVS 29098336.

@leofang This should address the issues you discovered above. Feel free to retry it, but be aware that this PR will migrate to the NVIDIA/cub repo soon.

@alliepiper
Copy link
Collaborator Author

Moved to NVIDIA#201.

@alliepiper alliepiper closed this Sep 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multiconfig: passed PR builds and tests pass locally
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants