Support CUB prefix sum & product #2919

leofang · 2020-01-07T18:10:27Z

In the demo below, CUB gets about a 4x speed-up on a P100:

>>> import cupy as cp
>>> import cupyx
>>> 
>>> a = cp.random.random(1000000)
>>>
>>> cp.cuda.cub_enabled  =False
>>> print(cupyx.time.repeat(cp.cumprod, (a, )))
cumprod             :   428.775 us   +/-57.870 (min:  396.988 / max:  976.234) us    471.108 us   +/-45.524 (min:  449.888 / max:  992.736) us
>>> cp.cuda.cub_enabled = True
>>> print(cupyx.time.repeat(cp.cumprod, (a, )))
cumprod             :   103.053 us   +/-19.460 (min:   84.242 / max:  229.079) us    132.358 us   +/-17.695 (min:  113.888 / max:  250.048) us
>>>
>>> cp.cuda.cub_enabled = False
>>> print(cupyx.time.repeat(cp.cumsum, (a, )))
cumsum              :   428.253 us   +/-55.802 (min:  398.817 / max:  978.976) us    470.163 us   +/-43.695 (min:  450.464 / max:  995.872) us
>>> cp.cuda.cub_enabled = True
>>> print(cupyx.time.repeat(cp.cumsum, (a, )))
cumsum              :   103.998 us   +/-17.858 (min:   86.243 / max:  234.290) us    133.392 us   +/-16.486 (min:  115.808 / max:  256.320) us

TODO:

compare performance with cupy.core._routines_math.scan() (Rel: cupy.cumsum() much slower than cupy.core._routines_math.scan()? #2905)
add a CUB call path to cupy.core._routines_math.scan()
check if CUB handles NaNs correctly
add tests (Question: to here or to Add tests for cupy.cuda.cub #2598?)

The first two action items are for speeding up certain fancy indexing cases.

grlee77 · 2020-01-07T20:30:17Z

Nice. I tried this out and the only problem I encountered was in the shape of the returned result when axis=None and the input is nD. (The output should be a ravelled array)

Simple example with CUB enabled/disabled (the disabled behavior matches NumPy):

import cupy as cp
import numpy as np

a = cp.testing.shaped_arange((4, 5), cp, float)
cp.cumsum(a)
# array([[  1.,   3.,   6.,  10.,  15.],
#        [ 21.,  28.,  36.,  45.,  55.],
#        [ 66.,  78.,  91., 105., 120.],
#        [136., 153., 171., 190., 210.]])

cp.cuda.cub_enabled = False
cp.cumsum(a)
# array([  1.,   3.,   6.,  10.,  15.,  21.,  28.,  36.,  45.,  55.,  66.,
#         78.,  91., 105., 120., 136., 153., 171., 190., 210.])

leofang · 2020-01-08T03:01:44Z

Oops, thanks @grlee77! I read the spec wrong...

emcastillo · 2020-01-08T03:15:54Z

I was actually working on this :D.
I let you take care of it better :).
One of the things I saw is that CUB scanning does not support axis selection, so implementing that correctly may be a bit trickier.

leofang · 2020-01-08T05:12:52Z

Oh sorry for duplicate work @emcastillo. Which branch is it? I could cherry pick your code.

One of the things I saw is that CUB scanning does not support axis selection, so implementing that correctly may be a bit trickier.

I don’t think CUB can support axis != None (unless we are willing to launch hundreds of kernels in such cases).

emcastillo · 2020-01-08T05:15:26Z

No worries, your PR is going to be better than mine so do the cub part and I will take care of improving the non-cub kernels.

leofang · 2020-01-08T14:51:08Z

@emcastillo This PR should be considered jointly with your #2907. Looks like cupy.core._routines_math.scan() is really good. It is slower than CUB only when there are >= 10^6 elements (for float64 in this case):

>>> a = cp.random.random(1000)
>>> print(cupyx.time.repeat(cp.cuda.cub.device_scan, (a, 5)))
device_scan         :    98.163 us   +/-16.937 (min:   76.599 / max:  222.032) us    106.869 us   +/-18.320 (min:   77.120 / max:  236.768) us
>>> print(cupyx.time.repeat(cp.core._routines_math._scan_for_test, (a, )))
_scan_for_test      :    67.176 us   +/- 9.349 (min:   49.020 / max:  147.861) us     76.206 us   +/-10.508 (min:   56.064 / max:  163.200) us
>>> 
>>> a = cp.random.random(100000)
>>> print(cupyx.time.repeat(cp.cuda.cub.device_scan, (a, 5)))
device_scan         :    97.013 us   +/-17.426 (min:   76.346 / max:  215.663) us    105.617 us   +/-18.885 (min:   83.296 / max:  300.320) us
>>> print(cupyx.time.repeat(cp.core._routines_math._scan_for_test, (a, )))
_scan_for_test      :    67.560 us   +/- 9.815 (min:   49.441 / max:  181.684) us     76.555 us   +/-11.008 (min:   56.416 / max:  198.016) us
>>> 
>>> a = cp.random.random(1000000)
>>> print(cupyx.time.repeat(cp.cuda.cub.device_scan, (a, 5)))
device_scan         :    94.037 us   +/-17.865 (min:   74.958 / max:  278.390) us    124.621 us   +/-16.371 (min:  106.752 / max:  301.504) us
>>> print(cupyx.time.repeat(cp.core._routines_math._scan_for_test, (a, )))
_scan_for_test      :    90.543 us   +/-17.130 (min:   70.443 / max:  201.541) us    123.582 us   +/-10.805 (min:  107.584 / max:  223.040) us
>>> 
>>> a = cp.random.random(10000000)
>>> print(cupyx.time.repeat(cp.cuda.cub.device_scan, (a, 5)))
device_scan         :    82.406 us   +/-12.235 (min:   74.969 / max:  204.807) us    650.291 us   +/- 6.245 (min:  644.096 / max:  714.784) us
>>> print(cupyx.time.repeat(cp.core._routines_math._scan_for_test, (a, )))
_scan_for_test      :    78.777 us   +/- 9.734 (min:   72.235 / max:  196.004) us    822.750 us   +/- 8.690 (min:  816.992 / max:  901.280) us

So, if cupy.core._routines_math.scan() is tweaked a bit (simply by changing the accumulator += to *= I guess), perhaps in addition to cumsum() it can also do cumprod(), and we either wouldn't need CUB at all, or would switch to CUB only when the array is excessively large?

(Actually I already cheated a bit in the above comparison by using cp.cuda.cub.device_scan() instead of cp.cuda.cub.cub_scan(): the latter adds about 8 us on my machine due to extra checks...)

leofang · 2020-01-08T15:12:38Z

we either wouldn't need CUB at all, or would switch to CUB only when the array is excessively large?

We could also let users decide if they want to use CUB scan, as usual by toggling cp.cuda.cub_enabled.

Fixes cupy#2919 (comment)

emcastillo · 2020-01-09T09:21:03Z

I wouldn't set sub-optimal implementations by default.
But I understand that enabling some cub algorithms and other not in some cases can be a nightmare of configurations.

leofang · 2020-01-09T18:59:14Z

Another thing: CUB cumsum and cumprod seems to be buggy for complex numbers. I am not sure why, but this is deterministically reproducible. The cumulation stopped at the 448-th element of the result:

>>> import cupy as cp
>>> a = cp.random.random(500) + 1j*cp.random.random(500)
>>> cp.cuda.cub_enabled = True
>>> b = cp.cumsum(a)  # wrong
>>> cp.cuda.cub_enabled = False
>>> c = cp.cumsum(a)  # correct
>>> cp.allclose(b[0:447],c[0:447])
array(True)
>>> cp.allclose(b[0:448],c[0:448])
array(True)
>>> cp.allclose(b[0:449],c[0:449])
array(False)
>>> b[448] == a[448]   # <--- why???
array(True)
>>> c[448] == a[448]
array(False)

Looking into this...

leofang · 2020-01-15T12:13:26Z

Another thing: CUB cumsum and cumprod seems to be buggy for complex numbers.

Two observations:

This bug is complex128-only; for complex64 arrays CUB works fine, so I suspect it's due to CUB's internal handling of data size exceeding 64 bits.
Adjusting the items per thread in this line
https://github.com/NVlabs/cub/blob/c3cceac115c072fb63df1836ff46d8c60d9eb304/cub/device/dispatch/dispatch_scan.cuh#L177
will cause the bug appear in different elements (I changed from 15 to 12 and the bug started in the element No.3xx instead of 448).

@emcastillo Due to the above observations, I tend to think this is a bug in CUB (which I have not identified yet), and so I will disable CUB prefix scan for complex128 and proceed.

emcastillo · 2020-01-15T12:18:31Z

I have been working on adapting the fast kernel in scan to do the batched cumsum and cumprod.
It's been a nightmare but i am pretty close to the end :D

leofang · 2020-01-15T12:56:33Z

Nice. We can then do a benchmark to decide if this PR is to be kept or closed.

leofang · 2020-01-15T18:29:02Z

cupy/cuda/cub.pyx

@@ -372,8 +372,7 @@ def can_use_device_segmented_reduce(int op, x_dtype, Py_ssize_t ndim,
                                                        order)


-cdef _cub_reduce_dtype_compatible(x_dtype, int op, dtype=None,
-                                  bint segmented=False):
+cdef _cub_reduce_dtype_compatible(x_dtype, int op, dtype=None):


This is to correct a mistake I made in #2562.

cupy/math/sumprod.py

leofang · 2020-01-19T04:01:12Z

Sounds good. I'll then avoid dealing with this and let the current _cum_core() handle it. Then, in the next PR we just fix _cum_core().

TODO: address the compliance issue (cupy#2919 (comment))

leofang · 2020-01-19T05:24:19Z

@emcastillo Thanks for the heads-up! I guess you are right: CUB supports in-place scan. This simplifies things quite a bit. PTAL.

I verified this support in a few levels:

Test it directly with the new change 79780e1
In Thrust's documentation in-place scans are used as an example. Since CUB is one of Thrust's backends, it should also work.
By inspecting CUB's code: The ConsumeTile() implementation in cub/cub/agent/agent_scan.cuh seems to permit in-place operation.

grlee77 · 2020-01-20T06:01:16Z

Hi @leofang. The dtype behavior you describe above (#2919 (comment)) is also what I recently attempted to implement for cupy.mean in #2903. Seeing now that this is more common across reduction functions, then perhaps the logic should move out into a separate helper function.

leofang · 2020-01-20T14:51:32Z

@grlee77 Thanks for bringing it up, Gregory 🙂 I just took a quick look at it. While it'd be nice to have a separate helper as you said, the rules there are slightly different from #2919 (comment). Maybe we need a switch or flag in the helper to determine the desired behavior?

Anyway, if you can make that helper in your PR, @emcastillo and I will follow in #2907 and here. We can move the discussion in your #2903. Thanks!

leofang · 2020-02-04T05:11:12Z

@emcastillo I suggest we consider the PRs in the following order: #2919 (this PR) -> #2907 for two reasons.

First, I do not deal with the complexity of out and dtype, as outlined in #2919 (comment), and simply let the caller functions (cumsum and cumprod) handle it, whereas in #2907 the caller code is refactored.

Second, while Gregory's suggestion on a separate helper is tempting, I don't see immediately how such helper can be flexible and useful since the rules vary across different functions.

If we want to make the implementation more NumPy-compliant, we can make another PR to address #2919 (comment) after both are merged, as you suggested earlier.

If you agree with my suggestion, the only question I have for you is:

add tests (Question: to here or to Add tests for cupy.cuda.cub #2598?)

Thanks 🙂

emcastillo · 2020-02-04T05:26:32Z

Lets do that

leofang · 2020-02-04T05:35:02Z

Cool, if we don’t need to add tests here, this PR is ready.

emcastillo · 2020-02-04T06:16:22Z

Lets add tests in #2598 thanks!

cupy/cuda/cub.pyx

leofang · 2020-02-06T01:37:51Z

I fixed the compilation error. Sorry for my stupid mistake 😅

emcastillo · 2020-02-06T01:52:21Z

cupy/cuda/cub.pyx

+
+    If the specified scan is not possible, None is returned.
+    """
+    if op < CUPY_CUB_CUMSUM or op > CUPY_CUB_CUMPROD:


if op not in (CUPY_CUB_CUMSUM, CUPY_CUB_CUMPROD):

I will fix it later in my pr

Oh, sorry for overlooking this @emcastillo 😓

emcastillo · 2020-02-06T02:59:39Z

Jenkins, test this please

pfn-ci-bot · 2020-02-06T02:59:43Z

Successfully created a job for commit b9483ef:

Dashboard for commit b9483ef

chainer-ci · 2020-02-06T04:03:44Z

Jenkins CI test (for commit b9483ef, target branch master) succeeded!

leofang · 2020-02-06T16:34:30Z

Thanks @emcastillo!

* upstream/master: apply cupy#2919 (comment) Fix nvcc command lookup Add NumPy 1.18 to installation guide Use (1, 3)-shape to specify RGB Use `scipy.stats` to compute bivariate normal Fix setup.py Keep imag a view of original array Print installed packages in pytest Fix typos in comments defaults to in-place scan avoid using cub_scan for complex128; simplify shape Remove PY2 warning Add CUDA 10.2 support Remove TODO Fix imag for 0-size array Apply cupy#2766 (comment) Do not let Python 2 users build CuPy v7 Fix flake8 Use intptr_t for cusolver handler

leofang added 5 commits January 6, 2020 12:38

device_scan works

513c19b

support cumprod

6e82223

add NumPy-compliant checkes

da78fdc

CUB prefix scan can be toggled on/off now

05c8026

remove code ambiguity

6df5cd6

emcastillo self-assigned this Jan 8, 2020

leofang added a commit to leofang/cupy that referenced this pull request Jan 8, 2020

Bug fix: output array should be ravelled

9737fc8

Fixes cupy#2919 (comment)

Bug fix: output array should be ravelled

d3a68e0

Fixes cupy#2919 (comment)

leofang force-pushed the cub_scan branch from 9737fc8 to d3a68e0 Compare January 8, 2020 15:44

Bug fix: the CUB module might not be present

99a8347

leofang force-pushed the cub_scan branch from 078bbac to 5188227 Compare January 15, 2020 18:27

leofang commented Jan 15, 2020

View reviewed changes

leofang changed the title ~~[WIP] Support CUB prefix sum & product~~ Support CUB prefix sum & product Jan 15, 2020

leofang marked this pull request as ready for review January 15, 2020 20:03

leofang mentioned this pull request Jan 16, 2020

Improve performance of cumsum and cumprod #2907

Merged

avoid using cub_scan for complex128; simplify shape

4daf20b

leofang force-pushed the cub_scan branch from fd4bf6b to 4daf20b Compare January 16, 2020 18:21

emcastillo reviewed Jan 17, 2020

View reviewed changes

cupy/math/sumprod.py Outdated Show resolved Hide resolved

defaults to in-place scan

79780e1

TODO: address the compliance issue (cupy#2919 (comment))

emcastillo reviewed Feb 5, 2020

View reviewed changes

cupy/cuda/cub.pyx Outdated Show resolved Hide resolved

leofang added a commit to leofang/cupy that referenced this pull request Feb 5, 2020

apply cupy#2919 (comment)

02059fb

apply cupy#2919 (comment)

b9483ef

leofang force-pushed the cub_scan branch from 02059fb to b9483ef Compare February 6, 2020 01:31

emcastillo reviewed Feb 6, 2020

View reviewed changes

emcastillo merged commit 71307a5 into cupy:master Feb 6, 2020

emcastillo added the cat:performance Performance in terms of speed or memory consumption label Feb 6, 2020

emcastillo added this to the v8.0.0a1 milestone Feb 6, 2020

leofang deleted the cub_scan branch February 6, 2020 14:58

This was referenced Feb 12, 2020

Support CUB prod #3067

Merged

Add tests for cupy.cuda.cub #2598

Merged

leofang mentioned this pull request Mar 10, 2020

cupy.cumsum() much slower than cupy.core._routines_math.scan()? #2905

Closed

leofang mentioned this pull request Jul 6, 2020

Support faster indexing using CUB? #3553

Closed

leofang mentioned this pull request Aug 4, 2020

Manage our own CUB patches? #3723

Closed

leofang mentioned this pull request Aug 18, 2020

Update the scan implementation to follow P0571's guidance. brycelelbach/cub_historical_2019_2020#52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support CUB prefix sum & product #2919

Support CUB prefix sum & product #2919

leofang commented Jan 7, 2020 •

edited

Loading

grlee77 commented Jan 7, 2020

leofang commented Jan 8, 2020

emcastillo commented Jan 8, 2020

leofang commented Jan 8, 2020 •

edited

Loading

emcastillo commented Jan 8, 2020

leofang commented Jan 8, 2020

leofang commented Jan 8, 2020

emcastillo commented Jan 9, 2020 •

edited

Loading

leofang commented Jan 9, 2020 •

edited

Loading

leofang commented Jan 15, 2020

emcastillo commented Jan 15, 2020

leofang commented Jan 15, 2020

leofang Jan 15, 2020

leofang commented Jan 19, 2020

leofang commented Jan 19, 2020 •

edited

Loading

grlee77 commented Jan 20, 2020

leofang commented Jan 20, 2020 •

edited

Loading

leofang commented Feb 4, 2020 •

edited

Loading

emcastillo commented Feb 4, 2020

leofang commented Feb 4, 2020

emcastillo commented Feb 4, 2020

leofang commented Feb 6, 2020

emcastillo Feb 6, 2020

emcastillo Feb 6, 2020

leofang Feb 6, 2020

emcastillo commented Feb 6, 2020

pfn-ci-bot commented Feb 6, 2020

chainer-ci commented Feb 6, 2020

leofang commented Feb 6, 2020

Support CUB prefix sum & product #2919

Support CUB prefix sum & product #2919

Conversation

leofang commented Jan 7, 2020 • edited Loading

grlee77 commented Jan 7, 2020

leofang commented Jan 8, 2020

emcastillo commented Jan 8, 2020

leofang commented Jan 8, 2020 • edited Loading

emcastillo commented Jan 8, 2020

leofang commented Jan 8, 2020

leofang commented Jan 8, 2020

emcastillo commented Jan 9, 2020 • edited Loading

leofang commented Jan 9, 2020 • edited Loading

leofang commented Jan 15, 2020

emcastillo commented Jan 15, 2020

leofang commented Jan 15, 2020

leofang Jan 15, 2020

Choose a reason for hiding this comment

leofang commented Jan 19, 2020

leofang commented Jan 19, 2020 • edited Loading

grlee77 commented Jan 20, 2020

leofang commented Jan 20, 2020 • edited Loading

leofang commented Feb 4, 2020 • edited Loading

emcastillo commented Feb 4, 2020

leofang commented Feb 4, 2020

emcastillo commented Feb 4, 2020

leofang commented Feb 6, 2020

emcastillo Feb 6, 2020

Choose a reason for hiding this comment

emcastillo Feb 6, 2020

Choose a reason for hiding this comment

leofang Feb 6, 2020

Choose a reason for hiding this comment

emcastillo commented Feb 6, 2020

pfn-ci-bot commented Feb 6, 2020

chainer-ci commented Feb 6, 2020

leofang commented Feb 6, 2020

leofang commented Jan 7, 2020 •

edited

Loading

leofang commented Jan 8, 2020 •

edited

Loading

emcastillo commented Jan 9, 2020 •

edited

Loading

leofang commented Jan 9, 2020 •

edited

Loading

leofang commented Jan 19, 2020 •

edited

Loading

leofang commented Jan 20, 2020 •

edited

Loading

leofang commented Feb 4, 2020 •

edited

Loading