-
-
Notifications
You must be signed in to change notification settings - Fork 818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support CUB prefix sum & product #2919
Conversation
Nice. I tried this out and the only problem I encountered was in the shape of the returned result when Simple example with CUB enabled/disabled (the disabled behavior matches NumPy): import cupy as cp
import numpy as np
a = cp.testing.shaped_arange((4, 5), cp, float)
cp.cumsum(a)
# array([[ 1., 3., 6., 10., 15.],
# [ 21., 28., 36., 45., 55.],
# [ 66., 78., 91., 105., 120.],
# [136., 153., 171., 190., 210.]])
cp.cuda.cub_enabled = False
cp.cumsum(a)
# array([ 1., 3., 6., 10., 15., 21., 28., 36., 45., 55., 66.,
# 78., 91., 105., 120., 136., 153., 171., 190., 210.]) |
Oops, thanks @grlee77! I read the spec wrong... |
I was actually working on this :D. |
Oh sorry for duplicate work @emcastillo. Which branch is it? I could cherry pick your code.
I don’t think CUB can support |
No worries, your PR is going to be better than mine so do the cub part and I will take care of improving the non-cub kernels. |
@emcastillo This PR should be considered jointly with your #2907. Looks like >>> a = cp.random.random(1000)
>>> print(cupyx.time.repeat(cp.cuda.cub.device_scan, (a, 5)))
device_scan : 98.163 us +/-16.937 (min: 76.599 / max: 222.032) us 106.869 us +/-18.320 (min: 77.120 / max: 236.768) us
>>> print(cupyx.time.repeat(cp.core._routines_math._scan_for_test, (a, )))
_scan_for_test : 67.176 us +/- 9.349 (min: 49.020 / max: 147.861) us 76.206 us +/-10.508 (min: 56.064 / max: 163.200) us
>>>
>>> a = cp.random.random(100000)
>>> print(cupyx.time.repeat(cp.cuda.cub.device_scan, (a, 5)))
device_scan : 97.013 us +/-17.426 (min: 76.346 / max: 215.663) us 105.617 us +/-18.885 (min: 83.296 / max: 300.320) us
>>> print(cupyx.time.repeat(cp.core._routines_math._scan_for_test, (a, )))
_scan_for_test : 67.560 us +/- 9.815 (min: 49.441 / max: 181.684) us 76.555 us +/-11.008 (min: 56.416 / max: 198.016) us
>>>
>>> a = cp.random.random(1000000)
>>> print(cupyx.time.repeat(cp.cuda.cub.device_scan, (a, 5)))
device_scan : 94.037 us +/-17.865 (min: 74.958 / max: 278.390) us 124.621 us +/-16.371 (min: 106.752 / max: 301.504) us
>>> print(cupyx.time.repeat(cp.core._routines_math._scan_for_test, (a, )))
_scan_for_test : 90.543 us +/-17.130 (min: 70.443 / max: 201.541) us 123.582 us +/-10.805 (min: 107.584 / max: 223.040) us
>>>
>>> a = cp.random.random(10000000)
>>> print(cupyx.time.repeat(cp.cuda.cub.device_scan, (a, 5)))
device_scan : 82.406 us +/-12.235 (min: 74.969 / max: 204.807) us 650.291 us +/- 6.245 (min: 644.096 / max: 714.784) us
>>> print(cupyx.time.repeat(cp.core._routines_math._scan_for_test, (a, )))
_scan_for_test : 78.777 us +/- 9.734 (min: 72.235 / max: 196.004) us 822.750 us +/- 8.690 (min: 816.992 / max: 901.280) us So, if (Actually I already cheated a bit in the above comparison by using |
We could also let users decide if they want to use CUB scan, as usual by toggling |
I wouldn't set sub-optimal implementations by default. |
Another thing: CUB >>> import cupy as cp
>>> a = cp.random.random(500) + 1j*cp.random.random(500)
>>> cp.cuda.cub_enabled = True
>>> b = cp.cumsum(a) # wrong
>>> cp.cuda.cub_enabled = False
>>> c = cp.cumsum(a) # correct
>>> cp.allclose(b[0:447],c[0:447])
array(True)
>>> cp.allclose(b[0:448],c[0:448])
array(True)
>>> cp.allclose(b[0:449],c[0:449])
array(False)
>>> b[448] == a[448] # <--- why???
array(True)
>>> c[448] == a[448]
array(False) Looking into this... |
Two observations:
@emcastillo Due to the above observations, I tend to think this is a bug in CUB (which I have not identified yet), and so I will disable CUB prefix scan for |
I have been working on adapting the fast kernel in scan to do the batched cumsum and cumprod. |
Nice. We can then do a benchmark to decide if this PR is to be kept or closed. |
@@ -372,8 +372,7 @@ def can_use_device_segmented_reduce(int op, x_dtype, Py_ssize_t ndim, | |||
order) | |||
|
|||
|
|||
cdef _cub_reduce_dtype_compatible(x_dtype, int op, dtype=None, | |||
bint segmented=False): | |||
cdef _cub_reduce_dtype_compatible(x_dtype, int op, dtype=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to correct a mistake I made in #2562.
Sounds good. I'll then avoid dealing with this and let the current |
TODO: address the compliance issue (cupy#2919 (comment))
@emcastillo Thanks for the heads-up! I guess you are right: CUB supports in-place scan. This simplifies things quite a bit. PTAL. I verified this support in a few levels:
|
Hi @leofang. The dtype behavior you describe above (#2919 (comment)) is also what I recently attempted to implement for cupy.mean in #2903. Seeing now that this is more common across reduction functions, then perhaps the logic should move out into a separate helper function. |
@grlee77 Thanks for bringing it up, Gregory 🙂 I just took a quick look at it. While it'd be nice to have a separate helper as you said, the rules there are slightly different from #2919 (comment). Maybe we need a switch or flag in the helper to determine the desired behavior? Anyway, if you can make that helper in your PR, @emcastillo and I will follow in #2907 and here. We can move the discussion in your #2903. Thanks! |
@emcastillo I suggest we consider the PRs in the following order: #2919 (this PR) -> #2907 for two reasons. First, I do not deal with the complexity of Second, while Gregory's suggestion on a separate helper is tempting, I don't see immediately how such helper can be flexible and useful since the rules vary across different functions. If we want to make the implementation more NumPy-compliant, we can make another PR to address #2919 (comment) after both are merged, as you suggested earlier. If you agree with my suggestion, the only question I have for you is:
Thanks 🙂 |
Lets do that |
Cool, if we don’t need to add tests here, this PR is ready. |
Lets add tests in #2598 thanks! |
I fixed the compilation error. Sorry for my stupid mistake 😅 |
|
||
If the specified scan is not possible, None is returned. | ||
""" | ||
if op < CUPY_CUB_CUMSUM or op > CUPY_CUB_CUMPROD: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if op not in (CUPY_CUB_CUMSUM, CUPY_CUB_CUMPROD):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix it later in my pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry for overlooking this @emcastillo 😓
Jenkins, test this please |
Successfully created a job for commit b9483ef: |
Jenkins CI test (for commit b9483ef, target branch master) succeeded! |
Thanks @emcastillo! |
* upstream/master: apply cupy#2919 (comment) Fix nvcc command lookup Add NumPy 1.18 to installation guide Use (1, 3)-shape to specify RGB Use `scipy.stats` to compute bivariate normal Fix setup.py Keep imag a view of original array Print installed packages in pytest Fix typos in comments defaults to in-place scan avoid using cub_scan for complex128; simplify shape Remove PY2 warning Add CUDA 10.2 support Remove TODO Fix imag for 0-size array Apply cupy#2766 (comment) Do not let Python 2 users build CuPy v7 Fix flake8 Use intptr_t for cusolver handler
In the demo below, CUB gets about a 4x speed-up on a P100:
TODO:
cupy.core._routines_math.scan()
(Rel:cupy.cumsum()
much slower thancupy.core._routines_math.scan()
? #2905)cupy.core._routines_math.scan()
NaN
s correctlycupy.cuda.cub
#2598?)The first two action items are for speeding up certain fancy indexing cases.