Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CUB prefix sum & product #2919

Merged
merged 10 commits into from
Feb 6, 2020
Merged

Support CUB prefix sum & product #2919

merged 10 commits into from
Feb 6, 2020

Conversation

leofang
Copy link
Member

@leofang leofang commented Jan 7, 2020

In the demo below, CUB gets about a 4x speed-up on a P100:

>>> import cupy as cp
>>> import cupyx
>>> 
>>> a = cp.random.random(1000000)
>>>
>>> cp.cuda.cub_enabled  =False
>>> print(cupyx.time.repeat(cp.cumprod, (a, )))
cumprod             :   428.775 us   +/-57.870 (min:  396.988 / max:  976.234) us    471.108 us   +/-45.524 (min:  449.888 / max:  992.736) us
>>> cp.cuda.cub_enabled = True
>>> print(cupyx.time.repeat(cp.cumprod, (a, )))
cumprod             :   103.053 us   +/-19.460 (min:   84.242 / max:  229.079) us    132.358 us   +/-17.695 (min:  113.888 / max:  250.048) us
>>>
>>> cp.cuda.cub_enabled = False
>>> print(cupyx.time.repeat(cp.cumsum, (a, )))
cumsum              :   428.253 us   +/-55.802 (min:  398.817 / max:  978.976) us    470.163 us   +/-43.695 (min:  450.464 / max:  995.872) us
>>> cp.cuda.cub_enabled = True
>>> print(cupyx.time.repeat(cp.cumsum, (a, )))
cumsum              :   103.998 us   +/-17.858 (min:   86.243 / max:  234.290) us    133.392 us   +/-16.486 (min:  115.808 / max:  256.320) us

TODO:

The first two action items are for speeding up certain fancy indexing cases.

@grlee77
Copy link
Contributor

grlee77 commented Jan 7, 2020

Nice. I tried this out and the only problem I encountered was in the shape of the returned result when axis=None and the input is nD. (The output should be a ravelled array)

Simple example with CUB enabled/disabled (the disabled behavior matches NumPy):

import cupy as cp
import numpy as np

a = cp.testing.shaped_arange((4, 5), cp, float)
cp.cumsum(a)
# array([[  1.,   3.,   6.,  10.,  15.],
#        [ 21.,  28.,  36.,  45.,  55.],
#        [ 66.,  78.,  91., 105., 120.],
#        [136., 153., 171., 190., 210.]])

cp.cuda.cub_enabled = False
cp.cumsum(a)
# array([  1.,   3.,   6.,  10.,  15.,  21.,  28.,  36.,  45.,  55.,  66.,
#         78.,  91., 105., 120., 136., 153., 171., 190., 210.])

@leofang
Copy link
Member Author

leofang commented Jan 8, 2020

Oops, thanks @grlee77! I read the spec wrong...

@emcastillo emcastillo self-assigned this Jan 8, 2020
@emcastillo
Copy link
Member

I was actually working on this :D.
I let you take care of it better :).
One of the things I saw is that CUB scanning does not support axis selection, so implementing that correctly may be a bit trickier.

@leofang
Copy link
Member Author

leofang commented Jan 8, 2020

Oh sorry for duplicate work @emcastillo. Which branch is it? I could cherry pick your code.

One of the things I saw is that CUB scanning does not support axis selection, so implementing that correctly may be a bit trickier.

I don’t think CUB can support axis != None (unless we are willing to launch hundreds of kernels in such cases).

@emcastillo
Copy link
Member

No worries, your PR is going to be better than mine so do the cub part and I will take care of improving the non-cub kernels.

@leofang
Copy link
Member Author

leofang commented Jan 8, 2020

@emcastillo This PR should be considered jointly with your #2907. Looks like cupy.core._routines_math.scan() is really good. It is slower than CUB only when there are >= 10^6 elements (for float64 in this case):

>>> a = cp.random.random(1000)
>>> print(cupyx.time.repeat(cp.cuda.cub.device_scan, (a, 5)))
device_scan         :    98.163 us   +/-16.937 (min:   76.599 / max:  222.032) us    106.869 us   +/-18.320 (min:   77.120 / max:  236.768) us
>>> print(cupyx.time.repeat(cp.core._routines_math._scan_for_test, (a, )))
_scan_for_test      :    67.176 us   +/- 9.349 (min:   49.020 / max:  147.861) us     76.206 us   +/-10.508 (min:   56.064 / max:  163.200) us
>>> 
>>> a = cp.random.random(100000)
>>> print(cupyx.time.repeat(cp.cuda.cub.device_scan, (a, 5)))
device_scan         :    97.013 us   +/-17.426 (min:   76.346 / max:  215.663) us    105.617 us   +/-18.885 (min:   83.296 / max:  300.320) us
>>> print(cupyx.time.repeat(cp.core._routines_math._scan_for_test, (a, )))
_scan_for_test      :    67.560 us   +/- 9.815 (min:   49.441 / max:  181.684) us     76.555 us   +/-11.008 (min:   56.416 / max:  198.016) us
>>> 
>>> a = cp.random.random(1000000)
>>> print(cupyx.time.repeat(cp.cuda.cub.device_scan, (a, 5)))
device_scan         :    94.037 us   +/-17.865 (min:   74.958 / max:  278.390) us    124.621 us   +/-16.371 (min:  106.752 / max:  301.504) us
>>> print(cupyx.time.repeat(cp.core._routines_math._scan_for_test, (a, )))
_scan_for_test      :    90.543 us   +/-17.130 (min:   70.443 / max:  201.541) us    123.582 us   +/-10.805 (min:  107.584 / max:  223.040) us
>>> 
>>> a = cp.random.random(10000000)
>>> print(cupyx.time.repeat(cp.cuda.cub.device_scan, (a, 5)))
device_scan         :    82.406 us   +/-12.235 (min:   74.969 / max:  204.807) us    650.291 us   +/- 6.245 (min:  644.096 / max:  714.784) us
>>> print(cupyx.time.repeat(cp.core._routines_math._scan_for_test, (a, )))
_scan_for_test      :    78.777 us   +/- 9.734 (min:   72.235 / max:  196.004) us    822.750 us   +/- 8.690 (min:  816.992 / max:  901.280) us

So, if cupy.core._routines_math.scan() is tweaked a bit (simply by changing the accumulator += to *= I guess), perhaps in addition to cumsum() it can also do cumprod(), and we either wouldn't need CUB at all, or would switch to CUB only when the array is excessively large?

(Actually I already cheated a bit in the above comparison by using cp.cuda.cub.device_scan() instead of cp.cuda.cub.cub_scan(): the latter adds about 8 us on my machine due to extra checks...)

@leofang
Copy link
Member Author

leofang commented Jan 8, 2020

we either wouldn't need CUB at all, or would switch to CUB only when the array is excessively large?

We could also let users decide if they want to use CUB scan, as usual by toggling cp.cuda.cub_enabled.

leofang added a commit to leofang/cupy that referenced this pull request Jan 8, 2020
@emcastillo
Copy link
Member

emcastillo commented Jan 9, 2020

I wouldn't set sub-optimal implementations by default.
But I understand that enabling some cub algorithms and other not in some cases can be a nightmare of configurations.

@leofang
Copy link
Member Author

leofang commented Jan 9, 2020

Another thing: CUB cumsum and cumprod seems to be buggy for complex numbers. I am not sure why, but this is deterministically reproducible. The cumulation stopped at the 448-th element of the result:

>>> import cupy as cp
>>> a = cp.random.random(500) + 1j*cp.random.random(500)
>>> cp.cuda.cub_enabled = True
>>> b = cp.cumsum(a)  # wrong
>>> cp.cuda.cub_enabled = False
>>> c = cp.cumsum(a)  # correct
>>> cp.allclose(b[0:447],c[0:447])
array(True)
>>> cp.allclose(b[0:448],c[0:448])
array(True)
>>> cp.allclose(b[0:449],c[0:449])
array(False)
>>> b[448] == a[448]   # <--- why???
array(True)
>>> c[448] == a[448]
array(False)

Looking into this...

@leofang
Copy link
Member Author

leofang commented Jan 15, 2020

Another thing: CUB cumsum and cumprod seems to be buggy for complex numbers.

Two observations:

  1. This bug is complex128-only; for complex64 arrays CUB works fine, so I suspect it's due to CUB's internal handling of data size exceeding 64 bits.
  2. Adjusting the items per thread in this line
    https://github.com/NVlabs/cub/blob/c3cceac115c072fb63df1836ff46d8c60d9eb304/cub/device/dispatch/dispatch_scan.cuh#L177
    will cause the bug appear in different elements (I changed from 15 to 12 and the bug started in the element No.3xx instead of 448).

@emcastillo Due to the above observations, I tend to think this is a bug in CUB (which I have not identified yet), and so I will disable CUB prefix scan for complex128 and proceed.

@emcastillo
Copy link
Member

I have been working on adapting the fast kernel in scan to do the batched cumsum and cumprod.
It's been a nightmare but i am pretty close to the end :D

@leofang
Copy link
Member Author

leofang commented Jan 15, 2020

Nice. We can then do a benchmark to decide if this PR is to be kept or closed.

@@ -372,8 +372,7 @@ def can_use_device_segmented_reduce(int op, x_dtype, Py_ssize_t ndim,
order)


cdef _cub_reduce_dtype_compatible(x_dtype, int op, dtype=None,
bint segmented=False):
cdef _cub_reduce_dtype_compatible(x_dtype, int op, dtype=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to correct a mistake I made in #2562.

@leofang leofang changed the title [WIP] Support CUB prefix sum & product Support CUB prefix sum & product Jan 15, 2020
@leofang leofang marked this pull request as ready for review January 15, 2020 20:03
cupy/math/sumprod.py Outdated Show resolved Hide resolved
@leofang
Copy link
Member Author

leofang commented Jan 19, 2020

Sounds good. I'll then avoid dealing with this and let the current _cum_core() handle it. Then, in the next PR we just fix _cum_core().

TODO: address the compliance issue (cupy#2919 (comment))
@leofang
Copy link
Member Author

leofang commented Jan 19, 2020

@emcastillo Thanks for the heads-up! I guess you are right: CUB supports in-place scan. This simplifies things quite a bit. PTAL.

I verified this support in a few levels:

  1. Test it directly with the new change 79780e1
  2. In Thrust's documentation in-place scans are used as an example. Since CUB is one of Thrust's backends, it should also work.
  3. By inspecting CUB's code: The ConsumeTile() implementation in cub/cub/agent/agent_scan.cuh seems to permit in-place operation.

@grlee77
Copy link
Contributor

grlee77 commented Jan 20, 2020

Hi @leofang. The dtype behavior you describe above (#2919 (comment)) is also what I recently attempted to implement for cupy.mean in #2903. Seeing now that this is more common across reduction functions, then perhaps the logic should move out into a separate helper function.

@leofang
Copy link
Member Author

leofang commented Jan 20, 2020

@grlee77 Thanks for bringing it up, Gregory 🙂 I just took a quick look at it. While it'd be nice to have a separate helper as you said, the rules there are slightly different from #2919 (comment). Maybe we need a switch or flag in the helper to determine the desired behavior?

Anyway, if you can make that helper in your PR, @emcastillo and I will follow in #2907 and here. We can move the discussion in your #2903. Thanks!

@leofang
Copy link
Member Author

leofang commented Feb 4, 2020

@emcastillo I suggest we consider the PRs in the following order: #2919 (this PR) -> #2907 for two reasons.

First, I do not deal with the complexity of out and dtype, as outlined in #2919 (comment), and simply let the caller functions (cumsum and cumprod) handle it, whereas in #2907 the caller code is refactored.

Second, while Gregory's suggestion on a separate helper is tempting, I don't see immediately how such helper can be flexible and useful since the rules vary across different functions.

If we want to make the implementation more NumPy-compliant, we can make another PR to address #2919 (comment) after both are merged, as you suggested earlier.

If you agree with my suggestion, the only question I have for you is:

Thanks 🙂

@emcastillo
Copy link
Member

Lets do that

@leofang
Copy link
Member Author

leofang commented Feb 4, 2020

Cool, if we don’t need to add tests here, this PR is ready.

@emcastillo
Copy link
Member

Lets add tests in #2598 thanks!

cupy/cuda/cub.pyx Outdated Show resolved Hide resolved
leofang added a commit to leofang/cupy that referenced this pull request Feb 5, 2020
@leofang
Copy link
Member Author

leofang commented Feb 6, 2020

I fixed the compilation error. Sorry for my stupid mistake 😅


If the specified scan is not possible, None is returned.
"""
if op < CUPY_CUB_CUMSUM or op > CUPY_CUB_CUMPROD:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if op not in (CUPY_CUB_CUMSUM, CUPY_CUB_CUMPROD):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix it later in my pr

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry for overlooking this @emcastillo 😓

@emcastillo
Copy link
Member

Jenkins, test this please

@pfn-ci-bot
Copy link
Collaborator

Successfully created a job for commit b9483ef:

@chainer-ci
Copy link
Member

Jenkins CI test (for commit b9483ef, target branch master) succeeded!

@emcastillo emcastillo merged commit 71307a5 into cupy:master Feb 6, 2020
@emcastillo emcastillo added the cat:performance Performance in terms of speed or memory consumption label Feb 6, 2020
@emcastillo emcastillo added this to the v8.0.0a1 milestone Feb 6, 2020
@leofang leofang deleted the cub_scan branch February 6, 2020 14:58
@leofang
Copy link
Member Author

leofang commented Feb 6, 2020

Thanks @emcastillo!

grlee77 added a commit to grlee77/cupy that referenced this pull request Feb 7, 2020
* upstream/master:
  apply cupy#2919 (comment)
  Fix nvcc command lookup
  Add NumPy 1.18 to installation guide
  Use (1, 3)-shape to specify RGB
  Use `scipy.stats` to compute bivariate normal
  Fix setup.py
  Keep imag a view of original array
  Print installed packages in pytest
  Fix typos in comments
  defaults to in-place scan
  avoid using cub_scan for complex128; simplify shape
  Remove PY2 warning
  Add CUDA 10.2 support
  Remove TODO
  Fix imag for 0-size array
  Apply cupy#2766 (comment)
  Do not let Python 2 users build CuPy v7
  Fix flake8
  Use intptr_t for cusolver handler
This was referenced Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:performance Performance in terms of speed or memory consumption
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants