Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Fix of CUB linkage #547

Merged
merged 1 commit into from
Sep 3, 2022
Merged

Conversation

gevtushenko
Copy link
Collaborator

@gevtushenko gevtushenko commented Aug 12, 2022

Motivation

This PR is an alternative fix of the following issue. The suggested solution of making all kernels static would explode binary sizes for builds with -rdc=true. @jrhemstad suggested an alternative approach where we encode the list of architectures we are compiling against in the CUB namespace.

Solution

This PR introduces inline namespace whose name is generated from combination of CUB_VERSION and __CUDA_ARCH_LIST__. This solution addresses original issue, since dispatch layer is calling kernels from the same set or architectures it was compiled against. On the other hand, the solution preserves weak linkage when the code compiled with -rdc=true.

Additionally, CUB_DISABLE_NAMESPACE_MAGIC macro is provided to disable mentioned changes. Providing CUB_DISABLE_NAMESPACE_MAGIC requires specification of CUB_WRAPPED_NAMESPACE.

Example

As an example I'm compiling the following code:

// tu_1.cu
#include <cub/cub.cuh>
void foo() {
  std::size_t tmp_size{};
  int *ptr{};
  cub::DeviceReduce::Sum(nullptr, tmp_size, ptr, ptr, 0);
}

// tu_2.cu
#include <cub/cub.cuh>
void bar() {
  std::size_t tmp_size{};
  int *ptr{};
  cub::DeviceReduce::Sum(nullptr, tmp_size, ptr, ptr, 0);
}

// main.cu
void foo();
void bar();
int main() {
  foo();
  bar();
}

When compiled with -rdc=false for different architectures:

:cuobjdump --dump-sass a.out | rg Function
		Function : _ZN3cub17CUB_200100_860_NS18DeviceReduceKernelINS0_18DeviceReducePolicyIiiNS0_3SumEE9Policy600EPiiS3_iEEvT0_PT3_T1_NS0_13GridEvenShareISA_EET2_
		Function : _ZN3cub17CUB_200100_860_NS28DeviceReduceSingleTileKernelINS0_18DeviceReducePolicyIiiNS0_3SumEE9Policy600EPiS6_iS3_iiEEvT0_T1_T2_T3_T4_
		Function : _ZN3cub17CUB_200100_860_NS11EmptyKernelIvEEvv
		Function : _ZN3cub21CUB_200100_750_800_NS18DeviceReduceKernelINS0_18DeviceReducePolicyIiiNS0_3SumEE9Policy600EPiiS3_iEEvT0_PT3_T1_NS0_13GridEvenShareISA_EET2_
		Function : _ZN3cub21CUB_200100_750_800_NS28DeviceReduceSingleTileKernelINS0_18DeviceReducePolicyIiiNS0_3SumEE9Policy600EPiS6_iS3_iiEEvT0_T1_T2_T3_T4_
		Function : _ZN3cub21CUB_200100_750_800_NS11EmptyKernelIvEEvv
		Function : _ZN3cub21CUB_200100_750_800_NS18DeviceReduceKernelINS0_18DeviceReducePolicyIiiNS0_3SumEE9Policy600EPiiS3_iEEvT0_PT3_T1_NS0_13GridEvenShareISA_EET2_
		Function : _ZN3cub21CUB_200100_750_800_NS28DeviceReduceSingleTileKernelINS0_18DeviceReducePolicyIiiNS0_3SumEE9Policy600EPiS6_iS3_iiEEvT0_T1_T2_T3_T4_
		Function : _ZN3cub21CUB_200100_750_800_NS11EmptyKernelIvEEvv

When compiled with -rdc=false for same architectures:

:cuobjdump --dump-sass a.out | rg Function
		Function : _ZN3cub17CUB_200100_860_NS18DeviceReduceKernelINS0_18DeviceReducePolicyIiiNS0_3SumEE9Policy600EPiiS3_iEEvT0_PT3_T1_NS0_13GridEvenShareISA_EET2_
		Function : _ZN3cub17CUB_200100_860_NS28DeviceReduceSingleTileKernelINS0_18DeviceReducePolicyIiiNS0_3SumEE9Policy600EPiS6_iS3_iiEEvT0_T1_T2_T3_T4_
		Function : _ZN3cub17CUB_200100_860_NS11EmptyKernelIvEEvv
		Function : _ZN3cub17CUB_200100_860_NS18DeviceReduceKernelINS0_18DeviceReducePolicyIiiNS0_3SumEE9Policy600EPiiS3_iEEvT0_PT3_T1_NS0_13GridEvenShareISA_EET2_
		Function : _ZN3cub17CUB_200100_860_NS28DeviceReduceSingleTileKernelINS0_18DeviceReducePolicyIiiNS0_3SumEE9Policy600EPiS6_iS3_iiEEvT0_T1_T2_T3_T4_
		Function : _ZN3cub17CUB_200100_860_NS11EmptyKernelIvEEvv

When compiled with -rdc=true for same architectures:

:cuobjdump --dump-sass a.out | rg Function
		Function : __cuda_sm70_shflsync_down
		Function : __cuda_reduxsync_s32_add
		Function : _ZN3cub17CUB_200100_860_NS18DeviceReduceKernelINS0_18DeviceReducePolicyIiiNS0_3SumEE9Policy600EPiiS3_iEEvT0_PT3_T1_NS0_13GridEvenShareISA_EET2_
		Function : _ZN3cub17CUB_200100_860_NS28DeviceReduceSingleTileKernelINS0_18DeviceReducePolicyIiiNS0_3SumEE9Policy600EPiS6_iS3_iiEEvT0_T1_T2_T3_T4_
		Function : _ZN3cub17CUB_200100_860_NS11EmptyKernelIvEEvv

Issues

  • This solution breaks ABI. If someone is using cub::DoubleBuffer as a function parameter in a precompiled libraries, we'll break the code. CUB doesn't document it's ABI guarantees, so additional discussion is needed. We might break this in 2.0. Alternatively, we might take the cub::DoubleBuffer out of the inner namespace.
  • There's still no guarantee which set of architectures is going to be used when CUB is used inside template function and different TUs are compiled for different architectures. To address this on our side, we have to adopt the modifications on the Thrust side.

TODO

  • Confirm/Reject the approach
  • Discuss ABI guarantees
  • Document new code
  • Adopt mentioned approach in Thrust

Copy link
Collaborator

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is actually not too bad.

cub/util_namespace.cuh Outdated Show resolved Hide resolved
@gevtushenko
Copy link
Collaborator Author

@brycelelbach suggested to use something like CUB_ABI_VERSION instead of CUB_VERSION in the inline namespace so that we only break ABI explicitly.

@@ -108,6 +108,60 @@
#define CUB_NS_QUALIFIER ::cub
#endif

#define CUB_COUNT_N(_1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, N, ...) N
#define CUB_COUNT(...) \
CUB_IDENTITY(CUB_COUNT_N(__VA_ARGS__, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1))
Copy link
Collaborator

@robertmaynard robertmaynard Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be needed to support 20. As it currently stands compiling with -arch=all will generate 14 entries for x86-64 machines. With a new generation of hardware we will easily overflow the 15 count limit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I wonder if there is some clever way to collapse all the archs down into a single number or some other condensed representation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrhemstad having namespace name readable has advantage in my understanding. If you have a binary, you can see why you have multiple symbols. Can you share a value in the single number representation?

@alliepiper
Copy link
Collaborator

use something like CUB_ABI_VERSION instead of CUB_VERSION in the inline namespace so that we only break ABI explicitly.

I think this is a good goal, but would like to stick with CUB_VERSION for now. We generally aren't concerned with ABI breaks, and it'd be easy for one to slip in without us noticing.

If we had some way to automatically test for ABI breaks I'd be much more comfortable with this idea.

@jrhemstad
Copy link
Collaborator

I think this is a good goal, but would like to stick with CUB_VERSION for now. We generally aren't concerned with ABI breaks, and it'd be easy for one to slip in without us noticing.

If we had some way to automatically test for ABI breaks I'd be much more comfortable with this idea.

What are some examples of types in CUB where we'd even be concerned with ABI?

To be honest, I can't even think of anything from CUB that would be part of someone's binary interface.

I guess technically someone could have a cub::DeviceReduce object as a member of their type, but that would be kind of strange.

@robertmaynard
Copy link
Collaborator

Looking at the impact of this change on a large library like libcudf.

mode before after
75 264MB 270MB
60, 70, 75, 80, 86 528MB 538MB

Overall this change looks to have a minimal impact on binary size, which is great to see.

@gevtushenko
Copy link
Collaborator Author

I think this is a good goal, but would like to stick with CUB_VERSION for now. We generally aren't concerned with ABI breaks, and it'd be easy for one to slip in without us noticing.
If we had some way to automatically test for ABI breaks I'd be much more comfortable with this idea.

What are some examples of types in CUB where we'd even be concerned with ABI?

To be honest, I can't even think of anything from CUB that would be part of someone's binary interface.

I guess technically someone could have a cub::DeviceReduce object as a member of their type, but that would be kind of strange.

@jrhemstad one of the examples might be cub::DoubleBuffer.

@gevtushenko
Copy link
Collaborator Author

Looking at the impact of this change on a large library like libcudf.

mode before after
75 264MB 270MB
60, 70, 75, 80, 86 528MB 538MB
Overall this change looks to have a minimal impact on binary size, which is great to see.

@robertmaynard just to make sure, does libcudf use -rdc=true?

@robertmaynard
Copy link
Collaborator

robertmaynard commented Aug 17, 2022

Looking at the impact of this change on a large library like libcudf.
mode before after
75 264MB 270MB
60, 70, 75, 80, 86 528MB 538MB
Overall this change looks to have a minimal impact on binary size, which is great to see.

@robertmaynard just to make sure, does libcudf use -rdc=true?

No, whole compilation only. I expect the size increase comes from the increase in symbol name length.

Copy link
Collaborator

@alliepiper alliepiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes requested, but overall I'm ok with the approach here.

cub/util_namespace.cuh Outdated Show resolved Hide resolved
cub/util_namespace.cuh Outdated Show resolved Hide resolved
cub/util_namespace.cuh Show resolved Hide resolved
cub/util_namespace.cuh Outdated Show resolved Hide resolved
cub/util_namespace.cuh Outdated Show resolved Hide resolved
cub/util_namespace.cuh Outdated Show resolved Hide resolved
@gevtushenko gevtushenko changed the title [WIP] Alternative fix of CUB linkage Fix of CUB linkage Sep 2, 2022
gevtushenko added a commit to gevtushenko/thrust that referenced this pull request Sep 2, 2022
@gevtushenko gevtushenko added this to the 2.1.0 milestone Sep 3, 2022
@gevtushenko gevtushenko merged commit 1fa663d into NVIDIA:main Sep 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
testing: gpuCI passed Passed gpuCI testing.
Projects
Archived in project
5 participants