Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] strided loop overflow observed while joining against build table with 1.2B rows #12108

Closed
abellina opened this issue Nov 9, 2022 · 11 comments · Fixed by NVIDIA/cuCollections#243
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS

Comments

@abellina
Copy link
Contributor

abellina commented Nov 9, 2022

We are seeing a illegal memory access while joining with a large build table (1.2B keys). We see this overflow in two scenarios, inner and left semi. There may be other cases, but these were the two I ran into.

Because we are using size_type in our counting_transform_iterator, the loops in cuco::static_multimap::pair_count and cuco::static_map::insert can overflow if we have a large build side (we see it with 1.2B keys).

I was able to work around this issue by finding the iterators that were in my query's execution path and constructing them manually using int64_t abellina@ea59ed1, but I do not know if this is the appropriate solution (or if the solution is to change cuco directly).

This is a follow on from: #12058. Note that after fixing the overflows in this issue our query ran successfully.

Thanks @nvdbaranec for spending time to help me triage this.

C++ repro case, this one is based on the pair_count kernel but without all the extra pieces (https://github.com/NVIDIA/cuCollections/blob/dev/include/cuco/detail/static_multimap/kernels.cuh#L304).

#include <cudf/detail/iterator.cuh>
#include <cudf/types.hpp>
#include <thrust/iterator/counting_iterator.h>
#include <cooperative_groups.h>

namespace cg = cooperative_groups;

template <uint32_t block_size,
          uint32_t tile_size,
          typename InputIt>
__global__ void strided_loop(InputIt first, InputIt last)
{
  auto tile     = cg::tiled_partition<tile_size>(cg::this_thread_block());
  auto tid      = block_size * blockIdx.x + threadIdx.x;
  auto pair_idx = tid / tile_size;

  while (first + pair_idx < last) {
    int32_t x = *(first + pair_idx);
    if (x < 0) {
      printf("OVERFLOWED tid: %u x: %d\n", tid, x);
      return;
    }
    pair_idx += (gridDim.x * block_size) / tile_size;
  }
}

int main(int argc, char** argv) {
  std::size_t num_rows = 1200000000;

  auto it = cudf::detail::make_counting_transform_iterator(0, 
    []__device__(cudf::size_type ix){ 
      return ix;
    });

  auto constexpr cgsize = 2; // according to: join.hpp
  auto constexpr block_size = 128;
  auto constexpr stride     = 1;
  auto const grid_size = (cgsize * num_rows + stride * block_size - 1) / (stride * block_size);
  auto last = it + num_rows;
  strided_loop<block_size, cgsize><<<grid_size, block_size, 0>>>(it, last);
  cudaDeviceSynchronize();
}
@abellina abellina added bug Something isn't working Needs Triage Need team to review and classify labels Nov 9, 2022
@abellina
Copy link
Contributor Author

abellina commented Nov 9, 2022

@jrhemstad fyi

@davidwendt
Copy link
Contributor

@PointKernel

@PointKernel
Copy link
Member

@abellina In your repro example, num_rows is std::size_t but the index iterator used to loop over rows is int (or cudf::size_type). Changing the iterator index type to std::size_t can get rid of the issue:

  auto it = cudf::detail::make_counting_transform_iterator(0, 
    []__device__(std::size_t/* instead of cudf::size_type */ ix){ 
      return ix;
    });

It's the user's responsibility to make sure the iterator won't overflow when passing it to cuco. I think your solution (abellina@ea59ed1) is the right way to go and std::size_t is probably preferred over int64_t.

@nvdbaranec
Copy link
Contributor

nvdbaranec commented Nov 9, 2022

It's the user's responsibility to make sure the iterator won't overflow when passing it to cuco. I think your solution >(abellina@ea59ed1) is the right >way to go and std::size_t is probably preferred over int64_t.

The issue here is this:

pair_idx += (gridDim.x * block_size) / tile_size;

The user can't (or at least shouldn't have to) know that this logic is happening. More specifically: the user can't possibly know that the internals of a function are using their input and potentially doubling it (when (gridDim.x * block_size) / tile_size is also the size of the input).

By changing the incoming iterator we're just tricking the template into getting it right. That's not an ideal solution.

pair_idx really needs to be a size_t here.
So just changing it to size_t isn't enough because we're working with iterators. But it feels very wrong to have to have an iterator that uses an int64_t when we know our inputs fit in a size_type. It's purely an implementation detail that we happen to be attempting to add something that will cause it to overflow. I guess the abstraction of the iterator makes it moot.

@nvdbaranec
Copy link
Contributor

For reference, this is just a slightly different flavor of:
#10368

@nvdbaranec
Copy link
Contributor

nvdbaranec commented Nov 9, 2022

Since this is a counting transform iterator, the type being compared here while (first + pair_idx < last) is the type of the counting iterator itself and not what the transform is returning. But the cudf::detail wrapper forces this to be a size_type:

CUDF_HOST_DEVICE inline auto make_counting_transform_iterator(cudf::size_type start, UnaryFunction f)

We can get around that by constructing the iterator ourselves with thrust, but I do still feel like it's odd that the caller should have to know that the range of iterator values might be. If that were the case, wouldn't you have to just always use int64_t to be safe?

@PointKernel
Copy link
Member

PointKernel commented Nov 9, 2022

The issue is persistent when using thrust::counting_iterator<int> directly. I noticed the size of thrust::counting_iterator<int>::difference_type is 8 bytes thus if we set pair_idx type as int64_t, comparison like first + pair_idx < last should work fine (no it doesn't) regardless of the iterator value type (or cudf::size_type in this case). I suspect something is wrong inside iterator_core_access::distance_from.

@jrhemstad
Copy link
Contributor

jrhemstad commented Nov 9, 2022

After giving it some more thought and conferring with my lawyers, I think this is a legit bug in cuco.

When you have an iterator, it, that satisfies random_access_iterator and you do it + N, then there is a requirement that [it, it + N) is a valid range (i.e., doesn't go past the end). Even if you don't dereference the iterator outside the valid range, simply advancing beyond the valid range is UB.

The safer way to write this would be:

template <typename It>
__global__ void kernel(It begin, It end){
auto const n = distance(begin, end);
auto idx = threadIdx.x + blockIdx.x * blockDim.x;
while(  idx < n ){
   do_stuff( *(begin + idx) );
   idx += blockDim.x * gridDim.x
}

@PointKernel
Copy link
Member

simply advancing beyond the valid range is UB.

Today I learned.

@abellina
Copy link
Contributor Author

abellina commented Nov 9, 2022

Yes @jrhemstad's suggestion doesn't overflow. I believe it's because pair_idx (or in his example idx) is uint32_t given that threadIdx, blockIdx and blockDim are using unsigned int for the components.

That said, wouldn't this overflow once we go above 4B elements? I know we currently can't address this many elements in cuDF, but I am curious on how one might change the kernel to be robust for any input iterator. Should we cast idx to something like It::difference_type?

@sameerz sameerz added the Spark Functionality that helps Spark RAPIDS label Nov 10, 2022
@abellina
Copy link
Contributor Author

This change looks great so far: NVIDIA/cuCollections#243, with my testing not turning up any issues. Will we be able to include it in 22.12?

@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants