[BUG] strided loop overflow observed while joining against build table with 1.2B rows #12108

abellina · 2022-11-09T19:15:39Z

We are seeing a illegal memory access while joining with a large build table (1.2B keys). We see this overflow in two scenarios, inner and left semi. There may be other cases, but these were the two I ran into.

Because we are using size_type in our counting_transform_iterator, the loops in cuco::static_multimap::pair_count and cuco::static_map::insert can overflow if we have a large build side (we see it with 1.2B keys).

I was able to work around this issue by finding the iterators that were in my query's execution path and constructing them manually using int64_t abellina@ea59ed1, but I do not know if this is the appropriate solution (or if the solution is to change cuco directly).

This is a follow on from: #12058. Note that after fixing the overflows in this issue our query ran successfully.

Thanks @nvdbaranec for spending time to help me triage this.

C++ repro case, this one is based on the pair_count kernel but without all the extra pieces (https://github.com/NVIDIA/cuCollections/blob/dev/include/cuco/detail/static_multimap/kernels.cuh#L304).

#include <cudf/detail/iterator.cuh>
#include <cudf/types.hpp>
#include <thrust/iterator/counting_iterator.h>
#include <cooperative_groups.h>

namespace cg = cooperative_groups;

template <uint32_t block_size,
          uint32_t tile_size,
          typename InputIt>
__global__ void strided_loop(InputIt first, InputIt last)
{
  auto tile     = cg::tiled_partition<tile_size>(cg::this_thread_block());
  auto tid      = block_size * blockIdx.x + threadIdx.x;
  auto pair_idx = tid / tile_size;

  while (first + pair_idx < last) {
    int32_t x = *(first + pair_idx);
    if (x < 0) {
      printf("OVERFLOWED tid: %u x: %d\n", tid, x);
      return;
    }
    pair_idx += (gridDim.x * block_size) / tile_size;
  }
}

int main(int argc, char** argv) {
  std::size_t num_rows = 1200000000;

  auto it = cudf::detail::make_counting_transform_iterator(0, 
    []__device__(cudf::size_type ix){ 
      return ix;
    });

  auto constexpr cgsize = 2; // according to: join.hpp
  auto constexpr block_size = 128;
  auto constexpr stride     = 1;
  auto const grid_size = (cgsize * num_rows + stride * block_size - 1) / (stride * block_size);
  auto last = it + num_rows;
  strided_loop<block_size, cgsize><<<grid_size, block_size, 0>>>(it, last);
  cudaDeviceSynchronize();
}

The text was updated successfully, but these errors were encountered:

abellina · 2022-11-09T19:17:40Z

@jrhemstad fyi

davidwendt · 2022-11-09T19:43:46Z

@PointKernel

PointKernel · 2022-11-09T20:18:04Z

@abellina In your repro example, num_rows is std::size_t but the index iterator used to loop over rows is int (or cudf::size_type). Changing the iterator index type to std::size_t can get rid of the issue:

  auto it = cudf::detail::make_counting_transform_iterator(0, 
    []__device__(std::size_t/* instead of cudf::size_type */ ix){ 
      return ix;
    });

It's the user's responsibility to make sure the iterator won't overflow when passing it to cuco. I think your solution (abellina@ea59ed1) is the right way to go and std::size_t is probably preferred over int64_t.

nvdbaranec · 2022-11-09T20:39:37Z

It's the user's responsibility to make sure the iterator won't overflow when passing it to cuco. I think your solution >(abellina@ea59ed1) is the right >way to go and std::size_t is probably preferred over int64_t.

The issue here is this:

pair_idx += (gridDim.x * block_size) / tile_size;

The user can't (or at least shouldn't have to) know that this logic is happening. More specifically: the user can't possibly know that the internals of a function are using their input and potentially doubling it (when (gridDim.x * block_size) / tile_size is also the size of the input).

By changing the incoming iterator we're just tricking the template into getting it right. That's not an ideal solution.

~~pair_idx really needs to be a size_t here.~~
So just changing it to size_t isn't enough because we're working with iterators. But it feels very wrong to have to have an iterator that uses an int64_t when we know our inputs fit in a size_type. It's purely an implementation detail that we happen to be attempting to add something that will cause it to overflow. I guess the abstraction of the iterator makes it moot.

nvdbaranec · 2022-11-09T20:46:12Z

For reference, this is just a slightly different flavor of:
#10368

nvdbaranec · 2022-11-09T22:15:34Z

Since this is a counting transform iterator, the type being compared here while (first + pair_idx < last) is the type of the counting iterator itself and not what the transform is returning. But the cudf::detail wrapper forces this to be a size_type:

CUDF_HOST_DEVICE inline auto make_counting_transform_iterator(cudf::size_type start, UnaryFunction f)

We can get around that by constructing the iterator ourselves with thrust, but I do still feel like it's odd that the caller should have to know that the range of iterator values might be. If that were the case, wouldn't you have to just always use int64_t to be safe?

PointKernel · 2022-11-09T22:44:11Z

The issue is persistent when using thrust::counting_iterator<int> directly. I noticed the size of thrust::counting_iterator<int>::difference_type is 8 bytes thus if we set pair_idx type as int64_t, comparison like first + pair_idx < last ~~should work fine~~ (no it doesn't) regardless of the iterator value type (or cudf::size_type in this case). ~~I suspect something is wrong inside iterator_core_access::distance_from.~~

jrhemstad · 2022-11-09T22:58:46Z

After giving it some more thought and conferring with my lawyers, I think this is a legit bug in cuco.

When you have an iterator, it, that satisfies random_access_iterator and you do it + N, then there is a requirement that [it, it + N) is a valid range (i.e., doesn't go past the end). Even if you don't dereference the iterator outside the valid range, simply advancing beyond the valid range is UB.

The safer way to write this would be:

template <typename It>
__global__ void kernel(It begin, It end){
auto const n = distance(begin, end);
auto idx = threadIdx.x + blockIdx.x * blockDim.x;
while(  idx < n ){
   do_stuff( *(begin + idx) );
   idx += blockDim.x * gridDim.x
}

PointKernel · 2022-11-09T23:06:30Z

simply advancing beyond the valid range is UB.

Today I learned.

abellina · 2022-11-09T23:29:43Z

Yes @jrhemstad's suggestion doesn't overflow. I believe it's because pair_idx (or in his example idx) is uint32_t given that threadIdx, blockIdx and blockDim are using unsigned int for the components.

That said, wouldn't this overflow once we go above 4B elements? I know we currently can't address this many elements in cuDF, but I am curious on how one might change the kernel to be robust for any input iterator. Should we cast idx to something like It::difference_type?

abellina · 2022-11-14T15:31:27Z

This change looks great so far: NVIDIA/cuCollections#243, with my testing not turning up any issues. Will we be able to include it in 22.12?

abellina added bug Something isn't working Needs Triage Need team to review and classify labels Nov 9, 2022

PointKernel mentioned this issue Nov 10, 2022

Fix iterator range bug in kernel functions NVIDIA/cuCollections#243

Merged

sameerz added the Spark Functionality that helps Spark RAPIDS label Nov 10, 2022

PointKernel closed this as completed in NVIDIA/cuCollections#243 Nov 14, 2022

PointKernel mentioned this issue Nov 14, 2022

Update cuco git tag rapidsai/rapids-cmake#302

Merged

7 tasks

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] strided loop overflow observed while joining against build table with 1.2B rows #12108

[BUG] strided loop overflow observed while joining against build table with 1.2B rows #12108

abellina commented Nov 9, 2022 •

edited

Loading

abellina commented Nov 9, 2022

davidwendt commented Nov 9, 2022

PointKernel commented Nov 9, 2022

nvdbaranec commented Nov 9, 2022 •

edited

Loading

nvdbaranec commented Nov 9, 2022

nvdbaranec commented Nov 9, 2022 •

edited

Loading

PointKernel commented Nov 9, 2022 •

edited

Loading

jrhemstad commented Nov 9, 2022 •

edited

Loading

PointKernel commented Nov 9, 2022

abellina commented Nov 9, 2022 •

edited

Loading

abellina commented Nov 14, 2022

[BUG] strided loop overflow observed while joining against build table with 1.2B rows #12108

[BUG] strided loop overflow observed while joining against build table with 1.2B rows #12108

Comments

abellina commented Nov 9, 2022 • edited Loading

abellina commented Nov 9, 2022

davidwendt commented Nov 9, 2022

PointKernel commented Nov 9, 2022

nvdbaranec commented Nov 9, 2022 • edited Loading

nvdbaranec commented Nov 9, 2022

nvdbaranec commented Nov 9, 2022 • edited Loading

PointKernel commented Nov 9, 2022 • edited Loading

jrhemstad commented Nov 9, 2022 • edited Loading

PointKernel commented Nov 9, 2022

abellina commented Nov 9, 2022 • edited Loading

abellina commented Nov 14, 2022

abellina commented Nov 9, 2022 •

edited

Loading

nvdbaranec commented Nov 9, 2022 •

edited

Loading

nvdbaranec commented Nov 9, 2022 •

edited

Loading

PointKernel commented Nov 9, 2022 •

edited

Loading

jrhemstad commented Nov 9, 2022 •

edited

Loading

abellina commented Nov 9, 2022 •

edited

Loading