Use "ranger" to prevent grid stride loop overflow #10368

nvdbaranec · 2022-02-28T19:30:02Z

(updated Aug 2023)

Background

We found a kernel indexing overflow issue, first discovered in the fused_concatenate kernels (#10333) and this issue is present in a number of our CUDA kernels that take the following form:

size_type output_index = threadIdx.x + blockIdx.x * blockDim.x;  
while (output_index < output_size) {
  output_index += blockDim.x * gridDim.x;
}

If we have an output_size of say 1.2 billion and a grid size that's the same, the following happens: Some late thread id, say 1.19 billion attempts to add 1.2 billion (blockDim.x * gridDim.x) and overflows the size_type (signed 32 bits).

We made a round of fixes in #10448, and then later found another instance of this error in #13838. Our first pass of investigation was not adequate to contain the issue, so we need to take another close look.

Part 1 - First pass fix kernels with this issue

Source file	Kernels	Status
`copying/concatenate.cu`	`fused_concatenate_kernel`	#10448
`valid_if.cuh`	`valid_if_kernel`	#10448
`scatter.cu`	`marking_bitmask_kernel`	#10448
`replace/nulls.cu`	`replace_nulls_strings`	#10448
`replace/nulls.cu`	`replace_nulls`	#10448
`rolling/rolling_detail.cuh`	`gpu_rolling`	#10448
`rolling/jit/kernel.cu`	`gpu_rolling_new`	#10448
`transform/compute_column.cu`	`compute_column_kernel`	#10448
`copying/concatenate.cu`	`fused_concatenate_string_offset_kernel`	#13838
`replace/replace.cu`	`replace_strings_first_pass` `replace_strings_second_pass` `replace_kernel`	#13905
`copying/concatenate.cu`	`concatenate_masks_kernel` `fused_concatenate_string_offset_kernel` `fused_concatenate_string_chars_kernel` `fused_concatenate_kernel` (int64)	#13906
`hash/helper_functions.cuh`	`init_hashtbl`	#13895
`null_mask.cu`	`set_null_mask_kernel` `copy_offset_bitmask` `count_set_bits_kernel`	#13895
`transform/row_bit_count.cu`	`compute_row_sizes`	#13895
`multibyte_split.cu`	`multibyte_split_init_kernel` `multibyte_split_seed_kernel` (auto??) `multibyte_split_kernel`	#13910
IO modules: parquet, orc, json		#13910
`io/utilities/parsing_utils.cu`	`count_and_set_positions` (uint64_t)	#13910
`conditional_join_kernels.cuh`	`compute_conditional_join_output_size` `conditional_join`	#13971
`merge.cu`	`materialize_merged_bitmask_kernel`	#13972
`partitioning.cu`	`compute_row_partition_numbers` `compute_row_output_locations` `copy_block_partitions`	#13973
`json_path.cu`	`get_json_object_kernel`	#13962
`tdigest`	`compute_percentiles_kernel` (int)	#13962
`strings/attributes.cu`	`count_characters_parallel_fn`	#13968
`strings/convert/convert_urls.cu`	`url_decode_char_counter` (int) `url_decode_char_replacer` (int)	#13968
`text/subword/data_normalizer.cu`	`kernel_data_normalizer` (uint32_t)	#13915
`text/subword/subword_tokenize.cu`	`kernel_compute_tensor_metadata` (uint32_t)	#13915
`text/subword/wordpiece_tokenizer.cu`	`init_data_and_mark_word_start_and_ends` (uint32_t) `mark_string_start_and_ends` (uint32_t) `kernel_wordpiece_tokenizer` (uint32_t)	#13915

Part 2 - Take another pass over more challenging kernels

Source file	Kernels	Status
null_mash.cuh	subtract_set_bits_range_boundaries_kernel
valid_if.cuh	valid_if_n_kernel
copy_if_else.cuh	copy_if_else_kernel
gather.cuh	gather_chars_fn_string_parallel
more?	search `gridDim.x` or `blockDim.x` to find more examples

Part 3 - Use ranger to prevent grid stride loop overflow

incorporate the ranger header as a libcudf utility
use ranger instead of manual indexing in libcudf kernels

Additional information

There are also a number of kernels that have this pattern but probably don't ever overflow because they are indexing by bitmask words. (Example)
Additional, In this kernel, source_idx probably overflows, but harmlessly.

A snippet of code to see this in action:

size_type const size = 1200000000;
auto big = cudf::make_fixed_width_column(data_type{type_id::INT32}, size, mask_state::UNALLOCATED);  
auto x = cudf::rolling_window(*big, 1, 1, 1, cudf::detail::sum_aggregation{});

Note: rmm may mask out of bounds accesses in some cases, so it's helpful to run with the plain cuda allocator.

The text was updated successfully, but these errors were encountered:

nvdbaranec · 2022-02-28T19:35:52Z

The fix in basically all of these cases is quite simple: just make the index a size_t

jrhemstad · 2022-02-28T19:53:46Z

I'd love to just add an algorithm to do this.

https://godbolt.org/z/hK95z7zff

harrism · 2022-03-15T05:29:29Z

Or even a simple range helper for for loops: https://github.com/harrism/hemi#simple-grid-stride-loops

harrism · 2022-03-15T05:32:31Z

The fix in basically all of these cases is quite simple: just make the index a size_t

I think the general approach should be:

Use an algorithm (thrust:: or std::) if possible before ever writing a custom kernel -- this way you write a per-element device functor instead, and indexing is handled for you.
If a custom kernel must be written, it should use device-side algorithms instead of raw loops.
If a raw grid-stride loop is required and an existing algorithm won't work, we should provide utilities to abstract the iteration and or the range and use auto for the type to avoid these mistakes.

Partially addresses #10368 Specifically: - `valid_if` - `scatter` - `rolling_window` - `compute_column_kernel` (ast stuff) - `replace_nulls` (fixed-width and strings) The majority of the fixes are simply making the indexing variable a `std::size_t` instead of a `cudf::size_type`. Although scatter had an additional place it was overflowing outside the kernel. I didn't add tests for these fixes, but each of them were individually tested locally to make sure they actually manifested the issue and then were verified with the fixes. Authors: - https://github.com/nvdbaranec Approvers: - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) - Mark Harris (https://github.com/harrism) - Nghia Truong (https://github.com/ttnghia) URL: #10448

github-actions · 2022-04-16T01:30:43Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

nvdbaranec · 2022-04-19T15:31:47Z

Still relevant.

harrism · 2022-04-26T23:29:16Z

Created https://github.com/harrism/ranger as a solution to this. Needs to be moved into libcudf.

github-actions · 2022-05-27T00:11:00Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

nvdbaranec · 2022-05-27T14:43:14Z

Still relevant.

github-actions · 2022-06-26T15:03:10Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

nvdbaranec · 2022-06-27T15:25:51Z

Still relevant.

github-actions · 2022-09-26T05:30:51Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

GregoryKimball · 2022-11-29T01:12:59Z

Thanks @harrism for creating the ranger repo! Do you think we are ready to kick off an integration with libcudf, or does ranger need more development first?

harrism · 2023-06-06T08:59:43Z

@GregoryKimball I have now created a PR to use ranger in libcuspatial. You guys could use this as an example if you want to do the same in libcudf. rapidsai/cuspatial#1178

wence- · 2023-08-10T16:56:32Z

wrt attempting to find locations where this might be happening. In host code, clang and gcc will warn if you add -Wsign-conversion (not covered by -Wall -Wextra) under some circumstances. Unfortunately there is no such option for nvcc.

#include <cstdint>
int what(int upper)
{
  int i = 0; // no warning if this is a std::int64_t
  unsigned int stride = 10;
  while (i < upper) {
    i = i + stride; // clang warns for this, so does gcc
  }
  i = 0;
  while (i < upper) {
    i += stride; // gcc warns for this, clang does not.
  }
  return i;
}

This PR adds `grid_1d::grid_stride()` and uses it in a handful of kernels. Follow-up to #13910, which added a `grid_1d::global_thread_id()`. We'll need to do a later PR that catches any missing instances where this should be used, since there are a large number of PRs in flight touching thread indexing code in various files. See #10368. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vyas Ramasubramani (https://github.com/vyasr) URL: #13996

…joins (#13971) See #10368 (and more recently #13771 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - David Wendt (https://github.com/davidwendt) URL: #13971

This PR refactors a few kernels to use `thread_index_type` and associated utilities. I started this before realizing how much scope was still left in issue #10368 ("Part 2 - Take another pass over more challenging kernels"), and then I stopped working on this due to time constraints. For the moment, I hope this PR makes a small dent in the number of remaining kernels to convert to using `thread_index_type`. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - MithunR (https://github.com/mythrocks) - Mark Harris (https://github.com/harrism) - David Wendt (https://github.com/davidwendt) URL: #14107

nvdbaranec added bug Something isn't working Needs Triage Need team to review and classify labels Feb 28, 2022

This was referenced Mar 17, 2022

Batch of fixes for index overflows in grid stride loops. #10448

Merged

Column to JCUDF row for tables with strings #10235

Merged

sameerz mentioned this issue Apr 12, 2022

[TASK] Big Reliability Epic NVIDIA/spark-rapids#1870

Closed

14 tasks

github-actions bot added the inactive-30d label Apr 16, 2022

github-actions bot removed the inactive-30d label Apr 19, 2022

github-actions bot added the inactive-30d label Jun 26, 2022

github-actions bot removed the inactive-30d label Jun 27, 2022

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 28, 2022

github-actions bot added the inactive-90d label Sep 26, 2022

nvdbaranec mentioned this issue Nov 9, 2022

[BUG] strided loop overflow observed while joining against build table with 1.2B rows #12108

Closed

GregoryKimball removed the inactive-90d label Apr 3, 2023

GregoryKimball added this to the Helps developer velocity milestone Apr 3, 2023

GregoryKimball changed the title ~~[BUG] Issues with grid stride loops overflowing in multiple cudf kernels.~~ Use mharris' "ranger" to prevent grid stride loop overflow Apr 3, 2023

GregoryKimball changed the title ~~Use mharris' "ranger" to prevent grid stride loop overflow~~ Use "ranger" to prevent grid stride loop overflow Apr 3, 2023

GregoryKimball added 1 - On Deck To be worked on next Spark Functionality that helps Spark RAPIDS helps: Dask labels Aug 10, 2023

GregoryKimball mentioned this issue Aug 10, 2023

[BUG] Errors converting tables from arrow to cuDF #13771

Closed

GregoryKimball modified the milestones: Helps developer velocity, Stabilizing large workflows (OOM, spilling, partitioning) Aug 10, 2023

vyasr mentioned this issue Aug 18, 2023

Add minhash support for MurmurHash3_x64_128 #13796

Merged

3 tasks

vyasr mentioned this issue Aug 25, 2023

Use thread_index_type to avoid out of bounds accesses in conditional joins #13971

Merged

3 tasks

bdice mentioned this issue Aug 29, 2023

Use grid_stride for stride computations. #13996

Merged

3 tasks

bdice mentioned this issue Sep 13, 2023

Some additional kernel thread index refactoring. #14107

Merged

3 tasks

wence- mentioned this issue Sep 18, 2023

Long string optimization for string column parsing in JSON reader #13803

Merged

3 tasks

bdice mentioned this issue Jan 2, 2024

[DRAFT] refactor to grid stride for data_type_detection kernel #14694

Open

3 tasks

vyasr removed the helps: Dask label Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use "ranger" to prevent grid stride loop overflow #10368

Use "ranger" to prevent grid stride loop overflow #10368

nvdbaranec commented Feb 28, 2022 •

edited by GregoryKimball

Loading

nvdbaranec commented Feb 28, 2022

jrhemstad commented Feb 28, 2022

harrism commented Mar 15, 2022

harrism commented Mar 15, 2022

github-actions bot commented Apr 16, 2022

nvdbaranec commented Apr 19, 2022

harrism commented Apr 26, 2022

github-actions bot commented May 27, 2022

nvdbaranec commented May 27, 2022

github-actions bot commented Jun 26, 2022

nvdbaranec commented Jun 27, 2022

github-actions bot commented Sep 26, 2022

GregoryKimball commented Nov 29, 2022

harrism commented Jun 6, 2023

wence- commented Aug 10, 2023

Use "ranger" to prevent grid stride loop overflow #10368

Use "ranger" to prevent grid stride loop overflow #10368

Comments

nvdbaranec commented Feb 28, 2022 • edited by GregoryKimball Loading

Background

Part 1 - First pass fix kernels with this issue

Part 2 - Take another pass over more challenging kernels

Part 3 - Use ranger to prevent grid stride loop overflow

Additional information

nvdbaranec commented Feb 28, 2022

jrhemstad commented Feb 28, 2022

harrism commented Mar 15, 2022

harrism commented Mar 15, 2022

github-actions bot commented Apr 16, 2022

nvdbaranec commented Apr 19, 2022

harrism commented Apr 26, 2022

github-actions bot commented May 27, 2022

nvdbaranec commented May 27, 2022

github-actions bot commented Jun 26, 2022

nvdbaranec commented Jun 27, 2022

github-actions bot commented Sep 26, 2022

GregoryKimball commented Nov 29, 2022

harrism commented Jun 6, 2023

wence- commented Aug 10, 2023

nvdbaranec commented Feb 28, 2022 •

edited by GregoryKimball

Loading