[BUG] find a way to stop allocating from RMM on the shuffle-client thread #1711

abellina · 2021-02-10T23:29:43Z

We have recently seen traces where a train of cudaStreamSynchronize can be seen after a concat (shuffle iterator gets a batch on next, we concat them at concat and then many synchronizes). These is happening because the shuffle-client thread can allocate RMM buffers that are smaller than the minimum_superblock_size (256 KiB), and then hand them off to another thread (the task thread) to actually perform work and eventually free it.

When ARENA is freeing the buffer in our Spark task thread/stream, it wants to place the freed block in the global arena, because the allocation came from the shuffle-client arena (i.e. this is an "external" buffer). The synchronization done here helps prevent another stream from taking the buffer from the global arena and start writing to it before it is safe.

This task is to figure out a way to have the task thread do this instead of the shuffle client thread, such that frees are "free" in this case since the blocks will be "deposited" back into the local stream arena, instead of going to the global arena.

@jlowe @rongou fyi.

The text was updated successfully, but these errors were encountered:

rongou · 2021-02-11T00:27:48Z

We talked about adding a bulk free api to rmm, but I think a less intrusive option is to make the arena allocator accumulate these small buffers until the total size reaches certain threshold, and then free them all at once. I'll work on a PR for that.

@abellina

… by default (#732) This is done similarly to #702. Previously `arena_memory_resource` maintained a set of allocated blocks, but this was only used for reporting/debugging purposes. Maintaining this set requires a `set::find` at every deallocation, which can get expensive when there are many allocated blocks. This PR moves the tracking behind a default-undefined preprocessor flag. This results in some speedup in the random allocations benchmark for `arena_memory_resource`. Tracking can be enabled by defining `RMM_POOL_TRACK_ALLOCATIONS`. This should also fix the Spark small shuffle buffer issue: NVIDIA/spark-rapids#1711 Before: ```console ------------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------------ BM_RandomAllocations/arena_mr/1000/1 1.36 ms 1.36 ms 457 BM_RandomAllocations/arena_mr/1000/4 1.21 ms 1.21 ms 517 BM_RandomAllocations/arena_mr/1000/64 1.22 ms 1.22 ms 496 BM_RandomAllocations/arena_mr/1000/256 1.08 ms 1.07 ms 535 BM_RandomAllocations/arena_mr/1000/1024 0.949 ms 0.948 ms 583 BM_RandomAllocations/arena_mr/1000/4096 0.853 ms 0.848 ms 680 BM_RandomAllocations/arena_mr/10000/1 98.7 ms 98.3 ms 8 BM_RandomAllocations/arena_mr/10000/4 65.4 ms 65.4 ms 9 BM_RandomAllocations/arena_mr/10000/64 16.6 ms 16.5 ms 38 BM_RandomAllocations/arena_mr/10000/256 11.2 ms 11.2 ms 48 BM_RandomAllocations/arena_mr/10000/1024 9.45 ms 9.44 ms 62 BM_RandomAllocations/arena_mr/10000/4096 9.24 ms 9.20 ms 59 BM_RandomAllocations/arena_mr/100000/1 7536 ms 7536 ms 1 BM_RandomAllocations/arena_mr/100000/4 3002 ms 3002 ms 1 BM_RandomAllocations/arena_mr/100000/64 170 ms 170 ms 3 BM_RandomAllocations/arena_mr/100000/256 107 ms 107 ms 7 BM_RandomAllocations/arena_mr/100000/1024 96.0 ms 95.7 ms 6 BM_RandomAllocations/arena_mr/100000/4096 86.7 ms 86.7 ms 6 ``` After: ```console ------------------------------------------------------------------------------------ Benchmark Time CPU Iterations ------------------------------------------------------------------------------------ BM_RandomAllocations/arena_mr/1000/1 1.20 ms 1.20 ms 519 BM_RandomAllocations/arena_mr/1000/4 1.08 ms 1.08 ms 588 BM_RandomAllocations/arena_mr/1000/64 1.11 ms 1.11 ms 552 BM_RandomAllocations/arena_mr/1000/256 0.957 ms 0.957 ms 611 BM_RandomAllocations/arena_mr/1000/1024 0.857 ms 0.857 ms 687 BM_RandomAllocations/arena_mr/1000/4096 0.795 ms 0.793 ms 724 BM_RandomAllocations/arena_mr/10000/1 73.0 ms 73.0 ms 10 BM_RandomAllocations/arena_mr/10000/4 45.7 ms 45.7 ms 14 BM_RandomAllocations/arena_mr/10000/64 14.4 ms 14.4 ms 40 BM_RandomAllocations/arena_mr/10000/256 9.87 ms 9.82 ms 60 BM_RandomAllocations/arena_mr/10000/1024 8.72 ms 8.72 ms 69 BM_RandomAllocations/arena_mr/10000/4096 7.32 ms 7.30 ms 85 BM_RandomAllocations/arena_mr/100000/1 6384 ms 6384 ms 1 BM_RandomAllocations/arena_mr/100000/4 2480 ms 2480 ms 1 BM_RandomAllocations/arena_mr/100000/64 147 ms 147 ms 5 BM_RandomAllocations/arena_mr/100000/256 103 ms 103 ms 7 BM_RandomAllocations/arena_mr/100000/1024 78.1 ms 78.1 ms 9 BM_RandomAllocations/arena_mr/100000/4096 72.3 ms 72.3 ms 9 ``` @abellina Authors: - Rong Ou (@rongou) Approvers: - Mark Harris (@harrism) - Conor Hoekstra (@codereport) URL: #732

rongou · 2021-03-17T00:56:19Z

@abellina this should be fixed when RMM makes into the next nightly cudf jni jar. Can you verify when you get the chance?

abellina · 2021-04-21T22:50:47Z

@rongou, I can confirm that small buffers are no longer seeing the synchronization train I saw after concat (as they are being returned to RMM using ARENA), in fact we see no CUDA api calls. With the RMM default allocator, we still see cudaEventRecord on free (I believe this is expected from having looked at the code path for do_deallocate in the past for both).

rongou · 2021-04-21T23:11:55Z

Yes, looks good.

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify shuffle things that impact the shuffle plugin performance A performance related task/issue labels Feb 10, 2021

sameerz removed the ? - Needs Triage Need team to review and classify label Feb 16, 2021

sameerz assigned rongou Feb 16, 2021

sameerz added this to the Mar 15 - March 26 milestone Mar 16, 2021

rongou mentioned this issue Mar 16, 2021

arena_memory_resource optimization: disable tracking allocated blocks by default rapidsai/rmm#732

Merged

sameerz modified the milestones: Mar 15 - March 26, Mar 30 - Apr 9 Mar 29, 2021

rongou assigned abellina and unassigned rongou Apr 7, 2021

sameerz modified the milestones: Mar 29 - Apr 9, Apr 12 - Apr 23 Apr 9, 2021

abellina closed this as completed Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] find a way to stop allocating from RMM on the shuffle-client thread #1711

[BUG] find a way to stop allocating from RMM on the shuffle-client thread #1711

abellina commented Feb 10, 2021

rongou commented Feb 11, 2021

rongou commented Mar 17, 2021

abellina commented Apr 21, 2021

rongou commented Apr 21, 2021

[BUG] find a way to stop allocating from RMM on the shuffle-client thread #1711

[BUG] find a way to stop allocating from RMM on the shuffle-client thread #1711

Comments

abellina commented Feb 10, 2021

rongou commented Feb 11, 2021

rongou commented Mar 17, 2021

abellina commented Apr 21, 2021

rongou commented Apr 21, 2021