Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] find a way to stop allocating from RMM on the shuffle-client thread #1711

Closed
abellina opened this issue Feb 10, 2021 · 4 comments
Closed
Assignees
Labels
bug Something isn't working performance A performance related task/issue shuffle things that impact the shuffle plugin

Comments

@abellina
Copy link
Collaborator

We have recently seen traces where a train of cudaStreamSynchronize can be seen after a concat (shuffle iterator gets a batch on next, we concat them at concat and then many synchronizes). These is happening because the shuffle-client thread can allocate RMM buffers that are smaller than the minimum_superblock_size (256 KiB), and then hand them off to another thread (the task thread) to actually perform work and eventually free it.

When ARENA is freeing the buffer in our Spark task thread/stream, it wants to place the freed block in the global arena, because the allocation came from the shuffle-client arena (i.e. this is an "external" buffer). The synchronization done here helps prevent another stream from taking the buffer from the global arena and start writing to it before it is safe.

This task is to figure out a way to have the task thread do this instead of the shuffle client thread, such that frees are "free" in this case since the blocks will be "deposited" back into the local stream arena, instead of going to the global arena.

@jlowe @rongou fyi.

@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify shuffle things that impact the shuffle plugin performance A performance related task/issue labels Feb 10, 2021
@rongou
Copy link
Collaborator

rongou commented Feb 11, 2021

We talked about adding a bulk free api to rmm, but I think a less intrusive option is to make the arena allocator accumulate these small buffers until the total size reaches certain threshold, and then free them all at once. I'll work on a PR for that.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Feb 16, 2021
@sameerz sameerz added this to the Mar 15 - March 26 milestone Mar 16, 2021
rapids-bot bot pushed a commit to rapidsai/rmm that referenced this issue Mar 17, 2021
… by default (#732)

This is done similarly to #702.

Previously `arena_memory_resource` maintained a set of allocated blocks, but this was only used for reporting/debugging purposes. Maintaining this set requires a `set::find` at every deallocation, which can get expensive when there are many allocated blocks. This PR moves the tracking behind a default-undefined preprocessor flag. This results in some speedup in the random allocations benchmark for `arena_memory_resource`. Tracking can be enabled by defining `RMM_POOL_TRACK_ALLOCATIONS`.

This should also fix the Spark small shuffle buffer issue: NVIDIA/spark-rapids#1711

Before:
```console
------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_RandomAllocations/arena_mr/1000/1            1.36 ms         1.36 ms          457
BM_RandomAllocations/arena_mr/1000/4            1.21 ms         1.21 ms          517
BM_RandomAllocations/arena_mr/1000/64           1.22 ms         1.22 ms          496
BM_RandomAllocations/arena_mr/1000/256          1.08 ms         1.07 ms          535
BM_RandomAllocations/arena_mr/1000/1024        0.949 ms        0.948 ms          583
BM_RandomAllocations/arena_mr/1000/4096        0.853 ms        0.848 ms          680
BM_RandomAllocations/arena_mr/10000/1           98.7 ms         98.3 ms            8
BM_RandomAllocations/arena_mr/10000/4           65.4 ms         65.4 ms            9
BM_RandomAllocations/arena_mr/10000/64          16.6 ms         16.5 ms           38
BM_RandomAllocations/arena_mr/10000/256         11.2 ms         11.2 ms           48
BM_RandomAllocations/arena_mr/10000/1024        9.45 ms         9.44 ms           62
BM_RandomAllocations/arena_mr/10000/4096        9.24 ms         9.20 ms           59
BM_RandomAllocations/arena_mr/100000/1          7536 ms         7536 ms            1
BM_RandomAllocations/arena_mr/100000/4          3002 ms         3002 ms            1
BM_RandomAllocations/arena_mr/100000/64          170 ms          170 ms            3
BM_RandomAllocations/arena_mr/100000/256         107 ms          107 ms            7
BM_RandomAllocations/arena_mr/100000/1024       96.0 ms         95.7 ms            6
BM_RandomAllocations/arena_mr/100000/4096       86.7 ms         86.7 ms            6
```

After:
```console
------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_RandomAllocations/arena_mr/1000/1            1.20 ms         1.20 ms          519
BM_RandomAllocations/arena_mr/1000/4            1.08 ms         1.08 ms          588
BM_RandomAllocations/arena_mr/1000/64           1.11 ms         1.11 ms          552
BM_RandomAllocations/arena_mr/1000/256         0.957 ms        0.957 ms          611
BM_RandomAllocations/arena_mr/1000/1024        0.857 ms        0.857 ms          687
BM_RandomAllocations/arena_mr/1000/4096        0.795 ms        0.793 ms          724
BM_RandomAllocations/arena_mr/10000/1           73.0 ms         73.0 ms           10
BM_RandomAllocations/arena_mr/10000/4           45.7 ms         45.7 ms           14
BM_RandomAllocations/arena_mr/10000/64          14.4 ms         14.4 ms           40
BM_RandomAllocations/arena_mr/10000/256         9.87 ms         9.82 ms           60
BM_RandomAllocations/arena_mr/10000/1024        8.72 ms         8.72 ms           69
BM_RandomAllocations/arena_mr/10000/4096        7.32 ms         7.30 ms           85
BM_RandomAllocations/arena_mr/100000/1          6384 ms         6384 ms            1
BM_RandomAllocations/arena_mr/100000/4          2480 ms         2480 ms            1
BM_RandomAllocations/arena_mr/100000/64          147 ms          147 ms            5
BM_RandomAllocations/arena_mr/100000/256         103 ms          103 ms            7
BM_RandomAllocations/arena_mr/100000/1024       78.1 ms         78.1 ms            9
BM_RandomAllocations/arena_mr/100000/4096       72.3 ms         72.3 ms            9
```

@abellina

Authors:
  - Rong Ou (@rongou)

Approvers:
  - Mark Harris (@harrism)
  - Conor Hoekstra (@codereport)

URL: #732
@rongou
Copy link
Collaborator

rongou commented Mar 17, 2021

@abellina this should be fixed when RMM makes into the next nightly cudf jni jar. Can you verify when you get the chance?

@rongou rongou assigned abellina and unassigned rongou Apr 7, 2021
@abellina
Copy link
Collaborator Author

@rongou, I can confirm that small buffers are no longer seeing the synchronization train I saw after concat (as they are being returned to RMM using ARENA), in fact we see no CUDA api calls. With the RMM default allocator, we still see cudaEventRecord on free (I believe this is expected from having looked at the code path for do_deallocate in the past for both).

@rongou
Copy link
Collaborator

rongou commented Apr 21, 2021

Yes, looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance A performance related task/issue shuffle things that impact the shuffle plugin
Projects
None yet
Development

No branches or pull requests

3 participants