[FEA] An equivalent of `torch.cuda.max_memory_allocated` for pooled resource #1465

masahi · 2024-02-09T21:50:36Z

PyTorch has torch.cuda.max_memory_allocated function, which allows me to figure out how much VRAM is remaining for use by an application at any moment. Does rmm have an equivalent function?

My use case is in LLM serving: After an initial warm up step, which consumes some VRAM, I want to use all remaining VRAM for pre-allocating paged cache blocks. To decide the maximum number of cache blocks I can allocate, I need information as returned by torch.cuda.max_memory_allocated but using rmm.

pool_memory_resource::pool_size() from #962 is not what I need since it includes the size of free blocks.

The text was updated successfully, but these errors were encountered:

jrhemstad · 2024-02-09T22:28:51Z

torch.cuda.max_memory_allocated doesn't seem like it does what you describe.

this returns the peak allocated memory since the beginning of this program.

This looks like it just returns the high water mark of allocated memory.

It sounds like you're asking for a way to query the amount of "free" memory hoping that you'll be able to allocate that amount of memory and it will succeed. Unfortunately, there is no such API and in general it is impossible to provide such an API.

masahi · 2024-02-09T23:23:05Z

Actually torch.cuda.max_memory_allocated does the job for me. I already use it. vLLM (https://github.com/vllm-project/vllm), which is very similar to an application I'm working on, was also using torch.cuda.max_memory_allocated to determine the maximum number of cache blocks that can be allocated, until vllm-project/vllm#2031.

"The warm-up" step I talked about is supposed to get the maximum-sized input and run an inference on an LLM. This gives an upper-bound on the peak VRAM footprint required by the model, for any input accepted by the applicatoin. The rest of available VRAM is used to allocate cache blocks. Now, I want to do the same thing using rmm in C++, without PyTorch.

I apologize if my explanation is not good, but please assume that I am looking for an equivalent API as torch.cuda.max_memory_allocated in rmm.

masahi added ? - Needs Triage Need team to review and classify feature request New feature or request labels Feb 9, 2024

rapidsai locked and limited conversation to collaborators Feb 10, 2024

harrism converted this issue into discussion #1466 Feb 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

[FEA] An equivalent of `torch.cuda.max_memory_allocated` for pooled resource #1465

[FEA] An equivalent of `torch.cuda.max_memory_allocated` for pooled resource #1465

masahi commented Feb 9, 2024 •

edited

Loading

jrhemstad commented Feb 9, 2024

masahi commented Feb 9, 2024 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

[FEA] An equivalent of torch.cuda.max_memory_allocated for pooled resource #1465

[FEA] An equivalent of torch.cuda.max_memory_allocated for pooled resource #1465

Comments

masahi commented Feb 9, 2024 • edited Loading

jrhemstad commented Feb 9, 2024

masahi commented Feb 9, 2024 • edited Loading

This issue was moved to a discussion.

[FEA] An equivalent of `torch.cuda.max_memory_allocated` for pooled resource #1465

[FEA] An equivalent of `torch.cuda.max_memory_allocated` for pooled resource #1465

masahi commented Feb 9, 2024 •

edited

Loading

masahi commented Feb 9, 2024 •

edited

Loading