Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] An equivalent of torch.cuda.max_memory_allocated for pooled resource #1465

Closed
masahi opened this issue Feb 9, 2024 · 2 comments
Closed
Labels
? - Needs Triage Need team to review and classify feature request New feature or request

Comments

@masahi
Copy link

masahi commented Feb 9, 2024

PyTorch has torch.cuda.max_memory_allocated function, which allows me to figure out how much VRAM is remaining for use by an application at any moment. Does rmm have an equivalent function?

My use case is in LLM serving: After an initial warm up step, which consumes some VRAM, I want to use all remaining VRAM for pre-allocating paged cache blocks. To decide the maximum number of cache blocks I can allocate, I need information as returned by torch.cuda.max_memory_allocated but using rmm.

pool_memory_resource::pool_size() from #962 is not what I need since it includes the size of free blocks.

@masahi masahi added ? - Needs Triage Need team to review and classify feature request New feature or request labels Feb 9, 2024
@jrhemstad
Copy link
Contributor

torch.cuda.max_memory_allocated doesn't seem like it does what you describe.

this returns the peak allocated memory since the beginning of this program.

This looks like it just returns the high water mark of allocated memory.

It sounds like you're asking for a way to query the amount of "free" memory hoping that you'll be able to allocate that amount of memory and it will succeed. Unfortunately, there is no such API and in general it is impossible to provide such an API.

@masahi
Copy link
Author

masahi commented Feb 9, 2024

Actually torch.cuda.max_memory_allocated does the job for me. I already use it. vLLM (https://github.com/vllm-project/vllm), which is very similar to an application I'm working on, was also using torch.cuda.max_memory_allocated to determine the maximum number of cache blocks that can be allocated, until vllm-project/vllm#2031.

"The warm-up" step I talked about is supposed to get the maximum-sized input and run an inference on an LLM. This gives an upper-bound on the peak VRAM footprint required by the model, for any input accepted by the applicatoin. The rest of available VRAM is used to allocate cache blocks. Now, I want to do the same thing using rmm in C++, without PyTorch.

I apologize if my explanation is not good, but please assume that I am looking for an equivalent API as torch.cuda.max_memory_allocated in rmm.

@rapidsai rapidsai locked and limited conversation to collaborators Feb 10, 2024
@harrism harrism converted this issue into discussion #1466 Feb 10, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
? - Needs Triage Need team to review and classify feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants