Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix peak memory profiling #2031

Merged
merged 3 commits into from
Dec 13, 2023
Merged

Fix peak memory profiling #2031

merged 3 commits into from
Dec 13, 2023

Conversation

WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Dec 11, 2023

Currently, vLLM's memory usage easily exceeds the gpu_memory_utilization (90%) when TP > 1. I believe this is because the current memory profiling is not accurate when TP > 1. My guess is that some memory (probably related to NCCL?) is allocated/freed not through the PyTorch memory allocator.

@WoosukKwon WoosukKwon marked this pull request as ready for review December 12, 2023 18:29
@WoosukKwon
Copy link
Collaborator Author

WoosukKwon commented Dec 12, 2023

NOTE: This significantly affects the KV cache size, especially when TP > 1.

Llama2-7B with TP=1 and 1xA100-80GB

  • Current main: # GPU blocks: 7449
  • This PR: # GPU blocks: 7351

Llama2-70B with TP=4 and 4xA100-80GB

  • Current main: # GPU blocks: 31615
  • This PR: # GPU blocks: 20865

I've noticed that in the current main branch, the peak memory usage during benchmark_throughput.py (with TP=4) is about 77.5GiB, although gpu_memory_utilization is set as 0.9 (72 GiB). After this PR, the peak memory usage is about 68 GiB for benchmark_throughput.py (with TP=4) and 73 GiB for python benchmarks/benchmark_latency.py --model meta-llama/Llama-2-70b-hf -tp 4 --batch-size 100 --input-len 1024 --output-len 512.

Maybe we can increase the default gpu_memory_utilization to 0.95 in this PR as the profiling becomes more conservative. (In this case, # GPU blocks becomes 24287 for Llama2-70B) @zhuohan123 WDYT?

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! I think I need more understanding of this PR. And yeah, we should change the percentage to 0.95 and even to 1 if we have a more conservative memory profiler.

vllm/worker/worker.py Show resolved Hide resolved
Comment on lines +91 to +92
free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
peak_memory = total_gpu_memory - free_gpu_memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused. What exactly does the free_gpu_memory here measure? Is it the part of memory that pytorch memory allocator does not touch? In other words, is the following diagram correct?

|<--------------------------------total GPU memory---------------------------------->|
|<---------------what Pytorch allocator allocates------------->|<--free GPU memory-->|
|<--actually used in execution-->|<--allocator fragmentation-->|<--free GPU memory-->|

In this case, peak_memory here can greatly over-estimate the memory usage. Is my understanding correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, peak_memory here can greatly over-estimate the memory usage. Is my understanding correct?

It depends on the memory fragmentation caused by the allocator. However, I believe the fragmentation should be small, since the most weight/activation tensors are pretty big.

is the following diagram correct?

I think one missing part here is the memory that is not allocated/freed using the PyTorch's memory allocator. For example, the memory internally used by cuBLAS or NCCL is not captured by the PyTorch's GPU memory allocator. My hypothesis is the NCCL memory usage is not considered in the current memory profiling which is based on PyTorch.

@zhuohan123

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus, I don't think the PR "over-estimates" the peak memory usage, since I observed that even after this PR, the memory usage goes over the gpu_memory_utilization.

After this PR, the peak memory usage is about 68 GiB for benchmark_throughput.py (with TP=4) and 73 GiB for python benchmarks/benchmark_latency.py --model meta-llama/Llama-2-70b-hf -tp 4 --batch-size 100 --input-len 1024 --output-len 512.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It's pretty strange that NCCL uses any extra memory. Since all we use is all_reduce, which should happen in-place and should not use any extra memory.

Copy link
Contributor

@Juelianqvq Juelianqvq Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, all. torch.cuda.mem_get_info() interface definitely degrades the throughput of LLaMA2-13B nearly 20% when tp=2, and I'm also curious about the extra gpu memory when I try benchmark_serving.py (and they do not release).
Writing back to torch.cuda.max_memory_allocated() boost the throughput from 1.43 to 1.82 requests/s. (num_gpu_blocks 700 -> 1403, utilization=0.9)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhuohan123

I measure the GPU's peak memory usage (using torch.cuda.mem_info), PyTorch reserved memory (torch.cuda.memory.memory_reserved) and PyTorch allocated memory (torch.cuda.memory.memory_allocated). I measured the memory usage during the initial profiling (i.e., before creating any CUDA graphs).

  • llama 7B, TP=1
Rank0 peak memory: 13.66, reserved: 13.15, allocated: 12.56
  1. There's little difference (~500 MiB) between the peak and reserved memory. I guess this space of memory is spent for some CUDA objects like cuBLAS handles.
  2. There's little difference (~600 MiB) between the reserved and allocated memory. This means the intermediate activation uses ~600 MiB of space.
  • llama-70B, TP=8 (before CUDA graph PR)
(RayWorkerVllm pid=252359) Rank0 peak memory: 24.57, reserved: 22.63, allocated: 16.07
(RayWorkerVllm pid=252361) Rank2 peak memory: 24.98, reserved: 22.75, allocated: 16.07
(RayWorkerVllm pid=252360) Rank1 peak memory: 24.98, reserved: 22.75, allocated: 16.07
(RayWorkerVllm pid=252362) Rank3 peak memory: 24.98, reserved: 22.75, allocated: 16.07
(RayWorkerVllm pid=252366) Rank7 peak memory: 24.69, reserved: 22.75, allocated: 16.07
(RayWorkerVllm pid=252364) Rank5 peak memory: 25.10, reserved: 22.88, allocated: 16.07
(RayWorkerVllm pid=252365) Rank6 peak memory: 24.98, reserved: 22.75, allocated: 16.07
(RayWorkerVllm pid=252363) Rank4 peak memory: 24.98, reserved: 22.75, allocated: 16.07
  1. There's large difference (~2 GiB) between the peak and reserved memory. I guess this is used for cuBLAS handle + NCCL. NCCL might use some GPU memory for the communicator and scratch pad memory. But I'm not sure.
  2. There's large difference (~6.7 GiB) between the reserved and allocated memory. I have no idea where this huge difference comes from.
  • llama-70B, TP=8 (after CUDA graph PR)
(RayWorkerVllm pid=238494) Rank3 peak memory: 25.66, reserved: 22.63, allocated: 16.07
(RayWorkerVllm pid=238498) Rank7 peak memory: 25.37, reserved: 22.75, allocated: 16.07
(RayWorkerVllm pid=238493) Rank2 peak memory: 25.79, reserved: 22.75, allocated: 16.07
(RayWorkerVllm pid=238492) Rank1 peak memory: 25.79, reserved: 22.75, allocated: 16.07
(RayWorkerVllm pid=238491) Rank0 peak memory: 25.37, reserved: 22.75, allocated: 16.07
(RayWorkerVllm pid=238496) Rank5 peak memory: 25.91, reserved: 22.88, allocated: 16.07
(RayWorkerVllm pid=238497) Rank6 peak memory: 25.91, reserved: 22.88, allocated: 16.07
(RayWorkerVllm pid=238495) Rank4 peak memory: 25.79, reserved: 22.75, allocated: 16.0
  1. The peak memory increases by ~1 GiB. I guess this is because we use 2 NCCL communicators: one for PyTorch and another for CuPy.

Copy link
Collaborator Author

@WoosukKwon WoosukKwon Dec 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've visualized the memory usage:

  • llama 7B, TP=1
Screenshot 2023-12-16 at 11 14 03 PM

The activation memory is reused after every layer.

  • llama-70B, TP=8
Screenshot 2023-12-16 at 11 20 10 PM

However, when using TP, the activation memory for all reduce is not reused

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vinod-sarvam
Copy link

This issue seems to be introducing a bug where peak memory utilisation is calculated wrongly. If there are multiple instances of different processes in the same GPU, the profiler assumes they are all of the same instance which is not correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants