Fix peak memory profiling #2031

WoosukKwon · 2023-12-11T19:01:58Z

Currently, vLLM's memory usage easily exceeds the gpu_memory_utilization (90%) when TP > 1. I believe this is because the current memory profiling is not accurate when TP > 1. My guess is that some memory (probably related to NCCL?) is allocated/freed not through the PyTorch memory allocator.

WoosukKwon · 2023-12-12T20:24:11Z

NOTE: This significantly affects the KV cache size, especially when TP > 1.

Llama2-7B with TP=1 and 1xA100-80GB

Current main: # GPU blocks: 7449
This PR: # GPU blocks: 7351

Llama2-70B with TP=4 and 4xA100-80GB

Current main: # GPU blocks: 31615
This PR: # GPU blocks: 20865

I've noticed that in the current main branch, the peak memory usage during benchmark_throughput.py (with TP=4) is about 77.5GiB, although gpu_memory_utilization is set as 0.9 (72 GiB). After this PR, the peak memory usage is about 68 GiB for benchmark_throughput.py (with TP=4) and 73 GiB for python benchmarks/benchmark_latency.py --model meta-llama/Llama-2-70b-hf -tp 4 --batch-size 100 --input-len 1024 --output-len 512.

Maybe we can increase the default gpu_memory_utilization to 0.95 in this PR as the profiling becomes more conservative. (In this case, # GPU blocks becomes 24287 for Llama2-70B) @zhuohan123 WDYT?

zhuohan123

Thanks for fixing this! I think I need more understanding of this PR. And yeah, we should change the percentage to 0.95 and even to 1 if we have a more conservative memory profiler.

vllm/worker/worker.py

zhuohan123 · 2023-12-13T03:37:48Z

vllm/worker/worker.py

+        free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
+        peak_memory = total_gpu_memory - free_gpu_memory


I'm a bit confused. What exactly does the free_gpu_memory here measure? Is it the part of memory that pytorch memory allocator does not touch? In other words, is the following diagram correct?

|<--------------------------------total GPU memory---------------------------------->| |<---------------what Pytorch allocator allocates------------->|<--free GPU memory-->| |<--actually used in execution-->|<--allocator fragmentation-->|<--free GPU memory-->|

In this case, peak_memory here can greatly over-estimate the memory usage. Is my understanding correct?

In this case, peak_memory here can greatly over-estimate the memory usage. Is my understanding correct?

It depends on the memory fragmentation caused by the allocator. However, I believe the fragmentation should be small, since the most weight/activation tensors are pretty big.

is the following diagram correct?

I think one missing part here is the memory that is not allocated/freed using the PyTorch's memory allocator. For example, the memory internally used by cuBLAS or NCCL is not captured by the PyTorch's GPU memory allocator. My hypothesis is the NCCL memory usage is not considered in the current memory profiling which is based on PyTorch.

@zhuohan123

Plus, I don't think the PR "over-estimates" the peak memory usage, since I observed that even after this PR, the memory usage goes over the gpu_memory_utilization.

After this PR, the peak memory usage is about 68 GiB for benchmark_throughput.py (with TP=4) and 73 GiB for python benchmarks/benchmark_latency.py --model meta-llama/Llama-2-70b-hf -tp 4 --batch-size 100 --input-len 1024 --output-len 512.

I see. It's pretty strange that NCCL uses any extra memory. Since all we use is all_reduce, which should happen in-place and should not use any extra memory.

Hi, all. torch.cuda.mem_get_info() interface definitely degrades the throughput of LLaMA2-13B nearly 20% when tp=2, and I'm also curious about the extra gpu memory when I try benchmark_serving.py (and they do not release).
Writing back to torch.cuda.max_memory_allocated() boost the throughput from 1.43 to 1.82 requests/s. (num_gpu_blocks 700 -> 1403, utilization=0.9)

@zhuohan123

I measure the GPU's peak memory usage (using torch.cuda.mem_info), PyTorch reserved memory (torch.cuda.memory.memory_reserved) and PyTorch allocated memory (torch.cuda.memory.memory_allocated). I measured the memory usage during the initial profiling (i.e., before creating any CUDA graphs).

llama 7B, TP=1

Rank0 peak memory: 13.66, reserved: 13.15, allocated: 12.56

There's little difference (~500 MiB) between the peak and reserved memory. I guess this space of memory is spent for some CUDA objects like cuBLAS handles.

There's little difference (~600 MiB) between the reserved and allocated memory. This means the intermediate activation uses ~600 MiB of space.

llama-70B, TP=8 (before CUDA graph PR)

(RayWorkerVllm pid=252359) Rank0 peak memory: 24.57, reserved: 22.63, allocated: 16.07 (RayWorkerVllm pid=252361) Rank2 peak memory: 24.98, reserved: 22.75, allocated: 16.07 (RayWorkerVllm pid=252360) Rank1 peak memory: 24.98, reserved: 22.75, allocated: 16.07 (RayWorkerVllm pid=252362) Rank3 peak memory: 24.98, reserved: 22.75, allocated: 16.07 (RayWorkerVllm pid=252366) Rank7 peak memory: 24.69, reserved: 22.75, allocated: 16.07 (RayWorkerVllm pid=252364) Rank5 peak memory: 25.10, reserved: 22.88, allocated: 16.07 (RayWorkerVllm pid=252365) Rank6 peak memory: 24.98, reserved: 22.75, allocated: 16.07 (RayWorkerVllm pid=252363) Rank4 peak memory: 24.98, reserved: 22.75, allocated: 16.07

There's large difference (~2 GiB) between the peak and reserved memory. I guess this is used for cuBLAS handle + NCCL. NCCL might use some GPU memory for the communicator and scratch pad memory. But I'm not sure.

There's large difference (~6.7 GiB) between the reserved and allocated memory. I have no idea where this huge difference comes from.

llama-70B, TP=8 (after CUDA graph PR)

(RayWorkerVllm pid=238494) Rank3 peak memory: 25.66, reserved: 22.63, allocated: 16.07 (RayWorkerVllm pid=238498) Rank7 peak memory: 25.37, reserved: 22.75, allocated: 16.07 (RayWorkerVllm pid=238493) Rank2 peak memory: 25.79, reserved: 22.75, allocated: 16.07 (RayWorkerVllm pid=238492) Rank1 peak memory: 25.79, reserved: 22.75, allocated: 16.07 (RayWorkerVllm pid=238491) Rank0 peak memory: 25.37, reserved: 22.75, allocated: 16.07 (RayWorkerVllm pid=238496) Rank5 peak memory: 25.91, reserved: 22.88, allocated: 16.07 (RayWorkerVllm pid=238497) Rank6 peak memory: 25.91, reserved: 22.88, allocated: 16.07 (RayWorkerVllm pid=238495) Rank4 peak memory: 25.79, reserved: 22.75, allocated: 16.0

The peak memory increases by ~1 GiB. I guess this is because we use 2 NCCL communicators: one for PyTorch and another for CuPy.

I've visualized the memory usage:

llama 7B, TP=1

The activation memory is reused after every layer.

llama-70B, TP=8

However, when using TP, the activation memory for all reduce is not reused

zhuohan123

LGTM!

vinod-sarvam · 2024-01-22T02:53:03Z

This issue seems to be introducing a bug where peak memory utilisation is calculated wrongly. If there are multiple instances of different processes in the same GPU, the profiler assumes they are all of the same instance which is not correct.

WoosukKwon added 2 commits December 11, 2023 18:53

Fix peak memory profiling

c865e8b

yapf

5e31317

WoosukKwon marked this pull request as ready for review December 12, 2023 18:29

simon-mo approved these changes Dec 12, 2023

View reviewed changes

zhuohan123 reviewed Dec 13, 2023

View reviewed changes

Remove get_gpu_memory

8da2cb7

zhuohan123 approved these changes Dec 13, 2023

View reviewed changes

WoosukKwon merged commit 30bad5c into main Dec 13, 2023
2 checks passed

WoosukKwon deleted the fix-mem-profiling branch December 13, 2023 06:01

This was referenced Dec 15, 2023

Optimize model execution with CUDA graph #1926

Merged

torch.distributed.all_reduce does not free memory #2150

Closed

hanzhi713 mentioned this pull request Dec 24, 2023

Recent vLLMs ask for too much memory: ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine. #2248

Open

masahi mentioned this pull request Feb 9, 2024

[FEA] An equivalent of torch.cuda.max_memory_allocated for pooled resource rapidsai/rmm#1465

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Fix peak memory profiling (vllm-project#2031)

84b2770

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix peak memory profiling #2031

Fix peak memory profiling #2031

WoosukKwon commented Dec 11, 2023 •

edited

Loading

WoosukKwon commented Dec 12, 2023 •

edited

Loading

zhuohan123 left a comment

zhuohan123 Dec 13, 2023

WoosukKwon Dec 13, 2023

WoosukKwon Dec 13, 2023

zhuohan123 Dec 13, 2023

Juelianqvq Dec 14, 2023 •

edited

Loading

WoosukKwon Dec 17, 2023

WoosukKwon Dec 17, 2023 •

edited

Loading

zhuohan123 left a comment

vinod-sarvam commented Jan 22, 2024

		free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
		peak_memory = total_gpu_memory - free_gpu_memory

Fix peak memory profiling #2031

Fix peak memory profiling #2031

Conversation

WoosukKwon commented Dec 11, 2023 • edited Loading

WoosukKwon commented Dec 12, 2023 • edited Loading

Llama2-7B with TP=1 and 1xA100-80GB

Llama2-70B with TP=4 and 4xA100-80GB

zhuohan123 left a comment

Choose a reason for hiding this comment

zhuohan123 Dec 13, 2023

Choose a reason for hiding this comment

WoosukKwon Dec 13, 2023

Choose a reason for hiding this comment

WoosukKwon Dec 13, 2023

Choose a reason for hiding this comment

zhuohan123 Dec 13, 2023

Choose a reason for hiding this comment

Juelianqvq Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

WoosukKwon Dec 17, 2023

Choose a reason for hiding this comment

WoosukKwon Dec 17, 2023 • edited Loading

Choose a reason for hiding this comment

zhuohan123 left a comment

Choose a reason for hiding this comment

vinod-sarvam commented Jan 22, 2024

WoosukKwon commented Dec 11, 2023 •

edited

Loading

WoosukKwon commented Dec 12, 2023 •

edited

Loading

Juelianqvq Dec 14, 2023 •

edited

Loading

WoosukKwon Dec 17, 2023 •

edited

Loading