-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recent vLLMs ask for too much memory: ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization
when initializing the engine.
#2248
Comments
vLLM 0.2.6 added cuda graph support, which is enabled by default (probably not a good decision) CUDA graph introduces a bit more memory overhead. Try to see if adding |
Thanks for responding. However, we had problems starting with 0.2.5. If you need a specific snapshot or something for 4*A10G using 70B AWQ on 0.2.4 vs. 0.2.5 let me know. Or what kind of repro do you need? |
Oh I see. Sorry for not reading your issue carefully. vLLM 0.2.5 changed the way the memory is profiled with #2031. While the new profiling method is more accurate, it didn't seem to take account for multiple instances running together or GPU memory usage by other processes. Line 100 in 1db83e3
Here, vLLM basically thinks that any occupied GPU memory is attributed to the current running instance, and thus will calculate the number of available blocks based on that. This may explain the problem when running 2 7b models on one GPU. Not quite sure about the 4xA10G use case though. Is the GPU empty or shared by other processes for that case? |
Just tried to write a fix. You can try it out: #2249 |
Our biggest issue is clean GPUs four A10G 70b AWQ. Nothing else on GPUs |
You could change the
Then, the |
We are having the exact same issue on our end, cache usage grows and consumes more than the allocated gpu_memory_utilization, even by using We had the same problem before with 0.2.1 |
having the same issue on cuda 11.8 and vllm 0.2.5 and 0.2.6 |
same here |
Same issue -- starting with vllm 0.2.5 |
same issue when use vllm 0.2.6 |
same here |
3 similar comments
same here |
same here |
same here |
@Snowdar @hanzhi713 et al. I want to be clear again. The primary issue is that even single sharded model across GPUs no longer works. Forget about multiple models per GPU for now. That is, on AWS 4*A10G, vLLM 0.2.4 and lower work perfectly fine and leave plenty of room without any failure. However, on 0.2.5+ no matter any settings of gpu utilitization etc., never will llama 70B AWQ model fit on the 4 A10G while before it was perfectly fine (even under heavy use for long periods). |
I'm working on v0.2.5 now and found this issue due to the same reason. My case is deploying a 70B BF16 model on 8xA100-40GB GPUs. I inserted logs to torch.cuda.empty_cache()
# Here shows the free memory is ~22GB per GPU. This is expected given 40-(70GB*2)/8=22.5
self.model_runner.profile_run()
# Calculate the number of blocks that can be allocated with the
# profiled peak memory.
torch.cuda.synchronize()
free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
# Here shows the free memory is only 0.26 GB per GPU. Looks like "profile_run()" consumes all memory
# even I don't know why for now. |
I dived in a bit and here are some findings:
My temporary solution is as follows:
|
Yet another version of this problem is that 01-ai/Yi-34B-Chat used to work perfectly fine on 4*H100 80GB when run like:
But now it doesn't since 0.2.5+ including 0.2.7. Get instead:
When can we expect a fix? It seems a pretty serious bug. BTW, curiously, I ran the same exact command a second time (both times nothing on the GPUs) and second time didn't hit the error. So maybe there is a race in the memory size detection in vLLM. |
I am trying to run this command as given in docs
It gives me an error
What should I do? I am running a runpod with 1x RTX 4000 Ada |
I have upgraded to 1x A100 and now passing --gpu_memory_utilization 0.8 param, but still same error |
The issue was resolved by adding The reason it helped, because I am running runpod instance which as I understand, gives me access only to requested GPUs attached to physical machine. |
I also encountered this problem (i.e., OOM, or too few KV cache blocks) on 70B LLM with v0.2.7 and dived in a bit. Here are my findings. My dev environment: 8 A800 GPUs machine with CUDA 11.3. Working Solution: Use Another Working Solution: Update to torch==2.1.2. Analysis: There are evidences of more memory fragmentation when tp > 1, see here and here. Seems that the |
@ZiyueHuang I have pytorch 2.1.2 and vllm 0.2.7 and this wasn't solved by that. |
@pseudotensor How about trying reverting #2031? |
@ZiyueHuang Yes, I'm trying that now. |
This issue was closed automatically by github, that was not correct. |
Reverting avoided the title message, but it went GPU OOM unlike 0.2.4 with same long-context query. FYI @sh1ng |
FYI @pseudotensor I've tested the memory footprint of
|
@sh1ng Try 4*A10G's with 70B AWQ, it simply doesn't work, but on 0.2.4 works perfectly fine. |
--enforce_eager works to solve this issue for 4*A10G 70b AWQ. The issue with the 2 7Bs was how the memory was defined, total vs. free, that changed in vLLM. Free is a bit hard to manage if bringing up 2 7Bs at same time, not well defined. So we have to wait and bring up other with 0.9 not the fraction of total (e.g. 0.4-0.5). |
|
@oandreeva-nv I also explained what helped me for same GPU issue above. Did you try that? vLLM changed behavior from total to free memory, so it's confusing that first (say) 7b model should be 0.4 and second should be 0.9. |
@pseudotensor, yes I tried changing |
After some fine-tunning with |
Has this issue been resolved? I've encountered the same problem as well. |
@XBeg9 , I have a question please, for the engine arg you used, does that mean you can handle within the same batch 32768 tokens? and the output number of sequence per input is 256? |
Hi, I don't know if there is anyone still active of this issue or has found any way to resolve it. I am running on 1 NVIDIA T4 GPU, with 40 GB memory. A side note is when it tries to run it says: Help! |
@nicobieber99 I managed to get Llama-2 7B working on an NVIDIA A2 GPU (16GB memory) today by setting these parameters. I'm using OpenShift AI with a custom vLLM serving runtime.
I went the other way with the gpu memory utilization setting after seeing this post I reduced the max model length (context) from the default for LLama-2 7b of 4096 down to 2048 after I got this error: Not ideal that I had to reduce the context but it is at least working now and may be ok for short Q&A stuff. |
Any update on this issue? I am trying to serve two models (tinyllama 1b) on the same GPU cluster, so I use I can only start model on a replica with 40% of GPU and model reserved 10G/22G (GPU RAM). However when I tried ot start the second model I got this error, although it created another replica and the usage of the cluster now 0.8/1 from GPU.
|
I have a similar problem with the new I assumed I need to be able to strictly set memory limits for vLLM for my use-case to work. Please advise. |
Since vLLM 0.2.5, we can't even run llama-2 70B 4bit AWQ on 4*A10G anymore, have to use old vLLM. Similar problems even trying to be two 7b models on 80B A100.
For small models, like 7b with 4k tokens, vLLM fails for "cache blocks" even though alot more memory is left.
E.g. building docker image with cuda 11.8 and vllm 0.2.5 or 0.2.6 and running like:
works. However, if the 2nd model was to have 0.4, one gets:
However, with 0.6 util from before, here is what GPU looks like:
Ignore GPU=0.
So 0.6 util is 17GB, why would 0.4 util out of 80GB be a problem?
The text was updated successfully, but these errors were encountered: