Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vllm hangs when reinitializing ray #1058

Closed
nelson-liu opened this issue Sep 15, 2023 · 21 comments
Closed

vllm hangs when reinitializing ray #1058

nelson-liu opened this issue Sep 15, 2023 · 21 comments
Assignees
Labels
bug Something isn't working

Comments

@nelson-liu
Copy link
Contributor

nelson-liu commented Sep 15, 2023

I'd like to be able to unload a vllm model and re-load it later, in the same script. However, the following (on 0.1.7) causes the script to hang (disclaimer: this isn't my particular workload, but a minimal reproducible example):

from vllm import LLM, SamplingParams

def process_prompts(prompts):
    llm = LLM(
        model="meta-llama/Llama-2-70b-chat-hf",
        tensor_parallel_size=2,
        trust_remote_code=True,
        load_format="pt")
    sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=500)
    return llm.generate(prompts, sampling_params)

prompt_batch_1 = ["Hello, my name is", "The president of the United States is"]
prompt_batch_2 = ["The capital of France is", "The future of AI is"]

batch_1_output = process_prompts(prompt_batch_1)
batch_2_output = process_prompts(prompt_batch_2)

Results in:

2023-09-15 11:43:25,943 INFO worker.py:1621 -- Started a local Ray instance.
INFO 09-15 11:43:51 llm_engine.py:72] Initializing an LLM engine with config: model='meta-llama/Llama-2
-70b-chat-hf', tokenizer='meta-llama/Llama-2-70b-chat-hf', tokenizer_mode=auto, trust_remote_code=True,
 dtype=torch.float16, download_dir='/scr/biggest/nfliu/cache/huggingface/', load_format=pt, tensor_para
llel_size=2, seed=0)
INFO 09-15 11:43:51 tokenizer.py:30] For some LLaMA-based models, initializing the fast tokenizer may t
ake a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokeni
zer' instead of the original tokenizer.
INFO 09-15 11:45:58 llm_engine.py:199] # GPU blocks: 2561, # CPU blocks: 1638
Processed prompts: 100%|█████████████████████████████████████████████████| 2/2 [00:14<00:00,  7.17s/it]
2023-09-15 11:46:28,348 INFO worker.py:1453 -- Calling ray.init() again after it has already been called.

Then, it just hangs forever (been waiting 10 minutes, with no sign of life). Checking the GPUs shows that the model is indeed unloaded from the GPUs.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:C7:00.0 Off |                    0 |
| N/A   30C    P0              61W / 350W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:CA:00.0 Off |                    0 |
| N/A   31C    P0              57W / 350W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I'm fairly sure this is related to ray, since this doesn't happen if tensor parallelism is set to 1 (e.g., if you're running a smaller model). When I ctrl+c out of the script after it hangs, it shows that it's stuck on ray.get(current_placement_group.ready(), timeout=1800) https://github.com/vllm-project/vllm/blob/main/vllm/engine/ray_utils.py#L112C9-L112C63 .

Is there any way to "reset" the ray state, such that it initializes from scratch the second time?

@hsm1997
Copy link

hsm1997 commented Sep 16, 2023

maybe you can try insert os.system("ray stop --force") somewhere between unload and reload

@raihan0824
Copy link

same problem, any solution?

@Fenkail
Copy link

Fenkail commented Oct 18, 2023

I encountered the same issue. It runs fine when I use tensor_parallel_size=1, but it hangs when I use tensor_parallel_size>1 . I have tried reinstalling many times but it didn't help.

The final solution for me was to modify the vllm/engine/ray_utils.py file and limit the number of CPUs used. After making this change, it works properly. The modified code is:
ray.init(num_cpus=32, num_gpus=4, address=ray_address, ignore_reinit_error=True).

Note: I encountered hanging issues while using tensor_parallel_size>1 on a 128-core machine. However, running tensor_parallel_size>1 on a 96-core machine works normally

@yichenjm
Copy link

yichenjm commented Nov 2, 2023

@Fenkail Hi, may I ask how do you decide the number of CPUs limit? I am running exactly the same issue as OP

@pvtoan
Copy link

pvtoan commented Nov 7, 2023

Hi @Fenkail , I already modified the "ray_utils.py" as you suggested but the problem is still there.

In fact, my pc has only two GPUs. So, I'd like to know how you choose num_cpus and num_gpus to fix the problem?

@Fenkail
Copy link

Fenkail commented Nov 14, 2023

@Fenkail Hi, may I ask how do you decide the number of CPUs limit? I am running exactly the same issue as OP

I just tried using 32 cores and it solved my problem. The specific number of CPU cores can be adjusted according to your needs. It was working fine on a machine with 96 cores, but I encountered issues on a 128-core machine, so I thought of limiting the CPU usage.

@Fenkail
Copy link

Fenkail commented Nov 14, 2023

ray_utils

Did you modify the ray_utils.py installed in the conda environment for vllm?

@pvtoan
Copy link

pvtoan commented Nov 14, 2023

Yes, I did modify ray_utils.py, installed in my conda environment for vllm

@qizzzh
Copy link

qizzzh commented Dec 8, 2023

Hit the exact same issue when running vLLM in Ray serve.

@qizzzh
Copy link

qizzzh commented Dec 8, 2023

In my case I have 4 GPUs and 3 RayServe deployments, 2 of which require 1 logical GPU with tensor_parallelism=1 and another one which requires 2 logical GPUs with tensor_parallelism=2. Looks like when vLLM tries to handle the tensor_parallelism=2 it got stuck because of not enough resources.

Resources
---------------------------------------------------------------
Usage:
 17.0/48.0 CPU
 4.0/4.0 GPU
 0B/104.83GiB memory
 44B/48.92GiB object_store_memory

Demands:
 {'GPU': 1.0} * 2 (PACK): 1+ pending placement groups

@smallmocha
Copy link

you should load model outside the function to keep model only load once

from vllm import LLM, SamplingParams

llm = LLM(
model="meta-llama/Llama-2-70b-chat-hf",
tensor_parallel_size=2,
trust_remote_code=True,
load_format="pt")

def process_prompts(prompts):
sampling_params = SamplingParams(temperature=0.0, top_p=1.0, max_tokens=500)
return llm.generate(prompts, sampling_params)

prompt_batch_1 = ["Hello, my name is", "The president of the United States is"]
prompt_batch_2 = ["The capital of France is", "The future of AI is"]

batch_1_output = process_prompts(prompt_batch_1)
batch_2_output = process_prompts(prompt_batch_2)

@Dolfik1
Copy link

Dolfik1 commented Jan 11, 2024

TLDR: Don't set the num_gpus value for vLLM, only set tensor_parallel_size.

I encountered the same problem, and here's what I found out:

  1. According to Ray's documentation, the framework itself will allocate the necessary GPUs (based on num_gpus) and set the CUDA_VISIBLE_DEVICES value. When I started running vLLM through Ray, I found that Ray sets the CUDA_VISIBLE_DEVICES value to 0,1,2,3 (I had num_gpus = 4 specified), however, when I called nvidia-smi, I found that vLLM uses 4,5,6,7. Therefore, vLLM ignores the CUDA_VISIBLE_DEVICES value and chooses other devices.
    In my case, I have 8 GPUs, I allocated 4 for vLLM, and 2 for another model, leaving 2 free. But vLLM requested another 4 GPUs, and since Ray couldn't satisfy this request, vLLM started waiting for the GPUs to free up. As soon as I removed the second model, which requested 2 GPUs, everything started working. Everything also worked when I allocated 2 GPUs for the vLLM model.

  2. If you try to run two identical applications using vLLM through Ray (in one serve), everything will break. The applications will not use different GPUs, but will start loading data into the memory of the same GPUs, while the other GPUs will be idle. Ultimately, this will lead to OOM. I believe this is related to the incorrect handling of the num_gpus value.

image

@hwaking
Copy link

hwaking commented Jan 16, 2024

I just tried using 32 cores and it solved my problem.

@paolovic
Copy link

paolovic commented Mar 11, 2024

I am still having the problem
I want to deploy one model with tensor_parallel_size=2 (just 1 replica), one model with num_gpus=0.4 (with 2 replicas, so in total 0.8 GPUs), and one model with num_gpus=0.1 (with 1 replica)
In total, this would require 2.9 GPUs which is ok since I have 3 GPUs each with sufficient VRAM on this node alone at hand.

ray status returns

Resources
---------------------------------------------------------------
Usage:
 10.0/24.0 CPU
 0.8999999999999999/4.0 GPU
 0B/200.20GiB memory
 44B/89.79GiB object_store_memory

Demands:
 {'CPU': 12.0}: 1+ pending tasks/actors

serve status returns

        message: 'Deployment ''vllmAPI'' in application ''ray vllm application''
          1 replicas that have taken more than 30s to be scheduled. This may be due
          to waiting for the cluster to auto-scale or for a runtime environment to
          be installed. Resources required for each replica: {"CPU": 12.0}, total
          resources available: {"CPU": 14.0}. Use `ray status` for more details.'

Edit: Problem solved.
As adviced above, for the one model with tensor_parallel_size=2, I defined num_gpus=0
For the others, as written above, one model with num_gpus=0.4 (with 2 replicas, so in total 0.8 GPUs), and one model with num_gpus=0.1 (with 1 replica) BUT at the same time, I also defined for these others CUDA_VISIBLE_DEVICES=0,1,2,3 (since I have 4 GPUs) and then it was able to spin up properly

@premsa
Copy link

premsa commented Mar 14, 2024

Fenkail's solution for setting the 'num_cpus' parameter up to a correct amount (i.e. 10 out of 10 available in my case) solved my problem. A fix for slurm jobs:

num_cpus = int(os.environ.get('SLURM_CPUS_PER_TASK'))

@panxnan
Copy link

panxnan commented Apr 11, 2024

I also fix the problem by setting the ray num cpus to 32.
ray start --head --num-cpus=32
It also works when I set cpus to 49 (since I have two pysical cpus, each have 48 cores)

@shyringo
Copy link

#1908 might be related, but in 'Offline Batched Inference' mode.

@rkooo567 rkooo567 self-assigned this May 3, 2024
@Vincent-Li-9701
Copy link

Hey folks had a similar issue, I'm running with offline inference mode. I was able to clear the resource with ray stop But when I try to reload the resource I got
[2024-05-20 22:14:06,214 E 1068826 1069198] gcs_rpc_client.h:554: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure. The program will terminate. Does anyone know how to properly restart?

@DarkLight1337 DarkLight1337 added the bug Something isn't working label May 31, 2024
@emirhanKural
Copy link

emirhanKural commented Jun 5, 2024

Hi @DarkLight1337, is there any update for the bug ? I have also the same problem when reload a model in api infrence.

Firstly, when I run api code, everything is fine, loading is ok.

If I try directly reload a model, I get:

2024-06-05 16:32:47,026 WARNING worker.py:1419 -- SIGTERM handler is not set because current thread is not the main thread.
2024-06-05 16:32:47,035 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 10.187.##.##:##...
2024-06-05 16:32:47,035 INFO worker.py:1582 -- Calling ray.init() again after it has already been called.`

And nothing happens.

If I check ray status and shutdown the ray cluster and reload a model, I get:

if ray.is_initialized():
    ray.shutdown()

new_model = AsyncLLMEngine.from_engine_args(engine_args, usage_context=UsageContext.API_SERVER)

2024-06-05 16:39:43,757 WARNING worker.py:1419 -- SIGTERM handler is not set because current thread is not the main thread.
2024-06-05 16:39:43,766 INFO worker.py:1564 -- Connecting to existing Ray cluster at address: 10.187.##.##:##...
2024-06-05 16:39:43,766 INFO worker.py:1582 -- Calling ray.init() again after it has already been called.
INFO 06-05 16:39:44 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: .......

It seems it connects and loads the model again but it does not load and gets this error:

(RayWorkerWrapper pid=3562842) [E socket.cpp:957] [c10d] The client socket has timed out after 600s while trying to connect to (10.187.##.##:##..., 44163)

@DarkLight1337
Copy link
Member

I was just triaging the issues. I'm not that involved with the use of Ray in vLLM so I won't be of much assistance here.

@DarkLight1337
Copy link
Member

We have added documentation for this situation in #5430. Please take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests