feat: upgrade vllm backend and refactor deployment #854

justinthelaw · 2024-07-30T16:01:51Z

OVERVIEW

See #835 for more details on rationale and findings. Also related to #623.

IMPORTANT NOTE: there are still ongoing AsyncLLMEngineDead and RoPE scaling + Ray issues upstream that may prevent us from upgrading past 0.4.x.

BREAKING CHANGES:

moves all ENV specific to LeapfrogAI SDK to a ConfigMap using volumeMount for runtime injection and modification
- in local dev, this is defined via config.yaml
moves all ENV specific to vLLM to a ConfigMap, using envFrom for runtime injection and modification
- in local dev, this is defined via .env
ZARF_CONFIG is used to define create-time adn deploy-time variables for (e.g., MODEL_REPO_ID, ENFORCE_EAGER)
- allows delivery engineer's declaritive definition of the backend configs and model
re-introduces LFAI SDK config.yaml configuration method for local development and testing

CHANGES:

updates docs for running vLLM locally
update to Python 3.11.9 to align with Registry1 base image
upgrades vLLM from 0.4.2 to 0.4.3 for:
- adds BFloat16 quantized model support (the "why is this important" here)
exposes more backend configurations for Zarf build (see packages/vllm/README.md)
- adds full set of QUANTIZATION options to existing configuration field
- exposes everything via a Zarf variable and the values files
removes default vLLM engine configurations from the Dockerfile (no duplicates, hardcoding)
fixes issue where request params are not used in generation of the response (e.g., temperature)
- uses LFAI SDK-received request object for inferencing params
- gracefully handles vLLM versus SDK param names/differences for generation

ADDITIONAL CONTEXT

The default model is still Synthia-7b, until #976 is resolved. The below description is only being kept for future context:

The new model option, defenseunicorns/Hermes-2-Pro-Mistral-7B-4bit-32g, takes ~4.16Gb to load on to RAM or vRAM. At 6Gb of vRAM, the max_context_length variable has to be reduced to ~400 tokens; at 8Gb vRAM, ~10K tokens; and at 12Gb of vRAM, ~15K tokens. All cases require the max_gpu_utilization variable to be 0.99 in order to max out the KV cache size for the context length that is to be reserved at inference.

To achieve the maximum context length (~32K tokens) of the model, 16Gb of vRAM is required.

Reaching the defined max_context_length during a completion or chatting will result in the activation of vLLM's automatic sliding window handler, which drops quality of the final responses significantly.

…gptq-bfloat16-inferencing

…ded during loud builds

…nicorns/leapfrogai into feat-make-silent-parallel-build

…be vllm

…gptq-bfloat16-inferencing

justinthelaw · 2024-09-17T02:25:03Z

Blocked by #1038, wait until that is merged and re-test this branch locally with the new vLLM E2E.

…gptq-bfloat16-inferencing

gphorvath and others added 30 commits July 23, 2024 18:04

adds silent builds and silent-build-all

465bd1b

also remove zarf color

0dd3f29

Merge branch 'main' into feat-make-silent-parallel-build

4314ad5

Update Makefile

c6a52b7

Merge branch 'main' into feat-make-silent-parallel-build

0e246ee

remove extra vllm docker args, vllm to 0.5.2

fd88b38

Merge remote-tracking branch 'origin/main' into 835-upgrade-vllm-for-…

3930c92

…gptq-bfloat16-inferencing

Merge remote-tracking branch 'origin/main' into 835-upgrade-vllm-for-…

b0f33b1

…gptq-bfloat16-inferencing

Merge branch 'main' into feat-make-silent-parallel-build

6fded12

modifying the way silent flags are passed to ensure they don't get ad…

5771711

…ded during loud builds

Merge branch 'feat-make-silent-parallel-build' of github.com:defenseu…

39ffba7

…nicorns/leapfrogai into feat-make-silent-parallel-build

forgot Supabase

68b3ef0

adding silent deploy targets

354a862

adding cpu commands

2148ed5

cleanup .PHONY

09fc319

vllm to 0.5.3

398b00c

fixing cpu cluster issues

0652089

add slight clarifying message

205ce95

ensure ui is has default model=vllm for gpu deployments

b0c4209

fix bug with ui defaulting to llama-cpp-python when the model should …

c063b28

…be vllm

Merge branch 'main' into feat-make-silent-parallel-build

9a39144

Merge remote-tracking branch 'origin/main' into 835-upgrade-vllm-for-…

8d48752

…gptq-bfloat16-inferencing

rec vllm configurations, efficient Dockerfile

8787093

pin pyenv, use official script

1be2902

merge/feat-make-silent-parallel-build

b412d22

Merge remote-tracking branch 'origin/main' into 835-upgrade-vllm-for-…

0320e8f

…gptq-bfloat16-inferencing

merge/main, improved and working Dockerfile

cd3e141

cascade TRUST_REMOTE_CODE to env and readme

7516073

remove extraneous HF_HOME env

3845ceb

add default Phi-3 mini 128k default model

03c38ac

justinthelaw added 3 commits September 16, 2024 16:43

re-added default tensor size

2a2c7d6

fix README

98227a6

cleanup

c620efa

justinthelaw removed the request for review from a team September 16, 2024 20:52

3.11.9 python

6593fbb

justinthelaw marked this pull request as draft September 16, 2024 21:04

justinthelaw marked this pull request as ready for review September 16, 2024 21:05

justinthelaw marked this pull request as draft September 16, 2024 21:18

justinthelaw marked this pull request as ready for review September 17, 2024 00:24

justinthelaw mentioned this pull request Sep 17, 2024

test(sdk, api): fix finish reason, chat, audio and completion tests #1038

Open

2 tasks

justinthelaw force-pushed the 835-upgrade-vllm-for-gptq-bfloat16-inferencing branch from 1ac3cba to 6593fbb Compare September 17, 2024 01:10

Merge remote-tracking branch 'origin/main' into 835-upgrade-vllm-for-…

b366c5f

…gptq-bfloat16-inferencing

justinthelaw marked this pull request as draft September 17, 2024 16:59

justinthelaw added the blocked 🛑 Something needs to happen before this issues is worked label Sep 17, 2024

justinthelaw and others added 8 commits September 17, 2024 13:17

handle request queue possibly being None

ecbd4f7

Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing

c85a00c

Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing

87cc755

Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing

9cf7d7f

Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing

23c008e

zarf-config.yaml changes docs

09510b7

add load_format

1e89fac

Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing

8a07080

justinthelaw marked this pull request as ready for review September 19, 2024 14:24

justinthelaw requested a review from a team September 19, 2024 14:24

justinthelaw removed the blocked 🛑 Something needs to happen before this issues is worked label Sep 19, 2024

justinthelaw marked this pull request as draft September 19, 2024 15:03

Merge branch 'main' into 835-upgrade-vllm-for-gptq-bfloat16-inferencing

3da388f

This was linked to issues Sep 19, 2024

Can't set vLLM container to have LAI_QUANTIZATION set to None #1079

Open

GPU settings are not passed down properly to vLLM from the uds config #1083

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: upgrade vllm backend and refactor deployment #854

feat: upgrade vllm backend and refactor deployment #854

justinthelaw commented Jul 30, 2024 •

edited

Loading

justinthelaw commented Sep 17, 2024 •

edited

Loading

feat: upgrade vllm backend and refactor deployment #854

Are you sure you want to change the base?

feat: upgrade vllm backend and refactor deployment #854

Conversation

justinthelaw commented Jul 30, 2024 • edited Loading

OVERVIEW

BREAKING CHANGES:

CHANGES:

ADDITIONAL CONTEXT

justinthelaw commented Sep 17, 2024 • edited Loading

justinthelaw commented Jul 30, 2024 •

edited

Loading

justinthelaw commented Sep 17, 2024 •

edited

Loading