Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: upgrade vllm backend and refactor deployment #854

Draft
wants to merge 115 commits into
base: main
Choose a base branch
from

Conversation

justinthelaw
Copy link
Contributor

@justinthelaw justinthelaw commented Jul 30, 2024

OVERVIEW

See #835 for more details on rationale and findings. Also related to #623.

IMPORTANT NOTE: there are still ongoing AsyncLLMEngineDead and RoPE scaling + Ray issues upstream that may prevent us from upgrading past 0.4.x.

BREAKING CHANGES:

  • moves all ENV specific to LeapfrogAI SDK to a ConfigMap using volumeMount for runtime injection and modification
    • in local dev, this is defined via config.yaml
  • moves all ENV specific to vLLM to a ConfigMap, using envFrom for runtime injection and modification
    • in local dev, this is defined via .env
  • ZARF_CONFIG is used to define create-time adn deploy-time variables for (e.g., MODEL_REPO_ID, ENFORCE_EAGER)
    • allows delivery engineer's declaritive definition of the backend configs and model
  • re-introduces LFAI SDK config.yaml configuration method for local development and testing

CHANGES:

  • updates docs for running vLLM locally
  • update to Python 3.11.9 to align with Registry1 base image
  • upgrades vLLM from 0.4.2 to 0.4.3 for:
    • adds BFloat16 quantized model support (the "why is this important" here)
  • exposes more backend configurations for Zarf build (see packages/vllm/README.md)
    • adds full set of QUANTIZATION options to existing configuration field
    • exposes everything via a Zarf variable and the values files
  • removes default vLLM engine configurations from the Dockerfile (no duplicates, hardcoding)
  • fixes issue where request params are not used in generation of the response (e.g., temperature)
    • uses LFAI SDK-received request object for inferencing params
    • gracefully handles vLLM versus SDK param names/differences for generation

ADDITIONAL CONTEXT

The default model is still Synthia-7b, until #976 is resolved. The below description is only being kept for future context:

The new model option, defenseunicorns/Hermes-2-Pro-Mistral-7B-4bit-32g, takes ~4.16Gb to load on to RAM or vRAM. At 6Gb of vRAM, the max_context_length variable has to be reduced to ~400 tokens; at 8Gb vRAM, ~10K tokens; and at 12Gb of vRAM, ~15K tokens. All cases require the max_gpu_utilization variable to be 0.99 in order to max out the KV cache size for the context length that is to be reserved at inference.

To achieve the maximum context length (~32K tokens) of the model, 16Gb of vRAM is required.

Reaching the defined max_context_length during a completion or chatting will result in the activation of vLLM's automatic sliding window handler, which drops quality of the final responses significantly.

gphorvath and others added 30 commits July 23, 2024 18:04
…nicorns/leapfrogai into feat-make-silent-parallel-build
@justinthelaw justinthelaw removed the request for review from a team September 16, 2024 20:52
@justinthelaw justinthelaw marked this pull request as draft September 16, 2024 21:04
@justinthelaw justinthelaw marked this pull request as ready for review September 16, 2024 21:05
@justinthelaw justinthelaw marked this pull request as draft September 16, 2024 21:18
@justinthelaw justinthelaw marked this pull request as ready for review September 17, 2024 00:24
@justinthelaw justinthelaw force-pushed the 835-upgrade-vllm-for-gptq-bfloat16-inferencing branch from 1ac3cba to 6593fbb Compare September 17, 2024 01:10
@justinthelaw
Copy link
Contributor Author

justinthelaw commented Sep 17, 2024

Blocked by #1038, wait until that is merged and re-test this branch locally with the new vLLM E2E.

@justinthelaw justinthelaw marked this pull request as draft September 17, 2024 16:59
@justinthelaw justinthelaw added the blocked 🛑 Something needs to happen before this issues is worked label Sep 17, 2024
@justinthelaw justinthelaw marked this pull request as ready for review September 19, 2024 14:24
@justinthelaw justinthelaw requested a review from a team September 19, 2024 14:24
@justinthelaw justinthelaw removed the blocked 🛑 Something needs to happen before this issues is worked label Sep 19, 2024
@justinthelaw justinthelaw marked this pull request as draft September 19, 2024 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file enhancement New feature or request
Projects
None yet
3 participants