Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: basic vllm support for hf cached models #2262

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

axel7083
Copy link
Contributor

What does this PR do?

Quick POC showing integration of VLLM in AI Lab.

VLLM only support .safetensors models. AI Lab only support the GGUF models, this PR show how the amazing @huggingface/hub library can be used to read the downloaded models from the cache and mount them to create an inference server.

I build and push myself an image quay.io/rh-ee-astefani/vllm:cpu-1734105797 this image has been built following the instruction in https://docs.vllm.ai/en/latest/getting_started/cpu-installation.html#quick-start-using-dockerfile

Screenshot / video of UI

image

What issues does this PR fix or reference?

How to test this PR?

  1. install the huggingface cli tool (See Installation)
  2. Run huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct (it will download a model)
  3. Start/Restart AI Lab, the hugging face models are loaded on startup.
  4. Assert Qwen/Qwen2.5-0.5B-Instruct is in the imported section of the models catalog
  5. Click on Create Model service
    image
  6. Click on Create Service
    image
  7. assert service is created
    image
  8. Open Podman Desktop containers page
  9. assert your cpu-vllm container is running
  10. Wait for the server to be up and running (can take a few minutes)
  11. checks models list
    image
  12. check basic request
    image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant