feat: basic vllm support for hf cached models #2262

axel7083 · 2024-12-13T17:46:56Z

What does this PR do?

Quick POC showing integration of VLLM in AI Lab.

VLLM only support .safetensors models. AI Lab only support the GGUF models, this PR show how the amazing @huggingface/hub library can be used to read the downloaded models from the cache and mount them to create an inference server.

I build and push myself an image quay.io/rh-ee-astefani/vllm:cpu-1734105797 this image has been built following the instruction in https://docs.vllm.ai/en/latest/getting_started/cpu-installation.html#quick-start-using-dockerfile

Screenshot / video of UI

What issues does this PR fix or reference?

How to test this PR?

install the huggingface cli tool (See Installation)
Run huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct (it will download a model)
Start/Restart AI Lab, the hugging face models are loaded on startup.
Assert Qwen/Qwen2.5-0.5B-Instruct is in the imported section of the models catalog
Click on Create Model service
Click on Create Service
assert service is created
Open Podman Desktop containers page
assert your cpu-vllm container is running
Wait for the server to be up and running (can take a few minutes)
checks models list
check basic request

Signed-off-by: axel7083 <[email protected]>

feat: basic vllm support for hf cached models

0af8659

Signed-off-by: axel7083 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: basic vllm support for hf cached models #2262

feat: basic vllm support for hf cached models #2262

axel7083 commented Dec 13, 2024

feat: basic vllm support for hf cached models #2262

Are you sure you want to change the base?

feat: basic vllm support for hf cached models #2262

Conversation

axel7083 commented Dec 13, 2024

What does this PR do?

Screenshot / video of UI

What issues does this PR fix or reference?

How to test this PR?