Skip to content

Latest commit

 

History

History

llamacpp_python

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Llamacpp_Python Model Server

The llamacpp_python model server images are based on the llama-cpp-python project that provides python bindings for llama.cpp. This provides us with a python based and OpenAI API compatible model server that can run LLM's of various sizes locally across Linux, Windows or Mac.

This model server requires models to be converted from their original format, typically a set of *.bin or *.safetensor files into a single GGUF formatted file. Many models are available in GGUF format already on huggingface.co. You can also use the model converter utility available in this repo to convert models yourself.

Image Options

We currently provide 3 options for the llamacpp_python model server:

Base

The base image is the standard image that works for both arm64 and amd64 environments. However, it does not includes any hardware acceleration and will run with CPU only. If you use the base image, make sure that your container runtime has sufficient resources to run the desired model(s).

To build the base model service image:

make -f Makefile build

To pull the base model service image:

podman pull quay.io/ai-lab/llamacpp_python

Cuda

The Cuda image include all the extra drivers necessary to run our model server with Nvidia GPUs. This will significant speed up the models response time over CPU only deployments.

To Build the the Cuda variant image:

make -f Makefile build-cuda

To pull the base model service image:

podman pull quay.io/ai-lab/llamacpp_python_cuda

IMPORTANT!

To run the Cuda image with GPU acceleration, you need to install the correct Cuda drivers for your system along with the Nvidia Container Toolkit. Please use the links provided to find installation instructions for your system.

Once those are installed you can use the container toolkit CLI to discover your Nvidia device(s).

sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

Finally, you will also need to add --device nvidia.com/gpu=all to your podman run command so your container can access the GPU.

Vulkan (experimental)

The Vulkan image (amd64/arm64) is experimental, but can be used for gaining partial GPU access on an M-series Mac, significantly speeding up model response time over a CPU only deployment. This image requires that your podman machine provider is "applehv" and that you use krunkit instead of vfkit. Since these tools are not currently supported by podman desktop this image will remain "experimental".

To build the Vulkan model service variant image:

System Architecture Command
amd64 make -f Makefile build-vulkan-amd64
arm64 make -f Makefile build-vulkan-arm64

To pull the base model service image:

podman pull quay.io/ai-lab/llamacpp_python_vulkan

Download Model(s)

There are many models to choose from these days, most of which can be found on huggingface.co. In order to use a model with the llamacpp_python model server, it must be in GGUF format. You can either download pre-converted GGUF models directly or convert them yourself with the model converter utility available in this repo.

A well performant Apache-2.0 licensed models that we recommend using if you are just getting started is granite-7b-lab. You can use the link below to quickly download a quantized (smaller) GGUF version of this model for use with the llamacpp_python model server.

Download URL: https://huggingface.co/instructlab/granite-7b-lab-GGUF/resolve/main/granite-7b-lab-Q4_K_M.gguf

Place all models in the models directory.

You can use this snippet below to download the default model:

make -f Makefile download-model-granite

Or you can use the generic download-models target from the /models directory to download any model file from huggingface:

cd ../../models
make MODEL_NAME=<model_name> MODEL_URL=<model_url> -f  Makefile download-model
# EX: make MODEL_NAME=granite-7b-lab-Q4_K_M.gguf MODEL_URL=https://huggingface.co/instructlab/granite-7b-lab-GGUF/resolve/main/granite-7b-lab-Q4_K_M.gguf -f  Makefile download-model

Deploy Model Service

Single Model Service:

To deploy the LLM server you must specify a volume mount -v where your models are stored on the host machine and the MODEL_PATH for your model of choice. The model_server is most easily deploy from calling the make command: make -f Makefile run. Of course as with all our make calls you can pass any number of the following variables: REGISTRY, IMAGE_NAME, MODEL_NAME, MODEL_PATH, and PORT.

podman run --rm -it \
  -p 8001:8001 \
  -v Local/path/to/locallm/models:/locallm/models:ro \
  -e MODEL_PATH=models/granite-7b-lab-Q4_K_M.gguf \
  -e HOST=0.0.0.0 \
  -e PORT=8001 \
  -e MODEL_CHAT_FORMAT=openchat \
  llamacpp_python

or with Cuda image

podman run --rm -it \
  --device nvidia.com/gpu=all \
  -p 8001:8001 \
  -v Local/path/to/locallm/models:/locallm/models:ro \
  -e MODEL_PATH=models/granite-7b-lab-Q4_K_M.gguf \
  -e HOST=0.0.0.0 \
  -e PORT=8001 \
  -e MODEL_CHAT_FORMAT=openchat \
  llamacpp_python

Multiple Model Service:

To enable dynamic loading and unloading of different models present on your machine, you can start the model service with a CONFIG_PATH instead of a MODEL_PATH.

Here is an example models_config.json with two model options.

{
    "host": "0.0.0.0",
    "port": 8001,
    "models": [
        {
            "model": "models/granite-7b-lab-Q4_K_M.gguf",
            "model_alias": "granite",
            "chat_format": "openchat",
        },
        {
            "model": "models/merlinite-7b-lab-Q4_K_M.gguf",
            "model_alias": "merlinite",
            "chat_format": "openchat",
        },

    ]
}

Now run the container with the specified config file.

podman run --rm -it -d \
        -p 8001:8001 \
        -v Local/path/to/locallm/models:/locallm/models:ro \
        -e CONFIG_PATH=models/<config-filename> \
        llamacpp_python

DEV environment

The environment is implemented with devcontainer technology.

Running tests

make -f Makefile test