From 613bec487f134b2ef655620dc20893a3bbe83de5 Mon Sep 17 00:00:00 2001
From: Michael Clifford <mcliffor@redhat.com>
Date: Tue, 9 Apr 2024 15:13:11 -0400
Subject: [PATCH] update llamacpp_python docs (#209)

Signed-off-by: Michael Clifford <mcliffor@redhat.com>
---
 model_servers/llamacpp_python/README.md      | 88 ++++++++++++++++----
 model_servers/llamacpp_python/cuda/README.md | 12 ---
 2 files changed, 70 insertions(+), 30 deletions(-)
 delete mode 100644 model_servers/llamacpp_python/cuda/README.md

diff --git a/model_servers/llamacpp_python/README.md b/model_servers/llamacpp_python/README.md
index 943f4440..bb030e4f 100644
--- a/model_servers/llamacpp_python/README.md
+++ b/model_servers/llamacpp_python/README.md
@@ -1,33 +1,75 @@
-### Build Model Service
+# Llamacpp_Python Model Sever
 
-For the standard model service image:
+The llamacpp_python model server images are based on the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) project that provides python bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp). This provides us with a python based and OpenAI API compatible model server that can run LLM's of various sizes locally across Linux, Windows or Mac.
+
+This model server requires models to be converted from their original format, typically a set of `*.bin` or `*.safetensor` files into a single GGUF formatted file. Many models are available in GGUF format already on [huggingface.co](https://huggingface.co). You can also use the [model converter utility](../../convert_models/) available in this repo to convert models yourself.      
+
+
+## Image Options
+
+We currently provide 3 options for the llamacpp_python model server: 
+* [Base](#base) 
+* [Cuda](#cuda)
+* [Vulkan (experimental)](#vulkan-experimental) 
+
+### Base
+
+The [base image](../llamacpp_python/base/Containerfile) is the standard image that works for both arm64 and amd64 environments. However, it does not includes any hardware acceleration and will run with CPU only. If you use the base image, make sure that your container runtime has sufficient resources to run the desired model(s).   
+
+To build the base model service image:
 
 ```bash
 make -f Makefile build
 ```
+To pull the base model service image:
 
-For the Cuda variant image:
+```bash
+podman pull quay.io/ai-lab/llamacpp-python
+```
+
+
+### Cuda
 
+The [Cuda image](../llamacpp_python/cuda/Containerfile) include all the extra drivers necessary to run our model server with Nvidia GPUs. This will significant speed up the models response time over CPU only deployments.   
+
+To Build the the Cuda variant image:
 ```bash
 make -f Makefile build-cuda
 ```
 
-For the Vulkan variant image:
+To pull the base model service image:
+
+```bash
+podman pull quay.io/ai-lab/llamacpp-python-cuda
+```
+
+### Vulkan (experimental)
+
+The [Vulkan image](../llamacpp_python/vulkan/Containerfile) is experimental, but can be used for gaining partial GPU access on an M-series Mac, significantly speeding up model response time over a CPU only deployment. This image requires that your podman machine provider is "applehv" and that you use krunkit instead of vfkit. Since these tools are not currently supported by podman desktop this image will remain "experimental".    
+
+To build the Vulkan model service variant image:
 
 ```bash
 make -f Makefile build-vulkan
 ```
+To pull the base model service image:
+
+```bash
+podman pull quay.io/ai-lab/llamacpp-python-vulkan
+```
 
-### Download Model
 
-At the time of this writing, 2 models are known to work with this service
+## Download Model(s)
 
-- **Llama2-7b**
-    - Download URL: [https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_S.gguf](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_S.gguf)
-- **Mistral-7b**
-    - Download URL: [https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf)
+There are many models to choose from these days, most of which can be found on [huggingface.co](https://huggingface.co). In order to use a model with the llamacpp_python model server, it must be in GGUF format. You can either download pre-converted GGUF models directly or convert them yourself with the [model converter utility](../../convert_models/) available in this repo.
 
-It is suggested you place models in the [models](../../models/) directory. As for retrieving them, either use `wget` to download them with the download links above, or call the model names from the Makefile.
+One of the more popular Apache-2.0 Licenesed models that we recommend using if you are just getting started is `mistral-7b-instruct-v0.1`. You can use the link below to quickly download a quantized (smaller) GGUF version of this model for use with the llamacpp_python model server. 
+
+Download URL: [https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf)
+
+Place all models in the [models](../../models/) directory.
+
+You can use this snippet below to download models. 
 
 ```bash
 cd ../../models
@@ -42,13 +84,23 @@ make -f Makefile download-model-mistral
 make -f Makefile download-model-llama
 ```
 
-### Deploy Model Service
+## Deploy Model Service
 
-#### Single Model Service:
+### Single Model Service:
 
-Deploy the LLM server and volume mount the model of choice using the `MODEL_PATH` environment variable. The model_server is most easily deploy from calling the make command: `make -f Makefile run`
+To deploy the LLM server you must specify a volume mount `-v` where your models are stored on the host machine and the `MODEL_PATH` for your model of choice. The model_server is most easily deploy from calling the make command: `make -f Makefile run`
+
+```bash
+podman run --rm -it \
+  -p 8001:8001 \
+  -v Local/path/to/locallm/models:/locallm/models:ro \
+  -e MODEL_PATH=models/mistral-7b-instruct-v0.1.Q4_K_M.gguf 
+  -e HOST=0.0.0.0 
+  -e PORT=8001 
+  llamacpp_python \
+```
 
-#### Multiple Model Service:
+### Multiple Model Service:
 
 To enable dynamic loading and unloading of different models present on your machine, you can start the model service with a `CONFIG_PATH` instead of a `MODEL_PATH`.
 
@@ -74,14 +126,14 @@ Here is an example `models_config.json` with two quantization variants of mistra
 }
 ```
 
-Now run the container with the specified config file. Note: the following command runs with linux bind mount options, for Darwin remove the `,Z` from the volume directive.
+Now run the container with the specified config file. 
 
 ```bash
 podman run --rm -it -d \
         -p 8001:8001 \
-        -v Local/path/to/locallm/models:/locallm/models:ro,Z \
+        -v Local/path/to/locallm/models:/locallm/models:ro \
         -e CONFIG_PATH=models/<config-filename> \
-        playground
+        llamacpp_python
 ```
 
 ### DEV environment
diff --git a/model_servers/llamacpp_python/cuda/README.md b/model_servers/llamacpp_python/cuda/README.md
deleted file mode 100644
index 76855d1d..00000000
--- a/model_servers/llamacpp_python/cuda/README.md
+++ /dev/null
@@ -1,12 +0,0 @@
-### Rebuild for x86
-
-If you are on a Mac, you'll need to rebuild the model-service image for the x86 architecture for most use case outside of Mac.
-Since this is an AI workload, you may also want to take advantage of Nvidia GPU's available outside our local machine.
-If so, build the model-service with a base image that contains CUDA and builds llama.cpp specifically for a CUDA environment.
-
-```bash
-cd chatbot/model_services/cuda
-podman build --platform linux/amd64 -t chatbot:service-cuda -f cuda/Containerfile .
-```
-
-The CUDA environment significantly increases the size of the container image.
\ No newline at end of file