diff --git a/packages/backend/src/ai.json b/packages/backend/src/ai.json index 1dfbaeb37..8dc3375aa 100644 --- a/packages/backend/src/ai.json +++ b/packages/backend/src/ai.json @@ -13,8 +13,8 @@ "config": "chatbot-langchain/ai-studio.yaml", "readme": "# Chat Application\n\nThis model service is intended be used as the basis for a chat application. It is capable of having arbitrarily long conversations\nwith users and retains a history of the conversation until it reaches the maximum context length of the model.\nAt that point, the service will remove the earliest portions of the conversation from its memory.\n\nTo use this model service, please follow the steps below:\n\n* [Download Model](#download-models)\n* [Build Image](#build-the-image)\n* [Run Image](#run-the-image)\n* [Interact with Service](#interact-with-the-app)\n* [Deploy on Openshift](#deploy-on-openshift)\n\n## Build and Deploy Locally\n\n### Download model(s)\n\nThe two models that we have tested and recommend for this example are Llama2 and Mistral. The locations of the GGUF variants\nare listed below:\n\n* Llama2 - https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main\n* Mistral - https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/tree/main\n\n_For a full list of supported model variants, please see the \"Supported models\" section of the\n[llama.cpp repository](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#description)._\n\nThis example assumes that the developer already has a copy of the model that they would like to use downloaded onto their host machine and located in the `/models` directory of this repo. \n\nThis can be accomplished with:\n\n```bash\ncd models\nwget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_S.gguf\ncd ../\n```\n\n## Deploy from Local Container\n\n### Build the image\n\nBuild the `model-service` image.\n\n```bash\ncd chatbot/model_services\npodman build -t chatbot:service -f base/Containerfile .\n```\n\nAfter the image is created it should be run with the model mounted as volume, as shown below.\nThis prevents large model files from being loaded into the container image which can cause a significant slowdown\nwhen transporting the images. If it is required that a model-service image contains the model,\nthe Containerfiles can be modified to copy the model into the image.\n\nWith the model-service image, in addition to a volume mounted model file, an environment variable, $MODEL_PATH,\nshould be set at runtime. If not set, the default location where the service expects a model is at \n`/locallm/models/llama-2-7b-chat.Q5_K_S.gguf` inside the running container. This file can be downloaded from the URL\n`https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_S.gguf`.\n\n### Run the image\n\nOnce the model service image is built, it can be run with the following:\nBy assuming that we want to mount the model `llama-2-7b-chat.Q5_K_S.gguf`\n\n```bash\nexport MODEL_FILE=llama-2-7b-chat.Q5_K_S.gguf\npodman run --rm -d -it \\n -v /local/path/to/$MODEL_FILE:/locallm/models/$MODEL_FILE:Z \\n --env MODEL_PATH=/locallm/models/$MODEL_FILE \\n -p 7860:7860 \\n chatbot:service\n```\n\n### Interact with the app\n\nNow the service can be interacted with by going to `0.0.0.0:7860` in your browser.\n\n![](https://raw.githubusercontent.com/redhat-et/locallm/main/assets/app.png)\n\n\nYou can also use the example [chatbot/ai_applications/ask.py](ask.py) to interact with the model-service in a terminal.\nIf the `--prompt` argument is left blank, it will default to \"Hello\".\n\n```bash\ncd chatbot/ai_applications\n\npython ask.py --prompt \n```\n\nOr, you can build the `ask.py` into a container image and run it alongside the model-service container, like so:\n\n```bash\ncd chatbot/ai_applications\npodman build -t chatbot -f builds/Containerfile .\npodman run --rm -d -it -p 8080:8080 chatbot # then interact with the application at 0.0.0.0:8080 in your browser\n```\n\n## Deploy on Openshift\n\nNow that we've developed an application locally that leverages an LLM, we'll want to share it with a wider audience.\nLet's get it off our machine and run it on OpenShift.\n\n### Rebuild for x86\n\nIf you are on a Mac, you'll need to rebuild the model-service image for the x86 architecture for most use case outside of Mac.\nSince this is an AI workload, you may also want to take advantage of Nvidia GPU's available outside our local machine.\nIf so, build the model-service with a base image that contains CUDA and builds llama.cpp specifically for a CUDA environment.\n\n```bash\ncd chatbot/model_services/cuda\npodman build --platform linux/amd64 -t chatbot:service-cuda -f cuda/Containerfile .\n```\n\nThe CUDA environment significantly increases the size of the container image.\nIf you are not utilizing a GPU to run this application, you can create an image\nwithout the CUDA layers for an x86 architecture machine with the following:\n\n```bash\ncd chatbot/model_services\npodman build --platform linux/amd64 -t chatbot:service-amd64 -f base/Containerfile .\n```\n\n### Push to Quay\n\nOnce you login to [quay.io](quay.io) you can push your own newly built version of this LLM application to your repository\nfor use by others.\n\n```bash\npodman login quay.io\n```\n\n```bash\npodman push localhost/chatbot:service-amd64 quay.io//\n```\n\n### Deploy\n\nNow that your model lives in a remote repository we can deploy it.\nGo to your OpenShift developer dashboard and select \"+Add\" to use the Openshift UI to deploy the application.\n\n![](https://raw.githubusercontent.com/redhat-et/locallm/main/assets/add_image.png)\n\nSelect \"Container images\"\n\n![](https://raw.githubusercontent.com/redhat-et/locallm/main/assets/container_images.png)\n\nThen fill out the form on the Deploy page with your [quay.io](quay.io) image name and make sure to set the \"Target port\" to 7860.\n\n![](https://raw.githubusercontent.com/redhat-et/locallm/main/assets/deploy.png)\n\nHit \"Create\" at the bottom and watch your application start.\n\nOnce the pods are up and the application is working, navigate to the \"Routes\" section and click on the link created for you\nto interact with your app.\n\n![](https://raw.githubusercontent.com/redhat-et/locallm/main/assets/app.png)", "models": [ - "llama-2-7b-chat.Q5_K_S", - "mistral-7b-instruct-v0.1.Q4_K_M" + "hf.TheBloke.llama-2-7b-chat.Q5_K_S", + "hf.TheBloke.mistral-7b-instruct-v0.1.Q4_K_M" ] }, { @@ -30,8 +30,8 @@ "config": "summarizer-langchain/ai-studio.yaml", "readme": "# Summarizer\n\nThis model service is intended be be used for text summarization tasks. This service can ingest an arbitrarily long text input. If the input length is less than the models maximum context window it will summarize the input directly. If the input is longer than the maximum context window, the input will be divided into appropriately sized chunks. Each chunk will be summarized and a final \"summary of summaries\" will be the services final output. ", "models": [ - "llama-2-7b-chat.Q5_K_S", - "mistral-7b-instruct-v0.1.Q4_K_M" + "hf.TheBloke.llama-2-7b-chat.Q5_K_S", + "hf.TheBloke.mistral-7b-instruct-v0.1.Q4_K_M" ] }, { @@ -47,15 +47,15 @@ "config": "code-generation/ai-studio.yaml", "readme": "# Code Generation\n\nThis example will deploy a local code-gen application using a llama.cpp model server and a python app built with langchain. \n\n### Download Model\n\n- **codellama**\n\n - Download URL: `wget https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GGUF/resolve/main/codellama-7b-instruct.Q4_K_M.gguf` \n\n```\n\ncd ../models\n\nwget \n\ncd ../\n\n```\n\n### Deploy Model Service\n\nTo start the model service, refer to [the playground model-service document](../playground/README.md). Deploy the LLM server and volumn mount the model of choice.\n\n```\n\npodman run --rm -it -d \\ \n\n -p 8001:8001 \\ \n\n -v Local/path/to/locallm/models:/locallm/models:ro,Z \\ \n\n -e MODEL_PATH=models/ \\ \n\n -e HOST=0.0.0.0 \\ \n\n -e PORT=8001 \\ \n\n playground:image\n\n```\n\n### Build Container Image\n\nOnce the model service is deployed, then follow the instruction below to build your container image and run it locally. \n\n- `podman build -t codegen-app code-generation -f code-generation/builds/Containerfile`\n\n- `podman run -it -p 8501:8501 codegen-app -- -m http://10.88.0.1:8001/v1` ", "models": [ - "llama-2-7b-chat.Q5_K_S", - "mistral-7b-instruct-v0.1.Q4_K_M" + "hf.TheBloke.llama-2-7b-chat.Q5_K_S", + "hf.TheBloke.mistral-7b-instruct-v0.1.Q4_K_M" ] } ], "models": [ { - "id": "llama-2-7b-chat.Q5_K_S", - "name": "Llama-2-7B-Chat-GGUF", + "id": "hf.TheBloke.llama-2-7b-chat.Q5_K_S", + "name": "TheBloke/Llama-2-7B-Chat-GGUF", "description": "Llama 2 is a family of state-of-the-art open-access large language models released by Meta today, and we’re excited to fully support the launch with comprehensive integration in Hugging Face. Llama 2 is being released with a very permissive community license and is available for commercial use. The code, pretrained models, and fine-tuned models are all being released today 🔥", "hw": "CPU", "registry": "Hugging Face", @@ -63,8 +63,8 @@ "url": "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_S.gguf" }, { - "id": "mistral-7b-instruct-v0.1.Q4_K_M", - "name": "Mistral-7B-Instruct-v0.1-GGUF", + "id": "hf.TheBloke.mistral-7b-instruct-v0.1.Q4_K_M", + "name": "TheBloke/Mistral-7B-Instruct-v0.1-GGUF", "description": "The Mistral-7B-Instruct-v0.1 Large Language Model (LLM) is a instruct fine-tuned version of the [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) generative text model using a variety of publicly available conversation datasets. For full details of this model please read our [release blog post](https://mistral.ai/news/announcing-mistral-7b/)", "hw": "CPU", "registry": "Hugging Face", diff --git a/packages/backend/src/managers/catalogManager.spec.ts b/packages/backend/src/managers/catalogManager.spec.ts index 8d7ccc2cf..596a1e8fb 100644 --- a/packages/backend/src/managers/catalogManager.spec.ts +++ b/packages/backend/src/managers/catalogManager.spec.ts @@ -95,9 +95,9 @@ describe('invalid user catalog', () => { }); test('expect correct model is returned with valid id', () => { - const model = catalogManager.getModelById('llama-2-7b-chat.Q5_K_S'); + const model = catalogManager.getModelById('hf.TheBloke.llama-2-7b-chat.Q5_K_S'); expect(model).toBeDefined(); - expect(model.name).toEqual('Llama-2-7B-Chat-GGUF'); + expect(model.name).toEqual('TheBloke/Llama-2-7B-Chat-GGUF'); expect(model.registry).toEqual('Hugging Face'); expect(model.url).toEqual( 'https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_S.gguf', @@ -112,9 +112,9 @@ describe('invalid user catalog', () => { test('expect correct model is returned from default catalog with valid id when no user catalog exists', async () => { vi.spyOn(fs, 'existsSync').mockReturnValue(false); await catalogManager.loadCatalog(); - const model = catalogManager.getModelById('llama-2-7b-chat.Q5_K_S'); + const model = catalogManager.getModelById('hf.TheBloke.llama-2-7b-chat.Q5_K_S'); expect(model).toBeDefined(); - expect(model.name).toEqual('Llama-2-7B-Chat-GGUF'); + expect(model.name).toEqual('TheBloke/Llama-2-7B-Chat-GGUF'); expect(model.registry).toEqual('Hugging Face'); expect(model.url).toEqual( 'https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_S.gguf',