diff --git a/README.md b/README.md index 59b1d9b8..eca1af9d 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,8 @@ # cortex.llamacpp cortex.llamacpp is a high-efficiency C++ inference engine for edge computing. +It is a dynamic library that can be loaded by any server at runtime. + # Repo Structure ``` . @@ -36,7 +38,7 @@ If you don't have git, you can download the source code as a file archive from [ - **On Linux, and Windows:** ```bash - make build-example-server + make build-example-server CMAKE_EXTRA_FLAGS="" ``` - **On MacOS with Apple Silicon:** @@ -57,9 +59,15 @@ If you don't have git, you can download the source code as a file archive from [ make build-example-server CMAKE_EXTRA_FLAGS="-DLLAMA_CUDA=ON" ``` -## Start process +# Quickstart +**Step 1: Downloading a Model** + +```bash +mkdir model && cd model +wget -O llama-2-7b-model.gguf https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf?download=true +``` -Finally, let's start Server. +**Step 2: Start server** - **On MacOS and Linux:** ```bash @@ -84,5 +92,56 @@ Finally, let's start Server. copy ..\..\..\build\Release\engine.dll engines\cortex.llamacpp\ server.exe ``` -# Quickstart -// TODO \ No newline at end of file + +**Step 3: Load model** +```bash title="Load model" +curl http://localhost:3928/loadmodel \ + -H 'Content-Type: application/json' \ + -d '{ + "llama_model_path": "/model/llama-2-7b-model.gguf", + "model_alias": "llama-2-7b-model", + "ctx_len": 512, + "ngl": 100, + "model_type": "llm" + }' +``` +**Step 4: Making an Inference** + +```bash title="cortex-cpp Inference" +curl http://localhost:3928/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { + "role": "user", + "content": "Who won the world series in 2020?" + }, + ], + "model": "llama-2-7b-model" + }' +``` + +Table of parameters + +| Parameter | Type | Description | +|------------------|---------|--------------------------------------------------------------| +| `llama_model_path` | String | The file path to the LLaMA model. | +| `ngl` | Integer | The number of GPU layers to use. | +| `ctx_len` | Integer | The context length for the model operations. | +| `embedding` | Boolean | Whether to use embedding in the model. | +| `n_parallel` | Integer | The number of parallel operations. | +| `cont_batching` | Boolean | Whether to use continuous batching. | +| `user_prompt` | String | The prompt to use for the user. | +| `ai_prompt` | String | The prompt to use for the AI assistant. | +| `system_prompt` | String | The prompt to use for system rules. | +| `pre_prompt` | String | The prompt to use for internal configuration. | +| `cpu_threads` | Integer | The number of threads to use for inferencing (CPU MODE ONLY) | +| `n_batch` | Integer | The batch size for prompt eval step | +| `caching_enabled` | Boolean | To enable prompt caching or not | +|`grp_attn_n`|Integer|Group attention factor in self-extend| +|`grp_attn_w`|Integer|Group attention width in self-extend| +|`mlock`|Boolean|Prevent system swapping of the model to disk in macOS| +|`grammar_file`| String |You can constrain the sampling using GBNF grammars by providing path to a grammar file| +|`model_type` | String | Model type we want to use: llm or embedding, default value is llm| +|`model_alias`| String | Used as model_id if specified in request, mandatory in loadmodel| +|`model` | String | Used as model_id if specified in request, mandatory in chat/embedding request| \ No newline at end of file