feat: update README.md (#23)

* feat: update README.md * feat: more about library * feat: move start server to quickstart --------- Co-authored-by: vansangpfiev <[email protected]>
janhq · May 14, 2024 · c6ab7a2 · c6ab7a2
1 parent 4f45dda
commit c6ab7a2
Showing 1 changed file with 64 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,8 @@
 # cortex.llamacpp
 cortex.llamacpp is a high-efficiency C++ inference engine for edge computing.
 
+It is a dynamic library that can be loaded by any server at runtime.
+
 # Repo Structure
 ```
 .
@@ -36,7 +38,7 @@ If you don't have git, you can download the source code as a file archive from [
 - **On Linux, and Windows:**
 
   ```bash
-  make build-example-server
+  make build-example-server CMAKE_EXTRA_FLAGS=""
   ```
 
 - **On MacOS with Apple Silicon:**
@@ -57,9 +59,15 @@ If you don't have git, you can download the source code as a file archive from [
   make build-example-server CMAKE_EXTRA_FLAGS="-DLLAMA_CUDA=ON"
   ```
 
-## Start process
+# Quickstart
+**Step 1: Downloading a Model**
+
+```bash
+mkdir model && cd model
+wget -O llama-2-7b-model.gguf https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf?download=true
+```
 
-Finally, let's start Server.
+**Step 2: Start server**
 - **On MacOS and Linux:**
 
   ```bash
@@ -84,5 +92,56 @@ Finally, let's start Server.
   copy ..\..\..\build\Release\engine.dll engines\cortex.llamacpp\
   server.exe
   ```
-# Quickstart
-// TODO
+
+**Step 3: Load model**
+```bash title="Load model"
+curl http://localhost:3928/loadmodel \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "llama_model_path": "/model/llama-2-7b-model.gguf",
+    "model_alias": "llama-2-7b-model",
+    "ctx_len": 512,
+    "ngl": 100,
+    "model_type": "llm"
+  }'
+```
+**Step 4: Making an Inference**
+
+```bash title="cortex-cpp Inference"
+curl http://localhost:3928/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {
+        "role": "user",
+        "content": "Who won the world series in 2020?"
+      },
+    ],
+    "model": "llama-2-7b-model"
+  }'
+```
+
+Table of parameters
+
+| Parameter        | Type    | Description                                                  |
+|------------------|---------|--------------------------------------------------------------|
+| `llama_model_path` | String  | The file path to the LLaMA model.                            |
+| `ngl`              | Integer | The number of GPU layers to use.                             |
+| `ctx_len`          | Integer | The context length for the model operations.                 |
+| `embedding`        | Boolean | Whether to use embedding in the model.                       |
+| `n_parallel`       | Integer | The number of parallel operations. |
+| `cont_batching`    | Boolean | Whether to use continuous batching.                          |
+| `user_prompt`      | String  | The prompt to use for the user.                              |
+| `ai_prompt`        | String  | The prompt to use for the AI assistant.                      |
+| `system_prompt`    | String  | The prompt to use for system rules.                          |
+| `pre_prompt`    | String  | The prompt to use for internal configuration.                          |
+| `cpu_threads`   | Integer | The number of threads to use for inferencing (CPU MODE ONLY) |
+| `n_batch`       | Integer | The batch size for prompt eval step |
+| `caching_enabled` | Boolean | To enable prompt caching or not   |
+|`grp_attn_n`|Integer|Group attention factor in self-extend|
+|`grp_attn_w`|Integer|Group attention width in self-extend|
+|`mlock`|Boolean|Prevent system swapping of the model to disk in macOS|
+|`grammar_file`| String |You can constrain the sampling using GBNF grammars by providing path to a grammar file|
+|`model_type` | String | Model type we want to use: llm or embedding, default value is llm|
+|`model_alias`| String | Used as model_id if specified in request, mandatory in loadmodel|
+|`model`      | String | Used as model_id if specified in request, mandatory in chat/embedding request|