Skip to content

Commit

Permalink
feat: update README.md (#23)
Browse files Browse the repository at this point in the history
* feat: update README.md

* feat: more about library

* feat: move start server to quickstart

---------

Co-authored-by: vansangpfiev <[email protected]>
  • Loading branch information
vansangpfiev and sangjanai authored May 14, 2024
1 parent 4f45dda commit c6ab7a2
Showing 1 changed file with 64 additions and 5 deletions.
69 changes: 64 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# cortex.llamacpp
cortex.llamacpp is a high-efficiency C++ inference engine for edge computing.

It is a dynamic library that can be loaded by any server at runtime.

# Repo Structure
```
.
Expand Down Expand Up @@ -36,7 +38,7 @@ If you don't have git, you can download the source code as a file archive from [
- **On Linux, and Windows:**

```bash
make build-example-server
make build-example-server CMAKE_EXTRA_FLAGS=""
```

- **On MacOS with Apple Silicon:**
Expand All @@ -57,9 +59,15 @@ If you don't have git, you can download the source code as a file archive from [
make build-example-server CMAKE_EXTRA_FLAGS="-DLLAMA_CUDA=ON"
```

## Start process
# Quickstart
**Step 1: Downloading a Model**

```bash
mkdir model && cd model
wget -O llama-2-7b-model.gguf https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf?download=true
```

Finally, let's start Server.
**Step 2: Start server**
- **On MacOS and Linux:**

```bash
Expand All @@ -84,5 +92,56 @@ Finally, let's start Server.
copy ..\..\..\build\Release\engine.dll engines\cortex.llamacpp\
server.exe
```
# Quickstart
// TODO

**Step 3: Load model**
```bash title="Load model"
curl http://localhost:3928/loadmodel \
-H 'Content-Type: application/json' \
-d '{
"llama_model_path": "/model/llama-2-7b-model.gguf",
"model_alias": "llama-2-7b-model",
"ctx_len": 512,
"ngl": 100,
"model_type": "llm"
}'
```
**Step 4: Making an Inference**

```bash title="cortex-cpp Inference"
curl http://localhost:3928/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"role": "user",
"content": "Who won the world series in 2020?"
},
],
"model": "llama-2-7b-model"
}'
```

Table of parameters

| Parameter | Type | Description |
|------------------|---------|--------------------------------------------------------------|
| `llama_model_path` | String | The file path to the LLaMA model. |
| `ngl` | Integer | The number of GPU layers to use. |
| `ctx_len` | Integer | The context length for the model operations. |
| `embedding` | Boolean | Whether to use embedding in the model. |
| `n_parallel` | Integer | The number of parallel operations. |
| `cont_batching` | Boolean | Whether to use continuous batching. |
| `user_prompt` | String | The prompt to use for the user. |
| `ai_prompt` | String | The prompt to use for the AI assistant. |
| `system_prompt` | String | The prompt to use for system rules. |
| `pre_prompt` | String | The prompt to use for internal configuration. |
| `cpu_threads` | Integer | The number of threads to use for inferencing (CPU MODE ONLY) |
| `n_batch` | Integer | The batch size for prompt eval step |
| `caching_enabled` | Boolean | To enable prompt caching or not |
|`grp_attn_n`|Integer|Group attention factor in self-extend|
|`grp_attn_w`|Integer|Group attention width in self-extend|
|`mlock`|Boolean|Prevent system swapping of the model to disk in macOS|
|`grammar_file`| String |You can constrain the sampling using GBNF grammars by providing path to a grammar file|
|`model_type` | String | Model type we want to use: llm or embedding, default value is llm|
|`model_alias`| String | Used as model_id if specified in request, mandatory in loadmodel|
|`model` | String | Used as model_id if specified in request, mandatory in chat/embedding request|

0 comments on commit c6ab7a2

Please sign in to comment.