-
Notifications
You must be signed in to change notification settings - Fork 134
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #128 from janhq/update-nitro-docs
Update Nitro docs
- Loading branch information
Showing
39 changed files
with
11,888 additions
and
667 deletions.
There are no files selected for viewing
Empty file.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
--- | ||
title: Simple chatbot with Nitro | ||
--- | ||
|
||
This guide provides instructions to create a chatbot powered by Nitro using the GGUF model. | ||
|
||
## Step 1: Download the Model | ||
|
||
First, you'll need to download the chatbot model. | ||
|
||
1. **Navigate to the Models Folder** | ||
- Open your project directory. | ||
- Locate and open the `models` folder within the directory. | ||
|
||
2. **Select a GGUF Model** | ||
- Visit the Hugging Face repository at [TheBloke's Models](https://huggingface.co/TheBloke). | ||
- Browse through the available models. | ||
- Choose the model that best fits your needs. | ||
|
||
3. **Download the Model** | ||
- Once you've selected a model, download it using a command like the one below. Replace `<llama_model_path>` with the path of your chosen model. | ||
|
||
|
||
```bash title="Downloading Zephyr 7B Model" | ||
wget https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/resolve/main/zephyr-7b-beta.Q5_K_M.gguf?download=true | ||
``` | ||
|
||
## Step 2: Load model | ||
Now, you'll set up the model in your application. | ||
|
||
1. **Open `app.py` File** | ||
|
||
- In your project directory, find and open the app.py file. | ||
|
||
2. **Configure the Model Path** | ||
|
||
- Modify the model path in app.py to point to your downloaded model. | ||
- Update the configuration parameters as necessary. | ||
|
||
```bash title="Example Configuration" {2} | ||
dat = { | ||
"llama_model_path": "nitro/interface/models/zephyr-7b-beta.Q5_K_M.gguf", | ||
"ctx_len": 2048, | ||
"ngl": 100, | ||
"embedding": True, | ||
"n_parallel": 4, | ||
"pre_prompt": "A chat between a curious user and an artificial intelligence", | ||
"user_prompt": "USER: ", | ||
"ai_prompt": "ASSISTANT: "} | ||
``` | ||
Congratulations! Your Nitro chatbot is now set up. Feel free to experiment with different configuration parameters to tailor the chatbot to your needs. | ||
For more information on parameter settings and their effects, please refer to Run Nitro(using-nitro) for a comprehensive parameters table. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,172 @@ | ||
--- | ||
title: Chat Completion | ||
--- | ||
|
||
The Chat Completion feature in Nitro provides a flexible way to interact with any local Large Language Model (LLM). | ||
|
||
## Single Request Example | ||
|
||
To send a single query to your chosen LLM, follow these steps: | ||
|
||
<div style={{ width: '50%', float: 'left', clear: 'left' }}> | ||
|
||
```bash title="Nitro" | ||
curl http://localhost:3928/inferences/llamacpp/chat_completion \ | ||
-H "Content-Type: application/json" \ | ||
-d '{ | ||
"model": "", | ||
"messages": [ | ||
{ | ||
"role": "user", | ||
"content": "Hello" | ||
}, | ||
] | ||
}' | ||
|
||
``` | ||
</div> | ||
|
||
<div style={{ width: '50%', float: 'right', clear: 'right' }}> | ||
|
||
```bash title="OpenAI" | ||
curl https://api.openai.com/v1/chat/completions \ | ||
-H "Content-Type: application/json" \ | ||
-H "Authorization: Bearer $OPENAI_API_KEY" \ | ||
-d '{ | ||
"model": "gpt-3.5-turbo", | ||
"messages": [ | ||
{ | ||
"role": "user", | ||
"content": "Hello" | ||
} | ||
] | ||
}' | ||
``` | ||
</div> | ||
|
||
This command sends a request to your local LLM, querying about the winner of the 2020 World Series. | ||
|
||
### Dialog Request Example | ||
|
||
For ongoing conversations or multiple queries, the dialog request feature is ideal. Here’s how to structure a multi-turn conversation: | ||
|
||
<div style={{ width: '50%', float: 'left', clear: 'left' }}> | ||
|
||
```bash title="Nitro" | ||
curl http://localhost:3928/inferences/llamacpp/chat_completion \ | ||
-H "Content-Type: application/json" \ | ||
-d '{ | ||
"messages": [ | ||
{ | ||
"role": "system", | ||
"content": "You are a helpful assistant." | ||
}, | ||
{ | ||
"role": "user", | ||
"content": "Who won the world series in 2020?" | ||
}, | ||
{ | ||
"role": "assistant", | ||
"content": "The Los Angeles Dodgers won the World Series in 2020." | ||
}, | ||
{ | ||
"role": "user", | ||
"content": "Where was it played?" | ||
} | ||
] | ||
}' | ||
|
||
``` | ||
</div> | ||
|
||
<div style={{ width: '50%', float: 'right', clear: 'right' }}> | ||
|
||
```bash title="OpenAI" | ||
curl https://api.openai.com/v1/chat/completions \ | ||
-H "Content-Type: application/json" \ | ||
-H "Authorization: Bearer $OPENAI_API_KEY" \ | ||
-d '{ | ||
"messages": [ | ||
{ | ||
"role": "system", | ||
"content": "You are a helpful assistant." | ||
}, | ||
{ | ||
"role": "user", | ||
"content": "Who won the world series in 2020?" | ||
}, | ||
{ | ||
"role": "assistant", | ||
"content": "The Los Angeles Dodgers won the World Series in 2020." | ||
}, | ||
{ | ||
"role": "user", | ||
"content": "Where was it played?" | ||
} | ||
] | ||
}' | ||
``` | ||
</div> | ||
|
||
### Chat Completion Response | ||
|
||
Below are examples of responses from both the Nitro server and OpenAI: | ||
|
||
<div style={{ width: '50%', float: 'left', clear: 'left' }}> | ||
|
||
```js title="Nitro" | ||
{ | ||
"choices": [ | ||
{ | ||
"finish_reason": null, | ||
"index": 0, | ||
"message": { | ||
"content": "Hello, how may I assist you this evening?", | ||
"role": "assistant" | ||
} | ||
} | ||
], | ||
"created": 1700215278, | ||
"id": "sofpJrnBGUnchO8QhA0s", | ||
"model": "_", | ||
"object": "chat.completion", | ||
"system_fingerprint": "_", | ||
"usage": { | ||
"completion_tokens": 13, | ||
"prompt_tokens": 90, | ||
"total_tokens": 103 | ||
} | ||
} | ||
``` | ||
</div> | ||
|
||
<div style={{ width: '50%', float: 'right', clear: 'right' }}> | ||
|
||
```js title="OpenAI" | ||
{ | ||
"choices": [ | ||
{ | ||
"finish_reason": "stop" | ||
"index": 0, | ||
"message": { | ||
"role": "assistant", | ||
"content": "Hello there, how may I assist you today?", | ||
} | ||
} | ||
], | ||
"created": 1677652288, | ||
"id": "chatcmpl-123", | ||
"model": "gpt-3.5-turbo-0613", | ||
"object": "chat.completion", | ||
"system_fingerprint": "fp_44709d6fcb", | ||
"usage": { | ||
"completion_tokens": 12, | ||
"prompt_tokens": 9, | ||
"total_tokens": 21 | ||
} | ||
} | ||
``` | ||
</div> | ||
|
||
|
||
The chat completion feature in Nitro showcases compatibility with OpenAI, making the transition between using OpenAI and local AI models more straightforward. For further details and advanced usage, please refer to the [API reference](https://nitro.jan.ai/api). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
--- | ||
title: Continuous Batching | ||
--- | ||
|
||
## What is continous batching? | ||
|
||
Continuous batching is a powerful technique that significantly boosts throughput in large language model (LLM) inference while minimizing latency. This process dynamically groups multiple inference requests, allowing for more efficient GPU utilization. | ||
|
||
## Why Continuous Batching? | ||
|
||
Traditional static batching methods can lead to underutilization of GPU resources, as they wait for all sequences in a batch to complete before moving on. Continuous batching overcomes this by allowing new sequences to start processing as soon as others finish, ensuring more consistent and efficient GPU usage. | ||
|
||
## Benefits of Continuous Batching | ||
|
||
- **Increased Throughput:** Improvement over traditional batching methods. | ||
- **Reduced Latency:** Lower p50 latency, leading to faster response times. | ||
- **Efficient Resource Utilization:** Maximizes GPU memory and computational capabilities. | ||
|
||
## How to use continous batching | ||
Nitro's `continuous batching` feature allows you to combine multiple requests for the same model execution, enhancing throughput and efficiency. | ||
|
||
```bash title="Enable Batching" {6,7} | ||
curl http://localhost:3928/inferences/llamacpp/loadmodel \ | ||
-H 'Content-Type: application/json' \ | ||
-d '{ | ||
"llama_model_path": "/path/to/your_model.gguf", | ||
"ctx_len": 512, | ||
"cont_batching": true, | ||
"n_parallel": 4, | ||
}' | ||
``` | ||
|
||
For optimal performance, ensure that the `n_parallel` value is set to match the `thread_num`, as detailed in the [Multithreading](features/multi-thread.md) documentation. | ||
|
||
### Benchmark and Compare | ||
|
||
To understand the impact of continuous batching on your system, perform benchmarks comparing it with traditional batching methods. This [article](https://www.anyscale.com/blog/continuous-batching-llm-inference) will help you quantify improvements in throughput and latency. |
Oops, something went wrong.