Skip to content

Commit

Permalink
succinct voice tone for features
Browse files Browse the repository at this point in the history
  • Loading branch information
hahuyhoang411 committed Nov 27, 2023
1 parent 925a482 commit 5b432e8
Show file tree
Hide file tree
Showing 5 changed files with 25 additions and 56 deletions.
2 changes: 1 addition & 1 deletion docs/docs/features/chat.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ description: Inference engine for chat completion, the same as OpenAI's

The Chat Completion feature in Nitro provides a flexible way to interact with any local Large Language Model (LLM).

## Single Request Example
### Single Request Example

To send a single query to your chosen LLM, follow these steps:

Expand Down
22 changes: 8 additions & 14 deletions docs/docs/features/cont-batch.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,17 @@ title: Continuous Batching
description: Nitro's continuous batching combines multiple requests, enhancing throughput.
---

## What is continous batching?
Continuous batching boosts throughput and minimizes latency in large language model (LLM) inference. This technique groups multiple inference requests, significantly improving GPU utilization.

Continuous batching is a powerful technique that significantly boosts throughput in large language model (LLM) inference while minimizing latency. This process dynamically groups multiple inference requests, allowing for more efficient GPU utilization.
**Key Advantages:**

## Why Continuous Batching?
- Increased Throughput.
- Reduced Latency.
- Efficient GPU Use.

Traditional static batching methods can lead to underutilization of GPU resources, as they wait for all sequences in a batch to complete before moving on. Continuous batching overcomes this by allowing new sequences to start processing as soon as others finish, ensuring more consistent and efficient GPU usage.
**Implementation Insight:**

## Benefits of Continuous Batching

- **Increased Throughput:** Improvement over traditional batching methods.
- **Reduced Latency:** Lower p50 latency, leading to faster response times.
- **Efficient Resource Utilization:** Maximizes GPU memory and computational capabilities.
To evaluate its effectiveness, compare continuous batching with traditional methods. For more details on benchmarking, refer to this [article](https://www.anyscale.com/blog/continuous-batching-llm-inference).

## How to use continous batching
Nitro's `continuous batching` feature allows you to combine multiple requests for the same model execution, enhancing throughput and efficiency.
Expand All @@ -31,8 +29,4 @@ curl http://localhost:3928/inferences/llamacpp/loadmodel \
}'
```

For optimal performance, ensure that the `n_parallel` value is set to match the `thread_num`, as detailed in the [Multithreading](features/multi-thread.md) documentation.

### Benchmark and Compare

To understand the impact of continuous batching on your system, perform benchmarks comparing it with traditional batching methods. This [article](https://www.anyscale.com/blog/continuous-batching-llm-inference) will help you quantify improvements in throughput and latency.
For optimal performance, ensure that the `n_parallel` value is set to match the `thread_num`, as detailed in the [Multithreading](features/multi-thread.md) documentation.
4 changes: 1 addition & 3 deletions docs/docs/features/embed.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@ title: Embedding
description: Inference engine for embedding, the same as OpenAI's
---

## What are embeddings?

Embeddings are lists of numbers (floats). To find how similar two embeddings are, we measure the [distance](https://en.wikipedia.org/wiki/Cosine_similarity) between them. Shorter distances mean they're more similar; longer distances mean less similarity.

## Activating Embedding Feature
Expand Down Expand Up @@ -44,7 +42,7 @@ curl https://api.openai.com/v1/embeddings \

</div>

## Embedding Reponse
### Embedding Reponse

The example response used the output from model [llama2 Chat 7B Q5 (GGUF)](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main) loaded to Nitro server.

Expand Down
39 changes: 11 additions & 28 deletions docs/docs/features/multi-thread.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,34 +3,20 @@ title: Multithreading
description: Nitro utilizes multithreading to optimize hardware usage.
---

## What is Multithreading?
Multithreading in programming allows concurrent task execution, improving efficiency and responsiveness. It's key for optimizing hardware and application performance.

Multithreading is a programming concept where a process executes multiple threads simultaneously, improving efficiency and performance. It allows concurrent execution of tasks, such as data processing or user interface updates. This technique is crucial for optimizing hardware usage and enhancing application responsiveness.
Effective multithreading offers:

## Drogon's Threading Model
- Faster Performance.
- Responsive IO.
- Deadlock Prevention.
- Resource Optimization.
- Asynchronous Programming Support.
- Scalability Enhancement.

Nitro powered by Drogon, a high-speed C++ web application framework, utilizes a thread pool where each thread possesses its own event loop. These event loops are central to Drogon's functionality:
For more information on threading, visit [Drogon's Documentation](https://github.com/drogonframework/drogon/wiki/ENG-FAQ-1-Understanding-drogon-threading-model).

- **Main Loop**: Runs on the main thread, responsible for starting worker loops.
- **Worker Loops**: Handle tasks and network events, ensuring efficient task execution without blocking.

## Why it's important

Understanding and effectively using multithreading in Drogon is crucial for several reasons:

1. **Optimized Performance**: Multithreading enhances application efficiency by enabling simultaneous task execution for faster response times.

2. **Non-blocking IO Operations**: Utilizing multiple threads prevents long-running tasks from blocking the entire application, ensuring high responsiveness.

3. **Deadlock Avoidance**: Event loops and threads helps prevent deadlocks, ensuring smoother and uninterrupted application operation.

4. **Effective Resource Utilization**: Distributing tasks across multiple threads leads to more efficient use of server resources, improving overall performance.

5. **Async Programming**

6. **Scalability**

## Enabling More Threads on Nitro
## Enabling Multi-Threads on Nitro

To increase the number of threads used by Nitro, use the following command syntax:

Expand All @@ -47,7 +33,4 @@ To launch Nitro with 4 threads, enter this command in the terminal:
nitro 4 127.0.0.1 5000
```

> After enabling multithreading, monitor your system's performance. Adjust the `thread_num` as needed to optimize throughput and latency based on your workload.
## Acknowledgements
For more information on Drogon's threading, visit [Drogon's Documentation](https://github.com/drogonframework/drogon/wiki/ENG-FAQ-1-Understanding-drogon-threading-model).
> After enabling multithreading, monitor your system's performance. Adjust the `thread_num` as needed to optimize throughput and latency based on your workload.
14 changes: 4 additions & 10 deletions docs/docs/features/warmup.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,11 @@ title: Warming Up Model
description: Nitro warms up the model to optimize delays.
---

## What is Model Warming Up?

Model warming up is the process of running pre-requests through a model to optimize its components for production use. This step is crucial for reducing initialization and optimization delays during the first few inference requests.

## What are the Benefits?

Warming up an AI model offers several key benefits:

- **Enhanced Initial Performance:** Unlike in `llama.cpp`, where the first inference can be very slow, warming up reduces initial latency, ensuring quicker response times from the start.
- **Consistent Response Times:** Especially beneficial for systems updating models frequently, like those with real-time training, to avoid performance lags with new snapshots.
Model warming up involves pre-running requests through an AI model to fine-tune its components for production. This step minimizes delays during initial inferences, ensuring readiness for immediate use.

**Key Advantages:**
- Improved Initial Performance.
- Stable Response Times.
## How to Enable Model Warming Up?

On the Nitro server, model warming up is automatically enabled whenever a new model is loaded. This means that the server handles the warm-up process behind the scenes, ensuring that the model is ready for efficient and effective performance from the first inference request.

0 comments on commit 5b432e8

Please sign in to comment.