succinct voice tone for features

janhq · Nov 27, 2023 · 5b432e8 · 5b432e8
1 parent 925a482
commit 5b432e8
Show file tree

Hide file tree

Showing 5 changed files with 25 additions and 56 deletions.
diff --git a/docs/docs/features/chat.md b/docs/docs/features/chat.md
@@ -5,7 +5,7 @@ description: Inference engine for chat completion, the same as OpenAI's
 
 The Chat Completion feature in Nitro provides a flexible way to interact with any local Large Language Model (LLM).
 
-## Single Request Example
+### Single Request Example
 
 To send a single query to your chosen LLM, follow these steps:
 

diff --git a/docs/docs/features/cont-batch.md b/docs/docs/features/cont-batch.md
@@ -3,19 +3,17 @@ title: Continuous Batching
 description: Nitro's continuous batching combines multiple requests, enhancing throughput.
 ---
 
-## What is continous batching?
+Continuous batching boosts throughput and minimizes latency in large language model (LLM) inference. This technique groups multiple inference requests, significantly improving GPU utilization.
 
-Continuous batching is a powerful technique that significantly boosts throughput in large language model (LLM) inference while minimizing latency. This process dynamically groups multiple inference requests, allowing for more efficient GPU utilization.
+**Key Advantages:**
 
-## Why Continuous Batching?
+- Increased Throughput.
+- Reduced Latency.
+- Efficient GPU Use.
 
-Traditional static batching methods can lead to underutilization of GPU resources, as they wait for all sequences in a batch to complete before moving on. Continuous batching overcomes this by allowing new sequences to start processing as soon as others finish, ensuring more consistent and efficient GPU usage.
+**Implementation Insight:**
 
-## Benefits of Continuous Batching
-
-- **Increased Throughput:** Improvement over traditional batching methods.
-- **Reduced Latency:** Lower p50 latency, leading to faster response times.
-- **Efficient Resource Utilization:** Maximizes GPU memory and computational capabilities.
+To evaluate its effectiveness, compare continuous batching with traditional methods. For more details on benchmarking, refer to this [article](https://www.anyscale.com/blog/continuous-batching-llm-inference).
 
 ## How to use continous batching
 Nitro's `continuous batching` feature allows you to combine multiple requests for the same model execution, enhancing throughput and efficiency.
@@ -31,8 +29,4 @@ curl http://localhost:3928/inferences/llamacpp/loadmodel \
   }'
 ```
 
-For optimal performance, ensure that the `n_parallel` value is set to match the `thread_num`, as detailed in the [Multithreading](features/multi-thread.md) documentation.
-
-### Benchmark and Compare
-
-To understand the impact of continuous batching on your system, perform benchmarks comparing it with traditional batching methods. This [article](https://www.anyscale.com/blog/continuous-batching-llm-inference) will help you quantify improvements in throughput and latency.
+For optimal performance, ensure that the `n_parallel` value is set to match the `thread_num`, as detailed in the [Multithreading](features/multi-thread.md) documentation.
diff --git a/docs/docs/features/embed.md b/docs/docs/features/embed.md
@@ -3,8 +3,6 @@ title: Embedding
 description: Inference engine for embedding, the same as OpenAI's
 ---
 
-## What are embeddings?
-
 Embeddings are lists of numbers (floats). To find how similar two embeddings are, we measure the [distance](https://en.wikipedia.org/wiki/Cosine_similarity) between them. Shorter distances mean they're more similar; longer distances mean less similarity.
 
 ## Activating Embedding Feature
@@ -44,7 +42,7 @@ curl https://api.openai.com/v1/embeddings \
 
 </div>
 
-## Embedding Reponse
+### Embedding Reponse
 
 The example response used the output from model [llama2 Chat 7B Q5 (GGUF)](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main) loaded to Nitro server.
 

diff --git a/docs/docs/features/multi-thread.md b/docs/docs/features/multi-thread.md
@@ -3,34 +3,20 @@ title: Multithreading
 description: Nitro utilizes multithreading to optimize hardware usage.
 ---
 
-## What is Multithreading?
+Multithreading in programming allows concurrent task execution, improving efficiency and responsiveness. It's key for optimizing hardware and application performance.
 
-Multithreading is a programming concept where a process executes multiple threads simultaneously, improving efficiency and performance. It allows concurrent execution of tasks, such as data processing or user interface updates. This technique is crucial for optimizing hardware usage and enhancing application responsiveness.
+Effective multithreading offers:
 
-## Drogon's Threading Model
+- Faster Performance.
+- Responsive IO.
+- Deadlock Prevention.
+- Resource Optimization.
+- Asynchronous Programming Support.
+- Scalability Enhancement.
 
-Nitro powered by Drogon, a high-speed C++ web application framework, utilizes a thread pool where each thread possesses its own event loop. These event loops are central to Drogon's functionality:
+For more information on threading, visit [Drogon's Documentation](https://github.com/drogonframework/drogon/wiki/ENG-FAQ-1-Understanding-drogon-threading-model).
 
-- **Main Loop**: Runs on the main thread, responsible for starting worker loops.
-- **Worker Loops**: Handle tasks and network events, ensuring efficient task execution without blocking.
-
-## Why it's important
-
-Understanding and effectively using multithreading in Drogon is crucial for several reasons:
-
-1. **Optimized Performance**:  Multithreading enhances application efficiency by enabling simultaneous task execution for faster response times.
-
-2. **Non-blocking IO Operations**: Utilizing multiple threads prevents long-running tasks from blocking the entire application, ensuring high responsiveness.
-
-3. **Deadlock Avoidance**: Event loops and threads helps prevent deadlocks, ensuring smoother and uninterrupted application operation.
-
-4. **Effective Resource Utilization**: Distributing tasks across multiple threads leads to more efficient use of server resources, improving overall performance.
-
-5. **Async Programming**
-
-6. **Scalability**
-
-## Enabling More Threads on Nitro
+## Enabling Multi-Threads on Nitro
 
 To increase the number of threads used by Nitro, use the following command syntax:
 
@@ -47,7 +33,4 @@ To launch Nitro with 4 threads, enter this command in the terminal:
 nitro 4 127.0.0.1 5000
 ```
 
-> After enabling multithreading, monitor your system's performance. Adjust the `thread_num` as needed to optimize throughput and latency based on your workload.
-
-## Acknowledgements
-For more information on Drogon's threading, visit [Drogon's Documentation](https://github.com/drogonframework/drogon/wiki/ENG-FAQ-1-Understanding-drogon-threading-model).
+> After enabling multithreading, monitor your system's performance. Adjust the `thread_num` as needed to optimize throughput and latency based on your workload.
diff --git a/docs/docs/features/warmup.md b/docs/docs/features/warmup.md
@@ -3,17 +3,11 @@ title: Warming Up Model
 description: Nitro warms up the model to optimize delays.
 ---
 
-## What is Model Warming Up?
-
-Model warming up is the process of running pre-requests through a model to optimize its components for production use. This step is crucial for reducing initialization and optimization delays during the first few inference requests.
-
-## What are the Benefits?
-
-Warming up an AI model offers several key benefits:
-
-- **Enhanced Initial Performance:** Unlike in `llama.cpp`, where the first inference can be very slow, warming up reduces initial latency, ensuring quicker response times from the start.
-- **Consistent Response Times:** Especially beneficial for systems updating models frequently, like those with real-time training, to avoid performance lags with new snapshots.
+Model warming up involves pre-running requests through an AI model to fine-tune its components for production. This step minimizes delays during initial inferences, ensuring readiness for immediate use.
 
+**Key Advantages:**
+- Improved Initial Performance.
+- Stable Response Times.
 ## How to Enable Model Warming Up?
 
 On the Nitro server, model warming up is automatically enabled whenever a new model is loaded. This means that the server handles the warm-up process behind the scenes, ensuring that the model is ready for efficient and effective performance from the first inference request.