From 5b432e8bcfb1b5df2f03122df2679bc9f4a58598 Mon Sep 17 00:00:00 2001 From: hahuyhoang411 Date: Tue, 28 Nov 2023 00:05:49 +0700 Subject: [PATCH] succinct voice tone for features --- docs/docs/features/chat.md | 2 +- docs/docs/features/cont-batch.md | 22 ++++++----------- docs/docs/features/embed.md | 4 +-- docs/docs/features/multi-thread.md | 39 +++++++++--------------------- docs/docs/features/warmup.md | 14 +++-------- 5 files changed, 25 insertions(+), 56 deletions(-) diff --git a/docs/docs/features/chat.md b/docs/docs/features/chat.md index 4b82738c4..229fb8b0e 100644 --- a/docs/docs/features/chat.md +++ b/docs/docs/features/chat.md @@ -5,7 +5,7 @@ description: Inference engine for chat completion, the same as OpenAI's The Chat Completion feature in Nitro provides a flexible way to interact with any local Large Language Model (LLM). -## Single Request Example +### Single Request Example To send a single query to your chosen LLM, follow these steps: diff --git a/docs/docs/features/cont-batch.md b/docs/docs/features/cont-batch.md index 5d11cea1d..65a5f950f 100644 --- a/docs/docs/features/cont-batch.md +++ b/docs/docs/features/cont-batch.md @@ -3,19 +3,17 @@ title: Continuous Batching description: Nitro's continuous batching combines multiple requests, enhancing throughput. --- -## What is continous batching? +Continuous batching boosts throughput and minimizes latency in large language model (LLM) inference. This technique groups multiple inference requests, significantly improving GPU utilization. -Continuous batching is a powerful technique that significantly boosts throughput in large language model (LLM) inference while minimizing latency. This process dynamically groups multiple inference requests, allowing for more efficient GPU utilization. +**Key Advantages:** -## Why Continuous Batching? +- Increased Throughput. +- Reduced Latency. +- Efficient GPU Use. -Traditional static batching methods can lead to underutilization of GPU resources, as they wait for all sequences in a batch to complete before moving on. Continuous batching overcomes this by allowing new sequences to start processing as soon as others finish, ensuring more consistent and efficient GPU usage. +**Implementation Insight:** -## Benefits of Continuous Batching - -- **Increased Throughput:** Improvement over traditional batching methods. -- **Reduced Latency:** Lower p50 latency, leading to faster response times. -- **Efficient Resource Utilization:** Maximizes GPU memory and computational capabilities. +To evaluate its effectiveness, compare continuous batching with traditional methods. For more details on benchmarking, refer to this [article](https://www.anyscale.com/blog/continuous-batching-llm-inference). ## How to use continous batching Nitro's `continuous batching` feature allows you to combine multiple requests for the same model execution, enhancing throughput and efficiency. @@ -31,8 +29,4 @@ curl http://localhost:3928/inferences/llamacpp/loadmodel \ }' ``` -For optimal performance, ensure that the `n_parallel` value is set to match the `thread_num`, as detailed in the [Multithreading](features/multi-thread.md) documentation. - -### Benchmark and Compare - -To understand the impact of continuous batching on your system, perform benchmarks comparing it with traditional batching methods. This [article](https://www.anyscale.com/blog/continuous-batching-llm-inference) will help you quantify improvements in throughput and latency. \ No newline at end of file +For optimal performance, ensure that the `n_parallel` value is set to match the `thread_num`, as detailed in the [Multithreading](features/multi-thread.md) documentation. \ No newline at end of file diff --git a/docs/docs/features/embed.md b/docs/docs/features/embed.md index 3bd71541c..da9370ec7 100644 --- a/docs/docs/features/embed.md +++ b/docs/docs/features/embed.md @@ -3,8 +3,6 @@ title: Embedding description: Inference engine for embedding, the same as OpenAI's --- -## What are embeddings? - Embeddings are lists of numbers (floats). To find how similar two embeddings are, we measure the [distance](https://en.wikipedia.org/wiki/Cosine_similarity) between them. Shorter distances mean they're more similar; longer distances mean less similarity. ## Activating Embedding Feature @@ -44,7 +42,7 @@ curl https://api.openai.com/v1/embeddings \ -## Embedding Reponse +### Embedding Reponse The example response used the output from model [llama2 Chat 7B Q5 (GGUF)](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main) loaded to Nitro server. diff --git a/docs/docs/features/multi-thread.md b/docs/docs/features/multi-thread.md index 538870bef..7fad1d3a6 100644 --- a/docs/docs/features/multi-thread.md +++ b/docs/docs/features/multi-thread.md @@ -3,34 +3,20 @@ title: Multithreading description: Nitro utilizes multithreading to optimize hardware usage. --- -## What is Multithreading? +Multithreading in programming allows concurrent task execution, improving efficiency and responsiveness. It's key for optimizing hardware and application performance. -Multithreading is a programming concept where a process executes multiple threads simultaneously, improving efficiency and performance. It allows concurrent execution of tasks, such as data processing or user interface updates. This technique is crucial for optimizing hardware usage and enhancing application responsiveness. +Effective multithreading offers: -## Drogon's Threading Model +- Faster Performance. +- Responsive IO. +- Deadlock Prevention. +- Resource Optimization. +- Asynchronous Programming Support. +- Scalability Enhancement. -Nitro powered by Drogon, a high-speed C++ web application framework, utilizes a thread pool where each thread possesses its own event loop. These event loops are central to Drogon's functionality: +For more information on threading, visit [Drogon's Documentation](https://github.com/drogonframework/drogon/wiki/ENG-FAQ-1-Understanding-drogon-threading-model). -- **Main Loop**: Runs on the main thread, responsible for starting worker loops. -- **Worker Loops**: Handle tasks and network events, ensuring efficient task execution without blocking. - -## Why it's important - -Understanding and effectively using multithreading in Drogon is crucial for several reasons: - -1. **Optimized Performance**: Multithreading enhances application efficiency by enabling simultaneous task execution for faster response times. - -2. **Non-blocking IO Operations**: Utilizing multiple threads prevents long-running tasks from blocking the entire application, ensuring high responsiveness. - -3. **Deadlock Avoidance**: Event loops and threads helps prevent deadlocks, ensuring smoother and uninterrupted application operation. - -4. **Effective Resource Utilization**: Distributing tasks across multiple threads leads to more efficient use of server resources, improving overall performance. - -5. **Async Programming** - -6. **Scalability** - -## Enabling More Threads on Nitro +## Enabling Multi-Threads on Nitro To increase the number of threads used by Nitro, use the following command syntax: @@ -47,7 +33,4 @@ To launch Nitro with 4 threads, enter this command in the terminal: nitro 4 127.0.0.1 5000 ``` -> After enabling multithreading, monitor your system's performance. Adjust the `thread_num` as needed to optimize throughput and latency based on your workload. - -## Acknowledgements -For more information on Drogon's threading, visit [Drogon's Documentation](https://github.com/drogonframework/drogon/wiki/ENG-FAQ-1-Understanding-drogon-threading-model). \ No newline at end of file +> After enabling multithreading, monitor your system's performance. Adjust the `thread_num` as needed to optimize throughput and latency based on your workload. \ No newline at end of file diff --git a/docs/docs/features/warmup.md b/docs/docs/features/warmup.md index 741ed139f..b709cfd7f 100644 --- a/docs/docs/features/warmup.md +++ b/docs/docs/features/warmup.md @@ -3,17 +3,11 @@ title: Warming Up Model description: Nitro warms up the model to optimize delays. --- -## What is Model Warming Up? - -Model warming up is the process of running pre-requests through a model to optimize its components for production use. This step is crucial for reducing initialization and optimization delays during the first few inference requests. - -## What are the Benefits? - -Warming up an AI model offers several key benefits: - -- **Enhanced Initial Performance:** Unlike in `llama.cpp`, where the first inference can be very slow, warming up reduces initial latency, ensuring quicker response times from the start. -- **Consistent Response Times:** Especially beneficial for systems updating models frequently, like those with real-time training, to avoid performance lags with new snapshots. +Model warming up involves pre-running requests through an AI model to fine-tune its components for production. This step minimizes delays during initial inferences, ensuring readiness for immediate use. +**Key Advantages:** +- Improved Initial Performance. +- Stable Response Times. ## How to Enable Model Warming Up? On the Nitro server, model warming up is automatically enabled whenever a new model is loaded. This means that the server handles the warm-up process behind the scenes, ensuring that the model is ready for efficient and effective performance from the first inference request.