Restructure the docs + simplify features

janhq · Nov 15, 2023 · 279de3b · 279de3b
1 parent 44462ae
commit 279de3b
Show file tree

Hide file tree

Showing 11 changed files with 242 additions and 118 deletions.
diff --git a/docs/docs/features/batch.md b/docs/docs/features/batch.md
diff --git a/docs/docs/features/embed.md b/docs/docs/features/embed.md
@@ -4,24 +4,20 @@ title: Embedding
 
 ## Activating Embedding Feature
 
-To activate the embedding feature in Nitro, a JSON parameter `"embedding": true` needs to be included in the inference request. This setting allows Nitro to process inferences with embedding enabled, enhancing the model's capabilities.
+To activate the embedding feature in Nitro, a JSON parameter `"embedding": true` needs to be included in the [inference request](features/load-unload.md). This setting allows Nitro to process inferences with embedding enabled, enhancing the model's capabilities.
 
 ### Example Request
 
-Here’s an example showing how to enable the embedding feature:
+Here’s an example showing how to get the embedding result from the model:
 
-```zsh title="Enable Embedding" {5}
-curl -X POST 'http://localhost:3928/inferences/llamacpp/loadmodel' \
-  -H 'Content-Type: application/json' \
-  -d '{
-    "llama_model_path": "/path/to/your_model.gguf",
-    "embedding": true,
-    "pre_prompt": "A chat between a curious user and an artificial intelligence",
-    "user_prompt": "USER: "
-  }'
+```zsh title="Embedding" {1}
+curl -X POST 'http://localhost:3928/inferences/llamacpp/embedding' \
+-H 'Content-Type: application/json' \  
+-d '{"content":"hello"}'
 ```
 
-The example response used the output from model [llama2 Chat 7B Q5 (GGUF)](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main)
-```
+The example response used the output from model [llama2 Chat 7B Q5 (GGUF)](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main).
 
+```zsh title="Example Output"
+{"embedding":[-0.2521767318248749,-1.3601059913635254,-1.5674391984939575,-1.0478453636169434,1.3870385885238647,-0.8430665731430054,0.07460325956344604,0.46725934743881226,1.4008780717849731,0.8727640509605408,-0.2514731287956238,...]}
 ```
diff --git a/docs/docs/features/gpu.md b/docs/docs/features/gpu.md
diff --git a/docs/docs/features/load-unload.md b/docs/docs/features/load-unload.md
@@ -8,60 +8,109 @@ The loadModel function in Nitro enables the loading of a model into the system.
 
 You can simply load the model using
 
-```zsh
+```zsh title="Load Model" {1}
 curl -X POST 'http://localhost:3928/inferences/llamacpp/loadmodel' \
   -H 'Content-Type: application/json' \
   -d '{
     "llama_model_path": "/path/to/your_model.gguf",
+    "ctx_len": 512,
   }'
 ```
 
-For more detail in loading model, please refer to [First Inference Request](nitro/first-call)
+For more detail in loading model, please refer to [Table of parameters](#table-of-parameters).
 
 :::info Run on a different port
-You can run Nitro on a different port instead of 3928 (by default) by running it manually in terminal with the format:
+You can run Nitro on a 5000 port instead of 3928 (by default) by running it manually in terminal with the format:
 
 ```
-./nitro ([thread_num] [host] [port])
+./nitro 1 127.0.0.1 5000 
 ```
 
-thread_num : the number of thread that nitro webserver needs to have
-host : host value normally 127.0.0.1 or 0.0.0.0
-port : the port that nitro got deployed onto
+`thread_num`: the number of thread that nitro webserver needs to have.
 
-Example for running Nitro on `5000` port
-```
-./nitro 1 127.0.0.1 5000 
+`host`: host value normally 127.0.0.1 or 0.0.0.0
+
+`port`: the port that nitro got deployed onto.
+:::
+
+### Enabling GPU Inference
+
+To enable GPU inference in Nitro, a simple POST request is used. This request will instruct Nitro to load the specified model into the GPU, significantly boosting the inference throughput.
+
+```zsh title="GPU enable" {5}
+curl -X POST 'http://localhost:3928/inferences/llamacpp/loadmodel' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "llama_model_path": "/path/to/your_model.gguf",
+    "ctx_len": 512,
+    "ngl": 100,
+  }'
 ```
 
+You can adjust the `ngl` parameter based on your requirements and GPU capabilities.
 
-:::
+### Continous batching
+Nitro provides `continous batching` feature, which combines multiple requests for the same model execution to provide larger throughput.
 
+```zsh title="Enable Batching" {6,7}
+curl -X POST 'http://localhost:3928/inferences/llamacpp/loadmodel' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "llama_model_path": "/path/to/your_model.gguf",
+    "ctx_len": 512,
+    "cont_batching": true,
+    "n_parallel": 4,
+  }'
+```
+
+You can adjust the `n_parallel` suitable for your usecases.
+> In the example, `n_parallel = 4` means you can serve up to 4 users at the time.
 
 ## Unload model
 To unload a model, you can use a similar `curl` command as loading the model, adjusting the endpoint to `/unloadmodel.`
 
-```zsh
+```zsh title="Unload the model" {1}
 curl -X POST 'http://localhost:3928/inferences/llamacpp/unloadmodel' \
   -H 'Content-Type: application/json' \
   -d '{
     "llama_model_path": "/path/to/your_model.gguf",
   }'
 ```
 
-### Stop background process
-:::danger TODO
-updating
-:::
-
 ## Status
 The `modelStatus` function provides the current status of the model, including whether it is loaded and its properties. This function offers improved monitoring capabilities compared to `llama.cpp`.
 
-```zsh
+```zsh title="Check Model Status" {1}
 curl -X POST 'http://localhost:3928/inferences/llamacpp/modelstatus' \
   -H 'Content-Type: application/json' \
   -d '{
     "llama_model_path": "/path/to/your_model.gguf",
   }'
 ```
 
+If you load the model correctly, the response would be
+
+```zsh title="Load Model Sucessfully"
+{"message":"Model loaded successfully", "code": "ModelloadedSuccessfully"}
+```
+
+In case you got error while loading models. Please check for the correct model path.
+```zsh title="Load Model Failed"
+{"message":"No model loaded", "code": "NoModelLoaded"}
+```
+
+
+### Table of parameters
+
+| Parameter        | Type    | Description                                                  |
+|------------------|---------|--------------------------------------------------------------|
+| `llama_model_path` | String  | The file path to the LLaMA model.                            |
+| `ngl`              | Integer | The number of GPU layers to use.                             |
+| `ctx_len`          | Integer | The context length for the model operations.                 |
+| `embedding`        | Boolean | Whether to use embedding in the model.                       |
+| `n_parallel`       | Integer | The number of parallel operations. Uses Drogon thread count if not set. |
+| `cont_batching`    | Boolean | Whether to use continuous batching.                          |
+| `user_prompt`      | String  | The prompt to use for the user.                              |
+| `ai_prompt`        | String  | The prompt to use for the AI assistant.                      |
+| `system_prompt`    | String  | The prompt to use for system rules.                          |
+| `pre_prompt`    | String  | The prompt to use for internal configuration.                          |
diff --git a/docs/docs/features/prompt.md b/docs/docs/features/prompt.md
@@ -2,29 +2,26 @@
 title: Prompt Role Support
 ---
 
-Understanding the roles of different prompts—system, user, and assistant—is crucial for effectively utilizing the Large Language Model. These prompts work together to create a coherent and functional conversational flow.
+Understanding the roles of different prompts—system, user, and assistant is crucial for effectively utilizing the Large Language Model. These prompts work together to create a coherent and functional conversational flow.
 
-With Nitro, developers can easily config the dialog for "system_prompt" or implement advanced prompt engineering like [few-shot learning](https://arxiv.org/abs/2005.14165)
+With Nitro, developers can easily config the dialog for "system prompt" or implement advanced prompt engineering like [few-shot learning](https://arxiv.org/abs/2005.14165).
 
 
 ## System prompt
-- The system prompt is foundational in setting up the assistant's behavior. It can guide the assistant's personality or provide specific instructions for its behavior throughout the conversation.
-
-- While the system message is optional, its presence can significantly influence the assistant's responses. A generic system message, like "You are a helpful assistant," sets a baseline behavior.
+- The system prompt is foundational in setting up the assistant's behavior. You can config it under `pre_prompt`.
 
 ## User prompt
 - User prompts are the requests or comments directed towards the assistant. They form the core of the conversation, with the assistant responding to these user inputs.
 
-- The user messages are essential for directing the flow and nature of the conversation.
-
 ## Assistant prompt
 - Assistant prompts are responses or messages generated by the assistant. These can be previous responses stored in the system or examples provided by developers to demonstrate desired behavior.
 
-- Assistant prompts are crucial for maintaining the context and continuity in ongoing conversations.
-
 ## Example usage
 
-Combine all three roles, we could create a "Pirate assistant" like this
+Combine all three roles, we could create a "Pirate assistant" as example:
+
+> NOTE: The "ai_prompt" and "user_prompt" is a prefix for a role. Please config it depend on the model you use.
+
 ```zsh title="Prompt Configuration" {7,8,9}
 curl -X POST 'http://localhost:3928/inferences/llamacpp/loadmodel' \
   -H 'Content-Type: application/json' \
@@ -33,9 +30,26 @@ curl -X POST 'http://localhost:3928/inferences/llamacpp/loadmodel' \
     "ctx_len": 128,
     "ngl": 100,
     "pre_prompt": "You are a Pirate. Using drunk language with a lot of Arr...",
-    "user_prompt": "USER: ",
+    "user_prompt": "USER:",
     "ai_prompt": "ASSISTANT: "
   }'
 ```
 
-> We should have a comparison between llama.cpp doesn't have a system prompt, so it can't output as well as the system prompt implemented via Nitro.
+For testing the assistant
+
+```zsh title="Pirate Assistant"
+curl -X POST 'http://localhost:3928/inferences/llamacpp/chat_completion' \
+  -H "Content-Type: application/json" \
+  -d '{
+    "llama_model_path": "/path/to/your_model.gguf",
+    "messages": [
+      {
+        "role": "user",
+        "content": "Hello, who is your captain?"
+      },
+    ]
+  }'
+```
+
+
+
diff --git a/docs/docs/new/about.md b/docs/docs/new/about.md
@@ -0,0 +1,67 @@
+---
+title: About Nitro
+slug: /docs
+---
+
+Nitro is a fast, lightweight (3mb) inference server that can be embedded in apps to run local AI. Nitro can be used to run a variety of popular open source AI models, and provides an OpenAI-compatible API. 
+
+Nitro is used to power [Jan](https://jan.ai), a open source alternative to OpenAI's platform that can be run on your own computer or server. 
+
+
+Nitro is a fast, lightweight, and embeddable inference engine, powering [Jan](https://jan.ai/). Developed in C++, it's specially optimized for use in edge computing and is ready for deployment in products.
+
+⚡ Discover more about Nitro on [GitHub](https://github.com/janhq/nitro)
+
+## Why Nitro?
+
+### Lightweight & Fast
+
+- Old materials
+  - At a mere 3MB, Nitro is a testament to efficiency. This stark difference in size makes Nitro an ideal choice for applications.
+  - Nitro is designed to blend seamlessly into your application without restricting the use of other tools. This flexibility is a crucial advantage.
+- **Quick Setup:**
+Nitro can be up and running in about 10 seconds. This rapid deployment means you can focus more on development and less on installation processes.
+
+- Old material
+  - Nitro uses the `drogon` C++17/20 HTTP application framework, which makes a significant difference. This framework is known for its speed, ensuring that Nitro processes data swiftly. This means your applications can make quick decisions based on complex data, a crucial factor in today's fast-paced digital environment.
+  - Nitro elevates its game with drogon cpp, a C++ production-ready web framework. Its non-blocking socket IO ensures that your web services are efficient, robust, and reliable.
+  - [Batching Inference](features/batch)
+  - Non-blocking Socket IO
+
+### OpenAI-compatible API
+
+- [ ] OpenAI-compatible
+- [ ] Given examples
+- [ ] What is not covered? (e.g. Assistants, Tools -> See Jan)
+
+- Extends OpenAI's API with helpful model methods
+- e.g. Load/Unload model
+- e.g. Checking model status
+- [Unload model](features/load-unload)
+- With Nitro, you gain more control over `llama.cpp` features. You can now stop background slot processing and unload models as needed. This level of control optimizes resource usage and enhances application performance.
+
+### Cross-Platform
+
+- [ ] Cross-platform
+
+### Multi-modal
+
+- [ ] Hint at what's coming
+
+## Architecture
+
+ - [ ] Link to Specifications
+
+## Support
+
+- [ ] File a Github Issue
+- [ ] Go to Discord
+
+## Contributing
+
+- [ ] Link to Github
+
+## Acknowledgements
+
+- [drogon](https://github.com/drogonframework/drogon): The fast C++ web framework supporting either C++17 or C++14
+- [llama.cpp](https://github.com/ggerganov/llama.cpp): Inference of LLaMA model in pure C/C++
diff --git a/docs/docs/new/architecture.md b/docs/docs/new/architecture.md
@@ -0,0 +1,7 @@
+---
+title: Architecture
+---
+
+We should only have 1 document
+- [ ] Refactor system/architecture
+- [ ] Refactor system/key-concepts
diff --git a/docs/docs/new/install.md b/docs/docs/new/install.md
@@ -0,0 +1,4 @@
+---
+title: Install from Source
+slug: /install
+---
diff --git a/docs/docs/new/quickstart.md b/docs/docs/new/quickstart.md
@@ -0,0 +1,34 @@
+---
+title: Quickstart
+---
+
+- Objective
+  - Quickstart shows the "power" of the system very quickly
+- Combine
+  - [ ] nitro/using-nitro
+  - [ ] nitro/installation
+  - [ ] nitro/first-call
+
+## Getting Nitro
+
+- [ ] Overview of the different ways to install nitro
+  - [ ] via npm
+  - [ ] via pip
+  - [ ] via shell script
+  - [ ] Link to other page for "Build from Source" (tedious, not happy path)
+- [ ] What does installing Nitro do? (what changes in your system?)
+
+## Downloading a Model
+
+- Recommend an actual model to download
+
+## Check Nitro server
+
+```zsh title="Nitro Health Status"
+curl -X GET http://localhost:3928/healthz
+```
+
+## Making an Inference
+
+- Make an actual inference call using Nitro
+- Talk about OpenAI compatibility