Skip to content

Commit

Permalink
Restructure the docs + simplify features
Browse files Browse the repository at this point in the history
  • Loading branch information
hahuyhoang411 committed Nov 15, 2023
1 parent 44462ae commit 279de3b
Show file tree
Hide file tree
Showing 11 changed files with 242 additions and 118 deletions.
20 changes: 0 additions & 20 deletions docs/docs/features/batch.md

This file was deleted.

22 changes: 9 additions & 13 deletions docs/docs/features/embed.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,20 @@ title: Embedding

## Activating Embedding Feature

To activate the embedding feature in Nitro, a JSON parameter `"embedding": true` needs to be included in the inference request. This setting allows Nitro to process inferences with embedding enabled, enhancing the model's capabilities.
To activate the embedding feature in Nitro, a JSON parameter `"embedding": true` needs to be included in the [inference request](features/load-unload.md). This setting allows Nitro to process inferences with embedding enabled, enhancing the model's capabilities.

### Example Request

Here’s an example showing how to enable the embedding feature:
Here’s an example showing how to get the embedding result from the model:

```zsh title="Enable Embedding" {5}
curl -X POST 'http://localhost:3928/inferences/llamacpp/loadmodel' \
-H 'Content-Type: application/json' \
-d '{
"llama_model_path": "/path/to/your_model.gguf",
"embedding": true,
"pre_prompt": "A chat between a curious user and an artificial intelligence",
"user_prompt": "USER: "
}'
```zsh title="Embedding" {1}
curl -X POST 'http://localhost:3928/inferences/llamacpp/embedding' \
-H 'Content-Type: application/json' \
-d '{"content":"hello"}'
```

The example response used the output from model [llama2 Chat 7B Q5 (GGUF)](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main)
```
The example response used the output from model [llama2 Chat 7B Q5 (GGUF)](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main).

```zsh title="Example Output"
{"embedding":[-0.2521767318248749,-1.3601059913635254,-1.5674391984939575,-1.0478453636169434,1.3870385885238647,-0.8430665731430054,0.07460325956344604,0.46725934743881226,1.4008780717849731,0.8727640509605408,-0.2514731287956238,...]}
```
24 changes: 0 additions & 24 deletions docs/docs/features/gpu.md

This file was deleted.

85 changes: 67 additions & 18 deletions docs/docs/features/load-unload.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,60 +8,109 @@ The loadModel function in Nitro enables the loading of a model into the system.

You can simply load the model using

```zsh
```zsh title="Load Model" {1}
curl -X POST 'http://localhost:3928/inferences/llamacpp/loadmodel' \
-H 'Content-Type: application/json' \
-d '{
"llama_model_path": "/path/to/your_model.gguf",
"ctx_len": 512,
}'
```

For more detail in loading model, please refer to [First Inference Request](nitro/first-call)
For more detail in loading model, please refer to [Table of parameters](#table-of-parameters).

:::info Run on a different port
You can run Nitro on a different port instead of 3928 (by default) by running it manually in terminal with the format:
You can run Nitro on a 5000 port instead of 3928 (by default) by running it manually in terminal with the format:

```
./nitro ([thread_num] [host] [port])
./nitro 1 127.0.0.1 5000
```

thread_num : the number of thread that nitro webserver needs to have
host : host value normally 127.0.0.1 or 0.0.0.0
port : the port that nitro got deployed onto
`thread_num`: the number of thread that nitro webserver needs to have.

Example for running Nitro on `5000` port
```
./nitro 1 127.0.0.1 5000
`host`: host value normally 127.0.0.1 or 0.0.0.0

`port`: the port that nitro got deployed onto.
:::

### Enabling GPU Inference

To enable GPU inference in Nitro, a simple POST request is used. This request will instruct Nitro to load the specified model into the GPU, significantly boosting the inference throughput.

```zsh title="GPU enable" {5}
curl -X POST 'http://localhost:3928/inferences/llamacpp/loadmodel' \
-H 'Content-Type: application/json' \
-d '{
"llama_model_path": "/path/to/your_model.gguf",
"ctx_len": 512,
"ngl": 100,
}'
```

You can adjust the `ngl` parameter based on your requirements and GPU capabilities.

:::
### Continous batching
Nitro provides `continous batching` feature, which combines multiple requests for the same model execution to provide larger throughput.

```zsh title="Enable Batching" {6,7}
curl -X POST 'http://localhost:3928/inferences/llamacpp/loadmodel' \
-H 'Content-Type: application/json' \
-d '{
"llama_model_path": "/path/to/your_model.gguf",
"ctx_len": 512,
"cont_batching": true,
"n_parallel": 4,
}'
```

You can adjust the `n_parallel` suitable for your usecases.
> In the example, `n_parallel = 4` means you can serve up to 4 users at the time.
## Unload model
To unload a model, you can use a similar `curl` command as loading the model, adjusting the endpoint to `/unloadmodel.`

```zsh
```zsh title="Unload the model" {1}
curl -X POST 'http://localhost:3928/inferences/llamacpp/unloadmodel' \
-H 'Content-Type: application/json' \
-d '{
"llama_model_path": "/path/to/your_model.gguf",
}'
```

### Stop background process
:::danger TODO
updating
:::

## Status
The `modelStatus` function provides the current status of the model, including whether it is loaded and its properties. This function offers improved monitoring capabilities compared to `llama.cpp`.

```zsh
```zsh title="Check Model Status" {1}
curl -X POST 'http://localhost:3928/inferences/llamacpp/modelstatus' \
-H 'Content-Type: application/json' \
-d '{
"llama_model_path": "/path/to/your_model.gguf",
}'
```

If you load the model correctly, the response would be

```zsh title="Load Model Sucessfully"
{"message":"Model loaded successfully", "code": "ModelloadedSuccessfully"}
```

In case you got error while loading models. Please check for the correct model path.
```zsh title="Load Model Failed"
{"message":"No model loaded", "code": "NoModelLoaded"}
```


### Table of parameters

| Parameter | Type | Description |
|------------------|---------|--------------------------------------------------------------|
| `llama_model_path` | String | The file path to the LLaMA model. |
| `ngl` | Integer | The number of GPU layers to use. |
| `ctx_len` | Integer | The context length for the model operations. |
| `embedding` | Boolean | Whether to use embedding in the model. |
| `n_parallel` | Integer | The number of parallel operations. Uses Drogon thread count if not set. |
| `cont_batching` | Boolean | Whether to use continuous batching. |
| `user_prompt` | String | The prompt to use for the user. |
| `ai_prompt` | String | The prompt to use for the AI assistant. |
| `system_prompt` | String | The prompt to use for system rules. |
| `pre_prompt` | String | The prompt to use for internal configuration. |
38 changes: 26 additions & 12 deletions docs/docs/features/prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,26 @@
title: Prompt Role Support
---

Understanding the roles of different prompts—system, user, and assistantis crucial for effectively utilizing the Large Language Model. These prompts work together to create a coherent and functional conversational flow.
Understanding the roles of different prompts—system, user, and assistant is crucial for effectively utilizing the Large Language Model. These prompts work together to create a coherent and functional conversational flow.

With Nitro, developers can easily config the dialog for "system_prompt" or implement advanced prompt engineering like [few-shot learning](https://arxiv.org/abs/2005.14165)
With Nitro, developers can easily config the dialog for "system prompt" or implement advanced prompt engineering like [few-shot learning](https://arxiv.org/abs/2005.14165).


## System prompt
- The system prompt is foundational in setting up the assistant's behavior. It can guide the assistant's personality or provide specific instructions for its behavior throughout the conversation.

- While the system message is optional, its presence can significantly influence the assistant's responses. A generic system message, like "You are a helpful assistant," sets a baseline behavior.
- The system prompt is foundational in setting up the assistant's behavior. You can config it under `pre_prompt`.

## User prompt
- User prompts are the requests or comments directed towards the assistant. They form the core of the conversation, with the assistant responding to these user inputs.

- The user messages are essential for directing the flow and nature of the conversation.

## Assistant prompt
- Assistant prompts are responses or messages generated by the assistant. These can be previous responses stored in the system or examples provided by developers to demonstrate desired behavior.

- Assistant prompts are crucial for maintaining the context and continuity in ongoing conversations.

## Example usage

Combine all three roles, we could create a "Pirate assistant" like this
Combine all three roles, we could create a "Pirate assistant" as example:

> NOTE: The "ai_prompt" and "user_prompt" is a prefix for a role. Please config it depend on the model you use.
```zsh title="Prompt Configuration" {7,8,9}
curl -X POST 'http://localhost:3928/inferences/llamacpp/loadmodel' \
-H 'Content-Type: application/json' \
Expand All @@ -33,9 +30,26 @@ curl -X POST 'http://localhost:3928/inferences/llamacpp/loadmodel' \
"ctx_len": 128,
"ngl": 100,
"pre_prompt": "You are a Pirate. Using drunk language with a lot of Arr...",
"user_prompt": "USER: ",
"user_prompt": "USER:",
"ai_prompt": "ASSISTANT: "
}'
```

> We should have a comparison between llama.cpp doesn't have a system prompt, so it can't output as well as the system prompt implemented via Nitro.
For testing the assistant

```zsh title="Pirate Assistant"
curl -X POST 'http://localhost:3928/inferences/llamacpp/chat_completion' \
-H "Content-Type: application/json" \
-d '{
"llama_model_path": "/path/to/your_model.gguf",
"messages": [
{
"role": "user",
"content": "Hello, who is your captain?"
},
]
}'
```



67 changes: 67 additions & 0 deletions docs/docs/new/about.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
title: About Nitro
slug: /docs
---

Nitro is a fast, lightweight (3mb) inference server that can be embedded in apps to run local AI. Nitro can be used to run a variety of popular open source AI models, and provides an OpenAI-compatible API.

Nitro is used to power [Jan](https://jan.ai), a open source alternative to OpenAI's platform that can be run on your own computer or server.


Nitro is a fast, lightweight, and embeddable inference engine, powering [Jan](https://jan.ai/). Developed in C++, it's specially optimized for use in edge computing and is ready for deployment in products.

⚡ Discover more about Nitro on [GitHub](https://github.com/janhq/nitro)

## Why Nitro?

### Lightweight & Fast

- Old materials
- At a mere 3MB, Nitro is a testament to efficiency. This stark difference in size makes Nitro an ideal choice for applications.
- Nitro is designed to blend seamlessly into your application without restricting the use of other tools. This flexibility is a crucial advantage.
- **Quick Setup:**
Nitro can be up and running in about 10 seconds. This rapid deployment means you can focus more on development and less on installation processes.

- Old material
- Nitro uses the `drogon` C++17/20 HTTP application framework, which makes a significant difference. This framework is known for its speed, ensuring that Nitro processes data swiftly. This means your applications can make quick decisions based on complex data, a crucial factor in today's fast-paced digital environment.
- Nitro elevates its game with drogon cpp, a C++ production-ready web framework. Its non-blocking socket IO ensures that your web services are efficient, robust, and reliable.
- [Batching Inference](features/batch)
- Non-blocking Socket IO

### OpenAI-compatible API

- [ ] OpenAI-compatible
- [ ] Given examples
- [ ] What is not covered? (e.g. Assistants, Tools -> See Jan)

- Extends OpenAI's API with helpful model methods
- e.g. Load/Unload model
- e.g. Checking model status
- [Unload model](features/load-unload)
- With Nitro, you gain more control over `llama.cpp` features. You can now stop background slot processing and unload models as needed. This level of control optimizes resource usage and enhances application performance.

### Cross-Platform

- [ ] Cross-platform

### Multi-modal

- [ ] Hint at what's coming

## Architecture

- [ ] Link to Specifications

## Support

- [ ] File a Github Issue
- [ ] Go to Discord

## Contributing

- [ ] Link to Github

## Acknowledgements

- [drogon](https://github.com/drogonframework/drogon): The fast C++ web framework supporting either C++17 or C++14
- [llama.cpp](https://github.com/ggerganov/llama.cpp): Inference of LLaMA model in pure C/C++
7 changes: 7 additions & 0 deletions docs/docs/new/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
title: Architecture
---

We should only have 1 document
- [ ] Refactor system/architecture
- [ ] Refactor system/key-concepts
4 changes: 4 additions & 0 deletions docs/docs/new/install.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
---
title: Install from Source
slug: /install
---
34 changes: 34 additions & 0 deletions docs/docs/new/quickstart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
title: Quickstart
---

- Objective
- Quickstart shows the "power" of the system very quickly
- Combine
- [ ] nitro/using-nitro
- [ ] nitro/installation
- [ ] nitro/first-call

## Getting Nitro

- [ ] Overview of the different ways to install nitro
- [ ] via npm
- [ ] via pip
- [ ] via shell script
- [ ] Link to other page for "Build from Source" (tedious, not happy path)
- [ ] What does installing Nitro do? (what changes in your system?)

## Downloading a Model

- Recommend an actual model to download

## Check Nitro server

```zsh title="Nitro Health Status"
curl -X GET http://localhost:3928/healthz
```

## Making an Inference

- Make an actual inference call using Nitro
- Talk about OpenAI compatibility
Loading

0 comments on commit 279de3b

Please sign in to comment.