-
Notifications
You must be signed in to change notification settings - Fork 8
Home
This AI worker ("workerbee") operates at a high level, responding to model inference, training fine tuning requests for specific model parameters.
Currently we only support llama2, via the llama-cpp-python library because it's fast, compatible with any platform and gpu, feature-complete and handles model-splitting. Stable diffusion will likely be the next target.
graph TD
subgraph "Many Workers"
W1[Connect to QueenBee URL via WebSockets]
W2[Send a Registration Message]
W3[Listen for Inference & Training Websocket Requests]
W4[Reply with OpenAI Formatted Response]
W1 --> W2
W2 --> W3
W3 --> W4
end
subgraph "One QueenBee"
S1[Listen for Worker Connections]
S2[Accumulate List of Active Workers]
S3[Listen for Inference, Training REST Requests]
S4[Pick an Available Worker]
S5[Send Request Over a Websocket]
S6[Translate Reply Into REST Response]
S7[Complete Any Billing]
S1 --> S2
S2 --> S3
S3 --> S4
S4 --> S5
S5 --> S6
S6 --> S7
end
subgraph "Many Clients"
C1[Make REST Requests Against the QueenBee]
end
%%% Interactions %%%
W1 -.-> S1
W2 -.-> S2
C1 -.-> S3
S5 -.-> W3
W4 -.-> S6
S6 -.-> C1
Each worker communicates with the QueenBee through WebSockets and performs specific tasks, such as listening for inference and training requests and replying with OpenAI-formatted responses.
-
Action: Connect to
[queenbee url]
via WebSockets.
-
Action: Send a JSON-formatted registration message through the established WebSocket connection.
-
Registration Message Fields:
-
ln_url
(str): Lightning url for payment receipt. -
auth_key
(str): Optional token for control panel connection. -
cpu_count
(int): Number of CPUs available. -
disk_space
(int): Amount of disk space available (in GB). -
vram
(int): Amount of VRAM available (in MB). -
nv_gpu_count
(Optional[int], default:None
): Count of Nvidia GPUs. -
nv_driver_version
(Optional[str], default:None
): Version of Nvidia GPU driver. -
nv_gpus
(Optional[List[GpuInfo]], default:[]
): Information for each Nvidia GPU. -
cl_driver_version
(Optional[str], default:None
): Version of OpenCL driver. -
cl_gpus
(Optional[List[GpuInfo]], default:[]
): Information for each OpenCL-compatible GPU.
-
-
Action: Listen for JSON-formatted requests over WebSocket that contain two main members.
-
Request Fields:
-
openai_url
(str): Always/v1/chat/completion
. Can be ignored for now. -
openai_req
(dict): Contains the model name and a list of messages (formatted similarly to regular OpenAI requests).
-
-
Action: Send back a JSON-formatted response that follows the OpenAI API schema.
-
Response Fields:
-
choices
: Choices array, containing one or more choices depending on the request parameters. -
usage
: Contains metadata about the API usage, such as token count.
-
{
"ln_url": "[email protected]",
"cpu_count": 8,
"disk_space": 500,
"vram": 2048,
"nv_gpu_count": 1,
"nv_driver_version": "465.19.01",
"nv_gpus": [
{
"name": "NVIDIA Tesla K80",
"memory": 11441
}
]
}
{
"openai_url": "/v1/chat/completion",
"openai_req": {
"model": "TheBloke/CoolModelv2:Q4_K_M",
"messages": [
{
"role": "user",
"content": "Who won the world series in 2020?"
}
]
}
}
{
"choices": [
{
"message": {
"role": "assistant",
"content": "The Los Angeles Dodgers won the World Series in 2020."
},
"finish_reason": "stop",
"index": 0
}
],
"usage": {
"prompt_tokens": 56,
"completion_tokens": 31,
"total_tokens": 87
}
}