Skip to content

Commit

Permalink
introductory pages
Browse files Browse the repository at this point in the history
  • Loading branch information
mcharytoniuk committed Jun 29, 2024
1 parent 8b8d9b7 commit 0f180c0
Show file tree
Hide file tree
Showing 11 changed files with 54 additions and 14 deletions.
17 changes: 9 additions & 8 deletions src/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
# Introduction

This handbook aims to provide a pragmatic guide to LLMOps. It provides a sufficient understanding of [Large Language Models](/general-concepts/large-language-model), [deployment](/deployments) techniques, and [software engineering](/application-layer) practices to maintain the entire stack.

It assumes you are interested in self-hosting open source [Large Language Models](/general-concepts/large-language-model). If you only want to use them through HTTP APIs, you can jump straight to the [application layer](/application-layer) best practices.

## What is LLMOps?

## Self-Hosted vs Third Party
`LLMOps` is a set of practices that deal with the deployment, maintenance and scaling of [Large Language Models](/general-concepts/large-language-model). If you want to consider yourself an `LLMOps` practitioner, you should be able to, at minimum, be able to deploy and maintain a scalable setup of multiple running LLM instances.

## New Class of Opportunities, New Class of Problems

```mermaid
graph TD;
A-->B;
A-->C;
B-->D;
C-->D;
```
Although there has been a recent trend of naming everything `*Ops` (`DevOps`, `Product Ops`, `MLOps`, `LLMOps`, `BizOps`, etc.), I think `LLMOps` and `MLOps` truly deserve their place as a standalone set of practices.

The issues they deal with is bridging the gap between the applications and AI models deployed in the infrastructure. They also deal with very specific set of issues arising from using GPUs and TPUs and the primary stress being [Input/Output](/general-concepts/input-output) optimizations.

11 changes: 6 additions & 5 deletions src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,24 @@
[Introduction](README.md)

- [General Concepts]()
- [Continuous Batching](./general-concepts/continuous-batching/README.md)
- [Fine-tuning]()
- [Input/Output]()
- [Large Language Model]()
- [Input/Output](./general-concepts/input-output/README.md)
- [Large Language Model](./general-concepts/large-language-model/README.md)
- [Load Balancing](./general-concepts/load-balancing/README.md)
- [Forward Proxy]()
- [Reverse Proxy]()
- [Model Parameters]()
- [Retrieval Augmented Generation]()
- [Supervisor]()
- [Deployments]()
- [llama.cpp]()
- [Deployments](./deployments/README.md)
- [llama.cpp](./deployments/llama.cpp/README.md)
- [Production Deployment]()
- [AWS]()
- [Kubernetes]()
- [Ollama](./deployments/ollama/README.md)
- [Paddler]()
- [Application Level]()
- [Application Layer](./application-layer/README.md)
- [Architecture]()
- [Long-Running]()
- [Serverless]()
Expand Down
13 changes: 13 additions & 0 deletions src/application-layer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Application Layer

This chapter is not strictly related to LLMOps, but discussing the best practices for architecting and developing applications that use them would be a good idea.

Those applications have to deal with some issues that are not typically met in traditional web development, primarily long-running HTTP requests or MLOps - using custom models for inference.

Up until [Large Language Models](/general-concepts/large-language-model) became mainstream and in demand by a variety of applications, the issue of dealing with long-running requests was much less prevalent. Typically, due to functional requirements, all the microservice requests normally would take 10ms or less, while waiting for a [Large Language Models](/general-concepts/large-language-model) to complete the inference can take multiple seconds.

That calls for some adjustments in the application architecture, non-blocking [Input/Output](/general-concepts/input-output) and asynchronous programming.

This is where asynchronous programming languages shine, like Python with its `asyncio` library or Rust with its `tokio` library, Go with its goroutines, etc.

Programming languages like `PHP`, which are synchronous by default, might struggle unless supplemented by extensions like [Swoole](https://swoole.com/) (which essentially gives PHP Go-like coroutines) or libraries like [AMPHP](https://amphp.org/). Introducing support for asynchronous programming in PHP can be a challenge, but it is possible.
1 change: 1 addition & 0 deletions src/deployments/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Deployments
7 changes: 7 additions & 0 deletions src/deployments/llama.cpp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# llama.cpp

Llama.cpp is a production ready, open source runner for various [Large Language Models](/general-concepts/large-language-model).

It has an excellent built-in [server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) with HTTP API.

In this handbook we will make the most use of [Continuous Batching](/general-concepts/continuous-batching), which in practice allows handling paralell requests.
14 changes: 14 additions & 0 deletions src/deployments/ollama/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,20 @@ For example, when you request completion from a model that is not yet loaded, it

In general terms, it acts like a `llama.cpp` [forward proxy](/general-concepts/load-balancing/forward-proxy.md) and a [supervisor](/general-concepts/load-balancing/supervisor.md).

For example if you load both `llama3` and `phi-3` into the same Ollama instance you will get something like this:

```mermaid
flowchart TD
Ollama --> llama1[llama.cpp with llama3]
Ollama --> llama2[llama.cpp with phi-3]
llama1 --> VRAM
llama2 --> VRAM
```

## Viability for Production

### Predictability

Although the `Ollama` approach is convenient for local development, it causes some deployment problems (compared to `llama.cpp`).

With `llama.cpp`, it is easily possible to divide the context of the loaded model into a specific number of slots, which makes it extremely easy to predict how many parallel requests the current server can handle.
Expand All @@ -20,6 +32,8 @@ We might end up in a situation where `Ollama` keeps both [llama3](https://llama.

How can that be balanced effectively? As a software architect, you would have to plan an infrastructure that does not allow developers to randomly load models into memory and force a specific number of slots, which defeats the purpose of `Ollama`.

### Good Parts of Ollama

I greatly support `Ollama` because it makes it easy to start your journey with large language models. You can use `Ollama` in production deployments, but I think `llama.cpp` is a better choice because it is so predictable.

I think `Ollama` is better suited than `llama.cpp` for end-user distributable applications. By that, I mean the applications that do not use an external server but are installed and run in their entirety on the user's device. The same thing that makes it less predictable when it comes to resource usage makes it more resilient to end-user errors. In that context, resource usage predictability is less important than on the server side.
Expand Down
1 change: 1 addition & 0 deletions src/general-concepts/continuous-batching/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Continuous Batching
1 change: 1 addition & 0 deletions src/general-concepts/input-output/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Input/Output
1 change: 1 addition & 0 deletions src/general-concepts/large-language-model/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Large Language Model
2 changes: 1 addition & 1 deletion src/general-concepts/load-balancing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,6 @@ The interesting thing is that having some experience with 3D game development mi

## Differences Between Balancing GPU and CPU Load

In the context of LLMOps, the primary factors we have to deal with this time are [Input/Output](/general-concepts/input-output/README.md) bottlenecks instead of the usual CPU bottlenecks. That forces us to adjust how we design our infrastructure and applications.
In the context of LLMOps, the primary factors we have to deal with this time are [Input/Output](/general-concepts/input-output) bottlenecks instead of the usual CPU bottlenecks. That forces us to adjust how we design our infrastructure and applications.

We will also often use a different set of metrics than traditional load balancing, which are usually closer to the application level (like the number of available context slots being used, the number of buffered application requests, and such).
Empty file added theme/header.hbs
Empty file.

0 comments on commit 0f180c0

Please sign in to comment.