introductory pages

distantmagic · Jun 29, 2024 · 0f180c0 · 0f180c0
1 parent 8b8d9b7
commit 0f180c0
Show file tree

Hide file tree

Showing 11 changed files with 54 additions and 14 deletions.
diff --git a/src/README.md b/src/README.md
@@ -1,15 +1,16 @@
 # Introduction
 
+This handbook aims to provide a pragmatic guide to LLMOps. It provides a sufficient understanding of [Large Language Models](/general-concepts/large-language-model), [deployment](/deployments) techniques, and [software engineering](/application-layer) practices to maintain the entire stack.
+
+It assumes you are interested in self-hosting open source [Large Language Models](/general-concepts/large-language-model). If you only want to use them through HTTP APIs, you can jump straight to the [application layer](/application-layer) best practices.
+
 ## What is LLMOps?
 
-## Self-Hosted vs Third Party
+`LLMOps` is a set of practices that deal with the deployment, maintenance and scaling of [Large Language Models](/general-concepts/large-language-model). If you want to consider yourself an `LLMOps` practitioner, you should be able to, at minimum, be able to deploy and maintain a scalable setup of multiple running LLM instances.
 
 ## New Class of Opportunities, New Class of Problems
 
-```mermaid
-graph TD;
-    A-->B;
-    A-->C;
-    B-->D;
-    C-->D;
-```
+Although there has been a recent trend of naming everything `*Ops` (`DevOps`, `Product Ops`, `MLOps`, `LLMOps`, `BizOps`, etc.), I think `LLMOps` and `MLOps` truly deserve their place as a standalone set of practices.
+
+The issues they deal with is bridging the gap between the applications and AI models deployed in the infrastructure. They also deal with very specific set of issues arising from using GPUs and TPUs and the primary stress being [Input/Output](/general-concepts/input-output) optimizations.
+
diff --git a/src/SUMMARY.md b/src/SUMMARY.md
@@ -3,23 +3,24 @@
 [Introduction](README.md)
 
 - [General Concepts]()
+    - [Continuous Batching](./general-concepts/continuous-batching/README.md)
     - [Fine-tuning]()
-    - [Input/Output]()
-    - [Large Language Model]()
+    - [Input/Output](./general-concepts/input-output/README.md)
+    - [Large Language Model](./general-concepts/large-language-model/README.md)
     - [Load Balancing](./general-concepts/load-balancing/README.md)
         - [Forward Proxy]()
         - [Reverse Proxy]()
     - [Model Parameters]()
     - [Retrieval Augmented Generation]()
     - [Supervisor]()
-- [Deployments]()
-    - [llama.cpp]()
+- [Deployments](./deployments/README.md)
+    - [llama.cpp](./deployments/llama.cpp/README.md)
         - [Production Deployment]()
             - [AWS]()
             - [Kubernetes]()
     - [Ollama](./deployments/ollama/README.md)
     - [Paddler]()
-- [Application Level]()
+- [Application Layer](./application-layer/README.md)
     - [Architecture]()
         - [Long-Running]()
         - [Serverless]()

diff --git a/src/application-layer/README.md b/src/application-layer/README.md
@@ -0,0 +1,13 @@
+# Application Layer
+
+This chapter is not strictly related to LLMOps, but discussing the best practices for architecting and developing applications that use them would be a good idea.
+
+Those applications have to deal with some issues that are not typically met in traditional web development, primarily long-running HTTP requests or MLOps - using custom models for inference.
+
+Up until [Large Language Models](/general-concepts/large-language-model) became mainstream and in demand by a variety of applications, the issue of dealing with long-running requests was much less prevalent. Typically, due to functional requirements, all the microservice requests normally would take 10ms or less, while waiting for a [Large Language Models](/general-concepts/large-language-model) to complete the inference can take multiple seconds.
+
+That calls for some adjustments in the application architecture, non-blocking [Input/Output](/general-concepts/input-output) and asynchronous programming. 
+
+This is where asynchronous programming languages shine, like Python with its `asyncio` library or Rust with its `tokio` library, Go with its goroutines, etc. 
+
+Programming languages like `PHP`, which are synchronous by default, might struggle unless supplemented by extensions like [Swoole](https://swoole.com/) (which essentially gives PHP Go-like coroutines) or libraries like [AMPHP](https://amphp.org/). Introducing support for asynchronous programming in PHP can be a challenge, but it is possible. 
diff --git a/src/deployments/README.md b/src/deployments/README.md
@@ -0,0 +1 @@
+# Deployments
diff --git a/src/deployments/llama.cpp/README.md b/src/deployments/llama.cpp/README.md
@@ -0,0 +1,7 @@
+# llama.cpp
+
+Llama.cpp is a production ready, open source runner for various [Large Language Models](/general-concepts/large-language-model). 
+
+It has an excellent built-in [server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) with HTTP API. 
+
+In this handbook we will make the most use of [Continuous Batching](/general-concepts/continuous-batching), which in practice allows handling paralell requests.
diff --git a/src/deployments/ollama/README.md b/src/deployments/ollama/README.md
@@ -8,8 +8,20 @@ For example, when you request completion from a model that is not yet loaded, it
 
 In general terms, it acts like a `llama.cpp` [forward proxy](/general-concepts/load-balancing/forward-proxy.md) and a [supervisor](/general-concepts/load-balancing/supervisor.md).
 
+For example if you load both `llama3` and `phi-3` into the same Ollama instance you will get something like this:
+
+```mermaid
+flowchart TD
+    Ollama --> llama1[llama.cpp with llama3]
+    Ollama --> llama2[llama.cpp with phi-3]
+    llama1 --> VRAM
+    llama2 --> VRAM
+```
+
 ## Viability for Production
 
+### Predictability
+
 Although the `Ollama` approach is convenient for local development, it causes some deployment problems (compared to `llama.cpp`).
 
 With `llama.cpp`, it is easily possible to divide the context of the loaded model into a specific number of slots, which makes it extremely easy to predict how many parallel requests the current server can handle.
@@ -20,6 +32,8 @@ We might end up in a situation where `Ollama` keeps both [llama3](https://llama.
 
 How can that be balanced effectively? As a software architect, you would have to plan an infrastructure that does not allow developers to randomly load models into memory and force a specific number of slots, which defeats the purpose of `Ollama`.
 
+### Good Parts of Ollama
+
 I greatly support `Ollama` because it makes it easy to start your journey with large language models. You can use `Ollama` in production deployments, but I think `llama.cpp` is a better choice because it is so predictable. 
 
 I think `Ollama` is better suited than `llama.cpp` for end-user distributable applications. By that, I mean the applications that do not use an external server but are installed and run in their entirety on the user's device. The same thing that makes it less predictable when it comes to resource usage makes it more resilient to end-user errors. In that context, resource usage predictability is less important than on the server side. 

diff --git a/src/general-concepts/continuous-batching/README.md b/src/general-concepts/continuous-batching/README.md
@@ -0,0 +1 @@
+# Continuous Batching
diff --git a/src/general-concepts/input-output/README.md b/src/general-concepts/input-output/README.md
@@ -0,0 +1 @@
+# Input/Output
diff --git a/src/general-concepts/large-language-model/README.md b/src/general-concepts/large-language-model/README.md
@@ -0,0 +1 @@
+# Large Language Model
diff --git a/src/general-concepts/load-balancing/README.md b/src/general-concepts/load-balancing/README.md
@@ -8,6 +8,6 @@ The interesting thing is that having some experience with 3D game development mi
 
 ## Differences Between Balancing GPU and CPU Load
 
-In the context of LLMOps, the primary factors we have to deal with this time are [Input/Output](/general-concepts/input-output/README.md) bottlenecks instead of the usual CPU bottlenecks. That forces us to adjust how we design our infrastructure and applications.
+In the context of LLMOps, the primary factors we have to deal with this time are [Input/Output](/general-concepts/input-output) bottlenecks instead of the usual CPU bottlenecks. That forces us to adjust how we design our infrastructure and applications.
 
 We will also often use a different set of metrics than traditional load balancing, which are usually closer to the application level (like the number of available context slots being used, the number of buffered application requests, and such).
diff --git a/theme/header.hbs b/theme/header.hbs