Update docs

ucbepic · Sep 18, 2024 · 9ad7ab3 · 9ad7ab3
1 parent 17d9d7c
commit 9ad7ab3
Show file tree

Hide file tree

Showing 8 changed files with 475 additions and 544 deletions.
diff --git a/docetl/optimizers/join_optimizer.py b/docetl/optimizers/join_optimizer.py
@@ -328,7 +328,7 @@ def synthesize_resolution_prompt(
 
     {{% endfor %}}
 
-    Create a single, consolidated key for {reduce_key} that combines the information from all duplicate entries.
+    Merge these into a single key.
     When merging, follow these guidelines:
     1. [Provide specific merging instructions relevant to the data type]
     2. [Do not make the prompt too long]

diff --git a/docs/concepts/optimization.md b/docs/concepts/optimization.md
@@ -1,13 +1,13 @@
 # Optimization
 
-In the world of data processing and analysis, finding the optimal pipeline for your task can be challenging. You might wonder:
+Sometimes, finding the optimal pipeline for your task can be challenging. You might wonder:
 
 !!! question "Questions"
 
     - Will a single LLM call suffice for your task?
     - Do you need to decompose your task or data further for better results?
 
-To address these questions and improve your pipeline's performance, DocETL provides a powerful optimization feature.
+To address these questions and improve your pipeline's performance, you can use DocETL to build an optimized version of your pipeline.
 
 ## The DocETL Optimizer
 
@@ -20,7 +20,7 @@ The DocETL optimizer is designed to decompose operators (and sequences of operat
     1. Extract actionable suggestions for course improvement
     2. Identify potential interdisciplinary connections
 
-    This could be optimized into two separate map operations:
+    This could be optimized into two _separate_ map operations:
 
     - Suggestion Extraction:
       Focus solely on identifying concrete, actionable suggestions for improving the course.
@@ -62,11 +62,11 @@ You can invoke the optimizer using the following command:
 docetl build your_pipeline.yaml
 ```
 
-This command will save the optimized pipeline to `your_pipeline_opt.yaml`.
+This command will save the optimized pipeline to `your_pipeline_opt.yaml`. Note that the optimizer will only rewrite operators where you've set `optimize: true`. Leaving this field unset will skip optimization for that operator.
 
-### Automatic Entity Resolution
+<!-- ### Automatic Entity Resolution
 
-If you have a map-reduce pipeline where you're reducing on keys generated by the map call, you should consider using the optimizer. The optimizer can automatically synthesize a resolve operation for you, improving the efficiency and accuracy of your pipeline.
+If you have a map-reduce pipeline where you're reducing on keys generated by the map call, you should consider using the optimizer. The optimizer can automatically synthesize a [resolve](../operators/resolve.md) operation for you, improving the efficiency and accuracy of your pipeline.
 
 ## Example: Optimizing a Theme Extraction Pipeline
 
@@ -135,4 +135,4 @@ Let's consider an example pipeline that extracts themes from student survey resp
           summary: string
     ```
 
-This optimized version ensures that similar themes are merged before the summarization step, potentially leading to more coherent and accurate summaries.
+This optimized version ensures that similar themes are merged before the summarization step, potentially leading to more coherent and accurate summaries. -->
diff --git a/docs/index.md b/docs/index.md
@@ -1,14 +1,13 @@
 # DocETL: A System for Complex Document Processing
 
-DocETL is a powerful tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers a low-code, declarative YAML interface to define complex data operations on complex data.
+DocETL is a powerful tool for creating and executing LLM-powered data processing pipelines. It offers a low-code, declarative YAML interface to define complex data operations on complex data.
 
 !!! tip "When to Use DocETL"
 
     DocETL is the ideal choice when you're looking to **maximize correctness and output quality** for complex tasks over a collection of documents or unstructured datasets. You should consider using DocETL if:
 
-    - You want to perform semantic processing on a collection of data
     - You have complex tasks that you want to represent via map-reduce (e.g., map over your documents, then group by the result of your map call & reduce)
-    - You're unsure how to best express your task to maximize LLM accuracy
+    - You're unsure how to best write your pipeline or sequence of operations to maximize LLM accuracy
     - You're working with long documents that don't fit into a single prompt or are too lengthy for effective LLM reasoning
     - You have validation criteria and want tasks to automatically retry when the validation fails
 

diff --git a/docs/optimization/configuration.md b/docs/optimization/configuration.md
@@ -0,0 +1,74 @@
+# Advanced: Customizing Optimization
+
+You can customize the optimization process for specific operations using the ``optimizer_config in your pipeline.
+
+## Global Configuration
+
+The following options can be applied globally to all operations in your pipeline during optimization:
+
+- `num_retries`: The number of times to retry optimizing if the LLM agent fails. Default is 1.
+
+- `sample_sizes`: Override the default sample sizes for each operator type. Specify as a dictionary with operator types as keys and integer sample sizes as values.
+
+  Default sample sizes:
+
+  ```python
+  SAMPLE_SIZE_MAP = {
+      "reduce": 40,
+      "map": 5,
+      "resolve": 100,
+      "equijoin": 100,
+      "filter": 5,
+  }
+  ```
+
+## Equijoin Configuration
+
+- `target_recall`: Change the default target recall (default is 0.95).
+
+## Resolve Configuration
+
+- `target_recall`: Specify the target recall for the resolve operation.
+
+## Reduce Configuration
+
+- `synthesize_resolve`: Set to `False` if you definitely don't want a resolve operation synthesized or want to turn off this rewrite rule.
+
+## Map Configuration
+
+- `force_chunking_plan`: Set to `True` if you want the the optimizer to force plan that breaks up the input documents into chunks.
+
+## Example Configuration
+
+Here's an example of how to use the `optimizer_config` in your pipeline:
+
+```yaml
+optimizer_config:
+  num_retries: 2
+  sample_sizes:
+    map: 10
+    reduce: 50
+  reduce:
+    synthesize_resolve: false
+  map:
+    force_chunking_plan: true
+
+operations:
+  - name: extract_medications
+    type: map
+    optimize: true
+    # ... other configuration ...
+
+  - name: summarize_prescriptions
+    type: reduce
+    optimize: true
+    # ... other configuration ...
+# ... rest of the pipeline configuration ...
+```
+
+This configuration will:
+
+1. Retry optimization up to 2 times for each operation if the LLM agent fails.
+2. Use custom sample sizes for map (10) and reduce (50) operations.
+3. Prevent the synthesis of resolve operations for reduce operations.
+4. Force a chunking plan for map operations.
diff --git a/docs/execution/optimizing-pipelines.md → docs/optimization/example.md b/docs/execution/optimizing-pipelines.md → docs/optimization/example.md
@@ -1,67 +1,9 @@
-# Optimizing Pipelines
-
-After creating your initial map-reduce pipeline, you might want to optimize it for better performance or to automatically add resolve operations. The DocETL pipeline optimizer is designed to help you achieve this.
-
-## Understanding the Optimizer
-
-The optimizer in DocETL finds optimal plans for operations marked with `optimize: True`. It can also insert resolve operations before reduce operations if needed. The optimizer uses GPT-4 under the hood (requiring an OpenAI API key) and can be customized with different models like gpt-4-turbo or gpt-4o-mini. Note that only LLM-powered operations can be optimized (e.g., `map`, `reduce`, `resolve`, `filter`, `equijoin`), but the optimized plans may involve new non-LLM operations (e.g., `split`).
-
-At its core, the optimizer employs two types of AI agents: generation agents and validation agents. Generation agents work to rewrite operators into better plans, potentially decomposing a single operation into multiple, more efficient steps. Validation agents then evaluate these candidate plans, synthesizing task-specific validation prompts to compare outputs and determine the best plan for each operator.
-
-<div class="mermaid-wrapper" style="display: flex; justify-content: center;">
-  <div class="mermaid" style="width: 100%; height: auto;">
-```mermaid
-graph LR
-    A[User-Defined Operation] --> B[Validation Agent]
-    style B fill:#f9f,stroke:#333,stroke-width:2px
-    B -->|Synthesize| C[Validator Prompt]
-    C --> D[Evaluate on Sample Data]
-    D --> E{Needs Optimization?}
-    E -->|Yes| F[Generation Agent]
-    E -->|No| J[Optimized Operation]
-    style F fill:#bbf,stroke:#333,stroke-width:2px
-    F -->|Create| G[Candidate Plans]
-    G --> H[Validation Agent]
-    style H fill:#f9f,stroke:#333,stroke-width:2px
-    H -->|Rank/Compare| I[Select Best Plan]
-    I --> J
-```
-  </div>
-</div>
+# Running the Optimizer
 
 !!! note "Optimizer Stability"
 
     The optimization process can be unstable, as well as resource-intensive (we've seen it take up to 10 minutes to optimize a single operation, spending up to ~$50 in API costs for end-to-end pipelines). We recommend optimizing one operation at a time and retrying if necessary, as results may vary between runs. This approach also allows you to confidently verify that each optimized operation is performing as expected before moving on to the next. See the [API](#optimizer-api) for more details on how to resume the optimizer from a failed run, by rerunning `docetl build pipeline.yaml --resume` (with the `--resume` flag).
 
-## Should I Use the Optimizer?
-
-While any pipeline can potentially benefit from optimization, there are specific scenarios where using the optimizer can significantly improve your pipeline's performance and accuracy. When should you use the optimizer?
-
-!!! info "Large Documents"
-
-    If you have documents that approach or exceed context limits and a map operation that transforms these documents using an LLM, the optimizer can help:
-
-    - Improve accuracy
-    - Enable processing of entire documents
-    - Optimize for large-scale data handling
-
-!!! info "Entity Resolution"
-The optimizer is particularly useful when:
-
-    - You need a resolve operation before your reduce operation
-    - You've defined a resolve operation but want to optimize it for speed using blocking
-
-!!! info "High-Volume Reduce Operations"
-Consider using the optimizer when:
-
-    - You have many documents feeding into a reduce operation for a given key
-    - You're concerned about the accuracy of the reduce operation due to high volume
-    - You want to optimize for better accuracy in complex reductions
-
-Even if your pipeline doesn't fall into these specific categories, optimization can still be beneficial. For example, the optimizer can enhance your operations by adding gleaning to an operation, which uses an LLM-powered validator to ensure operation correctness. [Learn more about gleaning](../concepts/operators.md).
-
-## Optimization Process
-
 To optimize your pipeline, start with your initial configuration and follow these steps:
 
 1. Set `optimize: True` for the operation you want to optimize (start with the first operation, if you're not sure which one).
@@ -233,81 +175,6 @@ This optimized pipeline now includes improved prompts, a resolve operation, and
 
     We're continually improving the optimizer. Your feedback on its performance and usability is invaluable. Please share your experiences and suggestions!
 
-## Advanced: Customizing Optimization
-
-You can customize the optimization process for specific operations using the ``optimizer_config in your pipeline.
-
-### Global Configuration
-
-The following options can be applied globally to all operations in your pipeline during optimization:
-
-- `num_retries`: The number of times to retry optimizing if the LLM agent fails. Default is 1.
-
-- `sample_sizes`: Override the default sample sizes for each operator type. Specify as a dictionary with operator types as keys and integer sample sizes as values.
-
-  Default sample sizes:
-
-  ```python
-  SAMPLE_SIZE_MAP = {
-      "reduce": 40,
-      "map": 5,
-      "resolve": 100,
-      "equijoin": 100,
-      "filter": 5,
-  }
-  ```
-
-### Equijoin Configuration
-
-- `target_recall`: Change the default target recall (default is 0.95).
-
-### Resolve Configuration
-
-- `target_recall`: Specify the target recall for the resolve operation.
-
-### Reduce Configuration
-
-- `synthesize_resolve`: Set to `False` if you definitely don't want a resolve operation synthesized or want to turn off this rewrite rule.
-
-### Map Configuration
-
-- `force_chunking_plan`: Set to `True` if you want the the optimizer to force plan that breaks up the input documents into chunks.
-
-### Example Configuration
-
-Here's an example of how to use the `optimizer_config` in your pipeline:
-
-```yaml
-optimizer_config:
-  num_retries: 2
-  sample_sizes:
-    map: 10
-    reduce: 50
-  reduce:
-    synthesize_resolve: false
-  map:
-    force_chunking_plan: true
-
-operations:
-  - name: extract_medications
-    type: map
-    optimize: true
-    # ... other configuration ...
-
-  - name: summarize_prescriptions
-    type: reduce
-    optimize: true
-    # ... other configuration ...
-# ... rest of the pipeline configuration ...
-```
-
-This configuration will:
-
-1. Retry optimization up to 2 times for each operation if the LLM agent fails.
-2. Use custom sample sizes for map (10) and reduce (50) operations.
-3. Prevent the synthesis of resolve operations for reduce operations.
-4. Force a chunking plan for map operations.
-
 ## Optimizer API
 
 ::: docetl.cli.build

diff --git a/docs/optimization/overview.md b/docs/optimization/overview.md
@@ -0,0 +1,83 @@
+# DocETL Optimizer
+
+The DocETL optimizer is a powerful tool designed to enhance the performance and accuracy of your document processing pipelines. It works by analyzing and potentially rewriting operations marked for optimization, finding optimal plans for execution.
+
+## Key Features
+
+- Automatically decomposes complex operations into more efficient sub-pipelines
+- Inserts resolve operations before reduce operations when beneficial
+- Optimizes for large documents that exceed context limits
+- Improves accuracy in high-volume reduce operations with incremental reduce
+
+## How It Works
+
+The optimizer employs AI agents to generate and validate potential optimizations:
+
+1. **Generation Agents**: Create alternative plans for operations, potentially breaking them down into multiple steps.
+2. **Validation Agents**: Evaluate and compare the outputs of different plans to determine the most effective approach.
+
+<div class="mermaid-wrapper" style="display: flex; justify-content: center;">
+  <div class="mermaid" style="width: 100%; height: auto;">
+```mermaid
+graph TB
+    A[User-Defined Operation] --> B[Validation Agent]
+    style B fill:#f9f,stroke:#333,stroke-width:2px
+    B -->|Synthesize| C[Validator Prompt]
+    C --> D[Evaluate on Sample Data]
+    D --> E{Needs Optimization?}
+    E -->|Yes| F[Generation Agent]
+    E -->|No| J[Optimized Operation]
+    style F fill:#bbf,stroke:#333,stroke-width:2px
+    F -->|Create| G[Candidate Plans]
+    G --> H[Validation Agent]
+    style H fill:#f9f,stroke:#333,stroke-width:2px
+    H -->|Rank/Compare| I[Select Best Plan]
+    I --> J
+```
+  </div>
+</div>
+
+## Should I Use the Optimizer?
+
+While any pipeline can potentially benefit from optimization, there are specific scenarios where using the optimizer can significantly improve your pipeline's performance and accuracy. When should you use the optimizer?
+
+!!! info "Large Documents"
+
+    If you have documents that approach or exceed context limits and a map operation that transforms these documents using an LLM, the optimizer can help:
+
+    - Improve accuracy
+    - Enable processing of entire documents
+    - Optimize for large-scale data handling
+
+!!! info "Entity Resolution"
+The optimizer is particularly useful when:
+
+    - You need a resolve operation before your reduce operation
+    - You've defined a resolve operation but want to optimize it for speed using blocking
+
+!!! info "High-Volume Reduce Operations"
+Consider using the optimizer when:
+
+    - You have many documents feeding into a reduce operation for a given key
+    - You're concerned about the accuracy of the reduce operation due to high volume
+    - You want to optimize for better accuracy in complex reductions
+
+Even if your pipeline doesn't fall into these specific categories, optimization can still be beneficial. For example, the optimizer can enhance your operations by adding gleaning to an operation, which uses an LLM-powered validator to ensure operation correctness. [Learn more about gleaning](../concepts/operators.md).
+
+## Example: Optimizing Legal Contract Analysis
+
+Let's consider a pipeline for analyzing legal contracts, extracting clauses, and summarizing them by type. Initially, you might have a single map operation to extract and tag clauses, followed by a reduce operation to summarize them. However, this approach might not be accurate enough for long contracts.
+
+### Initial Pipeline
+
+In the initial pipeline, you might have a single map operation that attempts to extract all clauses and tag them with their types in one go. This is followed by a reduce operation that summarizes the clauses by type. Maybe the reduce operation accurately summarizes the clauses in a single LLM call per clause type, but the map operation might not be able to accurately extract and tag the clauses in a single LLM call.
+
+### Optimized Pipeline
+
+After applying the optimizer, your pipeline could be transformed into a more efficient and accurate sub-pipeline:
+
+1. **Split Operation**: Breaks down _each_ long contract into manageable chunks.
+2. **Map Operation**: Processes each chunk to extract and tag clauses.
+3. **Reduce Operation**: For each contract, combine the extracted and tagged clauses from each chunk.
+
+The goal of the DocETL optimizer is to try many ways of rewriting your pipeline and then select the best one. This may take some time (20-30 minutes for very complex tasks and large documents). But the optimizer's ability to break down complex tasks into more manageable sub-steps can lead to more accurate and reliable results.