Skip to content

Commit

Permalink
Update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
shreyashankar committed Sep 18, 2024
1 parent 17d9d7c commit 9ad7ab3
Show file tree
Hide file tree
Showing 8 changed files with 475 additions and 544 deletions.
2 changes: 1 addition & 1 deletion docetl/optimizers/join_optimizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -328,7 +328,7 @@ def synthesize_resolution_prompt(
{{% endfor %}}
Create a single, consolidated key for {reduce_key} that combines the information from all duplicate entries.
Merge these into a single key.
When merging, follow these guidelines:
1. [Provide specific merging instructions relevant to the data type]
2. [Do not make the prompt too long]
Expand Down
14 changes: 7 additions & 7 deletions docs/concepts/optimization.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Optimization

In the world of data processing and analysis, finding the optimal pipeline for your task can be challenging. You might wonder:
Sometimes, finding the optimal pipeline for your task can be challenging. You might wonder:

!!! question "Questions"

- Will a single LLM call suffice for your task?
- Do you need to decompose your task or data further for better results?

To address these questions and improve your pipeline's performance, DocETL provides a powerful optimization feature.
To address these questions and improve your pipeline's performance, you can use DocETL to build an optimized version of your pipeline.

## The DocETL Optimizer

Expand All @@ -20,7 +20,7 @@ The DocETL optimizer is designed to decompose operators (and sequences of operat
1. Extract actionable suggestions for course improvement
2. Identify potential interdisciplinary connections

This could be optimized into two separate map operations:
This could be optimized into two _separate_ map operations:

- Suggestion Extraction:
Focus solely on identifying concrete, actionable suggestions for improving the course.
Expand Down Expand Up @@ -62,11 +62,11 @@ You can invoke the optimizer using the following command:
docetl build your_pipeline.yaml
```

This command will save the optimized pipeline to `your_pipeline_opt.yaml`.
This command will save the optimized pipeline to `your_pipeline_opt.yaml`. Note that the optimizer will only rewrite operators where you've set `optimize: true`. Leaving this field unset will skip optimization for that operator.

### Automatic Entity Resolution
<!-- ### Automatic Entity Resolution
If you have a map-reduce pipeline where you're reducing on keys generated by the map call, you should consider using the optimizer. The optimizer can automatically synthesize a resolve operation for you, improving the efficiency and accuracy of your pipeline.
If you have a map-reduce pipeline where you're reducing on keys generated by the map call, you should consider using the optimizer. The optimizer can automatically synthesize a [resolve](../operators/resolve.md) operation for you, improving the efficiency and accuracy of your pipeline.
## Example: Optimizing a Theme Extraction Pipeline
Expand Down Expand Up @@ -135,4 +135,4 @@ Let's consider an example pipeline that extracts themes from student survey resp
summary: string
```
This optimized version ensures that similar themes are merged before the summarization step, potentially leading to more coherent and accurate summaries.
This optimized version ensures that similar themes are merged before the summarization step, potentially leading to more coherent and accurate summaries. -->
5 changes: 2 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,13 @@
# DocETL: A System for Complex Document Processing

DocETL is a powerful tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers a low-code, declarative YAML interface to define complex data operations on complex data.
DocETL is a powerful tool for creating and executing LLM-powered data processing pipelines. It offers a low-code, declarative YAML interface to define complex data operations on complex data.

!!! tip "When to Use DocETL"

DocETL is the ideal choice when you're looking to **maximize correctness and output quality** for complex tasks over a collection of documents or unstructured datasets. You should consider using DocETL if:

- You want to perform semantic processing on a collection of data
- You have complex tasks that you want to represent via map-reduce (e.g., map over your documents, then group by the result of your map call & reduce)
- You're unsure how to best express your task to maximize LLM accuracy
- You're unsure how to best write your pipeline or sequence of operations to maximize LLM accuracy
- You're working with long documents that don't fit into a single prompt or are too lengthy for effective LLM reasoning
- You have validation criteria and want tasks to automatically retry when the validation fails

Expand Down
74 changes: 74 additions & 0 deletions docs/optimization/configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Advanced: Customizing Optimization

You can customize the optimization process for specific operations using the ``optimizer_config in your pipeline.

## Global Configuration

The following options can be applied globally to all operations in your pipeline during optimization:

- `num_retries`: The number of times to retry optimizing if the LLM agent fails. Default is 1.

- `sample_sizes`: Override the default sample sizes for each operator type. Specify as a dictionary with operator types as keys and integer sample sizes as values.

Default sample sizes:

```python
SAMPLE_SIZE_MAP = {
"reduce": 40,
"map": 5,
"resolve": 100,
"equijoin": 100,
"filter": 5,
}
```

## Equijoin Configuration

- `target_recall`: Change the default target recall (default is 0.95).

## Resolve Configuration

- `target_recall`: Specify the target recall for the resolve operation.

## Reduce Configuration

- `synthesize_resolve`: Set to `False` if you definitely don't want a resolve operation synthesized or want to turn off this rewrite rule.

## Map Configuration

- `force_chunking_plan`: Set to `True` if you want the the optimizer to force plan that breaks up the input documents into chunks.

## Example Configuration

Here's an example of how to use the `optimizer_config` in your pipeline:

```yaml
optimizer_config:
num_retries: 2
sample_sizes:
map: 10
reduce: 50
reduce:
synthesize_resolve: false
map:
force_chunking_plan: true

operations:
- name: extract_medications
type: map
optimize: true
# ... other configuration ...

- name: summarize_prescriptions
type: reduce
optimize: true
# ... other configuration ...
# ... rest of the pipeline configuration ...
```

This configuration will:

1. Retry optimization up to 2 times for each operation if the LLM agent fails.
2. Use custom sample sizes for map (10) and reduce (50) operations.
3. Prevent the synthesis of resolve operations for reduce operations.
4. Force a chunking plan for map operations.
Original file line number Diff line number Diff line change
@@ -1,67 +1,9 @@
# Optimizing Pipelines

After creating your initial map-reduce pipeline, you might want to optimize it for better performance or to automatically add resolve operations. The DocETL pipeline optimizer is designed to help you achieve this.

## Understanding the Optimizer

The optimizer in DocETL finds optimal plans for operations marked with `optimize: True`. It can also insert resolve operations before reduce operations if needed. The optimizer uses GPT-4 under the hood (requiring an OpenAI API key) and can be customized with different models like gpt-4-turbo or gpt-4o-mini. Note that only LLM-powered operations can be optimized (e.g., `map`, `reduce`, `resolve`, `filter`, `equijoin`), but the optimized plans may involve new non-LLM operations (e.g., `split`).

At its core, the optimizer employs two types of AI agents: generation agents and validation agents. Generation agents work to rewrite operators into better plans, potentially decomposing a single operation into multiple, more efficient steps. Validation agents then evaluate these candidate plans, synthesizing task-specific validation prompts to compare outputs and determine the best plan for each operator.

<div class="mermaid-wrapper" style="display: flex; justify-content: center;">
<div class="mermaid" style="width: 100%; height: auto;">
```mermaid
graph LR
A[User-Defined Operation] --> B[Validation Agent]
style B fill:#f9f,stroke:#333,stroke-width:2px
B -->|Synthesize| C[Validator Prompt]
C --> D[Evaluate on Sample Data]
D --> E{Needs Optimization?}
E -->|Yes| F[Generation Agent]
E -->|No| J[Optimized Operation]
style F fill:#bbf,stroke:#333,stroke-width:2px
F -->|Create| G[Candidate Plans]
G --> H[Validation Agent]
style H fill:#f9f,stroke:#333,stroke-width:2px
H -->|Rank/Compare| I[Select Best Plan]
I --> J
```
</div>
</div>
# Running the Optimizer

!!! note "Optimizer Stability"

The optimization process can be unstable, as well as resource-intensive (we've seen it take up to 10 minutes to optimize a single operation, spending up to ~$50 in API costs for end-to-end pipelines). We recommend optimizing one operation at a time and retrying if necessary, as results may vary between runs. This approach also allows you to confidently verify that each optimized operation is performing as expected before moving on to the next. See the [API](#optimizer-api) for more details on how to resume the optimizer from a failed run, by rerunning `docetl build pipeline.yaml --resume` (with the `--resume` flag).

## Should I Use the Optimizer?

While any pipeline can potentially benefit from optimization, there are specific scenarios where using the optimizer can significantly improve your pipeline's performance and accuracy. When should you use the optimizer?

!!! info "Large Documents"

If you have documents that approach or exceed context limits and a map operation that transforms these documents using an LLM, the optimizer can help:

- Improve accuracy
- Enable processing of entire documents
- Optimize for large-scale data handling

!!! info "Entity Resolution"
The optimizer is particularly useful when:

- You need a resolve operation before your reduce operation
- You've defined a resolve operation but want to optimize it for speed using blocking

!!! info "High-Volume Reduce Operations"
Consider using the optimizer when:

- You have many documents feeding into a reduce operation for a given key
- You're concerned about the accuracy of the reduce operation due to high volume
- You want to optimize for better accuracy in complex reductions

Even if your pipeline doesn't fall into these specific categories, optimization can still be beneficial. For example, the optimizer can enhance your operations by adding gleaning to an operation, which uses an LLM-powered validator to ensure operation correctness. [Learn more about gleaning](../concepts/operators.md).

## Optimization Process

To optimize your pipeline, start with your initial configuration and follow these steps:

1. Set `optimize: True` for the operation you want to optimize (start with the first operation, if you're not sure which one).
Expand Down Expand Up @@ -233,81 +175,6 @@ This optimized pipeline now includes improved prompts, a resolve operation, and

We're continually improving the optimizer. Your feedback on its performance and usability is invaluable. Please share your experiences and suggestions!

## Advanced: Customizing Optimization

You can customize the optimization process for specific operations using the ``optimizer_config in your pipeline.

### Global Configuration

The following options can be applied globally to all operations in your pipeline during optimization:

- `num_retries`: The number of times to retry optimizing if the LLM agent fails. Default is 1.

- `sample_sizes`: Override the default sample sizes for each operator type. Specify as a dictionary with operator types as keys and integer sample sizes as values.

Default sample sizes:

```python
SAMPLE_SIZE_MAP = {
"reduce": 40,
"map": 5,
"resolve": 100,
"equijoin": 100,
"filter": 5,
}
```

### Equijoin Configuration

- `target_recall`: Change the default target recall (default is 0.95).

### Resolve Configuration

- `target_recall`: Specify the target recall for the resolve operation.

### Reduce Configuration

- `synthesize_resolve`: Set to `False` if you definitely don't want a resolve operation synthesized or want to turn off this rewrite rule.

### Map Configuration

- `force_chunking_plan`: Set to `True` if you want the the optimizer to force plan that breaks up the input documents into chunks.

### Example Configuration

Here's an example of how to use the `optimizer_config` in your pipeline:

```yaml
optimizer_config:
num_retries: 2
sample_sizes:
map: 10
reduce: 50
reduce:
synthesize_resolve: false
map:
force_chunking_plan: true
operations:
- name: extract_medications
type: map
optimize: true
# ... other configuration ...
- name: summarize_prescriptions
type: reduce
optimize: true
# ... other configuration ...
# ... rest of the pipeline configuration ...
```

This configuration will:

1. Retry optimization up to 2 times for each operation if the LLM agent fails.
2. Use custom sample sizes for map (10) and reduce (50) operations.
3. Prevent the synthesis of resolve operations for reduce operations.
4. Force a chunking plan for map operations.

## Optimizer API

::: docetl.cli.build
Expand Down
83 changes: 83 additions & 0 deletions docs/optimization/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# DocETL Optimizer

The DocETL optimizer is a powerful tool designed to enhance the performance and accuracy of your document processing pipelines. It works by analyzing and potentially rewriting operations marked for optimization, finding optimal plans for execution.

## Key Features

- Automatically decomposes complex operations into more efficient sub-pipelines
- Inserts resolve operations before reduce operations when beneficial
- Optimizes for large documents that exceed context limits
- Improves accuracy in high-volume reduce operations with incremental reduce

## How It Works

The optimizer employs AI agents to generate and validate potential optimizations:

1. **Generation Agents**: Create alternative plans for operations, potentially breaking them down into multiple steps.
2. **Validation Agents**: Evaluate and compare the outputs of different plans to determine the most effective approach.

<div class="mermaid-wrapper" style="display: flex; justify-content: center;">
<div class="mermaid" style="width: 100%; height: auto;">
```mermaid
graph TB
A[User-Defined Operation] --> B[Validation Agent]
style B fill:#f9f,stroke:#333,stroke-width:2px
B -->|Synthesize| C[Validator Prompt]
C --> D[Evaluate on Sample Data]
D --> E{Needs Optimization?}
E -->|Yes| F[Generation Agent]
E -->|No| J[Optimized Operation]
style F fill:#bbf,stroke:#333,stroke-width:2px
F -->|Create| G[Candidate Plans]
G --> H[Validation Agent]
style H fill:#f9f,stroke:#333,stroke-width:2px
H -->|Rank/Compare| I[Select Best Plan]
I --> J
```
</div>
</div>

## Should I Use the Optimizer?

While any pipeline can potentially benefit from optimization, there are specific scenarios where using the optimizer can significantly improve your pipeline's performance and accuracy. When should you use the optimizer?

!!! info "Large Documents"

If you have documents that approach or exceed context limits and a map operation that transforms these documents using an LLM, the optimizer can help:

- Improve accuracy
- Enable processing of entire documents
- Optimize for large-scale data handling

!!! info "Entity Resolution"
The optimizer is particularly useful when:

- You need a resolve operation before your reduce operation
- You've defined a resolve operation but want to optimize it for speed using blocking

!!! info "High-Volume Reduce Operations"
Consider using the optimizer when:

- You have many documents feeding into a reduce operation for a given key
- You're concerned about the accuracy of the reduce operation due to high volume
- You want to optimize for better accuracy in complex reductions

Even if your pipeline doesn't fall into these specific categories, optimization can still be beneficial. For example, the optimizer can enhance your operations by adding gleaning to an operation, which uses an LLM-powered validator to ensure operation correctness. [Learn more about gleaning](../concepts/operators.md).

## Example: Optimizing Legal Contract Analysis

Let's consider a pipeline for analyzing legal contracts, extracting clauses, and summarizing them by type. Initially, you might have a single map operation to extract and tag clauses, followed by a reduce operation to summarize them. However, this approach might not be accurate enough for long contracts.

### Initial Pipeline

In the initial pipeline, you might have a single map operation that attempts to extract all clauses and tag them with their types in one go. This is followed by a reduce operation that summarizes the clauses by type. Maybe the reduce operation accurately summarizes the clauses in a single LLM call per clause type, but the map operation might not be able to accurately extract and tag the clauses in a single LLM call.

### Optimized Pipeline

After applying the optimizer, your pipeline could be transformed into a more efficient and accurate sub-pipeline:

1. **Split Operation**: Breaks down _each_ long contract into manageable chunks.
2. **Map Operation**: Processes each chunk to extract and tag clauses.
3. **Reduce Operation**: For each contract, combine the extracted and tagged clauses from each chunk.

The goal of the DocETL optimizer is to try many ways of rewriting your pipeline and then select the best one. This may take some time (20-30 minutes for very complex tasks and large documents). But the optimizer's ability to break down complex tasks into more manageable sub-steps can lead to more accurate and reliable results.
Loading

0 comments on commit 9ad7ab3

Please sign in to comment.