Skip to content

Commit

Permalink
Update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
shreyashankar committed Sep 15, 2024
1 parent 78aee6f commit 0c8505c
Show file tree
Hide file tree
Showing 2 changed files with 30 additions and 2 deletions.
27 changes: 27 additions & 0 deletions docs/execution/best-practices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Best Practices for DocETL

This guide outlines key best practices for using DocETL effectively, focusing on the most important aspects of pipeline creation, execution, and optimization.

## Pipeline Design

1. **Start Simple**: Begin with a basic pipeline and gradually add complexity as needed.
2. **Modular Design**: Break down complex tasks into smaller, manageable operations.
3. **Optimize Incrementally**: Optimize one operation at a time to ensure stability and verify improvements.

## Schema and Prompt Design

1. **Keep Schemas Simple**: Use simple output schemas whenever possible. Complex nested structures can be difficult for LLMs to produce consistently.
2. **Clear and Concise Prompts**: Write clear, concise prompts for LLM operations, providing relevant context from input data. Instruct quantities (e.g., 2-3 insights, one summary) to guide the LLM.
3. **Take advantage of Jinja Templating**: Use Jinja templating to dynamically generate prompts and provide context to the LLM. Feel free to use if statements, loops, and other Jinja features to customize prompts.
4. **Validate Outputs**: Use the `validate` field to ensure the quality and correctness of processed data. This consists of Python statements that validate the output and optionally retry the LLM if one or more statements fail.

## Handling Large Documents and Entity Resolution

1. **Chunk Large Inputs**: For documents exceeding token limits, consider using the optimizer to automatically chunk inputs.
2. **Use Resolve Operations**: Implement resolve operations before reduce operations when dealing with similar entities. Take care to write the compare prompts well to guide the LLM--often the optimizer-synthesized prompts are too generic.

## Optimization and Execution

1. **Use the Optimizer**: Leverage DocETL's optimizer for complex pipelines or when dealing with large documents.
2. **Leverage Caching**: Take advantage of DocETL's caching mechanism to avoid redundant computations.
3. **Monitor Resource Usage**: Keep an eye on API costs and processing time, especially when optimizing.
5 changes: 3 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ nav:
- Installation: installation.md
- Tutorial: tutorial.md
- Core Concepts:
- Operators: concepts/operators.md
- Schemas: concepts/schemas.md
- Operators & Validation: concepts/operators.md
- Output Schemas: concepts/schemas.md
- Pipelines: concepts/pipelines.md
- Optimization: concepts/optimization.md
- LLM-Powered Operators:
Expand All @@ -36,6 +36,7 @@ nav:
- Execution:
- Running Pipelines: execution/running-pipelines.md
- Optimizing Pipelines: execution/optimizing-pipelines.md
- Best Practices: execution/best-practices.md
# - Advanced Usage:
# - User-Defined Functions: advanced/custom-operators.md
# - Extending Optimizer Agents: advanced/extending-agents.md
Expand Down

0 comments on commit 0c8505c

Please sign in to comment.