Skip to content

Commit

Permalink
Fix casing for docetl
Browse files Browse the repository at this point in the history
  • Loading branch information
shreyashankar committed Sep 15, 2024
1 parent c6f6819 commit 80ffd31
Show file tree
Hide file tree
Showing 19 changed files with 69 additions and 73 deletions.
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# docetl
# DocETL

docetl is a powerful tool for creating and executing data processing pipelines using LLMs. It allows you to define complex data operations in a YAML configuration file and execute them efficiently.
DocETL is a powerful tool for creating and executing data processing pipelines using LLMs. It allows you to define complex data operations in a YAML configuration file and execute them efficiently.

## Table of Contents

Expand All @@ -26,7 +26,7 @@ docetl is a powerful tool for creating and executing data processing pipelines u

## Installation

To install docetl, clone this repository and install the required dependencies:
To install DocETL, clone this repository and install the required dependencies:

```bash
git clone https://github.com/shreyashankar/docetl.git
Expand Down Expand Up @@ -70,7 +70,7 @@ The configuration file is a YAML document with the following top-level keys:

## Operation Types

docetl supports various operation types, each designed for specific data transformation tasks. All prompt templates used in these operations are Jinja2 templates, allowing for the use of loops, conditionals, and other Jinja2 features to create dynamic prompts based on input data.
DocETL supports various operation types, each designed for specific data transformation tasks. All prompt templates used in these operations are Jinja2 templates, allowing for the use of loops, conditionals, and other Jinja2 features to create dynamic prompts based on input data.

All operations have the following optional parameters:

Expand Down Expand Up @@ -618,7 +618,7 @@ Example:

### Schema Definition

Schemas in docetl are defined using a simple key-value structure, where each key represents a field name and the value specifies the data type. The supported data types are:
Schemas in DocETL are defined using a simple key-value structure, where each key represents a field name and the value specifies the data type. The supported data types are:

- `string` (or `str`, `text`, `varchar`): For text data
- `integer` (or `int`): For whole numbers
Expand Down Expand Up @@ -668,7 +668,7 @@ It's important to note that all schema items pass through the pipeline. The `out

## Tool Use

docetl supports the use of tools in operations, allowing for more complex and specific data processing tasks. Tools are defined as Python functions that can be called by the language model during execution.
DocETL supports the use of tools in operations, allowing for more complex and specific data processing tasks. Tools are defined as Python functions that can be called by the language model during execution.

To use tools in an operation, you need to define them in the operation's configuration. Here's an example of how to define and use a tool:

Expand Down Expand Up @@ -710,7 +710,7 @@ In this example:

The language model can then use this tool to count words in the input title. The tool's output will be incorporated into the operation's result according to the defined output schema.

You can define multiple tools for an operation, allowing the model to choose the most appropriate one for the task at hand. Tools can range from simple utility functions to more complex data processing or external API calls, enhancing the capabilities of your docetl pipeline.
You can define multiple tools for an operation, allowing the model to choose the most appropriate one for the task at hand. Tools can range from simple utility functions to more complex data processing or external API calls, enhancing the capabilities of your DocETL pipeline.

Currently, only map and parallel_map operations support tools.

Expand Down
14 changes: 7 additions & 7 deletions docs/concepts/operators.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
# Operators

Operators in docetl are designed for semantically processing unstructured data. They form the building blocks of data processing pipelines, allowing you to transform, analyze, and manipulate datasets efficiently.
Operators in DocETL are designed for semantically processing unstructured data. They form the building blocks of data processing pipelines, allowing you to transform, analyze, and manipulate datasets efficiently.

## Overview

- Datasets contain documents, where a document is an object in the JSON list, with fields and values.
- docetl provides several operators, each tailored for specific unstructured data processing tasks.
- DocETL provides several operators, each tailored for specific unstructured data processing tasks.
- By default, operations are parallelized on your data using multithreading for improved performance.

!!! tip "Caching in docetl"
!!! tip "Caching in DocETL"

docetl employs caching for all LLM calls and partially-optimized plans. The cache is stored in the `.docetl/cache` and `.docetl/llm_cache` directories within your home directory. This caching mechanism helps to improve performance and reduce redundant API calls when running similar operations or reprocessing data.
DocETL employs caching for all LLM calls and partially-optimized plans. The cache is stored in the `.docetl/cache` and `.docetl/llm_cache` directories within your home directory. This caching mechanism helps to improve performance and reduce redundant API calls when running similar operations or reprocessing data.

## Common Attributes

Expand Down Expand Up @@ -69,15 +69,15 @@ prompt: |

!!! question "What happens if the input is too long?"

When the input data exceeds the token limit of the LLM, docetl automatically truncates tokens from the middle of the data to make it fit in the prompt. This approach preserves the beginning and end of the input, which often contain crucial context.
When the input data exceeds the token limit of the LLM, DocETL automatically truncates tokens from the middle of the data to make it fit in the prompt. This approach preserves the beginning and end of the input, which often contain crucial context.

A warning is displayed whenever truncation occurs, alerting you to potential loss of information:

```
WARNING: Input exceeded token limit. Truncated 500 tokens from the middle of the input.
```
If you frequently encounter this warning, consider using docetl's optimizer or breaking down your input yourself into smaller chunks to handle large inputs more effectively.
If you frequently encounter this warning, consider using DocETL's optimizer or breaking down your input yourself into smaller chunks to handle large inputs more effectively.
## Output Schema
Expand Down Expand Up @@ -123,7 +123,7 @@ Read more about schemas in the [schemas](../concepts/schemas.md) section.

## Validation

Validation is a first-class citizen in docetl, ensuring the quality and correctness of processed data.
Validation is a first-class citizen in DocETL, ensuring the quality and correctness of processed data.

### Basic Validation

Expand Down
10 changes: 5 additions & 5 deletions docs/concepts/optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ In the world of data processing and analysis, finding the optimal pipeline for y
- Will a single LLM call suffice for your task?
- Do you need to decompose your task or data further for better results?

To address these questions and improve your pipeline's performance, docetl provides a powerful optimization feature.
To address these questions and improve your pipeline's performance, DocETL provides a powerful optimization feature.

## The docetl Optimizer
## The DocETL Optimizer

The docetl optimizer is designed to decompose operators (and sequences of operators) into their own subpipelines, potentially leading to higher accuracy.
The DocETL optimizer is designed to decompose operators (and sequences of operators) into their own subpipelines, potentially leading to higher accuracy.

!!! example

Expand Down Expand Up @@ -46,13 +46,13 @@ The docetl optimizer is designed to decompose operators (and sequences of operat

### How It Works

The docetl optimizer operates using the following mechanism:
The DocETL optimizer operates using the following mechanism:

1. **Generation and Evaluation Agents**: These agents generate different plans for the pipeline according to predefined rewrite rules. Evaluation agents then compare plans and outputs to determine the best approach.

2. **Operator Rewriting**: The optimizer looks through operators in your pipeline where you've set optimize: true, and attempts to rewrite them using predefined rules.

3. **Output**: After optimization, docetl outputs a new YAML file representing the optimized pipeline.
3. **Output**: After optimization, DocETL outputs a new YAML file representing the optimized pipeline.

### Using the Optimizer

Expand Down
6 changes: 3 additions & 3 deletions docs/concepts/pipelines.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Pipelines

Pipelines in docetl are the core structures that define the flow of data processing. They orchestrate the application of operators to datasets, creating a seamless workflow for complex document processing tasks.
Pipelines in DocETL are the core structures that define the flow of data processing. They orchestrate the application of operators to datasets, creating a seamless workflow for complex document processing tasks.

## Components of a Pipeline

A pipeline in docetl consists of four main components:
A pipeline in DocETL consists of four main components:

1. **Default Model**: The language model to use for the pipeline.
2. **Datasets**: The input data sources for your pipeline.
Expand Down Expand Up @@ -32,7 +32,7 @@ datasets:
!!! note
Currently, docetl only supports JSON files as input datasets. If you're interested in support for other data types or cloud-based datasets, please reach out to us or join our open-source community and contribute! We welcome new ideas and contributions to expand the capabilities of docetl.
Currently, DocETL only supports JSON files as input datasets. If you're interested in support for other data types or cloud-based datasets, please reach out to us or join our open-source community and contribute! We welcome new ideas and contributions to expand the capabilities of DocETL.
### Operators
Expand Down
8 changes: 4 additions & 4 deletions docs/concepts/schemas.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Schemas

In docetl, schemas play an important role in defining the structure of output from LLM operations. Every LLM call in docetl is associated with an output schema, which specifies the expected format and types of the output data.
In DocETL, schemas play an important role in defining the structure of output from LLM operations. Every LLM call in DocETL is associated with an output schema, which specifies the expected format and types of the output data.

## Overview

- Schemas define the structure and types of output data from LLM operations.
- They help ensure consistency and facilitate downstream processing.
- docetl uses structured outputs or tool API to enforce these schemas.
- DocETL uses structured outputs or tool API to enforce these schemas.

!!! tip "Schema Simplicity"

Expand Down Expand Up @@ -74,7 +74,7 @@ Objects are defined using curly braces and must have typed fields:

## Structured Outputs and Tool API

docetl uses structured outputs or tool API to enforce schema typing. This ensures that the LLM outputs adhere to the specified schema, making the results more consistent and easier to process in subsequent operations.
DocETL uses structured outputs or tool API to enforce schema typing. This ensures that the LLM outputs adhere to the specified schema, making the results more consistent and easier to process in subsequent operations.

## Best Practices

Expand Down Expand Up @@ -108,4 +108,4 @@ docetl uses structured outputs or tool API to enforce schema typing. This ensure

The only reason to use the complex schema is if you need to do an operation at the point level, like resolve them and reduce on them.

By following these guidelines and best practices, you can create effective schemas that enhance the performance and reliability of your docetl operations.
By following these guidelines and best practices, you can create effective schemas that enhance the performance and reliability of your DocETL operations.
30 changes: 13 additions & 17 deletions docs/execution/optimizing-pipelines.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Optimizing Pipelines

After creating your initial map-reduce pipeline, you might want to optimize it for better performance or to automatically add resolve operations. The docetl pipeline optimizer is designed to help you achieve this.
After creating your initial map-reduce pipeline, you might want to optimize it for better performance or to automatically add resolve operations. The DocETL pipeline optimizer is designed to help you achieve this.

## Understanding the Optimizer

The optimizer in docetl finds optimal plans for operations marked with `optimize: True`. It can also insert resolve operations before reduce operations if needed. The optimizer uses GPT-4 under the hood (requiring an OpenAI API key) and can be customized with different models like gpt-4-turbo or gpt-4o-mini. Note that only LLM-powered operations can be optimized (e.g., `map`, `reduce`, `resolve`, `filter`, `equijoin`), but the optimized plans may involve new non-LLM operations (e.g., `split`).
The optimizer in DocETL finds optimal plans for operations marked with `optimize: True`. It can also insert resolve operations before reduce operations if needed. The optimizer uses GPT-4 under the hood (requiring an OpenAI API key) and can be customized with different models like gpt-4-turbo or gpt-4o-mini. Note that only LLM-powered operations can be optimized (e.g., `map`, `reduce`, `resolve`, `filter`, `equijoin`), but the optimized plans may involve new non-LLM operations (e.g., `split`).

At its core, the optimizer employs two types of AI agents: generation agents and validation agents. Generation agents work to rewrite operators into better plans, potentially decomposing a single operation into multiple, more efficient steps. Validation agents then evaluate these candidate plans, synthesizing task-specific validation prompts to compare outputs and determine the best plan for each operator.

Expand Down Expand Up @@ -33,7 +33,6 @@ graph LR

The optimization process can be unstable, as well as resource-intensive (we've seen it take up to 10 minutes to optimize a single operation, spending up to ~$50 in API costs for end-to-end pipelines). We recommend optimizing one operation at a time and retrying if necessary, as results may vary between runs. This approach also allows you to confidently verify that each optimized operation is performing as expected before moving on to the next. See the [API](#optimizer-api) for more details on how to resume the optimizer from a failed run, by rerunning `docetl build pipeline.yaml --resume` (with the `--resume` flag).


## Should I Use the Optimizer?

While any pipeline can potentially benefit from optimization, there are specific scenarios where using the optimizer can significantly improve your pipeline's performance and accuracy. When should you use the optimizer?
Expand All @@ -47,22 +46,20 @@ While any pipeline can potentially benefit from optimization, there are specific
- Optimize for large-scale data handling

!!! info "Entity Resolution"
The optimizer is particularly useful when:
The optimizer is particularly useful when:

- You need a resolve operation before your reduce operation
- You've defined a resolve operation but want to optimize it for speed using blocking

!!! info "High-Volume Reduce Operations"
Consider using the optimizer when:
Consider using the optimizer when:

- You have many documents feeding into a reduce operation for a given key
- You're concerned about the accuracy of the reduce operation due to high volume
- You want to optimize for better accuracy in complex reductions


Even if your pipeline doesn't fall into these specific categories, optimization can still be beneficial. For example, the optimizer can enhance your operations by adding gleaning to an operation, which uses an LLM-powered validator to ensure operation correctness. [Learn more about gleaning](../concepts/operators.md).


## Optimization Process

To optimize your pipeline, start with your initial configuration and follow these steps:
Expand Down Expand Up @@ -238,7 +235,7 @@ This optimized pipeline now includes improved prompts, a resolve operation, and

## Advanced: Customizing Optimization

You can customize the optimization process for specific operations using the ``optimizer_config in your pipeline.
You can customize the optimization process for specific operations using the ``optimizer_config in your pipeline.

### Global Configuration

Expand Down Expand Up @@ -314,12 +311,11 @@ This configuration will:
## Optimizer API

::: docetl.cli.build
handler: python
options:
members:
- build
show_root_full_path: true
show_root_toc_entry: true
show_root_heading: true
show_source: false
show_name: true
handler: python
options:
members: - build
show_root_full_path: true
show_root_toc_entry: true
show_root_heading: true
show_source: false
show_name: true
6 changes: 3 additions & 3 deletions docs/execution/running-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ This example pipeline configuration demonstrates a complex medical information e

## Running the Pipeline

To run a pipeline in docetl, follow these steps:
To run a pipeline in DocETL, follow these steps:

Ensure your pipeline configuration includes all the required components as described in the [Pipelines](../concepts/pipelines.md) documentation. Your configuration should specify:

Expand All @@ -130,7 +130,7 @@ docetl run pipeline.yaml

If you're unsure about the optimal pipeline configuration or dealing with more complex scenarios, you may want to skip directly to the optimizer section (covered in a later part of this documentation).

As the pipeline runs, docetl will display progress information and eventually show the output. Here's an example of what you might see:
As the pipeline runs, DocETL will display progress information and eventually show the output. Here's an example of what you might see:

```
[Placeholder for pipeline execution output]
Expand All @@ -151,7 +151,7 @@ Here are some additional notes to help you get the most out of your pipeline:
# ... rest of the operation configuration
```

- **Caching**: Docetl caches the results of operations by default. This means that if you run the same operation on the same data multiple times, the results will be retrieved from the cache rather than being recomputed. You can clear the cache by running `docetl clear-cache`.
- **Caching**: DocETL caches the results of operations by default. This means that if you run the same operation on the same data multiple times, the results will be retrieved from the cache rather than being recomputed. You can clear the cache by running `docetl clear-cache`.

- **The `run` Function**: The main entry point for running a pipeline is the `run` function in `docetl/cli.py`. Here's a description of its parameters and functionality:

Expand Down
12 changes: 6 additions & 6 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# docetl: A System for Complex Document Processing
# DocETL: A System for Complex Document Processing

docetl is a powerful tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers a low-code, declarative YAML interface to define complex data operations on complex data.
DocETL is a powerful tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers a low-code, declarative YAML interface to define complex data operations on complex data.

## Features

Expand All @@ -11,15 +11,15 @@ docetl is a powerful tool for creating and executing data processing pipelines,

## Getting Started

To get started with docetl:
To get started with DocETL:

1. Install the package (see [installation](installation.md) for detailed instructions)
2. Define your pipeline in a YAML file
3. Run your pipeline using the docetl command-line interface
3. Run your pipeline using the DocETL command-line interface

## Why Should I Use docetl?
## Why Should I Use DocETL?

docetl is the ideal choice when you're looking to **maximize correctness and output quality** for complex tasks over a collection of documents or unstructured datasets. You should consider using docetl if:
DocETL is the ideal choice when you're looking to **maximize correctness and output quality** for complex tasks over a collection of documents or unstructured datasets. You should consider using DocETL if:

- You want to perform semantic processing on a collection of data
- You have complex tasks that you want to represent via map-reduce (e.g., map over your documents, then group by the result of your map call & reduce)
Expand Down
Loading

0 comments on commit 80ffd31

Please sign in to comment.