From 80ffd31c6961ee5e3f532e6b7b75453e9ad436ae Mon Sep 17 00:00:00 2001 From: Shreya Shankar Date: Sun, 15 Sep 2024 14:26:00 -0700 Subject: [PATCH] Fix casing for docetl --- README.md | 14 ++++++------ docs/concepts/operators.md | 14 ++++++------ docs/concepts/optimization.md | 10 ++++----- docs/concepts/pipelines.md | 6 +++--- docs/concepts/schemas.md | 8 +++---- docs/execution/optimizing-pipelines.md | 30 +++++++++++--------------- docs/execution/running-pipelines.md | 6 +++--- docs/index.md | 12 +++++------ docs/operators/equijoin.md | 2 +- docs/operators/filter.md | 2 +- docs/operators/gather.md | 2 +- docs/operators/map.md | 4 ++-- docs/operators/parallel-map.md | 2 +- docs/operators/reduce.md | 2 +- docs/operators/resolve.md | 4 ++-- docs/operators/split.md | 4 ++-- docs/operators/unnest.md | 2 +- docs/tutorial.md | 16 +++++++------- vision.md | 2 +- 19 files changed, 69 insertions(+), 73 deletions(-) diff --git a/README.md b/README.md index 5b506f91..7ecf1511 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ -# docetl +# DocETL -docetl is a powerful tool for creating and executing data processing pipelines using LLMs. It allows you to define complex data operations in a YAML configuration file and execute them efficiently. +DocETL is a powerful tool for creating and executing data processing pipelines using LLMs. It allows you to define complex data operations in a YAML configuration file and execute them efficiently. ## Table of Contents @@ -26,7 +26,7 @@ docetl is a powerful tool for creating and executing data processing pipelines u ## Installation -To install docetl, clone this repository and install the required dependencies: +To install DocETL, clone this repository and install the required dependencies: ```bash git clone https://github.com/shreyashankar/docetl.git @@ -70,7 +70,7 @@ The configuration file is a YAML document with the following top-level keys: ## Operation Types -docetl supports various operation types, each designed for specific data transformation tasks. All prompt templates used in these operations are Jinja2 templates, allowing for the use of loops, conditionals, and other Jinja2 features to create dynamic prompts based on input data. +DocETL supports various operation types, each designed for specific data transformation tasks. All prompt templates used in these operations are Jinja2 templates, allowing for the use of loops, conditionals, and other Jinja2 features to create dynamic prompts based on input data. All operations have the following optional parameters: @@ -618,7 +618,7 @@ Example: ### Schema Definition -Schemas in docetl are defined using a simple key-value structure, where each key represents a field name and the value specifies the data type. The supported data types are: +Schemas in DocETL are defined using a simple key-value structure, where each key represents a field name and the value specifies the data type. The supported data types are: - `string` (or `str`, `text`, `varchar`): For text data - `integer` (or `int`): For whole numbers @@ -668,7 +668,7 @@ It's important to note that all schema items pass through the pipeline. The `out ## Tool Use -docetl supports the use of tools in operations, allowing for more complex and specific data processing tasks. Tools are defined as Python functions that can be called by the language model during execution. +DocETL supports the use of tools in operations, allowing for more complex and specific data processing tasks. Tools are defined as Python functions that can be called by the language model during execution. To use tools in an operation, you need to define them in the operation's configuration. Here's an example of how to define and use a tool: @@ -710,7 +710,7 @@ In this example: The language model can then use this tool to count words in the input title. The tool's output will be incorporated into the operation's result according to the defined output schema. -You can define multiple tools for an operation, allowing the model to choose the most appropriate one for the task at hand. Tools can range from simple utility functions to more complex data processing or external API calls, enhancing the capabilities of your docetl pipeline. +You can define multiple tools for an operation, allowing the model to choose the most appropriate one for the task at hand. Tools can range from simple utility functions to more complex data processing or external API calls, enhancing the capabilities of your DocETL pipeline. Currently, only map and parallel_map operations support tools. diff --git a/docs/concepts/operators.md b/docs/concepts/operators.md index 951852b5..ddbfeed5 100644 --- a/docs/concepts/operators.md +++ b/docs/concepts/operators.md @@ -1,16 +1,16 @@ # Operators -Operators in docetl are designed for semantically processing unstructured data. They form the building blocks of data processing pipelines, allowing you to transform, analyze, and manipulate datasets efficiently. +Operators in DocETL are designed for semantically processing unstructured data. They form the building blocks of data processing pipelines, allowing you to transform, analyze, and manipulate datasets efficiently. ## Overview - Datasets contain documents, where a document is an object in the JSON list, with fields and values. -- docetl provides several operators, each tailored for specific unstructured data processing tasks. +- DocETL provides several operators, each tailored for specific unstructured data processing tasks. - By default, operations are parallelized on your data using multithreading for improved performance. -!!! tip "Caching in docetl" +!!! tip "Caching in DocETL" - docetl employs caching for all LLM calls and partially-optimized plans. The cache is stored in the `.docetl/cache` and `.docetl/llm_cache` directories within your home directory. This caching mechanism helps to improve performance and reduce redundant API calls when running similar operations or reprocessing data. + DocETL employs caching for all LLM calls and partially-optimized plans. The cache is stored in the `.docetl/cache` and `.docetl/llm_cache` directories within your home directory. This caching mechanism helps to improve performance and reduce redundant API calls when running similar operations or reprocessing data. ## Common Attributes @@ -69,7 +69,7 @@ prompt: | !!! question "What happens if the input is too long?" - When the input data exceeds the token limit of the LLM, docetl automatically truncates tokens from the middle of the data to make it fit in the prompt. This approach preserves the beginning and end of the input, which often contain crucial context. + When the input data exceeds the token limit of the LLM, DocETL automatically truncates tokens from the middle of the data to make it fit in the prompt. This approach preserves the beginning and end of the input, which often contain crucial context. A warning is displayed whenever truncation occurs, alerting you to potential loss of information: @@ -77,7 +77,7 @@ prompt: | WARNING: Input exceeded token limit. Truncated 500 tokens from the middle of the input. ``` - If you frequently encounter this warning, consider using docetl's optimizer or breaking down your input yourself into smaller chunks to handle large inputs more effectively. + If you frequently encounter this warning, consider using DocETL's optimizer or breaking down your input yourself into smaller chunks to handle large inputs more effectively. ## Output Schema @@ -123,7 +123,7 @@ Read more about schemas in the [schemas](../concepts/schemas.md) section. ## Validation -Validation is a first-class citizen in docetl, ensuring the quality and correctness of processed data. +Validation is a first-class citizen in DocETL, ensuring the quality and correctness of processed data. ### Basic Validation diff --git a/docs/concepts/optimization.md b/docs/concepts/optimization.md index f5cfbb67..7d743b56 100644 --- a/docs/concepts/optimization.md +++ b/docs/concepts/optimization.md @@ -7,11 +7,11 @@ In the world of data processing and analysis, finding the optimal pipeline for y - Will a single LLM call suffice for your task? - Do you need to decompose your task or data further for better results? -To address these questions and improve your pipeline's performance, docetl provides a powerful optimization feature. +To address these questions and improve your pipeline's performance, DocETL provides a powerful optimization feature. -## The docetl Optimizer +## The DocETL Optimizer -The docetl optimizer is designed to decompose operators (and sequences of operators) into their own subpipelines, potentially leading to higher accuracy. +The DocETL optimizer is designed to decompose operators (and sequences of operators) into their own subpipelines, potentially leading to higher accuracy. !!! example @@ -46,13 +46,13 @@ The docetl optimizer is designed to decompose operators (and sequences of operat ### How It Works -The docetl optimizer operates using the following mechanism: +The DocETL optimizer operates using the following mechanism: 1. **Generation and Evaluation Agents**: These agents generate different plans for the pipeline according to predefined rewrite rules. Evaluation agents then compare plans and outputs to determine the best approach. 2. **Operator Rewriting**: The optimizer looks through operators in your pipeline where you've set optimize: true, and attempts to rewrite them using predefined rules. -3. **Output**: After optimization, docetl outputs a new YAML file representing the optimized pipeline. +3. **Output**: After optimization, DocETL outputs a new YAML file representing the optimized pipeline. ### Using the Optimizer diff --git a/docs/concepts/pipelines.md b/docs/concepts/pipelines.md index 9bf2130c..fa41588b 100644 --- a/docs/concepts/pipelines.md +++ b/docs/concepts/pipelines.md @@ -1,10 +1,10 @@ # Pipelines -Pipelines in docetl are the core structures that define the flow of data processing. They orchestrate the application of operators to datasets, creating a seamless workflow for complex document processing tasks. +Pipelines in DocETL are the core structures that define the flow of data processing. They orchestrate the application of operators to datasets, creating a seamless workflow for complex document processing tasks. ## Components of a Pipeline -A pipeline in docetl consists of four main components: +A pipeline in DocETL consists of four main components: 1. **Default Model**: The language model to use for the pipeline. 2. **Datasets**: The input data sources for your pipeline. @@ -32,7 +32,7 @@ datasets: !!! note - Currently, docetl only supports JSON files as input datasets. If you're interested in support for other data types or cloud-based datasets, please reach out to us or join our open-source community and contribute! We welcome new ideas and contributions to expand the capabilities of docetl. + Currently, DocETL only supports JSON files as input datasets. If you're interested in support for other data types or cloud-based datasets, please reach out to us or join our open-source community and contribute! We welcome new ideas and contributions to expand the capabilities of DocETL. ### Operators diff --git a/docs/concepts/schemas.md b/docs/concepts/schemas.md index f06104ab..8f6b103f 100644 --- a/docs/concepts/schemas.md +++ b/docs/concepts/schemas.md @@ -1,12 +1,12 @@ # Schemas -In docetl, schemas play an important role in defining the structure of output from LLM operations. Every LLM call in docetl is associated with an output schema, which specifies the expected format and types of the output data. +In DocETL, schemas play an important role in defining the structure of output from LLM operations. Every LLM call in DocETL is associated with an output schema, which specifies the expected format and types of the output data. ## Overview - Schemas define the structure and types of output data from LLM operations. - They help ensure consistency and facilitate downstream processing. -- docetl uses structured outputs or tool API to enforce these schemas. +- DocETL uses structured outputs or tool API to enforce these schemas. !!! tip "Schema Simplicity" @@ -74,7 +74,7 @@ Objects are defined using curly braces and must have typed fields: ## Structured Outputs and Tool API -docetl uses structured outputs or tool API to enforce schema typing. This ensures that the LLM outputs adhere to the specified schema, making the results more consistent and easier to process in subsequent operations. +DocETL uses structured outputs or tool API to enforce schema typing. This ensures that the LLM outputs adhere to the specified schema, making the results more consistent and easier to process in subsequent operations. ## Best Practices @@ -108,4 +108,4 @@ docetl uses structured outputs or tool API to enforce schema typing. This ensure The only reason to use the complex schema is if you need to do an operation at the point level, like resolve them and reduce on them. -By following these guidelines and best practices, you can create effective schemas that enhance the performance and reliability of your docetl operations. +By following these guidelines and best practices, you can create effective schemas that enhance the performance and reliability of your DocETL operations. diff --git a/docs/execution/optimizing-pipelines.md b/docs/execution/optimizing-pipelines.md index 2d640d7d..f048ce4a 100644 --- a/docs/execution/optimizing-pipelines.md +++ b/docs/execution/optimizing-pipelines.md @@ -1,10 +1,10 @@ # Optimizing Pipelines -After creating your initial map-reduce pipeline, you might want to optimize it for better performance or to automatically add resolve operations. The docetl pipeline optimizer is designed to help you achieve this. +After creating your initial map-reduce pipeline, you might want to optimize it for better performance or to automatically add resolve operations. The DocETL pipeline optimizer is designed to help you achieve this. ## Understanding the Optimizer -The optimizer in docetl finds optimal plans for operations marked with `optimize: True`. It can also insert resolve operations before reduce operations if needed. The optimizer uses GPT-4 under the hood (requiring an OpenAI API key) and can be customized with different models like gpt-4-turbo or gpt-4o-mini. Note that only LLM-powered operations can be optimized (e.g., `map`, `reduce`, `resolve`, `filter`, `equijoin`), but the optimized plans may involve new non-LLM operations (e.g., `split`). +The optimizer in DocETL finds optimal plans for operations marked with `optimize: True`. It can also insert resolve operations before reduce operations if needed. The optimizer uses GPT-4 under the hood (requiring an OpenAI API key) and can be customized with different models like gpt-4-turbo or gpt-4o-mini. Note that only LLM-powered operations can be optimized (e.g., `map`, `reduce`, `resolve`, `filter`, `equijoin`), but the optimized plans may involve new non-LLM operations (e.g., `split`). At its core, the optimizer employs two types of AI agents: generation agents and validation agents. Generation agents work to rewrite operators into better plans, potentially decomposing a single operation into multiple, more efficient steps. Validation agents then evaluate these candidate plans, synthesizing task-specific validation prompts to compare outputs and determine the best plan for each operator. @@ -33,7 +33,6 @@ graph LR The optimization process can be unstable, as well as resource-intensive (we've seen it take up to 10 minutes to optimize a single operation, spending up to ~$50 in API costs for end-to-end pipelines). We recommend optimizing one operation at a time and retrying if necessary, as results may vary between runs. This approach also allows you to confidently verify that each optimized operation is performing as expected before moving on to the next. See the [API](#optimizer-api) for more details on how to resume the optimizer from a failed run, by rerunning `docetl build pipeline.yaml --resume` (with the `--resume` flag). - ## Should I Use the Optimizer? While any pipeline can potentially benefit from optimization, there are specific scenarios where using the optimizer can significantly improve your pipeline's performance and accuracy. When should you use the optimizer? @@ -47,22 +46,20 @@ While any pipeline can potentially benefit from optimization, there are specific - Optimize for large-scale data handling !!! info "Entity Resolution" - The optimizer is particularly useful when: +The optimizer is particularly useful when: - You need a resolve operation before your reduce operation - You've defined a resolve operation but want to optimize it for speed using blocking !!! info "High-Volume Reduce Operations" - Consider using the optimizer when: +Consider using the optimizer when: - You have many documents feeding into a reduce operation for a given key - You're concerned about the accuracy of the reduce operation due to high volume - You want to optimize for better accuracy in complex reductions - Even if your pipeline doesn't fall into these specific categories, optimization can still be beneficial. For example, the optimizer can enhance your operations by adding gleaning to an operation, which uses an LLM-powered validator to ensure operation correctness. [Learn more about gleaning](../concepts/operators.md). - ## Optimization Process To optimize your pipeline, start with your initial configuration and follow these steps: @@ -238,7 +235,7 @@ This optimized pipeline now includes improved prompts, a resolve operation, and ## Advanced: Customizing Optimization -You can customize the optimization process for specific operations using the ``optimizer_config in your pipeline. +You can customize the optimization process for specific operations using the ``optimizer_config in your pipeline. ### Global Configuration @@ -314,12 +311,11 @@ This configuration will: ## Optimizer API ::: docetl.cli.build - handler: python - options: - members: - - build - show_root_full_path: true - show_root_toc_entry: true - show_root_heading: true - show_source: false - show_name: true +handler: python +options: +members: - build +show_root_full_path: true +show_root_toc_entry: true +show_root_heading: true +show_source: false +show_name: true diff --git a/docs/execution/running-pipelines.md b/docs/execution/running-pipelines.md index 5aaafb56..9020bfef 100644 --- a/docs/execution/running-pipelines.md +++ b/docs/execution/running-pipelines.md @@ -109,7 +109,7 @@ This example pipeline configuration demonstrates a complex medical information e ## Running the Pipeline -To run a pipeline in docetl, follow these steps: +To run a pipeline in DocETL, follow these steps: Ensure your pipeline configuration includes all the required components as described in the [Pipelines](../concepts/pipelines.md) documentation. Your configuration should specify: @@ -130,7 +130,7 @@ docetl run pipeline.yaml If you're unsure about the optimal pipeline configuration or dealing with more complex scenarios, you may want to skip directly to the optimizer section (covered in a later part of this documentation). -As the pipeline runs, docetl will display progress information and eventually show the output. Here's an example of what you might see: +As the pipeline runs, DocETL will display progress information and eventually show the output. Here's an example of what you might see: ``` [Placeholder for pipeline execution output] @@ -151,7 +151,7 @@ Here are some additional notes to help you get the most out of your pipeline: # ... rest of the operation configuration ``` -- **Caching**: Docetl caches the results of operations by default. This means that if you run the same operation on the same data multiple times, the results will be retrieved from the cache rather than being recomputed. You can clear the cache by running `docetl clear-cache`. +- **Caching**: DocETL caches the results of operations by default. This means that if you run the same operation on the same data multiple times, the results will be retrieved from the cache rather than being recomputed. You can clear the cache by running `docetl clear-cache`. - **The `run` Function**: The main entry point for running a pipeline is the `run` function in `docetl/cli.py`. Here's a description of its parameters and functionality: diff --git a/docs/index.md b/docs/index.md index 444670f8..4456b162 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,6 +1,6 @@ -# docetl: A System for Complex Document Processing +# DocETL: A System for Complex Document Processing -docetl is a powerful tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers a low-code, declarative YAML interface to define complex data operations on complex data. +DocETL is a powerful tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers a low-code, declarative YAML interface to define complex data operations on complex data. ## Features @@ -11,15 +11,15 @@ docetl is a powerful tool for creating and executing data processing pipelines, ## Getting Started -To get started with docetl: +To get started with DocETL: 1. Install the package (see [installation](installation.md) for detailed instructions) 2. Define your pipeline in a YAML file -3. Run your pipeline using the docetl command-line interface +3. Run your pipeline using the DocETL command-line interface -## Why Should I Use docetl? +## Why Should I Use DocETL? -docetl is the ideal choice when you're looking to **maximize correctness and output quality** for complex tasks over a collection of documents or unstructured datasets. You should consider using docetl if: +DocETL is the ideal choice when you're looking to **maximize correctness and output quality** for complex tasks over a collection of documents or unstructured datasets. You should consider using DocETL if: - You want to perform semantic processing on a collection of data - You have complex tasks that you want to represent via map-reduce (e.g., map over your documents, then group by the result of your map call & reduce) diff --git a/docs/operators/equijoin.md b/docs/operators/equijoin.md index 55c31e6c..51bfe88f 100644 --- a/docs/operators/equijoin.md +++ b/docs/operators/equijoin.md @@ -1,6 +1,6 @@ # Equijoin Operation (Experimental) -The Equijoin operation in docetl is an experimental feature designed for joining two datasets based on flexible, LLM-powered criteria. It leverages many of the same techniques as the [Resolve operation](resolve.md), but applies them to the task of joining datasets rather than deduplicating within a single dataset. +The Equijoin operation in DocETL is an experimental feature designed for joining two datasets based on flexible, LLM-powered criteria. It leverages many of the same techniques as the [Resolve operation](resolve.md), but applies them to the task of joining datasets rather than deduplicating within a single dataset. ## Motivation diff --git a/docs/operators/filter.md b/docs/operators/filter.md index 4618198a..aaa83c2a 100644 --- a/docs/operators/filter.md +++ b/docs/operators/filter.md @@ -1,6 +1,6 @@ # Filter Operation -The Filter operation in docetl is used to selectively process data items based on specific conditions. It behaves similarly to the Map operation, but with a key difference: items that evaluate to false are filtered out of the dataset, allowing you to include or exclude data points from further processing in your pipeline. +The Filter operation in DocETL is used to selectively process data items based on specific conditions. It behaves similarly to the Map operation, but with a key difference: items that evaluate to false are filtered out of the dataset, allowing you to include or exclude data points from further processing in your pipeline. ## Motivation diff --git a/docs/operators/gather.md b/docs/operators/gather.md index 4bb60e4b..ce030a86 100644 --- a/docs/operators/gather.md +++ b/docs/operators/gather.md @@ -1,6 +1,6 @@ # Gather Operation -The Gather operation in docetl is designed to maintain context when processing divided documents. It complements the Split operation by adding contextual information from surrounding chunks to each segment. +The Gather operation in DocETL is designed to maintain context when processing divided documents. It complements the Split operation by adding contextual information from surrounding chunks to each segment. ## Motivation diff --git a/docs/operators/map.md b/docs/operators/map.md index 030a6304..5a300adc 100644 --- a/docs/operators/map.md +++ b/docs/operators/map.md @@ -1,6 +1,6 @@ # Map Operation -The Map operation in docetl applies a specified transformation to each item in your input data, allowing for complex processing and insight extraction from large, unstructured documents. +The Map operation in DocETL applies a specified transformation to each item in your input data, allowing for complex processing and insight extraction from large, unstructured documents. ## 🚀 Example: Analyzing Long-Form News Articles @@ -178,7 +178,7 @@ Tools can extend the capabilities of the Map operation. Each tool is a Python fu ### Input Truncation -If the input doesn't fit within the token limit, docetl automatically truncates tokens from the middle of the input data, preserving the beginning and end which often contain more important context. A warning is displayed when truncation occurs. +If the input doesn't fit within the token limit, DocETL automatically truncates tokens from the middle of the input data, preserving the beginning and end which often contain more important context. A warning is displayed when truncation occurs. ## Best Practices diff --git a/docs/operators/parallel-map.md b/docs/operators/parallel-map.md index 95cafe71..3c6d713f 100644 --- a/docs/operators/parallel-map.md +++ b/docs/operators/parallel-map.md @@ -1,6 +1,6 @@ # Parallel Map Operation -The Parallel Map operation in docetl applies multiple independent transformations to each item in the input data concurrently, maintaining a 1:1 input-to-output ratio while generating multiple fields simultaneously. +The Parallel Map operation in DocETL applies multiple independent transformations to each item in the input data concurrently, maintaining a 1:1 input-to-output ratio while generating multiple fields simultaneously. !!! note "Similarity to Map Operation" diff --git a/docs/operators/reduce.md b/docs/operators/reduce.md index f8407d4a..6418c99b 100644 --- a/docs/operators/reduce.md +++ b/docs/operators/reduce.md @@ -1,6 +1,6 @@ # Reduce Operation -The Reduce operation in docetl aggregates data based on a key. It supports both batch reduction and incremental folding for large datasets, making it versatile for various data processing tasks. +The Reduce operation in DocETL aggregates data based on a key. It supports both batch reduction and incremental folding for large datasets, making it versatile for various data processing tasks. ## Motivation diff --git a/docs/operators/resolve.md b/docs/operators/resolve.md index 8f5f6baf..48323684 100644 --- a/docs/operators/resolve.md +++ b/docs/operators/resolve.md @@ -1,6 +1,6 @@ # Resolve Operation -The Resolve operation in docetl identifies and merges duplicate entities in your data. It's particularly useful when dealing with inconsistencies that can arise from LLM-generated content or data from multiple sources. +The Resolve operation in DocETL identifies and merges duplicate entities in your data. It's particularly useful when dealing with inconsistencies that can arise from LLM-generated content or data from multiple sources. ## Motivation @@ -50,7 +50,7 @@ Note: The prompt templates use Jinja2 syntax, allowing you to reference input fi ## Blocking -To improve efficiency, the Resolve operation supports "blocking" - a technique to reduce the number of comparisons by only comparing entries that are likely to be matches. docetl supports two types of blocking: +To improve efficiency, the Resolve operation supports "blocking" - a technique to reduce the number of comparisons by only comparing entries that are likely to be matches. DocETL supports two types of blocking: 1. Embedding similarity: Compare embeddings of specified fields and only process pairs above a certain similarity threshold. 2. Python conditions: Apply custom Python expressions to determine if a pair should be compared. diff --git a/docs/operators/split.md b/docs/operators/split.md index 2a6c8230..26066153 100644 --- a/docs/operators/split.md +++ b/docs/operators/split.md @@ -1,6 +1,6 @@ # Split Operation -The Split operation in docetl is designed to divide long text content into smaller, manageable chunks. This is particularly useful when dealing with large documents that exceed the token limit of language models or when the LLM's performance degrades with increasing input size for complex tasks. +The Split operation in DocETL is designed to divide long text content into smaller, manageable chunks. This is particularly useful when dealing with large documents that exceed the token limit of language models or when the LLM's performance degrades with increasing input size for complex tasks. ## Motivation @@ -176,7 +176,7 @@ This pipeline allows for detailed analysis of customer frustration in long suppo 1. **Choose the Right Splitting Method**: Use the token count method when working with models that have strict token limits. Use the delimiter method when you need to split at logical boundaries in your text. -2. **Balance Chunk Size**: When using the token count method, choose a chunk size that balances between context preservation and model performance. Smaller chunks may lose context, while larger chunks may degrade model performance. The \docetl optimizer can find the chunk size that works best for your task, if you choose to use the optimizer. +2. **Balance Chunk Size**: When using the token count method, choose a chunk size that balances between context preservation and model performance. Smaller chunks may lose context, while larger chunks may degrade model performance. The DocETL optimizer can find the chunk size that works best for your task, if you choose to use the optimizer. 3. **Consider Overlap**: In some cases, you might want to implement overlap between chunks to maintain context. This isn't built into the Split operation, but you can achieve it by post-processing the split chunks. diff --git a/docs/operators/unnest.md b/docs/operators/unnest.md index a9a883f2..458af9c1 100644 --- a/docs/operators/unnest.md +++ b/docs/operators/unnest.md @@ -1,6 +1,6 @@ # Unnest Operation -The Unnest operation in docetl is designed to expand an array field or a dictionary in the input data into multiple items. This operation is particularly useful when you need to process or analyze individual elements of an array or specific fields of a nested dictionary separately. +The Unnest operation in DocETL is designed to expand an array field or a dictionary in the input data into multiple items. This operation is particularly useful when you need to process or analyze individual elements of an array or specific fields of a nested dictionary separately. !!! warning "How Unnest Works" diff --git a/docs/tutorial.md b/docs/tutorial.md index 508cff2d..31667213 100644 --- a/docs/tutorial.md +++ b/docs/tutorial.md @@ -1,16 +1,16 @@ -# Tutorial: Mining User Behavior Data with docetl +# Tutorial: Mining User Behavior Data with DocETL -This tutorial will guide you through the process of using docetl to analyze user behavior data from UI logs. We'll create a simple pipeline that extracts key insights and supporting actions from user logs, then summarizes them by country. +This tutorial will guide you through the process of using DocETL to analyze user behavior data from UI logs. We'll create a simple pipeline that extracts key insights and supporting actions from user logs, then summarizes them by country. ## Installation -First, let's install docetl. Follow the instructions in the [installation guide](installation.md) to set up docetl on your system. +First, let's install DocETL. Follow the instructions in the [installation guide](installation.md) to set up DocETL on your system. ## Setting up API Keys -docetl uses [LiteLLM](https://github.com/BerriAI/litellm) under the hood, which supports various LLM providers. For this tutorial, we'll use OpenAI, as docetl tests and existing pipelines are run with OpenAI. +DocETL uses [LiteLLM](https://github.com/BerriAI/litellm) under the hood, which supports various LLM providers. For this tutorial, we'll use OpenAI, as DocETL tests and existing pipelines are run with OpenAI. -!!! tip "Setting up API Key" +!!! tip "Setting up your API Key" Set your OpenAI API key as an environment variable: @@ -49,7 +49,7 @@ Save this file as `user_logs.json` in your project directory. ## Creating the Pipeline -Now, let's create a docetl pipeline to analyze this data. We'll use a map-reduce-like approach: +Now, let's create a DocETL pipeline to analyze this data. We'll use a map-reduce-like approach: 1. Map each user log to key insights and supporting actions 2. Unnest the insights @@ -208,8 +208,8 @@ This will process the user logs, extract key insights and supporting actions, an ??? question "What if I want to reduce by insights or an LLM-generated field?" - You can modify the reduce operation to use any field as the reduce key, including LLM-generated fields from prior operations. Simply change the `reduce_key` in the `summarize_by_country` operation to the desired field. Note that we may need to perform entity resolution on the LLM-generated fields, which docetl can do for you in the optimization process (to be discussed later). + You can modify the reduce operation to use any field as the reduce key, including LLM-generated fields from prior operations. Simply change the `reduce_key` in the `summarize_by_country` operation to the desired field. Note that we may need to perform entity resolution on the LLM-generated fields, which DocETL can do for you in the optimization process (to be discussed later). ??? question "How do I know what pipeline configuration to write? Can't I do this all in one map operation?" - While it's possible to perform complex operations in a single map step, breaking down the process into multiple steps often leads to more maintainable and flexible pipelines. To learn more about optimizing your pipeline configuration, read on to discover docetl's optimizer, which can be invoked using `docetl build` instead of `docetl run`. + While it's possible to perform complex operations in a single map step, breaking down the process into multiple steps often leads to more maintainable and flexible pipelines. To learn more about optimizing your pipeline configuration, read on to discover DocETL's optimizer, which can be invoked using `DocETL build` instead of `docetl run`. diff --git a/vision.md b/vision.md index f03557b5..5c3f267f 100644 --- a/vision.md +++ b/vision.md @@ -7,6 +7,6 @@ Things I'd like for the interface/agents to do: - Users should be in control of validation prompts. - When users are looking at intermediates, we should have the ability to run validators on the intermediate prompts. - We need to store intermediates and have provenance. -- Have an interface to interactively create docetl pipelines. Start by users defining a high-level task, and optimize one operation at a time. +- Have an interface to interactively create DocETL pipelines. Start by users defining a high-level task, and optimize one operation at a time. - Synthesize validate statements for each operation during optimization. - When generating chunking plans, use an LLM agent to deterimine what chunking plans to synthesize. E.g., it should be able to tell us whether peripheral context is necessary to include in the chunk.