Skip to content

Commit

Permalink
Merge pull request #34 from ucbepic/shreyashankar/dataset
Browse files Browse the repository at this point in the history
docs: improve clarity for custom parsing
  • Loading branch information
shreyashankar authored Oct 1, 2024
2 parents ab7e87a + 438956d commit fb900c1
Showing 1 changed file with 155 additions and 104 deletions.
259 changes: 155 additions & 104 deletions docs/examples/custom-parsing.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,38 @@
# Custom Parsing in DocETL
# Custom Dataset Parsing in DocETL

DocETL provides some custom parsing capabilities that allow you to preprocess your data before it enters the main pipeline. This guide will walk you through creating a pipeline with custom parsing tools using a concrete example.
In DocETL, you have full control over your dataset JSONs. These JSONs typically contain objects with key-value pairs, where you can specify paths or references to external files that you want to process in your pipeline. But what if these external files are in formats that need special handling before they can be used in your main pipeline? This is where custom parsing in DocETL becomes useful.

## Example Scenario

Imagine you have:
!!! info "Why Use Custom Parsing?"

- A folder called "sales_data" containing JSON files with paths to Excel spreadsheets of monthly sales reports.
- A folder called "receipts" with JSON files containing paths to scanned receipts in PDF format that you want to process using OCR.
Consider these scenarios:

## Setting Up Custom Parsing
- Your dataset JSON contains paths to Excel spreadsheets with sales data.
- You have references to scanned receipts in PDF format that need OCR processing.
- You want to extract text from Word documents or PowerPoint presentations.

Let's walk through setting up a pipeline with custom parsing for this scenario:
In these cases, custom parsing enables you to transform your raw external data into a format that DocETL can process effectively within your pipeline.

### 1. Create a Configuration File
## Dataset JSON Example

First, create a configuration file (`config.yaml`) that defines your dataset, parsing tools, and pipeline:
Let's look at a typical dataset JSON file that you might create:

```yaml
default_model: "gpt-4o-mini"
```json
[
{ "id": 1, "excel_path": "sales_data/january_sales.xlsx" },
{ "id": 2, "excel_path": "sales_data/february_sales.xlsx" },
]
```

In this example, you've specified paths to Excel files. DocETL will use these paths to locate and process the external files. However, without custom parsing, DocETL wouldn't know how to handle the contents of these files. This is where parsing tools come in handy.

## Custom Parsing in Action

### 1. Configuration

To use custom parsing, you need to define parsing tools in your DocETL configuration file. Here's an example:

```yaml
parsing_tools:
- name: ocr_parser
function_code: |
Expand All @@ -30,28 +43,7 @@ parsing_tools:
text = ""
for image in images:
text += pytesseract.image_to_string(image)
return [text] # Return as a list with one element
operations:
- name: summarize_sales
type: map
prompt: |
Summarize the following sales data:
{{ input.sales_data }}
output:
schema:
summary: string
model: "gpt-4o-mini"
- name: extract_receipt_info
type: map
prompt: |
Extract the total amount and date from the following receipt text:
{{ input.receipt_text }}
output:
schema:
total_amount: float
date: string
model: "gpt-4o-mini"
return [text]
datasets:
sales_reports:
Expand All @@ -73,7 +65,18 @@ datasets:
- input_key: pdf_path
function: ocr_parser
output_key: receipt_text
```
In this configuration:
- We define a custom `ocr_parser` for PDF files.
- We use the built-in `xlsx_to_string` parser for Excel files.
- We apply these parsing tools to the external files referenced in the respective datasets.

### 2. Pipeline Integration

Once you've defined your parsing tools and datasets, you can use the processed data in your pipeline:

```yaml
pipeline:
steps:
- name: process_sales
Expand All @@ -84,108 +87,135 @@ pipeline:
input: receipts
operations:
- extract_receipt_info

output:
type: file
path: "output.json"
```

### 2. Configuration Breakdown
This pipeline will use the parsed data from both Excel files and PDFs for further processing.

In this configuration:
## Built-in Parsing Tools

- We define a custom parsing tool `ocr_parser` for PDF files.
- We use the built-in `xlsx_to_string` parsing tool for Excel files.
- We create two datasets: `sales_reports` for Excel files and `receipts` for PDF files.
- We apply the parsing tools to their respective datasets.
- We define map operations to process the parsed data.
DocETL provides several built-in parsing tools to handle common file formats and data processing tasks. These can be used directly in your configuration by specifying their names in the `function` field of your parsing configuration.

### 3. Prepare Required Files
[Insert the existing documentation for built-in parsing tools here]

Ensure you have the necessary input files:
## Creating Custom Parsing Tools

#### JSON file for Excel paths (`sales_data/sales_paths.json`):
If the built-in tools don't meet your needs, you can create your own custom parsing tools. Here's how:

```json
[
{ "id": 1, "excel_path": "sales_data/january_sales.xlsx" },
{ "id": 2, "excel_path": "sales_data/february_sales.xlsx" }
]
```
1. Define your parsing function in the `parsing_tools` section of your configuration.
2. Ensure your function takes a filename as input and returns a list of strings.
3. Use your custom parser in the `parsing` section of your dataset configuration.

#### JSON file for PDF paths (`receipts/receipt_paths.json`):
For example:

```json
[
{ "id": 1, "pdf_path": "receipts/receipt001.pdf" },
{ "id": 2, "pdf_path": "receipts/receipt002.pdf" }
]
```yaml
parsing_tools:
- name: my_custom_parser
function_code: |
def my_custom_parser(filename: str) -> List[str]:
# Your custom parsing logic here
return [processed_data]
datasets:
my_dataset:
type: file
source: local
path: "data/paths.json"
parsing:
- input_key: file_path
function: my_custom_parser
output_key: processed_data
```

### Understanding the Parsing Tools

In this example, we used two parsing tools:

1. **xlsx_to_string**: A built-in parsing tool provided by DocETL. It reads Excel files and converts them to a string representation.

2. **ocr_parser**: A custom parsing tool we defined for OCR processing of PDF files. *Note that it returns a list containing a single string, which is the format expected by DocETL for parsing tools.*

#### Parsing Process

Let's examine how the input files would be parsed using the logic defined in `parsing_tools.py`:
## How Data Gets Parsed and Formatted

1. For the Excel files (`sales_data/january_sales.xlsx` and `sales_data/february_sales.xlsx`):
- The `xlsx_to_string` function is used.
- By default, it processes the active sheet of each Excel file.
- The function returns a list containing a single string for each file.
- The string representation includes column headers followed by their respective values.
- For example, if the Excel file has columns "Date", "Product", and "Amount", the output might look like:
When you run your DocETL pipeline, the parsing tools you've specified in your configuration file are applied to the external files referenced in your dataset JSONs. Here's what happens:

Date:
2023-01-01
2023-01-02
...
1. DocETL reads your dataset JSON file.
2. For each entry in the dataset, it looks at the parsing configuration you've specified.
3. It applies the appropriate parsing function to the file path provided in the dataset JSON.
4. The parsing function processes the file and returns the data in a format DocETL can work with (typically a list of strings).

Product:
Widget A
Widget B
...
Let's look at how this works for our earlier examples:

Amount:
100
150
...
### Excel Files (using xlsx_to_string)

2. For the PDF files (`receipts/receipt001.pdf` and `receipts/receipt002.pdf`):
- The custom `ocr_parser` function is used.
- It converts each page of the PDF to an image using `pdf2image`.
- Then, it applies OCR to each image using `pytesseract`.
- The function combines the text from all pages and returns it as a list with a single string element.
- The output might look like:
For an Excel file like "sales_data/january_sales.xlsx":

RECEIPT
Store: Example Store
Date: 2023-05-15
Items:
1. Product A - $10.99
2. Product B - $15.50
Total: $26.49
1. The `xlsx_to_string` function reads the Excel file.
2. It converts the data to a string representation.
3. The output might look like this:

These parsed strings are then passed to the respective operations (`summarize_sales` and `extract_receipt_info`) for further processing in the pipeline.
```
Date:
2023-01-01
2023-01-02
...
Product:
Widget A
Widget B
...
Amount:
100
150
...
```

### PDF Files (using ocr_parser)

### 4. Run the Pipeline
For a PDF file like "receipts/receipt001.pdf":

Execute the pipeline using the DocETL CLI:
1. The `ocr_parser` function converts each page of the PDF to an image.
2. It applies OCR to each image.
3. The function combines the text from all pages.
4. The output might look like this:

```bash
docetl run config.yaml
```
RECEIPT
Store: Example Store
Date: 2023-05-15
Items:
1. Product A - $10.99
2. Product B - $15.50
Total: $26.49
```

### 5. Check the Output
This parsed and formatted data is then passed to the respective operations in your pipeline for further processing.

After running the pipeline, you'll find the output in `output.json`. It will contain summaries of the sales data and extracted information from the receipts.
## Running the Pipeline

## Understanding the Parsing Tools
Once you've set up your pipeline configuration file with the appropriate parsing tools and dataset definitions, you can run your DocETL pipeline. Here's how:

In this example, we used two parsing tools:
1. Ensure you have DocETL installed in your environment.
2. Open a terminal or command prompt.
3. Navigate to the directory containing your pipeline configuration file.
4. Run the following command:

```bash
docetl run pipeline.yaml
```

Replace `pipeline.yaml` with the name of your pipeline file if it's different.

When you run this command:

1. DocETL reads your pipeline file.
2. It processes each dataset using the specified parsing tools.
3. The pipeline steps are executed in the order you defined.
4. Any operations you've specified (like `summarize_sales` or `extract_receipt_info`) are applied to the parsed data.
5. The results are saved according to your output configuration.

1. **xlsx_to_string**: A built-in parsing tool provided by DocETL. It reads Excel files and converts them to a string representation.

2. **ocr_parser**: A custom parsing tool we defined for OCR processing of PDF files. *Note that it returns a list containing a single string, which is the format expected by DocETL for parsing tools.*

## Built-in Parsing Tools

Expand All @@ -212,6 +242,7 @@ DocETL provides several built-in parsing tools to handle common file formats and
heading_level: 3



### Using Function Arguments with Parsing Tools

When using parsing tools in your DocETL configuration, you can pass additional arguments to the parsing functions using the function_kwargs field. This allows you to customize the behavior of the parsing tools without modifying their implementation.
Expand All @@ -233,3 +264,23 @@ datasets:
doc_per_sheet: true
```

## Contributing Built-in Parsing Tools

While DocETL provides several built-in parsing tools, the community can always benefit from additional utilities. If you've developed a parsing tool that you think could be useful for others, consider contributing it to the DocETL repository. Here's how you can add new built-in parsing utilities:

1. Fork the DocETL repository on GitHub.
2. Clone your forked repository to your local machine.
3. Navigate to the `docetl/parsing_tools.py` file.
4. Add your new parsing function to this file. The function should also be added to the `PARSING_TOOLS` dictionary.
5. Update the documentation in the function's docstring.
6. Create a pull request to merge your changes into the main DocETL repository.

!!! note "Guidelines for Contributing Parsing Tools"

When contributing a new parsing tool, make sure it follows these guidelines:

- The function should have a clear, descriptive name.
- Include comprehensive docstrings explaining the function's purpose, parameters, and return value. The return value should be a list of strings.
- Handle potential errors gracefully and provide informative error messages.
- If your parser requires additional dependencies, make sure to mention them in the pull request.

0 comments on commit fb900c1

Please sign in to comment.