Merge pull request #34 from ucbepic/shreyashankar/dataset

docs: improve clarity for custom parsing
ucbepic · Oct 1, 2024 · fb900c1 · fb900c1
2 parents ab7e87a + 438956d
commit fb900c1
Showing 1 changed file with 155 additions and 104 deletions.
diff --git a/docs/examples/custom-parsing.md b/docs/examples/custom-parsing.md
@@ -1,25 +1,38 @@
-# Custom Parsing in DocETL
+# Custom Dataset Parsing in DocETL
 
-DocETL provides some custom parsing capabilities that allow you to preprocess your data before it enters the main pipeline. This guide will walk you through creating a pipeline with custom parsing tools using a concrete example.
+In DocETL, you have full control over your dataset JSONs. These JSONs typically contain objects with key-value pairs, where you can specify paths or references to external files that you want to process in your pipeline. But what if these external files are in formats that need special handling before they can be used in your main pipeline? This is where custom parsing in DocETL becomes useful.
 
-## Example Scenario
 
-Imagine you have:
+!!! info "Why Use Custom Parsing?"
 
-- A folder called "sales_data" containing JSON files with paths to Excel spreadsheets of monthly sales reports.
-- A folder called "receipts" with JSON files containing paths to scanned receipts in PDF format that you want to process using OCR.
+    Consider these scenarios:
 
-## Setting Up Custom Parsing
+    - Your dataset JSON contains paths to Excel spreadsheets with sales data.
+    - You have references to scanned receipts in PDF format that need OCR processing.
+    - You want to extract text from Word documents or PowerPoint presentations.
 
-Let's walk through setting up a pipeline with custom parsing for this scenario:
+    In these cases, custom parsing enables you to transform your raw external data into a format that DocETL can process effectively within your pipeline.
 
-### 1. Create a Configuration File
+## Dataset JSON Example
 
-First, create a configuration file (`config.yaml`) that defines your dataset, parsing tools, and pipeline:
+Let's look at a typical dataset JSON file that you might create:
 
-```yaml
-default_model: "gpt-4o-mini"
+```json
+[
+  { "id": 1, "excel_path": "sales_data/january_sales.xlsx" },
+  { "id": 2, "excel_path": "sales_data/february_sales.xlsx" },
+]
+```
+
+In this example, you've specified paths to Excel files. DocETL will use these paths to locate and process the external files. However, without custom parsing, DocETL wouldn't know how to handle the contents of these files. This is where parsing tools come in handy.
+
+## Custom Parsing in Action
 
+### 1. Configuration
+
+To use custom parsing, you need to define parsing tools in your DocETL configuration file. Here's an example:
+
+```yaml
 parsing_tools:
   - name: ocr_parser
     function_code: |
@@ -30,28 +43,7 @@ parsing_tools:
           text = ""
           for image in images:
               text += pytesseract.image_to_string(image)
-          return [text]  # Return as a list with one element
-
-operations:
-  - name: summarize_sales
-    type: map
-    prompt: |
-      Summarize the following sales data:
-      {{ input.sales_data }}
-    output:
-      schema:
-        summary: string
-    model: "gpt-4o-mini"
-  - name: extract_receipt_info
-    type: map
-    prompt: |
-      Extract the total amount and date from the following receipt text:
-      {{ input.receipt_text }}
-    output:
-      schema:
-        total_amount: float
-        date: string
-    model: "gpt-4o-mini"
+          return [text]
 
 datasets:
   sales_reports:
@@ -73,7 +65,18 @@ datasets:
       - input_key: pdf_path
         function: ocr_parser
         output_key: receipt_text
+```
+
+In this configuration:
+- We define a custom `ocr_parser` for PDF files.
+- We use the built-in `xlsx_to_string` parser for Excel files.
+- We apply these parsing tools to the external files referenced in the respective datasets.
+
+### 2. Pipeline Integration
+
+Once you've defined your parsing tools and datasets, you can use the processed data in your pipeline:
 
+```yaml
 pipeline:
   steps:
     - name: process_sales
@@ -84,108 +87,135 @@ pipeline:
       input: receipts
       operations:
         - extract_receipt_info
-
-output:
-  type: file
-  path: "output.json"
 ```
 
-### 2. Configuration Breakdown
+This pipeline will use the parsed data from both Excel files and PDFs for further processing.
 
-In this configuration:
+## Built-in Parsing Tools
 
-- We define a custom parsing tool `ocr_parser` for PDF files.
-- We use the built-in `xlsx_to_string` parsing tool for Excel files.
-- We create two datasets: `sales_reports` for Excel files and `receipts` for PDF files.
-- We apply the parsing tools to their respective datasets.
-- We define map operations to process the parsed data.
+DocETL provides several built-in parsing tools to handle common file formats and data processing tasks. These can be used directly in your configuration by specifying their names in the `function` field of your parsing configuration.
 
-### 3. Prepare Required Files
+[Insert the existing documentation for built-in parsing tools here]
 
-Ensure you have the necessary input files:
+## Creating Custom Parsing Tools
 
-#### JSON file for Excel paths (`sales_data/sales_paths.json`):
+If the built-in tools don't meet your needs, you can create your own custom parsing tools. Here's how:
 
-```json
-[
-  { "id": 1, "excel_path": "sales_data/january_sales.xlsx" },
-  { "id": 2, "excel_path": "sales_data/february_sales.xlsx" }
-]
-```
+1. Define your parsing function in the `parsing_tools` section of your configuration.
+2. Ensure your function takes a filename as input and returns a list of strings.
+3. Use your custom parser in the `parsing` section of your dataset configuration.
 
-#### JSON file for PDF paths (`receipts/receipt_paths.json`):
+For example:
 
-```json
-[
-  { "id": 1, "pdf_path": "receipts/receipt001.pdf" },
-  { "id": 2, "pdf_path": "receipts/receipt002.pdf" }
-]
+```yaml
+parsing_tools:
+  - name: my_custom_parser
+    function_code: |
+      def my_custom_parser(filename: str) -> List[str]:
+          # Your custom parsing logic here
+          return [processed_data]
+
+datasets:
+  my_dataset:
+    type: file
+    source: local
+    path: "data/paths.json"
+    parsing:
+      - input_key: file_path
+        function: my_custom_parser
+        output_key: processed_data
 ```
 
+### Understanding the Parsing Tools
+
+In this example, we used two parsing tools:
+
+1. **xlsx_to_string**: A built-in parsing tool provided by DocETL. It reads Excel files and converts them to a string representation.
+
+2. **ocr_parser**: A custom parsing tool we defined for OCR processing of PDF files. *Note that it returns a list containing a single string, which is the format expected by DocETL for parsing tools.*
 
-#### Parsing Process
 
-Let's examine how the input files would be parsed using the logic defined in `parsing_tools.py`:
+## How Data Gets Parsed and Formatted
 
-1. For the Excel files (`sales_data/january_sales.xlsx` and `sales_data/february_sales.xlsx`):
-   - The `xlsx_to_string` function is used.
-   - By default, it processes the active sheet of each Excel file.
-   - The function returns a list containing a single string for each file.
-   - The string representation includes column headers followed by their respective values.
-   - For example, if the Excel file has columns "Date", "Product", and "Amount", the output might look like:
+When you run your DocETL pipeline, the parsing tools you've specified in your configuration file are applied to the external files referenced in your dataset JSONs. Here's what happens:
 
-     Date:
-     2023-01-01
-     2023-01-02
-     ...
+1. DocETL reads your dataset JSON file.
+2. For each entry in the dataset, it looks at the parsing configuration you've specified.
+3. It applies the appropriate parsing function to the file path provided in the dataset JSON.
+4. The parsing function processes the file and returns the data in a format DocETL can work with (typically a list of strings).
 
-     Product:
-     Widget A
-     Widget B
-     ...
+Let's look at how this works for our earlier examples:
 
-     Amount:
-     100
-     150
-     ...
+### Excel Files (using xlsx_to_string)
 
-2. For the PDF files (`receipts/receipt001.pdf` and `receipts/receipt002.pdf`):
-   - The custom `ocr_parser` function is used.
-   - It converts each page of the PDF to an image using `pdf2image`.
-   - Then, it applies OCR to each image using `pytesseract`.
-   - The function combines the text from all pages and returns it as a list with a single string element.
-   - The output might look like:
+For an Excel file like "sales_data/january_sales.xlsx":
 
-     RECEIPT
-     Store: Example Store
-     Date: 2023-05-15
-     Items:
-     1. Product A - $10.99
-     2. Product B - $15.50
-     Total: $26.49
+1. The `xlsx_to_string` function reads the Excel file.
+2. It converts the data to a string representation.
+3. The output might look like this:
 
-These parsed strings are then passed to the respective operations (`summarize_sales` and `extract_receipt_info`) for further processing in the pipeline.
+```
+Date:
+2023-01-01
+2023-01-02
+...
+
+Product:
+Widget A
+Widget B
+...
+
+Amount:
+100
+150
+...
+```
 
+### PDF Files (using ocr_parser)
 
-### 4. Run the Pipeline
+For a PDF file like "receipts/receipt001.pdf":
 
-Execute the pipeline using the DocETL CLI:
+1. The `ocr_parser` function converts each page of the PDF to an image.
+2. It applies OCR to each image.
+3. The function combines the text from all pages.
+4. The output might look like this:
 
-```bash
-docetl run config.yaml
+```
+RECEIPT
+Store: Example Store
+Date: 2023-05-15
+Items:
+1. Product A - $10.99
+2. Product B - $15.50
+Total: $26.49
 ```
 
-### 5. Check the Output
+This parsed and formatted data is then passed to the respective operations in your pipeline for further processing.
 
-After running the pipeline, you'll find the output in `output.json`. It will contain summaries of the sales data and extracted information from the receipts.
+## Running the Pipeline
 
-## Understanding the Parsing Tools
+Once you've set up your pipeline configuration file with the appropriate parsing tools and dataset definitions, you can run your DocETL pipeline. Here's how:
 
-In this example, we used two parsing tools:
+1. Ensure you have DocETL installed in your environment.
+2. Open a terminal or command prompt.
+3. Navigate to the directory containing your pipeline configuration file.
+4. Run the following command:
+
+```bash
+docetl run pipeline.yaml
+```
+
+Replace `pipeline.yaml` with the name of your pipeline file if it's different.
+
+When you run this command:
+
+1. DocETL reads your pipeline file.
+2. It processes each dataset using the specified parsing tools.
+3. The pipeline steps are executed in the order you defined.
+4. Any operations you've specified (like `summarize_sales` or `extract_receipt_info`) are applied to the parsed data.
+5. The results are saved according to your output configuration.
 
-1. **xlsx_to_string**: A built-in parsing tool provided by DocETL. It reads Excel files and converts them to a string representation.
 
-2. **ocr_parser**: A custom parsing tool we defined for OCR processing of PDF files. *Note that it returns a list containing a single string, which is the format expected by DocETL for parsing tools.*
 
 ## Built-in Parsing Tools
 
@@ -212,6 +242,7 @@ DocETL provides several built-in parsing tools to handle common file formats and
         heading_level: 3
 
 
+
 ### Using Function Arguments with Parsing Tools
 
 When using parsing tools in your DocETL configuration, you can pass additional arguments to the parsing functions using the function_kwargs field. This allows you to customize the behavior of the parsing tools without modifying their implementation.
@@ -233,3 +264,23 @@ datasets:
           doc_per_sheet: true
 ```
 
+## Contributing Built-in Parsing Tools
+
+While DocETL provides several built-in parsing tools, the community can always benefit from additional utilities. If you've developed a parsing tool that you think could be useful for others, consider contributing it to the DocETL repository. Here's how you can add new built-in parsing utilities:
+
+1. Fork the DocETL repository on GitHub.
+2. Clone your forked repository to your local machine.
+3. Navigate to the `docetl/parsing_tools.py` file.
+4. Add your new parsing function to this file. The function should also be added to the `PARSING_TOOLS` dictionary.
+5. Update the documentation in the function's docstring.
+6. Create a pull request to merge your changes into the main DocETL repository.
+
+!!! note "Guidelines for Contributing Parsing Tools"
+
+    When contributing a new parsing tool, make sure it follows these guidelines:
+
+    - The function should have a clear, descriptive name.
+    - Include comprehensive docstrings explaining the function's purpose, parameters, and return value. The return value should be a list of strings.
+    - Handle potential errors gracefully and provide informative error messages.
+    - If your parser requires additional dependencies, make sure to mention them in the pull request.
+