Add Azure Document Intelligence Read Tool #36

shreyashankar · 2024-10-01T21:58:56Z

This PR introduces a new parsing tool, azure_di_read, which leverages Azure's Document Intelligence service to extract text from various document types, including PDFs, images, and scanned documents.

Features

Extracts text from documents using Azure Form Recognizer
Supports both local files and URLs
Configurable options for including line numbers, handwritten text, font styles, and selection marks
Option to return each page as a separate document or combine all pages into a single output

Usage

To use this new tool, users need to:

Set up Azure Document Intelligence service and obtain API credentials
Set environment variables for DOCUMENTINTELLIGENCE_API_KEY and DOCUMENTINTELLIGENCE_ENDPOINT
Configure the parsing tool in their DocETL pipeline configuration

Example configuration:

datasets:
  documents:
    type: file
    source: local
    path: "document_paths.json"
    parsing:
      - input_key: document_path
        function: azure_di_read
        output_key: extracted_text
        function_kwargs:
          include_line_numbers: true
          include_handwritten: true

Testing

A new test case has been added to verify the functionality of the azure_di_read tool.

Documentation

The azure_di_read function is documented with a docstring, explaining its parameters and usage. Additional documentation has been added to the parsing tools section of the project documentation.

Note that this requires the user to have set up Azure Document Intelligence. This is not great; we should explore an off-the-shelf OCR option as discussed in #3

shreyashankar added 4 commits October 1, 2024 14:53

feat: add azure document intelligence for basic OCR

fc22210

feat: add azure document intelligence for basic OCR

a2b95ea

refactor: run precommit

07e3401

chore: update poetry lockfile

9df21f5

shreyashankar merged commit 4fe9d65 into main Oct 1, 2024
3 checks passed

shreyashankar deleted the shreyashankar/ocr branch October 2, 2024 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Azure Document Intelligence Read Tool #36

Add Azure Document Intelligence Read Tool #36

shreyashankar commented Oct 1, 2024

Add Azure Document Intelligence Read Tool #36

Add Azure Document Intelligence Read Tool #36

Conversation

shreyashankar commented Oct 1, 2024

Features

Usage

Testing

Documentation