Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Azure Document Intelligence Read Tool #36

Merged
merged 4 commits into from
Oct 1, 2024
Merged

Conversation

shreyashankar
Copy link
Collaborator

This PR introduces a new parsing tool, azure_di_read, which leverages Azure's Document Intelligence service to extract text from various document types, including PDFs, images, and scanned documents.

Features

  • Extracts text from documents using Azure Form Recognizer
  • Supports both local files and URLs
  • Configurable options for including line numbers, handwritten text, font styles, and selection marks
  • Option to return each page as a separate document or combine all pages into a single output

Usage

To use this new tool, users need to:

  1. Set up Azure Document Intelligence service and obtain API credentials
  2. Set environment variables for DOCUMENTINTELLIGENCE_API_KEY and DOCUMENTINTELLIGENCE_ENDPOINT
  3. Configure the parsing tool in their DocETL pipeline configuration

Example configuration:

datasets:
  documents:
    type: file
    source: local
    path: "document_paths.json"
    parsing:
      - input_key: document_path
        function: azure_di_read
        output_key: extracted_text
        function_kwargs:
          include_line_numbers: true
          include_handwritten: true

Testing

A new test case has been added to verify the functionality of the azure_di_read tool.

Documentation

The azure_di_read function is documented with a docstring, explaining its parameters and usage. Additional documentation has been added to the parsing tools section of the project documentation.

Note that this requires the user to have set up Azure Document Intelligence. This is not great; we should explore an off-the-shelf OCR option as discussed in #3

@shreyashankar shreyashankar merged commit 4fe9d65 into main Oct 1, 2024
3 checks passed
@shreyashankar shreyashankar deleted the shreyashankar/ocr branch October 2, 2024 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant