Skip to content

Commit

Permalink
feat: Compatibility with xps, epub, mobi, fb2, cbz, svg, txt
Browse files Browse the repository at this point in the history
  • Loading branch information
clemlesne committed Jun 16, 2024
1 parent b51299f commit aa68723
Show file tree
Hide file tree
Showing 2 changed files with 29 additions and 8 deletions.
31 changes: 25 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,12 +61,31 @@ graph LR

### Format support

Document extraction is based on Azure Document Intelligence, specifically on the `prebuilt-layout` model. It [supports the following](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-layout?view=doc-intel-4.0.0&tabs=sample-code#input-requirements) formats:

- HTML
- Images: JPEG/JPG, PNG, BMP, TIFF, HEIF
- Microsoft Office: Word (DOCX), Excel (XLSX), PowerPoint (PPTX)
- PDF
Document extraction is based on Azure Document Intelligence, specifically on the `prebuilt-layout` model. It [supports popular formats](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-layout?view=doc-intel-4.0.0&tabs=sample-code#input-requirements).

Some formats are first converted to PDF [with MuPDF](https://github.com/ArtifexSoftware/mupdf) to ensure compatibility with Document Intelligence.

> [!IMPORTANT]
> Formats not listed there are treated as binary and decoded with `UTF-8` encoding.
| `Format` | **OCR** | **Details** |
|-|-|-|
| `.bmp` || |
| `.cbz` || First converted to PDF with MuPDF. |
| `.docx` || |
| `.epub` || First converted to PDF with MuPDF. |
| `.fb2` || First converted to PDF with MuPDF. |
| `.heif` || |
| `.html` || |
| `.jpg`, `.jpeg` || |
| `.mobi` || First converted to PDF with MuPDF. |
| `.pdf` || Sanitized & compressed with MuPDF. |
| `.png` || |
| `.pptx` || |
| `.svg` || First converted to PDF with MuPDF. |
| `.tiff` || |
| `.xlsx` || |
| `.xps` || First converted to PDF with MuPDF. |

### Demo

Expand Down
6 changes: 4 additions & 2 deletions function_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,11 +89,13 @@ async def _upload(local_file: IO, remote_path: str) -> None:
await downloader.readinto(in_local_path)
in_local_path.seek(0) # Reset file pointer

if detect_extension(in_remote_path) == ".pdf": # Sanitize PDF
logger.info(f"Sanitizing PDF ({in_remote_path})")
if detect_extension(in_remote_path) in {".pdf", ".xps", ".epub", ".mobi", ".fb2", ".cbz", ".svg", ".txt"}: # Sanitize with PyMuPDF
logger.info(f"Sanitizing ({in_remote_path})")
doc_client = CONFIG.document_intelligence.instance()
# Open
in_pdf = pymupdf.open(in_local_path)
if not in_pdf.is_pdf: # Convert to PDF
in_pdf = pymupdf.open("pdf", in_pdf.convert_to_pdf())
# Sanitize
in_pdf.scrub(
hidden_text=False, # Keep hidden text (it may contain OCR)
Expand Down

0 comments on commit aa68723

Please sign in to comment.