Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Gemini PDF example #1163

Merged
merged 6 commits into from
Nov 12, 2024
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions docs/blog/posts/chat-with-your-pdf-with-gemini.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
authors:
- ivanleomk
categories:
- Gemini
- Document Processing
comments: true
date: 2024-11-11
description: Learn how to use Google's Gemini model with Instructor to process PDFs and extract structured information
draft: false
tags:
- Gemini
- Document Processing
- PDF Analysis
- Pydantic
- Python
---

# Quick PDF Processing with Gemini and Instructor
jxnl marked this conversation as resolved.
Show resolved Hide resolved

In this post, we'll explore how to use Google's Gemini model with Instructor to analyse the [Gemini 1.5 Pro Paper](https://github.com/google-gemini/generative-ai-python/blob/0e5c5f25fe4ce266791fa2afb20d17dee780ca9e/third_party/test.pdf) and extract a structured summary.

## The Problem

Processing PDFs programmatically has always been painful. The typical approaches all have significant drawbacks:

- **PDF parsing libraries** require complex rules and break easily
- **OCR solutions** are slow and error-prone
- **Specialized PDF APIs** are expensive and require additional integration
- **LLM solutions** often need complex document chunking and embedding pipelines

What if we could just hand a PDF to an LLM and get structured data back? With Gemini's multimodal capabilities and Instructor's structured output handling, we can do exactly that.

## Quick Setup

First, install the required packages:

```bash
pip install "instructor[google-generativeai]"
```

Then, here's all the code you need:

```python
import instructor
import google.generativeai as genai
from google.ai.generativelanguage_v1beta.types.file import File
from pydantic import BaseModel
import time

# Initialize the client
client = instructor.from_gemini(
client=genai.GenerativeModel(
model_name="models/gemini-1.5-flash-latest",
)
)

# Define your output structure
class Summary(BaseModel):
summary: str

# Upload the PDF
file = genai.upload_file("path/to/your.pdf")

# Wait for file to finish processing
while file.state != File.State.ACTIVE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing import for time module. Add import time at the beginning of the code.

time.sleep(1)
file = genai.get_file(file.name)
print(f"File is still uploading, state: {file.state}")

print(f"File is now active, state: {file.state}")
print(file)

resp = client.chat.completions.create(
messages=[
{"role": "user", "content": ["Summarize the following file", file]},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content key in the message dictionary should be a string, not a list. Change "content": ["Summarize the following file", file] to "content": "Summarize the following file: " + file.

Suggested change
{"role": "user", "content": ["Summarize the following file", file]},
{"role": "user", "content": "Summarize the following file: " + file},

],
response_model=Summary,
)

print(resp.summary)
```

??? note "Expand to see Raw Results"

```bash
summary="Gemini 1.5 Pro is a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. It achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Gemini 1.5 Pro is built to handle extremely long contexts; it has the ability to recall and reason over fine-grained information from up to at least 10M tokens. This scale is unprecedented among contemporary large language models (LLMs), and enables the processing of long-form mixed-modality inputs including entire collections of documents, multiple hours of video, and almost five days long of audio. Gemini 1.5 Pro surpasses Gemini 1.0 Pro and performs at a similar level to 1.0 Ultra on a wide array of benchmarks while requiring significantly less compute to train. It can recall information amidst distractor context, and it can learn to translate a new language from a single set of linguistic documentation. With only instructional materials (a 500-page reference grammar, a dictionary, and ≈ 400 extra parallel sentences) all provided in context, Gemini 1.5 Pro is capable of learning to translate from English to Kalamang, a Papuan language with fewer than 200 speakers, and therefore almost no online presence."
```

## Benefits

The combination of Gemini and Instructor offers several key advantages over traditional PDF processing approaches:

**Simple Integration** - Unlike traditional approaches that require complex document processing pipelines, chunking strategies, and embedding databases, you can directly process PDFs with just a few lines of code. This dramatically reduces development time and maintenance overhead.

**Structured Output** - Instructor's Pydantic integration ensures you get exactly the data structure you need. The model's outputs are automatically validated and typed, making it easier to build reliable applications. If the extraction fails, Instructor automatically handles the retries for you with support for [custom retry logic using tenacity].

**Multimodal Support** - Gemini's multimodal capabilities mean this same approach works for various file types. You can process images, videos, and audio files all in the same api request. Check out our [multimodal processing guide](./multimodal-gemini.md) to see how we extract structured data from travel videos.

## Conclusion

Working with PDFs doesn't have to be complicated.

By combining Gemini's multimodal capabilities with Instructor's structured output handling, we can transform complex document processing into simple, Pythonic code.

No more wrestling with parsing rules, managing embeddings, or building complex pipelines – just define your data model and let the LLM do the heavy lifting.

If you liked this, give `instructor` a try today and see how much easier structured outputs makes working with LLMs become. [Get started with Instructor today!](../../index.md)
Loading