Skip to content

Commit

Permalink
Merge branch 'main' into hf/1104
Browse files Browse the repository at this point in the history
  • Loading branch information
ivanbelenky authored Oct 23, 2024
2 parents e3efd03 + f63e1d0 commit df07c39
Show file tree
Hide file tree
Showing 4 changed files with 450 additions and 3 deletions.
5 changes: 2 additions & 3 deletions docs/blog/posts/anthropic.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,7 @@ categories:
- Anthropic
comments: true
date: 2024-03-20
description: Enhance your projects with the new Anthropic client support, featuring
installation guidance and user model creation.
description: Learn how to integrate Anthropic's powerful language models into your projects using Instructor, with step-by-step guidance on installation, client setup, and creating structured outputs with Pydantic models.
draft: false
tags:
- Anthropic
Expand All @@ -16,7 +15,7 @@ tags:
- LLM Techniques
---

# Announcing Anthropic Support
# Structured Outputs with Anthropic

A special shoutout to [Shreya](https://twitter.com/shreyaw_) for her contributions to the anthropic support. As of now, all features are operational with the exception of streaming support.

Expand Down
215 changes: 215 additions & 0 deletions docs/blog/posts/multimodal-gemini.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
---
authors:
- ivanleomk
categories:
- Gemini
- Multimodal
comments: true
date: 2024-10-23
description: Learn how to use Google's Gemini model for multimodal structured extraction of YouTube videos, extracting structured recommendations for tourist destinations.
draft: false
tags:
- Gemini
- Multimodal AI
- Travel Recommendations
- Pydantic
- Python
---

# Structured Outputs with Multimodal Gemini

In this post, we'll explore how to use Google's Gemini model with Instructor to analyze [travel videos](https://www.youtube.com/watch?v=_R8yhW_H9NQ) and extract structured recommendations. This powerful combination allows us to process multimodal inputs (video) and generate structured outputs using Pydantic models. This post was done in collaboration with [Kino.ai](https://kino.ai), a company that uses instructor to do structured extraction from multimodal inputs to improve search for film makers.

## Setting Up the Environment

First, let's set up our environment with the necessary libraries:

```python
from pydantic import BaseModel
import instructor
import google.generativeai as genai
```

## Defining Our Data Models

We'll use Pydantic to define our data models for tourist destinations and recommendations:

```python
class TouristDestination(BaseModel):
name: str
description: str
location: str

class Recommendations(BaseModel):
chain_of_thought: str
description: str
destinations: list[TouristDestination]
```

## Initializing the Gemini Client

Next, we'll set up our Gemini client using Instructor:

```python
client = instructor.from_gemini(
client=genai.GenerativeModel(
model_name="models/gemini-1.5-flash-latest",
),
)
```

## Uploading and Processing the Video

To analyze a video, we first need to upload it:

```python
file = genai.upload_file("./takayama.mp4")
```

Then, we can process the video and extract recommendations:

```python
resp = client.chat.completions.create(
messages=[
{
"role": "user",
"content": ["What places do they recommend in this video?", file],
}
],
response_model=Recommendations,
)

print(resp)
```

??? note "Expand to see Raw Results"

```python
Recomendations(
chain_of_thought='The video recommends visiting Takayama city, in the Hida Region, Gifu Prefecture. The
video suggests visiting the Miyagawa Morning Market, to try the Sarubobo good luck charms, and to enjoy the
cookie cup espresso, made by Koma Coffee. Then, the video suggests visiting a traditional Japanese Cafe,
called Kissako Katsure, and try their matcha and sweets. Afterwards, the video suggests to visit the Sanmachi
Historic District, where you can find local crafts and delicious foods. The video recommends trying Hida Wagyu
beef, at the Kin no Kotte Ushi shop, or to have a sit-down meal at the Kitchen Hida. Finally, the video
recommends visiting Shirakawa-go, a World Heritage Site in Gifu Prefecture.',
description='This video recommends a number of places to visit in Takayama city, in the Hida Region, Gifu
Prefecture. It shows some of the local street food and highlights some of the unique shops and restaurants in
the area.',
destinations=[
TouristDestination(
name='Takayama',
description='Takayama is a city at the base of the Japan Alps, located in the Hida Region of
Gifu.',
location='Hida Region, Gifu Prefecture'
),
TouristDestination(
name='Miyagawa Morning Market',
description="The Miyagawa Morning Market, or the Miyagawa Asai-chi in Japanese, is a market that
has existed officially since the Edo Period, more than 100 years ago. It's open every single day, rain or
shine, from 7am to noon.",
location='Hida Takayama'
),
TouristDestination(
name='Nakaya - Handmade Hida Sarubobo',
description='The Nakaya shop sells handcrafted Sarubobo good luck charms.',
location='Hida Takayama'
),
TouristDestination(
name='Koma Coffee',
description="Koma Coffee is a shop that has been in business for about 50 or 60 years, and they
serve coffee in a cookie cup. They've been serving coffee for about 10 years.",
location='Hida Takayama'
),
TouristDestination(
name='Kissako Katsure',
description='Kissako Katsure is a traditional Japanese style cafe, called Kissako, and the name
means would you like to have some tea. They have a variety of teas and sweets.',
location='Hida Takayama'
),
TouristDestination(
name='Sanmachi Historic District',
description='Sanmachi Dori is a Historic Merchant District in Takayama, all of the buildings here
have been preserved to look as they did in the Edo Period.',
location='Hida Takayama'
),
TouristDestination(
name='Suwa Orchard',
description='The Suwa Orchard has been in business for more than 50 years.',
location='Hida Takayama'
),
TouristDestination(
name='Kitchen HIDA',
description='Kitchen HIDA is a restaurant with a 50 year history, known for their Hida Beef dishes
and for using a lot of local ingredients.',
location='Hida Takayama'
),
TouristDestination(
name='Kin no Kotte Ushi',
description='Kin no Kotte Ushi is a shop known for selling Beef Sushi, especially Hida Wagyu Beef
Sushi. Their sushi is medium rare.',
location='Hida Takayama'
),
TouristDestination(
name='Shirakawa-go',
description='Shirakawa-go is a World Heritage Site in Gifu Prefecture.',
location='Gifu Prefecture'
)
]
)
```

The Gemini model analyzes the video and provides structured recommendations. Here's a summary of the extracted information:

1. **Takayama City**: The main destination, located in the Hida Region of Gifu Prefecture.
2. **Miyagawa Morning Market**: A historic market open daily from 7am to noon.
3. **Nakaya Shop**: Sells handcrafted Sarubobo good luck charms.
4. **Koma Coffee**: A 50-60 year old shop famous for serving coffee in cookie cups.
5. **Kissako Katsure**: A traditional Japanese cafe offering various teas and sweets.
6. **Sanmachi Historic District**: A preserved merchant district from the Edo Period.
7. **Suwa Orchard**: A 50+ year old orchard business.
8. **Kitchen HIDA**: A restaurant with a 50-year history, known for Hida Beef dishes.
9. **Kin no Kotte Ushi**: A shop specializing in Hida Wagyu Beef Sushi.
10. **Shirakawa-go**: A World Heritage Site in Gifu Prefecture.

## Limitations, Challenges, and Future Directions

While the current approach demonstrates the power of multimodal AI for video analysis, there are several limitations and challenges to consider:

1. **Lack of Temporal Information**: Our current method extracts overall recommendations but doesn't provide timestamps for specific mentions. This limits the ability to link recommendations to exact moments in the video.

2. **Speaker Diarization**: The model doesn't distinguish between different speakers in the video. Implementing speaker diarization could provide valuable context about who is making specific recommendations.

3. **Content Density**: Longer or more complex videos might overwhelm the model, potentially leading to missed information or less accurate extractions.

### Future Explorations

To address these limitations and expand the capabilities of our video analysis system, here are some promising areas to explore:

1. **Timestamp Extraction**: Enhance the model to provide timestamps for each recommendation or point of interest mentioned in the video. This could be achieved by:

```python
class TimestampedRecommendation(BaseModel):
timestamp: str
timestamp_format: Literal["HH:MM", "HH:MM:SS"] # Helps with parsing
recommendation: str

class EnhancedRecommendations(BaseModel):
destinations: list[TouristDestination]
timestamped_mentions: list[TimestampedRecommendation]
```

2. **Speaker Diarization**: Implement speaker recognition to attribute recommendations to specific individuals. This could be particularly useful for videos featuring multiple hosts or interviewees.

3. **Segment-based Analysis**: Process longer videos in segments to maintain accuracy and capture all relevant information. This approach could involve:
- Splitting the video into smaller chunks
- Analyzing each chunk separately
- Aggregating and deduplicating results

4. **Multi-language Support**: Extend the model's capabilities to accurately analyze videos in various languages and capture culturally specific recommendations.

5. **Visual Element Analysis**: Enhance the model to recognize and describe visual elements like landmarks, food dishes, or activities shown in the video, even if not explicitly mentioned in the audio.

6. **Sentiment Analysis**: Incorporate sentiment analysis to gauge the speaker's enthusiasm or reservations about specific recommendations.

By addressing these challenges and exploring these new directions, we can create a more comprehensive and nuanced video analysis system, opening up even more possibilities for applications in travel, education, and beyond.
137 changes: 137 additions & 0 deletions docs/blog/posts/structured-output-anthropic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
---
authors:
- jxnl
categories:
- Anthropic
comments: true
date: 2024-10-23
description: Learn how to leverage Anthropic's Claude with Instructor for structured outputs and prompt caching, enhancing AI application development.
draft: false
tags:
- Anthropic
- API Development
- Pydantic
- Python
- LLM Techniques
- Prompt Caching
---

# Structured Outputs and Prompt Caching with Anthropic

Anthropic's ecosystem now offers two powerful features for AI developers: structured outputs and prompt caching. These advancements enable more efficient use of large language models (LLMs). This guide demonstrates how to leverage these features with the Instructor library to enhance your AI applications.

## Structured Outputs with Anthropic and Instructor

Instructor now offers seamless integration with Anthropic's powerful language models, allowing developers to easily create structured outputs using Pydantic models. This integration simplifies the process of extracting specific information from AI-generated responses.

To get started, you'll need to install Instructor with Anthropic support:

```bash
pip install instructor[anthropic]
```

Here's a basic example of how to use Instructor with Anthropic:

```python
from pydantic import BaseModel
from typing import List
import anthropic
import instructor

# Patch the Anthropic client with Instructor
anthropic_client = instructor.from_anthropic(
create=anthropic.Anthropic()
)

# Define your Pydantic models
class Properties(BaseModel):
name: str
value: str

class User(BaseModel):
name: str
age: int
properties: List[Properties]

# Use the patched client to generate structured output
user_response = anthropic_client(
model="claude-3-haiku-20240307",
max_tokens=1024,
messages=[
{
"role": "user",
"content": "Create a user for a model with a name, age, and properties.",
}
],
response_model=User,
)

print(user_response.model_dump_json(indent=2))
"""
{
"name": "John Doe",
"age": 30,
"properties": [
{ "name": "favorite_color", "value": "blue" }
]
}
"""
```

This approach allows you to easily extract structured data from Claude's responses, making it simpler to integrate AI-generated content into your applications.

## Prompt Caching: Boosting Performance and Reducing Costs

Anthropic has introduced a new prompt caching feature that can significantly improve response times and reduce costs for applications dealing with large context windows. This feature is particularly useful when making multiple calls with similar large contexts over time.

Here's how you can implement prompt caching with Instructor and Anthropic:

```python
from instructor import Instructor, Mode, patch
from anthropic import Anthropic
from pydantic import BaseModel

# Set up the client with prompt caching
client = instructor.from_anthropic(Anthropic())

# Define your Pydantic model
class Character(BaseModel):
name: str
description: str

# Load your large context
with open("./book.txt", "r") as f:
book = f.read()

# Make multiple calls using the cached context
for _ in range(2):
resp, completion = client.chat.completions.create_with_completion(
model="claude-3-haiku-20240307",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "<book>" + book + "</book>",
"cache_control": {"type": "ephemeral"},
},
{
"type": "text",
"text": "Extract a character from the text given above",
},
],
},
],
response_model=Character,
max_tokens=1000,
)
```

In this example, the large context (the book content) is cached after the first request and reused in subsequent requests. This can lead to significant time and cost savings, especially when working with extensive context windows.

## Conclusion

By combining Anthropic's Claude with Instructor's structured output capabilities and leveraging prompt caching, developers can create more efficient, cost-effective, and powerful AI applications. These features open up new possibilities for building sophisticated AI systems that can handle complex tasks with ease.

As the AI landscape continues to evolve, staying up-to-date with the latest tools and techniques is crucial. We encourage you to explore these features and share your experiences with the community. Happy coding!
Loading

0 comments on commit df07c39

Please sign in to comment.