-
Notifications
You must be signed in to change notification settings - Fork 282
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
enhance: add document summarizer example (#107)
* enhance: add document summarizer example Signed-off-by: Grant Linville <[email protected]>
- Loading branch information
1 parent
f039adc
commit 9b50fe1
Showing
7 changed files
with
104 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
venv/ |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
# Hamlet Summarizer | ||
|
||
This is an example tool that summarizes the contents of a large documents in chunks. | ||
|
||
The example document we are using is the Shakespeare play Hamlet. It is about 51000 tokens | ||
(according to OpenAI's tokenizer for GPT-4), so it can fit within GPT-4's context window, | ||
but this serves as an example of how larger documents can be split up and summarized. | ||
This example splits it into chunks of 10000 tokens. | ||
|
||
Hamlet PDF is from https://nosweatshakespeare.com/hamlet-play/pdf/. | ||
|
||
## Design | ||
|
||
The script consists of three tools: a top-level tool that orchestrates everything, a summarizer that | ||
will summarize one chunk of text at a time, and a Python script that ingests the PDF and splits it into | ||
chunks and provides a specific chunk based on an index. | ||
|
||
The summarizer tool looks at the entire summary up to the current chunk and then summarizes the current | ||
chunk and adds it onto the end. In the case of models with very small context windows, or extremely large | ||
documents, this approach may still exceed the context window, in which case another tool could be added to | ||
only give the summarizer the previous few chunk summaries instead of all of them. | ||
|
||
## Run the Example | ||
|
||
```bash | ||
# Create a Python venv | ||
python3 -m venv venv | ||
|
||
# Source it | ||
source venv/bin/activate | ||
|
||
# Install the packages | ||
pip install -r requirements.txt | ||
|
||
# Set your OpenAI key | ||
export OPENAI_API_KEY=your-api-key | ||
|
||
# Run the example | ||
gptscript --cache=false hamlet-summarizer.gpt | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
tools: hamlet-summarizer, sys.read, sys.write | ||
|
||
First, create the file "summary.txt" if it does not already exist. | ||
|
||
You are a program that is tasked with fetching partial summaries of a play called Hamlet. | ||
|
||
Call the hamlet-summarizer tool to get each part of the summary. Begin with index 0. Do not proceed | ||
until the tool has responded to you. | ||
|
||
Once you get "No more content" from the hamlet-summarizer, stop calling it. | ||
Then, print the contents of the summary.txt file. | ||
|
||
--- | ||
name: hamlet-summarizer | ||
tools: hamlet-retriever, sys.read, sys.append | ||
description: Summarizes a part of the text of Hamlet. Returns "No more content" if the index is greater than the number of parts. | ||
args: index: (unsigned int) the index of the portion to summarize, beginning at 0 | ||
|
||
You are a theater expert, and you're tasked with summarizing part of Hamlet. | ||
Get the part of Hamlet at index $index. | ||
Read the existing summary of Hamlet up to this point in summary.txt. | ||
|
||
Summarize the part at index $index. Include as many details as possible. Do not leave out any important plot points. | ||
Do not introduce the summary with "In this part of Hamlet", "In this segment", or any similar language. | ||
If a new character is introduced, be sure to explain who they are. | ||
Add two newlines to the end of your summary and append it to summary.txt. | ||
|
||
If you got "No more content" just say "No more content". Otherwise, say "Continue". | ||
|
||
--- | ||
name: hamlet-retriever | ||
description: Returns a part of the text of Hamlet. Returns "No more content" if the index is greater than the number of parts. | ||
args: index: (unsigned int) the index of the part to return, beginning at 0 | ||
|
||
#!python3 main.py "$index" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
import tiktoken | ||
import sys | ||
from llama_index.readers.file import PyMuPDFReader | ||
from llama_index.core.node_parser import TokenTextSplitter | ||
|
||
index = int(sys.argv[1]) | ||
docs = PyMuPDFReader().load("Hamlet.pdf") | ||
|
||
combined = "" | ||
for doc in docs: | ||
combined += doc.text | ||
|
||
splitter = TokenTextSplitter( | ||
chunk_size=10000, | ||
chunk_overlap=10, | ||
tokenizer=tiktoken.encoding_for_model("gpt-4").encode) | ||
|
||
pieces = splitter.split_text(combined) | ||
|
||
if index >= len(pieces): | ||
print("No more content") | ||
sys.exit(0) | ||
|
||
print(pieces[index]) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
tiktoken==0.6.0 | ||
llama-index-core==0.10.14 | ||
llama-index-readers-file==0.1.6 |