Skip to content

Commit

Permalink
enhance: add document summarizer example (#107)
Browse files Browse the repository at this point in the history
* enhance: add document summarizer example

Signed-off-by: Grant Linville <[email protected]>
  • Loading branch information
g-linville authored Mar 5, 2024
1 parent f039adc commit 9b50fe1
Show file tree
Hide file tree
Showing 7 changed files with 104 additions and 1 deletion.
2 changes: 1 addition & 1 deletion docs/README-USECASES.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ Depending on the context window supported by the LLM, you can either send a larg

### Summarization

Here is a GPTScript that sends a large document in batches to the LLM and produces a summary of the entire document. [Link to example here]
Here is a GPTScript that sends a large document in batches to the LLM and produces a summary of the entire document. [hamlet-summarizer](../examples/hamlet-summarizer)

Here is a GPTScript that reads the content of a large SQL database and produces a summary of the entire database. [Link to example here]

Expand Down
1 change: 1 addition & 0 deletions examples/hamlet-summarizer/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
venv/
Binary file added examples/hamlet-summarizer/Hamlet.pdf
Binary file not shown.
40 changes: 40 additions & 0 deletions examples/hamlet-summarizer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Hamlet Summarizer

This is an example tool that summarizes the contents of a large documents in chunks.

The example document we are using is the Shakespeare play Hamlet. It is about 51000 tokens
(according to OpenAI's tokenizer for GPT-4), so it can fit within GPT-4's context window,
but this serves as an example of how larger documents can be split up and summarized.
This example splits it into chunks of 10000 tokens.

Hamlet PDF is from https://nosweatshakespeare.com/hamlet-play/pdf/.

## Design

The script consists of three tools: a top-level tool that orchestrates everything, a summarizer that
will summarize one chunk of text at a time, and a Python script that ingests the PDF and splits it into
chunks and provides a specific chunk based on an index.

The summarizer tool looks at the entire summary up to the current chunk and then summarizes the current
chunk and adds it onto the end. In the case of models with very small context windows, or extremely large
documents, this approach may still exceed the context window, in which case another tool could be added to
only give the summarizer the previous few chunk summaries instead of all of them.

## Run the Example

```bash
# Create a Python venv
python3 -m venv venv

# Source it
source venv/bin/activate

# Install the packages
pip install -r requirements.txt

# Set your OpenAI key
export OPENAI_API_KEY=your-api-key

# Run the example
gptscript --cache=false hamlet-summarizer.gpt
```
35 changes: 35 additions & 0 deletions examples/hamlet-summarizer/hamlet-summarizer.gpt
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
tools: hamlet-summarizer, sys.read, sys.write

First, create the file "summary.txt" if it does not already exist.

You are a program that is tasked with fetching partial summaries of a play called Hamlet.

Call the hamlet-summarizer tool to get each part of the summary. Begin with index 0. Do not proceed
until the tool has responded to you.

Once you get "No more content" from the hamlet-summarizer, stop calling it.
Then, print the contents of the summary.txt file.

---
name: hamlet-summarizer
tools: hamlet-retriever, sys.read, sys.append
description: Summarizes a part of the text of Hamlet. Returns "No more content" if the index is greater than the number of parts.
args: index: (unsigned int) the index of the portion to summarize, beginning at 0

You are a theater expert, and you're tasked with summarizing part of Hamlet.
Get the part of Hamlet at index $index.
Read the existing summary of Hamlet up to this point in summary.txt.

Summarize the part at index $index. Include as many details as possible. Do not leave out any important plot points.
Do not introduce the summary with "In this part of Hamlet", "In this segment", or any similar language.
If a new character is introduced, be sure to explain who they are.
Add two newlines to the end of your summary and append it to summary.txt.

If you got "No more content" just say "No more content". Otherwise, say "Continue".

---
name: hamlet-retriever
description: Returns a part of the text of Hamlet. Returns "No more content" if the index is greater than the number of parts.
args: index: (unsigned int) the index of the part to return, beginning at 0

#!python3 main.py "$index"
24 changes: 24 additions & 0 deletions examples/hamlet-summarizer/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import tiktoken
import sys
from llama_index.readers.file import PyMuPDFReader
from llama_index.core.node_parser import TokenTextSplitter

index = int(sys.argv[1])
docs = PyMuPDFReader().load("Hamlet.pdf")

combined = ""
for doc in docs:
combined += doc.text

splitter = TokenTextSplitter(
chunk_size=10000,
chunk_overlap=10,
tokenizer=tiktoken.encoding_for_model("gpt-4").encode)

pieces = splitter.split_text(combined)

if index >= len(pieces):
print("No more content")
sys.exit(0)

print(pieces[index])
3 changes: 3 additions & 0 deletions examples/hamlet-summarizer/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
tiktoken==0.6.0
llama-index-core==0.10.14
llama-index-readers-file==0.1.6

0 comments on commit 9b50fe1

Please sign in to comment.