This repo is the official implementation of "MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio".
MM-StoryAgent is a multi-agent framework that employs LLMs and diverse expert tools across several modalities to produce expressive storytelling videos. It hightlights in the following aspects:
- MM-StoryAgent designs a reliable and customizable workflow. Users can define their own expert tools to improve the generation quality of each component.
- MM-StoryAgent writes high-quality stories based on the input story setting, in a multi-agent, multi-stage pipeline.
- Agents of all modalities (image, speech, sound, music) generated corresponding assets are composed to an immersive storytelling video.
Besides, we provide a story topic list and story evaluation criteria for further story writing evaluation.
- Aug 16, 2024: The initial version of MM-StoryAgent was released.
The demo video is available:
Install the required dependencies and install this repo as a package:
pip install -r requirements.txt
pip install -e .
MM-StoryAgent can be called by configuration files:
python run.py -c configs/mm_story_agent.yaml
Each agent is called in the following format:
story_writer: # agent name
tool: qa_outline_story_writer # name registered in the definition
cfg: # parameters for initializing the agent instance
max_conv_turns: 3
...
params: # parameters for calling the agent instance
story_topic: "Time Management: A child learning how to manage their time effectively."
...
The customization of new agents can refer to music_agent.py. The agent class should implement __init__
and call
to work properly, like the following:
from typing import Dict
from mm_story_agent.base import register_tool
@register_tool("my_speech_agent")
class MySpeechAgent:
def __init__(self, cfg: Dict):
# For example, the agent need `attr1` and `attr2` for initilization
self.attr1 = cfg.attr1
self.attr2 = cfg.attr2
...
def call(self, params: Dict):
# For example, calling the agent needs `voice` and `speed` parameters
voice = params["voice"]
speed = params["speed"]
...
Then the agent can be called by simply modifying the configuration like:
speech_generation:
tool: my_speech_agent
cfg:
attr1: val1
attr2: val2
params:
voice: en_female
speed: 1.0
The evaluation topics are provided in story_topics.json. Evaluation rubrics and prompts are also provided accordingly.
We use GPT-4 to automatically evaluate the story quality according to several aspects. Our story writing agent is compared with directly prompting LLM to write stories. Evaluation scores show the advantage of our multi-agent, multi-stage story writing pipeline.
Rubric Grading | Attractiveness | Warmth | Education | Average | |
---|---|---|---|---|---|
Topic 1: Self-growing | Direct | 3.68 | 4.42 | 4.84 | 4.31 |
Story Agent | 4.1 | 4.5 | 4.80 | 4.47 | |
Topic 2: Family & Friendship | Direct | 3.94 | 5.0 | 4.72 | 4.55 |
Story Agent | 4.36 | 4.8 | 4.92 | 4.69 | |
Topic 3: Environments | Direct | 4.0 | 4.62 | 4.92 | 4.51 |
Story Agent | 4.44 | 4.68 | 4.86 | 4.66 | |
Topic 4: Knowledge Learning | Direct | 4.46 | 4.14 | 4.86 | 4.49 |
Story Agent | 4.84 | 4.52 | 4.90 | 4.75 | |
All | Direct | 4.02 | 4.55 | 4.84 | 4.47 |
Story Agent | 4.44 | 4.63 | 4.87 | 4.65 |