Dataset Sample | Test Sample |
---|---|
original.mp4 |
validation.mp4 |
Now you can make Mochi-1 your own with diffusers
, too 🤗 🧨
We provide a minimal and faithful reimplementation of the Mochi-1 original fine-tuner. As usual, we leverage peft
for things LoRA in our implementation.
Updates
December 1 2024: Support for checkpoint saving and loading.
Install the dependencies: pip install -r requirements.txt
. Also make sure your diffusers
installation is from the current main
.
Download a demo dataset:
huggingface-cli download \
--repo-type dataset sayakpaul/video-dataset-disney-organized \
--local-dir video-dataset-disney-organized
The dataset follows the directory structure expected by the subsequent scripts. In particular, it follows what's prescribed here:
video_1.mp4
video_1.txt -- One-paragraph description of video_1
video_2.mp4
video_2.txt -- One-paragraph description of video_2
...
Then run (be sure to check the paths accordingly):
bash prepare_dataset.sh
We can adjust num_frames
and resolution
. By default, in prepare_dataset.sh
, we use --force_upsample
. This means if the original video resolution is smaller than the requested resolution, we will upsample the video.
Important
It's important to have a resolution of at least 480x848 to satisy Mochi-1's requirements.
Now, we're ready to fine-tune. To launch, run:
bash train.sh
You can disable intermediate validation by:
- --validation_prompt "..." \
- --validation_prompt_separator ::: \
- --num_validation_videos 1 \
- --validation_epochs 1 \
We haven't rigorously tested but without validation enabled, this script should run under 40GBs of GPU VRAM.
To use the LoRA checkpoint:
from diffusers import MochiPipeline
from diffusers.utils import export_to_video
import torch
pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview")
pipe.load_lora_weights("path-to-lora")
pipe.enable_model_cpu_offload()
pipeline_args = {
"prompt": "A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions",
"guidance_scale": 6.0,
"num_inference_steps": 64,
"height": 480,
"width": 848,
"max_sequence_length": 256,
"output_type": "np",
}
with torch.autocast("cuda", torch.bfloat16)
video = pipe(**pipeline_args).frames[0]
export_to_video(video)
(Contributions are welcome 🤗)
Our script currently doesn't leverage accelerate
and some of its consequences are detailed below:
- No support for distributed training.
train_batch_size > 1
are supported but can potentially lead to OOMs because we currently don't have gradient accumulation support.- No support for 8bit optimizers (but should be relatively easy to add).
Misc:
- We're aware of the quality issues in the
diffusers
implementation of Mochi-1. This is being fixed in this PR. embed.py
script is non-batched.