Simple Mochi-1 finetuner

Dataset Sample	Test Sample
original.mp4	validation.mp4

Now you can make Mochi-1 your own with diffusers, too 🤗 🧨

We provide a minimal and faithful reimplementation of the Mochi-1 original fine-tuner. As usual, we leverage peft for things LoRA in our implementation.

Updates

December 1 2024: Support for checkpoint saving and loading.

Getting started

Install the dependencies: pip install -r requirements.txt. Also make sure your diffusers installation is from the current main.

Download a demo dataset:

huggingface-cli download \
  --repo-type dataset sayakpaul/video-dataset-disney-organized \
  --local-dir video-dataset-disney-organized

The dataset follows the directory structure expected by the subsequent scripts. In particular, it follows what's prescribed here:

video_1.mp4
video_1.txt -- One-paragraph description of video_1
video_2.mp4
video_2.txt -- One-paragraph description of video_2
...

Then run (be sure to check the paths accordingly):

bash prepare_dataset.sh

We can adjust num_frames and resolution. By default, in prepare_dataset.sh, we use --force_upsample. This means if the original video resolution is smaller than the requested resolution, we will upsample the video.

Important

It's important to have a resolution of at least 480x848 to satisy Mochi-1's requirements.

Now, we're ready to fine-tune. To launch, run:

bash train.sh

You can disable intermediate validation by:

- --validation_prompt "..." \
- --validation_prompt_separator ::: \
- --num_validation_videos 1 \
- --validation_epochs 1 \

We haven't rigorously tested but without validation enabled, this script should run under 40GBs of GPU VRAM.

To use the LoRA checkpoint:

from diffusers import MochiPipeline
from diffusers.utils import export_to_video
import torch 

pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview")
pipe.load_lora_weights("path-to-lora")
pipe.enable_model_cpu_offload()

pipeline_args = {
    "prompt": "A black and white animated scene unfolds with an anthropomorphic goat surrounded by musical notes and symbols, suggesting a playful environment. Mickey Mouse appears, leaning forward in curiosity as the goat remains still. The goat then engages with Mickey, who bends down to converse or react. The dynamics shift as Mickey grabs the goat, potentially in surprise or playfulness, amidst a minimalistic background. The scene captures the evolving relationship between the two characters in a whimsical, animated setting, emphasizing their interactions and emotions",
    "guidance_scale": 6.0,
    "num_inference_steps": 64,
    "height": 480,
    "width": 848,
    "max_sequence_length": 256,
    "output_type": "np",
}

with torch.autocast("cuda", torch.bfloat16)
    video = pipe(**pipeline_args).frames[0]
export_to_video(video)

Known limitations

(Contributions are welcome 🤗)

Our script currently doesn't leverage accelerate and some of its consequences are detailed below:

No support for distributed training.
train_batch_size > 1 are supported but can potentially lead to OOMs because we currently don't have gradient accumulation support.
No support for 8bit optimizers (but should be relatively easy to add).

Misc:

We're aware of the quality issues in the diffusers implementation of Mochi-1. This is being fixed in this PR.
embed.py script is non-batched.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Simple Mochi-1 finetuner

Getting started

Known limitations

Files

README.md

Latest commit

History

README.md

File metadata and controls

Simple Mochi-1 finetuner

Getting started

Known limitations