One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie^1* Weijia Mao^1* Zechen Bai^1* David Junhao Zhang^1*
Weihao Wang² Kevin Qinghong Lin¹ Yuchao Gu¹ Zhijie Chen² Zhenheng Yang² Mike Zheng Shou¹

¹ Show Lab, National University of Singapore ² Bytedance

News

[2024-10-15] Update Arxiv paper to include new features and experimental results.
- Support image generation in a resolution of 512x512.
- Improve the multimodal understanding capabilities of purely discrete Show-o.
- Improve the performance on the GenEval benchmark.
- Explore the impact of dataset scale and image resolution on multimodal understanding capabilities of discrete image tokens. For more information, please refer to the paper.
- We release the weight of Show-o before fine-tuning on LLaVA instructional tuning datasets. You can fine-tune it following the configurations in ./configs.
[2024-09-12] Arxiv paper updated to include preliminaries about discrete diffusion.
[2024-09-03] We deploy an online demo on Hugging Face Space. 🤗 Have fun!
[2024-09-02] We release the training code for pre-training and instruction tuning! 🔥🔥
[2024-09-01] Add FlexAttention implementation for accleration. Thanks to @Horace for providing examples.
[2024-08-28] We maintain a repo of Awesome Unified Multimodal Models. If you are interested in unified models, star and watch it to get latest updates!
[2024-08-27] Add integration to Hugging Face! Thanks to @NielsRogge.
[2024-08-26] We build two community platforms to facilitate discussion, request and collaboration! Reach us with Discord and WeChat!
[2024-08-23] We release the inference code of Show-o (1.3B) for multimodal understanding and generation including image captioning, visual question answering (VQA), text-to-image generation, text-guided inpainting and extrapolation.

What is the new about Show-o?

Below is a characteristics comparison among understanding only, generation only, and unified (understanding & generation) models. Vision and Language indicate the representations from specific input modalities. In this context, Diffusion represents both continuous and discrete diffusion.

Below is an overview of Show-o. The input data, regardless of its modalities, is tokenized and then prompted into a formatted input sequence. Show-o processes text tokens autoregressively with causal attention and image tokens in (discrete) denoising diffusion modeling via full attention, and then generates the desired output. Specifically, Show-o is capable of handling image captioning, visual question answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed modality generation.

TODO

Release the inference code.
Release the training code.
Support image generation in a resolution of 512x512.
Scale up the model size (based on LLaMA3) and increase the number of training data.

Hugging Face models

The Show-o checkpoints can be found on Hugging Face:

showlab/show-o-512x512
showlab/show-o-w-clip-vit-512x512
showlab/show-o-512x512-wo-llava-tuning
showlab/show-o
showlab/show-o-w-clip-vit
showlab/magvitv2

Getting Started

First, set up the environment:

pip3 install -r requirements.txt

Login your wandb account on your machine or server.

wandb login <your wandb keys>

Inference demo for Multimodal Understanding and you can view the results on wandb.

option (c)

python3 inference_mmu.py config=configs/showo_demo_w_clip_vit_512x512.yaml \
max_new_tokens=100 \
mmu_image_root=./mmu_validation question='Please describe this image in detail. *** Do you think the image is unusual or not?'

or option (a)

python3 inference_mmu.py config=configs/showo_demo_512x512.yaml \
max_new_tokens=100 \
mmu_image_root=./mmu_validation question='Please describe this image in detail. *** Do you think the image is unusual or not?'

Inference demo for Text-to-Image Generation and you can view the results (in a resolution of 512x512) on wandb.

python3 inference_t2i.py config=configs/showo_demo_512x512.yaml \
batch_size=1 validation_prompts_file=validation_prompts/showoprompts.txt \
guidance_scale=5 generation_timesteps=50 \
mode='t2i'

Inference demo for Text-guided Inpainting and you can view the results (in a resolution of 256x256) on wandb.

python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=1 \
guidance_scale=1.75 generation_timesteps=16 \
mode='inpainting' prompt='A blue sports car with sleek curves and tinted windows, parked on a bustling city street.' \
image_path=./inpainting_validation/bus.jpg inpainting_mask_path=./inpainting_validation/bus_mask.webp

Inference demo for Text-guided Extrapolation and you can view the results (in a resolution of 256x256) on wandb.

python3 inference_t2i.py config=configs/showo_demo.yaml \
batch_size=1 \
guidance_scale=1.75 generation_timesteps=16 \
mode='extrapolation' extra_direction='left *** left *** left *** right *** right *** right' offset=0 prompt='a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees. *** a serene natural landscape featuring a clear, blue lake surrounded by lush green trees.' \
image_path=./inpainting_validation/alpine_lake.jpg

Training pipeline

Prepare your training data and change the data path in configs/xx.yaml.

Note that, our training process is based on accelerate. Please ensure to config your accelerate for distributed training. We provide config examples below for (distributed) training on a single GPU or multiple GPUs.

├── accelerate_configs/ 
|   ├── multi_nodes (6x8 GPUs)
|   |   ├—— ...
|   ├── 1_gpu.yaml
|   └── 8_gpu_deepspeed_zero2.yaml

Stage 1 - Pre-training on ImageNet-1K dataset. Change the data path to ImageNet-1K in configs/showo_pretraining_stage1.yaml. Note that, we use the internal packages to process the RefinedWeb dataset, and you must manually comment the code part related to language modeling in training/train.py or write a new dataloder.

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train.py config=configs/showo_pretraining_stage1.yaml

Once trained, the checkpoint folder is structured as follows:

├── show-o-training-stage1/ 
|   ├── ...
|   ├── checkpoint-500000
|   └── config.yaml

A bit cumbersome. Just create a new output folder (edited in the yaml config) for stage 2, copy the latest checkpoint of stage 1 to this folder, and rename it to checkpoint-0. It will be automatically resumed for next stage training. Apply same procedures for the resume training in the following stages.

├── show-o-training-stage2/ 
|   └── checkpoint-0

Stage 2 - Pre-training on Image-Text dataset. The default dataloader is based on WebDataset. Change the data path in configs/showo_pretraining_stage2.yaml.

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train.py config=configs/showo_pretraining_stage2.yaml

Stage 3 - Pre-training on High-quality Image-Text dataset. Change the data path in configs/showo_pretraining_stage3.yaml

Copy the pre-trained weights to the output_dir (specified in the config)

├── show-o-training-stage3/ 
|   └── checkpoint-0

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train.py config=configs/showo_pretraining_stage3.yaml

[Option a] Stage 3 - Instruction tuning on LLaVA dataset (llava-pretrain). Change the data path in llava/llava_data_vq_unified.py.

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train.py config=configs/showo_instruction_tuning_1.yaml

[Option a] Stage 3 - Instruction tuning on LLaVA dataset (llava-tuning). Change the data path in llava/llava_data_vq_unified.py.

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train.py config=configs/showo_instruction_tuning_2.yaml

[Option c] Stage 3 - Instruction tuning on LLaVA dataset (llava-pretrain) with CLIP-ViT. Change the data path in llava/llava_pretrain_data.py.

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train_w_clip_vit.py config=configs/showo_instruction_tuning_1_w_clip_vit.yaml

[Option c] Stage 3 - Instruction tuning on LLaVA dataset (llava-tuning) with CLIP-ViT. Change the data path in llava/llava_instuct_data.py.

accelerate launch --config_file path/to/your/accelerate_config --main_process_port=8888 training/train_w_clip_vit.py config=configs/showo_instruction_tuning_2_w_clip_vit.yaml

Request new features? Willing to contribute?

We welcome your bravo new ideas and contributions! If you would like to see any new features in Show-o, or you want to contribute to this project, please fill in this form!

Pending Requested Features

Mixed-modal generation
Support training on more datasets
Visual tokenizer training

Find more at Contributing and Roadmap.

Join Discussion

Welcome to discuss with us and continuously improve the user experience of Show-o. Reach us with this Discord channel or the WeChat QR code below!

Citation

To cite the paper and model, please use the below:

@article{xie2024showo,
  title={Show-o: One Single Transformer to Unify Multimodal Understanding and Generation},
  author={Xie, Jinheng and Mao, Weijia and Bai, Zechen and Zhang, David Junhao and Wang, Weihao and Lin, Kevin Qinghong and Gu, Yuchao and Chen, Zhijie and Yang, Zhenheng and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2408.12528},
  year={2024}
}

Acknowledgments

This work is heavily based on open-muse, Phi-1.5, muse-maskgit-pytorch, maskgit, taming-transformers, transformers, accelerate, diffusers, and webdataset. Thanks to all the authors for their great work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

One Single Transformer to Unify Multimodal Understanding and Generation

News

What is the new about Show-o?

TODO

Hugging Face models

Getting Started

Training pipeline

Request new features? Willing to contribute?

Join Discussion

Citation

Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

One Single Transformer to Unify Multimodal Understanding and Generation

News

What is the new about Show-o?

TODO

Hugging Face models

Getting Started

Training pipeline

Request new features? Willing to contribute?

Join Discussion

Citation

Acknowledgments