VQGAN remains essential in autoregressive visual generation, despite limitations in codebook size and utilization that underestimate its capabilities. MAGVIT2 addresses these issues with a lookup-free technique and a large codebook (
In our codebase, we have re-implemented the MAGVIT2 tokenizer in PyTorch, closely replicating the original results. We hope our efforts will foster innovation and creativity in the field of autoregressive visual generation. 💚
- [2024.06.17] 🔥🔥🔥 We release the training code of the image tokenizer and checkpoints for different resolutions, achieving state-of-the-art performance (
0.39 rFID
for 8x downsampling) compared to VQGAN, MaskGIT, and recent TiTok, LlamaGen, and OmniTokenizer.
- Better image tokenizer with scale-up training.
- Finalize the training of the autoregressive model.
- Video tokenizer and the corresponding autoregressive model.
🤗 Open-MAGVIT2 is still at an early stage and under active development. Stay tuned for the update!
Figure 1. The framework of the Open-MAGVIT2 tokenizer, composed of an encoder, a lookup-free quantizer (LFQ), and a decoder.
- Env: We have tested on
Python 3.8.8
andCUDA 11.7
(other versions may also be fine). - Dependencies:
pip install -r requirements
- Datasets
imagenet
└── train/
├── n01440764
├── n01440764_10026.JPEG
├── n01440764_10027.JPEG
├── ...
├── n01443537
├── ...
└── val/
├── ...
We follow the design of Generator in MAGVIT-2 but use PatchGAN instead of StyleGAN as Discriminator for GAN training. We use the combination of Loss utilized in MAGVIT-2 and VQGAN for better training stability and reconstruction quality. All the training details can be found in the config files. Note that, we train our model using 32
Table 1. Reconstruction performance of different tokenizers on
Method | Token Type | #Tokens | Train Data | Codebook Size | rFID | PSNR | Codebook Utilization | Checkpoint |
---|---|---|---|---|---|---|---|---|
VQGAN | 2D | 16 |
256 |
1024 | 7.94 | 19.4 | - | - |
MaskGIT | 2D | 16 |
256 |
1024 | 2.28 | - | - | - |
LlamaGen | 2D | 16 |
256 |
16384 | 2.19 | 20.79 | 97% | - |
🔥Open-MAGVIT2 | 2D | 16 |
256 |
262144 | 1.53 | 21.53 | 100% | IN256_Base |
ViT-VQGAN | 2D | 32 |
256 |
8192 | 1.28 | - | - | - |
VQGAN | 2D | 32 |
OpenImages | 16384 | 1.19 | 23.38 | - | - |
OmniTokenizer-VQ | 2D | 32 |
256 |
8192 | 1.11 | - | - | - |
LlamaGen | 2D | 32 |
256 |
16384 | 0.59 | 24.45 | - | - |
🔥Open-MAGVIT2* | 2D | 32 |
128 |
262144 | 0.39 | 25.78 | 100% | IN128_Base |
TiTok-L | 1D | 32 | 256 |
4096 | 2.21 | - | - | - |
TiTok-B | 1D | 64 | 256 |
4096 | 1.70 | - | - | - |
TiTok-S | 1D | 128 | 256 |
4096 | 1.71 | - | - | - |
(*) denotes that the results are from the direct inference using the model trained with
Table 2. Compare with the original MAGVIT2 by training and testing with both
Method | Token Type | #Tokens | Data | LFQ | Large Codebook | Up/Down Sampler | rFID | URL |
---|---|---|---|---|---|---|---|---|
MAGVIT2 | 2D | 128 |
√ | √ | √ | 1.21 | - | |
Open-MAGVIT2 | 2D | 128 |
√ | √ | √ | 1.56 | IN128_Base |
Figure 2. Visualization of the Open-MAGVIT2 tokenizer trained at imagenet_256_Base
version). (a) indicates the original images while (b) specifies the reconstruction images.
Figure 3. Visualization of the Open-MAGVIT2 tokenizer trained at imagenet_128_Base
version). (a) indicates the original images while (b) specifies the reconstruction images.
-
$128\times 128$ Tokenizer Training
bash run_B_128.sh
-
$256\times 256$ Tokenizer Training
bash run_B_256.sh
MAGVIT2 utilizes Non-AutoRegressive transformer for image generation. Instead, we would like to exploit the potential of Autogressive Visual Generation with the relatively large codebook. We are currently exploring Stage II training.
We thank Lijun Yu for his encouraging discussions. We refer a lot from VQGAN and MAGVIT. Thanks for their wonderful work.
If you found the codebase helpful, please cite it.
@software{Luo_Open-MAGVIT2_2024,
author = {Luo, Zhuoyan and Shi, Fengyuan and Ge, Yixiao},
month = jun,
title = {{Open-MAGVIT2}},
url = {https://github.com/TencentARC/Open-MAGVIT2},
version = {1.0},
year = {2024}
}
@inproceedings{
yu2024language,
title={Language Model Beats Diffusion - Tokenizer is key to visual generation},
author={Lijun Yu and Jose Lezama and Nitesh Bharadwaj Gundavarapu and Luca Versari and Kihyuk Sohn and David Minnen and Yong Cheng and Agrim Gupta and Xiuye Gu and Alexander G Hauptmann and Boqing Gong and Ming-Hsuan Yang and Irfan Essa and David A Ross and Lu Jiang},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=gzqrANCF4g}
}