diff --git a/DreamBooth_files/bibtex.txt b/DreamBooth_files/bibtex.txt deleted file mode 100644 index e9e3034..0000000 --- a/DreamBooth_files/bibtex.txt +++ /dev/null @@ -1,6 +0,0 @@ -@article{ruiz2022dreambooth, - title={DreamBooth: Fine Tuning Text-to-image Diffusion Models for Subject-Driven Generation}, - author={Ruiz, Nataniel and Li, Yuanzhen and Jampani, Varun and Pritch, Yael and Rubinstein, Michael and Aberman, Kfir}, - booktitle={arXiv preprint arxiv:2208.12242}, - year={2022} -} diff --git a/DreamBooth_files/.DS_Store b/LLM_Diff_files/.DS_Store similarity index 100% rename from DreamBooth_files/.DS_Store rename to LLM_Diff_files/.DS_Store diff --git a/DreamBooth_files/0l11u2HzQrQ.html b/LLM_Diff_files/0l11u2HzQrQ.html similarity index 100% rename from DreamBooth_files/0l11u2HzQrQ.html rename to LLM_Diff_files/0l11u2HzQrQ.html diff --git a/DreamBooth_files/ad_status.js b/LLM_Diff_files/ad_status.js similarity index 100% rename from DreamBooth_files/ad_status.js rename to LLM_Diff_files/ad_status.js diff --git a/DreamBooth_files/base.js b/LLM_Diff_files/base.js similarity index 100% rename from DreamBooth_files/base.js rename to LLM_Diff_files/base.js diff --git a/LLM_Diff_files/bibtex.txt b/LLM_Diff_files/bibtex.txt new file mode 100644 index 0000000..de7dbe1 --- /dev/null +++ b/LLM_Diff_files/bibtex.txt @@ -0,0 +1,6 @@ +@article{tan2024llmdiffusion, + title={An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation}, + author={Tan, Zhiyu and Yang, Mengping and Yang, Hao and Qian, Ye and Qin, Luozheng and Zhang, Cheng and Li Hao}, + booktitle={arXiv preprint arxiv:2405.xxxxx}, + year={2024} +} diff --git a/DreamBooth_files/cast_sender(1).js b/LLM_Diff_files/cast_sender(1).js similarity index 100% rename from DreamBooth_files/cast_sender(1).js rename to LLM_Diff_files/cast_sender(1).js diff --git a/DreamBooth_files/cast_sender.js b/LLM_Diff_files/cast_sender.js similarity index 100% rename from DreamBooth_files/cast_sender.js rename to LLM_Diff_files/cast_sender.js diff --git a/DreamBooth_files/comparison.png b/LLM_Diff_files/comparison.png similarity index 100% rename from DreamBooth_files/comparison.png rename to LLM_Diff_files/comparison.png diff --git a/DreamBooth_files/embed.js b/LLM_Diff_files/embed.js similarity index 100% rename from DreamBooth_files/embed.js rename to LLM_Diff_files/embed.js diff --git a/DreamBooth_files/fetch-polyfill.js b/LLM_Diff_files/fetch-polyfill.js similarity index 100% rename from DreamBooth_files/fetch-polyfill.js rename to LLM_Diff_files/fetch-polyfill.js diff --git a/DreamBooth_files/framework.png b/LLM_Diff_files/framework.png similarity index 100% rename from DreamBooth_files/framework.png rename to LLM_Diff_files/framework.png diff --git a/DreamBooth_files/longprompt.png b/LLM_Diff_files/longprompt.png similarity index 100% rename from DreamBooth_files/longprompt.png rename to LLM_Diff_files/longprompt.png diff --git a/DreamBooth_files/multilingual.png b/LLM_Diff_files/multilingual.png similarity index 100% rename from DreamBooth_files/multilingual.png rename to LLM_Diff_files/multilingual.png diff --git a/DreamBooth_files/remote.js b/LLM_Diff_files/remote.js similarity index 100% rename from DreamBooth_files/remote.js rename to LLM_Diff_files/remote.js diff --git a/DreamBooth_files/stages.png b/LLM_Diff_files/stages.png similarity index 100% rename from DreamBooth_files/stages.png rename to LLM_Diff_files/stages.png diff --git a/DreamBooth_files/style.css b/LLM_Diff_files/style.css similarity index 100% rename from DreamBooth_files/style.css rename to LLM_Diff_files/style.css diff --git a/DreamBooth_files/tUR9jtOhcuN8qeoeXnRQGExMe9QeBdn6F7LXrdB4oNs.js b/LLM_Diff_files/tUR9jtOhcuN8qeoeXnRQGExMe9QeBdn6F7LXrdB4oNs.js similarity index 100% rename from DreamBooth_files/tUR9jtOhcuN8qeoeXnRQGExMe9QeBdn6F7LXrdB4oNs.js rename to LLM_Diff_files/tUR9jtOhcuN8qeoeXnRQGExMe9QeBdn6F7LXrdB4oNs.js diff --git a/DreamBooth_files/teaser1.png b/LLM_Diff_files/teaser1.png similarity index 100% rename from DreamBooth_files/teaser1.png rename to LLM_Diff_files/teaser1.png diff --git a/DreamBooth_files/www-embed-player.js b/LLM_Diff_files/www-embed-player.js similarity index 100% rename from DreamBooth_files/www-embed-player.js rename to LLM_Diff_files/www-embed-player.js diff --git a/DreamBooth_files/www-player.css b/LLM_Diff_files/www-player.css similarity index 100% rename from DreamBooth_files/www-player.css rename to LLM_Diff_files/www-player.css diff --git a/index.html b/index.html index ae7b000..2bbd3f9 100644 --- a/index.html +++ b/index.html @@ -2,39 +2,41 @@ An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation - - - + + +

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

-

- Zhiyu Tan1, - Mengping Yang1, - Hao Yang1 , +

+ + Zhiyu Tan1 + Mengping Yang1 + Hao Yang1 + Ye Qian1 + Luozheng Qin2 + Cheng Zhang3 + Hao Li4† +
- Ye Qian1, - Luozheng Qin2, - Cheng Zhang3, - Hao Li4† - -
- 1InfTech, - 2Soochow University, - 3Carnegie Mellon University, - 4Fudan University, + + 1InfTech + 2Soochow University + 3Carnegie Mellon University + 4Fudan University +
+
+Corresponding author & Project lead +


- -Corresponding author & Project leader -

-
+

- [Arxiv]      - [Code]      - [BibTeX] + [Arxiv]      + [Code]      + [BibTeX]

@@ -45,7 +47,7 @@

Abstract

Method


-
+

The main idea of our method is a lightweight but effective adapter module to align the text features of LLMs with that of the visual-aware CLIP. In this way, LLMs could capture the visual clues contained in the input prompts, thereby drive text-to-image diffusion models to produce appropriate images. Specifically, we decompose the training procedure into three distinct stages. First, we adapt the features of LLMs into diffusion training process by aligning them with those from CLIP models, only adapter is optimized in this stage. Then, we improve the synthesis quality through end-to-end text-image training. After that, the aesthetic appeal of the generated images is enhanced by further finetuning on a carefully-curated dataset. By doing so, the textual representation capabilities of LLMs can be fully activated and the model performance is well improved in terms of text alignment, synthesis quality and image aesthetics. Notably, our model is trained with a fraction of the resources required by most text-to-image diffusion models while achieving superior synthesis quality and supporting multilingual input.

To verify the effectiveness of our proposed model, we conduct extensive empirical investigation on both English and Chinese prompts datasets, it turns out our model achieves favourable zero-shot FID, CLIP-s and Aes scores under various settings. Besides, user studies demonstrate that our model could produce images that are preferred by human. Furthermore, we also conduct various and comprehensive ablation study on the proposed three training stages, which fully confirms the effectiveness of the proposed training pipeline and training stages.

@@ -53,33 +55,33 @@

Method

Results

Our proposed model could not only produce images with high visual quality given English input prompts (left), but also enables multilingual understanding capability for various language driven T2I generation (middle), as well as grasps much longer contextual information for generation (right)

- +

Multilingual T2I Generation

Surprisingly, our model could understand these texts well and generate images with corresponding captions. This amazing feature indicates that our model successfully integrates the powerful language understanding ability of LLMs into the T2I generation process, and fully exploit the potential of LLMs.


-
+

Long Prompt T2I Generation

Our model could capture the meaning of prompts that are much longer than 77 tokens and synthesize images that well align with prompts, whereas prior methods usually fail under such setting. This further reflects the powerful language understanding capability and synthesis quality of our method.


-
+

Comparison with Other Baselines

Qualitative comparison of our model and competing methods. For models that do not support multilingual text conditions, we translate the given prompts into corresponding language to generate images. Our proposed method could produce images with better synthesis quality, accurate text-image alignment and higher visual quality.


-
+

BibTex

- @article{ruiz2022dreambooth,
-   title={DreamBooth: Fine Tuning Text-to-image Diffusion Models for Subject-Driven Generation},
-   author={Ruiz, Nataniel and Li, Yuanzhen and Jampani, Varun and Pritch, Yael and Rubinstein, Michael and Aberman, Kfir},
-   booktitle={arXiv preprint arxiv:2208.12242},
-   year={2022}
+ @article{tan2024llmdiffusion,
+   title={An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation},
+   author={Tan, Zhiyu and Yang, Mengping and Yang, Hao and Qian, Ye and Qin, Luozheng and Zhang, Cheng and Li Hao},
+   booktitle={arXiv preprint arxiv:2405.xxxxx},
+   year={2024}
}