Fengshenbang 1.0: Fengshenbang1.0 bilingual general paper, aims to be the Foundation of Chinese Cognitive Intelligence.
BioBART: A generative language model for biomedical domain provided by Tsinghua University together with IDEA Institute.(
BioNLP 2022
)
UniMC: A unified model for zero-shot scenarios based on labeled datasets.(
EMNLP 2022
)
FMIT: A single-tower multimodal named entity recognition model based on relative position encoding.(
COLING 2022
)
UniEX: A Natural Language Understanding Model for Unified Extraction Tasks.(
ACL 2023
)
Solving Math Word Problems via Cooperative Reasoning induced Language Models: Solving Math Word Problems via Cooperative Reasoning induced Language Models.(
ACL 2023
)
MVP-Tuning: 基Multi-View Knowledge Retrieval with Prompt Tuning for Commonsense Reasoning.(
ACL 2023
)
- Fengshenbang team open-sources the "Ziya-Visual" 2023.06.05
- Fengshenbang team open-sources the general large-scale model series "Ziya" 2023.05.17
- First Chinese stable diffusion model is open-sourced,IDEA Fengshenbang team opens the era of Chinese AI art 2022.11.2
- Breaking the impossible triangle, comparable to 540B models, IDEA Fengshenbang team only has 200B models to achieve zero-shot SOTA 2022.10.25
- AIWIN champion solution, Fengshenbang proposed multi-task learning model Ubert 2022.07.21
- Just a simple Finetune, "Fengshenbang" pre-trained language model "Erlangshen" won the first place in SimCLUE benchmark 2022.07.14
- Fengshen Framework is officially open-sourced, helping you easily pre-train and fine-tune major models in "Fengshenbang" 2022.06.30
- GTS model production platform is open to public beta, automatically produces AI models using AI 2022.05.23
- Dataset released!IDEA-CCNL × NLPCC 2022 Mission Challenge has begun, and the winning teams will receive IDEA internship opportunities 2022.04.07
- A new record! IDEA-CCNL pretrained language model "Erlangshen", this time won ZeroCLUE 2022.01.24
- IDEA Friends | CCNL Team "Fengshenbang", why did they choose IDEA? 2022.01.12
- IDEA Meeting Release|"Fengshenbang" Open Source Project 2021.11.25
- IDEA Chinese pre-trained language model Erlangshen tops the FewCLUE benchmark 2021.11.11
- Fengshenbang Achievements
- Fengshenbang Big Event
- Navigation
- Model Infofrmation
- Fengshenbang-LM
- Fengshenbang Model
- Fengshen Framework
- Fengshen Benchmark
- Fegnshenbang Series Articles
- Citation
- Contact
- License
Series | Demand | Task | Parameter Scale | Extra |
---|---|---|---|---|
Ziya | General | AGI | >7B | Ziya has the capabilities of translation, programming, text classification, information extraction, summarization, copy generation, common sense question and answer, and mathematical calculation. |
Erlangshen | General | NLU | 97M-3.9B | Erlangshen was designed to solve NLU tasks; The largest BERT when publicly released; SOTA on FewCLUE and ZeroCLUE in 2021. |
Wenzhong | General | NLG | 1B-3.5B | Wenzhong focuses on NLG tasks; Provides several generative models with different scales, such as GPT2, etc. |
Randeng | General | NLT | 770M-5B | Randeng handles natural language transformation (NLT) type tasks that convert from source text to target text, such as machine translation, text summarization, etc. |
Taiyi | Speical | MultiModal | 87M-1B | Taiyi was applied to cross-modality scenarios, including text image generation, protein structure prediction, speech-text representation, etc. |
Yuyuan | Speical | Domain | 0.1B-3.5B | Yuyuan was applied to specific domains such as healthcare, finance, law, programming, etc; The largest open-source GPT2 medical model |
-TBD- | Special | Exploration | -Unknown- | This series hopes to develop experimental models on NLP with various technology companies and universities. Currently there are:Zhouwenwang |
Fengshenbang Model training and fine-tuning code script
Remarkable advances in Artificial Intelligence (AI) have produced great models, in particular, pre-trained based foundation models become an emerging paradigm. In contrast to traditional AI models that must be trained on vast datasets for one or a few scenarios, foundation models can be adapted to a wide range of downstream tasks, therefore, limiting the amount of resource demanded to acquire an AI venture off the ground. Moreover, we observe that these models grow rapidly within a short period, around 10 times each year. For instance, BERT has 100 million parameters and GTP-3 has over 100 billion parameters. Many of the forefront challenges in AI, especially generalization ability, are becoming achievable due to this inspiring trend.
Foundation models, most notably language models, are dominated by the English-language community. The Chinese language as the world's largest spoken language (native speakers), however, has no systematic research resources to support it, making the progress in the Chinese language domain lag behind others.
And the world needs an answer for this.
On November 22nd, 2021, Harry Shum, the Founder and Chairman of the IDEA (International Digital Economy Academy) officially announces the launch of "Fengshenbang" open source project. —— a Chinese language driven foundation ecosystem, incorporates pre-trained models, task-specific fine-tune applications, benchmarks, and datasets.
"Fengshenbang Model" will open-source a series of NLP-related pre-trained models in all aspects. There are a wide range of research tasks in the NLP community, which can be divided into two categories: general demands and special demands. In general demands, there are common NLP tasks, which are classified into Natural Language Understanding (NLU), Natural Language Generation (NLG), and Natural Language Transformation (NLT). Due to the fast development, NLP community brings special demands to the entire AI community, which are often assigned to MultiModal (MM), Domains and Exploration. We consider all of these tasks and provide models that are fine tuning for downstream tasks, making our base model easy to use for users with limited computing resources. We consider all of these demands and provide models that are fine-tuned for downstream tasks, making our base model easy to use for users with limited computing resources. Moreover, we guarantee that we will optimize the models continuously with new datasets and latest algorithms. We aim to build universal infrastructure for Chinese cognitive intelligence and prevent duplicative construction, and hence save computing resources for the community.
We also call for businesses, universities and institutions to join us with the project and build the sytem of large-scale open-source models collaboratively. We envision that, in the near future, the first choice when in need of a new pretrained model should be selecting one in closest proximity to the desired scale,architecture and domain from the series, followed by further training. After obtaining a trained new model, we shall add it back to the series of open-source models for future usage. In this way we build the open-source system iteratively and collaboratively while individuals could get desired models using minimal computing resources.
For better open source experience, all models of the Fengshenbang series are synchronized within the Huggingface community, and can be obtained for use within few lines of code. Welcome to download and use our models from our repo at IDEA-CCNL at HuggingFace.
The general large-scale model "Ziya" series has the capabilities of translation, programming, text classification, information extraction, summarization, copy generation, common sense question and answer, and mathematical calculation. At present, Ziya's general-purpose large model (v1/v1.1) has completed a three-stage training process of large-scale pre-training, multi-task supervised fine-tuning, and human feedback learning. Ziya series models include the following models:
- Ziya-LLaMA-13B-v1.1
- Ziya-LLaMA-13B-v1
- Ziya-LLaMA-7B-Reward
- Ziya-LLaMA-13B-Pretrain-v1
- Ziya-BLIP2-14B-Visual-v1
Refer to Ziya-LLaMA-13B-v1
Refer to ziya_finetune
Refer to ziya_inference
This series focuses on using bidirectional language models with encoders to solve multiple natural language understanding tasks. Erlangshen-MegatronBert-1.3B is the largest Chinese open source model with the structure of Bert. It contains 13 billion parameters, and was trained with 280G datasets on 32 A100 GPUs for 14 days. It achieved the top on the Chinese natural language understanding benchmark FewCLUE on Nov 10th, 2021. Among the tasks of FewCLUE, Erlangshen-1.3 beat human performance on the task of CHID(Chinese idioms cloze test) and TNEWS(News Classification), and achieved SOTA on tasks of CHID, CSLDCP(academic literature classification) and OCNLI(Natural language Inference), refreshing the records of few-shot learning. We will continue to optimize the Erlangshen series with respect to model scale, knowledge fusion, auxiliary supervision tasks, etc.
Erlangshen-MRC achieved the Chinese language comprehension evaluations benchmark ZeroCLUE on Jan 24th, 2022. Among the tasks of ZeroCLUE, CSLDCP (discipline literature classification), TNEWS (news classification), IFLYTEK (application description classification), CSL (abstract keyword recognition), CLUEWSC (reference resolution) achieved SOTA.
Huggingface Erlangshen-MegatronBert-1.3B
from transformers import MegatronBertConfig, MegatronBertModel
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("IDEA-CCNL/Erlangshen-MegatronBert-1.3B")
config = MegatronBertConfig.from_pretrained("IDEA-CCNL/Erlangshen-MegatronBert-1.3B")
model = MegatronBertModel.from_pretrained("IDEA-CCNL/Erlangshen-MegatronBert-1.3B")
For the convenience of developers, we offer an example script for downstream finetuning. The script uses the tnews dataset from CLUE.
1、Fisrt, modify the MODEL_TYPE and PRETRAINING_MODEL_PATH parameters of finetune script, and other parameters can be modified according to your specific device.
MODEL_TYPE=huggingface-megatron_bert
PRETRAINED_MODEL_PATH=IDEA-CCNL/Erlangshen-MegatronBert-1.3B
2、Then, run
sh finetune_classification.sh
Model | afqmc | tnews | iflytek | ocnli | cmnli | wsc | csl |
---|---|---|---|---|---|---|---|
roberta-wwm-ext-large | 0.7514 | 0.5872 | 0.6152 | 0.777 | 0.814 | 0.8914 | 0.86 |
Erlangshen-MegatronBert-1.3B | 0.7608 | 0.5996 | 0.6234 | 0.7917 | 0.81 | 0.9243 | 0.872 |
Taiyi series models are mainly used in cross-modal scenarios, including text image generation, protein structure prediction, speech-text representation, etc. On November 1, 2022, Fengshenbang released the first Chinese version of the stable diffusion model "Taiyi Stable Diffusion".
Taiyi Stable Diffusion Chinese
Taiyi Stable Diffusion Chinese&English Bilingual
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("IDEA-CCNL/Taiyi-Stable-Diffusion-1B-Chinese-v0.1").to("cuda")
prompt = '飞流直下三千尺,油画'
image = pipe(prompt, guidance_scale=7.5).images[0]
image.save("飞流.png")
铁马冰河入梦来,3D绘画。 | 飞流直下三千尺,油画。 | 女孩背影,日落,唯美插画。 |
---|---|---|
Advanced Prompt
铁马冰河入梦来,概念画,科幻,玄幻,3D | 中国海边城市,科幻,未来感,唯美,插画。 | 那人却在灯火阑珊处,色彩艳丽,古风,资深插画师作品,桌面高清壁纸。 |
---|---|---|
https://github.com/IDEA-CCNL/stable-diffusion-webui/blob/master/README.md
https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/stable_diffusion_dreambooth
To make it easy for everyone to use the FengShenbang model, participate in the continuous training and downstream applications of the large-scale model, we We simultaneously open-source the user-centered FengShen framework. For details, please also see: Fengshen Framework.
Referring to other excellent open source frameworks (including HuggingFace, Megatron-LM, Pytorch-Lightning, DeepSpeed) and combining the characteristics of NLP field, we redesign FengShen with Pytorch as the base framework and Pytorch-Lightning as the Pipeline. FengShen can be applied to pre-training of large models (tens of billions of parameters) based on massive data (terabytes of data) and fine-tuning on various downstream tasks. Users can easily perform distributed training and memory-saving techniques with configuration, thus focusing more on model implementation and innovation. Also, FengShen can directly use the model structure in HuggingFace for continued training, which facilitates domain transfer for users. FengShen provides rich and realistic source code and examples. We will continue to optimize the FengShen framework as the models of Fengshenbang are trained and applied. Stay tuned.
git clone https://github.com/IDEA-CCNL/Fengshenbang-LM.git
cd Fengshenbang-LM
git submodule init
git submodule update
# ubmodule is the fs_datasets we use to manage the datasets, pulled by ssh, which may fail if the user does not have ssh-key configured on the machine.
# If the pull fails, you need to go to the .gitmodules file and change the ssh address to an https address.
pip install --editable .
We provide a simple docker, which contains torch and cuda environment to run our framework.
sudo docker run --runtime=nvidia --rm -itd --ipc=host --name fengshen fengshenbang/pytorch:1.10-cuda11.1-cudann8-devel
sudo docker exec -it fengshen bash
cd Fengshenbang-LM
# Update the code. The code in docker may not be up to date
git pull
git submodule foreach 'git pull origin master'
# Now you're ready to use our framework in docker
Fenghen framework is currently adapting various downstream tasks in Pipeline, support Predict, Finetuning by one-click in command line. Take Text Classification as an example
# predict
❯ fengshen-pipeline text_classification predict --model='IDEA-CCNL/Erlangshen-Roberta-110M-Similarity' --text='今天心情不好[SEP]今天很开心'
[{'label': 'not similar', 'score': 0.9988130331039429}]
# train
fengshen-pipeline text_classification train --model='IDEA-CCNL/Erlangshen-Roberta-110M-Similarity' --datasets='IDEA-CCNL/AFQMC' --gpus=0 --texta_name=sentence1 --strategy=ddp
Get Started with Fengshen in 3 Minutes
Fengshen Series: Getting Started on Training Large Model with Data Parallelism
Fengshen Series: It is Time to Accelerate your Training Process!
Fengshen Series: Chinese PEGASUS Model Pre-training
Fengshen Series: Just a Simple Finetune, Erlangshen Accidentally Took the First Place
Fengshen Series: Quickly Build Your Algorithm Demo
If you are using the resource for your work, please cite the our paper:
@article{fengshenbang,
author = {Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen and Ruyi Gan and Jiaxing Zhang},
title = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
journal = {CoRR},
volume = {abs/2209.02970},
year = {2022}
}
You can also cite our website:
@misc{Fengshenbang-LM,
title={Fengshenbang-LM},
author={IDEA-CCNL},
year={2021},
howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}
IDEA-CCNL team has created the Fengshebang open source discussion group, we will update and release new models and articles of Fengshenbang in the discussion group from time to time. Please scan the QR code below or search "fengshenbang-lm" on WeChat to add the Fengshen space assistant into the group!
We are also continuously recruiting, so feel free to send in your resume!