diff --git a/transformer/README.md b/transformer/README.md index 7cb48b0..1692640 100644 --- a/transformer/README.md +++ b/transformer/README.md @@ -1,17 +1,79 @@ -To implement a correct **Transformer Encoder-Decoder** model, we can refer to two resources: +To implement a correct **Transformer Encoder-Decoder** model, we can refer to some resources: -+ [Pytorch Implementation](https://pytorch.org/docs/stable/_modules/torch/nn/modules/transformer.html#Transformer) -+ [HuggingFace Implementation: BART (highly recommended)](https://github.com/huggingface/transformers/blob/master/src/transformers/models/bart/modeling_bart.py#L846) - + [BartModel](https://github.com/huggingface/transformers/blob/e57468b8a85bad5cc17efbfcfdd3eecb9b8a62ec/src/transformers/models/bart/modeling_bart.py#L1123) - + [BartEncoder](https://github.com/huggingface/transformers/blob/e57468b8a85bad5cc17efbfcfdd3eecb9b8a62ec/src/transformers/models/bart/modeling_bart.py#L671) - + [BartDecoder](https://github.com/huggingface/transformers/blob/e57468b8a85bad5cc17efbfcfdd3eecb9b8a62ec/src/transformers/models/bart/modeling_bart.py#L846) - + [BartLearnedPositionalEmbedding](https://github.com/huggingface/transformers/blob/e57468b8a85bad5cc17efbfcfdd3eecb9b8a62ec/src/transformers/models/bart/modeling_bart.py#L107) - + [BartAttention](https://github.com/huggingface/transformers/blob/e57468b8a85bad5cc17efbfcfdd3eecb9b8a62ec/src/transformers/models/bart/modeling_bart.py#L127) +0. Transformer paper: [Attention Is All You Need](https://arxiv.org/abs/1706.03762) +1. [Pytorch Implementation](https://pytorch.org/docs/stable/_modules/torch/nn/modules/transformer.html#Transformer) +2. HuggingFace Implementation: [BERT](https://github.com/huggingface/transformers/tree/master/src/transformers/models/bert) for Encoder & [BART](https://github.com/huggingface/transformers/tree/master/src/transformers/models/bart) for Enc-Dec +In this project, we give the basic definition of the Transformer network (specifying the required classes and the calling logic between them). +**You can freely modify the definition with knowledge of the context.** +To build a generic Transformer we need to implement the following classes: ++ [XformerEncoder](https://github.com/AntNLP/fm-gym/blob/b9b0fee45616a52578c1e599073568e11b058f14/transformer/xformer_encoder.py#L5) ++ [XformerDecoder](https://github.com/AntNLP/fm-gym/blob/b9b0fee45616a52578c1e599073568e11b058f14/transformer/xformer_decoder.py#L5) ++ [SelfAttentionSublayer](https://github.com/AntNLP/fm-gym/blob/b9b0fee45616a52578c1e599073568e11b058f14/transformer/attention.py#L31) ++ [CrossAttentionSublayer](https://github.com/AntNLP/fm-gym/blob/b9b0fee45616a52578c1e599073568e11b058f14/transformer/attention.py#L59) ++ [MultiHeadAttention](https://github.com/AntNLP/fm-gym/blob/b9b0fee45616a52578c1e599073568e11b058f14/transformer/attention.py#L5) ++ [FeedforwardSublayer](https://github.com/AntNLP/fm-gym/blob/b9b0fee45616a52578c1e599073568e11b058f14/transformer/feedforward.py#L4) ++ [SinusoidalEmbeddings](https://github.com/AntNLP/fm-gym/blob/b9b0fee45616a52578c1e599073568e11b058f14/transformer/embeddings.py#L4) ++ [LearnableEmbeddings](https://github.com/AntNLP/fm-gym/blob/b9b0fee45616a52578c1e599073568e11b058f14/transformer/embeddings.py#L20) +The tree logic between classes is as follows: +``` +├── XformerEncoder +│   ├── SelfAttentionSublayer +│   │   ├── MultiHeadAttention +│   ├── FeedforwardSublayer +├── XformerDecoder +│   ├── SelfAttentionSublayer +│   │   ├── MultiHeadAttention +│   ├── CrossAttentionSublayer +│   │   ├── MultiHeadAttention +│   ├── FeedforwardSublayer +├── LearnableEmbeddings +├── SinusoidalEmbeddings +``` +To obtain the output of a reproduced Transformer, we can define a randomly initialized BERT (Encoder) or BART (Enc-Dec). +Take BERT as an example: +``` +├── BertModel +│   ├── BERTEmbeddings +│   │   ├── LearnableEmbeddings +│   │   └── SinusoidalEmbeddings +│   ├── XformerEncoder +│   │   ├── SelfAttentionSublayer +│   │   │   ├── MultiHeadAttention +│   │   ├── FeedforwardSublayer +│   ├── BERTMLMHead +│   ├── BERTNSPHead +``` +```python +# Step 1: Set the Random Seed in the program entry +# earlier is better +random.seed(SEED) +np.random.seed(SEED) +torch.manual_seed(SEED) +torch.cuda.manual_seed(SEED) + +# Step 2: Load BERT hyperparameters +from transformers import AutoTokenizer, AutoConfig +config = AutoConfig.from_pretrained(os.environ['TRANSFORMERS_CACHE']+'bert-base') # 'bart-base' + +# Step 3: Define our BERT +from bert_model import BertModel +our_bert = BertModel( + vocab_size = config.vocab_size, + hidden_size = config.hidden_size, + ... + ... +) + +# Step 4: Get the model output without dropout +our_bert.eval() +inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") +last_hidden_states, _, _ = our_bert(**inputs) +``` By referring to the hugging face implementation, we could confirm that we have implemented the **Transformer Encoder-Decoder** model correctly. -Check that our output is consistent with the Bart output ( `last_hidden_states` in the code below). +Check that our output is consistent with the BERT/BART output ( `last_hidden_states` in the code below). ```python # Step 1: Set the Random Seed in the program entry @@ -24,9 +86,9 @@ torch.cuda.manual_seed(SEED) # Step 2: Define a randomly initialized BART model from transformers import AutoTokenizer, AutoConfig, AutoModel -config = AutoConfig.from_pretrained(os.environ['TRANSFORMERS_CACHE']+'bart-base') +config = AutoConfig.from_pretrained(os.environ['TRANSFORMERS_CACHE']+'bert-base') # 'bart-base' model = AutoModel.from_config(config) -tokenizer = AutoTokenizer.from_pretrained(os.environ['TRANSFORMERS_CACHE']+'bart-base') +tokenizer = AutoTokenizer.from_pretrained(os.environ['TRANSFORMERS_CACHE']+'bert-base') # 'bart-base' # Step 3: Get the model output without dropout model.eval()