This project was part of my finals project in CS 583 - Deep Learning at Stevens Institute of Technology. It uses a novel decoder model called RetNet from the torchscale library from Microsoft. It uses multi-scale retention as opposed to multi-head attention commonly used in transformer models. By using the retentive network, it allows for transformer like performance with better language modelling, lower memory consumption, higher throughput, and lower latency.
The model was trained on IWSLT 2017 dataset of English to French sentences (available on huggingface). The dataset description is as follows:
Number of examples | |
---|---|
Train | 232,825 |
Validation | 890 |
Test | 8,597 |
Separate encoder and decoder models were created with the following configuration:
- Embedding dimension = 64
- Number of attention heads = 8
- Feedforward Network embedding dimension = 256
- Number of encoder layers = 8
- Embedding dimension = 64
- Number of retention heads = 4
- Feedforward Network embedding dimension = 256
- Number of decoder layers = 8
- Optimizer = AdamW with learning rate of 1e-5
- Loss function = Cross-Entropy Loss
- Batch size = 32
- Number of epochs = 20
The model was evaluated using BLEU score and the model achieved a BLEU score of 36.4 on whole dataset.
I have also published an article on Medium.
If you find my work interesting or have any suggestions, let me know. Do cite my work if you find it valuable!