This is actually a less serious weekend project called SlapDash-Net which can be considered a less serious variation on VIT architecture. We use some encoder type transformer layers, together with some register tokens. Prior to encoder encoder layer we introduce some convolution layers in a highly slapdash manner. The dudes will be trained on ImageNet1k/22k dataset.
- No promise to get very high accuracy,
- No prior assumption that exactly the same idea might have been used elsewhere,
- No attempt to tweak hyperparameters more than needed,
- We would like to hybridize things,
- We do bizzare combinations for the mere reason: because we would like to!!!,
- In SdP-Net we trust!
#Size | #Params | #Blocks | Patch_size | Conv_Size | Embed_Dim | Top1 Acc |
---|---|---|---|---|---|---|
XXS | 55M | 7 | 16 | 7 | 128 | ? |
S | 76M | 12 | 16 | 7 | 512 | ? |
M | 86M | 12 | 16 | 7 | 768 | ? |
L | 86M | 12 | 16 | 7 | 768 | ? |
XL | 86M | 15 | 16 | 7 | 768 | 79.8 |
Bitter lesson: The biggest model gives 79.8 acc on Imagenet1k. Still training the others on time permitting.
AdamW: lr = 0.001875 (=0.001*batch_size/512) Weight decay 0.05 CosineAnnealing with warm starts in addition to 5 warming up epochs.
RandAugment + Random erase + Random resize+ CutMix + MixUp + Dropout(0.2) (Only to FFN parts of Attention layers)
#TODO
- Gating mechanism in FFN?
- EMA Model (This is important for future use!!!)
- Gradient Accumulation -- larger learning rate (ok!!!)
- Register tokens (VITs need registers)
- Stochastic Depth (Further research is needed!!!)
- No more batchnorm layers (Layer norm is implemented here!!!)
- If possible binary loss - instead of cross-entropy loss (Resnet strikes back!!!)
- Write kind-of-a unit-test for intermediate activations!!! (ok!!!)
- Write trainer class from scratch -- if possible do some subclassing kinda thing!!!
- Use KeLü activation instead of GeLu (KeLü implemented but may not be really optimized!)