A brand new activation function: KeLü (Keen Learning Unit):
It has continuous derivative at 0, while its third derivative has singularities at -3.5 and 3.5. Furthermore, compared to GELU it decays zero a bit faster. Both Flux and Jax implementations are included. For comparison we implement three well known networks.
Both Jax and Flux directories include implementation of the papers with respective abbreviations:
All the above models are trained with standard augmentation techniques (See SdP Repo).
#Act. | Depth | Patch_size | Kernel_size | Embed_Dim | Acc | Loss |
---|---|---|---|---|---|---|
Relu | 8 | 2 | 5 | 384 | 77.79 | 1.075 |
Gelu | 8 | 2 | 5 | 384 | 78.04 | 1.083 |
Swish | 8 | 2 | 5 | 384 | 78.26 | 1.052 |
KeLu | 8 | 2 | 5 | 384 | 78.53 | 1.043 |
KeLu | 12 | 2 | 5 | 384 | 79.63 | 0.9787 |
Gelu | 12 | 2 | 5 | 384 | 79.14 | 0.9995 |
#Act. | Depth | Patch_size | Kernel_size | Embed_Dim | Acc | Loss |
---|---|---|---|---|---|---|
Relu | 8 | 2 | 5 | 256 | 93.16 | 0.4382 |
Gelu | 8 | 2 | 5 | 256 | 93.23 | 0.4281 |
KeLu | 8 | 2 | 5 | 256 | 93.44 | 0.4274 |
Note: For 150 epoch training, I am not able to reproduce the aforementioned results in "Patches are all you need article" for CIFAR10. This is probably due to penalization methods.
#Act. | Acc | Loss |
---|---|---|
Gelu | 78.04% | 1.083 |
KeLu | 78.53 | 1.043 |
#Params | Embed_Dim | #Heads | #Blocks | KeLu - Val. Loss | gelu - Val. Loss |
---|---|---|---|---|---|
55M | 384 | 6 | 10 |