Skip to content

Latest commit

 

History

History
58 lines (39 loc) · 2.6 KB

README.md

File metadata and controls

58 lines (39 loc) · 2.6 KB

A brand new activation function: KeLü (Keen Learning Unit):

It has continuous derivative at 0, while its third derivative has singularities at -3.5 and 3.5. Furthermore, compared to GELU it decays zero a bit faster. Both Flux and Jax implementations are included. For comparison we implement three well known networks.

Both Jax and Flux directories include implementation of the papers with respective abbreviations:

  • Patches are all you need --- P
  • ResNet20 --- R20 P
  • GPT2 --- GPT2
  • All the above models are trained with standard augmentation techniques (See SdP Repo).

    CIFAR100 (P)

    #Act. Depth Patch_size Kernel_size Embed_Dim Acc Loss
    Relu 8 2 5 384 77.79 1.075
    Gelu 8 2 5 384 78.04 1.083
    Swish 8 2 5 384 78.26 1.052
    KeLu 8 2 5 384 78.53 1.043
    KeLu 12 2 5 384 79.63 0.9787
    Gelu 12 2 5 384 79.14 0.9995

    CIFAR10 (P)

    #Act. Depth Patch_size Kernel_size Embed_Dim Acc Loss
    Relu 8 2 5 256 93.16 0.4382
    Gelu 8 2 5 256 93.23 0.4281
    KeLu 8 2 5 256 93.44 0.4274

    Note: For 150 epoch training, I am not able to reproduce the aforementioned results in "Patches are all you need article" for CIFAR10. This is probably due to penalization methods.

    ImageNet1K (64x64) (R20) (Need to retrain these two one more time!)

    #Act. Acc Loss
    Gelu 78.04% 1.083
    KeLu 78.53 1.043

    XXS GPT2 (Character Based- Being trained on 100MB Text of Newspaper articles)

    #Params Embed_Dim #Heads #Blocks KeLu - Val. Loss gelu - Val. Loss
    55M 384 6 10