Skip to content

Latest commit

 

History

History
27 lines (23 loc) · 1.53 KB

Attention is not all you need.md

File metadata and controls

27 lines (23 loc) · 1.53 KB

Key ideas

  • Our understanding for why attention mechanisms work remains limited
  • Show that output of attention layers can be decomposed into a sum of smaller terms
  • Strong inductive bias of token uniformity deducted from these smaller terms
  • Without skip-connections or multi-layer perceptrons, output converges doubly exponentially to a rank-1 matrix

Introduction

  • Attention mechanisms were introduced to better learn long-range sequential knowledge
  • Stacking multiple self-attention layers show that expressiveness is lost the more layers that are added
  • Question: why transformers work? Skip connections play a major role
  • Path decomposition to study self-attention networks:
    • Decompose SAN to linear combination of weakly-dependent paths

Attention doubly exponentially loses rank

  • SANs converge exponentially (with depth) to a rank-1 matrix that makes all tokens identical
  • Multi-head SAN as a sum of single-head networks
  • Hence, SANs are a directed acyclic graph where nodes are self-attention heads and edges connections between layers

Mechanisms that counteract rank collapse

  • Hypothesis: too many short paths will make the SAN degenerate to rank-1 matrix
  • Skip connections dramatically diversify the path distribution, helping counteract rank collapse
  • Multi layer perceptrons: increase the Lipschitz constant
  • Layer normalization plays no role