diff --git a/README.md b/README.md index 5cb2040..c9b2117 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ Collection of papers/repos on state-space models. -## ICML 2024 (not exhaustive) +## ICML 2024 1. StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization (https://arxiv.org/abs/2311.14495) @@ -25,6 +25,15 @@ Collection of papers/repos on state-space models. 9. Repeat After Me: Transformers are Better than State Space Models at Copying (https://arxiv.org/pdf/2402.01032) +10. SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization (https://www.arxiv.org/abs/2405.11582) + +11. Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences + +12. When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models + + Abstract: Autoregressive Large Language Models (LLMs) have achieved impressive performance in language tasks but face significant bottlenecks: (1) quadratic complexity bottleneck in the attention module with increasing token numbers, and (2) efficiency bottleneck due to the sequential processing nature of autoregressive LLMs during generation. Linear attention and speculative decoding emerge as solutions for these challenges, yet their applicability and combinatory potential for autoregressive LLMs remain uncertain. To this end, we embark on the first comprehensive empirical investigation into the efficacy of existing linear attention methods for autoregressive LLMs and their integration with speculative decoding. We introduce an augmentation technique for linear attention and ensure the compatibility between linear attention and speculative decoding for efficient LLM training and serving. Extensive experiments and ablation studies on seven existing linear attention works and five encoder/decoder-based LLMs consistently validate the effectiveness of our augmented linearized LLMs, e.g., achieving up to a 6.67 perplexity reduction on LLaMA and 2x speedups during generation as compared to prior linear attention methods. + +13. Simple linear attention language models balance the recall-throughput tradeoff (https://arxiv.org/abs/2402.18668) [GitHub](https://github.com/HazyResearch/based) ## Input-dependent gating. diff --git a/RNN.md b/RNN.md index 5df78d3..c40ae54 100644 --- a/RNN.md +++ b/RNN.md @@ -1,3 +1,17 @@ +## ICML 2024 + +1. Learning Useful Representations of Recurrent Neural Network Weight Matrices (https://arxiv.org/abs/2403.11998) + +2. Hidden Traveling Waves bind Working Memory Variables in Recurrent Neural Networks (https://arxiv.org/abs/2402.10163) + +3. A Tensor Decomposition Perspective on Second-order RNNs + +Abstract: +Second-order Recurrent Neural Networks (2RNNs) extend RNNs by leveraging second-order interactions for sequence modelling. These models are provably more expressive than their first-order counterparts and have connections to well-studied models from formal language theory. However, their large parameter tensor makes computations intractable. To circumvent this issue, one approach known as MIRNN consists in limiting the type of interactions used by the model. Another is to leverage tensor decomposition to diminish the parameter count. In this work, we study the model resulting from parameterizing 2RNNs using the CP decomposition, which we call CPRNN. Intuitively, the rank of the decomposition should reduce expressivity. We analyze how rank and hidden size affect model capacity and show the relationships between RNNs, 2RNNs, MIRNNs, and CPRNNs based on these parameters. We support these results empirically with experiments on the Penn Treebank dataset which demonstrate that, with a fixed parameter budget, CPRNNs outperforms RNNs, 2RNNs, and MIRNNs with the right choice of rank and hidden size. + + +## NeurIPS 2023 + 1. Inverse Approximation Theory for Nonlinear Recurrent Neural Networks (https://openreview.net/forum?id=yC2waD70Vj) We prove an inverse approximation theorem for the approximation of nonlinear sequence-to-sequence relationships using recurrent neural networks (RNNs). This is a so-called Bernstein-type result in approximation theory, which deduces properties of a target function under the assumption that it can be effectively approximated by a hypothesis space. In particular, we show that nonlinear sequence relationships that can be stably approximated by nonlinear RNNs must have an exponential decaying memory structure - a notion that can be made precise. This extends the previously identified curse of memory in linear RNNs into the general nonlinear setting, and quantifies the essential limitations of the RNN architecture for learning sequential relationships with long-term memory. Based on the analysis, we propose a principled reparameterization method to overcome the limitations. Our theoretical results are confirmed by numerical experiments.