From 3e237a99661506bb36efbcfbf704530a15efdbc4 Mon Sep 17 00:00:00 2001 From: Phil Wang Date: Sun, 12 Nov 2023 08:10:12 -0800 Subject: [PATCH] last update --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 0537fde..a50ac87 100644 --- a/README.md +++ b/README.md @@ -24,6 +24,7 @@ Update 9: Head to head Update 10: and it got passed by attention, at least, assuming the implementation in the repo is correct. +Update 11: I'm seeing a steady improvement increasing the head dimension, so I no longer believe max-heads is optimal. Increasing the head dimension brings us right back to linear attention and needing the fused CUDA kernel. ### Appreciation