Skip to content

Commit

Permalink
last update
Browse files Browse the repository at this point in the history
  • Loading branch information
lucidrains authored Nov 12, 2023
1 parent fb7e779 commit 3e237a9
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ Update 9: <a href="https://api.wandb.ai/links/lucidrains/do1i9rx0">Head to head

Update 10: and it got passed by attention, at least, assuming the implementation in the repo is correct.

Update 11: I'm seeing a steady improvement increasing the head dimension, so I no longer believe max-heads is optimal. Increasing the head dimension brings us right back to linear attention and needing the fused CUDA kernel.

This comment has been minimized.

Copy link
@EelcoHoogendoorn

EelcoHoogendoorn Nov 12, 2023

Is the custom cuda kernel requirement also something you need in JAX?

This comment has been minimized.

Copy link
@EelcoHoogendoorn

EelcoHoogendoorn Nov 12, 2023

Looking at the JAX code I suppose its a little behind wrt to the torch implementation, and youve only tried the SISO case in JAX?

This comment has been minimized.

Copy link
@lucidrains

lucidrains Nov 12, 2023

Author Owner

yes i believe so

and yea, the JAX code is just to match up with what the author had. he only tested the max-head formulation. besides, even increasing the head dimension, i don't see it coming close to beating baseline transformer + rotary

This comment has been minimized.

Copy link
@lucidrains

lucidrains Nov 12, 2023

Author Owner

i'm probably not putting any more work into this unless someone finds an error in the implementation

This comment has been minimized.

Copy link
@EelcoHoogendoorn

EelcoHoogendoorn Nov 12, 2023

Much obliged for your service; you truly are the hero we do not deserve in a world where neither journals nor researchers make any effort to enforce reproducible publication standards.

That being said, it is not entirely obvious to me how to interpret your benchmarking thus far. As I understand it is sequence length 256; wouldnt we expect to see the relative merits of an SSM at much higher sequence lengths?

Or is you not pursuing this further not to be interpreted as a disappointment, but just a statement of the fact that 100 more papers that did not publish reproducible code were 'published' while I was typing this short comment?

This comment has been minimized.

Copy link
@lucidrains

lucidrains Nov 12, 2023

Author Owner

it would be a small essay if i spelt out how i'm evaluating this work, but i think we should just reserve judgement until the author releases his own repository with that reproducible 13.4 ppl result for wikitext103

This comment has been minimized.

Copy link
@EelcoHoogendoorn

EelcoHoogendoorn Nov 12, 2023

Fair enough


### Appreciation

Expand Down

0 comments on commit 3e237a9

Please sign in to comment.