Zigzag attention support? #20

dwromero · 2024-10-23T18:08:18Z

I hope you are doing well. And thank you for yet another useful repo! :)

I was wondering if you have any plans to support the zigzag version of ring attention. It seems to distributed compute better in autoregressive settings and is quite hot at the moment (zhuzilin/ring-flash-attention#2). I could help if you need help with that.

David

lucidrains · 2024-10-23T18:10:49Z

hey David, no problem

could you link me to the paper?

did you see the rotation trick from Chris Fifty yet?

dwromero · 2024-10-23T18:21:55Z

Hi,

could you link me to the paper?

-> It's used in the Llama3 paper (https://arxiv.org/abs/2407.21783). Page 11 of the paper in the section on context parallelism. Though they don't actually use the form of ring attention implemented here, for GQA and attention masking reasons.

did you see the rotation trick from Chris Fifty yet?

-> I have not. What is it about?

lucidrains · 2024-10-23T18:28:31Z

check out the vq repo

nice! didn't even know Meta was using ring attention 🤣 I'll read the paper tomorrow

lucidrains · 2024-10-23T18:54:27Z

guess all the big players will be using some form of sequence parallel attention soon (google, meta, and you at nvidia)

lucidrains · 2024-10-23T19:58:14Z

@dwromero could i prompt you for a summary of what zigzag is? is it just another way to permute the sequence for better balancing?

dwromero · 2024-10-23T20:44:30Z

That's right

lucidrains · 2024-10-23T20:48:16Z

@dwromero ok, should be an easy add!

dwromero · 2024-10-23T20:51:34Z

🤟🤟🤟

lucidrains · 2024-10-24T14:19:49Z

@dwromero oh, there is nothing to zigzag (did you coin that term?)

it is just an all gather for keys and values, with GQA as justification

lucidrains · 2024-10-24T14:24:56Z

yes i see, this is a greatly simplified version than what is here

lucidrains · 2024-10-24T14:33:51Z

@dwromero let me break this project into two, where i first handle the permuting they do, then offer the all gather for the key / values, both configurable.

lucidrains · 2024-10-24T14:48:44Z

@dwromero actually, maybe it should just be a separate self contained file given how different it is

dwromero · 2024-10-24T16:18:50Z

I actually tried this with TransformerEngine and it works simply by splitting differently. Ran some tests and all seems to match. Do you think that would be sufficient here too?

Basically, using a splitting like:

def extract_local(value, rank, world_size, dim=1):
    value_chunks = value.chunk(2 * world_size, dim=dim)
    local_value = torch.cat(
        [value_chunks[rank], value_chunks[2 * world_size - rank - 1]], dim=dim
    )
    return local_value.contiguous()

lucidrains · 2024-10-24T16:26:37Z

@dwromero yea that works for sharding the sequence

but you'll need to handle the masking (maybe flex attention can come in handy here). and it seems like they project the key values on each rank separately then do the all gather?

lucidrains · 2024-10-24T16:29:51Z

yea i don't know if i completely buy this. sure GQA can be enough savings that an all gather at 128k is fine, but how about 10 million? yea, this is definitely sequence parallelism in its crudest form, imo

lucidrains · 2024-10-24T19:59:11Z

@dwromero made a bit of progress in the linked PR but out of steam

will resume tomorrow morning

feel free to leave any comments for anything that doesn't look right

lucidrains · 2024-10-25T14:44:02Z

@dwromero alright, think i can knock out the remaining this morning

you still there?

lucidrains · 2024-10-25T15:23:39Z

@dwromero think it is all there in 0.5.19, you can play around with it by running the assert_zig_zag.py test script

dwromero · 2024-10-25T16:04:11Z

Wow cool! Thank you so much @lucidrains ! 💪

lucidrains · 2024-10-25T16:33:11Z

@dwromero no problem. if you can get me some nvidia cloud compute, i can throw in the flex attention logic. but not a big priority for now

lucidrains changed the title ~~Zigzag ring attention support?~~ Zigzag attention support? Oct 24, 2024

lucidrains mentioned this issue Oct 24, 2024

wip zigzag #22

Merged

lucidrains closed this as completed Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zigzag attention support? #20

Zigzag attention support? #20

dwromero commented Oct 23, 2024

lucidrains commented Oct 23, 2024

dwromero commented Oct 23, 2024

lucidrains commented Oct 23, 2024

lucidrains commented Oct 23, 2024

lucidrains commented Oct 23, 2024

dwromero commented Oct 23, 2024

lucidrains commented Oct 23, 2024

dwromero commented Oct 23, 2024

lucidrains commented Oct 24, 2024

lucidrains commented Oct 24, 2024 •

edited

Loading

lucidrains commented Oct 24, 2024

lucidrains commented Oct 24, 2024

dwromero commented Oct 24, 2024

lucidrains commented Oct 24, 2024

lucidrains commented Oct 24, 2024 •

edited

Loading

lucidrains commented Oct 24, 2024 •

edited

Loading

lucidrains commented Oct 25, 2024

lucidrains commented Oct 25, 2024

dwromero commented Oct 25, 2024

lucidrains commented Oct 25, 2024

Zigzag attention support? #20

Zigzag attention support? #20

Comments

dwromero commented Oct 23, 2024

lucidrains commented Oct 23, 2024

dwromero commented Oct 23, 2024

lucidrains commented Oct 23, 2024

lucidrains commented Oct 23, 2024

lucidrains commented Oct 23, 2024

dwromero commented Oct 23, 2024

lucidrains commented Oct 23, 2024

dwromero commented Oct 23, 2024

lucidrains commented Oct 24, 2024

lucidrains commented Oct 24, 2024 • edited Loading

lucidrains commented Oct 24, 2024

lucidrains commented Oct 24, 2024

dwromero commented Oct 24, 2024

lucidrains commented Oct 24, 2024

lucidrains commented Oct 24, 2024 • edited Loading

lucidrains commented Oct 24, 2024 • edited Loading

lucidrains commented Oct 25, 2024

lucidrains commented Oct 25, 2024

dwromero commented Oct 25, 2024

lucidrains commented Oct 25, 2024

lucidrains commented Oct 24, 2024 •

edited

Loading

lucidrains commented Oct 24, 2024 •

edited

Loading

lucidrains commented Oct 24, 2024 •

edited

Loading