Reserve more eval memory and use ggml scratch buffers #116

juho-p · 2023-04-06T22:00:57Z

This attempts to fix #115

Not the brightest of code. Scratch buffer usage is imitated from llama.cpp. evaluate context memory reservation is also imitated there (or at least that's the attempt). Note that I didn't actually go through all the things in llama.cpp, only the ones related to context memory size, so I might be missing something.

I'm quite sure this allocates a lot more extra memory for 7B and 13B models as well, even though I never had any issues with them running out of context memory. Scratch buffers take 1GB, and that's for every InferenceSession.

Maybe it would make sense to only use scratch buffers for inferring bigger models only? Though I think llama.cpp uses them always. Also, if there are multiple sessions, scratch memory buffers are not shared between them, but they could (just make sure only run one evaluate at a time somehow).

Anyway, with these changes I couldn't manage to run out of context memory any more, even with 65B model.

ggml/src/lib.rs

philpax · 2023-04-07T14:33:30Z

man I do not enjoy the amount of magic constants we're accumulating, but aside from the &mut thing I think this is fine. Does llama.cpp always reserve 1GB of scratch, even for 7B?

llama-rs/src/lib.rs

jon-chuang · 2023-04-12T10:19:48Z

Will this help Context window full, stopping inference.?

philpax · 2023-04-13T00:41:05Z

Will this help Context window full, stopping inference.?

No, that's because the model has a fixed context limit of 2048 tokens. We're investigating ways to improve this in #77.

juho-p force-pushed the fix-out-of-context-memory branch 2 times, most recently from 1e02c17 to b905791 Compare April 6, 2023 22:10

philpax reviewed Apr 7, 2023

View reviewed changes

ggml/src/lib.rs Outdated Show resolved Hide resolved

philpax reviewed Apr 7, 2023

View reviewed changes

llama-rs/src/lib.rs Outdated Show resolved Hide resolved

juho-p force-pushed the fix-out-of-context-memory branch 2 times, most recently from 248fc8c to bd5480c Compare April 8, 2023 08:04

Reserve more eval memory and use ggml scratch buffers

d279371

juho-p force-pushed the fix-out-of-context-memory branch from bd5480c to d279371 Compare April 8, 2023 19:30

philpax added this to the 0.1 milestone Apr 10, 2023

philpax added 2 commits April 13, 2023 02:16

Merge branch 'main' into fix-out-of-context-memory

b416898

refactor: improve docs + minor safety stuff

c48ab9f

philpax merged commit 5db8b4f into rustformers:main Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reserve more eval memory and use ggml scratch buffers #116

Reserve more eval memory and use ggml scratch buffers #116

juho-p commented Apr 6, 2023

philpax commented Apr 7, 2023

jon-chuang commented Apr 12, 2023

philpax commented Apr 13, 2023 •

edited

Loading

Reserve more eval memory and use ggml scratch buffers #116

Reserve more eval memory and use ggml scratch buffers #116

Conversation

juho-p commented Apr 6, 2023

philpax commented Apr 7, 2023

jon-chuang commented Apr 12, 2023

philpax commented Apr 13, 2023 • edited Loading

philpax commented Apr 13, 2023 •

edited

Loading