Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apple M1 metal lag #1730

Closed
leedrake5 opened this issue Jun 7, 2023 · 19 comments
Closed

Apple M1 metal lag #1730

leedrake5 opened this issue Jun 7, 2023 · 19 comments

Comments

@leedrake5
Copy link
Contributor

Prefacing that this isn't urgent. When using the recently added M1 GPU support, I see an odd behavior in system resource use. When using all threads -t 20, the first initialization follows the instruction. However when there is a pause in GPU use, only about 4 threads are used regardless of the tag.

Video showing response (Guanco 65B) and system resource use: https://youtu.be/ysA7xg6nevY

LLAMA_METAL=1 make -j && ./main -m ./models/guanaco-65B.ggmlv3.q4_0.bin -b 8000  -n 25600 -ngl 1 -t 20 --repeat-penalty 1.1764705882352942 --top-p 0 --top-k 40 --temp 0.7 --repeat-last-n 256 -p "How did the computer company Apple get its start and become successful?"

Apologies for the cringe prompt, but wanted to test accuracy (points for remembering Wayne was a founder, but Apple Watch was released in 2014, not 2015). Some parameters (batch size) are weird, but behavior is the same regardless of this integer.

@j-f1
Copy link
Collaborator

j-f1 commented Jun 7, 2023

What device are you running on? Unless it’s an M1 Ultra you should be running 10 or fewer threads.

@leedrake5
Copy link
Contributor Author

It's an M1 Ultra, it runs great with 20 threads to start, but only runs on 4 threads following the first GPU use.

@ggerganov
Copy link
Owner

The pause occurs when the context becomes full.

When this happens, we roughly pick the second half of the context and reprocess it in order to free-up half the context for new generation. The reprocessing currently does not use Metal, as we haven't implemented efficient Matrix x Matrix kernels. So we simply fallback to the standard non-GPU implementation. It currently runs on the CPU, while the heavy matrix multiplications are done with Apple Accelerate's CBLAS which allegedly utilizes the AMX coprocessor. Therefore, the CPU is barely occupied during this period as the AMX does the heavy lifting. The AMX utilization cannot be monitored with standard activity monitoring tools, so you won't see it in Activity Monitor.

@dogjamboree
Copy link

Are there any future plans to address this issue? Or does it even seem fixable? I just bought an m2 ultra with 128gb ram hoping it would be a great solution for LLM inference and this issue leaves me semi dead in the water -- nearly 10 tokens / second of generation averages out to less than 1 token / second taking this into account :(

@leedrake5
Copy link
Contributor Author

leedrake5 commented Jul 14, 2023

@ggerganov First thank you for the explanation, and thank you for initiating such a remarkable project here.

@dogjamboree The latest builds of oobabooga/text-generation-ui address this performance. I recommend install it and abetlen/llama-cpp-python to drive it. Make sure to install with this:

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python

When loading the model in text-generation-ui, make sure to set n_gpus > 1 and use all your threads.

I'm getting 1421.63 tokens per second on sample time, 1.79 tokens per second for prompt eval time, and full eval takes 7.35 tokens per second. The CPU lag between GPU "bursts" is almost gone. Note that the above metrics are for Guanco 65B on an M1 Ultra. I suspect your chip will feature a 20% or so speed boost over my baseline.

@ggerganov
Copy link
Owner

@leedrake5
How do you avoid the "context swap" as we call it? It is what makes the long pauses

@dogjamboree
I think there might be a way to shift the context without slow recomputing: #2060
Last time we dug into this problem, we reached a conclusion that it is not possible to avoid this recomputation (#71), but now after I have gathered more insight of how the inference works, I think it might actually be possible to do it.

Worst case scenario, when we implement a fast Metal qMatrix x qMatrix multiplication kernel, the "context swap" time should become much shorter. It's on the roadmap

@leedrake5
Copy link
Contributor Author

@ggerganov

Not sure - I was able to generate 2k tokens with no interruption. Video here showing its performance in contrast to the command-line resource pattern shown in the OG post. The exact same 65B Guanco is used for both instances. I'm a bit mystified - command line I get the context swap pauses, but with text-generation-ui + llama_cpp-python they aren't visible in GPU/CPU utilization. So either there is some default text-generation-ui is passing that I am not using in command line, or llama-cpp-python has a unique solution.

Here's the ggml_metal details in case there is anything useful :

ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/opt/homebrew/Caskroom/mambaforge/base/envs/textgen/lib/python3.10/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x5b9306d40
ggml_metal_init: loaded kernel_mul                            0x5b9306fa0
ggml_metal_init: loaded kernel_mul_row                        0x5b9307200
ggml_metal_init: loaded kernel_scale                          0x5b9307460
ggml_metal_init: loaded kernel_silu                           0x5b93076c0
ggml_metal_init: loaded kernel_relu                           0x5b9307920
ggml_metal_init: loaded kernel_gelu                           0x5b9307b80
ggml_metal_init: loaded kernel_soft_max                       0x28ee6f620
ggml_metal_init: loaded kernel_diag_mask_inf                  0x28ee7b3a0
ggml_metal_init: loaded kernel_get_rows_f16                   0x28ee7b600
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x28ee7b860
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x28ee7bac0
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x5b9307de0
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x5b93081a0
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x5b9308560
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x5b9308a40
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x5b9308f20
ggml_metal_init: loaded kernel_rms_norm                       0x5b9309430
ggml_metal_init: loaded kernel_norm                           0x5b9309940
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x28ee7c060
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x28ee7c2c0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x28ee7c640
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x28ee7cb80
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x5b9309f00
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x5b930a560
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x5b930aaa0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x5b930afe0
ggml_metal_init: loaded kernel_rope                           0x5b930b930
ggml_metal_init: loaded kernel_alibi_f32                      0x5b930c050
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x5b930c740
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x5b930ce30
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x5b930d520
ggml_metal_init: recommendedMaxWorkingSetSize = 98304.00 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =   140.62 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 35026.75 MB, (35027.14 / 98304.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1536.00 MB, (36563.14 / 98304.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  5122.00 MB, (41685.14 / 98304.00)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =  1024.00 MB, (42709.14 / 98304.00)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =  1024.00 MB, (43733.14 / 98304.00)

@ggerganov
Copy link
Owner

The command-line tool uses a context of 512 tokens by default. You can increase this up to 2048 by using the -c command-line argument. For example, run the same command and add: -c 2048 at the end.

Still when the 2048 context becomes full, it will do the swap and cause a pause

@AlphaAtlas
Copy link

@ggerganov llama-cpp-python (which text gen ui uses) implements the additional caching:

https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama.py#L865

@leedrake5
Copy link
Contributor Author

@ggerganov Yup - that definitely opens up a lot more GPU usage, but no free lunch. Cacheing takes much longer for the exact same reason.

I am curious if the @AlphaAtlas point about the self.cache property of llama-cpp-python is a workaround to cacheing in general, though this is far from my area of expertise. It looks like the definition of Llama.longest_token_prefix is important - maybe they just check system resources and adjust the -c flag themselves? Then by capping responses by that limit (even if --ignore-eos is present) they can essentially calibrate system use automatically.

@sukualam
Copy link

sukualam commented Sep 4, 2023

i still error on metal (rx 560, macos ventura 13.4)

gml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x7f9fcd80d2e0 | th_max = 768 | th_width = 64
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x7f9fcd80dac0 | th_max = 1024 | th_width = 64
ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x0 | th_max = 0 | th_width = 0
ggml_metal_init: load pipeline error: Error Domain=CompilerError Code=2 "SC compilation failure
There is a call to an undefined label" UserInfo={NSLocalizedDescription=SC compilation failure
There is a call to an undefined label}
llama_new_context_with_model: ggml_metal_init() failed
llama_init_from_gpt_params: error: failed to create context with model './models/falcon-7b-Q4_0-GGUF.gguf'
main: error: unable to load model

@mayulu
Copy link

mayulu commented Sep 8, 2023

i still error on metal (rx 560, macos ventura 13.4)

gml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x7f9fcd80d2e0 | th_max = 768 | th_width = 64 ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x7f9fcd80dac0 | th_max = 1024 | th_width = 64 ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x0 | th_max = 0 | th_width = 0 ggml_metal_init: load pipeline error: Error Domain=CompilerError Code=2 "SC compilation failure There is a call to an undefined label" UserInfo={NSLocalizedDescription=SC compilation failure There is a call to an undefined label} llama_new_context_with_model: ggml_metal_init() failed llama_init_from_gpt_params: error: failed to create context with model './models/falcon-7b-Q4_0-GGUF.gguf' main: error: unable to load model

I also still in error on metal(Intel CPU, AMD 5500m, macos ventura 13.5.1, llama-cpp-python 0.1.83).
Exact the same error.

@ZacharyDK
Copy link

ZacharyDK commented Sep 19, 2023

llm_load_print_meta: model size = 13.02 B
llm_load_print_meta: general.name = codellama_codellama-13b-instruct-hf
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.12 MB
llm_load_tensors: mem required = 7024.12 MB (+ 1600.00 MB per state)
...................................................................................................
llama_new_context_with_model: kv self size = 1600.00 MB
ggml_metal_init: allocating
ggml_metal_init: found device: Intel(R) UHD Graphics 630
ggml_metal_init: found device: AMD Radeon Pro 5600M
ggml_metal_init: picking default device: AMD Radeon Pro 5600M
ggml_metal_init: loading '/Users/zacharykolansky/miniforge3/envs/llama/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x7f9d98a0d730 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_add_row 0x7f9d98823170 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul 0x7f9d98a0e340 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_row 0x7f9d98823be0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_scale 0x7f9d98a0ef50 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_silu 0x7f9d988247f0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_relu 0x7f9d98a0fb60 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_gelu 0x7f9d98a10770 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_soft_max 0x7f9d98825400 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_soft_max_4 0x7f9d9b50dcc0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf 0x7f9d9b50e730 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_diag_mask_inf_8 0x7f9d98a11380 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f32 0x7f9d98a11f90 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_f16 0x7f9d98826010 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_0 0x7f9d98a12ba0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_1 0x7f9d98a137b0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q8_0 0x7f9d9b7c6150 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q2_K 0x7f9d98a143c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q3_K 0x7f9d9b50f4e0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q4_K 0x7f9d9b406000 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q5_K 0x7f9d9b407ed0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_get_rows_q6_K 0x7f9d9b409260 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_rms_norm 0x7f9d9b409e70 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_norm 0x7f9d9b40aa80 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f32_f32 0x7f9d9b5100f0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x7f9d98a14e30 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f16_f32_1row 0x7f9d98a15a40 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_f16_f32_l4 0x7f9d98a167d0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x7f9d98826c20 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x7f9d98b5f210 | th_max = 896 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q8_0_f32 0x7f9d98b5f950 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x7f9d98b603c0 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x7f9d98b60e30 | th_max = 512 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x7f9d98b618a0 | th_max = 576 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x7f9d98b624b0 | th_max = 512 | th_width = 32
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x7f9d98b630c0 | th_max = 1024 | th_width = 32
ggml_metal_init: loaded kernel_mul_mm_f32_f32 0x0 | th_max = 0 | th_width = 0
ggml_metal_init: load pipeline error: Error Domain=CompilerError Code=2 "SC compilation failure
There is a call to an undefined label" UserInfo={NSLocalizedDescription=SC compilation failure
There is a call to an undefined label}
llama_new_context_with_model: ggml_metal_init() failed
Traceback (most recent call last):
File "/Users/zacharykolansky/miniforge3/envs/llama/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/Users/zacharykolansky/miniforge3/envs/llama/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/Users/zacharykolansky/miniforge3/envs/llama/lib/python3.9/site-packages/llama_cpp/server/main.py", line 96, in
app = create_app(settings=settings)
File "/Users/zacharykolansky/miniforge3/envs/llama/lib/python3.9/site-packages/llama_cpp/server/app.py", line 337, in create_app
llama = llama_cpp.Llama(
File "/Users/zacharykolansky/miniforge3/envs/llama/lib/python3.9/site-packages/llama_cpp/llama.py", line 350, in init
assert self.ctx is not None
AssertionError


Any help would be appreciated.


So for some reason we have the line:

ggml_metal_init: loaded kernel_mul_mm_f32_f32 0x0 | th_max = 0 | th_width = 0

Why is it trying to load a nullptr? That would explain why it all fails. The relevant error is from ggml-metal.m, line 209. There is a lot of Macro magic. Not sure why GGML_METAL_ADD_KERNEL(cpy_f32_f32); returns null, and why rope, alibi, and f32_f16 are skipped.

@mounta11n
Copy link
Contributor

mounta11n commented Sep 21, 2023

@ZacharyDK probably it is the wrong file format (codellama_codellama-13b-instruct-hf). hf-format is not supported by llama.cpp. you have to look for gguf instead. the easiest way is to search on TheBloke's huggingface page – here you have the same model in gguf format:

TheBloke/CodeLlama-13B-Instruct-GGUF

@rse
Copy link

rse commented Sep 23, 2023

I've downloaded the LLama 2 model from scratch and converted and quantized with:

python3 convert.py --outfile models/7B/gguf-llama2-f16.bin --outtype f16 ../llama/llama-2-7b --vocab-dir ../llama/llama-2-7b
./quantize ./models/7B/gguf-llama2-f16.bin ./models/7B/gguf-llama2-q4_0.bin q4_0

Then I got exactly the same error on an Apple iMac (Intel):

$ ./main -m ./models/7B/gguf-llama2-q4_0.bin -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt 
[...]
ggml_metal_init: loaded kernel_mul_mm_f32_f32                         0x0 | th_max =    0 | th_width =    0
ggml_metal_init: load pipeline error: Error Domain=CompilerError Code=2 "SC compilation failure
There is a call to an undefined label" UserInfo={NSLocalizedDescription=SC compilation failure
There is a call to an undefined label}
llama_new_context_with_model: ggml_metal_init() failed
llama_init_from_gpt_params: error: failed to create context with model './models/7B/gguf-llama2-q4_0.bin'
main: error: unable to load model

When I disable the GPU usage with "--gpu-layers 0" the exactly same model works just fine. So, the problem with "There is a call to an undefined label" and the null has something to do with Metal/GPU support.

@mounta11n
Copy link
Contributor

mounta11n commented Sep 25, 2023

Ah yes, this is mentioned here #3129 (comment) as well. One workaround is to disable metal and enable clblast, which not only gives you GPU acceleration (in my case 20x faster loading times – iMac Intel i5), but still the ability to offload layers to the GPU.

@ZacharyDK
Copy link

@ZacharyDK probably it is the wrong file format (codellama_codellama-13b-instruct-hf). hf-format is not supported by llama.cpp. you have to look for gguf instead. the easiest way is to search on TheBloke's huggingface page – here you have the same model in gguf format:

TheBloke/CodeLlama-13B-Instruct-GGUF

I was using the quantized models, loaded with the llama_cpp python library.
No quantization is too slow.

Unless something was fixed recently, you have to stick with C++. Llama_cpp python will keep trying to use metal even when you specify you don't want metal in the install settings...

@ggerganov
Copy link
Owner

The original issue posted here has been resolved via #3228

@dogjamboree
Copy link

@ggerganov First thank you for the explanation, and thank you for initiating such a remarkable project here.

@dogjamboree The latest builds of oobabooga/text-generation-ui address this performance. I recommend install it and abetlen/llama-cpp-python to drive it. Make sure to install with this:

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python

When loading the model in text-generation-ui, make sure to set n_gpus > 1 and use all your threads.

I'm getting 1421.63 tokens per second on sample time, 1.79 tokens per second for prompt eval time, and full eval takes 7.35 tokens per second. The CPU lag between GPU "bursts" is almost gone. Note that the above metrics are for Guanco 65B on an M1 Ultra. I suspect your chip will feature a 20% or so speed boost over my baseline.

Thanks for your reply. That's quite impressive and I'm going to try it right now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants