Replies: 1 comment 1 reply
-
Could it be OS related, like no more vram and it switches to ram ? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I don't know if it's a bug or just a quirk, but I get some surprising results when benchmarking the performance of the model when offloading a different number of layers to my AMD GPU with Vulkan. As expected, performance scales nicely with the number of layers offloaded until the 31st one. Then the prompt processing performance falls off abruptly. The same thing happens with token generation performance at 37 layers offloaded.
.\build\bin\Release\llama-bench.exe -m .\models\Mistral\nemo\ggml-model-Q4_K_M.gguf -ngl 20,25,30,31,32,34,36,37,38,40,41 -t 12
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64
Anyone knows what could be causing such a significant jump in performance?
Beta Was this translation helpful? Give feedback.
All reactions