When using the GPU, is the model loaded into VRAM? #1808
Replies: 1 comment 5 replies
-
Jan uses the llama.cpp quantization engine. Therefore, I will assume that you are using a GGUF quantized model. You mentioned trying to run large models. Therefore, I will assume that you are using a model that is around ~65B. If my assumptions are wrong please provide more information. GGUF models are run on the CPU and will offload some tasks to the GPU, therefore it will primarily work on RAM. ~65B GGUF models will take approximately ~38.5 GB of RAM. You have likely over exceeded your 32 GB RAM memory allocation. Since you mentioned VRAM, I will mention the GPU-oriented quantization models. GGUF's utilization of the CPU and RAM as the primary processor and memory contrasts with the other quantization models GPTQ and AWQ which use the GPU and VRAM as the primary processor and memory. |
Beta Was this translation helpful? Give feedback.
-
Can someone please explain to me how this works. I have 32gb RAM, and 8gb VRAM. When I use GPU acceleration, I can't run large models. But when I don't use GPU acceleration, I can run them.
Beta Was this translation helpful? Give feedback.
All reactions