When using the GPU, is the model loaded into VRAM? #1808

Nord1cWarr1or · 2024-01-26T10:00:10Z

Nord1cWarr1or
Jan 26, 2024

Can someone please explain to me how this works. I have 32gb RAM, and 8gb VRAM. When I use GPU acceleration, I can't run large models. But when I don't use GPU acceleration, I can run them.

SmirkingKitsune · 2024-03-03T07:18:33Z

SmirkingKitsune
Mar 3, 2024

Jan uses the llama.cpp quantization engine. Therefore, I will assume that you are using a GGUF quantized model. You mentioned trying to run large models. Therefore, I will assume that you are using a model that is around ~65B. If my assumptions are wrong please provide more information.

GGUF models are run on the CPU and will offload some tasks to the GPU, therefore it will primarily work on RAM. ~65B GGUF models will take approximately ~38.5 GB of RAM. You have likely over exceeded your 32 GB RAM memory allocation.

Since you mentioned VRAM, I will mention the GPU-oriented quantization models. GGUF's utilization of the CPU and RAM as the primary processor and memory contrasts with the other quantization models GPTQ and AWQ which use the GPU and VRAM as the primary processor and memory.

5 replies

Nord1cWarr1or Mar 3, 2024
Author

@RookHyena
Thank you for your response.
Yes, all models shown in jan hub are in GGUF format.
When I wrote this question, I was trying to run the 34B models. In jan hub, they were labeled as "recommended".

However, now, a new version of jan 0.4.7 has been released and it says in jan hub that even 13B models are not recommended
This is very strange. I don't understand why this happened.

SmirkingKitsune Mar 3, 2024

@Nord1cWarr1or
According to the llama.cpp repo a ~30B GGUF model should be around ~19.5 GB. I would guess that some of your RAM was allocated to other processes such as OS or background processes, causing your initial issue.

A ~13B GGUF should take around ~7.8 GB of memory according to the llama.cpp repo. This makes the 13B not being recommended for a 32 GB RAM capacity system a bit odd. However, if the checker is assuming that the LLM will be run on your VRAM, such a recommendation would be technically correct, as your 8 GB VRAM GPU will likely be idling with some VRAM in use for your display output (I'm assuming you use your GPU for display output). But since they are using GGUFs and not GPTQ or AWQ, this might be an issue (unless their GPU acceleration implementation somehow loads the whole LLM into the VRAM). However, I am not familiar with how that system works in this project and have only recently started looking into GGUF, so it is hard for me to tell.

letsrockroll Mar 5, 2024

Hi @Nord1cWarr1or , which code did you run?
I have problems to build the code https://github.com/janhq/jan
did I miss anything?

thank you so much!

Nord1cWarr1or Mar 5, 2024
Author

I was running AppImage

letsrockroll Mar 5, 2024

thank you for your reply!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jan

When using the GPU, is the model loaded into VRAM? #1808

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Jan

When using the GPU, is the model loaded into VRAM? #1808

Nord1cWarr1or Jan 26, 2024

Replies: 1 comment · 5 replies

SmirkingKitsune Mar 3, 2024

Nord1cWarr1or Mar 3, 2024 Author

SmirkingKitsune Mar 3, 2024

letsrockroll Mar 5, 2024

Nord1cWarr1or Mar 5, 2024 Author

letsrockroll Mar 5, 2024

Nord1cWarr1or
Jan 26, 2024

Replies: 1 comment 5 replies

SmirkingKitsune
Mar 3, 2024

Nord1cWarr1or Mar 3, 2024
Author

Nord1cWarr1or Mar 5, 2024
Author