(Optimization of LLM inference) Does Intel OpenVINO support offloading LLM models, allowing some layers to remain on the SSD while loading the main layers into RAM during inference computation? #2533

hsulin0806 · 2024-11-19T01:57:02Z

Functional discussion for this project.
notebooks/llm-chatbot

Intel's official documentation: https://www.intel.com.tw/content/www/tw/zh/content-details/826081/running-ollama-with-open-webui-on-intel-hardware-platform.html
confirms support for Ollama.

In Ollama's GitHub documentation: https://github.com/ollama/ollama/blob/main/docs/faq.md, it describes:

100% GPU: The model is fully loaded into the GPU.
100% CPU: The model is fully loaded into system memory.
48%/52% CPU/GPU: The model is split between the GPU and system memory.
Ollama is powered by llama.cpp, which supports the --gpu-layers parameter to distribute model layers between VRAM and RAM, reducing GPU memory pressure.

However, when the CPU handles inference, the model is entirely loaded into RAM. Would it be possible for OpenVINO to introduce a parameter or functionality to support offloading model layers to SSD storage as temporary storage? This would reduce RAM usage, offering a more efficient way to handle resource-limited scenarios.

brmarkus · 2024-11-19T07:49:53Z

Besides CPU, GPU, NPU, (VPU, FPGA), AUTO and MULTI, have you tried to experiment with HETERO (see "https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.html")?

hsulin0806 · 2024-11-25T01:53:04Z

Besides CPU, GPU, NPU, (VPU, FPGA), AUTO and MULTI, have you tried to experiment with HETERO (see "https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes/hetero-execution.html")?

Hi, thank you for your response.

Does the HETERO mode allow RAM to be cached on an SSD to reduce RAM usage? If this functionality is not available, do you have any development plans to enable caching RAM on an SSD?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Optimization of LLM inference) Does Intel OpenVINO support offloading LLM models, allowing some layers to remain on the SSD while loading the main layers into RAM during inference computation? #2533

(Optimization of LLM inference) Does Intel OpenVINO support offloading LLM models, allowing some layers to remain on the SSD while loading the main layers into RAM during inference computation? #2533

hsulin0806 commented Nov 19, 2024

brmarkus commented Nov 19, 2024

hsulin0806 commented Nov 25, 2024

(Optimization of LLM inference) Does Intel OpenVINO support offloading LLM models, allowing some layers to remain on the SSD while loading the main layers into RAM during inference computation? #2533

(Optimization of LLM inference) Does Intel OpenVINO support offloading LLM models, allowing some layers to remain on the SSD while loading the main layers into RAM during inference computation? #2533

Comments

hsulin0806 commented Nov 19, 2024

brmarkus commented Nov 19, 2024

hsulin0806 commented Nov 25, 2024