You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I try to quantize llama3.1-8B-Instruct on a g5.2xlarge AWS instance. Model saving fails with the OOM killer terminating the process.
Expected behavior
A succesfull execution.
Environment
Include all relevant environment information:
OS: nvcr.io/nvidia/pytorch:23.10-py3
Python version: 3.10.12
LLM Compressor version or commit hash 0.2.0
ML framework version(s): '2.4.0+cu121'
Other Python package versions: compressed-tensors=any
Other relevant environment information [e.g. hardware, CUDA version]: g5.2xlarge, g5.4xlarge, g5.8xlarge AWS instances.
To Reproduce
Exact steps to reproduce the behavior:
Start an g5.2xlarge with nvcr.io/nvidia/pytorch:23.10-py3 container and run this script.
Note: g5.4xlarge, g5.8xlarge are also affected.
Additional context
The spike in memory consumption is somewhat misterious since it is caused by the kernel.
This is an example output of the free command prior to saving the model
total used free shared buff/cache available
Mem: 30Gi 12Gi 635Mi 20Mi 18Gi 18Gi
Swap: 0B 0B 0B
The cause does not seem to be the saving process per-se, but some kernel caching behaviour, since the crash may be caused if the quantization is run several consecutive times in a single process.
This is how the amount of free memory changes during the script execution.
This is an example readout of /proc/meminfo prior to compressing the model and causing the OOM kill.
I successfully quantized the model on a separate system.
PS I might update this issue if I will inquire further in this behaviour.
The text was updated successfully, but these errors were encountered:
Describe the bug
If I try to quantize llama3.1-8B-Instruct on a g5.2xlarge AWS instance. Model saving fails with the OOM killer terminating the process.
Expected behavior
A succesfull execution.
Environment
Include all relevant environment information:
To Reproduce
Exact steps to reproduce the behavior:
Start an g5.2xlarge with nvcr.io/nvidia/pytorch:23.10-py3 container and run this script.
Note: g5.4xlarge, g5.8xlarge are also affected.
Additional context
The spike in memory consumption is somewhat misterious since it is caused by the kernel.
This is an example output of the
free
command prior to saving the modelThe cause does not seem to be the saving process per-se, but some kernel caching behaviour, since the crash may be caused if the quantization is run several consecutive times in a single process.
This is how the amount of free memory changes during the script execution.
This is an example readout of /proc/meminfo prior to compressing the model and causing the OOM kill.
I successfully quantized the model on a separate system.
PS I might update this issue if I will inquire further in this behaviour.
The text was updated successfully, but these errors were encountered: