Model saving fails on AWS instances with OOM kill #868

Arseny-N · 2024-10-25T09:49:17Z

Describe the bug

If I try to quantize llama3.1-8B-Instruct on a g5.2xlarge AWS instance. Model saving fails with the OOM killer terminating the process.

Expected behavior
A succesfull execution.

Environment
Include all relevant environment information:

OS: nvcr.io/nvidia/pytorch:23.10-py3
Python version: 3.10.12
LLM Compressor version or commit hash 0.2.0
ML framework version(s): '2.4.0+cu121'
Other Python package versions: compressed-tensors=any
Other relevant environment information [e.g. hardware, CUDA version]: g5.2xlarge, g5.4xlarge, g5.8xlarge AWS instances.

To Reproduce
Exact steps to reproduce the behavior:

Start an g5.2xlarge with nvcr.io/nvidia/pytorch:23.10-py3 container and run this script.

Note: g5.4xlarge, g5.8xlarge are also affected.

Additional context

The spike in memory consumption is somewhat misterious since it is caused by the kernel.

This is an example output of the free command prior to saving the model

total        used        free      shared  buff/cache   available
Mem:            30Gi        12Gi       635Mi        20Mi        18Gi        18Gi
Swap:             0B          0B          0B

The cause does not seem to be the saving process per-se, but some kernel caching behaviour, since the crash may be caused if the quantization is run several consecutive times in a single process.

This is how the amount of free memory changes during the script execution.

This is an example readout of /proc/meminfo prior to compressing the model and causing the OOM kill.

I successfully quantized the model on a separate system.

PS I might update this issue if I will inquire further in this behaviour.

The text was updated successfully, but these errors were encountered:

Arseny-N added the bug Something isn't working label Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model saving fails on AWS instances with OOM kill #868

Model saving fails on AWS instances with OOM kill #868

Arseny-N commented Oct 25, 2024

Model saving fails on AWS instances with OOM kill #868

Model saving fails on AWS instances with OOM kill #868

Comments

Arseny-N commented Oct 25, 2024