[Bugfix] Support model offloading SparseGPTQ #918

kylesayrs · 2024-11-16T00:09:28Z

Purpose

Fix bug related to running calibration with offloaded models
Add offloading support to SparseGPTQ
Add additional logging to SparseGPTQ to validate sparsity

Changes

Update logic which determines which device to put calibration inputs on
- New logic incorporates offloading assumptions. Previously, if the model was offloaded, the logic would attempt to put the inputs on the meta device

NotImplementedError: Cannot copy out of meta tensor; no data!

Add weight onloading and offloading to SparseGPTQ

if is_module_offloaded(self.layer):
    self.layer._hf_hook.pre_forward(self.layer)

Added sparsity logging to SparseGPTQ for algorithm validation

===== Compressing layer 110/113 to sparsity 0.5 =====                                                                  
2024-11-15T23:56:01.649029+0000 | compress_module | INFO - Compressing model.layers.15.mlp.gate_proj.model.layers.15.mlp.gate_proj...
2024-11-15T23:56:02.016416+0000 | compress | INFO - time 0.36
2024-11-15T23:56:02.016682+0000 | compress | INFO - error 39350.72
2024-11-15T23:56:02.072700+0000 | compress | INFO - sparsity 0.50
2024-11-15T23:56:02.094924+0000 | apply_compression | INFO -

Testing

llama_example.py

from accelerate import cpu_offload
from datasets import load_dataset
from transformers import AutoTokenizer

from llmcompressor.modifiers.obcq import SparseGPTModifier
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot

# Select model and load it.
MODEL_ID = "meta-llama/Llama-3.2-1B-Instruct"

model = SparseAutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="cuda:0",
    torch_dtype="auto",
)
cpu_offload(model)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Select calibration dataset.
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 2  # 512
MAX_SEQUENCE_LENGTH = 2048

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))


def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    }


ds = ds.map(preprocess)


# Tokenize inputs.
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )


ds = ds.map(tokenize, remove_columns=ds.column_names)

# Configure the quantization algorithm to run.
#   * quantize the weights to 4 bit with GPTQ with a group size 128
recipe = SparseGPTModifier(targets="Linear", sparsity=0.5, ignore=["lm_head"])

# Apply algorithms.
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")

# Save to disk compressed.
SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

Signed-off-by: Kyle Sayers <[email protected]>

github-actions · 2024-11-16T00:09:39Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Signed-off-by: Kyle Sayers <[email protected]>

enable offloading, fix offloading bug, add log

66b6627

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs changed the title ~~[Bugfix] Fix offload~~ [Bugfix] Support model offloading Nov 16, 2024

kylesayrs changed the title ~~[Bugfix] Support model offloading~~ [Bugfix] Support model offloading SparseGPTQ Nov 16, 2024

kylesayrs self-assigned this Nov 18, 2024

kylesayrs marked this pull request as draft November 18, 2024 20:45

simplify logic

9e4a5f4

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs marked this pull request as ready for review November 18, 2024 21:08

kylesayrs added 5 commits November 18, 2024 18:09

Merge branch 'main' into kylesayrs/sparsegptq-offloading

eea0ef6

Merge branch 'main' into kylesayrs/sparsegptq-offloading

442956b

Merge branch 'main' into kylesayrs/sparsegptq-offloading

0c1928e

Merge branch 'main' into kylesayrs/sparsegptq-offloading

500fbcc

Merge branch 'main' into kylesayrs/sparsegptq-offloading

0d384f1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Support model offloading SparseGPTQ #918

[Bugfix] Support model offloading SparseGPTQ #918

kylesayrs commented Nov 16, 2024

github-actions bot commented Nov 16, 2024

[Bugfix] Support model offloading SparseGPTQ #918

Are you sure you want to change the base?

[Bugfix] Support model offloading SparseGPTQ #918

Conversation

kylesayrs commented Nov 16, 2024

Purpose

Changes

Testing

github-actions bot commented Nov 16, 2024