-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support quantization with adapter v1 and v2 finetuning #694
Conversation
The most time consuming part was running the finetunes sweep across the 2 models, 2 finetune versions and 4 precision+quantization levels. I wrote a Makefile that was able to run that all. In case something like that is useful for others will leave it here: # Params
########
# Adapter V1 or V2
# V1 leave blank, V2 is _v2
ADAPTER_VERSION=
ADAPTER_VERSION=_v2
ADAPTER_FINETUNE_CMD=python finetune/adapter$(ADAPTER_VERSION).py
ADAPTER_GENERATE_CMD=python generate/adapter$(ADAPTER_VERSION).py --prompt "Recommend a movie to watch on the weekend."
# Change model here
MODEL_SOURCE=stabilityai
MODEL_NAME=$(MODEL_SOURCE)/stablelm-base-alpha-3b
# MODEL_SOURCE=meta-llama
# MODEL_NAME=$(MODEL_SOURCE)/Llama-2-7b-chat-hf
run-all-adapters: setup
@echo "Running adapter with stabilityai (v1)..."
@mkdir -p logs/stabilityai
@make adapter ADAPTER_VERSION= MODEL_SOURCE=stabilityai MODEL_NAME=stabilityai/stablelm-base-alpha-3b
@echo "Running adapter with stabilityai (_v2)..."
@mkdir -p logs/stabilityai
@make adapter ADAPTER_VERSION=_v2 MODEL_SOURCE=stabilityai MODEL_NAME=stabilityai/stablelm-base-alpha-3b
@echo "Running adapter with meta-llama (v1)..."
@mkdir -p logs/meta-llama
@make adapter ADAPTER_VERSION= MODEL_SOURCE=meta-llama MODEL_NAME=meta-llama/Llama-2-7b-chat-hf
@echo "Running adapter with meta-llama (_v2)..."
@mkdir -p logs/meta-llama
@make adapter ADAPTER_VERSION=_v2 MODEL_SOURCE=meta-llama MODEL_NAME=meta-llama/Llama-2-7b-chat-hf
CHECKPOINT_DIR=checkpoints/$(MODEL_NAME)
LOG_SUFFIX=.txt
# Setup
#######
# Install Python dependencies
requirements:
pip install huggingface_hub sentencepiece
pip install -r requirements-all.txt
# Download Models
download-stable-3b: requirements
python scripts/download.py --repo_id stabilityai/stablelm-base-alpha-3b
python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/stabilityai/stablelm-base-alpha-3b
python scripts/prepare_alpaca.py
download-llama-7b: requirements
python scripts/download.py --repo_id meta-llama/Llama-2-7b-chat-hf --access_token XXX
python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-chat-hf
python scripts/prepare_alpaca.py --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-chat-hf
# Prerequisites check
######################
# Check CUDA availability
cuda-check:
@echo "Checking for CUDA..."
@/usr/bin/env nvidia-smi > /dev/null 2>&1 || (echo "CUDA is not available or nvidia-smi is not in your PATH"; exit 1)
logs:
mkdir logs
mkdir logs/$(MODEL_SOURCE)
prereqs: logs cuda-check
setup: download-stable-3b download-llama-7b
# Train Adapter
finetune-adapter-default: prereqs
$(ADAPTER_FINETUNE_CMD) \
--checkpoint_dir $(CHECKPOINT_DIR) \
--out_dir out/adapter$(ADAPTER_VERSION)/$(MODEL_NAME)/default \
2>&1 | tee logs/$(MODEL_NAME)-finetune-adapter$(ADAPTER_VERSION)-default$(LOG_SUFFIX)
finetune-adapter-bf16: prereqs
$(ADAPTER_FINETUNE_CMD) \
--precision bf16-true \
--checkpoint_dir $(CHECKPOINT_DIR) \
--out_dir out/adapter$(ADAPTER_VERSION)/$(MODEL_NAME)/bf16 \
2>&1 | tee logs/$(MODEL_NAME)-finetune-adapter$(ADAPTER_VERSION)-bf16$(LOG_SUFFIX)
finetune-adapter-bf16-bnb.nf4: prereqs
$(ADAPTER_FINETUNE_CMD) \
--precision bf16-true \
--quantize "bnb.nf4" \
--checkpoint_dir $(CHECKPOINT_DIR) \
--out_dir out/adapter$(ADAPTER_VERSION)/$(MODEL_NAME)/bf16-bnb-nf4 \
2>&1 | tee logs/$(MODEL_NAME)-finetune-adapter$(ADAPTER_VERSION)-bf16-bnb.nf4$(LOG_SUFFIX)
finetune-adapter-bf16-bnb.nf4-dq: prereqs
$(ADAPTER_FINETUNE_CMD) \
--precision bf16-true \
--quantize "bnb.nf4-dq" \
--checkpoint_dir $(CHECKPOINT_DIR) \
--out_dir out/adapter$(ADAPTER_VERSION)/$(MODEL_NAME)/bf16-bnb-nf4-dq \
2>&1 | tee logs/$(MODEL_NAME)-finetune-adapter$(ADAPTER_VERSION)-bf16-bnb.nf4-dq$(LOG_SUFFIX)
finetune-all: finetune-adapter-default finetune-adapter-bf16 finetune-adapter-bf16-bnb.nf4 finetune-adapter-bf16-bnb.nf4-dq
# Generate (check inference memory)
generate-adapter-default: prereqs
$(ADAPTER_GENERATE_CMD) \
--checkpoint_dir $(CHECKPOINT_DIR) \
--adapter_path out/adapter$(ADAPTER_VERSION)/$(MODEL_NAME)/default/lit_model_adapter_finetuned.pth \
2>&1 | tee logs/$(MODEL_NAME)-generate-adapter$(ADAPTER_VERSION)-default$(LOG_SUFFIX)
generate-adapter-bf16:
$(ADAPTER_GENERATE_CMD) \
--precision bf16-true \
--checkpoint_dir $(CHECKPOINT_DIR) \
--adapter_path out/adapter$(ADAPTER_VERSION)/$(MODEL_NAME)/bf16/lit_model_adapter_finetuned.pth \
2>&1 | tee logs/$(MODEL_NAME)-generate-adapter$(ADAPTER_VERSION)-bf16$(LOG_SUFFIX)
generate-adapter-bf16-bnb.nf4: prereqs
$(ADAPTER_GENERATE_CMD) \
--precision bf16-true \
--quantize "bnb.nf4" \
--checkpoint_dir $(CHECKPOINT_DIR) \
--adapter_path out/adapter$(ADAPTER_VERSION)/$(MODEL_NAME)/bf16-bnb-nf4/lit_model_adapter_finetuned.pth \
2>&1 | tee logs/$(MODEL_NAME)-generate-adapter$(ADAPTER_VERSION)-bf16-bnb.nf4$(LOG_SUFFIX)
generate-adapter-bf16-bnb.nf4-dq: prereqs
$(ADAPTER_GENERATE_CMD) \
--precision bf16-true \
--quantize "bnb.nf4-dq" \
--checkpoint_dir $(CHECKPOINT_DIR) \
--adapter_path out/adapter$(ADAPTER_VERSION)/$(MODEL_NAME)/bf16-bnb-nf4-dq/lit_model_adapter_finetuned.pth \
2>&1 | tee logs/$(MODEL_NAME)-generate-adapter$(ADAPTER_VERSION)-bf16-bnb.nf4-dq$(LOG_SUFFIX)
# Finetune and generate combined
adapter-default: finetune-adapter-default generate-adapter-default
adapter-bf16: finetune-adapter-bf16 generate-adapter-bf16
adapter-bf16-bnb.nf4: finetune-adapter-bf16-bnb.nf4 generate-adapter-bf16-bnb.nf4
adapter-bf16-bnb.nf4-dq: finetune-adapter-bf16-bnb.nf4-dq generate-adapter-bf16-bnb.nf4-dq
adapter: adapter-default adapter-bf16 adapter-bf16-bnb.nf4 adapter-bf16-bnb.nf4-dq
|
Hey @safurrier
As a sanity check, I took a look at dtypes for each layer when quantization is applied with command for python finetune/adapter.py --checkpoint_dir checkpoints/EleutherAI/pythia-70m --quantize bnb.nf4 --precision 16-true
and for quantize-adapters ~/repos/temp/lit-gpt python finetune/adapter_v2.py --checkpoint_dir checkpoints/EleutherAI/pythia-70m --quantize bnb.nf4 --precision 16-true
As you can see, in both cases only weight matrices for attention (QKV and projection), mlp and lm_head are quantized. To display dtypes one can put this code snippet right after for name, layer in model.named_parameters():
print(f"{(name + ' ').ljust(60, '-')} {layer.dtype}") |
@Andrei-Aksionov good to know. I didn't dig too deeply into things and was hoping this would mainly work out of the box. Is the quantization being limited only to the new params the desired behavior? Or for LoRa is the entire model quantized as well? |
d89808f
to
202ec0a
Compare
202ec0a
to
b52fd4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I added tests and included the results directly in the resource tables
Co-authored-by: Carlos Mocholí <[email protected]>
Closes #392
Implements quantization for adapter and adapter v2 with same code as used for LoRa.
Ran it on StableLM and Llama 7B for adapter and adapter_v2:
There is some duplicated code now across the lora, adapter and adapter_v2 for setting up quantization (mainly around setting up plugins based on quantization flag and also selecting BnB compatible optimizer). That could be cleaned up with some common utils but didn't want to refactor to that level unless it was desired.