This repository contains the code for paper Continual Memorization of Factoids in Large Language Models.
- [2024/11/20] Initial release.
In this paper, we examine LLM's ability to memorize small sets of long-tail factoids (factual associations) in a multi-stage continual training setting. Unlike regular datasets and tasks, training on these factoids often results in unintended disruption to the model such as exacerbated hallucination. We found similar fragilities in the continual learning setting -- the memorized long-tail factoids are easy be forgetten in after later-stage training. To understand this phenomenon, we investigate: 1) how much memorized factoids can be retained after further training on other tasks, 2) how differnet types of data affect memory retention after multiple stages of training, and 3) how to mitigate forgetting.
The figure above illustrates the setting. In stage 1, a base LLM (e.g., Llama 3) is trained to memorize a factoid dataset
We found mixing data in either or both stages changes the dynamics of how factoids are memorized. Surprisingly, mixing random word sequences in stage 1 helps mitigate forgetting. In addition, we found mixing general pretraining data (not related to
pip install -r requirements.txt
- Key-Value Recall (KVR)
- PopQA
- TriviaQA
- Factoid datasets: LAMA, EntityQuestions, WebQA.
- Non-factoid datasets: UltraChat, EvolCode, APPS, GSM8K, MATH.
Run the following commend to download the datasets.
./scripts/prepare_data.sh
We use FSDP for training. Most of our experiments can be done using 2 80G GPUs. With REMIX, sometimes 4 GPUs are needed for the 8B model.
Set up src/config.py
with your own paths. In the run scripts, specify root_dir
, run_base_dir
, and run_name
.
Stage 1 training.
./scripts/run_stage1.sh
Stage 2 training.
./scripts/run_stage2.sh
python -m src.run_eval \
--dataset_name ${dataset_name_A} \
--dataset_name_B ${dataset_name_B} \
--model_name ${model_name} \
--verbose
Please contact Howard at [email protected]
for any questions or issues.
@article{chen2024continual,
title={Continual Memorization of Factoids in Large Language Models},
author={Chen, Howard and Geng, Jiayi and Bhaskar, Adithya and Friedman, Dan and Chen, Danqi},
journal={arXiv preprint arXiv: 2411.07175},
year={2024}
}