Continual Memorization of Factoids in Large Language Models

This repository contains the code for paper Continual Memorization of Factoids in Large Language Models.

Updates

[2024/11/20] Initial release.

Introduction

In this paper, we examine LLM's ability to memorize small sets of long-tail factoids (factual associations) in a multi-stage continual training setting. Unlike regular datasets and tasks, training on these factoids often results in unintended disruption to the model such as exacerbated hallucination. We found similar fragilities in the continual learning setting -- the memorized long-tail factoids are easy be forgetten in after later-stage training. To understand this phenomenon, we investigate: 1) how much memorized factoids can be retained after further training on other tasks, 2) how differnet types of data affect memory retention after multiple stages of training, and 3) how to mitigate forgetting.

Experimental Setting

The figure above illustrates the setting. In stage 1, a base LLM (e.g., Llama 3) is trained to memorize a factoid dataset $D_A$ (one of the 3 in the figure). Memorization means 100% accuracy in our setting. In stage 2, we continue training on $D_B$, which can be a factoid dataset or a non-factoid dataset. Finally, the LLM is evaluated on $D_A$ to see how much performance is dropped. More performance drop indicates more forgetting.

We found mixing data in either or both stages changes the dynamics of how factoids are memorized. Surprisingly, mixing random word sequences in stage 1 helps mitigate forgetting. In addition, we found mixing general pretraining data (not related to $D_A$) reduces forgetting as well. We call the collection of our mixing strategies REMIX (Random and Generic Data Mixing).

Dependencies

pip install -r requirements.txt

Dataset Preparation

Stage 1

Key-Value Recall (KVR)
PopQA
TriviaQA

Stage 2

Factoid datasets: LAMA, EntityQuestions, WebQA.
Non-factoid datasets: UltraChat, EvolCode, APPS, GSM8K, MATH.

Run the following commend to download the datasets.

./scripts/prepare_data.sh

Training

We use FSDP for training. Most of our experiments can be done using 2 80G GPUs. With REMIX, sometimes 4 GPUs are needed for the 8B model.

Set up src/config.py with your own paths. In the run scripts, specify root_dir, run_base_dir, and run_name.

Stage 1 training.

./scripts/run_stage1.sh

Stage 2 training.

./scripts/run_stage2.sh

Evaluation

python -m src.run_eval \
    --dataset_name ${dataset_name_A} \
    --dataset_name_B ${dataset_name_B} \
    --model_name ${model_name} \
    --verbose

Question and Issue

Please contact Howard at [email protected] for any questions or issues.

Citation

@article{chen2024continual,
   title={Continual Memorization of Factoids in Large Language Models},
   author={Chen, Howard and Geng, Jiayi and Bhaskar, Adithya and Friedman, Dan and Chen, Danqi},
   journal={arXiv preprint arXiv: 2411.07175},
   year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
REMIX.png		REMIX.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Continual Memorization of Factoids in Large Language Models

Updates

Table of Contents

Introduction

Experimental Setting

Dependencies

Dataset Preparation

Stage 1

Stage 2

Training

Evaluation

Question and Issue

Citation

About

Releases

Packages

Languages

License

princeton-nlp/continual-factoid-memorization

Folders and files

Latest commit

History

Repository files navigation

Continual Memorization of Factoids in Large Language Models

Updates

Table of Contents

Introduction

Experimental Setting

Dependencies

Dataset Preparation

Stage 1

Stage 2

Training

Evaluation

Question and Issue

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages