A collection of papers and resources about Machine Unlearning on LLMs.
Another collection of Vision Language Models and Vision Generative models can be found here.
Large language models (LLMs) have demonstrated remarkable capabilities across various tasks, but their training typically requires vast amounts of data, raising concerns in legal and ethical domains. Issues such as potential copyright disputes, data authenticity, and privacy concerns have been brought to the forefront. Machine unlearning offers a potential solution to these challenges, even though it presents new hurdles when applied to LLMs. In this repository, we aim to collect and organize surveys, datasets, approaches, and evaluation metrics pertaining to machine unlearning on LLMs, with the hope of providing valuable insights for researchers in this field.
Paper Title | Venue | Year |
---|---|---|
Knowledge unlearning for LLMs: Tasks, methods, and challenges | ArXiv | 2023.11 |
Machine Unlearning of Pre-trained Large Language Models | ArXiv | 2024.02 |
Rethinking Machine Unlearning for Large Language Models | ArXiv | 2024.02 |
Machine Unlearning: Taxonomy, Metrics, Applications, Challenges, and Prospects | ArXiv | 2024.03 |
The Frontier of Data Erasure: Machine Unlearning for Large Language Models | ArXiv | 2024.03 |
- Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence, 2023
- Scalable Extraction of Training Data from (Production) Language Models, Nasr et al., 2023
Name | Description | Used By |
---|---|---|
BBQ (Bias Benchmark for QA) | a dataset of question sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine social dimensions relevant for U.S. English-speaking contexts. | Zhao et al., |
HarmfulQA | a ChatGPT-distilled dataset constructed using the Chain of Utterances (CoU) prompt. | Zhao et al. |
CategoricalHarmfulQA | Thus, the dataset consists of 550 harmful questions, 55 such questions are shown in the table. | Bhardwaj et al. |
Pile | an 825 GiB English text corpus targeted at training large-scale language models. | Zhao et al. |
Detoxify | Detoxify is a simple, easy to use, python library to detect hateful or offensive language. It was built to help researchers and practitioners identify potential toxic comments. | Zhao et al. |
Enron Email Dataset | Wu et al. | |
Training Data Extraction Challenge | Jang et al., | |
Harry Potter book series dataset | Eldan et al., Shi et al. | |
Real Toxicity Prompts | Lu et al., Liu et al. | |
TOFU | Maini et al. | |
WMDP | Li et al. |
Paper Title | Author | Paper with code | Key words | Venue | Time |
---|---|---|---|---|---|
Composing Parameter-Efficient Modules with Arithmetic Operations | Zhang et al. | Github | use LoRA to create task vectors and accomplish unlearning by negating tasks under these task vectors. | NeurIPS 2023 | 2023-06 |
Knowledge Unlearning for Mitigating Privacy Risks in Language Models | Jang et al. | Github | updating the model parameters by maximizing the likelihood of mis-prediction for the samples within the forget set |
ACL 2023 | 2023-07 |
Unlearning Bias in Language Models by Partitioning Gradients | Yu et al. | Github | aims to minimize the likelihood of predictions on relabeled forgetting data | ACL 2023 | 2023-07 |
Who’s Harry Potter? Approximate Unlearning in LLMs | Eldan et al. | HuggingFace | descent-based fine-tuning, over relabeled or randomly labeled forgetting data, where generic translations are used to replace the unlearned texts. | ICLR 2024 | 2023-10 |
Unlearn What You Want to Forget: Efficient Unlearning for LLMs | Chen and Yang | Github | fine-tune an adapter over the unlearning objective that acts as an unlearning layer within the LLM. | EMNLP | 2023-12 |
Machine Unlearning of Pre-trained Large Language Models | Yao et al. | Github | incorporate random labeling to augment the unlearning objective and ensure utility preservation on the retain set |
ArXiv | 2024-02 |
Paper Title | Author | Paper with code | Key words | Venue | Time |
---|---|---|---|---|---|
Locating and Editing Factual Associations in GPT | Meng et al. | Github | the process of localization can be accomplished through representation denoising, also known as causal tracing, focusing on the unit of model layers. | ArXiv | 2022-02 |
Unlearning Bias in Language Models by Partitioning Gradients | Yu et al. | Github | gradient-based saliency is employed to identify the crucial weights that need to be fine-tuned to achieve the unlearning objective. | ACL 2023 | 2023-07 |
DEPN: Detecting and Editing Privacy Neurons in Pretrained Language Models | Wu et al. | Github | neurons that respond to unlearning targets are identified within the feed-forward network and subsequently selected for knowledge unlearning. | EMNLP 2023 | 2023-10 |
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks | Patil et al. | Github | it is important to delete information about unlearning targets wherever it is represented in models in order to protect against attacks | ArXiv | 2023-09 |
Paper Title | Author | Paper with code | Key words | Venue | Time |
---|---|---|---|---|---|
Studying Large Language Model Generalization with Influence Functions | Grosse et al. | [No Code Avaialble] | the potential of influence functions in LLM unlearning may be underestimated, given that scalability issue, and approximation errors can be mitigated by focusing on localized weights that are salient to unlearning. | ArXiv | 2023-08 |
Paper Title | Author | Paper with code | Key words | Venue | Time |
---|---|---|---|---|---|
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks | Hase et al. | Github | Defending attacks. | ICLR 2024 | 2023-09 |
Learning and Forgetting Unsafe Examples in Large Language Models | Zhao et al. | [No Code Available] | Fine-tuning based. | ArXiv | 2023-12 |
Second-Order Information Matters: Revisiting Machine Unlearning for Large Language Models | Gu et al. | [No Code Available] | sequential editing of LLMs may compromise their general capabilities. | ArXiv | 2024-03 |
Towards Efficient and Effective Unlearning of Large Language Models for Recommendation | Wang et al. | Github | Using LLM Unlearning in Recommendation | ArXiv | 2024-03 |
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning | Li et al. | They control the model towards having a novice-like level of hazardous knowledge, designed a loss function with a forget loss an a retrain loss. The forget loss bends the model representations towards those of a noive, while the retain loss limits the amount of general capabilities removed. | Homepage | ArXiv | 2024-03 |
Paper Title | Author | Paper with code | Key words | Venue | Time |
---|---|---|---|---|---|
Memory-assisted prompt editing to improve gpt-3 after deployment | Madaan et al. | Github | have also shown promise in addressing the challenges posed by the restricted access to black-box LLMs and achieving parameter efficiency of LLM unlearning. | EMNLP 2022 | 2022-01 |
Alignment Studio: Aligning Large Language Models to Particular Contextual Regulations | Achintalwar et al. | No Code Available | aligning a company's internal-facing enterprise chatbot to its business conduct guidelines | ArXiv | 2024-03 |
Paper Title | Author | Paper with code | Key words | Venue | Year |
---|---|---|---|---|---|
Can Sensitive Information be Deleted from LLMS? Objectives for Defending Against Extration Attacks | Patil et al. | Github | ArXiv | 2023.09 | |
Detecting Pretraining Data from Large Language Models | Shi et al. | Github | pretrain data detection | ArXiv | 2023.10 |
Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration | Fu et al. | [No Code Available] | finetune data detection | ArXiv | 2023.11 |
Tensor trust: Interpretable prompt injection attacks from an online game | Toyer et al. | Github | input-based methods may not necessarily yield genuinely unlearned models, leading to weaker unlearning strategies compared to model-based methods because modifying the inputs of LLMs alone may not be sufficient to completely erase the influence of unlearning targets | ArXiv | 2023-11 |
Paper Title | Author | Paper with code | Key words | Venue | Year |
---|---|---|---|---|---|
TOFU: A Task of Fictitious Unlearning for LLMs | Maini et al. | Homepage | ArXiv | 2024.01 | |
Machine Unlearning of Pre-trained Large Language Models | Yao et al. | Github | ArXiv | 2024.02 | |
Eight Methods to Evaluate Robust Unlearning in LLMs | Lynch et al. | [No Code Available] | ArXiv | 2024.02 | |
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning | Li et al. | Homepage | Biology, Cyber and Chemical | ArXiv | 2024.03 |