Given black-box access to a pretrained large language model, can we predict whether a document has been part of its training dataset?
This repo contains the source code to generate the results as published in the paper "Did the Neurons Read your Book? Document-level Membership Inference for Large Language Models".
Follow these steps to install the correct python environment:
conda create --name doc_membership python=3.9
conda activate doc_membership
pip install -r requirements.txt
We now download the target model we consider. Use python src/split_chunks.py
or scripts/download_model.sh
to do so for the desired model on Hugging Face. In the paper we used OpenLLaMA. The model and its tokenizer will then be saved in the directory of choice, by default in the script corresponding to ./pretrained/
.
First and foremost, textual data should be collected and split in 'member' and 'non member' documents. In this project both books from Project Gutenberg and academic papers from ArXiv have been considered.
To reproduce the data collection we rely on the data download and preprocess scripts provided by RedPajama (their first version, so now an older branch here). More specifically, we applied the following strategy for both data sources:
- Books Project Gutenberg.
- Members: we just downloaded PG-19 from Hugging Face, as for instance here.
- Non-members: we used public code to scrape books from Project Gutenberg using this code. You can find the scripts we utilized to do so in
data/raw_gutenberg/
. Note that the book index to start from was manually searched from Project Gutenberg. We make the corresponding dataset available on Hugging Face here.
- Academic papers from ArXiv.
- Members: we download all
jsonl
files as provided by the V1 version of RedPajama. For all details seedata/raw_arxiv/
. - Non-members: we download all ArXiv papers at a small cost using the resources ArXiv provides here and the script to do so here. Note that we do not make the dataset collected as non-members publicly available, as we do not have the license to distribute the data ourselves.
- All preprocessing for ArXiv has been done using this script.
- Members: we download all
Next, we also tokenize the data using python src/tokenize_data.py
or scripts/tokenize_data.sh
.
Lastly, we create 'chunks' of documents, enabling us to run the entire pipeline multiple times (training on k-1 chunks and evaluating on the heldout chunk, repeating this k times.)
For this we use python src/split_chunks.py -c config/SOME_CONFIG.ini
with the appropriate input arguments.
We will now query the downloaded language model while running through each document, computing for each token its predicted probability and the top probabilities.
For this we use python src/compute_perplexity.py
with the appropriate input arguments as in scripts/compute_perplexity.sh
. Using GPUs is recommended for this.
The resulting token-level values are saved in perplexity_results/
. Note that parameter max_length
here corresponds to the context size C to query the model as used in the paper.
At the same time, the general probability for each token and token frequency in the overall set of documents is computed and saved. These normalization dictionaries are then used in the next step to normalize token-level probabilities to train the meta-classifier.
We run this with python main.py -c config/SOME_CONFIG.ini
, where the exact setup should be specified in the config file (such as the path to perplexity results, the normalization type, meta-classifier type etc).
The evaluation results are then saved in classifier_results/
. The folder ./config/
contains all setups used to generate the results in the paper (for one dataset, i.e. books).
The following inputs correspond to the exact setups used in the paper.
- Normalization strategies:
- NoNorm:
norm_type='none'
- RatioNormTF:
norm_type='ratio'
andpath_to_normalization_dict=PATH_TO_TOKEN_FREQ_DICT.pickle
- RatioNormGP:
norm_type='ratio'
andpath_to_normalization_dict=PATH_TO_GENERAL_PROBA_DICT.pickle
- MaxNormTF:
norm_type='diff_max_token_proba'
andpath_to_normalization_dict=PATH_TO_TOKEN_FREQ_DICT.pickle
- MaxNormGP:
norm_type='diff_max_token_proba'
andpath_to_normalization_dict=PATH_TO_GENERAL_PROBA_DICT.pickle
- NoNorm:
- Document-level feature extraction:
- Aggregate feature extractor (AggFE):
feat_extraction_type='simple_agg'
- Histogram feature extractor (AggFE):
feat_extraction_type='hist_K'
for a histogram with K bins.
- Aggregate feature extractor (AggFE):
By default across all configs, both a logistic regression and random forest model are trained as meta-classifier (models='logistic_regression,random_forest'
), while in the paper only the latter is used.
We also provided the code we used to compute the baselines. For this we use python src/compute_baselines.py
with the appropriate input arguments as in scripts/compute_baselines.sh
. Note that the code comes from Shi et al. here.
For the neighborhood baseline as introduced by Mattern et al., we adapt their code to src/compute_baselines.py
and scripts/compute_neighborhood_baselines.sh
. Note that its input requires a pickle file, but this could be easily adapted if needed.
If you found this code helpful for your research, kindly cite our work:
@article{meeus2023did,
title={Did the neurons read your book? document-level membership inference for large language models},
author={Meeus, Matthieu and Jain, Shubham and Rei, Marek and de Montjoye, Yves-Alexandre},
journal={arXiv preprint arXiv:2310.15007},
year={2023}
}