This is the source code for KBAlign: Efficient Self Adaptation on Specific Knowledge Bases. Currently, the code comments are not perfect, and we will continue to supplement them.
We propose an approach designed for efficient adaptation to downstream tasks involving knowledge bases (KBs), trying to unleash the self-learning potential of LLMs. Experiments on different datasets & backbone models demonstrate the general effectiveness of KBAlign.
The use for files are briefly described below:
requirements.txt
: environmental setup fileiter_pipeline(_llama).sh
: iterative tuning bash fileself_annotation_data/
long_pipeline.sh
: long-dependency annotation bash fileshort_pipeline.sh
: short-dependency annotation bash filemerge.sh
: merge annotated data or mix general domain datamerge.py
: merge data python filelong_dependency/
: long-dependency annotation code directoryshort_dependency/
: short-dependency annotation code directory
finetune
LLaMA-Factory
: LLaMA Factory finetune directorymc_finetune
: Model Center finetune directorybmtrainMiniCPM_hugMiniCPM.py
: model format conversion between bmtrainMiniCPM and hugMiniCPMhugMiniCPM_bmtrainMiniCPM.py
: model format conversion between hugMiniCPM and bmtrainMiniCPMchange_info.py
: updates the dataset for LLaMA Factory
verify
split_verify.py
: splits annotated datagen_iterate_verify.py
: generates self-verified data
eval
eval.sh
: evaluation bash fileeval.py
: evaluation python fileconfig.json
: evaluation config file
utils/
: utils directoryprompt
eval.json
: prompt for evaluationjec_eval.json
: prompt for jec dataset evaluationshort_annotation.json
: prompt for short-dependency annotation
The model is implemented using PyTorch. We strongly recommend you to manage your environment using Anaconda, since we switch the environment requirement for the LLaMA-based pipeline.
The versions of main packages used are shown below.
- numpy==1.24.4
- torch==2.4.0
- transformers==4.44.0
- model-center==1.0.2
- bmtrain==0.2.3
- vllm==0.5.4
To set up the dependencies, you can run the following command:
pip install -r requirements.txt
As for LLaMA-based pipeline, you can create a new conda environment and run the following command:
pip install -r ./finetune/LLaMA-Factory/requirements.txt
-
First, save your raw knowledge materials into json file, in the format of: [Homogeneous data(str), Homogeneous data(str),...] A segment of LooGLE raw data is saved in
data/loogle/KB/KB_0.json
, and you can use this file to try the annotation pipeline. Homogeneous data means knowledge on the same topic, content from the same document, or even segments of text related to the same subject area. The segmentation does not enforce a strict length requirement; it primarily depends on logical grouping and relevance within the data. -
Then, change the related arguments in
./self_annotation_data/short_pipeline.sh
, including the input/output file path, the annotation backbone model type and path, language, chunk size, and so on. -
Run
bash ./self_annotation_data/short_pipeline.sh
to conduct the short-denpendency data annotation. Correspondingly, runbash ./self_annotation_data/long_pipeline.sh
for the long-denpendency annotation. -
You can check
--ChatGLM3_data_format_dir
and--bm_data_format_dir
to see the self-annotation results. Notice that you can also try annotating data with higher quality, with openai API keys.
-
With the annotated data, you can conduct the tuning pocess. First, modify the needed arguments in
iter_pipeline.sh
. -
If you are trying to tune the llama-based model, go to
iter_pipeline_llama.sh
. You can also modify the LoRA settings in./finetune/LLaMA-Factory/config/llama3_lora_sft.yaml
and./LLaMA-Factory/config/llama3_merge_lora.yaml
. You can modify the dataset settings in./finetune/LLaMA-Factory/data/dataset_info.json
-
Now, run
bash iter_pipeline.sh
orbash iter_pipeline_llama.sh
to conduct the tuning process.
-
Please go to
eval/
for evaluation. You can save the data items that you want to test underdata/...
. Modify the related settings inconfig.json
, and then runbash eval.bash
. -
You can also download our tuned checkpoints to directly test the performance.
Below is a table outlining the key arguments that require modification. For more detailed explanations and additional parameters, refer to the bash files and search for todo.
Parameter Name | File Name(s) | Explanation/Example |
---|---|---|
bm_data_format_dir |
iter_pipeline.sh , long_pipeline.sh , short_pipeline.sh |
Directory path for bm data formatting. Example: data/bm/YOUR_PATH |
ChatGLM3_data_format_dir |
iter_pipeline_llama.sh , long_pipeline.sh , short_pipeline.sh |
Path for ChatGLM3 formatted data. Example: data/ChatGLM3/YOUR_PATH |
Finetune_file_name |
iter_pipeline.sh , iter_pipeline_llama.sh |
Name of the fine-tuning file. Example: YOUR_FINETUNE_FILE_NAME |
model_name_or_path |
iter_pipeline.sh , short_pipeline.sh |
Path to the pre-trained model. Example: YOUR_PATH_TO_LLM |
model_save_path |
iter_pipeline.sh , iter_pipeline_llama.sh |
Path to save the fine-tuned model. Example: $SCRIPT_DIR/model/finetuned_model |
kb_path |
iter_pipeline.sh , iter_pipeline_llama.sh , eval.sh , long_pipeline.sh |
Knowledge base JSON file path. Example: data/loogle/KB/KB_{id}.json |
kb_emb_path |
iter_pipeline.sh , iter_pipeline_llama.sh , eval.sh |
Knowledge base embeddings path. Example: data/loogle/embeddings/{bge_type}/sentence_embeddings_{id}_{chunk_size}.pkl |
kb_sentence_path |
iter_pipeline.sh , iter_pipeline_llama.sh , eval.sh |
Path to KB sentences. Example: data/loogle/sentences/sentence_sentences_{id}_{chunk_size}.pkl |
iter_num |
iter_pipeline.sh , iter_pipeline_llama.sh |
Number of iterations. Example: 1 |
train_step_num |
iter_pipeline.sh , iter_pipeline_llama.sh |
Number of training steps per iteration. Example: 1 |
split_num |
iter_pipeline.sh , iter_pipeline_llama.sh |
Number of splits for processing. Example: 1 |
annotate_path |
long_pipeline.sh |
Path for KB annotation. Example: data/loogle/KB/KB_annotate.json |
sentences_path |
long_pipeline.sh |
Path for sentence data. Example: data/loogle/sentences/sentence_sentences_0_128.pkl |
sentences_emb_path |
long_pipeline.sh |
Path for sentence embeddings. Example: data/loogle/embeddings/BAAI/bge-large-en-v1.5/sentence_embeddings_0_128.pkl |
documents_path |
long_pipeline.sh |
Path for cached documents. Example: cache/loogle_documents_path.pkl |
bge_type |
long_pipeline.sh , short_pipeline.sh |
Type of embedding model. Example: BAAI/bge-large-en-v1.5 |
language |
short_pipeline.sh |
Language for processing. Example: English |
model_type |
short_pipeline.sh |
Type of model used. Example: llama , cpm , or gpt |
tokenizer_path |
short_pipeline.sh |
Path to tokenizer files. Example: YOUR_PATH_TO_TOKENIZE |
cache_dir |
short_pipeline.sh , eval.sh |
Directory for caching. Example: YOUR_HUGGING_FACE_CACHE_DIR |
qa_model_path |
eval.sh |
Path to the fine-tuned QA model. Example: YOUR_FINETUNE_MODEL_PATH |
Please cite our work if it helps.
@article{zeng2024kbalign,
title = {KBAlign: Efficient Self Adaptation on Specific Knowledge Bases},
author = {Zheni, Zeng and Yuxuan, Chen and Shi, Yu and Yukun, Yan and Zhenghao, Liu and Shuo, Wang and Xu, Han and Zhiyuan, Liu and Maosong, Sun},
year = {2024},
journal = {arXiv preprint arXiv:2411.14790}
}