KCQRL: Automated Knowledge Concept Annotation and Question Representation Learning for Knowledge Tracing
This is the repository of KCQRL: Automated Knowledge Concept Annotation and Question Representation Learning for Knowledge Tracing. [link to our paper]
Note: If you find our work valuable or use the English translation and/or annotations of the XES3G5M dataset, we kindly ask you to consider citing our work.
@article{ozyurt2024automated,
title={Automated Knowledge Concept Annotation and Question Representation Learning for Knowledge Tracing},
author={Ozyurt, Yilmazcan and Feuerriegel, Stefan and Sachan, Mrinmaya},
journal={arXiv preprint arXiv:2410.01727},
year={2024}
}
Our KCQRL framework consistently improves the performance of state-of-the-art KT models by a clear margin. For this, we developed our framework in 3 modules:
- KC Annotation: We develop a novel, automated KC annotation approach using large language models (LLMs) that both generates solutions to the questions and labels KCs for each solution step. Thereby, we effectively circumvent the need for manual annotation from domain experts.
- Representation Learning of Questions: We propose a novel contrastive learning paradigm to jointly learn representations of question content, solution steps, and KCs. As a result, our KCQRL effectively leverages the semantics of question content and KCs, as a clear improvement over existing KT models.
- Improving KT Models: We integrate the learned representations into KT models to improve their performance. Our framework is flexible and can be combined with any state-of-the-art KT model for improved results.
You can find our main result below.
Model | XES3G5M | Eedi | ||||||
---|---|---|---|---|---|---|---|---|
Default | w/ KCQRL (ours) | Imp. (abs.) | Imp. (%) | Default | w/ KCQRL (ours) | Imp. (abs.) | Imp. (%) | |
DKT | 78.33 ± 0.06 | 82.13 ± 0.02 | +3.80 | +4.85% | 73.59 ± 0.01 | 74.97 ± 0.03 | +1.38 | +1.88% |
DKT+ | 78.57 ± 0.05 | 82.34 ± 0.04 | +3.77 | +4.80% | 73.79 ± 0.03 | 75.32 ± 0.04 | +1.53 | +2.07% |
KQN | 77.81 ± 0.03 | 82.10 ± 0.06 | +4.29 | +5.51% | 73.13 ± 0.01 | 75.16 ± 0.04 | +2.03 | +2.78% |
qDKT | 81.94 ± 0.05 | 82.13 ± 0.05 | +0.19 | +0.23% | 74.09 ± 0.03 | 74.97 ± 0.04 | +0.88 | +1.19% |
IEKT | 82.24 ± 0.07 | 82.82 ± 0.06 | +0.58 | +0.71% | 75.12 ± 0.02 | 75.56 ± 0.02 | +0.44 | +0.59% |
AT-DKT | 78.36 ± 0.06 | 82.36 ± 0.07 | +4.00 | +5.10% | 73.72 ± 0.04 | 75.25 ± 0.02 | +1.53 | +2.08% |
QIKT | 82.07 ± 0.04 | 82.62 ± 0.05 | +0.55 | +0.67% | 75.15 ± 0.04 | 75.74 ± 0.02 | +0.59 | +0.79% |
DKVMN | 77.88 ± 0.04 | 82.64 ± 0.02 | +4.76 | +6.11% | 72.74 ± 0.05 | 75.51 ± 0.02 | +2.77 | +3.81% |
DeepIRT | 77.81 ± 0.06 | 82.56 ± 0.02 | +4.75 | +6.10% | 72.61 ± 0.02 | 75.18 ± 0.05 | +2.57 | +3.54% |
ATKT | 79.78 ± 0.07 | 82.37 ± 0.04 | +2.59 | +3.25% | 72.17 ± 0.03 | 75.28 ± 0.04 | +3.11 | +4.31% |
SAKT | 75.90 ± 0.05 | 81.64 ± 0.03 | +5.74 | +7.56% | 71.60 ± 0.03 | 74.77 ± 0.02 | +3.17 | +4.43% |
SAINT | 79.65 ± 0.02 | 81.50 ± 0.07 | +1.85 | +2.32% | 73.96 ± 0.02 | 75.20 ± 0.04 | +1.24 | +1.68% |
AKT | 81.67 ± 0.03 | 83.04 ± 0.05 | +1.37 | +1.68% | 74.27 ± 0.03 | 75.49 ± 0.03 | +1.22 | +1.64% |
simpleKT | 81.05 ± 0.06 | 82.92 ± 0.04 | +1.87 | +2.31% | 73.90 ± 0.04 | 75.46 ± 0.02 | +1.56 | +2.11% |
sparseKT | 79.65 ± 0.11 | 82.95 ± 0.09 | +3.30 | +4.14% | 74.98 ± 0.09 | 78.96 ± 0.08 | +3.98 | +5.31% |
Best values are in bold. |
Dataset details: We used XES3G5M (we translated from Chinese to English) and Eedi datasets for our work.
- The details of XES3G5M can be found here. You can download the dataset by following instructions there. After the download, You can add the files from data/XES3G5M/metadata to run our framework.
- Eedi dataset can be acquired upon request. After acquired, you can create a new folder
data/Eedi/
and move your files there. Then, you can run the preprocessing code we provide,python data_preprocess.py --dataset_name=eedi
inside the directory pykt-toolkit.
Important note
: For XES3G5M, we already provide its English translation, entire output from our KC annotation, and the clustering of KCs here.
- Therefore, after downloading XES3G5M dataset from its source (for exercise histories), you can directly start from our Representation Learning of Questions and quickly improve your existing KT model!
Python environment: We used Python 3.11.6 in our implementation. We use two separate virtual environments in our framework.
-
Install the libraries via
pip install -r requirements_env_rl.txt
for KC Annotation and Representation Learning -
Install the libraries via
pip install -r requirements_env_pykt.txt
for Improving KT models. After loading libraries, locate pykt-toolkit and run the commandpip install -e .
to install our custom version of pykt with improved kt implementations.
This part shows an example usage of full KC annotation pipeline. To run the scripts, first locate kc_annotation folder
We use the English translation of XES3G5M dataset questions_translated.json
as our running example.
You can run the command below
python get_step_by_step_solutions.py --original_question_file ../data/XES3G5M/metadata/questions_translated.json --annotated_question_file ../data/XES3G5M/metadata/questions_translated_kc_annotated.json
You can run the command below
python get_kc_annotation.py --original_question_file ../data/XES3G5M/metadata/questions_translated_kc_annotated.json --annotated_question_file ../data/XES3G5M/metadata/questions_translated_kc_sol_annotated.json
You can run the command below
python get_mapping_kc_solsteps.py --original_question_file ../data/XES3G5M/metadata/questions_translated_kc_sol_annotated.json --mapped_question_file ../data/XES3G5M/metadata/questions_translated_kc_sol_annotated_mapped.json
Note: For convenience, we provide the final output of this pipeline questions_translated_kc_sol_annotated_mapped.json.
For this part, please locate representation_learning folder.
For training, you can run the command below:
python train.py --json_file_dataset ../data/XES3G5M/metadata/questions_translated_kc_sol_annotated_mapped.json --json_file_cluster_kc data/XES3G5M/metadata/kc_clusters_hdbscan.json --json_file_kc_questions data/XES3G5M/metadata/kc_questions_map.json --wandb_project_name <your_wandb_project_name>
Note that the above command requires you to setup your wandb account first.
After training, you can save the embeddings by following save_embeddings.ipynb.
We implemented the improved versions of KT models via pykt library. We forked the library to pykt-toolkit and developed the models there. Specifically, our implemented KT models can be found in models folder.
As the naming convention, we added Que
suffix to the existing models, where "que" refers to our "learned question representations". For instance, the improved version of SimpleKT
is implemented as SimpleKTQue
and can be found in simplekt_que.py.
For training the these models, you can locate train_test folder. You can train SimpleKTQue
with the command below:
python sparsekt_que_train.py --emb_path <embeddings_from_representation_learning>
Note that the above command requires you to setup your wandb account first.
We use wandb_eval.py
and wandb_predict.py
from pykt library for evaluation. The details of the library can be found in their documentation.