Skip to content

Latest commit

 

History

History
286 lines (245 loc) · 13 KB

README.md

File metadata and controls

286 lines (245 loc) · 13 KB

KCQRL: Automated Knowledge Concept Annotation and Question Representation Learning for Knowledge Tracing

The overview of our framework

This is the repository of KCQRL: Automated Knowledge Concept Annotation and Question Representation Learning for Knowledge Tracing. [link to our paper]

Note: If you find our work valuable or use the English translation and/or annotations of the XES3G5M dataset, we kindly ask you to consider citing our work.

@article{ozyurt2024automated,
  title={Automated Knowledge Concept Annotation and Question Representation Learning for Knowledge Tracing},
  author={Ozyurt, Yilmazcan and Feuerriegel, Stefan and Sachan, Mrinmaya},
  journal={arXiv preprint arXiv:2410.01727},
  year={2024}
}

Our KCQRL framework consistently improves the performance of state-of-the-art KT models by a clear margin. For this, we developed our framework in 3 modules:

  1. KC Annotation: We develop a novel, automated KC annotation approach using large language models (LLMs) that both generates solutions to the questions and labels KCs for each solution step. Thereby, we effectively circumvent the need for manual annotation from domain experts.
  2. Representation Learning of Questions: We propose a novel contrastive learning paradigm to jointly learn representations of question content, solution steps, and KCs. As a result, our KCQRL effectively leverages the semantics of question content and KCs, as a clear improvement over existing KT models.
  3. Improving KT Models: We integrate the learned representations into KT models to improve their performance. Our framework is flexible and can be combined with any state-of-the-art KT model for improved results.

You can find our main result below.

Improvement in the performance of KT models from our framework. Shown: AUC with std. dev. across 5 folds. Improvements are shown as both absolute and relative (%) values.
Model XES3G5M Eedi
Default w/ KCQRL (ours) Imp. (abs.) Imp. (%) Default w/ KCQRL (ours) Imp. (abs.) Imp. (%)
DKT 78.33 ± 0.06 82.13 ± 0.02 +3.80 +4.85% 73.59 ± 0.01 74.97 ± 0.03 +1.38 +1.88%
DKT+ 78.57 ± 0.05 82.34 ± 0.04 +3.77 +4.80% 73.79 ± 0.03 75.32 ± 0.04 +1.53 +2.07%
KQN 77.81 ± 0.03 82.10 ± 0.06 +4.29 +5.51% 73.13 ± 0.01 75.16 ± 0.04 +2.03 +2.78%
qDKT 81.94 ± 0.05 82.13 ± 0.05 +0.19 +0.23% 74.09 ± 0.03 74.97 ± 0.04 +0.88 +1.19%
IEKT 82.24 ± 0.07 82.82 ± 0.06 +0.58 +0.71% 75.12 ± 0.02 75.56 ± 0.02 +0.44 +0.59%
AT-DKT 78.36 ± 0.06 82.36 ± 0.07 +4.00 +5.10% 73.72 ± 0.04 75.25 ± 0.02 +1.53 +2.08%
QIKT 82.07 ± 0.04 82.62 ± 0.05 +0.55 +0.67% 75.15 ± 0.04 75.74 ± 0.02 +0.59 +0.79%
DKVMN 77.88 ± 0.04 82.64 ± 0.02 +4.76 +6.11% 72.74 ± 0.05 75.51 ± 0.02 +2.77 +3.81%
DeepIRT 77.81 ± 0.06 82.56 ± 0.02 +4.75 +6.10% 72.61 ± 0.02 75.18 ± 0.05 +2.57 +3.54%
ATKT 79.78 ± 0.07 82.37 ± 0.04 +2.59 +3.25% 72.17 ± 0.03 75.28 ± 0.04 +3.11 +4.31%
SAKT 75.90 ± 0.05 81.64 ± 0.03 +5.74 +7.56% 71.60 ± 0.03 74.77 ± 0.02 +3.17 +4.43%
SAINT 79.65 ± 0.02 81.50 ± 0.07 +1.85 +2.32% 73.96 ± 0.02 75.20 ± 0.04 +1.24 +1.68%
AKT 81.67 ± 0.03 83.04 ± 0.05 +1.37 +1.68% 74.27 ± 0.03 75.49 ± 0.03 +1.22 +1.64%
simpleKT 81.05 ± 0.06 82.92 ± 0.04 +1.87 +2.31% 73.90 ± 0.04 75.46 ± 0.02 +1.56 +2.11%
sparseKT 79.65 ± 0.11 82.95 ± 0.09 +3.30 +4.14% 74.98 ± 0.09 78.96 ± 0.08 +3.98 +5.31%
Best values are in bold.

Setup

Dataset details: We used XES3G5M (we translated from Chinese to English) and Eedi datasets for our work.

  • The details of XES3G5M can be found here. You can download the dataset by following instructions there. After the download, You can add the files from data/XES3G5M/metadata to run our framework.
  • Eedi dataset can be acquired upon request. After acquired, you can create a new folder data/Eedi/ and move your files there. Then, you can run the preprocessing code we provide, python data_preprocess.py --dataset_name=eedi inside the directory pykt-toolkit.

Important note: For XES3G5M, we already provide its English translation, entire output from our KC annotation, and the clustering of KCs here.

  • Therefore, after downloading XES3G5M dataset from its source (for exercise histories), you can directly start from our Representation Learning of Questions and quickly improve your existing KT model!

Python environment: We used Python 3.11.6 in our implementation. We use two separate virtual environments in our framework.

  • Install the libraries via pip install -r requirements_env_rl.txt for KC Annotation and Representation Learning

  • Install the libraries via pip install -r requirements_env_pykt.txt for Improving KT models. After loading libraries, locate pykt-toolkit and run the command pip install -e . to install our custom version of pykt with improved kt implementations.

1) KC Annotation via LLMs

This part shows an example usage of full KC annotation pipeline. To run the scripts, first locate kc_annotation folder

We use the English translation of XES3G5M dataset questions_translated.json as our running example.

a) Solution step generation

You can run the command below

python get_step_by_step_solutions.py --original_question_file ../data/XES3G5M/metadata/questions_translated.json --annotated_question_file ../data/XES3G5M/metadata/questions_translated_kc_annotated.json

b) KC annotation

You can run the command below

python get_kc_annotation.py --original_question_file ../data/XES3G5M/metadata/questions_translated_kc_annotated.json --annotated_question_file ../data/XES3G5M/metadata/questions_translated_kc_sol_annotated.json

c) Solution Step - KC mapping

You can run the command below

python get_mapping_kc_solsteps.py --original_question_file ../data/XES3G5M/metadata/questions_translated_kc_sol_annotated.json --mapped_question_file ../data/XES3G5M/metadata/questions_translated_kc_sol_annotated_mapped.json

Note: For convenience, we provide the final output of this pipeline questions_translated_kc_sol_annotated_mapped.json.

2) Representation Learning of Questions

For this part, please locate representation_learning folder.

For training, you can run the command below:

python train.py --json_file_dataset ../data/XES3G5M/metadata/questions_translated_kc_sol_annotated_mapped.json --json_file_cluster_kc data/XES3G5M/metadata/kc_clusters_hdbscan.json --json_file_kc_questions data/XES3G5M/metadata/kc_questions_map.json --wandb_project_name <your_wandb_project_name>

Note that the above command requires you to setup your wandb account first.

After training, you can save the embeddings by following save_embeddings.ipynb.

3) Improving KT Models

We implemented the improved versions of KT models via pykt library. We forked the library to pykt-toolkit and developed the models there. Specifically, our implemented KT models can be found in models folder.

As the naming convention, we added Que suffix to the existing models, where "que" refers to our "learned question representations". For instance, the improved version of SimpleKT is implemented as SimpleKTQue and can be found in simplekt_que.py.

For training the these models, you can locate train_test folder. You can train SimpleKTQue with the command below:

python sparsekt_que_train.py --emb_path <embeddings_from_representation_learning>

Note that the above command requires you to setup your wandb account first.

We use wandb_eval.py and wandb_predict.py from pykt library for evaluation. The details of the library can be found in their documentation.