Hierarchical Deconstruction of LLM Reasoning: A Graph-Based Framework for Analyzing Knowledge Utilization
This repository is the official implementation of Hierarchical Deconstruction of LLM Reasoning: A Graph-Based Framework for Analyzing Knowledge Utilization.
We investigate how large language models utilize knowledge for reasoning to solve complex questions, based on a method that deconstructs complex questions into a hierarchical graph.
Each depth of knowledge required to answer the question represents different levels of complexity. | Some reasoning is required to answer a more complex question compared to a simpler question. |
---|---|
Create a virtual environment with python>=3.9 and install the appropriate PyTorch version for your machine.
In our project, we use a node of 4 x NVIDIA A6000 40GB GPUs with CUDA version 12.3.
conda create -n myenv python=3.10
conda activate myenv
conda install pytorch pytorch-cuda=12.1 -c pytorch -c nvidia
To install requirements:
pip install -r requirements.txt
You can experiment multiple inference modes with our dataset, DepthQA:
- Single-turn:
zero-shot
: Only the target question is in the input.prompt-gold
: Before the target question, shallower (i.e., predecessors to the target question) question and gold answer pairs are provided as context.prompt-pred
: Before the target question, shallower question and its own predicted answer pairs are provided as context.
multi-turn
: Shallower questions are provided as inputs in a multi-turn conversation, i.e., the model answers each shallower question one by one and then is presented with the target question.
Most HuggingFace AutoModelForCausalLM
models can be run with src/inference/single_turn.py and src/inference/multi_turn.py, with vLLM integrated and using mixed precision.
For OpenAI models, use src/inference/single_turn_openai.py and src/inference/multi_turn_openai.py.
To inference LLaMA 3 8B Instruct with all modes:
bash scripts/inference/llama3_8b.sh
To inference GPT-3.5 Turbo with all modes:
bash scripts/inference/gpt-3.5-turbo.sh
Following the LLM-as-a-Judge approach, we use gpt-4-0125-preview
to score the correctness of model predictions. Specifically, we use the Batch API for faster and cheaper evaluation. Our implementation of the evaluation pipeline consists of four steps:
- Creating a batch request
- Check the status of the batch request
- Retrieve the results of the batch request
- Calculate evaluation metrics
- Average accuracy
- Forward discrepancy
- Backward discrepancy
where the first three steps are performed in src/evaluation/batch_eval_openai.py and the last step is in src/evaluation/metric_calculator.py
To analyze each step in the evaluation pipeline of LLaMA 3 8B Instruct zero-shot
predictions, refer to the example commands and printed outputs in scripts/evaluation/llama3_8b_zero-shot.sh.
To run the entire pipeline of LLaMA 3 8B Instruct prompt-gold
predictions automatically:
bash scripts/evaluation/llama3_8b_prompt-gold_auto.sh
@misc{ko2024hierarchicaldeconstructionllmreasoning,
title={Hierarchical Deconstruction of LLM Reasoning: A Graph-Based Framework for Analyzing Knowledge Utilization},
author={Miyoung Ko and Sue Hyun Park and Joonsuk Park and Minjoon Seo},
year={2024},
eprint={2406.19502},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.19502},
}