This repository contains the code for the LLM NL2SQL Benchmark.
Use the following commands to install the required packages.
pip install nltk
pip install torch
pip install chardet
pip install protobuf
pip install sqlparse
pip install accelerate
pip install func_timeout
pip install transformers
pip install sentencepiece
pip install python-dotenv
pip install langchain_openai
python -c "import nltk; nltk.download('punkt')"
Our evaluation framework supports almost all models that can be downloaded from HuggingFace and Spider dataset and Bird dataset. You can run the benchmark with the following command and specify the model and dataset.
python main.py -m [method_name] -d [dataset_name] -l [language_model_name] -s [schema_linking_strategy]
Spider | easy | medium | hard | extra | all |
---|---|---|---|---|---|
count | 248 | 446 | 174 | 166 | 1034 |
ACCURACY | |||||
Code Qwen2.5 7B | 0.935 | 0.877 | 0.759 | 0.584 | 0.824 |
Code Qwen2.5 32B | 0.827 | 0.841 | 0.718 | 0.548 | 0.770 |
Bird | simple | moderate | challenging | total |
---|---|---|---|---|
count | 925 | 465 | 144 | 1534 |
ACCURACY | ||||
Qwen2 7B | 45.51 | 25.38 | 19.44 | 36.96 |
Qwen2.5 Code 7B | 56.11 | 34.48 | 33.79 | 47.46 |
GLM 9B | 47.57 | 24.73 | 19.44 | 38.01 |
Qwen2.5 32B | 61.08 | 39.87 | 33.10 | 52.02 |
Qwen2.5 Code 32B | 64.76 | 47.84 | 41.38 | 57.43 |
Qwen2 72B | 60.43 | 40.73 | 35.86 | 52.15 |
Qwen2.5 72B | 60.00 | 39.44 | 35.86 | 51.50 |
The performance of LLMs on NL2SQL is highly sensitive to the schema linking strategy. We provide the performance of different schema linking strategies on the Bird dataset with Qwen2-72B-Instruct model. The performance is calculated by the accuracy of the SQL query.
Bird | simple | moderate | challenging | total |
---|---|---|---|---|
count | 925 | 465 | 144 | 1534 |
ACCURACY | ||||
Vanilla | 64.76 | 47.84 | 41.38 | 57.43 |
Gold Linker | 68.54 | 48.28 | 42.07 | 59.91 |
Qwen2 72B
Bird | simple | moderate | challenging | total |
---|---|---|---|---|
count | 925 | 465 | 144 | 1534 |
ACCURACY | ||||
Base | 59.89 | 40.95 | 38.62 | 52.15 |
Base (gold) | 65.84 | 46.12 | 40.69 | 57.50 |
Vote | 60.76 | 46.34 | 40.00 | 54.43 |
Rule | 36.43 | 18.53 | 15.17 | 29.01 |
CoT | 39.89 | 19.18 | 24.83 | 32.20 |
Column Desc. | 28.76 | 13.58 | 19.31 | 23.27 |
Column Desc. (gold) | 65.41 | 43.32 | 36.55 | 56.00 |
Sketch | 9.19 | 4.53 | 8.28 | 7.69 |
Sketch (gold) | 60.76 | 37.50 | 30.34 | 50.85 |
Qwen2.5 Coder 7B
Spider | easy | medium | hard | extra | all |
---|---|---|---|---|---|
count | 248 | 446 | 174 | 166 | 1034 |
ACCURACY | |||||
Base | 93.5 | 87.7 | 75.9 | 58.4 | 82.4 |
Vote | 94.4 | 89.7 | 82.8 | 63.9 | 85.5 |
Rule | 92.7 | 83.9 | 70.1 | 57.2 | 79.4 |
CoT | 90.7 | 84.3 | 69.5 | 49.4 | 77.8 |
Sketch | 89.9 | 80.5 | 67.8 | 40.4 | 74.2 |
Qwen2.5 Coder 7B
Bird | simple | moderate | challenging | total |
---|---|---|---|---|
count | 925 | 465 | 144 | 1534 |
ACCURACY | ||||
Base | 56.11 | 34.48 | 33.79 | 47.46 |
Vote | 61.30 | 44.18 | 38.62 | 53.98 |
Base (gold) | 66.16 | 42.67 | 38.62 | 56.45 |
Rule | 54.27 | 37.72 | 31.03 | 47.07 |
CoT | 55.89 | 36.85 | 29.66 | 47.65 |
Column Desc. | 49.84 | 25.65 | 24.83 | 40.16 |
Column Desc. (gold) | 62.27 | 36.85 | 30.34 | 51.56 |
Sketch | 53.51 | 34.05 | 30.34 | 45.44 |
Sketch (gold) | 62.38 | 35.99 | 35.86 | 51.89 |