NL2SQL Benchmark

Overview

This repository contains the code for the LLM NL2SQL Benchmark.

Installation

Use the following commands to install the required packages.

pip install nltk
pip install torch
pip install chardet
pip install protobuf
pip install sqlparse
pip install accelerate
pip install func_timeout
pip install transformers
pip install sentencepiece
pip install python-dotenv
pip install langchain_openai

python -c "import nltk; nltk.download('punkt')"

Running

Our evaluation framework supports almost all models that can be downloaded from HuggingFace and Spider dataset and Bird dataset. You can run the benchmark with the following command and specify the model and dataset.

python main.py -m [method_name] -d [dataset_name] -l [language_model_name] -s [schema_linking_strategy]

Evaluation Result

Spider Dataset

Spider	easy	medium	hard	extra	all
count	248	446	174	166	1034
ACCURACY
Code Qwen2.5 7B	0.935	0.877	0.759	0.584	0.824
Code Qwen2.5 32B	0.827	0.841	0.718	0.548	0.770

Bird Dataset

Bird	simple	moderate	challenging	total
count	925	465	144	1534
ACCURACY
Qwen2 7B	45.51	25.38	19.44	36.96
Qwen2.5 Code 7B	56.11	34.48	33.79	47.46
GLM 9B	47.57	24.73	19.44	38.01
Qwen2.5 32B	61.08	39.87	33.10	52.02
Qwen2.5 Code 32B	64.76	47.84	41.38	57.43
Qwen2 72B	60.43	40.73	35.86	52.15
Qwen2.5 72B	60.00	39.44	35.86	51.50

The Impact of Schema Linking on NL2SQL

The performance of LLMs on NL2SQL is highly sensitive to the schema linking strategy. We provide the performance of different schema linking strategies on the Bird dataset with Qwen2-72B-Instruct model. The performance is calculated by the accuracy of the SQL query.

Bird	simple	moderate	challenging	total
count	925	465	144	1534
ACCURACY
Vanilla	64.76	47.84	41.38	57.43
Gold Linker	68.54	48.28	42.07	59.91

The Impact of Different Methods on NL2SQL

Qwen2 72B

Bird	simple	moderate	challenging	total
count	925	465	144	1534
ACCURACY
Base	59.89	40.95	38.62	52.15
Base (gold)	65.84	46.12	40.69	57.50
Vote	60.76	46.34	40.00	54.43
Rule	36.43	18.53	15.17	29.01
CoT	39.89	19.18	24.83	32.20
Column Desc.	28.76	13.58	19.31	23.27
Column Desc. (gold)	65.41	43.32	36.55	56.00
Sketch	9.19	4.53	8.28	7.69
Sketch (gold)	60.76	37.50	30.34	50.85

Qwen2.5 Coder 7B

Spider	easy	medium	hard	extra	all
count	248	446	174	166	1034
ACCURACY
Base	93.5	87.7	75.9	58.4	82.4
Vote	94.4	89.7	82.8	63.9	85.5
Rule	92.7	83.9	70.1	57.2	79.4
CoT	90.7	84.3	69.5	49.4	77.8
Sketch	89.9	80.5	67.8	40.4	74.2

Qwen2.5 Coder 7B

Bird	simple	moderate	challenging	total
count	925	465	144	1534
ACCURACY
Base	56.11	34.48	33.79	47.46
Vote	61.30	44.18	38.62	53.98
Base (gold)	66.16	42.67	38.62	56.45
Rule	54.27	37.72	31.03	47.07
CoT	55.89	36.85	29.66	47.65
Column Desc.	49.84	25.65	24.83	40.16
Column Desc. (gold)	62.27	36.85	30.34	51.56
Sketch	53.51	34.05	30.34	45.44
Sketch (gold)	62.38	35.99	35.86	51.89

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NL2SQL Benchmark

Overview

Installation

Running

Evaluation Result

Spider Dataset

Bird Dataset

The Impact of Schema Linking on NL2SQL

The Impact of Different Methods on NL2SQL

Files

README.md

Latest commit

History

README.md

File metadata and controls

NL2SQL Benchmark

Overview

Installation

Running

Evaluation Result

Spider Dataset

Bird Dataset

The Impact of Schema Linking on NL2SQL

The Impact of Different Methods on NL2SQL