Skip to content

Commit

Permalink
updated description
Browse files Browse the repository at this point in the history
  • Loading branch information
tdoehmen committed Apr 11, 2024
1 parent 56c6eaf commit 430c826
Showing 1 changed file with 64 additions and 2 deletions.
66 changes: 64 additions & 2 deletions docs/index.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,71 @@ permalink: /
<img src="assets/schemapile.png" height="auto" width="400">


Access to fine-grained schema information is crucial for understanding how relational databases are designed and used in practice, and for building systems that help users interact with them. Furthermore, such information is required as training data to leverage the potential of large language models (LLMs) for improving data preparation, data integration and natural language querying. Existing single-table corpora such as GitTables provide insights into how tables are structured in-the-wild, but lack detailed schema information about how tables relate to each other, as well as metadata like data types or integrity constraints. On the other hand, existing multi-table (or database schema) datasets are rather small and attribute-poor, leaving it unclear to what extent they actually represent typical real-world database schemas.
## Description

In order to address these challenges, we present SchemaPile, a corpus of 221,171 database schemas, extracted from SQL files on GitHub. It contains 1.7 million tables with 10 million column definitions, 700 thousand foreign key relationships, seven million integrity constraints, and data content for more than 340 thousand tables. We conduct an in-depth analysis on the millions of schema metadata properties in our corpus, as well as its highly diverse language and topic distribution. In addition, we showcase the potential of SchemaPile to improve a variety of data management applications, e.g., fine-tuning LLMs for schema-only foreign key detection, improving CSV header detection and evaluating multi-dialect SQL parsers. We publish the code and data for recreating SchemaPile and a permissively licensed subset SchemaPile-Perm.
SchemaPile is a collection of database schemas, extracted from DDL/DML statements in SQL files on public code repositories.

## Summary statistics

| Dataset | SchemaPile | SchemaPile-Perm |
|---------------------------|-----------:|----------------:|
| #Schemas | 221,171 | 22,989 |
| #Tables | 1.7M | 199K |
| #Schemas with data | 75.6K | 7.1K |
| #Tables with data | 347.0K | 34.9K |
| #Columns with data | 2.2M | 219.0K |
| #Total data values | 58.2M | 5.9M |
| Median #tables per schema | 4 | 4 |
| Mean #columns per schema | 6.5 | 6.7 |
| Mean #values per column | 27.6 | 28.7 |

## Usage Example

```shell
-- Download from Zenodo
curl -O https://zenodo.org/records/10931803/files/schemapile-perm.json.gz
```

```python
import gzip
import json
from collections import Counter

# Read file
with gzip.open("schemapile-perm.json.gz", 'r') as f:
schemapile = json.loads(f.read())

# Look at example schema
print(schemapile['015036_schema.sql'])

# {'INFO': {'URL': 'https://github.com/nages103/k8s-petclinic/blob/bb75e895591...
# 'LICENSE': 'APACHE-2.0',
# 'PERMISSIVE': True},
# 'TABLES': {'vets': {'COLUMNS': {'id': {'TYPE': 'UnsignedInt',
# 'NULLABLE': False,
# 'UNIQUE': True,
# 'DEFAULT': None,
# 'CHECKS': [],
# ...

# Get the 5 most common column names
column_names = []
for schema in schemapile:
for table in schemapile[schema]["TABLES"]:
for column_name in schemapile[schema]["TABLES"][table]["COLUMNS"]:
column_names.append(column_name.lower())
column_names_schemapile = Counter(column_names)

print(column_names_schemapile.most_common(5))

# [('id', 84487),
# ('name', 28426),
# ('description', 14324),
# ('created_at', 12624),
# ('user_id', 10718)]
```

## Links

💾 Dataset: [SchemaPile (Zenodo)](https://zenodo.org/records/10931803).

Expand Down

0 comments on commit 430c826

Please sign in to comment.