This repository contains the source code, resources, and instructions to reproduce the experiments performed for the following research paper:
Rabbani, Kashif; Lissandrini, Matteo; and Hose, Katja. Extraction of Validating Shapes from very large Knowledge Graphs. In Proceedings of the Very Large Databases 2023 (Volume 16), August 28 - Sept 02, 2023, Vancouver, Canada.
Experimental results and other details are also available on our website.
Please follow these steps to get the code and data to reproduce the results:
Clone the GitHub repository using the following commands and checkout to the vldb release tag.
git clone https://github.com/dkw-aau/qse.git
git checkout tags/vldb -b vldb
We have used WikiData, DBpedia, YAGO-4, and LUBM datasets. Details on how to download these datasets are given below:
- DBPedia: We used our dbpedia script to download the dbpedia files listed here.
- YAGO-4: We downloaded YAGO-4 English version from https://yago-knowledge.org/data/yago4/en/.
- LUBM: We used LUBM-Generator to generate LUBM-500.
- WikiData (Wdt15): We downloaded a WikiData dump from 2015 form this link.
- WikiData (Wdt21): We downloaded the truthy dump of WikiData (2021) and then used our wikidata python script to remove labels, descriptions, and non-English strings.
We provide a copy of some of these datasets in a single archive.
You can check the size and number of lines (triples) with the commands:
cd data; du -sh yago.n3; wc -l yago.n3
, etc.
We used Docker and shell scripts to build and run the code on different datasets. We allow users to specify the configuration parameters in the config files depending on the dataset and user's requirement.
The experiments run on a single machine. To reproduce the experiments the software used are a GNU/Linux distribution (with git, bash, make, and wget), Docker, and Java version 15.0.2.fx-zulu having a machine with 256 GB (minimum required 16GB) and CPU with 16 cores (minimum required 1 core).
We have prepared shell scripts and configuration files for each dataset to make the process of running experiments as much easy as possible.
Please update the configuration file for each dataset available in the config directory, i.e., dbpediaConfig
, yagoConfig
, lubmConfig
, wdt15Config
, and wdt21Config
to set the correct paths for your machine.
You have to choose from one of these options to either extract shapes using QSE-Exact (file or query-based) or QSE-Approximate.
Parameter | Description | Options |
---|---|---|
qse_exact_file | set the value to extract shapes from a file using QSE-Exact | true or false |
qse_exact_query_based | set the value to extract shapes from an endpoint using QSE-Exact | true or false |
qse_approximate_file | set the value to extract shapes from a file using QSE-Approximate | true or false |
qse_approximate_query_based | set the value to extract shapes from an endpoint using QSE-Approximate | true or false |
Depending on the approach you have chosen from one of the above, you have to set parameters listed in this table to run QSE.
You can define various values of pruning thresholds (Support and Confidence) in the config file for each dataset.
You can specify the classes in classes.txt
file available in config/pruning/ directory. Then QSE will only extract shapes for the classes specified in the file.
Assuming that you are in the project's directory, you have updated the configuration file(s), and docker is installed on your machine, move into scripts directory using the command cd scripts
and then execute one of the following shell scripts files:
./dbpedia.sh
,
./yago.sh
,
./lubm.sh
,
./wdt15.sh
, or
./wdt21.sh
You will see logs and the output will be stored in the path of the output directory specified in the config file.
Note: You may have to execute chmod +rwx
for each script to solve the permissions issue. In case you want to run the experiments without script, please follow the instructions on this page.
-
Install Java: Please follow these steps to install sdkman and execute the following commands to install the specified version of Java.
sdk list java sdk install java 17.0.2-open sdk use java 17.0.2-open
-
Install gradle
sdk install gradle 7.4-rc-1
-
Build project
gradle clean gradle build gradle shadowJar
-
Install GraphDB by following the instructions listed here.
As stated above, that you have to set the configuration parameters for each file individually based on your machine and requirements.
java -jar -Xmx10g build/libs/qse.jar config/dbpediaConfig.properties &> dbpedia.logs
java -jar -Xmx10g build/libs/qse.jar config/yagoConfig.properties &> yago.logs
java -jar -Xmx10g build/libs/qse.jar config/lubmConfig.properties &> lubm.logs
java -jar -Xmx16g build/libs/qse.jar config/wdt15Config.properties &> wdt15.logs
java -jar -Xmx32g build/libs/qse.jar config/wdt21Config.properties &> wdt21.logs
QSE will output SHACL shapes in the output_file_path
directory along with classFrequency.csv
file containing number of instances (nodes) of each class in the dataset.