Big Data Prak SS 17: Speed up Entity Resolution with Bit Arrays

This repository contains the work that was done during the big data internship during the summer semester 2017 to the subject "Speed up Entity Resolution with Bit Arrays".

If you want fast introduction, you could use:

some scripts (located in /scripts)
- If you want to execute a single entity resolution for two input data sets of size 5000 and 20000, you should execute the ./scripts/run.sh script. This script finishes within 10 minutes on my laptop, but you could already compare the performance of the trivial, the sorted set and the two stage filter approach. All three will yield to the same result, but since the last one uses bit arrays to filter the dataset, it will be much faster.
- If you want a more detailed evaluation of more input data sets and other approaches with different configurations, you could continue with the ./scripts/batch_run.sh script which could be used to run a huge amount of different entity resolution approaches in a single batch run. This script will generate detailed outputs into an json file which you could use to evaluate the different approaches in its different configurations against each other and draw nice graphs.

Since this is an internship, there are additionally:

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
documentation		documentation
dummy_input_data		dummy_input_data
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data Prak SS 17: Speed up Entity Resolution with Bit Arrays

About

Releases

Packages

Languages

mam10eks/Big-Data-Prak-17

Folders and files

Latest commit

History

Repository files navigation

Big Data Prak SS 17: Speed up Entity Resolution with Bit Arrays

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages