The research area of data profiling includes a large set of methods and processes to examine a given dataset and determine metadata about it (1). Typically, the results comprise various statistics about the columns and the relationships among them, in particular dependencies. Among the basic statistics about a column are data type, the number of unique values, maximum and minimum values, the number of null values, and the value distribution.
This repository has two parts:
It collects the following statistics about each column of the input dataset (*.csv file)
- Data type (REAL, SMALLINT, VARCHAR,...)
- Exact number and percentage of distinct values
- Number and percentage of Nulls
- Top 10 frequent items and their frequencies.
- Min, Max, Standard deviation, Average
- ...
Metanome is a framework that handles both algorithms and datasets as external resources. All the algorithms above have been developed to work within Metanome.
- Download latest release of Metanome from Metanome releases page as well as the algorithms from the Algorithm releases page.
- Unzip deployment/target/deployment-1.1-SNAPSHOT-package_with_tomcat.zip
- Go into the unzipped folder, place the algorithm jar-file into the folder /WEB-INF/classes/algorithms and the datasets in the folder /WEB-INF/classes/inputData
- Start the run script, either run.sh or run.bat(Windows Systems)
- Open a browser at http://localhost:8080/ and register both the algorithm and the dataset in the Metanome frontend
- Choose the algorithm and datasource, setting parameter and then run!
MetanomeTestRunner: is a project to run the algorithms in development phase. As it is a MVN project all the required Metanome libraries will be automatically downloaded. If you want to build your own algorithm, give it a look here.
Metanome and all the algorithms developed by the developers group has the following license.