Skip to content

Collect dataset basic statistics data type,cardinality, size, nulls,...

Notifications You must be signed in to change notification settings

hazourahh/Single-Column-Profiler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

Single Column Profiling Algorithms

The research area of data profiling includes a large set of methods and processes to examine a given dataset and determine metadata about it (1). Typically, the results comprise various statistics about the columns and the relationships among them, in particular dependencies. Among the basic statistics about a column are data type, the number of unique values, maximum and minimum values, the number of null values, and the value distribution.

This repository has two parts:

Single Column Data Profiler (SCDP)

It collects the following statistics about each column of the input dataset (*.csv file)

  • Data type (REAL, SMALLINT, VARCHAR,...)
  • Exact number and percentage of distinct values
  • Number and percentage of Nulls
  • Top 10 frequent items and their frequencies.
  • Min, Max, Standard deviation, Average
  • ...

Metanome Tool and Profiling Algorithms

Metanome is a framework that handles both algorithms and datasets as external resources. All the algorithms above have been developed to work within Metanome.

Run the algorithms using Metanome GUI

  1. Download latest release of Metanome from Metanome releases page as well as the algorithms from the Algorithm releases page.
  2. Unzip deployment/target/deployment-1.1-SNAPSHOT-package_with_tomcat.zip
  3. Go into the unzipped folder, place the algorithm jar-file into the folder /WEB-INF/classes/algorithms and the datasets in the folder /WEB-INF/classes/inputData
  4. Start the run script, either run.sh or run.bat(Windows Systems)
  5. Open a browser at http://localhost:8080/ and register both the algorithm and the dataset in the Metanome frontend
  6. Choose the algorithm and datasource, setting parameter and then run!

Development

MetanomeTestRunner: is a project to run the algorithms in development phase. As it is a MVN project all the required Metanome libraries will be automatically downloaded. If you want to build your own algorithm, give it a look here.

License

Metanome and all the algorithms developed by the developers group has the following license.

About

Collect dataset basic statistics data type,cardinality, size, nulls,...

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages