Skip to content

A SQL port of python's scikit-learn preprocessing module, provided as cross-database dbt macros.

License

Notifications You must be signed in to change notification settings

isfuku/dbt-ml-preprocessing

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dbt-ml-preprocessing

A package for dbt which enables standardization of data sets. You can use it to build a feature store in your data warehouse, without using external libraries like Spark's mllib or Python's scikit-learn.

The package contains a set of macros that mirror the functionality of the scikit-learn preprocessing module. Originally they were developed as part of the 2019 Medium article Feature Engineering in Snowflake.

Currently they have been tested in Snowflake, Redshift , BigQuery, and SQL Server. The test case expectations have been built using scikit-learn (see *.py in integration_tests/data/sql), so you can expect behavioural parity with it.

The macros are:

scikit-learn function macro name Snowflake BigQuery Redshift MSSQL Example
KBinsDiscretizer k_bins_discretizer Y Y Y N example
LabelEncoder label_encoder Y Y Y Y example
MaxAbsScaler max_abs_scaler Y Y Y Y example
MinMaxScaler min_max_scaler Y Y Y N example
Normalizer normalizer Y Y Y Y example
OneHotEncoder one_hot_encoder Y Y Y Y example
QuantileTransformer quantile_transformer Y Y N N example
RobustScaler robust_scaler Y Y Y N example
StandardScaler standard_scaler Y Y Y N example

* 2D charts taken from scikit-learn.org, GIFs are my own

Installation

To use this in your dbt project, create or modify packages.yml to include:

packages:
  - package: "omnata-labs/dbt_ml_preprocessing"
    version: [">=1.0.1"]

(replace the revision number with the latest)

Then run: dbt deps to import the package.

Usage

To read the macro documentation and see examples, simply generate your docs, and you'll see macro documentation in the Projects tree under dbt_ml_preprocessing:

docs screenshot

About

A SQL port of python's scikit-learn preprocessing module, provided as cross-database dbt macros.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 95.1%
  • Makefile 4.9%