Primitive for normalizing (feature scaling) input data #82

itinawi · 2019-02-01T06:23:03Z

I want to create a primitive for normalization of data, or feature scaling so that the input is rescaled to [-1,1]

The formula for rescaling is
X_rescaled = (X - X.min) / (X.max - X.min)

The primitive input arguments are:

data (pandas dataframe)
column (string): the column to be rescaled
trim_percentage (float): percentage from the bottom and top to trim
inplace=True: if false, creates a new column called rescaled_input

Here are some potential issues with the implementation:

the implementation will run on historic data. If this primitive were to be used in an online system, we would have to either implement dynamic rescaling of dataset or automatically flag values larger than min and max as anomalies. Or we can just clip the input and assign large values as -1 or 1. The last suggestion makes the most sense, in my opinion.
distribution of the data. The data may have an outlier (very large value, e.g. 1234) and then the remaining values would be somewhere between -10 and 10. This would result in bad rescaling. One solution is to trim the lowest and highest 1% values, but what if those were outliers?

The text was updated successfully, but these errors were encountered:

csala · 2019-02-01T19:22:36Z

This functionality seems to be mostly covered by scikit-learn's MinMaxScaler, so I think it would be better to integrate that instead.

The only thing that would really be new would be the trimming of the values, but this could go on a dedicated primitive.

@itinawi if you agree on that, would you mind editing the title and description to have this dedicated only to the trimming, and open a separated issue for the MinMaxScaler integration?

csala · 2019-02-07T17:36:14Z

The MinMaxScaler will be covered in #94

This can be kept for the trimming.

itinawi added the new primitives A new primitive is being requested label Feb 1, 2019

itinawi assigned kveerama and csala Feb 1, 2019

csala unassigned kveerama and csala Feb 1, 2019

csala added approved The issue is approved and someone can start working on it Pending Review The bug is not confirmed or the feature request is being considered and removed approved The issue is approved and someone can start working on it labels Feb 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Primitive for normalizing (feature scaling) input data #82

Primitive for normalizing (feature scaling) input data #82

itinawi commented Feb 1, 2019

csala commented Feb 1, 2019

csala commented Feb 7, 2019

Primitive for normalizing (feature scaling) input data #82

Primitive for normalizing (feature scaling) input data #82

Comments

itinawi commented Feb 1, 2019

csala commented Feb 1, 2019

csala commented Feb 7, 2019