Primitive for normalizing (feature scaling) input data #82
Labels
new primitives
A new primitive is being requested
Pending Review
The bug is not confirmed or the feature request is being considered
I want to create a primitive for normalization of data, or feature scaling so that the input is rescaled to
[-1,1]
The formula for rescaling is
X_rescaled = (X - X.min) / (X.max - X.min)
The primitive input arguments are:
data
(pandas dataframe)column
(string): the column to be rescaledtrim_percentage
(float): percentage from the bottom and top to triminplace=True
: if false, creates a new column calledrescaled_input
Here are some potential issues with the implementation:
the implementation will run on historic data. If this primitive were to be used in an online system, we would have to either implement dynamic rescaling of dataset or automatically flag values larger than
min
andmax
as anomalies. Or we can just clip the input and assign large values as-1
or1
. The last suggestion makes the most sense, in my opinion.distribution of the data. The data may have an outlier (very large value,
e.g. 1234
) and then the remaining values would be somewhere between-10 and 10
. This would result in bad rescaling. One solution is to trim the lowest and highest 1% values, but what if those were outliers?The text was updated successfully, but these errors were encountered: