Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get attribute of PCA object and custom predict function #77

Open
bbzzzz opened this issue Jan 12, 2018 · 4 comments
Open

Get attribute of PCA object and custom predict function #77

bbzzzz opened this issue Jan 12, 2018 · 4 comments

Comments

@bbzzzz
Copy link

bbzzzz commented Jan 12, 2018

Hi Villu,

I am building an anomaly detection classifier based on PCA.

I need to

  1. Extract singular values (lambdas)
standard_scalar = StandardScaler()
centered_training_data = standard_scalar.fit_transform(train)
pca = PCA()
pca.fit(centered_training_data)
lambdas = pca.singular_values_
  1. Calculate distance vectors based on PCA transformed data and eigen values
class CalcDist(TransformerMixin):

    def __init__(self, lambdas):
        self.lambdas = lambdas

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        dist = X * X / self.lambdas
        return dist
  1. Get first q and last r elements of distance vectors, make transformation and output
class PCC(TransformerMixin):
    
    def __init__(self, q, r):
        self.q = q
        self.r = r
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        major_comp = np.sum(X[:,range(self.q)],axis=1)
        minor_comp = np.sum(X[:,range(self.r, X.shape[1])],axis=1)
        
        return np.dstack((major_comp, minor_comp))

I would like to use PCC as my classifier. My PMMLpipeline would be:

mapper = DataFrameMapper([(list(train), [ContinuousDomain(), StandardScaler(), PCA(), CalcDist(lambdas)])])
pipeline = PMMLPipeline([("mapper", mapper), ("classifier", PCC(q,r))])

Or PCC can be moved to mapper and connect with a DecisionTreeClassifier():

mapper = DataFrameMapper([(list(train), [ContinuousDomain(), StandardScaler(), PCA(), CalcDist(lambdas), PCC(q,r)])])
pipeline = PMMLPipeline([("mapper", mapper), ("classifier", DecisionTreeClassifier())])

Would any of this be possible?

Thanks,
Bohan

@vruusmann
Copy link
Member

Would any of this be possible?

Everything seems possible/doable.

If the pipeline is simplified a bit, then it should be possible to implement custom PMML converters for CalcDist and PCC classes (as exemplified by the SkLearn2PMML-Plugin project).

Calculate distance vectors based on PCA transformed data and eigen values

It's technically difficult to transfer PCA.singular_values_ attribute value from one pipeline step to another. Therefore, you should make class CalcDist a subclass of PCA:

class CalcDist(PCA):

  def transform(self, X):
    dist = X * X / self.singular_values_
    return dist

Or, if you don't want to subclass PCA directly, then keep a PCA object as pca_ attribute, and interact with it directly in CalcDist.fit(X) and CalcDist.transform(X) methods.

This class could be then renamed to something like SVDist as well?

I would like to use PCC as my classifier.

Please excuse my ignorance, but how should the output of PCC.transform(X) be interpreted (in terms of anomaly score)? A sample is more anomalous if the difference between the first component ("q" component) and the second component ("r" component) is greater?

I'm asking this, because I'd like to better understand how to encode the PCC class using one of the top-level PMML model elements.

@bbzzzz
Copy link
Author

bbzzzz commented Jan 16, 2018

Hi Villu,

Thank you for your reply.

Here I am trying to implement an algorithm brought by this paper.

Technically speaking, dist = X * X / self.singular_values_ is not a distance. It is a p dimension vector, where p is the number of singular_values_.

major_component is the sum of the first q element of dist
minor_component is the sum of the last r element of dist

major_component and minor_component will be both used as score. If either of the two are greater than the threshold, we predict a sample as anomalous.

If connected with a DecisionTreeClassifier(), major_component and minor_component will be used as feature.

Does this make sense?

@vruusmann
Copy link
Member

I am trying to implement an algorithm brought by this paper.

Thanks for the reference - now I can relate to your idea more closely.

In principle, "PCC" stands for "Principal Component Classifier". The first outlier category ("q") represents instances that are are outliers with respect to one or more of the original variables. The second outlier category ("r") represents instances that are inconsistent with the correlation structure of the data, but are not outliers with respect to the original variables.

The PCC would be a regression-type model, because it outputs two numeric scores. Do you know "q" and "r" threshold values at the time when training and exporting the model? If so, then we could turn PCC into a classification-type model, which would output two booleans instead (eg. "is_outlier(q)" and "is_outlier(r)").

Anyway, from the API perspective, all this logic could be captured into one Scikit-Learn class:

class PCC(RegressorMixin):

  def __init__(n, q, r):
    self.pca_ = PCA(n_components = n)
    self.q = q
    self.r = r

  def fit(X, y):
    self.pca_.fit(X)

  def predict(X):
    dist = X * X / self.pca_.singular_values_
    major_comp = np.sum(dist[:, range(self.q)], axis=1)
    minor_comp = np.sum(dist[:, range(self.r, X.shape[1])], axis=1)
    return np.dstack((major_comp, minor_comp))

The above code violates some Scikit-Learn's API conventions, because the method PCC.predict(X) is returning a 2-d array (when most regressors typically return a 1-d array). It should be possible to work around it somehow (by inheriting from a different base class?), because Scikit-Learn already provides several multi-output classifiers and regressors.

I want to encapsulate everything (PCA fitting, and distance calculation) into one Python class, because this way my PMML converter can see and analyze all information together, and generate the most compact and efficient PMML representation possible. For example, I've got a feeling, that PCC prediction logic can be mapped directly to RegressionTable element (see http://dmg.org/pmml/v4-3/Regression.html#xsdElement_RegressionTable).

@vruusmann
Copy link
Member

Using the above "all-in-one" PCC class, then the pipeline would be simplified to the following:

pipeline = PMMLPipeline([
  ("mapper", DataFrameMapper([
    (df_X.columns.values, [ContinuousDomain(), StandardScaler()])
  ])),
  ("pcc", PCC(n, q, r))
])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants