Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for 'Normalizer' tranformer #56

Open
rodrigojimenezdiego opened this issue Jan 16, 2019 · 3 comments
Open

Add support for 'Normalizer' tranformer #56

rodrigojimenezdiego opened this issue Jan 16, 2019 · 3 comments

Comments

@rodrigojimenezdiego
Copy link

[More of an inquiry than a proper issue but I searched for prior issues/comments about this and did not find one, so I raise the issue so the reply is available for others.]

From the documentation, transformer 'org.apache.spark.ml.feature.Normalizer' is not currently supported and the API complains when trying to convert pipelines that contains said transformation.

We'd like to know a bit more about whether there is any particular reason for this transformation not being supported, and if there are plans to support it in the future.

Keep up the great work! Yours is an invaluable contribution to the industry.

@vruusmann
Copy link
Member

We'd like to know a bit more about whether there is any particular reason for this transformation not being supported

There's a conceptual mismatch between the PMML representation and Apache Spark/Scikit-Learn representations:

  • PMML: High-level, treats features individually, features as scalars
  • Apache Spark/Scikit-Learn: Low-level, treats features collectively, collections of features as vectors

The Normalizer transformer is a prime example of a low-level Transformation that operates on a collection of features. When mapped to a higher-level representation (such as PMML, or any other human readable explanation), then it needs to be broken down into elementary feature-oriented operations. Unfortunately, this often results in a markup that is computationally not very efficient during deployment time.

@vruusmann
Copy link
Member

There's a conceptual mismatch between the PMML representation and Apache Spark/Scikit-Learn representations

To elaborate some more:

  • PMML: Optimized for model interpretation and deployment. Long term.
  • Apache Spark/Scikit-Learn: Optimized for model training. Short term.

Also, consider the one-hot-encoding of categorical features for model training. Objectively, this is a stupid thing to do, but is very much needed in the current state of Apache Spark/Scikit-Learn, because they can't handle categorical features directly. PMML can, and doesn't need one-hot-encoding.

@rodrigojimenezdiego
Copy link
Author

Many thanks for your explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants