Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the check for categorical values order #9

Closed
wants to merge 1 commit into from

Conversation

ohmystack
Copy link

According to the PMML docs ,

Categorical field values can only be tested for equality, while ordinal field values in addition have an order defined.

If the application algorithm for some class of models does not require any special treatment of ordinal fields, then they can be interpreted as categorical fields.

"categorical" doesn't require an order defined. Users should use "ordinal" field instead of "categorical" if they want the order.
So, no need to check categorical field's order.

For example, in jpmml-spark here , the values' order is hardcoded. The hardcoded order [0, 1] may be different from the order given by spark ML [1, 0]. For categorical field, they should be the same thing.

@ohmystack
Copy link
Author

@vruusmann Would you mind reviewing this?

@vruusmann
Copy link
Member

No objections - this is a valid code change.

I'm more interested about how did you find it? Do you have a (reproducible-) use case, where JPMML-SparkML fails because of this sanity check?

@ohmystack
Copy link
Author

env: spark 2.2.1
pipelineStages: VectorAssembler, StringIndexer, XGBoostEstimator (classification)

In the 2nd stage, StringIndexerConverter uses transformer.labels() to generate categories for the label column. code
The array returned byStringIndexerModel.labels() is not sorted, can be [0, 1] or [1, 0].
Then, the field is encoded using this array.

In the 3rd stage, when encoding the schema, my code runs into the ContinuousFeature condition and the issue occurs.

Maybe my code has some problem. In the 3rd stage, the label should have been converted to CategoricalFeature already, instead of converting it from ContinuousFeature again. But if it happens, the code will break there.


Here is an example of returning [1, 0] in transformer.labels().

train.csv

LABEL,col1,col2
1,628,787
1,75,1794
1,444,14899
1,53,500
1,997,286
1,613,729
1,031,2245
1,02,5081
1,850,404
1,647,560
1,78,1517
1,75,4977
0,977,311
1,812,532
1,472,10822
0,570,2304
1,519,690
1,834,843
1,054,30
spark-shell

scala> val data = spark.sqlContext.read.format("csv").
     |   options(Map(
     |     "header" -> "true",
     |     "ignoreLeadingWhiteSpace" -> "true",
     |     "ignoreTrailingWhiteSpace" -> "true",
     |     "timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSZZZ",
     |     "inferSchema" -> "true",
     |     "mode" -> "FAILFAST")).
     |   load("train.csv")

scala> val labelColName = "LABEL"

scala> import org.apache.spark.ml.feature.StringIndexer

scala> val labelIndexer = new StringIndexer().setInputCol(labelColName).setOutputCol("label")

scala> val labelIndexerModel = labelIndexer.fit(data)
labelIndexerModel: org.apache.spark.ml.feature.StringIndexerModel = strIdx_e56af3540e9d

scala> labelIndexerModel.labels
res0: Array[String] = Array(1, 0)       // <- Here's the [1, 0] values

@vruusmann
Copy link
Member

The problem is that StringIndexerModel#labels() returns labels in the order of "popularity". In most datasets the 0 label is more popular than the 1 label (eg. there are many more non-fraud cases than there are fraud cases), so the labels are ordered [0, 1]. However, your dataset appears to have many more 1 labels than 0 labels, so the labels are ordered [1, 0], and this sanity check fails.

This issue needs more investigation. The order of target labels is critical, because it determines the encoding of model objects. For example, LogisticRegressionModel performs the computation relative to the second target category, so if we pass "unexpected" ordering of labels, then we will get a misbehaving logistic regression model (will predict the probability 0 class where the probability of 1 class is expected, and vice versa).

@ohmystack
Copy link
Author

Agree to have more investigation.

@vruusmann
Copy link
Member

Related to jpmml/jpmml-sparkml#14

There should be ContinuousDomain and CategoricalDomain meta-transforrmer classes that collect canonical label information.

@ohmystack ohmystack closed this Aug 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants