Remove the check for categorical values order #9

ohmystack · 2018-06-05T13:34:16Z

Categorical field values can only be tested for equality, while ordinal field values in addition have an order defined.

If the application algorithm for some class of models does not require any special treatment of ordinal fields, then they can be interpreted as categorical fields.

"categorical" doesn't require an order defined. Users should use "ordinal" field instead of "categorical" if they want the order.
So, no need to check categorical field's order.

For example, in jpmml-spark here , the values' order is hardcoded. The hardcoded order [0, 1] may be different from the order given by spark ML [1, 0]. For categorical field, they should be the same thing.

ohmystack · 2018-06-06T07:35:21Z

@vruusmann Would you mind reviewing this?

vruusmann · 2018-06-06T11:15:19Z

No objections - this is a valid code change.

I'm more interested about how did you find it? Do you have a (reproducible-) use case, where JPMML-SparkML fails because of this sanity check?

ohmystack · 2018-06-07T06:55:47Z

env: spark 2.2.1
pipelineStages: VectorAssembler, StringIndexer, XGBoostEstimator (classification)

In the 2nd stage, StringIndexerConverter uses transformer.labels() to generate categories for the label column. code
The array returned byStringIndexerModel.labels() is not sorted, can be [0, 1] or [1, 0].
Then, the field is encoded using this array.

In the 3rd stage, when encoding the schema, my code runs into the ContinuousFeature condition and the issue occurs.

Maybe my code has some problem. In the 3rd stage, the label should have been converted to CategoricalFeature already, instead of converting it from ContinuousFeature again. But if it happens, the code will break there.

Here is an example of returning [1, 0] in transformer.labels().

train.csv

LABEL,col1,col2
1,628,787
1,75,1794
1,444,14899
1,53,500
1,997,286
1,613,729
1,031,2245
1,02,5081
1,850,404
1,647,560
1,78,1517
1,75,4977
0,977,311
1,812,532
1,472,10822
0,570,2304
1,519,690
1,834,843
1,054,30

spark-shell

scala> val data = spark.sqlContext.read.format("csv").
     |   options(Map(
     |     "header" -> "true",
     |     "ignoreLeadingWhiteSpace" -> "true",
     |     "ignoreTrailingWhiteSpace" -> "true",
     |     "timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSZZZ",
     |     "inferSchema" -> "true",
     |     "mode" -> "FAILFAST")).
     |   load("train.csv")

scala> val labelColName = "LABEL"

scala> import org.apache.spark.ml.feature.StringIndexer

scala> val labelIndexer = new StringIndexer().setInputCol(labelColName).setOutputCol("label")

scala> val labelIndexerModel = labelIndexer.fit(data)
labelIndexerModel: org.apache.spark.ml.feature.StringIndexerModel = strIdx_e56af3540e9d

scala> labelIndexerModel.labels
res0: Array[String] = Array(1, 0)       // <- Here's the [1, 0] values

vruusmann · 2018-06-07T07:11:51Z

The problem is that StringIndexerModel#labels() returns labels in the order of "popularity". In most datasets the 0 label is more popular than the 1 label (eg. there are many more non-fraud cases than there are fraud cases), so the labels are ordered [0, 1]. However, your dataset appears to have many more 1 labels than 0 labels, so the labels are ordered [1, 0], and this sanity check fails.

This issue needs more investigation. The order of target labels is critical, because it determines the encoding of model objects. For example, LogisticRegressionModel performs the computation relative to the second target category, so if we pass "unexpected" ordering of labels, then we will get a misbehaving logistic regression model (will predict the probability 0 class where the probability of 1 class is expected, and vice versa).

ohmystack · 2018-06-07T07:29:50Z

Agree to have more investigation.

vruusmann · 2018-06-07T07:44:18Z

Related to jpmml/jpmml-sparkml#14

There should be ContinuousDomain and CategoricalDomain meta-transforrmer classes that collect canonical label information.

Remove the check for categorical values order

5666f6d

ohmystack closed this Aug 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the check for categorical values order #9

Remove the check for categorical values order #9

ohmystack commented Jun 5, 2018

ohmystack commented Jun 6, 2018

vruusmann commented Jun 6, 2018

ohmystack commented Jun 7, 2018

vruusmann commented Jun 7, 2018

ohmystack commented Jun 7, 2018

vruusmann commented Jun 7, 2018

Remove the check for categorical values order #9

Remove the check for categorical values order #9

Conversation

ohmystack commented Jun 5, 2018

ohmystack commented Jun 6, 2018

vruusmann commented Jun 6, 2018

ohmystack commented Jun 7, 2018

vruusmann commented Jun 7, 2018

ohmystack commented Jun 7, 2018

vruusmann commented Jun 7, 2018