Support for Decimal Types? #110

abelsonlive · 2017-02-22T03:54:31Z

Not sure if this is a noob question, but I'm wondering why there seems to be no support for DecimalType inputs to models?

When my featuresDF includes these types, I get the following error:

IllegalArgumentExceptionTraceback (most recent call last)
<ipython-input-46-d8430332ffa6> in <module>()
      1 from jpmml_sparkml import toPMMLBytes
----> 2 pmmlBytes = toPMMLBytes(spark, DF, pipelineModel)
      3 print(pmmlBytes)

/home/hadoop/pyenv/eggs/jpmml_sparkml-1.1rc0-py2.7.egg/jpmml_sparkml/__init__.pyc in toPMMLBytes(sc, df, pipelineModel)
     17         if(not isinstance(javaConverter, JavaClass)):
     18                 raise RuntimeError("JPMML-SparkML not found on classpath")
---> 19         return javaConverter.toPMMLByteArray(javaSchema, javaPipelineModel)

/usr/lib/spark/python/lib/py4j-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     77                 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
     78             if s.startswith('java.lang.IllegalArgumentException: '):
---> 79                 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
     80             raise
     81     return deco

IllegalArgumentException: u'Expected string, integral, double or boolean type, got decimal(18,0) type'

I was eventually able to address this in pyspark with the following pre-model hack:

DF = ...
# convert all decimals to double
for f in DF.schema.fields:
    d = json.loads(f.json())
    if 'decimal' in d["type"]:
        DF = DF.withColumn(d['name'], DF[d["name"]].cast("double"))

However, i'm curious why DecimalType, which is effectively synonymous with DoubleType, is not natively supported?

The text was updated successfully, but these errors were encountered:

vruusmann · 2017-02-22T07:52:44Z

Why there seems to be no support for DecimalType inputs to models?

java.math.BigDecimal is an "guaranteed" precision data type, whereas java.lang.Double is a "best effort" precision data type. Therefore, it would be a type system violation to automatically "downgrade" Apache Spark's decimal columns to double columns.

I was eventually able to address this in pyspark with the following pre-model hack:

You're using the decimal(18, 0) data type, which "requires" two orders of magnitude more precision that what is theoretically available when using the double data type (ie. decimal(16, 0)). So, while your solution is functional, it may have some some unintended side effects.

Do you really need decimal(18, 0) columns in the first place? It doesn't make sense to train ML models using guaranteed precision features, because all ML algorithms do the numerical computation work using double (or float) values anyway.

I'm keeping this issue open at the moment. Will have to go through the relevant parts of the Apache Spark codebase, and see how this type conflict is handled there.

sbourzai · 2017-05-29T09:31:05Z

hi Sir,
Have you any idea about this error ?
My data is the " sample_libsvm_data" for sparkML
Thanks

IllegalArgumentException                  Traceback (most recent call last)
<ipython-input-1-a0d334e7c374> in <module>()
     71     print(rfModel)  # summary only
     72 
---> 73     pmmlBytes = toPMMLBytes(spark, trainingData, model)
     74     print(pmmlBytes.decode("UTF-8"))

/home/hadoop/jpmml.py in toPMMLBytes(sc, data, pipelineModel)
     10         if(not isinstance(javaConverter,JavaClass)):
     11                 raise RuntimeError("JPMML-SparkML not found on classpath")
---> 12         return javaConverter.toPMMLByteArray(javaSchema, javaPipelineModel)

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     77                 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
     78             if s.startswith('java.lang.IllegalArgumentException: '):
---> 79                 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
     80             raise
     81     return deco

IllegalArgumentException: u'Expected string, integral, double or boolean type, got vector type'

vruusmann · 2017-05-29T09:58:52Z

My data is the "sample_libsvm_data" for sparkML

@sbourzai As the name suggests, "sample_libsvm_data" is a toy dataset in LibSVM data format. It is absolutely non-sensical from any real-life application scenario.

Please re-run your experiment with a realistical dataset. For example, see the "Usage" section of the JPMML-SparkML README file: https://github.com/jpmml/jpmml-sparkml#usage

sealzjh · 2019-04-19T09:23:06Z

hi Sir @vruusmann
i have same error:
IllegalArgumentException: u'Expected string, integral, double or boolean type, got vector type'
the feature.MinMaxScaler inputCol is vector type, but jpmml not support vector type

how can i use MinMaxScaler and jpmml ?

vruusmann transferred this issue from jpmml/pyspark2pmml Mar 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Decimal Types? #110

Support for Decimal Types? #110

abelsonlive commented Feb 22, 2017

vruusmann commented Feb 22, 2017

sbourzai commented May 29, 2017 •

edited by vruusmann

Loading

vruusmann commented May 29, 2017

sealzjh commented Apr 19, 2019 •

edited

Loading

Support for Decimal Types? #110

Support for Decimal Types? #110

Comments

abelsonlive commented Feb 22, 2017

vruusmann commented Feb 22, 2017

sbourzai commented May 29, 2017 • edited by vruusmann Loading

vruusmann commented May 29, 2017

sealzjh commented Apr 19, 2019 • edited Loading

sbourzai commented May 29, 2017 •

edited by vruusmann

Loading

sealzjh commented Apr 19, 2019 •

edited

Loading