Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Decimal Types? #110

Open
abelsonlive opened this issue Feb 22, 2017 · 4 comments
Open

Support for Decimal Types? #110

abelsonlive opened this issue Feb 22, 2017 · 4 comments

Comments

@abelsonlive
Copy link

Not sure if this is a noob question, but I'm wondering why there seems to be no support for DecimalType inputs to models?

When my featuresDF includes these types, I get the following error:

IllegalArgumentExceptionTraceback (most recent call last)
<ipython-input-46-d8430332ffa6> in <module>()
      1 from jpmml_sparkml import toPMMLBytes
----> 2 pmmlBytes = toPMMLBytes(spark, DF, pipelineModel)
      3 print(pmmlBytes)

/home/hadoop/pyenv/eggs/jpmml_sparkml-1.1rc0-py2.7.egg/jpmml_sparkml/__init__.pyc in toPMMLBytes(sc, df, pipelineModel)
     17         if(not isinstance(javaConverter, JavaClass)):
     18                 raise RuntimeError("JPMML-SparkML not found on classpath")
---> 19         return javaConverter.toPMMLByteArray(javaSchema, javaPipelineModel)

/usr/lib/spark/python/lib/py4j-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     77                 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
     78             if s.startswith('java.lang.IllegalArgumentException: '):
---> 79                 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
     80             raise
     81     return deco

IllegalArgumentException: u'Expected string, integral, double or boolean type, got decimal(18,0) type'

I was eventually able to address this in pyspark with the following pre-model hack:

DF = ...
# convert all decimals to double
for f in DF.schema.fields:
    d = json.loads(f.json())
    if 'decimal' in d["type"]:
        DF = DF.withColumn(d['name'], DF[d["name"]].cast("double"))

However, i'm curious why DecimalType, which is effectively synonymous with DoubleType, is not natively supported?

@vruusmann
Copy link
Member

Why there seems to be no support for DecimalType inputs to models?

java.math.BigDecimal is an "guaranteed" precision data type, whereas java.lang.Double is a "best effort" precision data type. Therefore, it would be a type system violation to automatically "downgrade" Apache Spark's decimal columns to double columns.

I was eventually able to address this in pyspark with the following pre-model hack:

You're using the decimal(18, 0) data type, which "requires" two orders of magnitude more precision that what is theoretically available when using the double data type (ie. decimal(16, 0)). So, while your solution is functional, it may have some some unintended side effects.

Do you really need decimal(18, 0) columns in the first place? It doesn't make sense to train ML models using guaranteed precision features, because all ML algorithms do the numerical computation work using double (or float) values anyway.

I'm keeping this issue open at the moment. Will have to go through the relevant parts of the Apache Spark codebase, and see how this type conflict is handled there.

@sbourzai
Copy link

sbourzai commented May 29, 2017

hi Sir,
Have you any idea about this error ?
My data is the " sample_libsvm_data" for sparkML
Thanks

IllegalArgumentException                  Traceback (most recent call last)
<ipython-input-1-a0d334e7c374> in <module>()
     71     print(rfModel)  # summary only
     72 
---> 73     pmmlBytes = toPMMLBytes(spark, trainingData, model)
     74     print(pmmlBytes.decode("UTF-8"))

/home/hadoop/jpmml.py in toPMMLBytes(sc, data, pipelineModel)
     10         if(not isinstance(javaConverter,JavaClass)):
     11                 raise RuntimeError("JPMML-SparkML not found on classpath")
---> 12         return javaConverter.toPMMLByteArray(javaSchema, javaPipelineModel)

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     77                 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
     78             if s.startswith('java.lang.IllegalArgumentException: '):
---> 79                 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
     80             raise
     81     return deco

IllegalArgumentException: u'Expected string, integral, double or boolean type, got vector type'

@vruusmann
Copy link
Member

My data is the "sample_libsvm_data" for sparkML

@sbourzai As the name suggests, "sample_libsvm_data" is a toy dataset in LibSVM data format. It is absolutely non-sensical from any real-life application scenario.

Please re-run your experiment with a realistical dataset. For example, see the "Usage" section of the JPMML-SparkML README file: https://github.com/jpmml/jpmml-sparkml#usage

@sealzjh
Copy link

sealzjh commented Apr 19, 2019

hi Sir @vruusmann
i have same error:
IllegalArgumentException: u'Expected string, integral, double or boolean type, got vector type'
the feature.MinMaxScaler inputCol is vector type, but jpmml not support vector type

how can i use MinMaxScaler and jpmml ?

@vruusmann vruusmann transferred this issue from jpmml/pyspark2pmml Mar 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants