Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#793: Added support Pandas 2 pyarrow dtype columns for emitting data from Python UDFs #795

Merged
merged 14 commits into from
Jun 2, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions doc/changes/changes_6.1.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Code name: t.b.d.

## Summary

t.b.d.
This releases adds support for Pandas 2 pyarrow dtype columns for emitting data from Python UDFs. Furthermore, it fixes a silent data corruption when emitting dateframes with float16 dtype columns from Python UDFs.
tkilias marked this conversation as resolved.
Show resolved Hide resolved

## [Package Version Comparison between Release 6.0.0 and 6.1.0](package_diffs/6.1.0/README.md)

Expand All @@ -14,11 +14,12 @@ This release uses version 0.17.0 of the container tool.

## Bug Fixes

- #792: Fixes Github workflow publish-test-container by updating script-languages-container-tool to 0.17.0 and script-languages-container-ci to 1.1.0
- #792: Fixed Github workflow publish-test-container by updating script-languages-container-tool to 0.17.0 and script-languages-container-ci to 1.1.0
- #796: Fixed silent data corruption when emitting dataframes with float16 dtype columns from Python UDFs

## Features / Enhancements

n/a
- #793: Added support Pandas 2 pyarrow dtype columns for emitting data from Python UDFs
tkilias marked this conversation as resolved.
Show resolved Hide resolved

## Documentation

Expand Down

Large diffs are not rendered by default.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

42 changes: 41 additions & 1 deletion doc/user_guide/py_dataframe.md
tkilias marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,47 @@ The parameters of `get_dataframe` are the following.
`get_dataframe` will return a DataFrame containing `num_rows` rows or a lesser number if `num_rows` are not available. If there are zero rows available, `get_dataframe` will return `None`. The DataFrame column labels will be set to the corresponding UDF parameter names for the columns, except in case of a dynamic parameter list, then the column labels will be '0','1',.. . After calling `get_dataframe`, the UDF data iterator will point to the next row (i.e. following the last row in the DataFrame) just as with `next()`.

## Emitting data
An entire DataFrame can be emitted by passing it to `emit()` just as with single values. Each column of the DataFrame will be automatically converted to a column in the result set. The columns will be matched by their position to the columns in the emit-clause. The column label in the DataFrame will be ignored.
An entire DataFrame can be emitted by passing it to `emit()` just as with single values. Each column of the DataFrame will be automatically converted to a column in the result set. The columns will be matched by their position to the columns in the emit-clause. The column label in the DataFrame will be ignored.

The following table describes which conversions between DType/Python-Type and Database-Type are possible:

| Pandas-DType | Python-Type | Database-Type | Flavors | SLC Version | NULL handling |
|--------------------------------|------------------|---------------|--------------------------------|-------------|-----------------------------------|
| (u)int* | - | DECIMAL | all standard and python3 | all | (u)int* does not support NULL |
| (u)int* | - | DOUBLE | all standard and python3 | all | (u)int* does not support NULL |
| float* | - | DECIMAL | all standard and python3 | all | Uses numpy.nan for NULL |
| float* | - | DOUBLE | all standard and python3 | all | Uses numpy.nan for NULL |
| string | - | (VAR)CHAR | all standard and python3 | >=6.1.0 | Uses pandas.NaN for NULL |
| bool_ | - | BOOLEAN | all standard and python3 | all | Pandas will convert None to False |
tkilias marked this conversation as resolved.
Show resolved Hide resolved
| boolean | - | BOOLEAN | all standard and python3 | >=6.1.0 | Uses pandas.NaN for NULL |
| datetime64[ns] | - | TIMESTAMP | all standard and python3 | all | Uses pandas.NaT for NULL |
| object | int | DECIMAL | all standard and python3 | all | Uses None for NULL |
| object | int | DOUBLE | all standard and python3 | all | Uses None for NULL |
| object | float | DECIMAL | all standard and python3 | all | Uses None or numpy.NaN for NULL |
| object | float | DOUBLE | all standard and python3 | all | Uses None or numpy.NaN for NULL |
| object | decimal.Decimal | DECIMAL | all standard and python3 | all | Uses None for NULL |
| object | decimal.Decimal | DOUBLE | all standard and python3 | all | Uses None for NULL |
| object | bool | BOOLEAN | all standard and python3 | all | Uses None for NULL |
| object | str | (VAR)CHAR | all standard and python3 | all | Uses None for NULL |
| object | pandas.Timestamp | TIMESTAMP | all standard and python3 | >=6.1.0 | Uses None or pandas.NaN for NULL |
| object | datetime.date | DATE | all standard and python3 | all | Uses None for NULL |
| (u)int*[pyarrow] | - | DECIMAL | standard-8.0.0 and all python3 | >=6.1.0 | Native NULL support |
| (u)int*[pyarrow] | - | DOUBLE | standard-8.0.0 and all python3 | >=6.1.0 | Native NULL support |
| float*[pyarrow] | - | DECIMAL | standard-8.0.0 and all python3 | >=6.1.0 | Native NULL support |
| float*[pyarrow] | - | DOUBLE | standard-8.0.0 and all python3 | >=6.1.0 | Native NULL support |
| string[pyarrow] | - | (VAR)CHAR | standard-8.0.0 and all python3 | >=6.1.0 | Native NULL support |
| bool[pyarrow] | - | BOOLEAN | standard-8.0.0 and all python3 | >=6.1.0 | Native NULL support |
| decimal128(*)[pyarrow] | - | DECIMAL | standard-8.0.0 and all python3 | >=6.1.0 | Native NULL support |
| decimal128(*)[pyarrow] | - | DOUBLE | standard-8.0.0 and all python3 | >=6.1.0 | Native NULL support |
| timestamp[ns, tz=UTC][pyarrow] | - | TIMESTAMP | standard-8.0.0 and all python3 | >=6.1.0 | Native NULL support. |
tkilias marked this conversation as resolved.
Show resolved Hide resolved

**Note**:

- Before SLC version 6.1.0, emitting float16 lead to a silent data corruption of the emitted values.
- Since SLC version 6.1.0 DateFrame Columns with DType object also support pandas.NaN for NULL.
- We only DType timestamp[ns, tz=UTC][pyarrow] and datetime64[ns], because Exasol doesn't support timezones. We also drop the timezone before using the timestamp in Exasol. Furthermore, Exasol 7.* only supports miliseconds precision timestamps, the nanoseconds will be truncated.
tkilias marked this conversation as resolved.
Show resolved Hide resolved
- Conversions from float* to DECIMAL can include rounding down to the precision and scale of the DECIMAL.
- Conversions from (u)int*,decimal128,decimal.Decimal to DOUBLE can be also imprecise, in case DOUBLE can't represent the number precisily.

## Example

Expand Down
2 changes: 1 addition & 1 deletion flavors/standard-EXASOL-7.0.0/flavor_base/testconfig
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
generic_language_tests=python3 java r
test_folders=python3/all python3/external_modules java r standard-flavor/all standard-flavor/7.0 python3-no-python2
test_folders=python3/all python3/external_modules java r standard-flavor/all standard-flavor/7.0 python3-no-python2 pandas/all
2 changes: 1 addition & 1 deletion flavors/standard-EXASOL-7.1.0/flavor_base/testconfig
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
generic_language_tests=python3 java r
test_folders=python3/all python3/external_modules python3-no-python2 java r standard-flavor/all standard-flavor/7.1
test_folders=python3/all python3/external_modules python3-no-python2 java r standard-flavor/all standard-flavor/7.1 pandas/all
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ azure-storage-blob|12.9.0
azure-storage-file-datalake|12.5.0
azure-storage-file-share|12.6.0
azure-storage-queue|12.1.6
boto3|1.20.37
boto3|1.26.125
google-cloud-asset|3.7.1
google-cloud-bigquery|2.32.0
google-cloud-bigquery-storage|2.11.0
Expand All @@ -34,41 +34,41 @@ google-cloud-spanner|3.12.1
google-cloud-storage|2.0.0
google-cloud-trace|1.5.1
martian|1.4
protobuf|3.19.5
pyexasol|0.23.3
protobuf|3.20.3
pyexasol|0.25.2
pysftp|0.2.9
pytz|2021.3
sagemaker|2.72.3
pytz|2023.3
sagemaker|2.151.0
setuptools|65.5.1
pycurl|7.44.1
redis|4.5.4
roman|3.3
pyodbc|4.0.32
lxml|4.9.1
scipy|1.6.2
scipy|1.7.3
pyftpdlib|1.5.6
jinja2|3.0.3
cffi|1.15.0
docutils|0.18.1
requests|2.27.1
ujson|5.4.0
paramiko|2.9.2
paramiko|3.1.0
simplejson|3.17.6
scikit-learn|1.0.2
plyvel|1.4.0
python-ldap|3.4.0
pyOpenSSL|23.0.0
scikit-learn|1.2.2
plyvel|1.5.0
python-ldap|3.4.3
pyOpenSSL|23.1.1
git+http://github.com/EXASOL/websocket-api.git@5150f964388412788bf5e47752a7916a5a8624c5#egg=exasol-db-api&subdirectory=python|
debugpy|1.5.1
debugpy|1.6.7
pybase64|1.2.1
pysimdjson|4.0.3
numba|0.55.0
pyarrow|6.0.1
pysimdjson|5.0.2
numba|0.57.0
pyarrow|12.0.0
bitarray|2.3.5
pybloomfiltermmap3|0.5.5
bitsets|0.8.3
numba-scipy|0.3.0
pyyaml|6.0
https://github.com/exasol/bucketfs-utils-python/releases/download/0.1.0/exasol_bucketfs_utils_python-0.1.0-py3-none-any.whl|
numba-scipy|0.3.1
pyyaml|5.4.1
exasol-bucketfs|0.8.0
pysmbc|1.0.23
cryptography|39.0.1
cryptography|40.0.2
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
pandas|1.3.4
numpy|1.21.3
pandas|2.0.1
numpy|1.22.4
2 changes: 1 addition & 1 deletion flavors/standard-EXASOL-8.0.0/flavor_base/testconfig
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
generic_language_tests=python3 java r
test_folders=python3/all python3/external_modules python3-no-python2 java r standard-flavor/all standard-flavor/7.1 standard-flavor/8.0
test_folders=python3/all python3/external_modules python3-no-python2 java r standard-flavor/all standard-flavor/7.1 standard-flavor/8.0 pandas/all pandas/pandas2