The future? #3

martindurant · 2022-12-08T14:47:38Z

This repo currently is a nice little proof-of-concept.

To be a viable package in the pydata realm, it should have

decent testing coverage and CI
release on pypi/conda
move to a recognised github org (intake?)
full documentation

Should we poke pandas directly to consider upstreaming?

martindurant · 2022-12-16T14:35:55Z

@jreback I thought you would be just the kind of person that might have an opinion on this little project.

jreback · 2022-12-16T14:53:23Z

yea i think we would highly consider this to vendor even

but starting as an optional is great

martindurant · 2023-01-05T17:01:43Z

@jrbourbeau - MattR suggested this would be of interest with your ongoing work using spark-generated data types. Maybe you have already solved everything with arrow's decimal type?

jrbourbeau · 2023-01-10T18:43:02Z

With pandas 1.5+ we can now use DataFrames with pyarrow-backed decimal128 data. Here's an small snippet which demonstrate that in action:

import decimal
import pandas as pd
import dask.dataframe as dd
import pyarrow as pa

pa_dtype = pa.decimal128(precision=7, scale=3)

data = pa.array(
    [
        decimal.Decimal("8093.012"),
        decimal.Decimal("8094.123"),
        decimal.Decimal("8095.234"),
        decimal.Decimal("8096.345"),
        decimal.Decimal("8097.456"),
        decimal.Decimal("8098.567"),
    ],
    type=pa_dtype,
)
df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa_dtype))
ddf = dd.from_pandas(df, npartitions=3)

# Some pandas operation
print(f"{df.x * 3 - 1 = }")
# Equivalent Dask operation
print(f"{(ddf.x * 3 - 1).compute() = }")

Experimenting locally, that approach seems to work well for working with decimal data (though I've not done exhaustive testing). Does that align well with pandas-decimal?

jamescooke · 2023-06-07T15:53:30Z

On pandas>=2 and pyarrow>=12, there seems to be quite a bit of mileage possible with pd.ArrowDtype(pa.decimal128(...)) as mentioned by @jrbourbeau above.

Should this be wrapped in pandas-decimal, or has progress been made elsewhere? Or maybe some other solution has arisen?

martindurant · 2023-06-07T16:38:33Z

Since most things should be possible with the arrow extension type alone, there has been no further development or attention here. I daresay I can archive this. The only advantage I see (aside from the technical differences) is the ability to have exact decimals in pandas without requiring arrow. That ship has probably gone.

jamescooke · 2023-06-08T12:19:54Z

I daresay I can archive this.

Agree - but - the PyPI page is on the first page of search results for me for "pandas decimal", and I've not found any good write-up of how to "do Decimal in pd DataFrames" - do you have any tips / resources? Maybe they could be linked from the README? 🙏🏻

The only advantage I see (aside from the technical differences) is the ability to have exact decimals in pandas without requiring arrow. That ship has probably gone.

Agree the ship has sailed - that's a pretty niche use case. Functions like pd.read_gbq() require pandas-gbq which installs pyarrow, so it's already in the Python environment for many.

martindurant · 2023-06-08T14:10:20Z

Agree - but - the PyPI page is on the first page of search results for me for "pandas decimal", and I've not found any good write-up of how to "do Decimal in pd DataFrames" - do you have any tips / resources? Maybe they could be linked from the README?

This is a good point. Maybe I should write up a blog demonstrating decimals in pandas with pyarrow. I can publish this on Anaconda and link in the README as you suggest, but it will take me a little time to get to.

jamescooke · 2023-06-14T10:50:29Z

A blog with some more guidance about decimals with PyArrow would be great.

Meanwhile I've been digging into this a little more at work to see if we should use decimal128 for a new part of our pipeline. I agree that decimal128 does provide vectorization and fixed point arithmetic - but there are two features that seem to be missing that I've not found a fix for.

Describe has problems

In [1]: import decimal    
   ...: import pandas as pd                                                                              
                                                    
In [2]: import pyarrow as pa                                                                             
                                                                                                         
In [3]: pa_dtype = pa.decimal128(precision=7, scale=3)
                          
In [4]: data = pa.array(                                                                                 
   ...:     [
   ...:         decimal.Decimal("8093.012"),                                                             
   ...:         decimal.Decimal("8094.123"),                                                             
   ...:         decimal.Decimal("8095.234"),                                                             
   ...:         decimal.Decimal("8096.345"),
   ...:         decimal.Decimal("8097.456"),                                                             
   ...:         decimal.Decimal("8098.567"),                                                             
   ...:     ],                                                                                           
   ...:     type=pa_dtype,                          
   ...: )                                                                                                
   ...: df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa_dtype))
                                                    
In [5]: df.describe()
---------------------------------------------------------------------------                              
ArrowInvalid                              Traceback (most recent call last)
...
ArrowTypeError: int or Decimal object expected, got numpy.int64

Simple ops have problems

Continuing with the df above - adding 1 to column 'x' is fine:

In [7]: df['x'] + 1                       
Out[7]: 
0    8094.012
1    8095.123
2    8096.234
3    8097.345
4    8098.456
5    8099.567
Name: x, dtype: decimal128(23, 3)[pyarrow]

However, if we change the precision and scale to use more bits things get hairy:

In [13]: df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa.decimal128(precision=38, scale=3)))

In [14]: df['x'] + 1                                                                                                                                                                                               
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
...
ArrowInvalid: Decimal precision out of range [1, 38]: 39

We know from above where df['x'] + 1 worked and returned values in decimal128(23, 3)[pyarrow] that the results of adding 1 will fit in the decimal128(precision=38, scale=3) (right? correct me if I'm wrong 🙏🏻 ) . But there appears to be a problem in where the spec of the decimals is not getting managed correctly.

Current conclusion

My guess that there are work arounds for the issues above - plus possibly even fixes upstream in pandas. However, this extra complexity for vectorization and smaller memory footprint might not be worth it at this time for us. My plan is to push ahead with decimal.Decimal in the DF and see what walls we collide with 🙏🏻 .

martindurant · 2023-06-14T13:33:23Z

I must admit, that I had not tried to do too much with arrow decimals. Your notes here are pretty disappointing! If you use dtype="decimal[3]" from this package, describe() at least does work. Also, we are significantly faster:

In [23]: df = pd.DataFrame({"a": [0, 1] * 100000}, dtype=pd.ArrowDtype(pa.decimal128(precision=7, scale=3)))

In [26]: %timeit df.a.std()
1.28 ms ± 7.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [27]: df = pd.DataFrame({"a": [0, 1] * 100000}, dtype="decimal[3]")

In [28]: %timeit df.a.std()
340 µs ± 2.62 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

As for precision, you would expect maybe to be bounded by integers up to 2**63 (~19 decimal places for numbers near 1). I don't know why arrow has both precision and scale...

For a conclusion, decimal./Decimal may solve your problem, but it is orders of magnitude slower.

In [35]: df["a"] = df.a.map(lambda x: decimal.Decimal(str(x)))

In [39]: %timeit df.a.mean()  # this is MEAN because std fails
12.9 ms ± 96.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

jamescooke · 2023-06-16T16:17:16Z

Regarding pyarrow being required by pandas, I found this just now: pandas-dev/pandas#52509

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The future? #3

The future? #3

martindurant commented Dec 8, 2022

martindurant commented Dec 16, 2022

jreback commented Dec 16, 2022

martindurant commented Jan 5, 2023

jrbourbeau commented Jan 10, 2023

jamescooke commented Jun 7, 2023

martindurant commented Jun 7, 2023

jamescooke commented Jun 8, 2023

martindurant commented Jun 8, 2023

jamescooke commented Jun 14, 2023

martindurant commented Jun 14, 2023

jamescooke commented Jun 16, 2023

The future? #3

The future? #3

Comments

martindurant commented Dec 8, 2022

martindurant commented Dec 16, 2022

jreback commented Dec 16, 2022

martindurant commented Jan 5, 2023

jrbourbeau commented Jan 10, 2023

jamescooke commented Jun 7, 2023

martindurant commented Jun 7, 2023

jamescooke commented Jun 8, 2023

martindurant commented Jun 8, 2023

jamescooke commented Jun 14, 2023

Describe has problems

Simple ops have problems

Current conclusion

martindurant commented Jun 14, 2023

jamescooke commented Jun 16, 2023