-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The future? #3
Comments
@jreback I thought you would be just the kind of person that might have an opinion on this little project. |
yea i think we would highly consider this to vendor even but starting as an optional is great |
@jrbourbeau - MattR suggested this would be of interest with your ongoing work using spark-generated data types. Maybe you have already solved everything with arrow's decimal type? |
With import decimal
import pandas as pd
import dask.dataframe as dd
import pyarrow as pa
pa_dtype = pa.decimal128(precision=7, scale=3)
data = pa.array(
[
decimal.Decimal("8093.012"),
decimal.Decimal("8094.123"),
decimal.Decimal("8095.234"),
decimal.Decimal("8096.345"),
decimal.Decimal("8097.456"),
decimal.Decimal("8098.567"),
],
type=pa_dtype,
)
df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa_dtype))
ddf = dd.from_pandas(df, npartitions=3)
# Some pandas operation
print(f"{df.x * 3 - 1 = }")
# Equivalent Dask operation
print(f"{(ddf.x * 3 - 1).compute() = }") Experimenting locally, that approach seems to work well for working with decimal data (though I've not done exhaustive testing). Does that align well with |
On Should this be wrapped in |
Since most things should be possible with the arrow extension type alone, there has been no further development or attention here. I daresay I can archive this. The only advantage I see (aside from the technical differences) is the ability to have exact decimals in pandas without requiring arrow. That ship has probably gone. |
Agree - but - the PyPI page is on the first page of search results for me for "pandas decimal", and I've not found any good write-up of how to "do Decimal in pd DataFrames" - do you have any tips / resources? Maybe they could be linked from the README? 🙏🏻
Agree the ship has sailed - that's a pretty niche use case. Functions like |
This is a good point. Maybe I should write up a blog demonstrating decimals in pandas with pyarrow. I can publish this on Anaconda and link in the README as you suggest, but it will take me a little time to get to. |
A blog with some more guidance about decimals with PyArrow would be great. Meanwhile I've been digging into this a little more at work to see if we should use Describe has problemsIn [1]: import decimal
...: import pandas as pd
In [2]: import pyarrow as pa
In [3]: pa_dtype = pa.decimal128(precision=7, scale=3)
In [4]: data = pa.array(
...: [
...: decimal.Decimal("8093.012"),
...: decimal.Decimal("8094.123"),
...: decimal.Decimal("8095.234"),
...: decimal.Decimal("8096.345"),
...: decimal.Decimal("8097.456"),
...: decimal.Decimal("8098.567"),
...: ],
...: type=pa_dtype,
...: )
...: df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa_dtype))
In [5]: df.describe()
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
...
ArrowTypeError: int or Decimal object expected, got numpy.int64 Simple ops have problemsContinuing with the In [7]: df['x'] + 1
Out[7]:
0 8094.012
1 8095.123
2 8096.234
3 8097.345
4 8098.456
5 8099.567
Name: x, dtype: decimal128(23, 3)[pyarrow] However, if we change the precision and scale to use more bits things get hairy: In [13]: df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa.decimal128(precision=38, scale=3)))
In [14]: df['x'] + 1
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
...
ArrowInvalid: Decimal precision out of range [1, 38]: 39 We know from above where Current conclusionMy guess that there are work arounds for the issues above - plus possibly even fixes upstream in pandas. However, this extra complexity for vectorization and smaller memory footprint might not be worth it at this time for us. My plan is to push ahead with |
I must admit, that I had not tried to do too much with arrow decimals. Your notes here are pretty disappointing! If you use dtype="decimal[3]" from this package,
As for precision, you would expect maybe to be bounded by integers up to 2**63 (~19 decimal places for numbers near 1). I don't know why arrow has both precision and scale... For a conclusion, decimal./Decimal may solve your problem, but it is orders of magnitude slower.
|
Regarding pyarrow being required by pandas, I found this just now: pandas-dev/pandas#52509 |
This repo currently is a nice little proof-of-concept.
To be a viable package in the pydata realm, it should have
Should we poke pandas directly to consider upstreaming?
The text was updated successfully, but these errors were encountered: