-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: read_parquet
doesn't work for ArrowDtype
dictionary types.
#54392
Comments
This looks like an upstream issue in pyarrow. We pass
|
import pandas as pd pa_dict = pa.dictionary(pa.int32(), pa.string()) Save the DataFrame to Parquetdf.to_parquet("demo2.parquet") Read the Parquet filedf_loaded = pd.read_parquet("demo2.parquet", dtype_backend="pyarrow") Manually decode the dictionary columndf_loaded['bar'] = df_loaded['bar'].apply(lambda x: pa_dict.decode(x)) assert df_loaded.bar.dtype == pd_dict # Confirm the dtype after decoding |
I think it is the same underlying problem what we run into here: #53011 |
Closing as a duplicate of #53011 |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When attempting to load a
DataFrame
that was serialized withpandas
as aparquet
-file the erroris raised if the table contained a column with
pd.ArrowDtype(pa.dictionary(pa.int32(), pa.string()))
.Surprisingly, the same is not true when trying to read the same table if it was serialized as
parquet
viapyarrow
.Expected Behavior
It should load the DataFrame.
Installed Versions
The text was updated successfully, but these errors were encountered: