-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: read_parquet converts pyarrow list type to numpy dtype #53011
Comments
I found during the Pyarrow conversion, if you pass in a
|
From the traceback, it appears that pyarrow tries to convert this type to a numpy dtype by default, so I think an appropriate fix would be for pyarrow to just return an File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas_compat.py:812, in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
809 table = _add_any_metadata(table, pandas_metadata)
810 table, index = _reconstruct_index(table, index_descriptors,
811 all_columns)
--> 812 ext_columns_dtypes = _get_extension_dtypes(
813 table, all_columns, types_mapper)
814 else:
815 index = _pandas_api.pd.RangeIndex(table.num_rows)
File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas_compat.py:865, in _get_extension_dtypes(table, columns_metadata, types_mapper)
860 dtype = col_meta['numpy_type']
862 if dtype not in _pandas_supported_numpy_types:
863 # pandas_dtype is expensive, so avoid doing this for types
864 # that are certainly numpy dtypes
--> 865 pandas_dtype = _pandas_api.pandas_dtype(dtype)
866 if isinstance(pandas_dtype, _pandas_api.extension_dtype):
867 if hasattr(pandas_dtype, "__from_arrow__"):
File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas-shim.pxi:136, in pyarrow.lib._PandasAPIShim.pandas_dtype()
File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas-shim.pxi:139, in pyarrow.lib._PandasAPIShim.pandas_dtype()
File ~/.../pandas/core/dtypes/common.py:1626, in pandas_dtype(dtype)
1621 with warnings.catch_warnings():
1622 # GH#51523 - Series.astype(np.integer) doesn't show
1623 # numpy deprecation warning of np.integer
1624 # Hence enabling DeprecationWarning
1625 warnings.simplefilter("always", DeprecationWarning)
-> 1626 npdtype = np.dtype(dtype)
1627 except SyntaxError as err:
1628 # np.dtype uses `eval` which can raise SyntaxError
1629 raise TypeError(f"data type '{dtype}' not understood") from err
TypeError: data type 'list<item: string>[pyarrow]' not understood |
Hmm so I looked at the Pandas code, and not sure if using The issue is eg:
returns:
and using
fails since it's a string. I think the better approach would be to not just pass in |
I think this behaves as expected. You can pass |
Oh oops I forgot to mention I tried |
Confirmed it still fails:
|
Interesting, This one works:
|
Ye that works since it's an The issue is if you explicity parse Ie:
returns
|
In fact the
returns
|
Maybe a Maybe a
|
Run into the same issue: df = pd.DataFrame({'a': pd.Series([['a'], ['a', 'b']], dtype=pd.ArrowDtype(pa.list_(pa.string())))})
df.to_parquet('test.parquet') # SUCCESS
pd.read_parquet('test.parquet') # *** FAIL
df.to_parquet('test.parquet') # SUCCESS
pq.read_table('test.parquet').to_pandas(ignore_metadata=True, types_mapper=pd.ArrowDtype) # SUCCESS
df.to_parquet('test.parquet', store_schema=False) # SUCCESS
pd.read_parquet('test.parquet') # SUCCESS I think the last case was not mentioned so far. |
@takacsd oh interesting - so it's possible its the schema storing component that's wrong? |
@danielhanchen I think the problem is in the pandas specific metadata. If the parquet file was created with something else (e.g.: AWS Athena) it could read it just fine. pq.write_table(pa.table({'a': pa.array([['a'], ['a', 'b']], type=pa.list_(pa.string()))}), 'test.parquet') # SUCCESS
pd.read_parquet('test.parquet') # SUCCESS
pq.write_table(pa.Table.from_pandas(df), 'test.parquet') # SUCCESS
pd.read_parquet('test.parquet') # *** FAIL
pq.write_table(pa.Table.from_pandas(df).replace_schema_metadata(), 'test.parquet') # SUCCESS
pd.read_parquet('test.parquet') # SUCCESS This is the pandas metadata btw:
In the case of a simple |
@takacsd oh yep your reasoning sounds right - so I think adding a simple try except might be a simple maybe? Try calling numpy then if it fails, call pd.ArrowDtype |
The main issue I think is because Just my two cents. |
It seems a little more complicated than that: We already have a special case for temporal types, so I suppose we just need something similar for arrays and maps... |
@takacsd The issue though timestamps can be reasonably easy to construct from text. The below could all be possible though:
Constructing Arrow dtypes from that could be potentially problematic. I guess in theory one can iterate through the string, and create a string which you can then call
I think a wiser approach would be to use the Arrow dtype from |
@danielhanchen your approach only works here, and it just ignores the metadata. I'm not a pandas developer but I suppose they generated that metadata for a reason, so it may break some things if we just ignore it. Properly parsing the string is obviously harder, but I still think it is the better solution... |
@takacsd agreed parsing the metadata string is the correct way. I thought about how one would go about doing it. Eg take: You'll have to first find the type which has the first enclosed The while loop looks something like this:
and
The code just gets too cumbersome sadly - the above only supports struct, dictionary and list types. The main issue is the infinite nesting of Arrow dtypes which overcomplicates the conversion process in my view. |
Actually a simpler solution is to directly all However, this doesnt work with This means a struct field name could have This probably means string parsing won't work for |
Yeah, after some experimenting, I think we need to gave up on parsing the type string: These two: pd.Series([{'a': 1, 'b': 1}], dtype=pd.ArrowDtype(pa.struct({'a': pa.int64(), 'b': pa.int64()})))
pd.Series([{'a: int64, b': 1}], dtype=pd.ArrowDtype(pa.struct({'a: int64, b': pa.int64()}))) both have the following type string: But even if we disallow such cases, it is just too hard: I tried to write a recursive parser with some regexp, but I gave up. We need a balancing matcher or a recursive pattern to match the nested The fundamental problem is we try to parse a string which was not meant to be easily parsable. The metadata should save the nested data types in a way that is easy to work with... |
I was bored: class ParseFail(Exception):
pass
class Parsed(NamedTuple):
type: pa.DataType
end: int
class TypeStringParser:
BASIC_TYPE_MATCHER = re.compile(r'\w+(\[[^\]]+\])?')
TIMESTAMP_MATCHER = re.compile(r'timestamp\[([^,]+), tz=([^\]]+)\]')
NAME_MATCHER = re.compile(r'\w+') # this can be r'[^:]' to support weird names in struct
def __init__(self, type_str: str) -> None:
self.type_str = type_str
def parse(self) -> pa.DataType:
try:
parsed = self.type(0)
except ParseFail:
raise ValueError(f"Can't parse '{self.type_str}' as a type.")
if parsed.end != len(self.type_str):
raise ValueError(f"Can't parse '{self.type_str}' as a type.")
return self.type(0).type
def type(self, pos: int) -> Parsed:
try:
return self.basic_type(pos)
except ParseFail:
pass
try:
return self.timestamp(pos)
except ParseFail:
pass
try:
return self.list(pos)
except ParseFail:
pass
try:
return self.dictionary(pos)
except ParseFail:
pass
try:
return self.struct(pos)
except ParseFail:
pass
raise ParseFail()
def basic_type(self, pos: int) -> pa.DataType:
match = self.BASIC_TYPE_MATCHER.match(self.type_str, pos)
if match is None:
raise ParseFail()
try:
return Parsed(pa.type_for_alias(match.group(0)), match.end(0))
except ValueError:
pass
raise ParseFail()
def timestamp(self, pos: int) -> pa.DataType:
match = self.TIMESTAMP_MATCHER.match(self.type_str, pos)
if match is None:
raise ParseFail()
try:
return Parsed(pa.timestamp(match.group(1).strip(), tz=match.group(2).strip()), match.end(0))
except ValueError:
pass
raise ParseFail()
def list(self, pos: int) -> pa.DataType:
pos = self.accept('list<', pos)
match = self.NAME_MATCHER.match(self.type_str, pos)
if match is None:
raise ParseFail()
pos = self.accept(': ', match.end(0))
item = self.type(pos)
pos = self.accept('>', item.end)
return Parsed(pa.list_(item.type), pos)
def dictionary(self, pos: int) -> pa.DataType:
pos = self.accept('dictionary<values=', pos)
values = self.type(pos)
pos = self.accept(', indices=', values.end)
indices = self.type(pos)
pos = self.accept(', ordered=', indices.end)
try:
pos = self.accept('0', pos)
ordered = False
except ParseFail:
pos = self.accept('1', pos)
ordered = True
pos = self.accept('>', pos)
return Parsed(pa.dictionary(indices.type, values.type, ordered), pos)
def struct(self, pos: int) -> pa.DataType:
pos = self.accept('struct<', pos)
elements = []
while self.type_str[pos] != '>':
match = self.NAME_MATCHER.match(self.type_str, pos)
if match is None:
raise ParseFail()
element_name = match.group(0)
pos = self.accept(': ', match.end(0))
element_type = self.type(pos)
pos = element_type.end
if self.type_str[pos] != '>':
pos = self.accept(', ', pos)
elements.append((element_name, element_type.type))
pos = self.accept('>', pos)
return Parsed(pa.struct(elements), pos)
def accept(self, term: str, pos: int) -> int:
if self.type_str.startswith(term, pos):
return pos + len(term)
raise ParseFail() Probably not the prettiest recursive descent parser in existence, but it does parse arbitrary nested types. |
@takacsd Nice work on the parser! :) Ye Also I just noticed but https://github.com/apache/arrow/blob/8be70c137289adba92871555ce74055719172f56/python/pyarrow/pandas_compat.py#L870 actually does in fact parse Arrow Dtypes! The issue is the code previous to it breaks, and it never gets there.
The issue is https://github.com/apache/arrow/blob/8be70c137289adba92871555ce74055719172f56/python/pyarrow/pandas_compat.py#LL854C1-L868C53:
I think I might have fixed it WITHOUT using
We push the original command This also means
can be deleted - it'ls redundant, since we folded the code into the previous code. |
I just hit this today trying to read a parquet file made by someone else, where they had used the pyarrow backend. Here is another minimal example to add to the mix that fails on reading import io
import numpy as np
import pandas as pd
import pyarrow as pa
def main():
df0 = pd.DataFrame(
[
{"foo": {"bar": True, "baz": np.float32(1)}},
{"foo": {"bar": True, "baz": None}},
],
)
schema = pa.schema(
[
pa.field(
"foo",
pa.struct(
[
pa.field("bar", pa.bool_(), nullable=False),
pa.field("baz", pa.float32(), nullable=True),
],
),
),
],
)
print(schema)
with io.BytesIO() as stream0, io.BytesIO() as stream1:
kwargs = {
"engine": "pyarrow",
"compression": "zstd",
"schema": schema,
"row_group_size": 2_000,
}
print("Writing df0")
df0.to_parquet(stream0, **kwargs)
print("Reading df1")
stream0.seek(0)
df1 = pd.read_parquet(stream0, engine="pyarrow", dtype_backend="pyarrow")
print("Writing df1")
df1.to_parquet(stream1, **kwargs)
print("Reading df2")
stream1.seek(0)
df2 = pd.read_parquet(stream1, engine="pyarrow", dtype_backend="pyarrow")
if __name__ == "__main__":
main() Using |
INSTALLED VERSIONScommit : f538741 pandas : 2.2.0 |
The only "workaround" at the pandas-level I've found is to set Versions:
|
I would recommend using |
@phoff, not sure if you saw from my screenshot, but I did apply the dypte_backend="pyarrow" to the read_parquet method and it still fails, unless I am misunderstanding your suggestion |
The only workaround I have found so far is the following (which works in all cases I have thought of, except round-tripping an empty dataframe with a struct or list type, setting the schema, and not using Would def welcome suggested improvements to this workaround! Obv you can write this differently if you don't want a byte string returned, but for us that's what we want. def serialize(self, data: pd.DataFrame, **kwargs) -> bytes:
"""see BytesWriter.serialize -- Dump pandas dataframe to parquet bytes"""
with io.BytesIO() as stream:
schema = kwargs.pop("schema", None)
all_arrow_types = all(isinstance(t, pd.ArrowDtype) for t in data.dtypes.tolist())
# An empty dataframe may use default dtypes that are incompatible with the schema.
# In this case, first cast to object, as the schema can always convert that to the correct type.
if len(data) == 0 and schema is not None and not all_arrow_types:
data = data.astype("object").astype({n: pd.ArrowDtype(schema.field(n).type) for n in schema.names})
table = pa.Table.from_pandas(data, schema=schema)
# drop pandas from the schema metadata to work around the bug where you can't read struct columns with
# pandas metadata
# see https://github.com/pandas-dev/pandas/issues/53011
metadata = table.schema.metadata
if b"pandas" in metadata and b"list" in metadata[b"pandas"] or b"struct" in metadata[b"pandas"]:
del metadata[b"pandas"]
table = table.replace_schema_metadata(metadata)
pq.write_table(table, stream, **kwargs)
return stream.getvalue() |
This bug is very similar to #57411 where the data type is list[int] instead of list[str]. |
Additionally, |
Addresses pandas-dev/pandas#53011 `types_mapper` always had highest priority as it overrode what was set before. However, switching the logical ordering, it means that we don't need to call `_pandas_api.pandas_dtype(dtype)` when using the pyarrow backend. Resolving the issue of complex `dtype` with `list` or `struct`
Matching arrow ticket: apache/arrow#39914 and potential PR: apache/arrow#44720 |
Only seeing this long ticket now .. FWIW I think this is another good reason it would be good pandas had tighter control over the pandas<->arrow conversion (#59780) |
### Rationale for this change This is a long standing [pandas ticket](pandas-dev/pandas#53011) with some fairly horrible workarounds, where complex arrow types do not serialise well to pandas as the pandas metadata string is not parseable. However, `types_mapper` always had highest priority as it overrode what was set before. ### What changes are included in this PR? By switching the logical ordering, it means that we don't need to call `_pandas_api.pandas_dtype(dtype)` when using the pyarrow backend, thus resolving the issue of complex `dtype` with `list` or `struct`. It will likely still fail if the numpy backend is used, but at least this gives a working solution rather than an inability to load files at all. ### Are these changes tested? Existing tests should stay unchanged and a new test for the complex type has been added ### Are there any user-facing changes? **This PR contains a "Critical Fix".** This makes `pd.read_parquet(..., dtype_backend="pyarrow")` work with complex data types where the metadata added by pyarrow during `pd.to_parquet` is not serialisable and currently throwing an exception. This issue currently prevents the use of pyarrow as the default backend for pandas. * GitHub Issue: #39914 Lead-authored-by: bretttully <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Brett Tully <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Great work on extending Arrow to Pandas!
Using
pd.ArrowDtype(pa.list_(pa.string()))
or any other alteration works in the Parquet saving mode, but fails during the reading of the parquet file.In fact, if there is a Pandas Series of pure lists of strings for eg
["a"], ["a", "b"]
, Parquet saves it internally as alist[string]
type. When Pandas reads the parquet file, it then converts it to an object type.Is there a way during the reading step to either:
object
type ORpd.Series(pd.arrays.ArrowExtensionArray(x))
seems to actually work! Maybe during the conversion from the internal Pyarrow representation into Pandas, we can usepd.Series(pd.arrays.ArrowExtensionArray(x))
on columns which had errors? ORExpected Behavior
Installed Versions
INSTALLED VERSIONS
commit : 37ea63d
python : 3.11.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 30 Stepping 5, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_Australia.1252
pandas : 2.0.1
numpy : 1.24.3
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 0.29.34
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.12.0
pandas_datareader: None
bs4 : 4.12.2
bottleneck : None
brotli :
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.1
numba : 0.57.0rc1
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
zstandard : 0.21.0
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: