Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_parquet converts pyarrow list type to numpy dtype #53011

Open
2 of 3 tasks
danielhanchen opened this issue Apr 30, 2023 · 34 comments
Open
2 of 3 tasks

BUG: read_parquet converts pyarrow list type to numpy dtype #53011

danielhanchen opened this issue Apr 30, 2023 · 34 comments
Labels
Arrow pyarrow functionality Bug IO Parquet parquet, feather

Comments

@danielhanchen
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import pyarrow as pa
pyarrow_list_of_strings = pd.ArrowDtype(pa.list_(pa.string()))
data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]], dtype = pyarrow_list_of_strings),
})
data.to_parquet("data.parquet") # SUCCESS
pd.read_parquet("data.parquet") # *** FAIL

data_object = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]], dtype = object),
})
data_object.to_parquet("data.parquet")
pyarrow_internal = pa.parquet.read_table("data.parquet") # SUCCESS with type list[string]
pyarrow_internal .to_pandas() # SUCCESS except object now

pd.Series(pd.arrays.ArrowExtensionArray(pyarrow_internal["Pyarrow"])) # SUCCESS - data-type also correct!

Issue Description

Great work on extending Arrow to Pandas!
Using pd.ArrowDtype(pa.list_(pa.string())) or any other alteration works in the Parquet saving mode, but fails during the reading of the parquet file.

In fact, if there is a Pandas Series of pure lists of strings for eg ["a"], ["a", "b"], Parquet saves it internally as a list[string] type. When Pandas reads the parquet file, it then converts it to an object type.

Is there a way during the reading step to either:

  1. Convert the data-type like in the pure list mode to an object type OR
  2. pd.Series(pd.arrays.ArrowExtensionArray(x)) seems to actually work! Maybe during the conversion from the internal Pyarrow representation into Pandas, we can use pd.Series(pd.arrays.ArrowExtensionArray(x)) on columns which had errors? OR
  3. Somehow support these new types?

Expected Behavior

import pandas as pd
import pyarrow as pa
pyarrow_list_of_strings = pd.ArrowDtype(pa.list_(pa.string()))
data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]], dtype = pyarrow_list_of_strings),
})
data.to_parquet("data.parquet") # SUCCESS
pd.read_parquet("data.parquet") # SUCCESS

Installed Versions

INSTALLED VERSIONS

commit : 37ea63d
python : 3.11.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 30 Stepping 5, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_Australia.1252

pandas : 2.0.1
numpy : 1.24.3
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 0.29.34
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.12.0
pandas_datareader: None
bs4 : 4.12.2
bottleneck : None
brotli :
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.1
numba : 0.57.0rc1
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
zstandard : 0.21.0
tzdata : 2023.3
qtpy : None
pyqt5 : None

@danielhanchen danielhanchen added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 30, 2023
@jbrockmendel jbrockmendel added the Arrow pyarrow functionality label Apr 30, 2023
@danielhanchen
Copy link
Author

danielhanchen commented May 1, 2023

I found during the Pyarrow conversion, if you pass in a types_mapper and setting ignore_metadata to False, it works!

mapping = {schema.type : pd.ArrowDtype(schema.type) for schema in data.schema}
data.to_pandas(types_mapper = mapping.get, ignore_metadata = True)

@mroeschke
Copy link
Member

From the traceback, it appears that pyarrow tries to convert this type to a numpy dtype by default, so I think an appropriate fix would be for pyarrow to just return an ArrowDtype here

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas_compat.py:812, in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
    809     table = _add_any_metadata(table, pandas_metadata)
    810     table, index = _reconstruct_index(table, index_descriptors,
    811                                       all_columns)
--> 812     ext_columns_dtypes = _get_extension_dtypes(
    813         table, all_columns, types_mapper)
    814 else:
    815     index = _pandas_api.pd.RangeIndex(table.num_rows)

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas_compat.py:865, in _get_extension_dtypes(table, columns_metadata, types_mapper)
    860 dtype = col_meta['numpy_type']
    862 if dtype not in _pandas_supported_numpy_types:
    863     # pandas_dtype is expensive, so avoid doing this for types
    864     # that are certainly numpy dtypes
--> 865     pandas_dtype = _pandas_api.pandas_dtype(dtype)
    866     if isinstance(pandas_dtype, _pandas_api.extension_dtype):
    867         if hasattr(pandas_dtype, "__from_arrow__"):

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas-shim.pxi:136, in pyarrow.lib._PandasAPIShim.pandas_dtype()

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas-shim.pxi:139, in pyarrow.lib._PandasAPIShim.pandas_dtype()

File ~/.../pandas/core/dtypes/common.py:1626, in pandas_dtype(dtype)
   1621     with warnings.catch_warnings():
   1622         # GH#51523 - Series.astype(np.integer) doesn't show
   1623         # numpy deprecation warning of np.integer
   1624         # Hence enabling DeprecationWarning
   1625         warnings.simplefilter("always", DeprecationWarning)
-> 1626         npdtype = np.dtype(dtype)
   1627 except SyntaxError as err:
   1628     # np.dtype uses `eval` which can raise SyntaxError
   1629     raise TypeError(f"data type '{dtype}' not understood") from err

TypeError: data type 'list<item: string>[pyarrow]' not understood

@mroeschke mroeschke added IO Parquet parquet, feather Upstream issue Issue related to pandas dependency and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 1, 2023
@mroeschke mroeschke changed the title BUG: TypeError: data type 'list<item: string>[pyarrow]' not understood BUG: read_parquet converts pyarrow list type to numpy dtype May 1, 2023
@danielhanchen
Copy link
Author

Hmm so I looked at the Pandas code, and not sure if using pd.ArrowDtype(dtype) will work.

The issue is data.schema.pandas_metadata['columns'][7]["numpy_type"] is a str and not an actual type object, and pd.ArrowDtype does not accept strings.

eg:

dt = A.schema.pandas_metadata['columns'][7]["numpy_type"]

returns:

'list<element: struct<rank: uint8, subtype: dictionary<values=string, indices=int32, ordered=0>, caption: string, credit: string, type: dictionary<values=string, indices=int32, ordered=0>, url: string, height: uint16, width: uint16, subType: dictionary<values=string, indices=int32, ordered=0>, crop_name: dictionary<values=string, indices=int32, ordered=0>>>[pyarrow]'

and using

pd.ArrowDtype(dt)

fails since it's a string.

I think the better approach would be to not just pass in data.schema.pandas_metadata['columns'][j]["numpy_type"] but also data.schema.types since it has the actual types which can be converted into a pd.ArrowDtype object.

@phofl
Copy link
Member

phofl commented May 2, 2023

I think this behaves as expected. You can pass dtype_backend="pyarrow" to keep the list dtype

@danielhanchen
Copy link
Author

@phofl

Oh oops I forgot to mention I tried pd.read_parquet(..., dtype_backend = "pyarrow"), and the TypeError still exists. The error is exactly the same, since it passes the dtype to np.dtype

@danielhanchen
Copy link
Author

Confirmed it still fails:

import pandas as pd
import pyarrow as pa
pyarrow_list_of_strings = pd.ArrowDtype(pa.list_(pa.string()))
data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]], dtype = pyarrow_list_of_strings),
})
data.to_parquet("data.parquet") # SUCCESS
pd.read_parquet("data.parquet", dtype_backend = "pyarrow") # *** FAIL

@phofl
Copy link
Member

phofl commented May 3, 2023

Interesting,

This one works:

data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]]),
})
data.to_parquet("data.parquet")
pd.read_parquet("data.parquet", dtype_backend = "pyarrow")

@danielhanchen
Copy link
Author

Ye that works since it's an object - Pyarrow indeed saves the data inside the parquet file as list[string].

The issue is if you explicity parse list[string] directly, it does not work.

Ie:

data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]]),
})
data.dtypes

returns

Pyarrow    object
dtype: object

@danielhanchen
Copy link
Author

In fact the object schema is converted:

pa.parquet.read_table("data.parquet")

returns

pyarrow.Table
Pyarrow: list<item: string>
  child 0, item: string
----
Pyarrow: [[["a"],["a","b"]]]

@danielhanchen
Copy link
Author

danielhanchen commented May 3, 2023

Maybe a try except so to not break other parts of the Pandas repo?

https://github.com/apache/arrow/blob/a77aab07b02b7d0dd6bd9c9a11c4af067d26b674/python/pyarrow/pandas_compat.py#L855

Maybe a try except so to not break other parts of the Pandas repo?

    # infer the extension columns from the pandas metadata
>>for col_meta, field in zip(columns_metadata, table.schema):
        try:
            name = col_meta['field_name']
        except KeyError:
            name = col_meta['name']
        dtype = col_meta['numpy_type']

        if dtype not in _pandas_supported_numpy_types:
            # pandas_dtype is expensive, so avoid doing this for types
            # that are certainly numpy dtypes
>>       try:
>>           pandas_dtype = _pandas_api.pandas_dtype(dtype)
>>       except:
>>           pandas_dtype = pd.ArrowDtype(field.type)
                
            if isinstance(pandas_dtype, _pandas_api.extension_dtype):
                if hasattr(pandas_dtype, "__from_arrow__"):
                    ext_columns[name] = pandas_dtype

@takacsd
Copy link

takacsd commented May 10, 2023

Run into the same issue:

df = pd.DataFrame({'a': pd.Series([['a'], ['a', 'b']], dtype=pd.ArrowDtype(pa.list_(pa.string())))})

df.to_parquet('test.parquet')  # SUCCESS
pd.read_parquet('test.parquet')  # *** FAIL

df.to_parquet('test.parquet')  # SUCCESS
pq.read_table('test.parquet').to_pandas(ignore_metadata=True, types_mapper=pd.ArrowDtype)  # SUCCESS

df.to_parquet('test.parquet', store_schema=False)  # SUCCESS
pd.read_parquet('test.parquet')  # SUCCESS

I think the last case was not mentioned so far.

@danielhanchen
Copy link
Author

@takacsd oh interesting - so it's possible its the schema storing component that's wrong?

@takacsd
Copy link

takacsd commented May 11, 2023

@danielhanchen I think the problem is in the pandas specific metadata. If the parquet file was created with something else (e.g.: AWS Athena) it could read it just fine.

pq.write_table(pa.table({'a': pa.array([['a'], ['a', 'b']], type=pa.list_(pa.string()))}), 'test.parquet')  # SUCCESS
pd.read_parquet('test.parquet')  # SUCCESS

pq.write_table(pa.Table.from_pandas(df), 'test.parquet')  # SUCCESS
pd.read_parquet('test.parquet')  # *** FAIL

pq.write_table(pa.Table.from_pandas(df).replace_schema_metadata(), 'test.parquet') # SUCCESS
pd.read_parquet('test.parquet') # SUCCESS

This is the pandas metadata btw:

{'column_indexes': [{'field_name': None,
                     'metadata': {'encoding': 'UTF-8'},
                     'name': None,
                     'numpy_type': 'object',
                     'pandas_type': 'unicode'}],
 'columns': [{'field_name': 'a',
              'metadata': None,
              'name': 'a',
              'numpy_type': 'list<item: string>[pyarrow]',   # <---- this causes the error
              'pandas_type': 'list[unicode]'}],
 'creator': {'library': 'pyarrow', 'version': '11.0.0'},
 'index_columns': [{'kind': 'range',
                    'name': None,
                    'start': 0,
                    'step': 1,
                    'stop': 2}],
 'pandas_version': '2.0.1'}

In the case of a simple 'numpy_type': 'int64[pyarrow]' type everything works, so I suspect the _pandas_api.pandas_dtype(dtype) doesn't support complex types (yet).

@danielhanchen
Copy link
Author

danielhanchen commented May 12, 2023

@takacsd oh yep your reasoning sounds right - so I think adding a simple try except might be a simple maybe? Try calling numpy then if it fails, call pd.ArrowDtype

@danielhanchen
Copy link
Author

danielhanchen commented May 12, 2023

The main issue I think is becausedtype is a string I guess.
I'm not 100% sure about how _pandas_api.pandas_dtype works, but presumably it's a large dict mapping types in string form to the correct type. Due to the infinite nature of possible Arrow datatypes, I guess its not feasible to update the dictionary, so maybe the try except solution is the only reasonable solution?

Just my two cents.

@takacsd
Copy link

takacsd commented May 13, 2023

The main issue I think is becausedtype is a string I guess. I'm not 100% sure about how _pandas_api.pandas_dtype works, but presumably it's a large dict mapping types in string form to the correct type.

It seems a little more complicated than that:
pandas_dtype looks up a registry.
This iterates trough the registered ExtensionDtypes, and tries to make sense of the string.
The ExtensionDtype should understand it but doesn't, because pa.type_for_alias(base_type) only understands basic types.

We already have a special case for temporal types, so I suppose we just need something similar for arrays and maps...

@danielhanchen
Copy link
Author

@takacsd The issue though timestamps can be reasonably easy to construct from text.

The below could all be possible though:

list[list[struct[int, float]]]
list[int]
struct[list[datetime]]

Constructing Arrow dtypes from that could be potentially problematic.

I guess in theory one can iterate through the string, and create a string which you can then call eval on ie:

list[struct[int32, string]] is pa.list_(pa.struct((pa.int32(), pa.string())) then you can eval on it.

I think a wiser approach would be to use the Arrow dtype from data.schema.types then call pd.ArrowDtype on it

@takacsd
Copy link

takacsd commented May 13, 2023

@danielhanchen your approach only works here, and it just ignores the metadata. I'm not a pandas developer but I suppose they generated that metadata for a reason, so it may break some things if we just ignore it.

Properly parsing the string is obviously harder, but I still think it is the better solution...

@danielhanchen
Copy link
Author

@takacsd agreed parsing the metadata string is the correct way.

I thought about how one would go about doing it. Eg take:
list<element: struct<rank: uint8, subtype: dictionary<values=string, indices=int32, ordered=0>, caption: string, credit: string, type: dictionary<values=string, indices=int32, ordered=0>, url: string, height: uint16, width: uint16, subType: dictionary<values=string, indices=int32, ordered=0>, crop_name: dictionary<values=string, indices=int32, ordered=0>>>[pyarrow]

You'll have to first find the type which has the first enclosed >, and continuously parse outwards. Ie if one makes a string converter, it'll have to find the inner-most enclosed data-type, then expand out, and encapsulate it with a while loop.

The while loop looks something like this:

left_pointer = 0
bracket_end   = dt.find(">") + 1
bracket_start = dt.rfind("<", left_pointer, bracket_end)
bracket_start = dt.rfind(" ", left_pointer, bracket_start) + 1
partial_dt = dt[bracket_start : bracket_end]
partial_dt = _partial_convert_dt(partial_dt)

and

def _partial_convert_dt(partial_dt):
    if partial_dt.startswith("dictionary"):
        value_type = re.findall('values=([^,]{1,})',  partial_dt)[0]
        index_type = re.findall('indices=([^,]{1,})', partial_dt)[0]
        ordered    = re.findall('ordered=([\d])',     partial_dt)[0]
        if not value_type.startswith("pa."): value_type = f"pa.{value_type}()"
        if not index_type.startswith("pa."): index_type = f"pa.{index_type}()"
        partial_dt = f"pa.dictionary(value_type = {value_type}, index_type = {index_type}, ordered = {ordered})"
    elif partial_dt.startswith("list"):
        value_type = partial_dt[partial_dt.find(" ")+1 : -1]
        if not value_type.startswith("pa."): value_type = f"pa.{value_type}()"
        partial_dt = f"pa.list_({value_type})"
    elif partial_dt.startswith("struct"):
        struct_part = partial_dt[len("struct<"):-1]
        all_structs = struct_part.split(", ")
        converted = []
        for struct in all_structs:
            name, type = struct.split(": ")
            if not type.startswith("pa."): type = f"pa.{type}()"
            converted.append(f"('{name}', {type})")
        partial_dt = f"pa.struct(({', '.join(converted)}))"
    return partial_dt
pass

The code just gets too cumbersome sadly - the above only supports struct, dictionary and list types.

The main issue is the infinite nesting of Arrow dtypes which overcomplicates the conversion process in my view.

@danielhanchen
Copy link
Author

Actually a simpler solution is to directly all .replace on the string and replace list<element: to pa.list_(( etc.

However, this doesnt work with struct data-types, since struct also keeps note of each field name.

This means a struct field name could have dictionary as it's name, which means using eval will fail.

This probably means string parsing won't work for structs, but works for everything else. I still believe a try except is the simplest solution. Obviously now Python 3.11 has zero cost exceptions, which means if the conversion fails and gets to the except portion, it'll be slower. This means a refactoring of code by parsing in the Arrow data-type, and if the data-type does not exist in registry then we output the Arrow data-type.

@takacsd
Copy link

takacsd commented May 15, 2023

Yeah, after some experimenting, I think we need to gave up on parsing the type string:

These two:

pd.Series([{'a': 1, 'b': 1}], dtype=pd.ArrowDtype(pa.struct({'a': pa.int64(), 'b': pa.int64()})))
pd.Series([{'a: int64, b': 1}], dtype=pd.ArrowDtype(pa.struct({'a: int64, b': pa.int64()})))

both have the following type string: struct<a: int64, b: int64>[pyarrow].

But even if we disallow such cases, it is just too hard: I tried to write a recursive parser with some regexp, but I gave up. We need a balancing matcher or a recursive pattern to match the nested <> pairs properly, but none of them are supported by the built in regexp module. And I don't feel like we should write a full blown recursive descent parser for this one.

The fundamental problem is we try to parse a string which was not meant to be easily parsable. The metadata should save the nested data types in a way that is easy to work with...

@takacsd
Copy link

takacsd commented May 15, 2023

I was bored:

class ParseFail(Exception):
    pass

class Parsed(NamedTuple):
    type: pa.DataType
    end: int


class TypeStringParser:

    BASIC_TYPE_MATCHER = re.compile(r'\w+(\[[^\]]+\])?')
    TIMESTAMP_MATCHER = re.compile(r'timestamp\[([^,]+), tz=([^\]]+)\]')
    NAME_MATCHER = re.compile(r'\w+')  # this can be r'[^:]' to support weird names in struct

    def __init__(self, type_str: str) -> None:
        self.type_str = type_str

    def parse(self) -> pa.DataType:
        try:
            parsed = self.type(0)
        except ParseFail:
            raise ValueError(f"Can't parse '{self.type_str}' as a type.")

        if parsed.end != len(self.type_str):
            raise ValueError(f"Can't parse '{self.type_str}' as a type.")

        return self.type(0).type

    def type(self, pos: int) -> Parsed:
        try:
            return self.basic_type(pos)
        except ParseFail:
            pass

        try:
            return self.timestamp(pos)
        except ParseFail:
            pass

        try:
            return self.list(pos)
        except ParseFail:
            pass

        try:
            return self.dictionary(pos)
        except ParseFail:
            pass

        try:
            return self.struct(pos)
        except ParseFail:
            pass

        raise ParseFail()

    def basic_type(self, pos: int) -> pa.DataType:
        match = self.BASIC_TYPE_MATCHER.match(self.type_str, pos)
        if match is None:
            raise ParseFail()
        try:
            return Parsed(pa.type_for_alias(match.group(0)), match.end(0))
        except ValueError:
            pass
        raise ParseFail()

    def timestamp(self, pos: int) -> pa.DataType:
        match = self.TIMESTAMP_MATCHER.match(self.type_str, pos)
        if match is None:
            raise ParseFail()
        try:
            return Parsed(pa.timestamp(match.group(1).strip(), tz=match.group(2).strip()), match.end(0))
        except ValueError:
            pass
        raise ParseFail()

    def list(self, pos: int) -> pa.DataType:
        pos = self.accept('list<', pos)
        match = self.NAME_MATCHER.match(self.type_str, pos)
        if match is None:
            raise ParseFail()
        pos = self.accept(': ', match.end(0))
        item = self.type(pos)
        pos = self.accept('>', item.end)
        return Parsed(pa.list_(item.type), pos)

    def dictionary(self, pos: int) -> pa.DataType:
        pos = self.accept('dictionary<values=', pos)
        values = self.type(pos)
        pos = self.accept(', indices=', values.end)
        indices = self.type(pos)
        pos = self.accept(', ordered=', indices.end)
        try:
            pos = self.accept('0', pos)
            ordered = False
        except ParseFail:
            pos = self.accept('1', pos)
            ordered = True
        pos = self.accept('>', pos)
        return Parsed(pa.dictionary(indices.type, values.type, ordered), pos)

    def struct(self, pos: int) -> pa.DataType:
        pos = self.accept('struct<', pos)
        elements = []
        while self.type_str[pos] != '>':
            match = self.NAME_MATCHER.match(self.type_str, pos)
            if match is None:
                raise ParseFail()
            element_name = match.group(0)
            pos = self.accept(': ', match.end(0))
            element_type = self.type(pos)
            pos = element_type.end
            if self.type_str[pos] != '>':
                pos = self.accept(', ', pos)
            elements.append((element_name, element_type.type))
        pos = self.accept('>', pos)
        return Parsed(pa.struct(elements), pos)

    def accept(self, term: str, pos: int) -> int:
        if self.type_str.startswith(term, pos):
            return pos + len(term)
        raise ParseFail()

Probably not the prettiest recursive descent parser in existence, but it does parse arbitrary nested types.
The only restriction that I know of is that the names in the structs needs to be alphanumeric.

@danielhanchen
Copy link
Author

@takacsd Nice work on the parser! :) Ye struct is the biggest issue with it being able to have column names. It gets worse if struct<struct : uint8> exists - yikes that'll be a painful pain.

Also I just noticed but https://github.com/apache/arrow/blob/8be70c137289adba92871555ce74055719172f56/python/pyarrow/pandas_compat.py#L870 actually does in fact parse Arrow Dtypes! The issue is the code previous to it breaks, and it never gets there.

    for field in table.schema:
        typ = field.type
        if isinstance(typ, pa.BaseExtensionType):
            try:
                pandas_dtype = typ.to_pandas_dtype()
            except NotImplementedError:
                pass
            else:
                ext_columns[field.name] = pandas_dtype

The issue is https://github.com/apache/arrow/blob/8be70c137289adba92871555ce74055719172f56/python/pyarrow/pandas_compat.py#LL854C1-L868C53:

    # infer the extension columns from the pandas metadata
    for col_meta in columns_metadata:
        try:
            name = col_meta['field_name']
        except KeyError:
            name = col_meta['name']
        dtype = col_meta['numpy_type']

        if dtype not in _pandas_supported_numpy_types:
            # pandas_dtype is expensive, so avoid doing this for types
            # that are certainly numpy dtypes
            pandas_dtype = _pandas_api.pandas_dtype(dtype)               >>>>>>>> BREAKS (A)
            if isinstance(pandas_dtype, _pandas_api.extension_dtype):
                if hasattr(pandas_dtype, "__from_arrow__"):
                    ext_columns[name] = pandas_dtype

I think I might have fixed it WITHOUT using try except

            # infer the extension columns from the pandas metadata
            schema = table.schema                                                 <<<<
            for col_meta, field in zip(columns_metadata, schema):                       <<<<
                try:
                    name = col_meta['field_name']
                except KeyError:
                    name = col_meta['name']
                dtype = col_meta['numpy_type']

		if dtype not in _pandas_supported_numpy_types:
                        # pandas_dtype is expensive, so avoid doing this for types
                        # that are certainly numpy dtypes
			if dtype.endswith("[pyarrow]"): pandas_dtype = pd.ArrowDtype(field.type)  <<<< (1)
			elif dtype == "string": pandas_dtype = pd.ArrowDtype(pa.string())        <<<< (2)
			else: pandas_dtype = pd_dtype(dtype)                                    <<<< (3)
                        if isinstance(pandas_dtype, _pandas_api.extension_dtype):
                            if hasattr(pandas_dtype, "__from_arrow__"):
                                ext_columns[name] = pandas_dtype

We push the original command (A) to line (3) than if a string data-type has [pyarrow] as it's ending, we use pd.ArrowDtype. For strings, we ignore the pandas parser and just parse strings in the fastpath.

This also means

    for field in table.schema:
        typ = field.type
        if isinstance(typ, pa.BaseExtensionType):
            try:
                pandas_dtype = typ.to_pandas_dtype()
            except NotImplementedError:
                pass
            else:
                ext_columns[field.name] = pandas_dtype

can be deleted - it'ls redundant, since we folded the code into the previous code.

@bretttully
Copy link

I just hit this today trying to read a parquet file made by someone else, where they had used the pyarrow backend.

Here is another minimal example to add to the mix that fails on reading df2.

import io

import numpy as np
import pandas as pd
import pyarrow as pa


def main():
    df0 = pd.DataFrame(
        [
            {"foo": {"bar": True, "baz": np.float32(1)}},
            {"foo": {"bar": True, "baz": None}},
        ],
    )
    schema = pa.schema(
        [
            pa.field(
                "foo",
                pa.struct(
                    [
                        pa.field("bar", pa.bool_(), nullable=False),
                        pa.field("baz", pa.float32(), nullable=True),
                    ],
                ),
            ),
        ],
    )
    print(schema)
    with io.BytesIO() as stream0, io.BytesIO() as stream1:
        kwargs = {
            "engine": "pyarrow",
            "compression": "zstd",
            "schema": schema,
            "row_group_size": 2_000,
        }
        print("Writing df0")
        df0.to_parquet(stream0, **kwargs)

        print("Reading df1")
        stream0.seek(0)
        df1 = pd.read_parquet(stream0, engine="pyarrow", dtype_backend="pyarrow")

        print("Writing df1")
        df1.to_parquet(stream1, **kwargs)

        print("Reading df2")
        stream1.seek(0)
        df2 = pd.read_parquet(stream1, engine="pyarrow", dtype_backend="pyarrow")


if __name__ == "__main__":
    main()

Using df2 = pq.read_table(stream1).to_pandas(ignore_metadata=True) works for all of the reasons mentioned in the thread.

@giftculture
Copy link

I'm running into this issue as well:

Screenshot 2024-02-07 at 19 50 26

@giftculture
Copy link

INSTALLED VERSIONS

commit : f538741
python : 3.11.7.final.0
python-bits : 64
OS : Linux
OS-release : 4.13.9-300.fc27.x86_64
Version : #1 SMP Mon Oct 23 13:41:58 UTC 2017
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.0
numpy : 1.26.3
pytz : 2023.4
dateutil : 2.8.2
setuptools : 69.0.3
pip : 23.3.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : 8.20.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat : None
fastparquet : 2023.10.1
fsspec : 2023.12.2
gcsfs : None
matplotlib : 3.8.2
numba : 0.58.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 15.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.12.0
sqlalchemy : 2.0.25
tables : None
tabulate : None
xarray : 2024.1.1
xlrd : None
zstandard : None
tzdata : 2023.4
qtpy : 2.4.1
pyqt5 : None

@jborman-stonex
Copy link

The only "workaround" at the pandas-level I've found is to set df.to_parquet(..., store_schema=False) for a df containing complex/nested types like a categorical. Are there any plans to get successful dtype roundtripping going forward? Is there anything in the pyarrow library we can leverage here?

Versions:

INSTALLED VERSIONS
------------------
commit                : bdc79c146c2e32f2cab629be240f01658cfb6cc2
python                : 3.11.6.final.0
python-bits           : 64
OS                    : Windows
OS-release            : 10
Version               : 10.0.19045
machine               : AMD64
processor             : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel
byteorder             : little
LC_ALL                : None
LANG                  : None
LOCALE                : English_United States.1252

pandas                : 2.2.1
numpy                 : 1.26.4
pytz                  : 2024.1
dateutil              : 2.9.0.post0
setuptools            : 65.5.0
pip                   : 23.2.1
Cython                : None
pytest                : 8.1.1
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : 2.9.9
jinja2                : None
IPython               : 8.23.0
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
gcsfs                 : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
pyarrow               : 15.0.2
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None

@phofl
Copy link
Member

phofl commented Apr 4, 2024

I would recommend using dtype_backend="pyarrow"

@giftculture
Copy link

I would recommend using dtype_backend="pyarrow"

@phoff, not sure if you saw from my screenshot, but I did apply the dypte_backend="pyarrow" to the read_parquet method and it still fails, unless I am misunderstanding your suggestion

@bretttully
Copy link

bretttully commented Apr 4, 2024

The only workaround I have found so far is the following (which works in all cases I have thought of, except round-tripping an empty dataframe with a struct or list type, setting the schema, and not using dtype_backend="pyarrow" when reading back in).

Would def welcome suggested improvements to this workaround!

Obv you can write this differently if you don't want a byte string returned, but for us that's what we want.

    def serialize(self, data: pd.DataFrame, **kwargs) -> bytes:
        """see BytesWriter.serialize -- Dump pandas dataframe to parquet bytes"""
        with io.BytesIO() as stream:

            schema = kwargs.pop("schema", None)
            all_arrow_types = all(isinstance(t, pd.ArrowDtype) for t in data.dtypes.tolist())
            # An empty dataframe may use default dtypes that are incompatible with the schema.
            # In this case, first cast to object, as the schema can always convert that to the correct type.
            if len(data) == 0 and schema is not None and not all_arrow_types:
                data = data.astype("object").astype({n: pd.ArrowDtype(schema.field(n).type) for n in schema.names})
            table = pa.Table.from_pandas(data, schema=schema)

            # drop pandas from the schema metadata to work around the bug where you can't read struct columns with
            # pandas metadata
            # see https://github.com/pandas-dev/pandas/issues/53011
            metadata = table.schema.metadata
            if b"pandas" in metadata and b"list" in metadata[b"pandas"] or b"struct" in metadata[b"pandas"]:
                del metadata[b"pandas"]
                table = table.replace_schema_metadata(metadata)

            pq.write_table(table, stream, **kwargs)
            return stream.getvalue()

@kinianlo
Copy link

kinianlo commented May 21, 2024

This bug is very similar to #57411 where the data type is list[int] instead of list[str].

@judahrand
Copy link

judahrand commented Sep 3, 2024

Actually a simpler solution is to directly all .replace on the string and replace list<element: to pa.list_(( etc.

Additionally, list<element: can also be list<item: depending on how the Parquet was written out (see: use_compliant_nested_type argument to pyarrow.parquet.ParquetWriter). And within Arrow itself the containing name be arbitrary (ie. pyarrow.list_(pyarrow.field('some_name', pyarrow.int64())) is a valid type). So the replace would have to be a regex and something like r'list<(.+):' with the capturing group grabbing the name of the field.

bretttully added a commit to bretttully/arrow that referenced this issue Nov 13, 2024
Addresses pandas-dev/pandas#53011

`types_mapper` always had highest priority as it overrode what was set before. However, switching the logical ordering, it means that we don't need to call `_pandas_api.pandas_dtype(dtype)` when using the pyarrow backend. Resolving the issue of complex `dtype` with `list` or `struct`
@bretttully
Copy link

Matching arrow ticket: apache/arrow#39914 and potential PR: apache/arrow#44720

@jorisvandenbossche
Copy link
Member

Only seeing this long ticket now .. FWIW I think this is another good reason it would be good pandas had tighter control over the pandas<->arrow conversion (#59780)

@jorisvandenbossche jorisvandenbossche removed the Upstream issue Issue related to pandas dependency label Nov 14, 2024
jorisvandenbossche added a commit to apache/arrow that referenced this issue Nov 27, 2024
### Rationale for this change

This is a long standing [pandas ticket](pandas-dev/pandas#53011) with some fairly horrible workarounds, where complex arrow types do not serialise well to pandas as the pandas metadata string is not parseable. However, `types_mapper` always had highest priority as it overrode what was set before. 

### What changes are included in this PR?

By switching the logical ordering, it means that we don't need to call `_pandas_api.pandas_dtype(dtype)` when using the pyarrow backend, thus resolving the issue of complex `dtype` with `list` or `struct`. It will likely still fail if the numpy backend is used, but at least this gives a working solution rather than an inability to load files at all.

### Are these changes tested?

Existing tests should stay unchanged and a new test for the complex type has been added

### Are there any user-facing changes?

**This PR contains a "Critical Fix".**
This makes `pd.read_parquet(..., dtype_backend="pyarrow")` work with complex data types where the metadata added by pyarrow during `pd.to_parquet` is not serialisable and currently throwing an exception. This issue currently prevents the use of pyarrow as the default backend for pandas.
* GitHub Issue: #39914

Lead-authored-by: bretttully <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Brett Tully <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug IO Parquet parquet, feather
Projects
None yet
Development

No branches or pull requests