BUG: read_parquet converts pyarrow list type to numpy dtype #53011

danielhanchen · 2023-04-30T10:27:20Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import pyarrow as pa
pyarrow_list_of_strings = pd.ArrowDtype(pa.list_(pa.string()))
data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]], dtype = pyarrow_list_of_strings),
})
data.to_parquet("data.parquet") # SUCCESS
pd.read_parquet("data.parquet") # *** FAIL

data_object = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]], dtype = object),
})
data_object.to_parquet("data.parquet")
pyarrow_internal = pa.parquet.read_table("data.parquet") # SUCCESS with type list[string]
pyarrow_internal .to_pandas() # SUCCESS except object now

pd.Series(pd.arrays.ArrowExtensionArray(pyarrow_internal["Pyarrow"])) # SUCCESS - data-type also correct!

Issue Description

Great work on extending Arrow to Pandas!
Using pd.ArrowDtype(pa.list_(pa.string())) or any other alteration works in the Parquet saving mode, but fails during the reading of the parquet file.

In fact, if there is a Pandas Series of pure lists of strings for eg ["a"], ["a", "b"], Parquet saves it internally as a list[string] type. When Pandas reads the parquet file, it then converts it to an object type.

Is there a way during the reading step to either:

Convert the data-type like in the pure list mode to an object type OR
pd.Series(pd.arrays.ArrowExtensionArray(x)) seems to actually work! Maybe during the conversion from the internal Pyarrow representation into Pandas, we can use pd.Series(pd.arrays.ArrowExtensionArray(x)) on columns which had errors? OR
Somehow support these new types?

Expected Behavior

import pandas as pd
import pyarrow as pa
pyarrow_list_of_strings = pd.ArrowDtype(pa.list_(pa.string()))
data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]], dtype = pyarrow_list_of_strings),
})
data.to_parquet("data.parquet") # SUCCESS
pd.read_parquet("data.parquet") # SUCCESS

Installed Versions

INSTALLED VERSIONS

commit : 37ea63d
python : 3.11.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 30 Stepping 5, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_Australia.1252

pandas : 2.0.1
numpy : 1.24.3
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 0.29.34
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.12.0
pandas_datareader: None
bs4 : 4.12.2
bottleneck : None
brotli :
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.1
numba : 0.57.0rc1
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
zstandard : 0.21.0
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

danielhanchen · 2023-05-01T13:53:37Z

I found during the Pyarrow conversion, if you pass in a types_mapper and setting ignore_metadata to False, it works!

mapping = {schema.type : pd.ArrowDtype(schema.type) for schema in data.schema}
data.to_pandas(types_mapper = mapping.get, ignore_metadata = True)

mroeschke · 2023-05-01T17:21:19Z

From the traceback, it appears that pyarrow tries to convert this type to a numpy dtype by default, so I think an appropriate fix would be for pyarrow to just return an ArrowDtype here

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas_compat.py:812, in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
    809     table = _add_any_metadata(table, pandas_metadata)
    810     table, index = _reconstruct_index(table, index_descriptors,
    811                                       all_columns)
--> 812     ext_columns_dtypes = _get_extension_dtypes(
    813         table, all_columns, types_mapper)
    814 else:
    815     index = _pandas_api.pd.RangeIndex(table.num_rows)

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas_compat.py:865, in _get_extension_dtypes(table, columns_metadata, types_mapper)
    860 dtype = col_meta['numpy_type']
    862 if dtype not in _pandas_supported_numpy_types:
    863     # pandas_dtype is expensive, so avoid doing this for types
    864     # that are certainly numpy dtypes
--> 865     pandas_dtype = _pandas_api.pandas_dtype(dtype)
    866     if isinstance(pandas_dtype, _pandas_api.extension_dtype):
    867         if hasattr(pandas_dtype, "__from_arrow__"):

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas-shim.pxi:136, in pyarrow.lib._PandasAPIShim.pandas_dtype()

File /opt/miniconda3/envs/pandas-dev/lib/python3.10/site-packages/pyarrow/pandas-shim.pxi:139, in pyarrow.lib._PandasAPIShim.pandas_dtype()

File ~/.../pandas/core/dtypes/common.py:1626, in pandas_dtype(dtype)
   1621     with warnings.catch_warnings():
   1622         # GH#51523 - Series.astype(np.integer) doesn't show
   1623         # numpy deprecation warning of np.integer
   1624         # Hence enabling DeprecationWarning
   1625         warnings.simplefilter("always", DeprecationWarning)
-> 1626         npdtype = np.dtype(dtype)
   1627 except SyntaxError as err:
   1628     # np.dtype uses `eval` which can raise SyntaxError
   1629     raise TypeError(f"data type '{dtype}' not understood") from err

TypeError: data type 'list<item: string>[pyarrow]' not understood

danielhanchen · 2023-05-02T18:27:45Z

Hmm so I looked at the Pandas code, and not sure if using pd.ArrowDtype(dtype) will work.

The issue is data.schema.pandas_metadata['columns'][7]["numpy_type"] is a str and not an actual type object, and pd.ArrowDtype does not accept strings.

eg:

dt = A.schema.pandas_metadata['columns'][7]["numpy_type"]

returns:

'list<element: struct<rank: uint8, subtype: dictionary<values=string, indices=int32, ordered=0>, caption: string, credit: string, type: dictionary<values=string, indices=int32, ordered=0>, url: string, height: uint16, width: uint16, subType: dictionary<values=string, indices=int32, ordered=0>, crop_name: dictionary<values=string, indices=int32, ordered=0>>>[pyarrow]'

and using

pd.ArrowDtype(dt)

fails since it's a string.

I think the better approach would be to not just pass in data.schema.pandas_metadata['columns'][j]["numpy_type"] but also data.schema.types since it has the actual types which can be converted into a pd.ArrowDtype object.

phofl · 2023-05-02T21:22:24Z

I think this behaves as expected. You can pass dtype_backend="pyarrow" to keep the list dtype

danielhanchen · 2023-05-03T03:46:20Z

@phofl

Oh oops I forgot to mention I tried pd.read_parquet(..., dtype_backend = "pyarrow"), and the TypeError still exists. The error is exactly the same, since it passes the dtype to np.dtype

danielhanchen · 2023-05-03T08:06:44Z

Confirmed it still fails:

import pandas as pd
import pyarrow as pa
pyarrow_list_of_strings = pd.ArrowDtype(pa.list_(pa.string()))
data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]], dtype = pyarrow_list_of_strings),
})
data.to_parquet("data.parquet") # SUCCESS
pd.read_parquet("data.parquet", dtype_backend = "pyarrow") # *** FAIL

phofl · 2023-05-03T08:09:02Z

Interesting,

This one works:

data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]]),
})
data.to_parquet("data.parquet")
pd.read_parquet("data.parquet", dtype_backend = "pyarrow")

danielhanchen · 2023-05-03T09:06:25Z

Ye that works since it's an object - Pyarrow indeed saves the data inside the parquet file as list[string].

The issue is if you explicity parse list[string] directly, it does not work.

Ie:

data = pd.DataFrame({
    "Pyarrow" : pd.Series([["a"], ["a", "b"]]),
})
data.dtypes

returns

Pyarrow    object
dtype: object

danielhanchen · 2023-05-03T09:07:48Z

In fact the object schema is converted:

pa.parquet.read_table("data.parquet")

returns

pyarrow.Table
Pyarrow: list<item: string>
  child 0, item: string
----
Pyarrow: [[["a"],["a","b"]]]

danielhanchen · 2023-05-03T09:41:00Z

Maybe a try except so to not break other parts of the Pandas repo?

https://github.com/apache/arrow/blob/a77aab07b02b7d0dd6bd9c9a11c4af067d26b674/python/pyarrow/pandas_compat.py#L855

Maybe a try except so to not break other parts of the Pandas repo?

    # infer the extension columns from the pandas metadata
>>for col_meta, field in zip(columns_metadata, table.schema):
        try:
            name = col_meta['field_name']
        except KeyError:
            name = col_meta['name']
        dtype = col_meta['numpy_type']

        if dtype not in _pandas_supported_numpy_types:
            # pandas_dtype is expensive, so avoid doing this for types
            # that are certainly numpy dtypes
>>       try:
>>           pandas_dtype = _pandas_api.pandas_dtype(dtype)
>>       except:
>>           pandas_dtype = pd.ArrowDtype(field.type)
                
            if isinstance(pandas_dtype, _pandas_api.extension_dtype):
                if hasattr(pandas_dtype, "__from_arrow__"):
                    ext_columns[name] = pandas_dtype

takacsd · 2023-05-10T21:18:57Z

Run into the same issue:

df = pd.DataFrame({'a': pd.Series([['a'], ['a', 'b']], dtype=pd.ArrowDtype(pa.list_(pa.string())))})

df.to_parquet('test.parquet')  # SUCCESS
pd.read_parquet('test.parquet')  # *** FAIL

df.to_parquet('test.parquet')  # SUCCESS
pq.read_table('test.parquet').to_pandas(ignore_metadata=True, types_mapper=pd.ArrowDtype)  # SUCCESS

df.to_parquet('test.parquet', store_schema=False)  # SUCCESS
pd.read_parquet('test.parquet')  # SUCCESS

I think the last case was not mentioned so far.

danielhanchen · 2023-05-11T08:50:18Z

@takacsd oh interesting - so it's possible its the schema storing component that's wrong?

takacsd · 2023-05-11T12:00:30Z

@danielhanchen I think the problem is in the pandas specific metadata. If the parquet file was created with something else (e.g.: AWS Athena) it could read it just fine.

pq.write_table(pa.table({'a': pa.array([['a'], ['a', 'b']], type=pa.list_(pa.string()))}), 'test.parquet')  # SUCCESS
pd.read_parquet('test.parquet')  # SUCCESS

pq.write_table(pa.Table.from_pandas(df), 'test.parquet')  # SUCCESS
pd.read_parquet('test.parquet')  # *** FAIL

pq.write_table(pa.Table.from_pandas(df).replace_schema_metadata(), 'test.parquet') # SUCCESS
pd.read_parquet('test.parquet') # SUCCESS

This is the pandas metadata btw:

{'column_indexes': [{'field_name': None,
                     'metadata': {'encoding': 'UTF-8'},
                     'name': None,
                     'numpy_type': 'object',
                     'pandas_type': 'unicode'}],
 'columns': [{'field_name': 'a',
              'metadata': None,
              'name': 'a',
              'numpy_type': 'list<item: string>[pyarrow]',   # <---- this causes the error
              'pandas_type': 'list[unicode]'}],
 'creator': {'library': 'pyarrow', 'version': '11.0.0'},
 'index_columns': [{'kind': 'range',
                    'name': None,
                    'start': 0,
                    'step': 1,
                    'stop': 2}],
 'pandas_version': '2.0.1'}

In the case of a simple 'numpy_type': 'int64[pyarrow]' type everything works, so I suspect the _pandas_api.pandas_dtype(dtype) doesn't support complex types (yet).

danielhanchen · 2023-05-12T13:48:35Z

@takacsd oh yep your reasoning sounds right - so I think adding a simple try except might be a simple maybe? Try calling numpy then if it fails, call pd.ArrowDtype

danielhanchen · 2023-05-12T13:52:53Z

The main issue I think is becausedtype is a string I guess.
I'm not 100% sure about how _pandas_api.pandas_dtype works, but presumably it's a large dict mapping types in string form to the correct type. Due to the infinite nature of possible Arrow datatypes, I guess its not feasible to update the dictionary, so maybe the try except solution is the only reasonable solution?

Just my two cents.

takacsd · 2023-05-13T08:53:57Z

The main issue I think is becausedtype is a string I guess. I'm not 100% sure about how _pandas_api.pandas_dtype works, but presumably it's a large dict mapping types in string form to the correct type.

It seems a little more complicated than that:
pandas_dtype looks up a registry.
This iterates trough the registered ExtensionDtypes, and tries to make sense of the string.
The ExtensionDtype should understand it but doesn't, because pa.type_for_alias(base_type) only understands basic types.

We already have a special case for temporal types, so I suppose we just need something similar for arrays and maps...

danielhanchen · 2023-05-13T09:41:27Z

@takacsd The issue though timestamps can be reasonably easy to construct from text.

The below could all be possible though:

list[list[struct[int, float]]]
list[int]
struct[list[datetime]]

Constructing Arrow dtypes from that could be potentially problematic.

I guess in theory one can iterate through the string, and create a string which you can then call eval on ie:

list[struct[int32, string]] is pa.list_(pa.struct((pa.int32(), pa.string())) then you can eval on it.

I think a wiser approach would be to use the Arrow dtype from data.schema.types then call pd.ArrowDtype on it

takacsd · 2023-05-13T12:26:02Z

@danielhanchen your approach only works here, and it just ignores the metadata. I'm not a pandas developer but I suppose they generated that metadata for a reason, so it may break some things if we just ignore it.

Properly parsing the string is obviously harder, but I still think it is the better solution...

danielhanchen · 2023-05-13T14:10:10Z

@takacsd agreed parsing the metadata string is the correct way.

I thought about how one would go about doing it. Eg take:
list<element: struct<rank: uint8, subtype: dictionary<values=string, indices=int32, ordered=0>, caption: string, credit: string, type: dictionary<values=string, indices=int32, ordered=0>, url: string, height: uint16, width: uint16, subType: dictionary<values=string, indices=int32, ordered=0>, crop_name: dictionary<values=string, indices=int32, ordered=0>>>[pyarrow]

You'll have to first find the type which has the first enclosed >, and continuously parse outwards. Ie if one makes a string converter, it'll have to find the inner-most enclosed data-type, then expand out, and encapsulate it with a while loop.

The while loop looks something like this:

left_pointer = 0
bracket_end   = dt.find(">") + 1
bracket_start = dt.rfind("<", left_pointer, bracket_end)
bracket_start = dt.rfind(" ", left_pointer, bracket_start) + 1
partial_dt = dt[bracket_start : bracket_end]
partial_dt = _partial_convert_dt(partial_dt)

and

def _partial_convert_dt(partial_dt):
    if partial_dt.startswith("dictionary"):
        value_type = re.findall('values=([^,]{1,})',  partial_dt)[0]
        index_type = re.findall('indices=([^,]{1,})', partial_dt)[0]
        ordered    = re.findall('ordered=([\d])',     partial_dt)[0]
        if not value_type.startswith("pa."): value_type = f"pa.{value_type}()"
        if not index_type.startswith("pa."): index_type = f"pa.{index_type}()"
        partial_dt = f"pa.dictionary(value_type = {value_type}, index_type = {index_type}, ordered = {ordered})"
    elif partial_dt.startswith("list"):
        value_type = partial_dt[partial_dt.find(" ")+1 : -1]
        if not value_type.startswith("pa."): value_type = f"pa.{value_type}()"
        partial_dt = f"pa.list_({value_type})"
    elif partial_dt.startswith("struct"):
        struct_part = partial_dt[len("struct<"):-1]
        all_structs = struct_part.split(", ")
        converted = []
        for struct in all_structs:
            name, type = struct.split(": ")
            if not type.startswith("pa."): type = f"pa.{type}()"
            converted.append(f"('{name}', {type})")
        partial_dt = f"pa.struct(({', '.join(converted)}))"
    return partial_dt
pass

The code just gets too cumbersome sadly - the above only supports struct, dictionary and list types.

The main issue is the infinite nesting of Arrow dtypes which overcomplicates the conversion process in my view.

danielhanchen · 2023-05-13T14:51:55Z

Actually a simpler solution is to directly all .replace on the string and replace list<element: to pa.list_(( etc.

However, this doesnt work with struct data-types, since struct also keeps note of each field name.

This means a struct field name could have dictionary as it's name, which means using eval will fail.

This probably means string parsing won't work for structs, but works for everything else. I still believe a try except is the simplest solution. Obviously now Python 3.11 has zero cost exceptions, which means if the conversion fails and gets to the except portion, it'll be slower. This means a refactoring of code by parsing in the Arrow data-type, and if the data-type does not exist in registry then we output the Arrow data-type.

takacsd · 2023-05-15T16:44:42Z

Yeah, after some experimenting, I think we need to gave up on parsing the type string:

These two:

pd.Series([{'a': 1, 'b': 1}], dtype=pd.ArrowDtype(pa.struct({'a': pa.int64(), 'b': pa.int64()})))
pd.Series([{'a: int64, b': 1}], dtype=pd.ArrowDtype(pa.struct({'a: int64, b': pa.int64()})))

both have the following type string: struct<a: int64, b: int64>[pyarrow].

But even if we disallow such cases, it is just too hard: I tried to write a recursive parser with some regexp, but I gave up. We need a balancing matcher or a recursive pattern to match the nested <> pairs properly, but none of them are supported by the built in regexp module. And I don't feel like we should write a full blown recursive descent parser for this one.

The fundamental problem is we try to parse a string which was not meant to be easily parsable. The metadata should save the nested data types in a way that is easy to work with...

takacsd · 2023-05-15T21:08:59Z

I was bored:

class ParseFail(Exception):
    pass

class Parsed(NamedTuple):
    type: pa.DataType
    end: int


class TypeStringParser:

    BASIC_TYPE_MATCHER = re.compile(r'\w+(\[[^\]]+\])?')
    TIMESTAMP_MATCHER = re.compile(r'timestamp\[([^,]+), tz=([^\]]+)\]')
    NAME_MATCHER = re.compile(r'\w+')  # this can be r'[^:]' to support weird names in struct

    def __init__(self, type_str: str) -> None:
        self.type_str = type_str

    def parse(self) -> pa.DataType:
        try:
            parsed = self.type(0)
        except ParseFail:
            raise ValueError(f"Can't parse '{self.type_str}' as a type.")

        if parsed.end != len(self.type_str):
            raise ValueError(f"Can't parse '{self.type_str}' as a type.")

        return self.type(0).type

    def type(self, pos: int) -> Parsed:
        try:
            return self.basic_type(pos)
        except ParseFail:
            pass

        try:
            return self.timestamp(pos)
        except ParseFail:
            pass

        try:
            return self.list(pos)
        except ParseFail:
            pass

        try:
            return self.dictionary(pos)
        except ParseFail:
            pass

        try:
            return self.struct(pos)
        except ParseFail:
            pass

        raise ParseFail()

    def basic_type(self, pos: int) -> pa.DataType:
        match = self.BASIC_TYPE_MATCHER.match(self.type_str, pos)
        if match is None:
            raise ParseFail()
        try:
            return Parsed(pa.type_for_alias(match.group(0)), match.end(0))
        except ValueError:
            pass
        raise ParseFail()

    def timestamp(self, pos: int) -> pa.DataType:
        match = self.TIMESTAMP_MATCHER.match(self.type_str, pos)
        if match is None:
            raise ParseFail()
        try:
            return Parsed(pa.timestamp(match.group(1).strip(), tz=match.group(2).strip()), match.end(0))
        except ValueError:
            pass
        raise ParseFail()

    def list(self, pos: int) -> pa.DataType:
        pos = self.accept('list<', pos)
        match = self.NAME_MATCHER.match(self.type_str, pos)
        if match is None:
            raise ParseFail()
        pos = self.accept(': ', match.end(0))
        item = self.type(pos)
        pos = self.accept('>', item.end)
        return Parsed(pa.list_(item.type), pos)

    def dictionary(self, pos: int) -> pa.DataType:
        pos = self.accept('dictionary<values=', pos)
        values = self.type(pos)
        pos = self.accept(', indices=', values.end)
        indices = self.type(pos)
        pos = self.accept(', ordered=', indices.end)
        try:
            pos = self.accept('0', pos)
            ordered = False
        except ParseFail:
            pos = self.accept('1', pos)
            ordered = True
        pos = self.accept('>', pos)
        return Parsed(pa.dictionary(indices.type, values.type, ordered), pos)

    def struct(self, pos: int) -> pa.DataType:
        pos = self.accept('struct<', pos)
        elements = []
        while self.type_str[pos] != '>':
            match = self.NAME_MATCHER.match(self.type_str, pos)
            if match is None:
                raise ParseFail()
            element_name = match.group(0)
            pos = self.accept(': ', match.end(0))
            element_type = self.type(pos)
            pos = element_type.end
            if self.type_str[pos] != '>':
                pos = self.accept(', ', pos)
            elements.append((element_name, element_type.type))
        pos = self.accept('>', pos)
        return Parsed(pa.struct(elements), pos)

    def accept(self, term: str, pos: int) -> int:
        if self.type_str.startswith(term, pos):
            return pos + len(term)
        raise ParseFail()

Probably not the prettiest recursive descent parser in existence, but it does parse arbitrary nested types.
The only restriction that I know of is that the names in the structs needs to be alphanumeric.

danielhanchen · 2023-05-16T03:58:43Z

@takacsd Nice work on the parser! :) Ye struct is the biggest issue with it being able to have column names. It gets worse if struct<struct : uint8> exists - yikes that'll be a painful pain.

Also I just noticed but https://github.com/apache/arrow/blob/8be70c137289adba92871555ce74055719172f56/python/pyarrow/pandas_compat.py#L870 actually does in fact parse Arrow Dtypes! The issue is the code previous to it breaks, and it never gets there.

    for field in table.schema:
        typ = field.type
        if isinstance(typ, pa.BaseExtensionType):
            try:
                pandas_dtype = typ.to_pandas_dtype()
            except NotImplementedError:
                pass
            else:
                ext_columns[field.name] = pandas_dtype

The issue is https://github.com/apache/arrow/blob/8be70c137289adba92871555ce74055719172f56/python/pyarrow/pandas_compat.py#LL854C1-L868C53:

    # infer the extension columns from the pandas metadata
    for col_meta in columns_metadata:
        try:
            name = col_meta['field_name']
        except KeyError:
            name = col_meta['name']
        dtype = col_meta['numpy_type']

        if dtype not in _pandas_supported_numpy_types:
            # pandas_dtype is expensive, so avoid doing this for types
            # that are certainly numpy dtypes
            pandas_dtype = _pandas_api.pandas_dtype(dtype)               >>>>>>>> BREAKS (A)
            if isinstance(pandas_dtype, _pandas_api.extension_dtype):
                if hasattr(pandas_dtype, "__from_arrow__"):
                    ext_columns[name] = pandas_dtype

I think I might have fixed it WITHOUT using try except

            # infer the extension columns from the pandas metadata
            schema = table.schema                                                 <<<<
            for col_meta, field in zip(columns_metadata, schema):                       <<<<
                try:
                    name = col_meta['field_name']
                except KeyError:
                    name = col_meta['name']
                dtype = col_meta['numpy_type']

		if dtype not in _pandas_supported_numpy_types:
                        # pandas_dtype is expensive, so avoid doing this for types
                        # that are certainly numpy dtypes
			if dtype.endswith("[pyarrow]"): pandas_dtype = pd.ArrowDtype(field.type)  <<<< (1)
			elif dtype == "string": pandas_dtype = pd.ArrowDtype(pa.string())        <<<< (2)
			else: pandas_dtype = pd_dtype(dtype)                                    <<<< (3)
                        if isinstance(pandas_dtype, _pandas_api.extension_dtype):
                            if hasattr(pandas_dtype, "__from_arrow__"):
                                ext_columns[name] = pandas_dtype

We push the original command (A) to line (3) than if a string data-type has [pyarrow] as it's ending, we use pd.ArrowDtype. For strings, we ignore the pandas parser and just parse strings in the fastpath.

This also means

    for field in table.schema:
        typ = field.type
        if isinstance(typ, pa.BaseExtensionType):
            try:
                pandas_dtype = typ.to_pandas_dtype()
            except NotImplementedError:
                pass
            else:
                ext_columns[field.name] = pandas_dtype

can be deleted - it'ls redundant, since we folded the code into the previous code.

bretttully · 2023-10-12T04:32:21Z

I just hit this today trying to read a parquet file made by someone else, where they had used the pyarrow backend.

Here is another minimal example to add to the mix that fails on reading df2.

import io

import numpy as np
import pandas as pd
import pyarrow as pa


def main():
    df0 = pd.DataFrame(
        [
            {"foo": {"bar": True, "baz": np.float32(1)}},
            {"foo": {"bar": True, "baz": None}},
        ],
    )
    schema = pa.schema(
        [
            pa.field(
                "foo",
                pa.struct(
                    [
                        pa.field("bar", pa.bool_(), nullable=False),
                        pa.field("baz", pa.float32(), nullable=True),
                    ],
                ),
            ),
        ],
    )
    print(schema)
    with io.BytesIO() as stream0, io.BytesIO() as stream1:
        kwargs = {
            "engine": "pyarrow",
            "compression": "zstd",
            "schema": schema,
            "row_group_size": 2_000,
        }
        print("Writing df0")
        df0.to_parquet(stream0, **kwargs)

        print("Reading df1")
        stream0.seek(0)
        df1 = pd.read_parquet(stream0, engine="pyarrow", dtype_backend="pyarrow")

        print("Writing df1")
        df1.to_parquet(stream1, **kwargs)

        print("Reading df2")
        stream1.seek(0)
        df2 = pd.read_parquet(stream1, engine="pyarrow", dtype_backend="pyarrow")


if __name__ == "__main__":
    main()

Using df2 = pq.read_table(stream1).to_pandas(ignore_metadata=True) works for all of the reasons mentioned in the thread.

giftculture · 2024-02-08T01:51:31Z

I'm running into this issue as well:

giftculture · 2024-02-08T01:54:10Z

INSTALLED VERSIONS

commit : f538741
python : 3.11.7.final.0
python-bits : 64
OS : Linux
OS-release : 4.13.9-300.fc27.x86_64
Version : #1 SMP Mon Oct 23 13:41:58 UTC 2017
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.0
numpy : 1.26.3
pytz : 2023.4
dateutil : 2.8.2
setuptools : 69.0.3
pip : 23.3.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : 8.20.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat : None
fastparquet : 2023.10.1
fsspec : 2023.12.2
gcsfs : None
matplotlib : 3.8.2
numba : 0.58.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 15.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.12.0
sqlalchemy : 2.0.25
tables : None
tabulate : None
xarray : 2024.1.1
xlrd : None
zstandard : None
tzdata : 2023.4
qtpy : 2.4.1
pyqt5 : None

jborman-stonex · 2024-04-04T12:28:51Z

The only "workaround" at the pandas-level I've found is to set df.to_parquet(..., store_schema=False) for a df containing complex/nested types like a categorical. Are there any plans to get successful dtype roundtripping going forward? Is there anything in the pyarrow library we can leverage here?

Versions:

INSTALLED VERSIONS
------------------
commit                : bdc79c146c2e32f2cab629be240f01658cfb6cc2
python                : 3.11.6.final.0
python-bits           : 64
OS                    : Windows
OS-release            : 10
Version               : 10.0.19045
machine               : AMD64
processor             : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel
byteorder             : little
LC_ALL                : None
LANG                  : None
LOCALE                : English_United States.1252

pandas                : 2.2.1
numpy                 : 1.26.4
pytz                  : 2024.1
dateutil              : 2.9.0.post0
setuptools            : 65.5.0
pip                   : 23.2.1
Cython                : None
pytest                : 8.1.1
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : 2.9.9
jinja2                : None
IPython               : 8.23.0
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
gcsfs                 : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
pyarrow               : 15.0.2
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None

phofl · 2024-04-04T13:17:14Z

I would recommend using dtype_backend="pyarrow"

giftculture · 2024-04-04T13:24:18Z

I would recommend using dtype_backend="pyarrow"

@phoff, not sure if you saw from my screenshot, but I did apply the dypte_backend="pyarrow" to the read_parquet method and it still fails, unless I am misunderstanding your suggestion

bretttully · 2024-04-04T22:33:00Z

The only workaround I have found so far is the following (which works in all cases I have thought of, except round-tripping an empty dataframe with a struct or list type, setting the schema, and not using dtype_backend="pyarrow" when reading back in).

Would def welcome suggested improvements to this workaround!

Obv you can write this differently if you don't want a byte string returned, but for us that's what we want.

    def serialize(self, data: pd.DataFrame, **kwargs) -> bytes:
        """see BytesWriter.serialize -- Dump pandas dataframe to parquet bytes"""
        with io.BytesIO() as stream:

            schema = kwargs.pop("schema", None)
            all_arrow_types = all(isinstance(t, pd.ArrowDtype) for t in data.dtypes.tolist())
            # An empty dataframe may use default dtypes that are incompatible with the schema.
            # In this case, first cast to object, as the schema can always convert that to the correct type.
            if len(data) == 0 and schema is not None and not all_arrow_types:
                data = data.astype("object").astype({n: pd.ArrowDtype(schema.field(n).type) for n in schema.names})
            table = pa.Table.from_pandas(data, schema=schema)

            # drop pandas from the schema metadata to work around the bug where you can't read struct columns with
            # pandas metadata
            # see https://github.com/pandas-dev/pandas/issues/53011
            metadata = table.schema.metadata
            if b"pandas" in metadata and b"list" in metadata[b"pandas"] or b"struct" in metadata[b"pandas"]:
                del metadata[b"pandas"]
                table = table.replace_schema_metadata(metadata)

            pq.write_table(table, stream, **kwargs)
            return stream.getvalue()

kinianlo · 2024-05-21T17:07:13Z

This bug is very similar to #57411 where the data type is list[int] instead of list[str].

judahrand · 2024-09-03T09:32:22Z

Actually a simpler solution is to directly all .replace on the string and replace list<element: to pa.list_(( etc.

Additionally, list<element: can also be list<item: depending on how the Parquet was written out (see: use_compliant_nested_type argument to pyarrow.parquet.ParquetWriter). And within Arrow itself the containing name be arbitrary (ie. pyarrow.list_(pyarrow.field('some_name', pyarrow.int64())) is a valid type). So the replace would have to be a regex and something like r'list<(.+):' with the capturing group grabbing the name of the field.

Addresses pandas-dev/pandas#53011 `types_mapper` always had highest priority as it overrode what was set before. However, switching the logical ordering, it means that we don't need to call `_pandas_api.pandas_dtype(dtype)` when using the pyarrow backend. Resolving the issue of complex `dtype` with `list` or `struct`

bretttully · 2024-11-14T00:12:36Z

Matching arrow ticket: apache/arrow#39914 and potential PR: apache/arrow#44720

jorisvandenbossche · 2024-11-14T20:18:17Z

Only seeing this long ticket now .. FWIW I think this is another good reason it would be good pandas had tighter control over the pandas<->arrow conversion (#59780)

### Rationale for this change This is a long standing [pandas ticket](pandas-dev/pandas#53011) with some fairly horrible workarounds, where complex arrow types do not serialise well to pandas as the pandas metadata string is not parseable. However, `types_mapper` always had highest priority as it overrode what was set before. ### What changes are included in this PR? By switching the logical ordering, it means that we don't need to call `_pandas_api.pandas_dtype(dtype)` when using the pyarrow backend, thus resolving the issue of complex `dtype` with `list` or `struct`. It will likely still fail if the numpy backend is used, but at least this gives a working solution rather than an inability to load files at all. ### Are these changes tested? Existing tests should stay unchanged and a new test for the complex type has been added ### Are there any user-facing changes? **This PR contains a "Critical Fix".** This makes `pd.read_parquet(..., dtype_backend="pyarrow")` work with complex data types where the metadata added by pyarrow during `pd.to_parquet` is not serialisable and currently throwing an exception. This issue currently prevents the use of pyarrow as the default backend for pandas. * GitHub Issue: #39914 Lead-authored-by: bretttully <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Brett Tully <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

danielhanchen added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 30, 2023

jbrockmendel added the Arrow pyarrow functionality label Apr 30, 2023

mroeschke added IO Parquet parquet, feather Upstream issue Issue related to pandas dependency and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 1, 2023

mroeschke changed the title ~~BUG: TypeError: data type 'list<item: string>[pyarrow]' not understood~~ BUG: read_parquet converts pyarrow list type to numpy dtype May 1, 2023

takacsd mentioned this issue Sep 11, 2023

BUG: read_parquet doesn't work for ArrowDtype dictionary types. #54392

Closed

3 tasks

aiudirog mentioned this issue Oct 5, 2023

Object columns cause P2P shuffling errors after v2023.9.2 dask/distributed#8200

Closed

bretttully mentioned this issue Apr 20, 2024

BUG: failed to roundtrip to pyarrow Table with ArrowDtype for DictionaryArray #58207

Open

3 tasks

bretttully mentioned this issue Aug 22, 2024

BUG: gdf.to_arrow doesn't use geopandas.io.arrow._geopandas_to_arrow and thus has no geo metadata geopandas/geopandas#3406

Open

rhshadrach mentioned this issue Sep 15, 2024

BUG: fixed_size_list pyarrow type not interpretable on round-trip write/read. #59738

Open

3 tasks

bretttully mentioned this issue Nov 14, 2024

GH-39914: [pyarrow] Reorder to_pandas extension dtype mapping apache/arrow#44720

Merged

jorisvandenbossche removed the Upstream issue Issue related to pandas dependency label Nov 14, 2024

BUG: read_parquet converts pyarrow list type to numpy dtype #53011

BUG: read_parquet converts pyarrow list type to numpy dtype #53011

Comments

danielhanchen commented Apr 30, 2023

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

danielhanchen commented May 1, 2023 • edited Loading

mroeschke commented May 1, 2023

danielhanchen commented May 2, 2023

phofl commented May 2, 2023

danielhanchen commented May 3, 2023

danielhanchen commented May 3, 2023

phofl commented May 3, 2023 • edited Loading

danielhanchen commented May 3, 2023

danielhanchen commented May 3, 2023

danielhanchen commented May 3, 2023 • edited Loading

takacsd commented May 10, 2023

danielhanchen commented May 11, 2023

takacsd commented May 11, 2023

danielhanchen commented May 12, 2023 • edited Loading

danielhanchen commented May 12, 2023 • edited Loading

takacsd commented May 13, 2023

danielhanchen commented May 13, 2023

takacsd commented May 13, 2023

danielhanchen commented May 13, 2023

danielhanchen commented May 13, 2023

takacsd commented May 15, 2023

takacsd commented May 15, 2023

danielhanchen commented May 16, 2023

bretttully commented Oct 12, 2023

giftculture commented Feb 8, 2024

giftculture commented Feb 8, 2024

INSTALLED VERSIONS

jborman-stonex commented Apr 4, 2024

phofl commented Apr 4, 2024

giftculture commented Apr 4, 2024

bretttully commented Apr 4, 2024 • edited Loading

kinianlo commented May 21, 2024 • edited Loading

judahrand commented Sep 3, 2024 • edited Loading

bretttully commented Nov 14, 2024

jorisvandenbossche commented Nov 14, 2024

danielhanchen commented May 1, 2023 •

edited

Loading

phofl commented May 3, 2023 •

edited

Loading

danielhanchen commented May 3, 2023 •

edited

Loading

danielhanchen commented May 12, 2023 •

edited

Loading

danielhanchen commented May 12, 2023 •

edited

Loading

bretttully commented Apr 4, 2024 •

edited

Loading

kinianlo commented May 21, 2024 •

edited

Loading

judahrand commented Sep 3, 2024 •

edited

Loading