Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: valueerror: found non-unique column index !! when using read_csv and arrow engine when CSV has duplicate columns #52408

Closed
tfr2003 opened this issue Apr 4, 2023 · 4 comments
Labels
Arrow pyarrow functionality IO CSV read_csv, to_csv Upstream issue Issue related to pandas dependency

Comments

@tfr2003
Copy link

tfr2003 commented Apr 4, 2023

I got this error massage :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 ff=pd.read_csv("CSVWO/2022-December.csv",engine="pyarrow")

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py:912, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
    899 kwds_defaults = _refine_defaults_read(
    900     dialect,
    901     delimiter,
   (...)
    908     dtype_backend=dtype_backend,
    909 )
    910 kwds.update(kwds_defaults)
--> 912 return _read(filepath_or_buffer, kwds)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py:583, in _read(filepath_or_buffer, kwds)
    580     return parser
    582 with parser:
--> 583     return parser.read(nrows)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1692, in TextFileReader.read(self, nrows)
   1689 if self.engine == "pyarrow":
   1690     try:
   1691         # error: "ParserBase" has no attribute "read"
-> 1692         df = self._engine.read()  # type: ignore[attr-defined]
   1693     except Exception:
   1694         self.close()

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pandas/io/parsers/arrow_parser_wrapper.py:163, in ArrowParserWrapper.read(self)
    161     frame = table.to_pandas(types_mapper=_arrow_dtype_mapping().get)
    162 else:
--> 163     frame = table.to_pandas()
    164 return self._finalize_pandas_output(frame)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pyarrow/array.pxi:830, in pyarrow.lib._PandasConvertible.to_pandas()

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pyarrow/table.pxi:3990, in pyarrow.lib.Table._to_pandas()

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pyarrow/pandas_compat.py:819, in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
    816     ext_columns_dtypes = _get_extension_dtypes(table, [], types_mapper)
    818 _check_data_column_metadata_consistency(all_columns)
--> 819 columns = _deserialize_column_index(table, all_columns, column_indexes)
    820 blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
    822 axes = [columns, index]

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pyarrow/pandas_compat.py:938, in _deserialize_column_index(block_table, all_columns, column_indexes)
    935     columns = _reconstruct_columns_from_metadata(columns, column_indexes)
    937 # ARROW-1751: flatten a single level column MultiIndex for pandas 0.21.0
--> 938 columns = _flatten_single_level_multiindex(columns)
    940 return columns

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pyarrow/pandas_compat.py:1185, in _flatten_single_level_multiindex(index)
   1183     # Cheaply check that we do not somehow have duplicate column names
   1184     if not index.is_unique:
-> 1185         raise ValueError('Found non-unique column index')
   1187     return pd.Index(
   1188         [levels[_label] if _label != -1 else None for _label in labels],
   1189         dtype=dtype,
   1190         name=index.names[0]
   1191     )
   1192 return index

ValueError: Found non-unique column index

When try to use (read_csv(path,engine="pyarrow”)). !!!

Thanks

@MarcoGorelli
Copy link
Member

thanks @tfr2003 for the report

simpler reproducer:

import io
data = io.StringIO('a,a\n1,1\n2,2\n')
pd.read_csv(data,engine="pyarrow")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [5], line 1
----> 1 pd.read_csv(data,engine="pyarrow")

File ~/tmp/.venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py:912, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
    899 kwds_defaults = _refine_defaults_read(
    900     dialect,
    901     delimiter,
   (...)
    908     dtype_backend=dtype_backend,
    909 )
    910 kwds.update(kwds_defaults)
--> 912 return _read(filepath_or_buffer, kwds)

File ~/tmp/.venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py:583, in _read(filepath_or_buffer, kwds)
    580     return parser
    582 with parser:
--> 583     return parser.read(nrows)

File ~/tmp/.venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py:1692, in TextFileReader.read(self, nrows)
   1689 if self.engine == "pyarrow":
   1690     try:
   1691         # error: "ParserBase" has no attribute "read"
-> 1692         df = self._engine.read()  # type: ignore[attr-defined]
   1693     except Exception:
   1694         self.close()

File ~/tmp/.venv/lib/python3.8/site-packages/pandas/io/parsers/arrow_parser_wrapper.py:163, in ArrowParserWrapper.read(self)
    161     frame = table.to_pandas(types_mapper=_arrow_dtype_mapping().get)
    162 else:
--> 163     frame = table.to_pandas()
    164 return self._finalize_pandas_output(frame)

File ~/tmp/.venv/lib/python3.8/site-packages/pyarrow/array.pxi:830, in pyarrow.lib._PandasConvertible.to_pandas()

File ~/tmp/.venv/lib/python3.8/site-packages/pyarrow/table.pxi:3990, in pyarrow.lib.Table._to_pandas()

File ~/tmp/.venv/lib/python3.8/site-packages/pyarrow/pandas_compat.py:819, in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
    816     ext_columns_dtypes = _get_extension_dtypes(table, [], types_mapper)
    818 _check_data_column_metadata_consistency(all_columns)
--> 819 columns = _deserialize_column_index(table, all_columns, column_indexes)
    820 blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
    822 axes = [columns, index]

File ~/tmp/.venv/lib/python3.8/site-packages/pyarrow/pandas_compat.py:938, in _deserialize_column_index(block_table, all_columns, column_indexes)
    935     columns = _reconstruct_columns_from_metadata(columns, column_indexes)
    937 # ARROW-1751: flatten a single level column MultiIndex for pandas 0.21.0
--> 938 columns = _flatten_single_level_multiindex(columns)
    940 return columns

File ~/tmp/.venv/lib/python3.8/site-packages/pyarrow/pandas_compat.py:1185, in _flatten_single_level_multiindex(index)
   1183     # Cheaply check that we do not somehow have duplicate column names
   1184     if not index.is_unique:
-> 1185         raise ValueError('Found non-unique column index')
   1187     return pd.Index(
   1188         [levels[_label] if _label != -1 else None for _label in labels],
   1189         dtype=dtype,
   1190         name=index.names[0]
   1191     )
   1192 return index

ValueError: Found non-unique column index

@MarcoGorelli MarcoGorelli added Arrow pyarrow functionality IO CSV read_csv, to_csv labels Apr 4, 2023
@MarcoGorelli MarcoGorelli changed the title valueerror: found non-unique column index !! BUG: valueerror: found non-unique column index !! when using read_csv and arrow engine Apr 4, 2023
@mroeschke mroeschke added the Upstream issue Issue related to pandas dependency label Apr 6, 2023
@mroeschke
Copy link
Member

Looks to be an upstream issue with pyarrow; the pyarrow read_csv currently doesn't have functionality to handle duplicate column lablels

@mroeschke mroeschke changed the title BUG: valueerror: found non-unique column index !! when using read_csv and arrow engine BUG: valueerror: found non-unique column index !! when using read_csv and arrow engine when CSV has duplicate columns Apr 6, 2023
@tfr2003
Copy link
Author

tfr2003 commented Apr 6, 2023

I found 3 things weird when use csv_read with engine = ‘pyarrow'

1- If you use similar name of column; pandas look at them as same like my issue here I have two columns [Created On,Created On.1] !!!

2 - when you concat files together and some files do not have some columns ,concat will goes normal and that column will fill as NAN which is good, but when you active “ pyarrow” it will come up with error massege keep asking about these columns not exists!.

3- it seem dose not work with empty cell as NAN too.!

@mroeschke
Copy link
Member

Since this needs fixing upstream, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality IO CSV read_csv, to_csv Upstream issue Issue related to pandas dependency
Projects
None yet
Development

No branches or pull requests

3 participants