Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: KeyError when loading csv with NaNs #56929

Open
3 tasks done
CYHSM opened this issue Jan 17, 2024 · 3 comments
Open
3 tasks done

BUG: KeyError when loading csv with NaNs #56929

CYHSM opened this issue Jan 17, 2024 · 3 comments
Labels
Bug IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@CYHSM
Copy link

CYHSM commented Jan 17, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# Define parameters for data
nan_rows = 3
random_rows = 7
total_rows = nan_rows + random_rows

# With two levels
level1 = ['Level 1'] * 1  
level2 = ['Level 2'] * 1 

# Combine these lists to create a MultiIndex for columns
columns = pd.MultiIndex.from_arrays([level1, level2], names=['level1', 'level2'])

# Create dummy data
data = np.random.rand(total_rows, len(columns))
data[:nan_rows, :] = np.nan 
df_complex = pd.DataFrame(data, columns=columns)

# Save the DataFrame as a CSV
csv_filename_complex = 'example_data_complex.csv'
df_complex.to_csv(csv_filename_complex)

# Load the data back with header=[0, 1] and index_col=0
loaded_df_complex = pd.read_csv(csv_filename_complex, header=[0, 1], index_col=0)

# NaN in row 0 is ommited:
print(loaded_df_complex.head(total_rows))

# Also throws a KeyError when indexing into it
try:
    loaded_df_complex.loc[0]
    print('Works')
except KeyError:
    print('Does not work')

Issue Description

When loading back a CSV file previously saved using pandas, the first row is omitted if it contains NaN values. The problem seems to be the default behaviour for replacing NaN with ''. If we force replacement to be 'NaN' the same example works, see below:

Maybe related Issue: #55803

Expected Behavior

# Define parameters for data
nan_rows = 3
random_rows = 7
total_rows = nan_rows + random_rows

# With two levels
level1 = ['Level 1'] * 1  
level2 = ['Level 2'] * 1 

# Combine these lists to create a MultiIndex for columns
columns = pd.MultiIndex.from_arrays([level1, level2], names=['level1', 'level2'])

# Create dummy data
data = np.random.rand(total_rows, len(columns))
data[:nan_rows, :] = np.nan 
df_complex = pd.DataFrame(data, columns=columns)

# Save the DataFrame as a CSV
csv_filename_complex = 'example_data_complex.csv'
df_complex.to_csv(csv_filename_complex, na_rep='NaN')

# Load the data back with header=[0, 1] and index_col=0
loaded_df_complex = pd.read_csv(csv_filename_complex, header=[0, 1], index_col=0)

# NaN in row 0 is not ommited:
print(loaded_df_complex.head(total_rows))

# Also throws a KeyError when indexing into it
try:
    loaded_df_complex.loc[0]
    print('Works')
except KeyError:
    print('Does not work')

Installed Versions

INSTALLED VERSIONS

commit : 0f43794
python : 3.8.16.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-91-generic
Version : #101~20.04.1-Ubuntu SMP Thu Nov 16 14:22:28 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.3
numpy : 1.24.3
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : 3.0.0
pytest : 7.4.0
hypothesis : None
sphinx : 7.1.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.12.2
pandas_datareader: None
bs4 : 4.12.2
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : None
tables : 3.8.0
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : 2.4.1
pyqt5 : None

@CYHSM CYHSM added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 17, 2024
@rhshadrach
Copy link
Member

Thanks for the report! Confirmed on main, further investigations and PRs to fix are welcome!

@rhshadrach rhshadrach added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 17, 2024
@CYHSM
Copy link
Author

CYHSM commented Jan 18, 2024

I narrowed it down as far as possible within the python part of the codebase. The faulty example gives the following when running within a debug session:
2024-01-18_13-55
The expected behavior (script from above) shows this:
2024-01-18_13-56

You can see the difference in the Index, in the faulty example it takes 0 as name instead of the value. From here onwards it calls cython code and I'm not sure how to debug it properly, are there any guidelines for this?

Just from looking at the parsers I assume the error occurs somewhere here:

        self.allow_leading_cols = allow_leading_cols
        self.leading_cols = 0  # updated in _get_header


        # TODO: no header vs. header is not the first row
        self.has_mi_columns = 0
        self.orig_header = header
        if header is None:
            # sentinel value
            self.parser.header_start = -1
            self.parser.header_end = -1
            self.parser.header = -1
            self.parser_start = 0
            prelim_header = []
        else:
            if isinstance(header, list):
                if len(header) > 1:
                    # need to artificially skip the final line
                    # which is still a header line
                    header = list(header)
                    header.append(header[-1] + 1)
                    self.parser.header_end = header[-1]
                    self.has_mi_columns = 1
                else:
                    self.parser.header_end = header[0]


                self.parser_start = header[-1] + 1
                self.parser.header_start = header[0]
                self.parser.header = header[0]
                prelim_header = header
            else:
                self.parser.header_start = header
                self.parser.header_end = header
                self.parser_start = header + 1
                self.parser.header = header
                prelim_header = [header]

https://github.com/CYHSM/pandas/blob/f459437d7a7dc10db82437aa4438f058173212a7/pandas/_libs/parsers.pyx#L537-L571

Edit: More precisely, the TextReader class seems to think the first empty line is part of the header:
2024-01-18_15-11
That Level 2 should not be in there.

Edit 2:
The problem seems to be the following lines:

if name == "":
if self.has_mi_columns:
name = f"Unnamed: {i}_level_{level}"
else:
name = f"Unnamed: {i}"

where an empty header can not be differentiated from an empty first value. I am still wondering why we even look at the third row, given header is specified as [0,1]. This seems to be enforced here:
# need to artificially skip the final line
# which is still a header line
header = list(header)
header.append(header[-1] + 1)

Without this line, it is a trivial fix as we would not try to read the third line as a header, but I assume this is there for a reason @rhshadrach?

@rhshadrach
Copy link
Member

Thanks for investigating here!

Without this line, it is a trivial fix as we would not try to read the third line as a header, but I assume this is there for a reason @rhshadrach?

As a next step I would try removing it and see if any tests in pandas/tests/io/csv break. If they do, the way in which they break might indicate a proper approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants