BUG: KeyError when loading csv with NaNs #56929

CYHSM · 2024-01-17T14:10:03Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# Define parameters for data
nan_rows = 3
random_rows = 7
total_rows = nan_rows + random_rows

# With two levels
level1 = ['Level 1'] * 1  
level2 = ['Level 2'] * 1 

# Combine these lists to create a MultiIndex for columns
columns = pd.MultiIndex.from_arrays([level1, level2], names=['level1', 'level2'])

# Create dummy data
data = np.random.rand(total_rows, len(columns))
data[:nan_rows, :] = np.nan 
df_complex = pd.DataFrame(data, columns=columns)

# Save the DataFrame as a CSV
csv_filename_complex = 'example_data_complex.csv'
df_complex.to_csv(csv_filename_complex)

# Load the data back with header=[0, 1] and index_col=0
loaded_df_complex = pd.read_csv(csv_filename_complex, header=[0, 1], index_col=0)

# NaN in row 0 is ommited:
print(loaded_df_complex.head(total_rows))

# Also throws a KeyError when indexing into it
try:
    loaded_df_complex.loc[0]
    print('Works')
except KeyError:
    print('Does not work')

Issue Description

When loading back a CSV file previously saved using pandas, the first row is omitted if it contains NaN values. The problem seems to be the default behaviour for replacing NaN with ''. If we force replacement to be 'NaN' the same example works, see below:

Maybe related Issue: #55803

Expected Behavior

# Define parameters for data
nan_rows = 3
random_rows = 7
total_rows = nan_rows + random_rows

# With two levels
level1 = ['Level 1'] * 1  
level2 = ['Level 2'] * 1 

# Combine these lists to create a MultiIndex for columns
columns = pd.MultiIndex.from_arrays([level1, level2], names=['level1', 'level2'])

# Create dummy data
data = np.random.rand(total_rows, len(columns))
data[:nan_rows, :] = np.nan 
df_complex = pd.DataFrame(data, columns=columns)

# Save the DataFrame as a CSV
csv_filename_complex = 'example_data_complex.csv'
df_complex.to_csv(csv_filename_complex, na_rep='NaN')

# Load the data back with header=[0, 1] and index_col=0
loaded_df_complex = pd.read_csv(csv_filename_complex, header=[0, 1], index_col=0)

# NaN in row 0 is not ommited:
print(loaded_df_complex.head(total_rows))

# Also throws a KeyError when indexing into it
try:
    loaded_df_complex.loc[0]
    print('Works')
except KeyError:
    print('Does not work')

Installed Versions

INSTALLED VERSIONS

commit : 0f43794
python : 3.8.16.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-91-generic
Version : #101~20.04.1-Ubuntu SMP Thu Nov 16 14:22:28 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.3
numpy : 1.24.3
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : 3.0.0
pytest : 7.4.0
hypothesis : None
sphinx : 7.1.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.12.2
pandas_datareader: None
bs4 : 4.12.2
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : None
tables : 3.8.0
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : 2.4.1
pyqt5 : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2024-01-17T21:21:18Z

Thanks for the report! Confirmed on main, further investigations and PRs to fix are welcome!

CYHSM · 2024-01-18T13:40:26Z

I narrowed it down as far as possible within the python part of the codebase. The faulty example gives the following when running within a debug session:

The expected behavior (script from above) shows this:

You can see the difference in the Index, in the faulty example it takes 0 as name instead of the value. From here onwards it calls cython code and I'm not sure how to debug it properly, are there any guidelines for this?

Just from looking at the parsers I assume the error occurs somewhere here:

        self.allow_leading_cols = allow_leading_cols
        self.leading_cols = 0  # updated in _get_header


        # TODO: no header vs. header is not the first row
        self.has_mi_columns = 0
        self.orig_header = header
        if header is None:
            # sentinel value
            self.parser.header_start = -1
            self.parser.header_end = -1
            self.parser.header = -1
            self.parser_start = 0
            prelim_header = []
        else:
            if isinstance(header, list):
                if len(header) > 1:
                    # need to artificially skip the final line
                    # which is still a header line
                    header = list(header)
                    header.append(header[-1] + 1)
                    self.parser.header_end = header[-1]
                    self.has_mi_columns = 1
                else:
                    self.parser.header_end = header[0]


                self.parser_start = header[-1] + 1
                self.parser.header_start = header[0]
                self.parser.header = header[0]
                prelim_header = header
            else:
                self.parser.header_start = header
                self.parser.header_end = header
                self.parser_start = header + 1
                self.parser.header = header
                prelim_header = [header]

https://github.com/CYHSM/pandas/blob/f459437d7a7dc10db82437aa4438f058173212a7/pandas/_libs/parsers.pyx#L537-L571

Edit: More precisely, the TextReader class seems to think the first empty line is part of the header:

That Level 2 should not be in there.

Edit 2:
The problem seems to be the following lines:

pandas/pandas/_libs/parsers.pyx

Lines 694 to 698 in f459437

    
           if name == "": 
        
               if self.has_mi_columns: 
        
                   name = f"Unnamed: {i}_level_{level}" 
        
               else: 
        
                   name = f"Unnamed: {i}"

where an empty header can not be differentiated from an empty first value. I am still wondering why we even look at the third row, given header is specified as [0,1]. This seems to be enforced here:

pandas/pandas/_libs/parsers.pyx

Lines 553 to 556 in f459437

    
           # need to artificially skip the final line 
        
           # which is still a header line 
        
           header = list(header) 
        
           header.append(header[-1] + 1)

Without this line, it is a trivial fix as we would not try to read the third line as a header, but I assume this is there for a reason @rhshadrach?

rhshadrach · 2024-01-19T21:49:37Z

Thanks for investigating here!

Without this line, it is a trivial fix as we would not try to read the third line as a header, but I assume this is there for a reason @rhshadrach?

As a next step I would try removing it and see if any tests in pandas/tests/io/csv break. If they do, the way in which they break might indicate a proper approach.

CYHSM added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 17, 2024

rhshadrach added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 17, 2024

CYHSM added a commit to CYHSM/pandas that referenced this issue Jan 25, 2024

Initial commit to fix csv loading bug pandas-dev#56929

b0bc80a

CYHSM added a commit to CYHSM/pandas that referenced this issue Jan 25, 2024

Fix for pandas issue pandas-dev#56929, including new test

0826212

CYHSM mentioned this issue Jan 25, 2024

BUG: Fix reading of multi_index dataframe with NaN #57070

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: KeyError when loading csv with NaNs #56929

BUG: KeyError when loading csv with NaNs #56929

CYHSM commented Jan 17, 2024

INSTALLED VERSIONS

rhshadrach commented Jan 17, 2024

CYHSM commented Jan 18, 2024 •

edited

Loading

rhshadrach commented Jan 19, 2024

BUG: KeyError when loading csv with NaNs #56929

BUG: KeyError when loading csv with NaNs #56929

Comments

CYHSM commented Jan 17, 2024

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented Jan 17, 2024

CYHSM commented Jan 18, 2024 • edited Loading

rhshadrach commented Jan 19, 2024

CYHSM commented Jan 18, 2024 •

edited

Loading