Inconsistent behaviour of getitem and .loc with slices for DataFrames with MultiIndex columns #26511

plammens · 2019-05-24T17:15:48Z

Code Sample

import numpy as np
import pandas as pd

data = np.zeros((3, 6), dtype=np.int16)
col_index = pd.MultiIndex.from_product([['A', 'B', 'C'], ['t1', 't2']])
df = pd.DataFrame(data, columns=col_index)

>>> df
   A     B     C   
  t1 t2 t1 t2 t1 t2
0  0  0  0  0  0  0
1  0  0  0  0  0  0
2  0  0  0  0  0  0

>>> df['A']
   t1  t2
0   0   0
1   0   0
2   0   0

>>> df[:, 't1']
Traceback (most recent call last):
  File "<ipython-input-20-c0e18fe6137b>", line 1, in <module>
    df[:, 't1']
  File "c:\users\paolo\code\ipython\lib\site-packages\pandas\core\frame.py", line 2926, in __getitem__
    return self._getitem_multilevel(key)
  File "c:\users\paolo\code\ipython\lib\site-packages\pandas\core\frame.py", line 2972, in _getitem_multilevel
    loc = self.columns.get_loc(key)
  File "c:\users\paolo\code\ipython\lib\site-packages\pandas\core\indexes\multi.py", line 2406, in get_loc
    return self._engine.get_loc(key)
  File "pandas\_libs\index.pyx", line 665, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc
TypeError: '(slice(None, None, None), 't1')' is an invalid key

but

>>> df.loc[:, (slice(None), 't1')]
   A  B  C
  t1 t1 t1
0  0  0  0
1  0  0  0
2  0  0  0

Problem description

Dataframes with MultiIndex columns aren't able to make a "cross section" slice (i.e., selecting all sub-columns of a certain label for every "major" column). I know this behaviour is provided by DataFrame.xs & company, but I don't see why it is beneficial to ignore this inconsistency and make a separate method for the purpose, since DataFrame.__getitem__ already interprets a tuple indexer as a column indexing when columns is a MultiIndex (but it just supports selecting exactly one label for each sub-index).

Expected Output

df[:, 't1'] doesn't throw; it returns a DataFrame with only the t1 sub-columns from columns A, B, and C (like the one returned by df.loc[:, (slice(None), 't1')].

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: 998a0deea39f11fa06071af77cc1afba65900330
python: 3.7.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.25.0.dev0+605.gb730ab35e
pytest: 4.5.0
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.3.0
pyarrow: 0.13.0
xarray: 0.12.1
IPython: 7.5.0
sphinx: 2.0.1
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: 1.8.1
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.1.0
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.8
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: 0.2.1
fastparquet: 0.3.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

TannhauserGate42 · 2019-05-26T22:32:45Z

No, the behaviour is not inconsistent!

Try this example as a thought-experiment

col_index = pd.MultiIndex.from_product([['t1', 't2', 't3'], ['t1', 't2']])

What would you then like to get for the following query?

>>> df[:, 't1']

TannhauserGate42 · 2019-05-26T22:38:39Z

You need (!) to specify in some way the level / dimension on the MultiIndex, on which you want to slice.

In my example above, both would work, but with another result:

>>> df[:, 't1']
>>> df[:, (slice(None), 't1')]

And this is good so.

Well, the syntax is not nice, but this is another story.

plammens · 2019-05-27T12:28:33Z

No, the behaviour is not inconsistent!

Try this example as a thought-experiment
col_index = pd.MultiIndex.from_product([['t1', 't2', 't3'], ['t1', 't2']])
What would you then like to get for the following query?
>>> df[:, 't1']

This is exactly the same example as above, just changing the names 'A', 'B', 'C' to 't1', 't2', 't3'. In this case I would expect df[:, 't1'] to return

since we're indexing the whole first level of the MultiIndex (with slice(None)) and selecting the 't1' sub-column from each of those. But it raises TypeError.

In my example above, both would work, but with another result:
>>> df[:, 't1']
>>> df[:, (slice(None), 't1')]

No, the first one still raises TypeError. And the second one does too, but i presume you meant df.loc[:, (slice(None), 't1')].

I think I see what you meant, @TannhauserGate42. If df[a, b] (where a and b are some objects) meant indexing the rows of df with a and indexing the columns with b then yes, in your example df[:, 't1'] would be malformed. But the thing is that DataFrame.__getitiem__ interprets a tuple indexer as an indexer to the columns (only), when columns is a MultiIndex:

pandas/pandas/core/frame.py

Lines 2862 to 2864 in 0a516c1

    
           # We are left with two options: a single key, and a collection of keys, 
        
           # We interpret tuples as collections only for non-MultiIndex 
        
           is_single_key = isinstance(key, tuple) or not is_list_like(key)

From the docs (I've only been able to find this fragment):

In general, MultiIndex keys take the form of tuples.

Thus df[:, 't1'] is saying "index the first level of columns with slice(None) and the second one with 't1'", and thus should be equivalent to df.loc[:, (slice(None), 't1')] (but it isn't).

TannhauserGate42 · 2019-05-28T07:33:44Z

In any case I overlooked something, as you already noticed ... I should have used .loc[...] in my example. Did not realise, that you used both, [] and loc[] in yours.

TannhauserGate42 · 2019-05-28T08:32:09Z

I went through your examples again.

This inconsistency of indexing on the 'columns' and on the 'index' is very unfortunate and already in place for quite a long time. Well, but even knowing that DataFrames are tables and not matrices, I agree with you, that one expects the example above to work. I think the syntax/signatures could show more symmetry.

BTW, this asymmetry/inconsistency was one of our reasons to use dataframes with dimensions (MultiIndex) on the 'index' and scenarios (simulations) on the 'columns' axis in one of our business applications ... (But is a main concern for performance now on the data consumption ... #26502)

TomAugspurger · 2019-05-28T20:37:06Z

@plammens can you double check the output of your first block? I think the test1, test2, etc. should be t1, t2.

DataFrame.__getitem__(key) not being the same as DataFrame.loc[:, key] is probably a bug. Though there may be some subtlety of DataFrame.__getitem__ that I'm missing.

TomAugspurger · 2019-05-28T20:37:47Z

I'm not really sure what would be going wrong, but I'd probably start with a pdbinside DataFrame.__getitem__ and see where things differ from loc.

TannhauserGate42 · 2019-05-29T09:23:34Z

A few notes here, which might help you if you want to fix it ...

As you said, the following does not work:

>>> df[:, (slice(None), 't1')]

In the end it happens, because

>>> slice(None, None, None)

is not hashable. But the main cause might be a wrong case switch already before somewhere.

Some symmetry/asymetry in success/failure you can deduct from the follwing examples on the transposed dataframe. Hence, having the MultiIndex on the index-axis.

The following works fine

>>> tf = df.transpose()
>>> tf.loc[('A', 't2')]  # works

... while the following breads

>>> tf = df.transpose()
>>> tf.loc[('A', 't2'), :]  # TypeError

Additionally, the following breaks

>>> tf = df.transpose()
>>> tf.loc[(slice(None), 't2')]  # TypeError

... while the following works fine

>>> tf = df.transpose()
>>> tf.loc[(slice(None), 't2'), :] # works

TannhauserGate42 · 2019-05-29T09:25:14Z

Whenever the code runs over

        # We are left with two options: a single key, and a collection of keys,
        # We interpret tuples as collections only for non-MultiIndex
        is_single_key = isinstance(key, tuple) or not is_list_like(key)

... a Type error is raises.

You might investigate in 2850

        # Do we have a slicer (on rows)?
        indexer = convert_to_index_sliceable(self, key)
        if indexer is not None:
            return self._slice(indexer, axis=0)

further.

Have fun.

plammens · 2019-05-29T10:28:51Z

@plammens can you double check the output of your first block? I think the test1, test2, etc. should be t1, t2.

Yep, I forgot to edit that.

As for the inconsistencies, are there any reasons why df[key] doesn't delegate to df.loc[:, key] when key is one of the types interpreted as column indexers and to df.loc[key, :] when key is one of the types interpreted as row indexing? (Or iloc if appropriate?) Other cases can be managed separately (as when indexing with a boolean DataFrame), but I don't see why those first two cases are handled separately, causing inconsistencies like this one.

TomAugspurger · 2019-05-29T11:09:44Z

are there any reasons why df[key] doesn't delegate to df.loc[:, key] when key is one of the types interpreted as column indexers and to df.loc[key, :] when key is one of the types interpreted as row indexing?

I'm not sure. Probably just history / edge cases, but hard to say.

JeffMII · 2023-11-29T16:07:42Z

Pandas does a lot of great things, but it also has a lot of inconsistencies. With large and complex datasets, it can be difficult to deal with. Specifically, DataFrame.loc can either result in another DataFrame if there are multiple rows in the result, or a Series if there is only one row. Since we're supposed to use DataFrame.iloc and Series.iloc to access individual rows or Scalars, it causes problems. It would be a lot better if DataFrame.loc always resulted in another DataFrame, Series.loc always resulted in another Series, DataFrame.iloc always resulted in a Series, and Series.iloc always resulted in a Scalar, even if the results only have a length of one respectively.

TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels May 28, 2019

TomAugspurger added this to the Contributions Welcome milestone May 28, 2019

mroeschke added API - Consistency Internal Consistency of API/Behavior Bug labels Jul 10, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behaviour of getitem and .loc with slices for DataFrames with MultiIndex columns #26511

Inconsistent behaviour of getitem and .loc with slices for DataFrames with MultiIndex columns #26511

plammens commented May 24, 2019 •

edited

Loading

TannhauserGate42 commented May 26, 2019

TannhauserGate42 commented May 26, 2019 •

edited

Loading

plammens commented May 27, 2019 •

edited

Loading

TannhauserGate42 commented May 28, 2019

TannhauserGate42 commented May 28, 2019

TomAugspurger commented May 28, 2019

TomAugspurger commented May 28, 2019

TannhauserGate42 commented May 29, 2019 •

edited

Loading

TannhauserGate42 commented May 29, 2019 •

edited

Loading

plammens commented May 29, 2019

TomAugspurger commented May 29, 2019

JeffMII commented Nov 29, 2023

Inconsistent behaviour of __getitem__ and .loc with slices for DataFrames with MultiIndex columns #26511

Inconsistent behaviour of __getitem__ and .loc with slices for DataFrames with MultiIndex columns #26511

Comments

plammens commented May 24, 2019 • edited Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

TannhauserGate42 commented May 26, 2019

TannhauserGate42 commented May 26, 2019 • edited Loading

plammens commented May 27, 2019 • edited Loading

TannhauserGate42 commented May 28, 2019

TannhauserGate42 commented May 28, 2019

TomAugspurger commented May 28, 2019

TomAugspurger commented May 28, 2019

TannhauserGate42 commented May 29, 2019 • edited Loading

TannhauserGate42 commented May 29, 2019 • edited Loading

plammens commented May 29, 2019

TomAugspurger commented May 29, 2019

JeffMII commented Nov 29, 2023

Inconsistent behaviour of getitem and .loc with slices for DataFrames with MultiIndex columns #26511

Inconsistent behaviour of getitem and .loc with slices for DataFrames with MultiIndex columns #26511

plammens commented May 24, 2019 •

edited

Loading

Output of `pd.show_versions()`

TannhauserGate42 commented May 26, 2019 •

edited

Loading

plammens commented May 27, 2019 •

edited

Loading

TannhauserGate42 commented May 29, 2019 •

edited

Loading

TannhauserGate42 commented May 29, 2019 •

edited

Loading