Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behaviour of __getitem__ and .loc with slices for DataFrames with MultiIndex columns #26511

Open
plammens opened this issue May 24, 2019 · 12 comments
Labels
API - Consistency Internal Consistency of API/Behavior Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex

Comments

@plammens
Copy link
Contributor

plammens commented May 24, 2019

Code Sample

import numpy as np
import pandas as pd

data = np.zeros((3, 6), dtype=np.int16)
col_index = pd.MultiIndex.from_product([['A', 'B', 'C'], ['t1', 't2']])
df = pd.DataFrame(data, columns=col_index)
>>> df
   A     B     C   
  t1 t2 t1 t2 t1 t2
0  0  0  0  0  0  0
1  0  0  0  0  0  0
2  0  0  0  0  0  0
>>> df['A']
   t1  t2
0   0   0
1   0   0
2   0   0
>>> df[:, 't1']
Traceback (most recent call last):
  File "<ipython-input-20-c0e18fe6137b>", line 1, in <module>
    df[:, 't1']
  File "c:\users\paolo\code\ipython\lib\site-packages\pandas\core\frame.py", line 2926, in __getitem__
    return self._getitem_multilevel(key)
  File "c:\users\paolo\code\ipython\lib\site-packages\pandas\core\frame.py", line 2972, in _getitem_multilevel
    loc = self.columns.get_loc(key)
  File "c:\users\paolo\code\ipython\lib\site-packages\pandas\core\indexes\multi.py", line 2406, in get_loc
    return self._engine.get_loc(key)
  File "pandas\_libs\index.pyx", line 665, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc
TypeError: '(slice(None, None, None), 't1')' is an invalid key

but

>>> df.loc[:, (slice(None), 't1')]
   A  B  C
  t1 t1 t1
0  0  0  0
1  0  0  0
2  0  0  0

Problem description

Dataframes with MultiIndex columns aren't able to make a "cross section" slice (i.e., selecting all sub-columns of a certain label for every "major" column). I know this behaviour is provided by DataFrame.xs & company, but I don't see why it is beneficial to ignore this inconsistency and make a separate method for the purpose, since DataFrame.__getitem__ already interprets a tuple indexer as a column indexing when columns is a MultiIndex (but it just supports selecting exactly one label for each sub-index).

Expected Output

df[:, 't1'] doesn't throw; it returns a DataFrame with only the t1 sub-columns from columns A, B, and C (like the one returned by df.loc[:, (slice(None), 't1')].

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: 998a0deea39f11fa06071af77cc1afba65900330
python: 3.7.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.25.0.dev0+605.gb730ab35e
pytest: 4.5.0
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.3.0
pyarrow: 0.13.0
xarray: 0.12.1
IPython: 7.5.0
sphinx: 2.0.1
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: 1.8.1
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.1.0
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.8
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: 0.2.1
fastparquet: 0.3.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None
@TannhauserGate42
Copy link

No, the behaviour is not inconsistent!

Try this example as a thought-experiment

col_index = pd.MultiIndex.from_product([['t1', 't2', 't3'], ['t1', 't2']])

What would you then like to get for the following query?

>>> df[:, 't1']

@TannhauserGate42
Copy link

TannhauserGate42 commented May 26, 2019

You need (!) to specify in some way the level / dimension on the MultiIndex, on which you want to slice.

In my example above, both would work, but with another result:

>>> df[:, 't1']
>>> df[:, (slice(None), 't1')]

And this is good so.

Well, the syntax is not nice, but this is another story.

@plammens
Copy link
Contributor Author

plammens commented May 27, 2019

No, the behaviour is not inconsistent!

Try this example as a thought-experiment

col_index = pd.MultiIndex.from_product([['t1', 't2', 't3'], ['t1', 't2']])

What would you then like to get for the following query?

>>> df[:, 't1']

This is exactly the same example as above, just changing the names 'A', 'B', 'C' to 't1', 't2', 't3'. In this case I would expect df[:, 't1'] to return

  t1 t2 t3
  t1 t1 t1
0  0  0  0
1  0  0  0
2  0  0  0

since we're indexing the whole first level of the MultiIndex (with slice(None)) and selecting the 't1' sub-column from each of those. But it raises TypeError.

In my example above, both would work, but with another result:

>>> df[:, 't1']
>>> df[:, (slice(None), 't1')]

No, the first one still raises TypeError. And the second one does too, but i presume you meant df.loc[:, (slice(None), 't1')].


I think I see what you meant, @TannhauserGate42. If df[a, b] (where a and b are some objects) meant indexing the rows of df with a and indexing the columns with b then yes, in your example df[:, 't1'] would be malformed. But the thing is that DataFrame.__getitiem__ interprets a tuple indexer as an indexer to the columns (only), when columns is a MultiIndex:

pandas/pandas/core/frame.py

Lines 2862 to 2864 in 0a516c1

# We are left with two options: a single key, and a collection of keys,
# We interpret tuples as collections only for non-MultiIndex
is_single_key = isinstance(key, tuple) or not is_list_like(key)

From the docs (I've only been able to find this fragment):

In general, MultiIndex keys take the form of tuples.

Thus df[:, 't1'] is saying "index the first level of columns with slice(None) and the second one with 't1'", and thus should be equivalent to df.loc[:, (slice(None), 't1')] (but it isn't).

@TannhauserGate42
Copy link

In any case I overlooked something, as you already noticed ... I should have used .loc[...] in my example. Did not realise, that you used both, [] and loc[] in yours.

@TannhauserGate42
Copy link

I went through your examples again.

This inconsistency of indexing on the 'columns' and on the 'index' is very unfortunate and already in place for quite a long time. Well, but even knowing that DataFrames are tables and not matrices, I agree with you, that one expects the example above to work. I think the syntax/signatures could show more symmetry.

BTW, this asymmetry/inconsistency was one of our reasons to use dataframes with dimensions (MultiIndex) on the 'index' and scenarios (simulations) on the 'columns' axis in one of our business applications ... (But is a main concern for performance now on the data consumption ... #26502)

@TomAugspurger
Copy link
Contributor

@plammens can you double check the output of your first block? I think the test1, test2, etc. should be t1, t2.

DataFrame.__getitem__(key) not being the same as DataFrame.loc[:, key] is probably a bug. Though there may be some subtlety of DataFrame.__getitem__ that I'm missing.

@TomAugspurger TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels May 28, 2019
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone May 28, 2019
@TomAugspurger
Copy link
Contributor

I'm not really sure what would be going wrong, but I'd probably start with a pdbinside DataFrame.__getitem__ and see where things differ from loc.

@TannhauserGate42
Copy link

TannhauserGate42 commented May 29, 2019

A few notes here, which might help you if you want to fix it ...

As you said, the following does not work:

>>> df[:, (slice(None), 't1')]

In the end it happens, because

>>> slice(None, None, None)

is not hashable. But the main cause might be a wrong case switch already before somewhere.

Some symmetry/asymetry in success/failure you can deduct from the follwing examples on the transposed dataframe. Hence, having the MultiIndex on the index-axis.

The following works fine

>>> tf = df.transpose()
>>> tf.loc[('A', 't2')]  # works

... while the following breads

>>> tf = df.transpose()
>>> tf.loc[('A', 't2'), :]  # TypeError

Additionally, the following breaks

>>> tf = df.transpose()
>>> tf.loc[(slice(None), 't2')]  # TypeError

... while the following works fine

>>> tf = df.transpose()
>>> tf.loc[(slice(None), 't2'), :] # works

@TannhauserGate42
Copy link

TannhauserGate42 commented May 29, 2019

Whenever the code runs over

        # We are left with two options: a single key, and a collection of keys,
        # We interpret tuples as collections only for non-MultiIndex
        is_single_key = isinstance(key, tuple) or not is_list_like(key)

... a Type error is raises.

You might investigate in 2850

        # Do we have a slicer (on rows)?
        indexer = convert_to_index_sliceable(self, key)
        if indexer is not None:
            return self._slice(indexer, axis=0)

further.

Have fun.

@plammens
Copy link
Contributor Author

@plammens can you double check the output of your first block? I think the test1, test2, etc. should be t1, t2.

Yep, I forgot to edit that.

As for the inconsistencies, are there any reasons why df[key] doesn't delegate to df.loc[:, key] when key is one of the types interpreted as column indexers and to df.loc[key, :] when key is one of the types interpreted as row indexing? (Or iloc if appropriate?) Other cases can be managed separately (as when indexing with a boolean DataFrame), but I don't see why those first two cases are handled separately, causing inconsistencies like this one.

@TomAugspurger
Copy link
Contributor

are there any reasons why df[key] doesn't delegate to df.loc[:, key] when key is one of the types interpreted as column indexers and to df.loc[key, :] when key is one of the types interpreted as row indexing?

I'm not sure. Probably just history / edge cases, but hard to say.

@mroeschke mroeschke added API - Consistency Internal Consistency of API/Behavior Bug labels Jul 10, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@JeffMII
Copy link

JeffMII commented Nov 29, 2023

Pandas does a lot of great things, but it also has a lot of inconsistencies. With large and complex datasets, it can be difficult to deal with. Specifically, DataFrame.loc can either result in another DataFrame if there are multiple rows in the result, or a Series if there is only one row. Since we're supposed to use DataFrame.iloc and Series.iloc to access individual rows or Scalars, it causes problems. It would be a lot better if DataFrame.loc always resulted in another DataFrame, Series.loc always resulted in another Series, DataFrame.iloc always resulted in a Series, and Series.iloc always resulted in a Scalar, even if the results only have a length of one respectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

No branches or pull requests

5 participants