-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent behaviour of __getitem__ and .loc with slices for DataFrames with MultiIndex columns #26511
Comments
No, the behaviour is not inconsistent! Try this example as a thought-experiment col_index = pd.MultiIndex.from_product([['t1', 't2', 't3'], ['t1', 't2']]) What would you then like to get for the following query? >>> df[:, 't1'] |
You need (!) to specify in some way the level / dimension on the MultiIndex, on which you want to slice. In my example above, both would work, but with another result: >>> df[:, 't1']
>>> df[:, (slice(None), 't1')] And this is good so. Well, the syntax is not nice, but this is another story. |
This is exactly the same example as above, just changing the names
since we're indexing the whole first level of the MultiIndex (with
No, the first one still raises I think I see what you meant, @TannhauserGate42. If Lines 2862 to 2864 in 0a516c1
From the docs (I've only been able to find this fragment):
Thus |
In any case I overlooked something, as you already noticed ... I should have used .loc[...] in my example. Did not realise, that you used both, [] and loc[] in yours. |
I went through your examples again. This inconsistency of indexing on the 'columns' and on the 'index' is very unfortunate and already in place for quite a long time. Well, but even knowing that DataFrames are tables and not matrices, I agree with you, that one expects the example above to work. I think the syntax/signatures could show more symmetry. BTW, this asymmetry/inconsistency was one of our reasons to use dataframes with dimensions (MultiIndex) on the 'index' and scenarios (simulations) on the 'columns' axis in one of our business applications ... (But is a main concern for performance now on the data consumption ... #26502) |
@plammens can you double check the output of your first block? I think the
|
I'm not really sure what would be going wrong, but I'd probably start with a |
A few notes here, which might help you if you want to fix it ... As you said, the following does not work: >>> df[:, (slice(None), 't1')] In the end it happens, because
is not hashable. But the main cause might be a wrong case switch already before somewhere. Some symmetry/asymetry in success/failure you can deduct from the follwing examples on the transposed dataframe. Hence, having the MultiIndex on the index-axis. The following works fine >>> tf = df.transpose()
>>> tf.loc[('A', 't2')] # works ... while the following breads >>> tf = df.transpose()
>>> tf.loc[('A', 't2'), :] # TypeError Additionally, the following breaks >>> tf = df.transpose()
>>> tf.loc[(slice(None), 't2')] # TypeError ... while the following works fine >>> tf = df.transpose()
>>> tf.loc[(slice(None), 't2'), :] # works |
Whenever the code runs over
... a Type error is raises. You might investigate in 2850
further. Have fun. |
Yep, I forgot to edit that. As for the inconsistencies, are there any reasons why |
I'm not sure. Probably just history / edge cases, but hard to say. |
Pandas does a lot of great things, but it also has a lot of inconsistencies. With large and complex datasets, it can be difficult to deal with. Specifically, DataFrame.loc can either result in another DataFrame if there are multiple rows in the result, or a Series if there is only one row. Since we're supposed to use DataFrame.iloc and Series.iloc to access individual rows or Scalars, it causes problems. It would be a lot better if DataFrame.loc always resulted in another DataFrame, Series.loc always resulted in another Series, DataFrame.iloc always resulted in a Series, and Series.iloc always resulted in a Scalar, even if the results only have a length of one respectively. |
Code Sample
but
Problem description
Dataframes with
MultiIndex
columns aren't able to make a "cross section" slice (i.e., selecting all sub-columns of a certain label for every "major" column). I know this behaviour is provided byDataFrame.xs
& company, but I don't see why it is beneficial to ignore this inconsistency and make a separate method for the purpose, sinceDataFrame.__getitem__
already interprets a tuple indexer as a column indexing whencolumns
is aMultiIndex
(but it just supports selecting exactly one label for each sub-index).Expected Output
df[:, 't1']
doesn't throw; it returns aDataFrame
with only thet1
sub-columns from columnsA
,B
, andC
(like the one returned bydf.loc[:, (slice(None), 't1')]
.Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: