Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.merge using "inner" with multi-index doesn't preserve left index order #55774

Closed
2 of 3 tasks
f-combes opened this issue Oct 31, 2023 · 2 comments
Closed
2 of 3 tasks
Labels
Bug Closing Candidate May be closeable, needs more eyeballs Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@f-combes
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

left = pd.DataFrame(
    {
        "digits": [1, 2, 3, 1],
        "lowercase": ["a", "v", "f", "g"],
        "uppercase": ["A", "B", "B", "A"],
        "left": [100_000, 100_000, 100_000, 30_000],
    },
).set_index(["digits", "lowercase", "uppercase"])


right = pd.DataFrame(
    {
        "uppercase": ["A", "B", "B", "D"],
        "digits": [1, 2, 3, 5],
        "right": [20_000, 70_000, 100_000, 500],
    },
).set_index(["uppercase", "digits"]).squeeze()

result = pd.merge(left, right, how="inner", left_index=True, right_index=True)

pd.testing.assert_index_equal(left.index, result.index)

Output:
> AssertionError: MultiIndex level [0] are different

MultiIndex level [0] values are different (75.0 %)
[left]:  Index([1, 2, 3, 1], dtype='int64', name='digits')
[right]: Index([1, 1, 2, 3], dtype='int64', name='digits')

Issue Description

As the documentation states:

inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

Performing an inner merge should preserve the order of the left keys. In the above example, the left multi-index should be preserved - it is instead sorted.

In a similar fashion, even though align states that the left key order is preserved when doing inner join, it is not. However, aligning twice will result in the correct order, e.g.

_, aligned_right_once = left.align(right, join="inner", axis=0)
_, aligned_right_twice = left.align(aligned_right_once, join="inner", axis=0)

pd.testing.assert_index_equal(left.index, aligned_right_twice.index) # passes
pd.testing.assert_index_equal(left.index, aligned_right_once.index) # raises 

which seems fairly inconsistent -

Potentially relatable issues/PR

Expected Behavior

The key order of the result should the the same as the left dataframe in the example.

Installed Versions

INSTALLED VERSIONS

commit : ba1cccd
python : 3.10.12.final.0
python-bits : 64
OS : Linux
OS-release : 6.2.0-35-generic
Version : #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Oct 6 10:23:26 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.0
numpy : 1.23.5
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 67.4.0
pip : 23.3.1
Cython : None
pytest : 7.4.0
hypothesis : 6.81.1
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.15.0
pandas_datareader : None
bs4 : 4.11.1
bottleneck : None
dataframe-api-compat: None
fastparquet : 2023.8.0
fsspec : 2023.9.1
gcsfs : 2023.9.1
matplotlib : 3.6.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@f-combes f-combes added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 31, 2023
@paulreece
Copy link
Contributor

This is error doesn't occur on main but it does occur on 2.1.x. Possibly a candidate for a backport?

@paulreece paulreece added Index Related to the Index class or subclasses DataFrame DataFrame data structure and removed Needs Triage Issue that has not been reviewed by a pandas team member Index Related to the Index class or subclasses labels Oct 31, 2023
@rhshadrach rhshadrach added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed DataFrame DataFrame data structure labels Nov 1, 2023
@lukemanley
Copy link
Member

This was fixed in the 2.2 cycle and is mentioned in the notable section of the 2.2 release notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Closing Candidate May be closeable, needs more eyeballs Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

5 participants