Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-48045][PYTHON] Pandas API groupby with multi-agg-relabel ignor…
…es as_index=False ### What changes were proposed in this pull request? In a Scenario where we use GroupBy in PySpark API with relabeling of aggregate columns and using as_index = False, the columns with which we group by are not returned in the DataFrame. The change proposes to fix this bug. Example: ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a", as_index=False).agg(b_max=("b", "max")) Result: _ b_max 0 1 Required Result: _ a b_max 0 0 1 ### Why are the changes needed? The relabeling part of the code only uses only the aggregate columns. In a scenario where as_index=True, it is not an issue as the columns with which we group by are included in the index. When as_index=False, we need to append the columns with which we grouped by to the relabeling code. Please, check the commits/PR for the code changes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Passed GA - Passed Build tests - Unit Tested including scenarios in addition to the one provided in the Jira ticket ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46391 from sinaiamonkar-sai/SPARK-48045-2. Authored-by: sai <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
- Loading branch information