BUG: Automatic data alignment fails in the `transform` method of `GroupBy` objects #41550

gdudziuk · 2021-05-18T20:12:37Z

I have checked that this issue has not already been reported. (... or at least I couldn't find an issue like this)
I have confirmed this bug exists on the latest version of pandas. (which is 1.2.4 at the moment)
(optional) I have confirmed this bug exists on the master branch of pandas.

Example data and intended output

Let's create some synthetic data to illustrate the issue:

df = pd.DataFrame({
    'household_id' : ['b80', 'b90', 'a70', 'c60', 'a50', 'X13', 'X12', 'X11', 'XYZ'],
    'district' : 5*['Neverland'] + 4*['Narnia'],
    'income' : [1200, 2000, 1000, 900, 1200, 1500, 800, 900, 850],
    })
df = df.set_index('household_id')

               district  income
household_id                   
b80           Neverland    1200
b90           Neverland    2000
a70           Neverland    1000
c60           Neverland     900
a50           Neverland    1200
X13              Narnia    1500
X12              Narnia     800
X11              Narnia     900
XYZ              Narnia     850

Suppose we want to add some money to the poorest households. We have 600$ per district to distribute to households and we decide that this money will be divided equally among all households in a district that have income 1000$ or below. So, our expected result is:

               district  income
household_id                   
b80           Neverland    1200
b90           Neverland    2000
a70           Neverland    1300
c60           Neverland    1200
a50           Neverland    1200
X13              Narnia    1500
X12              Narnia    1000
X11              Narnia    1100
XYZ              Narnia    1050

Code sample and actual output

We will try to achieve the described goal by using groupby followed by transform:

def add_income_to_poor_households(households):
    money_to_share = 600
    poor_households_ids = households.index[households<=1000]
    paylist = pd.Series(money_to_share/len(poor_households_ids), index=poor_households_ids)
    return households.add(paylist, fill_value=0)

grouped = df.groupby('district')['income']
df['income'] = grouped.transform(add_income_to_poor_households)

After executing the above code df looks like this:

               district  income
household_id                   
b80           Neverland    1200
b90           Neverland    1300
a70           Neverland    1200
c60           Neverland    2000
a50           Neverland    1200
X13              Narnia    1100
X12              Narnia    1000
X11              Narnia    1500
XYZ              Narnia    1050

which doesn't match the expected output given above.

Problem description

The output obtained with the sample code doesn't match the expected output. It's unclear to me which of the assumptions specified in the documentation of SeriesGrupBy.transform are violated by the above code. I think that none, so I figure it's a bug.

To gain some insight into the roots of the problem, note that the series returned by add_income_to_poor_households are ordered differently than the original ones. Please execute this on the original df (before changing it via using transform):

households = df[df['district']=='Neverland']['income']

Now households looks like this:

household_id
b80    1200
b90    2000
a70    1000
c60     900
a50    1200
Name: income, dtype: int64

But after calling add_income_to_poor_households the order of the index changes:

add_income_to_poor_households(households)

household_id
a50    1200.0
a70    1300.0
b80    1200.0
b90    2000.0
c60    1200.0
dtype: float64

So possibly the transform method fails to align the output of the inner function add_income_to_poor_households to the original indices of the data frame subject to grouping.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 2cb9652
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-73-generic
Version : #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : pl_PL.UTF-8
LOCALE : pl_PL.UTF-8

pandas : 1.2.4
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 54.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.22.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

phofl · 2021-05-18T21:37:52Z

Hi, thanks for your report.

I am not sure, if transform should align the output. Did you find any reference about this in the docs? For example, what would you expect if your function would change the index to a RangeIndex for example?

gdudziuk · 2021-05-18T21:57:37Z

Good point - I don't know what I would expect. But is that an intended use of transform? According to the docs, transform calls a function "producing a like-indexed Series on each group and return a Series having the same indexes as the original object filled with the transformed values". It sounds like changing the index e.g. to a RangeIndex shouldn't be allowed here.

True, I have found no explicit reference that transform should align the output. But if the output should have "the same index as the original object" then it is natural to expect the alignment. Especially since the pandas guide stresses strongly that in pandas "data alignment is intrinsic".

jreback · 2021-05-18T22:11:32Z

transform expects a reduction operation - when it's not i don't know if there are any expectations

gdudziuk · 2021-05-19T10:53:39Z

Don't you mean aggregate? Where the docs state that transform carries out reduction operations?

rhshadrach · 2023-04-23T16:06:30Z

transform now aligns to the input object and produces the expected output. However unlike the expected in the OP, the output is a float because of the division. I think this should be expected.

However, I think that transform aligns should be added to the docs.

akollikonda · 2023-04-26T20:15:31Z

take

gdudziuk · 2023-09-23T14:55:13Z

Of course, the floating point output is reasonable for the particular example in the initial post. For me, transform works as expected now, but as I understand expanding the docs is still on the plan, so I leave the issue open.

TheMellyBee · 2023-11-28T19:54:08Z

What does this need on the docs side? :) I'm happy to make the change

rhshadrach · 2023-11-28T21:39:57Z

@TheMellyBee - looks like this is already in the docstring of transform and the User Guide:

I don't think there is anything further to do. Closing.

gdudziuk added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 18, 2021

mroeschke added Apply Apply, Aggregate, Transform, Map Groupby Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 20, 2021

rhshadrach added Docs good first issue and removed Bug Needs Discussion Requires discussion from core team before further action labels Apr 23, 2023

github-actions bot assigned akollikonda Apr 26, 2023

rhshadrach closed this as completed Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Automatic data alignment fails in the `transform` method of `GroupBy` objects #41550

BUG: Automatic data alignment fails in the `transform` method of `GroupBy` objects #41550

gdudziuk commented May 18, 2021 •

edited

Loading

INSTALLED VERSIONS

phofl commented May 18, 2021

gdudziuk commented May 18, 2021 •

edited

Loading

jreback commented May 18, 2021

gdudziuk commented May 19, 2021

rhshadrach commented Apr 23, 2023

akollikonda commented Apr 26, 2023

gdudziuk commented Sep 23, 2023

TheMellyBee commented Nov 28, 2023

rhshadrach commented Nov 28, 2023

BUG: Automatic data alignment fails in the transform method of GroupBy objects #41550

BUG: Automatic data alignment fails in the transform method of GroupBy objects #41550

Comments

gdudziuk commented May 18, 2021 • edited Loading

Example data and intended output

Code sample and actual output

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

phofl commented May 18, 2021

gdudziuk commented May 18, 2021 • edited Loading

jreback commented May 18, 2021

gdudziuk commented May 19, 2021

rhshadrach commented Apr 23, 2023

akollikonda commented Apr 26, 2023

gdudziuk commented Sep 23, 2023

TheMellyBee commented Nov 28, 2023

rhshadrach commented Nov 28, 2023

BUG: Automatic data alignment fails in the `transform` method of `GroupBy` objects #41550

BUG: Automatic data alignment fails in the `transform` method of `GroupBy` objects #41550

gdudziuk commented May 18, 2021 •

edited

Loading

Output of `pd.show_versions()`

gdudziuk commented May 18, 2021 •

edited

Loading