Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Automatic data alignment fails in the transform method of GroupBy objects #41550

Closed
2 of 3 tasks
gdudziuk opened this issue May 18, 2021 · 9 comments
Closed
2 of 3 tasks
Assignees
Labels
Apply Apply, Aggregate, Transform, Map Docs good first issue Groupby

Comments

@gdudziuk
Copy link

gdudziuk commented May 18, 2021

  • I have checked that this issue has not already been reported. (... or at least I couldn't find an issue like this)

  • I have confirmed this bug exists on the latest version of pandas. (which is 1.2.4 at the moment)

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Example data and intended output

Let's create some synthetic data to illustrate the issue:

df = pd.DataFrame({
    'household_id' : ['b80', 'b90', 'a70', 'c60', 'a50', 'X13', 'X12', 'X11', 'XYZ'],
    'district' : 5*['Neverland'] + 4*['Narnia'],
    'income' : [1200, 2000, 1000, 900, 1200, 1500, 800, 900, 850],
    })
df = df.set_index('household_id')
               district  income
household_id                   
b80           Neverland    1200
b90           Neverland    2000
a70           Neverland    1000
c60           Neverland     900
a50           Neverland    1200
X13              Narnia    1500
X12              Narnia     800
X11              Narnia     900
XYZ              Narnia     850

Suppose we want to add some money to the poorest households. We have 600$ per district to distribute to households and we decide that this money will be divided equally among all households in a district that have income 1000$ or below. So, our expected result is:

               district  income
household_id                   
b80           Neverland    1200
b90           Neverland    2000
a70           Neverland    1300
c60           Neverland    1200
a50           Neverland    1200
X13              Narnia    1500
X12              Narnia    1000
X11              Narnia    1100
XYZ              Narnia    1050

Code sample and actual output

We will try to achieve the described goal by using groupby followed by transform:

def add_income_to_poor_households(households):
    money_to_share = 600
    poor_households_ids = households.index[households<=1000]
    paylist = pd.Series(money_to_share/len(poor_households_ids), index=poor_households_ids)
    return households.add(paylist, fill_value=0)

grouped = df.groupby('district')['income']
df['income'] = grouped.transform(add_income_to_poor_households)

After executing the above code df looks like this:

               district  income
household_id                   
b80           Neverland    1200
b90           Neverland    1300
a70           Neverland    1200
c60           Neverland    2000
a50           Neverland    1200
X13              Narnia    1100
X12              Narnia    1000
X11              Narnia    1500
XYZ              Narnia    1050

which doesn't match the expected output given above.

Problem description

The output obtained with the sample code doesn't match the expected output. It's unclear to me which of the assumptions specified in the documentation of SeriesGrupBy.transform are violated by the above code. I think that none, so I figure it's a bug.

To gain some insight into the roots of the problem, note that the series returned by add_income_to_poor_households are ordered differently than the original ones. Please execute this on the original df (before changing it via using transform):

households = df[df['district']=='Neverland']['income']

Now households looks like this:

household_id
b80    1200
b90    2000
a70    1000
c60     900
a50    1200
Name: income, dtype: int64

But after calling add_income_to_poor_households the order of the index changes:

add_income_to_poor_households(households)
household_id
a50    1200.0
a70    1300.0
b80    1200.0
b90    2000.0
c60    1200.0
dtype: float64

So possibly the transform method fails to align the output of the inner function add_income_to_poor_households to the original indices of the data frame subject to grouping.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2cb9652
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-73-generic
Version : #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : pl_PL.UTF-8
LOCALE : pl_PL.UTF-8

pandas : 1.2.4
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 54.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.22.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None

@gdudziuk gdudziuk added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 18, 2021
@phofl
Copy link
Member

phofl commented May 18, 2021

Hi, thanks for your report.

I am not sure, if transform should align the output. Did you find any reference about this in the docs? For example, what would you expect if your function would change the index to a RangeIndex for example?

@gdudziuk
Copy link
Author

gdudziuk commented May 18, 2021

Good point - I don't know what I would expect. But is that an intended use of transform? According to the docs, transform calls a function "producing a like-indexed Series on each group and return a Series having the same indexes as the original object filled with the transformed values". It sounds like changing the index e.g. to a RangeIndex shouldn't be allowed here.

True, I have found no explicit reference that transform should align the output. But if the output should have "the same index as the original object" then it is natural to expect the alignment. Especially since the pandas guide stresses strongly that in pandas "data alignment is intrinsic".

@jreback
Copy link
Contributor

jreback commented May 18, 2021

transform expects a reduction operation - when it's not i don't know if there are any expectations

@gdudziuk
Copy link
Author

Don't you mean aggregate? Where the docs state that transform carries out reduction operations?

@mroeschke mroeschke added Apply Apply, Aggregate, Transform, Map Groupby Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 20, 2021
@rhshadrach
Copy link
Member

transform now aligns to the input object and produces the expected output. However unlike the expected in the OP, the output is a float because of the division. I think this should be expected.

However, I think that transform aligns should be added to the docs.

@rhshadrach rhshadrach added Docs good first issue and removed Bug Needs Discussion Requires discussion from core team before further action labels Apr 23, 2023
@akollikonda
Copy link

take

@gdudziuk
Copy link
Author

Of course, the floating point output is reasonable for the particular example in the initial post. For me, transform works as expected now, but as I understand expanding the docs is still on the plan, so I leave the issue open.

@TheMellyBee
Copy link

What does this need on the docs side? :) I'm happy to make the change

@rhshadrach
Copy link
Member

@TheMellyBee - looks like this is already in the docstring of transform and the User Guide:

I don't think there is anything further to do. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Docs good first issue Groupby
Projects
None yet
Development

No branches or pull requests

7 participants