Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skipna added to groupby numeric ops #15772

Closed
wants to merge 1 commit into from
Closed

skipna added to groupby numeric ops #15772

wants to merge 1 commit into from

Conversation

mayukh18
Copy link

@mayukh18 mayukh18 commented Mar 21, 2017

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests look good! are you comfortable tackling the cython updating?

if _convert:
result = result._convert(datetime=True)
return result

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think we actually should add a parameter to the cython routines. it would go in pandas/_libs/algos_groupby_helper.pxi.in; you can call it skipna.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I was thinking when going through the code for the first time. Doing this outside the cython modules will make it more cluttered. I'll take a bit of time and figure out the cython updating.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!

@mayukh18
Copy link
Author

mayukh18 commented Mar 26, 2017

Added the skipna in cython routines. Tried to support the skipna for numeric_only=False scenario specially for datetimes but it breaks other things most of the time.

counts[lab] += 1
for j in range(K):
val = values[i, j]
for i in range(N):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I think you are better off making only a single block (like it was),

then checking the flag like this (around like 69-70)

(rename skipna -> checknull as that is more consistent with what we use)

if checknull and val == val:

separately, you can pass this flag (see cummin) as well (so you are adding 2 args)
also have a look here
https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/groupby_helper.pxi.in#L597

to deal with datetimelikes (basically this is how we check for nulls there, the val==val trick only works for floats).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I'll revert it to the usual single block. Thought it was better not to do one extra check every iteration. 😛 Thanks for the feedback.

So do you suggest handling the datetimes in the cython routines just like cummin using the is_datetimelike arg?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes like cummin, but this is a bit more as we have 2 parameters
bint checknull, bint is_datetimelike.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I'll revert it to the usual single block. Thought it was better not to do one extra check every iteration. 😛 Thanks for the feedback.

its just simpler. perf shouldn't make only a very small difference (but then have 2x the code).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am running into a problem that I am not that sure about. The support of datetimelikes in the cython routines is done. So for example, sum now works ontimedelta values just like it's non-cython implementation.

The problem is it's raising AssertionError("Gaps in blk ref_locs") in _rebuild_blknos_and_blklocs while initializing a BlockManager in mgr = BlockManager(blocks, [items, index]) in the method self._wrap_agged_blocks(new_items, new_blocks) after the cython routine is done and we have a timedelta block.

Can you please guide me on this issue? I am at a loss this time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the pushed up code current, and can you show me an example of where that fails?

@mayukh18
Copy link
Author

mayukh18 commented Apr 7, 2017

Pushed a bare minimum code. Feeding such a dataframe fails.

import pandas as pd
import numpy as np
 df = pd.DataFrame(
            {'group': [1, 1, 2, 2],
             'int': [1, np.nan, 3, 1],
             'float': [4., 5., 6., 4],
             'category_int': [7, 8, np.nan, 3],
             'datetime': [pd.Timestamp('2013-01-01 05:00:00'),
                          pd.NaT,
                          pd.Timestamp('2013-01-03 00:00:00.555444'),
                          pd.Timestamp('2013-01-04 12:00:00.458795')],
             'datetimetz': [
                 pd.Timestamp('2013-01-01 12:00:00', tz='US/Eastern'),
                 np.nan,
                 pd.Timestamp('2013-01-01 12:00:00', tz='US/Eastern'),
                 pd.Timestamp('2013-01-03 00:00:00', tz='US/Eastern')],
             'timedelta': [pd.Timedelta('1.0s'),
                           pd.Timedelta('3s'),
                           pd.Timedelta('1s'),
                           np.nan]},
            columns=['group', 'int', 'float', 'category_int', 'datetime',
                     'datetimetz','timedelta'])
result = df.groupby('group')
result = result.sum(numeric_only=False, skipna=False)
print(result)

It fails for timedelta. However notice the bypass I had put on Implementation Error in _cython_agg_blocks else it was failing for DatetimeTZ also.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2017

hmm, might be related to this bug: #15884 (comment)

(this falls thru the cython_agg, where it should work, but doesn't becauase of a bug), then the .apply is also buggy.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2017

@mayukh18 if you have it working for most cases, then split the input frame up in 2 tests, working and non-working (and you can even xfail the non-working ones).

@jreback jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Apr 7, 2017
@mayukh18
Copy link
Author

mayukh18 commented Apr 8, 2017

@jreback I didn't understand your last comment fully. Are you suggesting on pushing the code as it is like this and just manage with 2 separate tests?

@jreback
Copy link
Contributor

jreback commented Apr 8, 2017

so separate out everything that is not working into a separate test

then we can iiterate on fixing it

ultimately we want 1 test but sometimes it's helpful to make the basic case work w/o the distraction of other cases

another (maybe better) are to parameteize the cases so hey r treated as separate test (again might be easier to debug that way)

@mayukh18
Copy link
Author

@jreback So the situation now is like everything is done. In ideal case, i.e. if that bug isn't there, skipna will work on the datetimes too. But now, due to the bug, the ops all break down when the frame has datetimes. Now this (new code + bug) is leading a small few other tests to fail which can't be passed without correcting the bug. The test for skipna however runs ok without the datetimes.

So should I push the code after adding another skipna test having datetimes? what do you suggest?

@@ -9,34 +9,36 @@ cdef extern from "numpy/npy_math.h":
_int64_max = np.iinfo(np.int64).max
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has moved to groupby_helper.pxi.in

# not nan
if val == val:
# val = nan
{{if name == 'int64'}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need a if checknull here as well

# not nan
if val == val:
# val = nan
{{if name == 'int64'}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@cython.wraparound(False)
@cython.boundscheck(False)
def group_prod_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
ndarray[int64_t] counts,
ndarray[{{c_type}}, ndim=2] values,
ndarray[int64_t] labels):
ndarray[int64_t] labels,
bint skipna):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this the same, checknull as above.


# not nan
if val == val:
if skipna == False:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to write this as a single expression (Iow push checknull down lower)

nobs[lab, j] += 1
if val > maxx[lab, j]:
maxx[lab, j] = val
if val != val:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, need checknull

@@ -3187,9 +3187,9 @@ def _iterate_slices(self):
continue
yield val, slicer(val)

def _cython_agg_general(self, how, alt=None, numeric_only=True):
def _cython_agg_general(self, how, alt=None, numeric_only=True, skipna=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this all consistent either skipna=True or skipna=None (this is different than below)

except NotImplementedError:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take this out. This is a fairly complicated workflow of falling back on non-numerics. This is the key here.

@@ -3327,6 +3328,9 @@ def aggregate(self, arg, *args, **kwargs):
self._insert_inaxis_grouper_inplace(result)
result.index = np.arange(len(result))

if result.empty:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh?

@mayukh18
Copy link
Author

Sorry if I misled you but I hadn't actually pushed the code. This was the older push. I'll go though these and see if I have covered all these.

@jreback
Copy link
Contributor

jreback commented Jun 10, 2017

can you rebase and update?

@mayukh18
Copy link
Author

I had completed it more or less quite sometime back but my cython development environment broke down. I work on Windows and just can't fix it even after trying everything. Strange but I just don't know what to do in this situation.

@jreback
Copy link
Contributor

jreback commented Jun 11, 2017

@mayukh18 have you seen: http://pandas.pydata.org/pandas-docs/stable/contributing.html#creating-a-windows-development-environment

using 3.6 on windows is quite easy, simply install 2015 (free download). create a new conda env and it should just work.

@mayukh18
Copy link
Author

The problem I am facing is with setting up C compilers while building cython. Will a new conda env set those up on it's own?

@jreback
Copy link
Contributor

jreback commented Jun 11, 2017

you simply install VS 2015

@mayukh18
Copy link
Author

Does it have to be 2015? Because I installed 2017 and it didn't solve it. Thanks for the help.

@jreback
Copy link
Contributor

jreback commented Jun 11, 2017

2017 in theory should work, but simply use 2015 (you can install many of these).

@jreback
Copy link
Contributor

jreback commented Jun 11, 2017

note 2015 means you can build using 3.5 & 3.6. TO do something else is much much harder (e.g. 2.7 is tricky).

@mayukh18
Copy link
Author

I guess that may be the problem. I used a virtualenv of 2.7. That time I had a installation of vs 2008 I guess. Then it broke somehow and things went apart. I'll give 2015 and 3.6 a fresh try.

@jreback
Copy link
Contributor

jreback commented Jun 11, 2017

yeah 2.7 is hard, just use 3.5 or 3.6 with VS2015. should be a breeze (that in fact is the point of the changes in VS), it is backward compatible for building (but that starts at 3.5)

@mayukh18
Copy link
Author

Thanks a lot for the help.

@jreback
Copy link
Contributor

jreback commented Aug 17, 2017

can you rebase / update

@jreback
Copy link
Contributor

jreback commented Oct 28, 2017

closing as stale, if you'd like to continue, please ping.

@jreback jreback closed this Oct 28, 2017
@jreback jreback added this to the No action milestone Oct 28, 2017
@mayukh18
Copy link
Author

mayukh18 commented Mar 9, 2020

Hi @jreback , is this something that will still be useful? I'll take this up again then.

@cklb
Copy link

cklb commented Apr 9, 2020

@mayukh18 as I just ran into this very issue, it would be much appreciated if you could do so.

@mayukh18
Copy link
Author

Cool. Started on this.

@jorisvandenbossche
Copy link
Member

@mayukh18 sorry for the very late reply, but do you have an updated version of this branch / started working on it again?
It's certainly still something that would be very useful

@mayukh18
Copy link
Author

mayukh18 commented May 9, 2021

@jorisvandenbossche have finally got to it. Opened up a new PR #41399 since I could not revive the branch of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: enable skipna on groupby reduction ops
4 participants