skipna added to groupby numeric ops #15772

mayukh18 · 2017-03-21T20:30:36Z

closes ENH: enable skipna on groupby reduction ops #15675
tests added / passed
passes git diff upstream/master --name-only -- '*.py' | flake8 --diff
whatsnew entry

jreback

tests look good! are you comfortable tackling the cython updating?

jreback · 2017-03-21T20:51:23Z

pandas/core/groupby.py

-            if _convert:
-                result = result._convert(datetime=True)
-            return result
+


hmm, I think we actually should add a parameter to the cython routines. it would go in pandas/_libs/algos_groupby_helper.pxi.in; you can call it skipna.

Yes, that's what I was thinking when going through the code for the first time. Doing this outside the cython modules will make it more cluttered. I'll take a bit of time and figure out the cython updating.

mayukh18 · 2017-03-26T08:54:34Z

Added the skipna in cython routines. Tried to support the skipna for numeric_only=False scenario specially for datetimes but it breaks other things most of the time.

jreback · 2017-03-26T14:10:37Z

pandas/_libs/algos_groupby_helper.pxi.in

-                counts[lab] += 1
-                for j in range(K):
-                    val = values[i, j]
+                for i in range(N):


so I think you are better off making only a single block (like it was),

then checking the flag like this (around like 69-70)

(rename skipna -> checknull as that is more consistent with what we use)

if checknull and val == val:

separately, you can pass this flag (see cummin) as well (so you are adding 2 args)
also have a look here
https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/groupby_helper.pxi.in#L597

to deal with datetimelikes (basically this is how we check for nulls there, the val==val trick only works for floats).

Okay I'll revert it to the usual single block. Thought it was better not to do one extra check every iteration. 😛 Thanks for the feedback.

So do you suggest handling the datetimes in the cython routines just like cummin using the is_datetimelike arg?

yes like cummin, but this is a bit more as we have 2 parameters
bint checknull, bint is_datetimelike.

Okay I'll revert it to the usual single block. Thought it was better not to do one extra check every iteration. 😛 Thanks for the feedback.

its just simpler. perf shouldn't make only a very small difference (but then have 2x the code).

I am running into a problem that I am not that sure about. The support of datetimelikes in the cython routines is done. So for example, sum now works ontimedelta values just like it's non-cython implementation.

The problem is it's raising AssertionError("Gaps in blk ref_locs") in _rebuild_blknos_and_blklocs while initializing a BlockManager in mgr = BlockManager(blocks, [items, index]) in the method self._wrap_agged_blocks(new_items, new_blocks) after the cython routine is done and we have a timedelta block.

Can you please guide me on this issue? I am at a loss this time.

is the pushed up code current, and can you show me an example of where that fails?

mayukh18 · 2017-04-07T21:10:08Z

Pushed a bare minimum code. Feeding such a dataframe fails.

import pandas as pd
import numpy as np
 df = pd.DataFrame(
            {'group': [1, 1, 2, 2],
             'int': [1, np.nan, 3, 1],
             'float': [4., 5., 6., 4],
             'category_int': [7, 8, np.nan, 3],
             'datetime': [pd.Timestamp('2013-01-01 05:00:00'),
                          pd.NaT,
                          pd.Timestamp('2013-01-03 00:00:00.555444'),
                          pd.Timestamp('2013-01-04 12:00:00.458795')],
             'datetimetz': [
                 pd.Timestamp('2013-01-01 12:00:00', tz='US/Eastern'),
                 np.nan,
                 pd.Timestamp('2013-01-01 12:00:00', tz='US/Eastern'),
                 pd.Timestamp('2013-01-03 00:00:00', tz='US/Eastern')],
             'timedelta': [pd.Timedelta('1.0s'),
                           pd.Timedelta('3s'),
                           pd.Timedelta('1s'),
                           np.nan]},
            columns=['group', 'int', 'float', 'category_int', 'datetime',
                     'datetimetz','timedelta'])
result = df.groupby('group')
result = result.sum(numeric_only=False, skipna=False)
print(result)

It fails for timedelta. However notice the bypass I had put on Implementation Error in _cython_agg_blocks else it was failing for DatetimeTZ also.

jreback · 2017-04-07T21:13:09Z

hmm, might be related to this bug: #15884 (comment)

(this falls thru the cython_agg, where it should work, but doesn't becauase of a bug), then the .apply is also buggy.

jreback · 2017-04-07T21:14:38Z

@mayukh18 if you have it working for most cases, then split the input frame up in 2 tests, working and non-working (and you can even xfail the non-working ones).

mayukh18 · 2017-04-08T16:17:24Z

@jreback I didn't understand your last comment fully. Are you suggesting on pushing the code as it is like this and just manage with 2 separate tests?

jreback · 2017-04-08T16:33:50Z

so separate out everything that is not working into a separate test

then we can iiterate on fixing it

ultimately we want 1 test but sometimes it's helpful to make the basic case work w/o the distraction of other cases

another (maybe better) are to parameteize the cases so hey r treated as separate test (again might be easier to debug that way)

mayukh18 · 2017-04-28T08:23:04Z

@jreback So the situation now is like everything is done. In ideal case, i.e. if that bug isn't there, skipna will work on the datetimes too. But now, due to the bug, the ops all break down when the frame has datetimes. Now this (new code + bug) is leading a small few other tests to fail which can't be passed without correcting the bug. The test for skipna however runs ok without the datetimes.

So should I push the code after adding another skipna test having datetimes? what do you suggest?

jreback · 2017-04-28T10:06:41Z

pandas/_libs/algos_groupby_helper.pxi.in

@@ -9,34 +9,36 @@ cdef extern from "numpy/npy_math.h":
 _int64_max = np.iinfo(np.int64).max


this has moved to groupby_helper.pxi.in

jreback · 2017-04-28T10:07:18Z

pandas/_libs/algos_groupby_helper.pxi.in

-                    # not nan
-                    if val == val:
+                    # val = nan
+                    {{if name == 'int64'}}


you need a if checknull here as well

jreback · 2017-04-28T10:07:39Z

pandas/_libs/algos_groupby_helper.pxi.in

-                # not nan
-                if val == val:
+                # val = nan
+                {{if name == 'int64'}}


jreback · 2017-04-28T10:08:09Z

pandas/_libs/algos_groupby_helper.pxi.in

 @cython.wraparound(False)
 @cython.boundscheck(False)
 def group_prod_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,
                        ndarray[int64_t] counts,
                        ndarray[{{c_type}}, ndim=2] values,
-                        ndarray[int64_t] labels):
+                        ndarray[int64_t] labels,
+                        bint skipna):


call this the same, checknull as above.

jreback · 2017-04-28T10:08:54Z

pandas/_libs/algos_groupby_helper.pxi.in

-
-                    # not nan
-                    if val == val:
+        if skipna == False:


better to write this as a single expression (Iow push checknull down lower)

jreback · 2017-04-28T10:09:25Z

pandas/_libs/algos_groupby_helper.pxi.in

-                        nobs[lab, j] += 1
-                        if val > maxx[lab, j]:
-                            maxx[lab, j] = val
+                    if val != val:


same as above, need checknull

jreback · 2017-04-28T10:13:37Z

pandas/core/groupby.py

@@ -3187,9 +3187,9 @@ def _iterate_slices(self):
                continue
            yield val, slicer(val)

-    def _cython_agg_general(self, how, alt=None, numeric_only=True):
+    def _cython_agg_general(self, how, alt=None, numeric_only=True, skipna=True):


make this all consistent either skipna=True or skipna=None (this is different than below)

jreback · 2017-04-28T10:14:08Z

pandas/core/groupby.py

            except NotImplementedError:
+                continue


take this out. This is a fairly complicated workflow of falling back on non-numerics. This is the key here.

jreback · 2017-04-28T10:14:18Z

pandas/core/groupby.py

@@ -3327,6 +3328,9 @@ def aggregate(self, arg, *args, **kwargs):
            self._insert_inaxis_grouper_inplace(result)
            result.index = np.arange(len(result))

+        if result.empty:


mayukh18 · 2017-04-28T11:03:33Z

Sorry if I misled you but I hadn't actually pushed the code. This was the older push. I'll go though these and see if I have covered all these.

jreback · 2017-06-10T19:03:21Z

can you rebase and update?

mayukh18 · 2017-06-11T05:13:08Z

I had completed it more or less quite sometime back but my cython development environment broke down. I work on Windows and just can't fix it even after trying everything. Strange but I just don't know what to do in this situation.

jreback · 2017-06-11T16:06:12Z

@mayukh18 have you seen: http://pandas.pydata.org/pandas-docs/stable/contributing.html#creating-a-windows-development-environment

using 3.6 on windows is quite easy, simply install 2015 (free download). create a new conda env and it should just work.

mayukh18 · 2017-06-11T17:14:33Z

The problem I am facing is with setting up C compilers while building cython. Will a new conda env set those up on it's own?

jreback · 2017-06-11T17:20:26Z

you simply install VS 2015

mayukh18 · 2017-06-11T17:26:12Z

Does it have to be 2015? Because I installed 2017 and it didn't solve it. Thanks for the help.

jreback · 2017-06-11T17:41:27Z

2017 in theory should work, but simply use 2015 (you can install many of these).

jreback · 2017-06-11T17:42:03Z

note 2015 means you can build using 3.5 & 3.6. TO do something else is much much harder (e.g. 2.7 is tricky).

mayukh18 · 2017-06-11T17:57:42Z

I guess that may be the problem. I used a virtualenv of 2.7. That time I had a installation of vs 2008 I guess. Then it broke somehow and things went apart. I'll give 2015 and 3.6 a fresh try.

jreback · 2017-06-11T18:08:20Z

yeah 2.7 is hard, just use 3.5 or 3.6 with VS2015. should be a breeze (that in fact is the point of the changes in VS), it is backward compatible for building (but that starts at 3.5)

mayukh18 · 2017-06-11T18:20:55Z

Thanks a lot for the help.

jreback · 2017-08-17T10:30:11Z

can you rebase / update

jreback · 2017-10-28T00:26:40Z

closing as stale, if you'd like to continue, please ping.

mayukh18 · 2020-03-09T09:25:58Z

Hi @jreback , is this something that will still be useful? I'll take this up again then.

cklb · 2020-04-09T14:30:22Z

@mayukh18 as I just ran into this very issue, it would be much appreciated if you could do so.

mayukh18 · 2020-04-26T21:15:53Z

Cool. Started on this.

jorisvandenbossche · 2020-11-20T15:31:16Z

@mayukh18 sorry for the very late reply, but do you have an updated version of this branch / started working on it again?
It's certainly still something that would be very useful

mayukh18 · 2021-05-09T15:32:20Z

@jorisvandenbossche have finally got to it. Opened up a new PR #41399 since I could not revive the branch of this PR.

jreback added the Groupby label Mar 21, 2017

jreback requested changes Mar 21, 2017

View reviewed changes

jreback reviewed Mar 26, 2017

View reviewed changes

clean demo of error.only the op sum is ready in this

3e1f2ba

jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Apr 7, 2017

jreback requested changes Apr 28, 2017

View reviewed changes

jreback closed this Oct 28, 2017

jreback added this to the No action milestone Oct 28, 2017

jorisvandenbossche mentioned this pull request Nov 20, 2020

PERF: Severe performance hit on DataFrame.sum with multi index and 'skipna=False' #37976

Closed

3 tasks

mayukh18 mentioned this pull request May 9, 2021

WIP: groupby skipna #41399

Closed

4 tasks

		@@ -9,34 +9,36 @@ cdef extern from "numpy/npy_math.h":
		_int64_max = np.iinfo(np.int64).max

skipna added to groupby numeric ops #15772

skipna added to groupby numeric ops #15772

Conversation

mayukh18 commented Mar 21, 2017 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayukh18 commented Mar 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayukh18 commented Apr 7, 2017 • edited Loading

jreback commented Apr 7, 2017

jreback commented Apr 7, 2017

mayukh18 commented Apr 8, 2017

jreback commented Apr 8, 2017

mayukh18 commented Apr 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayukh18 commented Apr 28, 2017

jreback commented Jun 10, 2017

mayukh18 commented Jun 11, 2017

jreback commented Jun 11, 2017

mayukh18 commented Jun 11, 2017

jreback commented Jun 11, 2017

mayukh18 commented Jun 11, 2017

jreback commented Jun 11, 2017

jreback commented Jun 11, 2017

mayukh18 commented Jun 11, 2017

jreback commented Jun 11, 2017

mayukh18 commented Jun 11, 2017

jreback commented Aug 17, 2017

jreback commented Oct 28, 2017

mayukh18 commented Mar 9, 2020

cklb commented Apr 9, 2020

mayukh18 commented Apr 26, 2020

jorisvandenbossche commented Nov 20, 2020

mayukh18 commented May 9, 2021

mayukh18 commented Mar 21, 2017 •

edited

Loading

mayukh18 commented Mar 26, 2017 •

edited

Loading

mayukh18 commented Apr 7, 2017 •

edited

Loading