-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
skipna added to groupby numeric ops #15772
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests look good! are you comfortable tackling the cython updating?
pandas/core/groupby.py
Outdated
if _convert: | ||
result = result._convert(datetime=True) | ||
return result | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think we actually should add a parameter to the cython routines. it would go in pandas/_libs/algos_groupby_helper.pxi.in
; you can call it skipna
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I was thinking when going through the code for the first time. Doing this outside the cython modules will make it more cluttered. I'll take a bit of time and figure out the cython updating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great!
Added the skipna in cython routines. Tried to support the skipna for |
counts[lab] += 1 | ||
for j in range(K): | ||
val = values[i, j] | ||
for i in range(N): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I think you are better off making only a single block (like it was),
then checking the flag like this (around like 69-70)
(rename skipna -> checknull as that is more consistent with what we use)
if checknull and val == val:
separately, you can pass this flag (see cummin) as well (so you are adding 2 args)
also have a look here
https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/groupby_helper.pxi.in#L597
to deal with datetimelikes (basically this is how we check for nulls there, the val==val
trick only works for floats).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I'll revert it to the usual single block. Thought it was better not to do one extra check every iteration. 😛 Thanks for the feedback.
So do you suggest handling the datetimes in the cython routines just like cummin
using the is_datetimelike
arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes like cummin
, but this is a bit more as we have 2 parameters
bint checknull, bint is_datetimelike
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I'll revert it to the usual single block. Thought it was better not to do one extra check every iteration. 😛 Thanks for the feedback.
its just simpler. perf shouldn't make only a very small difference (but then have 2x the code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am running into a problem that I am not that sure about. The support of datetimelikes in the cython routines is done. So for example, sum
now works ontimedelta
values just like it's non-cython implementation.
The problem is it's raising AssertionError("Gaps in blk ref_locs")
in _rebuild_blknos_and_blklocs
while initializing a BlockManager
in mgr = BlockManager(blocks, [items, index])
in the method self._wrap_agged_blocks(new_items, new_blocks)
after the cython routine is done and we have a timedelta block.
Can you please guide me on this issue? I am at a loss this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the pushed up code current, and can you show me an example of where that fails?
Pushed a bare minimum code. Feeding such a dataframe fails.
It fails for timedelta. However notice the bypass I had put on Implementation Error in |
hmm, might be related to this bug: #15884 (comment) (this falls thru the cython_agg, where it should work, but doesn't becauase of a bug), then the |
@mayukh18 if you have it working for most cases, then split the input frame up in 2 tests, working and non-working (and you can even xfail the non-working ones). |
@jreback I didn't understand your last comment fully. Are you suggesting on pushing the code as it is like this and just manage with 2 separate tests? |
so separate out everything that is not working into a separate test then we can iiterate on fixing it ultimately we want 1 test but sometimes it's helpful to make the basic case work w/o the distraction of other cases another (maybe better) are to parameteize the cases so hey r treated as separate test (again might be easier to debug that way) |
@jreback So the situation now is like everything is done. In ideal case, i.e. if that bug isn't there, skipna will work on the datetimes too. But now, due to the bug, the ops all break down when the frame has datetimes. Now this (new code + bug) is leading a small few other tests to fail which can't be passed without correcting the bug. The test for skipna however runs ok without the datetimes. So should I push the code after adding another skipna test having datetimes? what do you suggest? |
@@ -9,34 +9,36 @@ cdef extern from "numpy/npy_math.h": | |||
_int64_max = np.iinfo(np.int64).max |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has moved to groupby_helper.pxi.in
# not nan | ||
if val == val: | ||
# val = nan | ||
{{if name == 'int64'}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need a if checknull
here as well
# not nan | ||
if val == val: | ||
# val = nan | ||
{{if name == 'int64'}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
@cython.wraparound(False) | ||
@cython.boundscheck(False) | ||
def group_prod_{{name}}(ndarray[{{dest_type2}}, ndim=2] out, | ||
ndarray[int64_t] counts, | ||
ndarray[{{c_type}}, ndim=2] values, | ||
ndarray[int64_t] labels): | ||
ndarray[int64_t] labels, | ||
bint skipna): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this the same, checknull as above.
|
||
# not nan | ||
if val == val: | ||
if skipna == False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to write this as a single expression (Iow push checknull down lower)
nobs[lab, j] += 1 | ||
if val > maxx[lab, j]: | ||
maxx[lab, j] = val | ||
if val != val: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, need checknull
@@ -3187,9 +3187,9 @@ def _iterate_slices(self): | |||
continue | |||
yield val, slicer(val) | |||
|
|||
def _cython_agg_general(self, how, alt=None, numeric_only=True): | |||
def _cython_agg_general(self, how, alt=None, numeric_only=True, skipna=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this all consistent either skipna=True
or skipna=None
(this is different than below)
except NotImplementedError: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take this out. This is a fairly complicated workflow of falling back on non-numerics. This is the key here.
@@ -3327,6 +3328,9 @@ def aggregate(self, arg, *args, **kwargs): | |||
self._insert_inaxis_grouper_inplace(result) | |||
result.index = np.arange(len(result)) | |||
|
|||
if result.empty: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh?
Sorry if I misled you but I hadn't actually pushed the code. This was the older push. I'll go though these and see if I have covered all these. |
can you rebase and update? |
I had completed it more or less quite sometime back but my cython development environment broke down. I work on Windows and just can't fix it even after trying everything. Strange but I just don't know what to do in this situation. |
@mayukh18 have you seen: http://pandas.pydata.org/pandas-docs/stable/contributing.html#creating-a-windows-development-environment using 3.6 on windows is quite easy, simply install 2015 (free download). create a new conda env and it should just work. |
The problem I am facing is with setting up C compilers while building cython. Will a new conda env set those up on it's own? |
you simply install VS 2015 |
Does it have to be 2015? Because I installed 2017 and it didn't solve it. Thanks for the help. |
2017 in theory should work, but simply use 2015 (you can install many of these). |
note 2015 means you can build using 3.5 & 3.6. TO do something else is much much harder (e.g. 2.7 is tricky). |
I guess that may be the problem. I used a virtualenv of 2.7. That time I had a installation of vs 2008 I guess. Then it broke somehow and things went apart. I'll give 2015 and 3.6 a fresh try. |
yeah 2.7 is hard, just use 3.5 or 3.6 with VS2015. should be a breeze (that in fact is the point of the changes in VS), it is backward compatible for building (but that starts at 3.5) |
Thanks a lot for the help. |
can you rebase / update |
closing as stale, if you'd like to continue, please ping. |
Hi @jreback , is this something that will still be useful? I'll take this up again then. |
@mayukh18 as I just ran into this very issue, it would be much appreciated if you could do so. |
Cool. Started on this. |
@mayukh18 sorry for the very late reply, but do you have an updated version of this branch / started working on it again? |
@jorisvandenbossche have finally got to it. Opened up a new PR #41399 since I could not revive the branch of this PR. |
git diff upstream/master --name-only -- '*.py' | flake8 --diff