-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Use fused types for map_infer_mask #55736
Changes from 3 commits
5921bf1
5c4d8d9
c4bdb1a
c58e682
8ff9047
4634a24
59c836d
da55ccc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -102,6 +102,7 @@ cdef extern from "pandas/parser/pd_parser.h": | |||||||||||
PandasParser_IMPORT | ||||||||||||
|
||||||||||||
from pandas._libs cimport util | ||||||||||||
from pandas._libs.dtypes cimport numeric_object_t | ||||||||||||
from pandas._libs.util cimport ( | ||||||||||||
INT64_MAX, | ||||||||||||
INT64_MIN, | ||||||||||||
|
@@ -2855,8 +2856,6 @@ no_default = _NoDefault.no_default # Sentinel indicating the default value. | |||||||||||
NoDefault = Literal[_NoDefault.no_default] | ||||||||||||
|
||||||||||||
|
||||||||||||
@cython.boundscheck(False) | ||||||||||||
@cython.wraparound(False) | ||||||||||||
def map_infer_mask( | ||||||||||||
ndarray[object] arr, | ||||||||||||
object f, | ||||||||||||
|
@@ -2875,10 +2874,59 @@ def map_infer_mask( | |||||||||||
mask : ndarray | ||||||||||||
uint8 dtype ndarray indicating values not to apply `f` to. | ||||||||||||
convert : bool, default True | ||||||||||||
Whether to call `maybe_convert_objects` on the resulting ndarray | ||||||||||||
Whether to call `maybe_convert_objects` on the resulting ndarray. | ||||||||||||
na_value : Any, optional | ||||||||||||
The result value to use for masked values. By default, the | ||||||||||||
input value is used | ||||||||||||
input value is used. | ||||||||||||
dtype : numpy.dtype | ||||||||||||
The numpy dtype to use for the result ndarray. | ||||||||||||
|
||||||||||||
Returns | ||||||||||||
------- | ||||||||||||
np.ndarray | ||||||||||||
""" | ||||||||||||
# Passed so we can use infused types depending on the result dtype | ||||||||||||
dummy = np.empty(0, dtype=dtype) | ||||||||||||
result = _map_infer_mask( | ||||||||||||
arr, | ||||||||||||
f, | ||||||||||||
mask, | ||||||||||||
dummy, | ||||||||||||
convert, | ||||||||||||
na_value, | ||||||||||||
dtype, | ||||||||||||
) | ||||||||||||
if convert: | ||||||||||||
return maybe_convert_objects(result) | ||||||||||||
else: | ||||||||||||
return result | ||||||||||||
|
||||||||||||
|
||||||||||||
@cython.boundscheck(False) | ||||||||||||
@cython.wraparound(False) | ||||||||||||
def _map_infer_mask( | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. make this a cdef function? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't believe so. When I try, I get:
I believe this is cython/cython#2462 I'm not certain, but I believe it happens when calling a cdef function from a def function where the def function does not use the fused types. |
||||||||||||
ndarray[object] arr, | ||||||||||||
object f, | ||||||||||||
const uint8_t[:] mask, | ||||||||||||
numeric_object_t[:] dummy, | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does changing I think this is a nice change but would be nice to get rid of some of the indirection There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. might be cleaner to create There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @WillAyd: I tried this first, and it compiles but then fails when called with pandas/pandas/_libs/groupby.pyx Lines 1341 to 1345 in 984d755
we pass a value. That's the reason I went with an array here - we can create an empty one without needing to produce a value. @jbrockmendel - will update. |
||||||||||||
bint convert=True, | ||||||||||||
object na_value=no_default, | ||||||||||||
cnp.dtype dtype=np.dtype(object) | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think dtype and convert are not needed here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doh - this is why you don't refactor late at night 😆 Removed, thanks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lithomas1 were you once looking at adding Wextra or increasing the meson warning setting? I know we'd have to do a bit more work to get there but I think that would catch issues like this in the future (via -Wunused-argument, which could alternately be added directly) |
||||||||||||
) -> np.ndarray: | ||||||||||||
""" | ||||||||||||
jbrockmendel marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||
Helper for map_infer_mask, split off to use fused types based on the result. | ||||||||||||
|
||||||||||||
Parameters | ||||||||||||
---------- | ||||||||||||
arr : ndarray | ||||||||||||
f : function | ||||||||||||
mask : ndarray | ||||||||||||
uint8 dtype ndarray indicating values not to apply `f` to. | ||||||||||||
dummy : ndarray | ||||||||||||
Unused. Has the same dtype as the result to allow using fused types. | ||||||||||||
na_value : Any, optional | ||||||||||||
The result value to use for masked values. By default, the | ||||||||||||
input value is used. | ||||||||||||
dtype : numpy.dtype | ||||||||||||
The numpy dtype to use for the result ndarray. | ||||||||||||
|
||||||||||||
|
@@ -2888,7 +2936,7 @@ def map_infer_mask( | |||||||||||
""" | ||||||||||||
cdef: | ||||||||||||
Py_ssize_t i, n | ||||||||||||
ndarray result | ||||||||||||
ndarray[numeric_object_t] result | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we narrow down what types are actually needed here? is there a danger of types that aren't part of numeric_object_t? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The test suite only uses bool, int64, and object. I'm going to check calls to make sure that's guaranteed. Assuming it is, does something like
in dtypes.pxd work? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All calls are either hard coded as bool, int64, or object. The only part that was a little tricky was the use in the various There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we create the type here what would we name it? I agree its good to specialize instead of using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with the sentiment but have no recommendations for a "natural" name. If we are not happy with the current name There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
im fine with either of these options There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @WillAyd - friendly ping on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's go with the first option for now. Can always rename later |
||||||||||||
object val | ||||||||||||
|
||||||||||||
n = len(arr) | ||||||||||||
|
@@ -2908,10 +2956,7 @@ def map_infer_mask( | |||||||||||
|
||||||||||||
result[i] = val | ||||||||||||
|
||||||||||||
if convert: | ||||||||||||
return maybe_convert_objects(result) | ||||||||||||
else: | ||||||||||||
return result | ||||||||||||
return result | ||||||||||||
|
||||||||||||
|
||||||||||||
@cython.boundscheck(False) | ||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can also just pass a pointer to avoid python runtime interaction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar - can you spell out the syntax here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See if this works
http://docs.cython.org/en/latest/src/userguide/fusedtypes.html#fused-types-and-arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot take address of Python variable 'dummy'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably need to cdef it as ndarray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea shouldn't even need an array. Typing from phone so sorry for grammar but should be
cdef numeric_object_t dummy
Then when calling function use address of operator &dummy
The callee should have numeric_object_t *dummy as an argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe we can use fused types unless they are utilized in the args. Doing either one of these:
leads to compilation error
Type is not specialized
. Also doinggives
Cannot take address of Python variable 'dummy'
as before.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK. That's unfortunate - I think @jbrockmendel has the best suggestion then to make
result
caller allocatedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misunderstood that suggestion before - I think this would only be doable if we are okay with also moving
convert
out of the function. So instead of something like:we are now doing
This seems worse to me (duplicating the convert code), but happy to make the change if that's what's preferred by others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes im fine with this. though i think if you called the result something other than
result
you might not have the same TypeError issues, and since we're not iterating over it its no big deal perf-wise