-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: change get_dummies default dtype to bool #48022
ENH: change get_dummies default dtype to bool #48022
Conversation
ed5136a
to
1eb5cd6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is off to a good start!
There's some CI failures in the doc build, could you fix them up please? e.g.
Warning in /home/runner/work/pandas/pandas/doc/source/whatsnew/v0.13.0.rst at block ending on line 526
Specify :okwarning: as an option in the ipython:: block to suppress this message
This would also require a whatsnew note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Marco Edward Gorelli <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me
EDIT: holding off til discussion in pandas-dev meeting
@@ -169,7 +177,7 @@ def test_get_dummies_unicode(self, sparse): | |||
e = "e" | |||
eacute = unicodedata.lookup("LATIN SMALL LETTER E WITH ACUTE") | |||
s = [e, eacute, eacute] | |||
res = get_dummies(s, prefix="letter", sparse=sparse) | |||
res = get_dummies(s, dtype=np.uint8, prefix="letter", sparse=sparse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we rather just catch the warnings for these? Wondering how we remember in the future to go back and update these tests when we make the change to the dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do, I just thought that would be a lot of warnings to catch
Regarding updating tests - I wouldn't have thought they needed updating, I'd have thought just having a test which called .get_dummies()
(without specifying dtype
) would be enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I agree. So that's why I was thinking it is better to catch the warning for now and not change the argument. Otherwise with this in the future we lose testing the behavior of the default argument unless someone comes back and revert what was changed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no all tests should be fixed now
and then u have an explicit test of the warning
it's not better to defer fixing something like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK good points, thanks for raising
I've added this to the agenda for the next dev meeting
@kianelbo let's hold off further changes til after there's been discussion
# https://github.com/pandas-dev/pandas/issues/45848 | ||
msg = "In a future version of pandas the default dtype will change" | ||
with tm.assert_produces_warning(FutureWarning, match=msg): | ||
get_dummies(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry to go back on the approval, but can we check the return value here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this change (even the deprecation) needs discussion as this is quite some long standing behavior
pls schedule it for the next dev meeting
@@ -169,7 +177,7 @@ def test_get_dummies_unicode(self, sparse): | |||
e = "e" | |||
eacute = unicodedata.lookup("LATIN SMALL LETTER E WITH ACUTE") | |||
s = [e, eacute, eacute] | |||
res = get_dummies(s, prefix="letter", sparse=sparse) | |||
res = get_dummies(s, dtype=np.uint8, prefix="letter", sparse=sparse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK good points, thanks for raising
I've added this to the agenda for the next dev meeting
@kianelbo let's hold off further changes til after there's been discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kianelbo
Apologies for conflicting instructions here. In the end, at the last dev meeting, we decided it would be best to treat this as a bug, and just change the default dtype without going through the deprecation cycle
Changing unsigned data to signed is unlikely to cause any issues
So, this'd be a much simpler fix on your side - just change the default dtype
to bool
, and make sure to either update tests or make sure this behaviour is tested
It's too late to get this in to v1.5.0, so for now let's go with v1.6.0 or v1.5.1
2ef8633
to
9f7fbc4
Compare
9f7fbc4
to
15aeb3e
Compare
get_dummies
Awesome, thanks for sticking with this @kianelbo |
* ENH: Warn when dtype is not passed to get_dummies * Edit get_dummies' dtype warning * Add whatsnew entry for issue pandas-dev#45848 * Fix dtype warning test * Suppress warnings in docs * Edit whatsnew entry Co-authored-by: Marco Edward Gorelli <[email protected]> * Fix find_stack_level in get_dummies dtype warning * Change the default dtype of get_dummies to bool * Revert dtype(bool) change * Move the changelog entry to v1.6.0.rst * Move whatsnew entry to 'Other API changes' Co-authored-by: Marco Edward Gorelli <[email protected]> Co-authored-by: Marco Edward Gorelli <[email protected]>
FWIW, this does cause some confusion for downstream consumers--for example the common approach of using Well, I'm not really that annoyed minus 90 minutes of debugging, and the fix is trivial for the consuming code (just specify the old default |
Fixes darshan-hpc#909 * make the library compatible with both `pandas 1.5.x` and `pandas 2.0.0rc0` by pinning the dtype we use for one-hot encoding our heatmap data * see related upstream comment and release notes (`get_dummies()` change): - pandas-dev/pandas#48022 (comment) - https://pandas.pydata.org/pandas-docs/version/2.0/whatsnew/v2.0.0.html
It also lead to some not obvious errors in statsmodels when testing against the pre-prelease. It wasn't that hard to change since I know about this change, but would have been hard to determine had one not. When making the changes I half wondered if the default should have become |
This PR changes the default dtype for get_dummies to bool from uint8 to match pandas-2.0: pandas-dev/pandas#48022
Added a future warning when no dtype is passed to
get_dummies
stating the the default dtype will change tobool
fromnp.uint8
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.