Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.cut: the 'include_lowest' argument isn't behaving as documented #23164

Open
evarga opened this issue Oct 15, 2018 · 12 comments
Open

pandas.cut: the 'include_lowest' argument isn't behaving as documented #23164

evarga opened this issue Oct 15, 2018 · 12 comments
Assignees

Comments

@evarga
Copy link

evarga commented Oct 15, 2018

Code Sample

import pandas as pd
import numpy as np

pd.cut(np.array([1, 7, 5, 4, 6, 3]), bins=[0, 3, 6, 8], include_lowest=True)

Problem description

Just by setting the include_lowest to True the data type of the interval changes from int64 to float64 and the first interval isn't left-inclusive. Here is the wrong output that you'll get:

[(-0.001, 3.0], (6.0, 8.0], (3.0, 6.0], (3.0, 6.0], (3.0, 6.0], (-0.001, 3.0]]
Categories (3, interval[float64]): [(-0.001, 3.0] < (3.0, 6.0] < (6.0, 8.0]]

Expected Output

[(0, 3], (6, 8], (3, 6], (3, 6], (3, 6], (0, 3]]
Categories (3, interval[int64]): [[0, 3] < (3, 6] < (6, 8]]

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.8.2
pip: 10.0.1
setuptools: 40.4.3
Cython: 0.28.5
numpy: 1.15.2
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.0.1
sphinx: 1.8.1
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.0
openpyxl: 2.5.8
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.1.1
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.12
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WWH98932
Copy link

WWH98932 commented Nov 7, 2018

I met the same problem, I changed the intervals to str format and fix the first bound manually.

@mroeschke
Copy link
Member

I think this is the documented behavior as described in the bins section of the docstring:

The criteria to bin by.
int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

This situation could be better documented though

@evarga
Copy link
Author

evarga commented Jan 13, 2019

Thanks for the comment! Please, also see the issue with altered data type of intervals (see my oiginal comment above). This behavior is nowhere documented as far as I know, and could be the consequence of this %1 extension, as you have cited.

@Nakai-Naoto
Copy link

The criteria to bin by.
* int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.
* sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of x is done.

This does not seem to be documented because bins is a sequence of scalars in the code sample.

@gdex1
Copy link
Contributor

gdex1 commented Jul 12, 2019

I've looked more into how to resolve this and it seems this is more a restriction of IntervalIndex than a documentation error. IntervalIndex requires that all bins are closed on the same side so
(3, interval[int64]): [[0, 3] < (3, 6] < (6, 8]] is not valid.

This is solved by slightly reducing the lower bound of the first interval and making it lower exclusive. Unfortunately, this converts the interval from int64 to float64 and creates the following unexpected behavior:

In: 
pd.cut(np.array([-0.0001, 0, 1, 7, 5, 4, 6, 3, 8]), bins=[0, 3, 6, 8], include_lowest=True)

Out: 
[NaN, (-0.001, 3.0], (-0.001, 3.0], (6.0, 8.0], (3.0, 6.0], (3.0, 6.0], (3.0, 6.0], (-0.001, 3.0], (6.0, 8.0]]
Categories (3, interval[float64]): [(-0.001, 3.0] < (3.0, 6.0] < (6.0, 8.0]]

You can see that even though -0.0001 should be in the first interval, it is not assigned to the first bin.

I am new to this so I was wondering what the best way to handle this is?

@jotasi
Copy link
Contributor

jotasi commented Aug 27, 2021

As suggested in #42212 (@mroeschke suggested to include it into the discussion here instead), I would love to also have the option to include not only the left-most boundary for left-open intervals but also the right-most boundary for right-open intervals (i.e. for right=False).

My preferred solution would be to change include_lowest to something like final_interval_closed, which is also functional for right-open intervals (i.e. if right=False is specified).

Maybe, if the solution to this issue is to adapt the function or the underlying IntervalIndex instead of updating the documentation, this could be kept in mind and implemented as well.

@edwhuang23
Copy link
Contributor

take

@bluenote10
Copy link

Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

This sounds like it depends in the minimum of x but this doesn't seem to be the case, no?

The current behavior raises a number of questions:

  • Why does it not simply use a left-closed interval as the first bin instead of the awkward extension? Isn't that what they are for?
  • How can I inclusively bin -np.inf? As far as I can see this cannot be counted correctly with extending.
  • What if I want to recover the original bin edges from the result? Using include_lowest currently messes this up, because one would have to implement the inverse logic of how it got extended, which is super fragile.

For these reasons it would be great if include_lowest would actually not rely on extending, no?

@AshaHolla
Copy link

take

@AshaHolla
Copy link

in which file and folder can I find this issue. Its my first time contributing

@bluenote10
Copy link

cut is in tile.py.

But as stated above, the issue should probably rather be classified as "bug" instead of "documentation", i.e., it would be better to fix the code rather than adapting the documentation to the problematic semantics of the code.

@hualiu01
Copy link

hualiu01 commented Sep 2, 2024

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.