pandas.cut: the 'include_lowest' argument isn't behaving as documented #23164

evarga · 2018-10-15T11:45:11Z

Code Sample

import pandas as pd
import numpy as np

pd.cut(np.array([1, 7, 5, 4, 6, 3]), bins=[0, 3, 6, 8], include_lowest=True)

Problem description

Just by setting the include_lowest to True the data type of the interval changes from int64 to float64 and the first interval isn't left-inclusive. Here is the wrong output that you'll get:

[(-0.001, 3.0], (6.0, 8.0], (3.0, 6.0], (3.0, 6.0], (3.0, 6.0], (-0.001, 3.0]]
Categories (3, interval[float64]): [(-0.001, 3.0] < (3.0, 6.0] < (6.0, 8.0]]

Expected Output

[(0, 3], (6, 8], (3, 6], (3, 6], (3, 6], (0, 3]]
Categories (3, interval[int64]): [[0, 3] < (3, 6] < (6, 8]]

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.6.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.8.2
pip: 10.0.1
setuptools: 40.4.3
Cython: 0.28.5
numpy: 1.15.2
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.0.1
sphinx: 1.8.1
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.0
openpyxl: 2.5.8
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.1.1
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.12
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WWH98932 · 2018-11-07T01:47:20Z

I met the same problem, I changed the intervals to str format and fix the first bound manually.

mroeschke · 2019-01-13T18:51:43Z

I think this is the documented behavior as described in the bins section of the docstring:

The criteria to bin by.
int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

This situation could be better documented though

evarga · 2019-01-13T19:33:23Z

Thanks for the comment! Please, also see the issue with altered data type of intervals (see my oiginal comment above). This behavior is nowhere documented as far as I know, and could be the consequence of this %1 extension, as you have cited.

Nakai-Naoto · 2019-02-18T06:40:47Z

The criteria to bin by.
* int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.
* sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of x is done.

This does not seem to be documented because bins is a sequence of scalars in the code sample.

gdex1 · 2019-07-12T00:30:17Z

I've looked more into how to resolve this and it seems this is more a restriction of IntervalIndex than a documentation error. IntervalIndex requires that all bins are closed on the same side so
(3, interval[int64]): [[0, 3] < (3, 6] < (6, 8]] is not valid.

This is solved by slightly reducing the lower bound of the first interval and making it lower exclusive. Unfortunately, this converts the interval from int64 to float64 and creates the following unexpected behavior:

In: 
pd.cut(np.array([-0.0001, 0, 1, 7, 5, 4, 6, 3, 8]), bins=[0, 3, 6, 8], include_lowest=True)

Out: 
[NaN, (-0.001, 3.0], (-0.001, 3.0], (6.0, 8.0], (3.0, 6.0], (3.0, 6.0], (3.0, 6.0], (-0.001, 3.0], (6.0, 8.0]]
Categories (3, interval[float64]): [(-0.001, 3.0] < (3.0, 6.0] < (6.0, 8.0]]

You can see that even though -0.0001 should be in the first interval, it is not assigned to the first bin.

I am new to this so I was wondering what the best way to handle this is?

jotasi · 2021-08-27T09:59:23Z

As suggested in #42212 (@mroeschke suggested to include it into the discussion here instead), I would love to also have the option to include not only the left-most boundary for left-open intervals but also the right-most boundary for right-open intervals (i.e. for right=False).

My preferred solution would be to change include_lowest to something like final_interval_closed, which is also functional for right-open intervals (i.e. if right=False is specified).

Maybe, if the solution to this issue is to adapt the function or the underlying IntervalIndex instead of updating the documentation, this could be kept in mind and implemented as well.

edwhuang23 · 2022-03-22T00:54:01Z

take

bluenote10 · 2023-10-10T08:47:13Z

Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

This sounds like it depends in the minimum of x but this doesn't seem to be the case, no?

The current behavior raises a number of questions:

Why does it not simply use a left-closed interval as the first bin instead of the awkward extension? Isn't that what they are for?
How can I inclusively bin -np.inf? As far as I can see this cannot be counted correctly with extending.
What if I want to recover the original bin edges from the result? Using include_lowest currently messes this up, because one would have to implement the inverse logic of how it got extended, which is super fragile.

For these reasons it would be great if include_lowest would actually not rely on extending, no?

AshaHolla · 2023-10-15T10:43:25Z

take

AshaHolla · 2023-10-16T16:57:42Z

in which file and folder can I find this issue. Its my first time contributing

bluenote10 · 2023-10-17T07:56:24Z

cut is in tile.py.

But as stated above, the issue should probably rather be classified as "bug" instead of "documentation", i.e., it would be better to fix the code rather than adapting the documentation to the problematic semantics of the code.

hualiu01 · 2024-09-02T21:20:25Z

take

mroeschke added Docs good first issue labels Jan 13, 2019

jbrockmendel added the cut cut, qcut label Sep 22, 2020

zareami10 mentioned this issue Nov 25, 2020

BUG: cut() precision at the left end does not appear as specified (3 digits by default) #33912

Open

3 tasks

simonjayhawkins mentioned this issue Jun 25, 2021

ENH: Add option to make final interval closed for right-open intervals in pd.cut #42212

Closed

imatiach-msft mentioned this issue Sep 22, 2021

decrease the lower bound on first category for quantile binning microsoft/responsible-ai-toolbox#898

Merged

github-actions bot assigned edwhuang23 Mar 22, 2022

edwhuang23 removed their assignment Mar 29, 2022

carbonleakage mentioned this issue Jun 25, 2022

ENH: consistency of input args for boundaries #40245

Open

10 tasks

github-actions bot assigned AshaHolla Oct 15, 2023

AshaHolla mentioned this issue Oct 18, 2023

DOC: Include_lowest parameter docs modification #55577

Closed

5 tasks

github-actions bot assigned hualiu01 Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas.cut: the 'include_lowest' argument isn't behaving as documented #23164

pandas.cut: the 'include_lowest' argument isn't behaving as documented #23164

evarga commented Oct 15, 2018

WWH98932 commented Nov 7, 2018

mroeschke commented Jan 13, 2019

evarga commented Jan 13, 2019 •

edited

Loading

Nakai-Naoto commented Feb 18, 2019

gdex1 commented Jul 12, 2019

jotasi commented Aug 27, 2021

edwhuang23 commented Mar 22, 2022

bluenote10 commented Oct 10, 2023

AshaHolla commented Oct 15, 2023

AshaHolla commented Oct 16, 2023

bluenote10 commented Oct 17, 2023

hualiu01 commented Sep 2, 2024

pandas.cut: the 'include_lowest' argument isn't behaving as documented #23164

pandas.cut: the 'include_lowest' argument isn't behaving as documented #23164

Comments

evarga commented Oct 15, 2018

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

WWH98932 commented Nov 7, 2018

mroeschke commented Jan 13, 2019

evarga commented Jan 13, 2019 • edited Loading

Nakai-Naoto commented Feb 18, 2019

gdex1 commented Jul 12, 2019

jotasi commented Aug 27, 2021

edwhuang23 commented Mar 22, 2022

bluenote10 commented Oct 10, 2023

AshaHolla commented Oct 15, 2023

AshaHolla commented Oct 16, 2023

bluenote10 commented Oct 17, 2023

hualiu01 commented Sep 2, 2024

Output of `pd.show_versions()`

evarga commented Jan 13, 2019 •

edited

Loading