-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pandas.cut: the 'include_lowest' argument isn't behaving as documented #23164
Comments
I met the same problem, I changed the intervals to str format and fix the first bound manually. |
I think this is the documented behavior as described in the bins section of the docstring:
This situation could be better documented though |
Thanks for the comment! Please, also see the issue with altered data type of intervals (see my oiginal comment above). This behavior is nowhere documented as far as I know, and could be the consequence of this %1 extension, as you have cited. |
This does not seem to be documented because |
I've looked more into how to resolve this and it seems this is more a restriction of IntervalIndex than a documentation error. IntervalIndex requires that all bins are closed on the same side so This is solved by slightly reducing the lower bound of the first interval and making it lower exclusive. Unfortunately, this converts the interval from int64 to float64 and creates the following unexpected behavior: In:
pd.cut(np.array([-0.0001, 0, 1, 7, 5, 4, 6, 3, 8]), bins=[0, 3, 6, 8], include_lowest=True)
Out:
[NaN, (-0.001, 3.0], (-0.001, 3.0], (6.0, 8.0], (3.0, 6.0], (3.0, 6.0], (3.0, 6.0], (-0.001, 3.0], (6.0, 8.0]]
Categories (3, interval[float64]): [(-0.001, 3.0] < (3.0, 6.0] < (6.0, 8.0]] You can see that even though -0.0001 should be in the first interval, it is not assigned to the first bin. I am new to this so I was wondering what the best way to handle this is? |
As suggested in #42212 (@mroeschke suggested to include it into the discussion here instead), I would love to also have the option to include not only the left-most boundary for left-open intervals but also the right-most boundary for right-open intervals (i.e. for My preferred solution would be to change Maybe, if the solution to this issue is to adapt the function or the underlying |
take |
This sounds like it depends in the minimum of The current behavior raises a number of questions:
For these reasons it would be great if |
take |
in which file and folder can I find this issue. Its my first time contributing |
But as stated above, the issue should probably rather be classified as "bug" instead of "documentation", i.e., it would be better to fix the code rather than adapting the documentation to the problematic semantics of the code. |
take |
Code Sample
Problem description
Just by setting the
include_lowest
toTrue
the data type of the interval changes fromint64
tofloat64
and the first interval isn't left-inclusive. Here is the wrong output that you'll get:Expected Output
Output of
pd.show_versions()
pandas: 0.23.4
pytest: 3.8.2
pip: 10.0.1
setuptools: 40.4.3
Cython: 0.28.5
numpy: 1.15.2
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.0.1
sphinx: 1.8.1
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.0
openpyxl: 2.5.8
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.1.1
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.12
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: