-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST: Improved test coverage for Styler.bar error conditions #56341
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these tests relate to a different issue? #56283 is to test that this doesn't raise for pyarrow types
Hi @mroeschke, this is a simple PR that covers the test for the raise in Since you mentioned, I did a deep dive into this issue, and found that this issue does exist on the main(2.1.3). The To address this, I've implemented a method replacing Thanks for your review, and I look forward to your response! |
pandas/io/formats/style.py
Outdated
return [replace_pd_NA_with_np_nan(element) for element in data_structure] | ||
elif isinstance(data_structure, np.ndarray): | ||
# Convert numpy array elements recursively | ||
return np.array([replace_pd_NA_with_np_nan(element) for element in data_structure]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks quite non-performant for big sparse arrays.
Are there other approaches that might be better for this whole process.
Maybe there is a substitute for np.nanmin
instead of wrangling the data element-wise to make it fit into the numpy function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @attack68 , thanks for your advice! I have implemented an alternative - using df.min/max
followed by np.nanmin/nanmax
. This will resolve the issues caused by numpy cannot properly handle pd.NA in certain circumstances. Here is the explanation for this issue: when import missing value with pyarrow
, pd.NA will be generated in DataFrame, then, doing np.nanmin(df.to_numpy())
in _bar()
, numpy will throw a TypeError. However, after adding df.min/max
, pd.NA
will be properly handled and the returned structure will have no pd.NA
, so it will be safe to proceed with np.nanmin/nanmax
afterwords. This solution will resolve the issue without causing performance issue, although a little bit tricky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative proposed will likely be much more performant, as well as being much more condensed and easy to read code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls see inline comment
main is not a released version. As of pandas 5e4f2b2 I get In [1]: import pandas as pd
In [2]: import io
...: data = '''name,age,test1,test2,teacher
...: Adam,15,95.0,80,Ashby
...: Bob,16,81.0,82,Ashby
...: Dave,16,89.0,84,Jones
...: Fred,15,,88,Jones'''
...: scores = pd.read_csv(io.StringIO(data), dtype_backend='pyarrow',
...: engine='pyarrow'
...: )
...:
...: (scores
...: .style.bar(subset='test1')
...: )
Out[2]: <pandas.io.formats.style.Styler at 0x1078cdf90>
In [3]: pd.__version__
Out[3]: '2.2.0.dev0+821.g5e4f2b27db' |
Hi @mroeschke , you may see the error with |
pre-commit.ci autofix |
|
||
|
||
def test_styler_bar_with_NA_values(): | ||
df1 = DataFrame({"A": [1, 2, NA, 4]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you test arrow types here per the original issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mroeschke , thank you for your response. I tried to add pyarrow
into the test like the origin issue, but encountered a DeprecationWarning when using engine="pyarrow"
.
I noticed two PRs related to pyarrow
issue, #55637 and #55576 , where several tests were marked as xfail due to parsing or type problems with pyarrow
.
However, the test can be passed without specifying the engine, and I updated this test in my recent commit, although pyarrow
maybe not suitable for this unit test.
def test_style_with_pyarrow_NA_values():
data = """name,age,test1,test2,teacher
Adam,15,95.0,80,Ashby
Bob,16,81.0,82,Ashby
Dave,16,89.0,84,Jones
Fred,15,,88,Jones"""
df = read_csv(io.StringIO(data), dtype_backend="pyarrow")
expected_substring = "style type="
html_output = df.style.bar(subset="test1").to_html()
assert expected_substring in html_output
Look forward to your advice. Thank you!
3c5e782
to
e032e9b
Compare
for more information, see https://pre-commit.ci
e032e9b
to
13e976a
Compare
Hello @mroeschke @attack68 , I've updated my PR and would appreciate it if you could take a review. Please let me know if there are any further changes needed. If everything looks good, could you assist with merging? Thanks for your time and help! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will observe that values
is data
converted to numpy array.
Since you have replaced this by data.min
and data.max
to solve the problem of left
and right
I question whether there is the need to still wrap this within np.nanmin
and np.nanmax
. Probably not if the pandas function returns a scalar in all cases.
Then the next question is what are the other uses of values
. In the code below you will observe the function np.nanmean(values)
. Does this mean your fix will not work when align="mean"
? Or if align
is callable?
values
is also used when there is a cmap
.
Your solution may well work for this case and I believe is still a good improvement, but I do not believe that it fully solves the underlying issue and errors may still result when different arguments are used on this function.
Do you want to attempt these or just propose this to address just the issue in this particular case?
It is possible to create a separate issue for these other identified cases and push two different PRs. |
Hi @attack68 , thanks for your valuable feedback. I have tested with This PR was initially meant to provide a test for issue #56283 , but during the code analysis, I discovered that Regarding your concerns about Thanks for your help and suggestions! |
LGTM. |
Hi @mroeschke I'm writing to gently follow up on my recent PR #56341. When you have a moment, could you please take a look or help me merge it? Appreciate your time and assistance. |
Thanks @ccccjone |
closes BUG:
.style.bar
doesn't work with missing pyarrow values #56283Added two new unit tests
test_bar_color_and_cmap_error_raises()
andtest_bar_invalid_color_type_error_raises()
to improve the coverage ofStyler.bar
where the issue mentioned, ensuring the method raises the appropriateValueError
.Implemented a solution using
df.min/max
beforenp.nanmin/nanmax
to address issues with NumPy's inability to properly handlepd.NA
under certain conditions, and addedtest_styler_bar_with_NA_values()
for testing.