Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST: Improved test coverage for Styler.bar error conditions #56341

Merged
merged 10 commits into from
Dec 11, 2023

Conversation

ccccjone
Copy link
Contributor

@ccccjone ccccjone commented Dec 5, 2023

  • closes BUG: .style.bar doesn't work with missing pyarrow values #56283

  • Added two new unit tests test_bar_color_and_cmap_error_raises() and test_bar_invalid_color_type_error_raises() to improve the coverage of Styler.bar where the issue mentioned, ensuring the method raises the appropriate ValueError.

  • Implemented a solution using df.min/max before np.nanmin/nanmax to address issues with NumPy's inability to properly handle pd.NA under certain conditions, and added test_styler_bar_with_NA_values() for testing.

@ccccjone ccccjone changed the title Improved test coverage for Styler.bar error conditions TST: Improved test coverage for Styler.bar error conditions Dec 5, 2023
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these tests relate to a different issue? #56283 is to test that this doesn't raise for pyarrow types

@ccccjone ccccjone requested a review from attack68 as a code owner December 6, 2023 11:43
@ccccjone
Copy link
Contributor Author

ccccjone commented Dec 6, 2023

Do these tests relate to a different issue? #56283 is to test that this doesn't raise for pyarrow types

Hi @mroeschke, this is a simple PR that covers the test for the raise in bar() method.

Since you mentioned, I did a deep dive into this issue, and found that this issue does exist on the main(2.1.3). The pyarrow will read the null value as an NAType in array, which may not be resolved properly by the np.nanmin() or np.nanmax() in certain circumstances. You will see this issue as long as you render the result, or call the to_html() method, they all give the same errors.

To address this, I've implemented a method replacing pd.NA with np.nan, and is applied only before executing np.nanmin() or np.nanmax(), to resolve the issue.

Thanks for your review, and I look forward to your response!

return [replace_pd_NA_with_np_nan(element) for element in data_structure]
elif isinstance(data_structure, np.ndarray):
# Convert numpy array elements recursively
return np.array([replace_pd_NA_with_np_nan(element) for element in data_structure])
Copy link
Contributor

@attack68 attack68 Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite non-performant for big sparse arrays.
Are there other approaches that might be better for this whole process.

Maybe there is a substitute for np.nanmin instead of wrangling the data element-wise to make it fit into the numpy function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @attack68 , thanks for your advice! I have implemented an alternative - using df.min/max followed by np.nanmin/nanmax. This will resolve the issues caused by numpy cannot properly handle pd.NA in certain circumstances. Here is the explanation for this issue: when import missing value with pyarrow, pd.NA will be generated in DataFrame, then, doing np.nanmin(df.to_numpy()) in _bar(), numpy will throw a TypeError. However, after adding df.min/max, pd.NA will be properly handled and the returned structure will have no pd.NA, so it will be safe to proceed with np.nanmin/nanmax afterwords. This solution will resolve the issue without causing performance issue, although a little bit tricky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative proposed will likely be much more performant, as well as being much more condensed and easy to read code

Copy link
Contributor

@attack68 attack68 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls see inline comment

@mroeschke
Copy link
Member

main is not a released version. As of pandas 5e4f2b2 I get

In [1]: import pandas as pd

In [2]: import io
   ...: data = '''name,age,test1,test2,teacher
   ...: Adam,15,95.0,80,Ashby
   ...: Bob,16,81.0,82,Ashby
   ...: Dave,16,89.0,84,Jones
   ...: Fred,15,,88,Jones'''
   ...: scores = pd.read_csv(io.StringIO(data), dtype_backend='pyarrow',
   ...:                     engine='pyarrow'
   ...:                     )
   ...: 
   ...: (scores
   ...: .style.bar(subset='test1')
   ...: )
Out[2]: <pandas.io.formats.style.Styler at 0x1078cdf90>

In [3]: pd.__version__
Out[3]: '2.2.0.dev0+821.g5e4f2b27db'

@ccccjone
Copy link
Contributor Author

ccccjone commented Dec 6, 2023

main is not a released version. As of pandas 5e4f2b2 I get

In [1]: import pandas as pd

In [2]: import io
   ...: data = '''name,age,test1,test2,teacher
   ...: Adam,15,95.0,80,Ashby
   ...: Bob,16,81.0,82,Ashby
   ...: Dave,16,89.0,84,Jones
   ...: Fred,15,,88,Jones'''
   ...: scores = pd.read_csv(io.StringIO(data), dtype_backend='pyarrow',
   ...:                     engine='pyarrow'
   ...:                     )
   ...: 
   ...: (scores
   ...: .style.bar(subset='test1')
   ...: )
Out[2]: <pandas.io.formats.style.Styler at 0x1078cdf90>

In [3]: pd.__version__
Out[3]: '2.2.0.dev0+821.g5e4f2b27db'

Hi @mroeschke , you may see the error with scores.style.bar(subset='test1').to_html() both in 2.1.3 and in main not release version. Please have a check.

@ccccjone
Copy link
Contributor Author

ccccjone commented Dec 7, 2023

pre-commit.ci autofix

@mroeschke mroeschke added the Styler conditional formatting using DataFrame.style label Dec 7, 2023


def test_styler_bar_with_NA_values():
df1 = DataFrame({"A": [1, 2, NA, 4]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you test arrow types here per the original issue?

Copy link
Contributor Author

@ccccjone ccccjone Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mroeschke , thank you for your response. I tried to add pyarrow into the test like the origin issue, but encountered a DeprecationWarning when using engine="pyarrow".

I noticed two PRs related to pyarrow issue, #55637 and #55576 , where several tests were marked as xfail due to parsing or type problems with pyarrow.

However, the test can be passed without specifying the engine, and I updated this test in my recent commit, although pyarrow maybe not suitable for this unit test.

def test_style_with_pyarrow_NA_values():
    data = """name,age,test1,test2,teacher
        Adam,15,95.0,80,Ashby
        Bob,16,81.0,82,Ashby
        Dave,16,89.0,84,Jones
        Fred,15,,88,Jones"""
    df = read_csv(io.StringIO(data), dtype_backend="pyarrow")
    expected_substring = "style type="
    html_output = df.style.bar(subset="test1").to_html()
    assert expected_substring in html_output

Look forward to your advice. Thank you!

@ccccjone ccccjone requested a review from mroeschke December 8, 2023 00:26
@ccccjone ccccjone force-pushed the enhence_Styler_bar_test branch from 3c5e782 to e032e9b Compare December 8, 2023 08:12
@ccccjone ccccjone force-pushed the enhence_Styler_bar_test branch from e032e9b to 13e976a Compare December 8, 2023 21:01
@ccccjone
Copy link
Contributor Author

ccccjone commented Dec 9, 2023

Hello @mroeschke @attack68 , I've updated my PR and would appreciate it if you could take a review. Please let me know if there are any further changes needed. If everything looks good, could you assist with merging? Thanks for your time and help!

Copy link
Contributor

@attack68 attack68 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will observe that values is data converted to numpy array.
Since you have replaced this by data.min and data.max to solve the problem of left and right I question whether there is the need to still wrap this within np.nanmin and np.nanmax. Probably not if the pandas function returns a scalar in all cases.

Then the next question is what are the other uses of values. In the code below you will observe the function np.nanmean(values). Does this mean your fix will not work when align="mean"? Or if align is callable?

values is also used when there is a cmap.

Your solution may well work for this case and I believe is still a good improvement, but I do not believe that it fully solves the underlying issue and errors may still result when different arguments are used on this function.

Do you want to attempt these or just propose this to address just the issue in this particular case?

@attack68
Copy link
Contributor

attack68 commented Dec 9, 2023

It is possible to create a separate issue for these other identified cases and push two different PRs.

@ccccjone
Copy link
Contributor Author

ccccjone commented Dec 9, 2023

Hi @attack68 , thanks for your valuable feedback. I have tested with align="mean" and got a similar error.

This PR was initially meant to provide a test for issue #56283 , but during the code analysis, I discovered that np.nanmax and np.nanmin can not work on pd.NA. If we don't iterate through data to convert pd.NA to np.nan, using pd.max and pd.min - which can handle pd.NA - seems to be the most clear and concise approach. However, they return Series or DataFrame, which differ in type from the numpy functions' return values (like numpy.int64). Hence, I used nesting to ensure the return types are consistent after modification.

Regarding your concerns about align and cmap, I am more than willing to address them in separate PRs to avoid overcomplicating this one. We can create a new issue to track it.

Thanks for your help and suggestions!

@attack68
Copy link
Contributor

attack68 commented Dec 9, 2023

LGTM.
Please add the new issues before merging this PR. Thx,

@ccccjone
Copy link
Contributor Author

ccccjone commented Dec 9, 2023

@attack68 I have created the issue #56425 . Please help me merge this one. Thank you very much!

@ccccjone
Copy link
Contributor Author

Hi @mroeschke I'm writing to gently follow up on my recent PR #56341. When you have a moment, could you please take a look or help me merge it? Appreciate your time and assistance.

@mroeschke mroeschke added this to the 2.2 milestone Dec 11, 2023
@mroeschke mroeschke merged commit d352d5a into pandas-dev:main Dec 11, 2023
44 checks passed
@mroeschke
Copy link
Member

Thanks @ccccjone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Styler conditional formatting using DataFrame.style
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: .style.bar doesn't work with missing pyarrow values
3 participants