-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[COST-861] Tokenizing error - skip bad rows when reading csv #5031
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #5031 +/- ##
=======================================
- Coverage 94.1% 94.1% -0.0%
=======================================
Files 377 377
Lines 31329 31331 +2
Branches 3714 3714
=======================================
Hits 29494 29494
- Misses 1169 1171 +2
Partials 666 666 |
/retest |
def test_get_data_frame(self): | ||
"""Test the divide_csv_daily method.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test should be broken into two distinct tests: one that asserts it returns as expected with valid data, the other that it behaves as intended with invalid data.
The way it reads right now is not immediatebly obvious what the test is asserting. It's a "no news is good news" test instead of explicitly spelling out pass/fail criteria.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the docstring needs to be updated or, if the test code is obvious enough, omitted.
What happens if we get a file that has many invalid lines and a substatial portion of the file is ignored without error? We would see a lot of warnings in the logs (one per invalid line in my testing), but would the final data be trustworthy? This may not be a concern, but it's something I was wondering. |
I think there are a couple ways to look at this:
It's not so much that the data would be not trustworthy, there would simply be gaps in the data. |
Thanks for clarifying. That's what I thought but wanted to double check. |
/retest |
1 similar comment
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Jira Ticket
COST-861
Description
This change will add
on_bad_lines="warn"
to read_csv for OCP reports.Testing
make create-test-customer
Release Notes
Notes:
This payload contains a bad row in the
2b29eb8a-cac1-4dd8-a985-494e61601618_openshift_report.2.csv
file:payload.tar.gz