Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[COST-861] Tokenizing error - skip bad rows when reading csv #5031

Merged
merged 5 commits into from
Apr 22, 2024

Conversation

maskarb
Copy link
Member

@maskarb maskarb commented Apr 11, 2024

Jira Ticket

COST-861

Description

This change will add on_bad_lines="warn" to read_csv for OCP reports.

Testing

  1. make create-test-customer
  2. send attached payload to masu on main (it is a payload for the my-test-cluster-2 OCP source (ocp-on-azure))
$ curl -F '[email protected]'  http://localhost:5042/api/cost-management/v1/ingest_ocp_payload/
  1. See the payload ingest failure:
masu_server  | [2024-04-15 14:30:49,370] ERROR None 23 File /tmp/tmphseyl2po/insights_local/my-ocp-cluster-2/20240401-20240501/2b29eb8a-cac1-4dd8-a985-494e61601618_openshift_report.2.csv could not be parsed. Reason: Error tokenizing data. C error: Expected 6 fields in line 3, saw 9
  1. checkout this branch and ensure masu reloads.
  2. Resend payload again, and see ingest in masu on this branch:
2024-04-15 14:33:23,113] INFO None 3971 {'message': 'Successfully extracted OCP for my-ocp-cluster-2/20240401-20240501', 'tracing_id': '2b29eb8a-cac1-4dd8-a985-494e61601618', 'account': 'no_account', 'org_id': 'no_org_id', 'request_id': '054a480b15de4dac9cf709e161daa026', 'cluster_id': 'my-ocp-cluster-2', 'manifest_uuid': '2b29eb8a-cac1-4dd8-a985-494e61601618', 'provider_type': 'OCP', 'schema': 'org1234567'}
masu_server  | Skipping line 3: expected 6 fields, saw 9
masu_server  | 

Release Notes

  • proposed release note
* [COST-861](https://issues.redhat.com/browse/COST-861) Tokenizing error - skip bad rows when reading csv

Notes:

This payload contains a bad row in the 2b29eb8a-cac1-4dd8-a985-494e61601618_openshift_report.2.csv file:

payload.tar.gz

@maskarb maskarb requested review from a team as code owners April 11, 2024 20:56
@maskarb maskarb changed the title Tokenizing error [COST-861] Tokenizing error - skip bad rows when reading csv Apr 11, 2024
Copy link

codecov bot commented Apr 11, 2024

Codecov Report

Merging #5031 (4715638) into main (ea39c99) will decrease coverage by 0.0%.
The diff coverage is 100.0%.

Additional details and impacted files
@@           Coverage Diff           @@
##            main   #5031     +/-   ##
=======================================
- Coverage   94.1%   94.1%   -0.0%     
=======================================
  Files        377     377             
  Lines      31329   31331      +2     
  Branches    3714    3714             
=======================================
  Hits       29494   29494             
- Misses      1169    1171      +2     
  Partials     666     666             

@maskarb maskarb added the smoke-tests pr_check will build the image and run minimal required smokes label Apr 11, 2024
@maskarb
Copy link
Member Author

maskarb commented Apr 11, 2024

/retest

koku/masu/external/kafka_msg_handler.py Show resolved Hide resolved
koku/masu/external/kafka_msg_handler.py Show resolved Hide resolved
Comment on lines 944 to 945
def test_get_data_frame(self):
"""Test the divide_csv_daily method."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test should be broken into two distinct tests: one that asserts it returns as expected with valid data, the other that it behaves as intended with invalid data.

The way it reads right now is not immediatebly obvious what the test is asserting. It's a "no news is good news" test instead of explicitly spelling out pass/fail criteria.

Copy link
Contributor

@samdoran samdoran Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the docstring needs to be updated or, if the test code is obvious enough, omitted.

@samdoran
Copy link
Contributor

What happens if we get a file that has many invalid lines and a substatial portion of the file is ignored without error? We would see a lot of warnings in the logs (one per invalid line in my testing), but would the final data be trustworthy? This may not be a concern, but it's something I was wondering.

lcouzens
lcouzens previously approved these changes Apr 16, 2024
@maskarb
Copy link
Member Author

maskarb commented Apr 16, 2024

What happens if we get a file that has many invalid lines and a substatial portion of the file is ignored without error? We would see a lot of warnings in the logs (one per invalid line in my testing), but would the final data be trustworthy? This may not be a concern, but it's something I was wondering.

I think there are a couple ways to look at this:

  1. We don't receive payloads with multiple bad lines. At most, the tokenizing error, which is the only error we have previously seen with OCP payloads, affects 1 line out of the 4 reports.
  2. Each line in the report that can be read is certainly valid data. If we somehow receive a payload that has 100 bad lines, then we simple lose 100 lines worth of data.

It's not so much that the data would be not trustworthy, there would simply be gaps in the data.

@samdoran
Copy link
Contributor

It's not so much that the data would be not trustworthy, there would simply be gaps in the data.

Thanks for clarifying. That's what I thought but wanted to double check.

@maskarb
Copy link
Member Author

maskarb commented Apr 22, 2024

/retest

1 similar comment
@maskarb
Copy link
Member Author

maskarb commented Apr 22, 2024

/retest

Copy link
Contributor

@lcouzens lcouzens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@maskarb maskarb merged commit 573c012 into main Apr 22, 2024
11 checks passed
@maskarb maskarb deleted the tokenizing-error branch April 22, 2024 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
smoke-tests pr_check will build the image and run minimal required smokes smokes-required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants