[COST-861] Tokenizing error - skip bad rows when reading csv #5031

maskarb · 2024-04-11T20:56:33Z

Jira Ticket

Description

This change will add on_bad_lines="warn" to read_csv for OCP reports.

Testing

make create-test-customer
send attached payload to masu on main (it is a payload for the my-test-cluster-2 OCP source (ocp-on-azure))

$ curl -F '[email protected]'  http://localhost:5042/api/cost-management/v1/ingest_ocp_payload/

See the payload ingest failure:

masu_server  | [2024-04-15 14:30:49,370] ERROR None 23 File /tmp/tmphseyl2po/insights_local/my-ocp-cluster-2/20240401-20240501/2b29eb8a-cac1-4dd8-a985-494e61601618_openshift_report.2.csv could not be parsed. Reason: Error tokenizing data. C error: Expected 6 fields in line 3, saw 9

checkout this branch and ensure masu reloads.
Resend payload again, and see ingest in masu on this branch:

2024-04-15 14:33:23,113] INFO None 3971 {'message': 'Successfully extracted OCP for my-ocp-cluster-2/20240401-20240501', 'tracing_id': '2b29eb8a-cac1-4dd8-a985-494e61601618', 'account': 'no_account', 'org_id': 'no_org_id', 'request_id': '054a480b15de4dac9cf709e161daa026', 'cluster_id': 'my-ocp-cluster-2', 'manifest_uuid': '2b29eb8a-cac1-4dd8-a985-494e61601618', 'provider_type': 'OCP', 'schema': 'org1234567'}
masu_server  | Skipping line 3: expected 6 fields, saw 9
masu_server  |

Release Notes

proposed release note

* [COST-861](https://issues.redhat.com/browse/COST-861) Tokenizing error - skip bad rows when reading csv

Notes:

This payload contains a bad row in the 2b29eb8a-cac1-4dd8-a985-494e61601618_openshift_report.2.csv file:

payload.tar.gz

codecov · 2024-04-11T21:15:07Z

Codecov Report

Merging #5031 (4715638) into main (ea39c99) will decrease coverage by 0.0%.
The diff coverage is 100.0%.

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #5031     +/-   ##
=======================================
- Coverage   94.1%   94.1%   -0.0%     
=======================================
  Files        377     377             
  Lines      31329   31331      +2     
  Branches    3714    3714             
=======================================
  Hits       29494   29494             
- Misses      1169    1171      +2     
  Partials     666     666

maskarb · 2024-04-11T22:18:10Z

/retest

koku/masu/external/kafka_msg_handler.py

samdoran · 2024-04-15T15:10:28Z

koku/masu/test/external/test_kafka_msg_handler.py

+    def test_get_data_frame(self):
+        """Test the divide_csv_daily method."""


This test should be broken into two distinct tests: one that asserts it returns as expected with valid data, the other that it behaves as intended with invalid data.

The way it reads right now is not immediatebly obvious what the test is asserting. It's a "no news is good news" test instead of explicitly spelling out pass/fail criteria.

Also, the docstring needs to be updated or, if the test code is obvious enough, omitted.

samdoran · 2024-04-15T16:30:08Z

What happens if we get a file that has many invalid lines and a substatial portion of the file is ignored without error? We would see a lot of warnings in the logs (one per invalid line in my testing), but would the final data be trustworthy? This may not be a concern, but it's something I was wondering.

maskarb · 2024-04-16T13:12:53Z

What happens if we get a file that has many invalid lines and a substatial portion of the file is ignored without error? We would see a lot of warnings in the logs (one per invalid line in my testing), but would the final data be trustworthy? This may not be a concern, but it's something I was wondering.

I think there are a couple ways to look at this:

We don't receive payloads with multiple bad lines. At most, the tokenizing error, which is the only error we have previously seen with OCP payloads, affects 1 line out of the 4 reports.
Each line in the report that can be read is certainly valid data. If we somehow receive a payload that has 100 bad lines, then we simple lose 100 lines worth of data.

It's not so much that the data would be not trustworthy, there would simply be gaps in the data.

samdoran · 2024-04-16T13:54:18Z

It's not so much that the data would be not trustworthy, there would simply be gaps in the data.

Thanks for clarifying. That's what I thought but wanted to double check.

maskarb · 2024-04-22T14:15:45Z

/retest

maskarb · 2024-04-22T14:59:40Z

/retest

lcouzens

LGTM

maskarb added 2 commits April 11, 2024 16:39

warn on tokenizing error

1cda8da

skip bad lines of csv;

621c3ad

maskarb requested review from a team as code owners April 11, 2024 20:56

github-actions bot added the smokes-required label Apr 11, 2024

maskarb changed the title ~~Tokenizing error~~ [COST-861] Tokenizing error - skip bad rows when reading csv Apr 11, 2024

maskarb added the smoke-tests pr_check will build the image and run minimal required smokes label Apr 11, 2024

Merge branch 'main' into tokenizing-error

d014900

samdoran reviewed Apr 15, 2024

View reviewed changes

lcouzens previously approved these changes Apr 16, 2024

View reviewed changes

maskarb added 2 commits April 22, 2024 10:00

Merge branch 'main' into tokenizing-error

020972a

update test docstring

4715638

maskarb dismissed lcouzens’s stale review via 4715638 April 22, 2024 14:04

lcouzens approved these changes Apr 22, 2024

View reviewed changes

maskarb merged commit 573c012 into main Apr 22, 2024
11 checks passed

maskarb deleted the tokenizing-error branch April 22, 2024 17:50

djnakabaale pushed a commit that referenced this pull request Apr 26, 2024

[COST-861] Tokenizing error - skip bad rows when reading csv (#5031)

f7ba27d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[COST-861] Tokenizing error - skip bad rows when reading csv #5031

[COST-861] Tokenizing error - skip bad rows when reading csv #5031

maskarb commented Apr 11, 2024 •

edited

Loading

codecov bot commented Apr 11, 2024 •

edited

Loading

maskarb commented Apr 11, 2024

samdoran Apr 15, 2024

samdoran Apr 15, 2024 •

edited

Loading

samdoran commented Apr 15, 2024

maskarb commented Apr 16, 2024

samdoran commented Apr 16, 2024

maskarb commented Apr 22, 2024

maskarb commented Apr 22, 2024

lcouzens left a comment

		def test_get_data_frame(self):
		"""Test the divide_csv_daily method."""

[COST-861] Tokenizing error - skip bad rows when reading csv #5031

[COST-861] Tokenizing error - skip bad rows when reading csv #5031

Conversation

maskarb commented Apr 11, 2024 • edited Loading

Jira Ticket

Description

Testing

Release Notes

Notes:

codecov bot commented Apr 11, 2024 • edited Loading

Codecov Report

maskarb commented Apr 11, 2024

samdoran Apr 15, 2024

Choose a reason for hiding this comment

samdoran Apr 15, 2024 • edited Loading

Choose a reason for hiding this comment

samdoran commented Apr 15, 2024

maskarb commented Apr 16, 2024

samdoran commented Apr 16, 2024

maskarb commented Apr 22, 2024

maskarb commented Apr 22, 2024

lcouzens left a comment

Choose a reason for hiding this comment

maskarb commented Apr 11, 2024 •

edited

Loading

codecov bot commented Apr 11, 2024 •

edited

Loading

samdoran Apr 15, 2024 •

edited

Loading