[YAML] - Kafka Proto String schema #29835

ffernandez92 · 2023-12-20T15:44:52Z

Include the option of sending the protobuf schema as string not only as a file descriptor
addresses [Feature Request][YAML]: KafkaIO for YAML #28664

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

github-actions · 2023-12-20T16:36:07Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @riteshghorse for label python.
R: @damondouglas for label java.
R: @damondouglas for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

ffernandez92 · 2023-12-21T09:01:23Z

@brucearctor this PR contains the String Proto schema feature for Kafka. There is a test check that fails because of CData section too big found, line 100046, column 254 (TEST-org.apache.beam.sdk.io.kafka.KafkaIOIT.xml, line 100046)
Apparently the test for that class (that i haven't touched) are generating XML files that are validated downstream. Looks like we are hitting a limitation where the XML result is bigger than 10MB.

I've tested this with Dataflow as well using different configurations and it seems to be working fine.

ffernandez92 · 2023-12-21T09:13:58Z

A bit more info about that failed test:

The test: beam_PreCommit_Java_Kafka_IO_Direct runs fine. However, the step: Publish JUnit Test Results shows the following error when publishing the test results:

Run EnricoMi/publish-unit-test-result-action@v2

[32](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:33)
/usr/local/bin/docker run --name ghcrioenricomipublishunittestresultactionv2110_da9c74 --label 9ada59 --workdir /github/workspace --rm -e "GRADLE_ENTERPRISE_ACCESS_KEY" -e "GRADLE_ENTERPRISE_CACHE_USERNAME" -e "GRADLE_ENTERPRISE_CACHE_PASSWORD" -e "KUBELET_GCLOUD_CONFIG_PATH" -e "GRADLE_BUILD_ACTION_SETUP_COMPLETED" -e "GRADLE_BUILD_ACTION_CACHE_RESTORED" -e "INPUT_COMMIT" -e "INPUT_COMMENT_MODE" -e "INPUT_FILES" -e "INPUT_GITHUB_TOKEN" -e "INPUT_GITHUB_TOKEN_ACTOR" -e "INPUT_GITHUB_RETRIES" -e "INPUT_CHECK_NAME" -e "INPUT_COMMENT_TITLE" -e "INPUT_FAIL_ON" -e "INPUT_ACTION_FAIL" -e "INPUT_ACTION_FAIL_ON_INCONCLUSIVE" -e "INPUT_JUNIT_FILES" -e "INPUT_NUNIT_FILES" -e "INPUT_XUNIT_FILES" -e "INPUT_TRX_FILES" -e "INPUT_TIME_UNIT" -e "INPUT_TEST_FILE_PREFIX" -e "INPUT_REPORT_INDIVIDUAL_RUNS" -e "INPUT_REPORT_SUITE_LOGS" -e "INPUT_DEDUPLICATE_CLASSES_BY_FILE_NAME" -e "INPUT_LARGE_FILES" -e "INPUT_IGNORE_RUNS" -e "INPUT_JOB_SUMMARY" -e "INPUT_COMPARE_TO_EARLIER_COMMIT" -e "INPUT_PULL_REQUEST_BUILD" -e "INPUT_EVENT_FILE" -e "INPUT_EVENT_NAME" -e "INPUT_TEST_CHANGES_LIMIT" -e "INPUT_CHECK_RUN_ANNOTATIONS" -e "INPUT_CHECK_RUN_ANNOTATIONS_BRANCH" -e "INPUT_SECONDS_BETWEEN_GITHUB_READS" -e "INPUT_SECONDS_BETWEEN_GITHUB_WRITES" -e "INPUT_SECONDARY_RATE_LIMIT_WAIT_SECONDS" -e "INPUT_JSON_FILE" -e "INPUT_JSON_THOUSANDS_SEPARATOR" -e "INPUT_JSON_SUITE_DETAILS" -e "INPUT_JSON_TEST_CASE_RESULTS" -e "INPUT_SEARCH_PULL_REQUESTS" -e "HOME" -e "GITHUB_JOB" -e "GITHUB_REF" -e "GITHUB_SHA" -e "GITHUB_REPOSITORY" -e "GITHUB_REPOSITORY_OWNER" -e "GITHUB_REPOSITORY_OWNER_ID" -e "GITHUB_RUN_ID" -e "GITHUB_RUN_NUMBER" -e "GITHUB_RETENTION_DAYS" -e "GITHUB_RUN_ATTEMPT" -e "GITHUB_REPOSITORY_ID" -e "GITHUB_ACTOR_ID" -e "GITHUB_ACTOR" -e "GITHUB_TRIGGERING_ACTOR" -e "GITHUB_WORKFLOW" -e "GITHUB_HEAD_REF" -e "GITHUB_BASE_REF" -e "GITHUB_EVENT_NAME" -e "GITHUB_SERVER_URL" -e "GITHUB_API_URL" -e "GITHUB_GRAPHQL_URL" -e "GITHUB_REF_NAME" -e "GITHUB_REF_PROTECTED" -e "GITHUB_REF_TYPE" -e "GITHUB_WORKFLOW_REF" -e "GITHUB_WORKFLOW_SHA" -e "GITHUB_WORKSPACE" -e "GITHUB_EVENT_PATH" -e "GITHUB_PATH" -e "GITHUB_ENV" -e "GITHUB_STEP_SUMMARY" -e "GITHUB_STATE" -e "GITHUB_OUTPUT" -e "GITHUB_ACTION" -e "GITHUB_ACTION_REPOSITORY" -e "GITHUB_ACTION_REF" -e "RUNNER_OS" -e "RUNNER_ARCH" -e "RUNNER_NAME" -e "RUNNER_ENVIRONMENT" -e "RUNNER_TOOL_CACHE" -e "RUNNER_TEMP" -e "RUNNER_WORKSPACE" -e "ACTIONS_RUNTIME_URL" -e "ACTIONS_RUNTIME_TOKEN" -e "ACTIONS_CACHE_URL" -e "ACTIONS_RESULTS_URL" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/runner/_work/_temp/_github_home":"/github/home" -v "/runner/_work/_temp/_github_workflow":"/github/workflow" -v "/runner/_work/_temp/_runner_file_commands":"/github/file_commands" -v "/runner/_work/beam/beam":"/github/workspace" ghcr.io/enricomi/publish-unit-test-result-action:v2.11.0
[33](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:34)
2023-12-21 08:50:11 +0000 - publish -  INFO - Available memory to read files: 17.9 GiB
[34](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:35)
2023-12-21 08:50:13 +0000 - publish -  INFO - Reading files **/build/test-results/**/*.xml (34 files, 85.9 MiB)
[35](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:36)
2023-12-21 08:50:14 +0000 - publish -  INFO - Detected 34 JUnit XML files (85.9 MiB)
[36](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:37)
2023-12-21 08:50:14 +0000 - publish -  INFO - Finished reading 34 files in 1.43 seconds
[37](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:38)
2023-12-21 08:50:14 +0000 - publish - ERROR - lxml.etree.XMLSyntaxError: CData section too big found, line 100046, column 254
[38](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:39)
2023-12-21 08:50:14 +0000 - publish - ERROR - lxml.etree.XMLSyntaxError: CData section too big found, line 99247, column 127
[39](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:40)
2023-12-21 08:50:14 +0000 - publish - ERROR - lxml.etree.XMLSyntaxError: CData section too big found, line 98891, column 128
[40](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:41)
2023-12-21 08:50:14 +0000 - publish - ERROR - lxml.etree.XMLSyntaxError: CData section too big found, line 99542, column 58
[41](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:42)
2023-12-21 08:50:14 +0000 - publish - ERROR - lxml.etree.XMLSyntaxError: CData section too big found, line 99182, column 243
[42](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:43)
2023-12-21 08:50:14 +0000 - publish - ERROR - lxml.etree.XMLSyntaxError: CData section too big found, line 98943, column 96
[43](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:44)
2023-12-21 08:50:15 +0000 - publish -  INFO - Publishing failure results for commit f295bda46585a7acff61ed373379a3b7e0dfeff5
[44](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:45)
2023-12-21 08:50:17 +0000 - publish -  INFO - Created check https://github.com/apache/beam/runs/19853749247
[45](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:46)
2023-12-21 08:50:17 +0000 - publish -  INFO - Created job summary
[46](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:47)
2023-12-21 08:50:17 +0000 - publish -  INFO - Commenting on pull requests disabled
[47](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:48)
Error: lxml.etree.XMLSyntaxError: CData section too big found, line 100046, column 254
[48](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:49)
Error: Error processing result file: CData section too big found, line 100046, column 254 (TEST-org.apache.beam.sdk.io.kafka.KafkaIOIT.xml, line 100046)
[49](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:50)
Error: lxml.etree.XMLSyntaxError: CData section too big found, line 99247, column 127
[50](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:51)
Error: Error processing result file: CData section too big found, line 99247, column 127 (TEST-org.apache.beam.sdk.io.kafka.KafkaIOIT.xml, line 99247)
[51](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:52)
Error: lxml.etree.XMLSyntaxError: CData section too big found, line 98891, column 128
[52](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:53)
Error: Error processing result file: CData section too big found, line 98891, column 128 (TEST-org.apache.beam.sdk.io.kafka.KafkaIOIT.xml, line 98891)
[53](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:54)
Error: lxml.etree.XMLSyntaxError: CData section too big found, line 99[54](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:55)2, column 58
54
Error: Error processing result file: CData section too big found, line 99542, column 58 (TEST-org.apache.beam.sdk.io.kafka.KafkaIOIT.xml, line 99542)
[55](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:56)
Error: lxml.etree.XMLSyntaxError: CData section too big found, line 99182, column 243
[56](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:57)
Error: Error processing result file: CData section too big found, line 99182, column 243 (TEST-org.apache.beam.sdk.io.kafka.KafkaIOIT.xml, line 99182)
[57](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:58)
Error: lxml.etree.XMLSyntaxError: CData section too big found, line 98943, column 96
[58](https://github.com/apache/beam/actions/runs/7279070100/job/19852311909?pr=29835#step:9:59)
Error: Error processing result file: CData section too big found, line 98943, column 96 (TEST-org.apache.beam.sdk.io.kafka.KafkaIOIT.xml, line 98943)

brucearctor · 2023-12-21T22:15:52Z

I still want to dig more closely into the code, only have superficially skimmed at this point...

Rereading @ffernandez92 comments -- this seems likely to be an issue with EnricoMi/publish-unit-test-result-action@v2 ... and not the test itself. So, that's a positive!

github-actions · 2024-01-05T12:13:47Z

Reminder, please take a look at this pr: @riteshghorse @damondouglas @damondouglas

brucearctor · 2024-01-05T17:52:25Z

Will be curious anyone else's thoughts.

As I understand it, this PR is fine - from a code perspective. BUT, introduces an issue due to some of our testing infrastructure. Not that a test 'fails' but rather a limitation in something we rely on.

I'm inclined to merge the PR, and then address the limitations in the testing infra afterwards [ if it were to persist ]. Thoughts?

Polber

Thanks for adding this! I gave a few suggestions, mostly nits on formatting and organization.

...va/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaReadSchemaTransformProvider.java

...xtensions/protobuf/src/main/java/org/apache/beam/sdk/extensions/protobuf/ProtoByteUtils.java

...a/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaWriteSchemaTransformProvider.java

...va/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaReadSchemaTransformProvider.java

Polber · 2024-01-08T21:46:52Z

...a/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaWriteSchemaTransformProvider.java

+        } else {
          throw new IllegalArgumentException(
-              "Expecting both descriptorPath and messageName to be non-null.");
+              "At least a descriptorPath or a proto Schema is required.");
        }


More for my understanding - why is a schema provided by the Configuration required here? The other data formats use the schema from the incoming PCollectionRowTuple input to create the schema for the outgoing PCollectionRowTuple output. Can the output Proto schema not be constructed from the input Row schema?

Providing a separate schema in the Configuration offers flexibility and explicit control over the translation process, particularly when addressing variations in field mapping, data types, nested structures, default values, and schema evolution. Other alternatives, such as using the StorageApiProto, were considered, but this approach could potentially prevent the resulting output from matching the expected Proto schema for the subsequent reader. Another option explored was similar to the approach used in Scio (https://spotify.github.io/scio/io/Protobuf.html#write-protobuf-files), where a wrapper is created. However, this method introduces a layer of abstraction, potentially resulting in the output not precisely aligning with the user's desired schema. I remain open to suggestions for alternative approaches in this context.

Interesting -- I didn't realize a translation function existed from beam row to proto, I imagine there are things around: https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BeamRowToStorageApiProto.html

We'd definitely need to understand the translation much more, to ensure sufficiently deterministic. Passing the information explicitly removes all doubt.

I see, in that case I think providing an explicit schema, at least optionally, makes sense. Perhaps adding support for an implicit schema could be provided in a future FR.

...o/kafka/src/test/java/org/apache/beam/sdk/io/kafka/KafkaReadSchemaTransformProviderTest.java

Polber · 2024-01-08T21:53:57Z

Will be curious anyone else's thoughts.

As I understand it, this PR is fine - from a code perspective. BUT, introduces an issue due to some of our testing infrastructure. Not that a test 'fails' but rather a limitation in something we rely on.

I'm inclined to merge the PR, and then address the limitations in the testing infra afterwards [ if it were to persist ]. Thoughts?

I agree with this.

@damccorm do you have any reservations on merging?

damccorm · 2024-01-09T19:42:12Z

Will be curious anyone else's thoughts.
As I understand it, this PR is fine - from a code perspective. BUT, introduces an issue due to some of our testing infrastructure. Not that a test 'fails' but rather a limitation in something we rely on.
I'm inclined to merge the PR, and then address the limitations in the testing infra afterwards [ if it were to persist ]. Thoughts?

I agree with this.

@damccorm do you have any reservations on merging?

Where is the failed check? If the test infra is flaky then I agree we shouldn't block on it. If we are turning a meaningful suite permared then I think we should address that before proceeding. Looking at the current pr, I only see the failing Kafka check which looks like it is running into a timeout (maybe its stuck)? We likely should not ignore that

damccorm · 2024-01-09T19:45:07Z

I added #29964 to address the timeout issue

brucearctor · 2024-01-09T20:16:47Z

Will be curious anyone else's thoughts.
As I understand it, this PR is fine - from a code perspective. BUT, introduces an issue due to some of our testing infrastructure. Not that a test 'fails' but rather a limitation in something we rely on.
I'm inclined to merge the PR, and then address the limitations in the testing infra afterwards [ if it were to persist ]. Thoughts?

I agree with this.
@damccorm do you have any reservations on merging?

Where is the failed check? If the test infra is flaky then I agree we shouldn't block on it. If we are turning a meaningful suite permared then I think we should address that before proceeding. Looking at the current pr, I only see the failing Kafka check which looks like it is running into a timeout (maybe its stuck)? We likely should not ignore that

Also see --> #29835 (comment)

Error: Error processing result file: CData section too big found, ...

Seems to be an issue with a limitation on EnricoMi/publish-unit-test-result-action@v2 ...?

Since an issue filed, it also seems like we can proceed, and see whether this is a persistent or flaky problem, and then prioritize fixing if warranted -- rather than being a blocker.

Polber

Approving, assuming the conversation over the failing test is resolved before merging.

damccorm · 2024-01-09T21:55:41Z

Will be curious anyone else's thoughts.
As I understand it, this PR is fine - from a code perspective. BUT, introduces an issue due to some of our testing infrastructure. Not that a test 'fails' but rather a limitation in something we rely on.
I'm inclined to merge the PR, and then address the limitations in the testing infra afterwards [ if it were to persist ]. Thoughts?

I agree with this.
@damccorm do you have any reservations on merging?

Where is the failed check? If the test infra is flaky then I agree we shouldn't block on it. If we are turning a meaningful suite permared then I think we should address that before proceeding. Looking at the current pr, I only see the failing Kafka check which looks like it is running into a timeout (maybe its stuck)? We likely should not ignore that

Also see --> #29835 (comment)

Error: Error processing result file: CData section too big found, ...

Seems to be an issue with a limitation on EnricoMi/publish-unit-test-result-action@v2 ...?

Oh I see - this silently failed and the workflow still succeeded. Yeah, I think this is fine to ignore. Its actually not a new issue (e.g. a scheduled run on master ran into this earlier today - https://github.com/apache/beam/actions/runs/7458106150)

Since an issue filed, it also seems like we can proceed, and see whether this is a persistent or flaky problem, and then prioritize fixing if warranted -- rather than being a blocker.

Have we actually filed the issue? I don't see one referenced in the comments and couldn't find one

brucearctor · 2024-01-09T22:02:45Z

Merged ... And filed: #29966 ...

[YAML] - Kafka Proto String schema

7863b9b

github-actions bot added python java io extensions kafka protobuf yaml labels Dec 20, 2023

github-actions bot added the Next Action: Reviewers label Dec 20, 2023

[YAML] - Kafka Proto String schema - Fix test

f295bda

github-actions bot added the slow-review label Jan 5, 2024

github-actions bot removed the slow-review label Jan 5, 2024

Polber suggested changes Jan 8, 2024

View reviewed changes

[YAML] - Add suggestions

1d35eca

Polber approved these changes Jan 9, 2024

View reviewed changes

brucearctor merged commit 6066af3 into apache:master Jan 9, 2024
89 of 90 checks passed

brucearctor mentioned this pull request Jan 9, 2024

[Bug]: Error: Error processing result file: CData section too big found, ... FROM EnricoMi/publish-unit-test-result-action #29966

Closed

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YAML] - Kafka Proto String schema #29835

[YAML] - Kafka Proto String schema #29835

ffernandez92 commented Dec 20, 2023 •

edited

Loading

github-actions bot commented Dec 20, 2023

ffernandez92 commented Dec 21, 2023 •

edited

Loading

ffernandez92 commented Dec 21, 2023

brucearctor commented Dec 21, 2023 •

edited

Loading

github-actions bot commented Jan 5, 2024

brucearctor commented Jan 5, 2024

Polber left a comment

Polber Jan 8, 2024

ffernandez92 Jan 9, 2024

brucearctor Jan 9, 2024

Polber Jan 9, 2024

Polber commented Jan 8, 2024

damccorm commented Jan 9, 2024 •

edited

Loading

damccorm commented Jan 9, 2024

brucearctor commented Jan 9, 2024

Polber left a comment

damccorm commented Jan 9, 2024

brucearctor commented Jan 9, 2024

[YAML] - Kafka Proto String schema #29835

[YAML] - Kafka Proto String schema #29835

Conversation

ffernandez92 commented Dec 20, 2023 • edited Loading

GitHub Actions Tests Status (on master branch)

github-actions bot commented Dec 20, 2023

ffernandez92 commented Dec 21, 2023 • edited Loading

ffernandez92 commented Dec 21, 2023

brucearctor commented Dec 21, 2023 • edited Loading

github-actions bot commented Jan 5, 2024

brucearctor commented Jan 5, 2024

Polber left a comment

Choose a reason for hiding this comment

Polber Jan 8, 2024

Choose a reason for hiding this comment

ffernandez92 Jan 9, 2024

Choose a reason for hiding this comment

brucearctor Jan 9, 2024

Choose a reason for hiding this comment

Polber Jan 9, 2024

Choose a reason for hiding this comment

Polber commented Jan 8, 2024

damccorm commented Jan 9, 2024 • edited Loading

damccorm commented Jan 9, 2024

brucearctor commented Jan 9, 2024

Polber left a comment

Choose a reason for hiding this comment

damccorm commented Jan 9, 2024

brucearctor commented Jan 9, 2024

ffernandez92 commented Dec 20, 2023 •

edited

Loading

ffernandez92 commented Dec 21, 2023 •

edited

Loading

brucearctor commented Dec 21, 2023 •

edited

Loading

damccorm commented Jan 9, 2024 •

edited

Loading