[YAML] Kafka Read Provider #28865

ffernandez92 · 2023-10-06T11:15:15Z

addresses #28664

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Match master

codecov · 2023-10-06T11:45:42Z

Codecov Report

Merging #28865 (c4a09a7) into master (01197c4) will decrease coverage by 33.87%.
Report is 267 commits behind head on master.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           master   #28865       +/-   ##
===========================================
- Coverage   72.23%   38.36%   -33.87%     
===========================================
  Files         684      686        +2     
  Lines      101241   101673      +432     
===========================================
- Hits        73129    39010    -34119     
- Misses      26535    61088    +34553     
+ Partials     1577     1575        -2

Flag	Coverage Δ
python	`29.98% <0.00%> (-52.80%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
sdks/python/apache_beam/pipeline.py	`19.73% <ø> (-72.13%)`	⬇️
...on/apache_beam/runners/dataflow/dataflow_runner.py	`22.28% <ø> (-58.61%)`	⬇️
...pache_beam/runners/interactive/interactive_beam.py	`0.00% <ø> (-82.34%)`	⬇️
...eam/runners/interactive/interactive_environment.py	`0.00% <ø> (-92.03%)`	⬇️
...ache_beam/runners/interactive/recording_manager.py	`0.00% <ø> (-96.57%)`	⬇️
...beam/testing/load_tests/load_test_metrics_utils.py	`0.00% <ø> (-33.90%)`	⬇️
sdks/python/apache_beam/yaml/yaml_provider.py	`0.00% <0.00%> (-64.32%)`	⬇️
...ache_beam/testing/analyzers/github_issues_utils.py	`0.00% <0.00%> (-42.86%)`	⬇️
sdks/python/apache_beam/dataframe/frames.py	`0.00% <0.00%> (-95.21%)`	⬇️
...hon/apache_beam/testing/analyzers/perf_analysis.py	`0.00% <0.00%> (-16.91%)`	⬇️
... and 1 more

... and 307 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

github-actions · 2023-10-06T12:07:50Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @riteshghorse for label python.
R: @bvolpato for label java.
R: @Abacn for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

ffernandez92 · 2023-10-06T12:43:08Z

R: @brucearctor

github-actions · 2023-10-06T12:45:37Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

brucearctor · 2023-10-06T19:20:44Z

.../kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaReadSchemaTransformConfiguration.java

+  public static final String VALID_FORMATS_STR = "raw,avro,json";
  public static final Set<String> VALID_DATA_FORMATS =
      Sets.newHashSet(VALID_FORMATS_STR.split(","));


Wanting to look more at this:

Is there a reason for the convention of UPPERCASE, vs lower?

Why use a string that we need to then split on the subsequent line, instead of adding to a new set directly?

unless that is more of a java convention to not have to constuct the Hash/Set.

Context on the first point --> I'm used to FINALS being UPPERCASE

Agreed. I tried to match:

beam/sdks/python/apache_beam/yaml/yaml_io.py

Line 195 in 7531501

- raw: Produces records with a single `payload` field whose contents

so all the IOs follow the same standard.
Regarding: unless that is more of a java convention to not have to constuct the Hash/Set. I'll do some research but I don't think there is any. I imagine the reason why it was made this way was to prevent misalignments between the print of Valid data formats and the Set of formats.

+1 to ensuring the two remain consistent (though one could just as easily join for the error message).

brucearctor · 2023-10-06T19:25:53Z

.../kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaReadSchemaTransformConfiguration.java

-    assert dataFormat == null || VALID_DATA_FORMATS.contains(dataFormat)
-        : "Valid data formats are " + VALID_DATA_FORMATS;
+    assert dataFormat == null || isValidDataFormat(dataFormat)
+        : "Valid data formats are " + VALID_FORMATS_STR;


TBD on returning STR vs the SET. I see pros/cons, so this is mostly to call out the thought process and choice/difference than a problem.

On rereading, I guess I don't see the value of the changes in this file other than adding RAW on line 42? Unless directly creating a set.

brucearctor · 2023-10-06T19:27:27Z

.../kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaReadSchemaTransformConfiguration.java

+  private boolean isValidDataFormat(String dataFormat) {
+    // Convert the input dataFormat to lowercase for case-insensitive comparison
+    String lowercaseDataFormat = dataFormat.toLowerCase();
+    return VALID_DATA_FORMATS.contains(lowercaseDataFormat);


Why allow case insensitivity? I'd err towards being explicit [ and sensitive on case ]. But, curious thoughts.

Well, I actually considered that too. But, you know, I'm not entirely sure if someone else might be using it through the expansion service. So, my thought was to make it compatible with older versions and not cause any disruptions. That's why I went for the case-insensitive approach.

I'd actually aim to err for one or the other [ either upper or lower ]. Else, we need to really be careful and dig in here to ensure there isn't a case where it passes partly, and then doesn't fail. I am not looking at code this second, but would there be problems with 'RaW', or 'jSOn' as the data format. It would pass that check, but what if that value is piped through elsewhere.

brucearctor · 2023-10-06T19:31:57Z

...va/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaReadSchemaTransformProvider.java

+    String format = configuration.getFormat();
+
+    if (format == null || format.isEmpty()) {
+      format = "raw";


Do we want to add some sort of log for the user, so that they know that RAW wound up auto-set?

Do we think that RAW is the ideal default?

I am thinking ahead to a future where we might be able to pickup information from the schema shared to then choose to set the format. That's probably outside the scope of this PR, but to think of a future where that is possible.

I think we should require the user to set this. In the future we may want to pull the appropriate schema from metadata (in that case we could add an explicit "infer" value if we wanted, but if there is a default that should be it). We should leave ourselves options--we can always set a default later if we really want but we can't retract one.

brucearctor · 2023-10-06T20:53:30Z

...va/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaReadSchemaTransformProvider.java

+            "To read from Kafka in raw format, you can't provide a schema.");
+      }
+      Schema rawSchema = Schema.builder().addField("payload", Schema.FieldType.BYTES).build();
+      SerializableFunction<byte[], Row> valueMapper = getRawBytesToRowFunction(rawSchema);


wanting to wrap my head around why we'd have bytes with row. Row tends to imply schemas.

Same as before:

beam/sdks/python/apache_beam/yaml/yaml_io.py

Line 195 in 7531501

- raw: Produces records with a single `payload` field whose contents

It does have an schema is just that this schema only has one attribute which is payload and then the values are raw bytes.

I do agree that this feature might not see widespread use. Ideally, users should utilize schemas for data processing. However, there could be scenarios where this feature comes in handy. For instance, imagine you have events in a format like a,b,c in Kafka, resembling CSV data. In such cases, you might receive raw bytes in the payload, and it becomes your responsibility to parse them downstream in the correct way.

Yeah. Ideally users should be publishing schema'd data, but technically Kafka is just a plumber of bytes and doesn't otherwise impose restrictions. This allows users to at least get at the data and do what they want with it. (Also, perhaps they're using a schema we don't support yet, like protobuf or Cap'n Proto or something.)

brucearctor · 2023-10-06T20:56:37Z

sdks/python/apache_beam/yaml/standard_io.yaml

+        'confluent_schema_registry_url': 'confluentSchemaRegistryUrl'
+        'confluent_schema_registry_subject': 'confluentSchemaRegistrySubject'


do we need the naming of confluent_schema_registry_... ? what if using a different schema registry, like redpanda's?

brucearctor · 2023-10-06T21:02:09Z

...va/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaReadSchemaTransformProvider.java

      assert Strings.isNullOrEmpty(configuration.getConfluentSchemaRegistryUrl())
          : "To read from Kafka, a schema must be provided directly or though Confluent "
              + "Schema Registry, but not both.";


Ah, getConfluentSchemaRegistry ... So we might just use that more generically, OR we might look [ not now ] to refactor that.

Exactly, that was my idea here. Again, it's a bit uncertain to me if this isn't used by someone somewhere, so I think we still want to support that functionality. That said, though, it makes total sense to include other schema registry technologies as we keep refactoring this class.

My rule of thumb: If there isn't a test then it isn't important or nobody is using. So, lack of a test tells me effectively that nobody is using [ otherwise, if they found it important they would have written a test ].

Please keep this in mind, as it also helps reinforce the importance of writing tests. We need tests in place to ensure others do not break things we intend to use.

Naturally, we need to be careful, but tests help us with that.

ffernandez92 · 2023-10-10T16:50:58Z

Run Python_Coverage PreCommit

brucearctor · 2023-10-14T22:35:45Z

@robertwb -- I believe case sensitivity is important [ have seen that cause more problems and things harder to debug, than the user benefits of being insensitive, but ... ]. Is it time to choose whether we want YAML to use generally upper or lower case? We probably want go-to conventions.

Ex: https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaReadSchemaTransformConfiguration.java#L42

^ currently uppercase

beam/sdks/python/apache_beam/yaml/yaml_io.py

Line 140 in 7531501

if format == 'raw':

^ wants to use lowercase

I'll look in these a little deeper, in case we want to map upper to lower, etc ... but thought this might be worth calling out [ unless I'm just missing a connection in my mental model here ]. Related: I don't recall seeing this in the syntax doc. https://docs.google.com/document/d/10tzBd6yeElucqLN07OI8MguSQYqvRCYhqgVIChjn67w/edit

Thoughts?

robertwb · 2023-10-20T23:03:40Z

@robertwb -- I believe case sensitivity is important [ have seen that cause more problems and things harder to debug, than the user benefits of being insensitive, but ... ]. Is it time to choose whether we want YAML to use generally upper or lower case? We probably want go-to conventions.
[...]
I'll look in these a little deeper, in case we want to map upper to lower, etc ... but thought this might be worth calling out [ unless I'm just missing a connection in my mental model here ]. Related: I don't recall seeing this in the syntax doc. https://docs.google.com/document/d/10tzBd6yeElucqLN07OI8MguSQYqvRCYhqgVIChjn67w/edit

Thoughts?

Yeah, case sensitivity is a messy question. Here we're trying to represent an enum value (so there's no risk that raw and RAW and Raw are going to be miss-interpreted or should be kept distinct), and there isn't omnipresent IDE support to tell you when you've got it wrong (and no auto-complete either).

Generally the conventions we have so far are that yaml keys (type, name, input, config, transforms, etc.) are (case-sensitive) snake_case (though preferring to go with one word), transform types are (again case-sensitive) CamelCase, and config parameters (like other keys) are also snake_case. For enum-style values, the one president we have is the "language" parameter for MapToFields (Filter, etc.) which is lower case (java, python, javacript, etc.) My prefernce would be to be consistent between that and this format parameter if we require specific casing there.

I could see this boiling down to a matter of preference though. Perhaps something to ask on the list?

ffernandez92 · 2023-10-24T14:25:01Z

Run Python_Runners PreCommit

ffernandez92 · 2023-10-24T15:19:45Z

Run Python_Coverage PreCommit

ffernandez92 · 2023-10-24T16:38:54Z

Run Python PreCommit

brucearctor · 2023-10-25T00:05:44Z

...va/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaReadSchemaTransformProvider.java

+    if (format != null && format.equals("RAW")) {
+      if (inputSchema != null) {
        throw new IllegalArgumentException(
-            "To read from Kafka in raw format, you can't provide a schema.");
+            "To read from Kafka in RAW format, you can't provide a schema.");


are there other error scenarios that we should include? I'm trying to think about this one

I couldn't think of any other but I'm open to suggestions

brucearctor · 2023-10-25T00:08:11Z

...o/kafka/src/test/java/org/apache/beam/sdk/io/kafka/KafkaReadSchemaTransformProviderTest.java

@@ -120,7 +120,6 @@ public void testBuildTransformWithAvroSchema() {
        KafkaReadSchemaTransformConfiguration.builder()
            .setTopic("anytopic")
            .setBootstrapServers("anybootstrap")
-            .setFormat("avro")
            .setSchema(AVRO_SCHEMA)


is .setSchema sufficient [ if AVRO_SCHEMA ], that we don't need AVRO in setFormat? [ I probably need to read this class to see if this gets auto set ]

Or, is the expectation that this test will now fail without setFormat?

The original test didn't have the setFormat so I let the test like it was before to ensure compatibility

I don't fully understand this ... since the diff of the PR shows removing the line. BUT, since tests aren't failing, I guess this isn't causing a problem. Would like to add even more test cases, BUT, I don't think required or needed to hold things up.

ffernandez92 · 2023-10-26T13:11:37Z

I would like to understand the value or importance of the codecov workflows here. I think it's comparing things are not even affected by this PR so it's confusing to me why those fail. I wonder if it has to do with the fact that my "other" files (the ones I haven't changed) have x% diff from master but hard to understand how that can be used or how it's preventing this to merge the PR

brucearctor · 2023-10-27T02:47:06Z

CodeCov seems broken -- it is failing, but this did not have that much change in test coverage, so I'm going to ignore those for now.

brucearctor

I think fine for now.

Let's continue to think about [ and not hesitate to ] adding tests.

Next up: proto. other ...

@ffernandez92 Congrats on your first PR into Beam!

brucearctor · 2023-10-27T05:21:21Z

I would like to understand the value or importance of the codecov workflows here. I think it's comparing things are not even affected by this PR so it's confusing to me why those fail. I wonder if it has to do with the fact that my "other" files (the ones I haven't changed) have x% diff from master but hard to understand how that can be used or how it's preventing this to merge the PR

It appears that CodeCov is comparing against an old commit [ from when originally forked ] ... it didn't seem worth digging into understand that further, for now. If recurs [ due to long open PR ], I suggest we figure out how to get CodeCov to compare PR commit against current main

ffernandez92 added 2 commits October 6, 2023 13:05

Merge pull request #1 from apache/master

05fac6c

Match master

[YAML] Kafka provider

857d5e2

github-actions bot added python java io kafka yaml labels Oct 6, 2023

[YAML] Kafka provider - fix spotless

f68dc0a

github-actions bot added the Next Action: Reviewers label Oct 6, 2023

ffernandez92 changed the title ~~[YAML] Kafka Provider~~ [YAML] Kafka Read Provider Oct 6, 2023

brucearctor reviewed Oct 6, 2023

View reviewed changes

Move to uppercase

c4a09a7

brucearctor reviewed Oct 25, 2023

View reviewed changes

brucearctor approved these changes Oct 27, 2023

View reviewed changes

brucearctor merged commit e98e37f into apache:master Oct 27, 2023
87 of 89 checks passed

		'confluent_schema_registry_url': 'confluentSchemaRegistryUrl'
		'confluent_schema_registry_subject': 'confluentSchemaRegistrySubject'

[YAML] Kafka Read Provider #28865

[YAML] Kafka Read Provider #28865

Conversation

ffernandez92 commented Oct 6, 2023

GitHub Actions Tests Status (on master branch)

codecov bot commented Oct 6, 2023 • edited Loading

Codecov Report

github-actions bot commented Oct 6, 2023

ffernandez92 commented Oct 6, 2023

github-actions bot commented Oct 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brucearctor Oct 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brucearctor Oct 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brucearctor Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

ffernandez92 commented Oct 10, 2023

brucearctor commented Oct 14, 2023

robertwb commented Oct 20, 2023

ffernandez92 commented Oct 24, 2023

ffernandez92 commented Oct 24, 2023

ffernandez92 commented Oct 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffernandez92 commented Oct 26, 2023

brucearctor commented Oct 27, 2023

brucearctor left a comment • edited Loading

Choose a reason for hiding this comment

brucearctor commented Oct 27, 2023

codecov bot commented Oct 6, 2023 •

edited

Loading

brucearctor Oct 6, 2023 •

edited

Loading

brucearctor Oct 6, 2023 •

edited

Loading

brucearctor Oct 9, 2023 •

edited

Loading

brucearctor left a comment •

edited

Loading