SQL PTransform Update. #112

becketqin · 2024-02-08T23:54:10Z

The existing SqlPTransform has a few limitations by design:

It only supports batch mode.
It assumes there is always a downstream PTransform after the SQLPtransform, i.e. no INSERT INTO statement is supported.

This patch removes these two restrictions.

xinyuiscool

Looks good. I have a few minor questions for me to understand. Thanks.

xinyuiscool · 2024-02-15T00:35:14Z

....15/src/main/java/org/apache/beam/runners/flink/transform/sql/StatementOnlySqlTransform.java

+
+  @Override
+  public PDone expand(PBegin input) {
+    return PDone.in(input.getPipeline());


Shall we log the statements and catalogs here? Seems better if we log them somewhere for debugging purpose.

Btw, there is an overridable "PTransform.validate(PipelineOptions)" method that you can use to do validations. I think we can double check things like statements are not empty and catalog is valid.

We are logging the full statements in the translator. I changed that logging to info and added a debug level logging here. Good point about validation. I added an empty statement check there.

xinyuiscool · 2024-02-15T00:37:33Z

...va/org/apache/beam/runners/flink/transform/sql/StatementOnlyFlinkSqlTransformTranslator.java

+    StreamStatementSet ss = tEnv.createStatementSet();
+    for (String statement : sqlTransform.getStatements()) {
+      combinedStatements.add(statement);
+      if (statement.substring(0, INSERT_INTO.length()).toUpperCase().startsWith(INSERT_INTO)) {


nit: maybe move this line of logic into a static method like isInsertIntoStatement()? A bit easier to read.

xinyuiscool · 2024-02-15T00:37:56Z

...va/org/apache/beam/runners/flink/transform/sql/StatementOnlyFlinkSqlTransformTranslator.java

+    }
+    // Now attach everything to StreamExecutionEnv.
+    ss.attachAsDataStream();
+    LOG.debug("Executing SQL statements:\n {}", combinedStatements);


I would just make it info log so it's clear :)

xinyuiscool · 2024-02-15T00:39:29Z

....15/src/main/java/org/apache/beam/runners/flink/transform/sql/StatementOnlySqlTransform.java

+public class StatementOnlySqlTransform extends PTransform<PBegin, PDone> {
+
+  private final List<String> statements;
+  private final Map<String, SerializableCatalog> catalogs;


Just curious: we can support multiple catalogs here? Curious what's the use case would look like.

Yes, sometimes the datasets may come from different external storage systems. For example, we may have a job reading from Hive, MySql and Kafka at the same time. In this case, there might be three catalogs each for one of the external system.

xinyuiscool · 2024-02-15T00:46:59Z

...va/org/apache/beam/runners/flink/transform/sql/StatementOnlyFlinkSqlTransformTranslator.java

+        ss.addInsertSql(statement);
+      } else {
+        // Not an insert into statement. Treat it as a DDL.
+        tEnv.executeSql(statement);


Can there be valid normal query statements without INSERT INTO? I am not very sure how FLink SQL statements look like today. Maybe everything except DDL has INSERT INTO in the beginning?

A query logic in SQL is always a SELECT statement. In SQL the query logic has to be represented by some entity, and such entity would either be a Table or a View. So any SELECT statement will either be a part of CREATE TABLE statement or CREATE VIEW statement. INSERT INTO is effectively binding a SELECT statement with an existing table instead of creating a new one.

From execution perspective, only INSERT INTO will trigger an action to run the query. Without an INSERT INTO statement, all the queries will only generate temporary views or tables. That is why we can call tEnv.executeSql() for all the statements other than INSERT INTO, because they will not really trigger an execution. And we have to append all the INSERT INTO statement in a StatementSet so that multiple INSERT INTO can be executed in the same Flink job.

roborahul

Are these statements still valid ?

1 - https://jarvis.corp.linkedin.com/codesearch/result?name=MultiOutputSqlTransform.java&path=beam%2Frunners%2Fflink%2F1.15%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fbeam%2Frunners%2Fflink%2Ftransform%2Fsql&reponame=linkedin%2Fbeam#42
And here
2 - https://jarvis.corp.linkedin.com/codesearch/result/?path=beam%2Frunners%2Fflink%2F1.15%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fbeam%2Frunners%2Fflink%2Ftransform%2Fsql&reponame=linkedin%2Fbeam&name=SingleOutputSqlTransform.java#40

becketqin · 2024-02-23T22:43:47Z

Are these statements still valid ?

1 - https://jarvis.corp.linkedin.com/codesearch/result?name=MultiOutputSqlTransform.java&path=beam%2Frunners%2Fflink%2F1.15%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fbeam%2Frunners%2Fflink%2Ftransform%2Fsql&reponame=linkedin%2Fbeam#42 And here 2 - https://jarvis.corp.linkedin.com/codesearch/result/?path=beam%2Frunners%2Fflink%2F1.15%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fbeam%2Frunners%2Fflink%2Ftransform%2Fsql&reponame=linkedin%2Fbeam&name=SingleOutputSqlTransform.java#40

Good catch. Updated the java docs to remove the lines.

xinyuiscool

Looks great. Thanks

venkata91 · 2024-03-02T19:41:49Z

runners/flink/1.15/src/main/java/org/apache/beam/runners/flink/transform/sql/SqlTransform.java

@@ -114,6 +114,16 @@ public static <T> SingleOutputSqlTransform<T> of(Class<T> outputClass) {
    return new SingleOutputSqlTransform<>(of(Integer.class, outputClass));
  }

+  /**
+   * Create a {@link StatementOnlySqlTransform} which takes a full script of SQL statements and
+   * execute them. The statements must have at least one <code>INSERT INTO</code> statement.


@becketqin Sorry for commenting after the PR is merged. Couple of clarifying questions.

What if the SQL statements had a INSERT OVERWRITE instead of INSERT INTO? Is that not a valid StatementOnlySqlTransform?

Similarly, how about having a CREATE TABLE AS SELECT with out INSERT INTO, should we support that as well?

That is a good point. We should support both. I'll have a follow up patch.

I created the PR. However, currenlty we cannot support CTAS because the OSS Flink StreamStatementSet.attachAsDataStream() does not support that yet. We will need to make change to the OSS Flink first.

Allow SqlPtransform to run in streaming mode.

cccac3a

github-actions bot added runners flink labels Feb 8, 2024

becketqin self-assigned this Feb 8, 2024

becketqin requested review from xinyuiscool and yananhao12 February 8, 2024 23:54

xinyuiscool reviewed Feb 15, 2024

View reviewed changes

becketqin force-pushed the sql_ptransform_update branch from d30d083 to 5ea8e10 Compare February 22, 2024 21:08

roborahul reviewed Feb 23, 2024

View reviewed changes

Let SQLPTransform support full SQL scripts

b645f96

becketqin force-pushed the sql_ptransform_update branch from 5ea8e10 to b645f96 Compare February 23, 2024 22:42

roborahul approved these changes Feb 23, 2024

View reviewed changes

xinyuiscool approved these changes Feb 23, 2024

View reviewed changes

Update version to 2.45.19

580f5ca

github-actions bot added the build label Feb 24, 2024

becketqin merged commit ff2d3ea into linkedin:li_trunk Feb 24, 2024
6 checks passed

venkata91 reviewed Mar 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL PTransform Update. #112

SQL PTransform Update. #112

becketqin commented Feb 8, 2024

xinyuiscool left a comment

xinyuiscool Feb 15, 2024

becketqin Feb 22, 2024

xinyuiscool Feb 15, 2024

xinyuiscool Feb 15, 2024

xinyuiscool Feb 15, 2024

becketqin Feb 22, 2024

xinyuiscool Feb 15, 2024 •

edited

Loading

becketqin Feb 22, 2024

roborahul left a comment

becketqin commented Feb 23, 2024

xinyuiscool left a comment

venkata91 Mar 2, 2024 •

edited

Loading

becketqin Mar 2, 2024

becketqin Mar 3, 2024

SQL PTransform Update. #112

SQL PTransform Update. #112

Conversation

becketqin commented Feb 8, 2024

xinyuiscool left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xinyuiscool Feb 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roborahul left a comment

Choose a reason for hiding this comment

becketqin commented Feb 23, 2024

xinyuiscool left a comment

Choose a reason for hiding this comment

venkata91 Mar 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xinyuiscool Feb 15, 2024 •

edited

Loading

venkata91 Mar 2, 2024 •

edited

Loading