You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Kafka with Flink How to write Debezium JSON with a single topic into different tables using CDC method (Kafka 单主题多表 debezium json 如何 CDC 方式写入不同表)
#3201
I am a beginner in CDC and now seeking a better implementation solution. I want to achieve downstream with changelog +I, +U, -U, -D, combined with Hudi and Paimon for near real-time computation. Currently, there are thousands of tables in the existing multi-source database. Kafka is used for data collection in a CDC manner with incremental data and change items. Some are directly in Debezium JSON format. Due to the large number of tables, one Kafka topic may contain an unequal number of tables.
Currently, Flink's CDC Kafka connector maps one table to one topic where the schema information is predefined during definition and cannot consume multiple tables within one topic and write to downstream table types like Hudi. The current thought is to use Flink stream API to determine the specific corresponding table from Debezium JSON and then consume it into Hudi and Paimon tables. However, during streaming processing, how can we implement CDC into Hudi tables and Paimon tables following changelog +I,+U,-U,-D instead of batch write insert? Any guidance on aspects that need understanding or existing implementation solutions would be appreciated. Thank you.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
我是一个CDC新手,现向大家寻求比较好的实现方案,想要实现下游有changelog +I,+U,-U,-D,结合 hudi, paimon 实现近实时计算。目前存量多源数据库总计有几千张表,有对接的 Kafka 采用了CDC 的方式采集增量数据,有变更项,有的直接是 debezium json 格式,但由于表数量大,一个 Kafka topic 里会有数量不等的表。
而目前 Flink CDC 的 kafka connector 都是一个 topic 对应一张表,定义时已经确定对应的表结构 schema 信息,而无法实现消费一个 topic 里包含多张表 并向下游写入 Hudi 等表类型。目前想到要用 Flink stream api 来实现,判断 debezium json 中具体对应的表,再消费写入到 Hudi,paimon 表,但流式处理时,如何实现 CDC 到 Hudi 表和 paimon 表等 是按 changelog +I,+U,-U,-D,而不是批式 write insert 那种。可以指点下需要了解的方面或已有的实现方案,谢谢。
I am a beginner in CDC and now seeking a better implementation solution. I want to achieve downstream with changelog +I, +U, -U, -D, combined with Hudi and Paimon for near real-time computation. Currently, there are thousands of tables in the existing multi-source database. Kafka is used for data collection in a CDC manner with incremental data and change items. Some are directly in Debezium JSON format. Due to the large number of tables, one Kafka topic may contain an unequal number of tables.
Currently, Flink's CDC Kafka connector maps one table to one topic where the schema information is predefined during definition and cannot consume multiple tables within one topic and write to downstream table types like Hudi. The current thought is to use Flink stream API to determine the specific corresponding table from Debezium JSON and then consume it into Hudi and Paimon tables. However, during streaming processing, how can we implement CDC into Hudi tables and Paimon tables following changelog +I,+U,-U,-D instead of batch write insert? Any guidance on aspects that need understanding or existing implementation solutions would be appreciated. Thank you.
Beta Was this translation helpful? Give feedback.
All reactions