feat: improve blog page (#285)

apache · Nov 4, 2023 · d4ee99b · d4ee99b
1 parent 4abf9b4
commit d4ee99b
Show file tree

Hide file tree

Showing 59 changed files with 2,298 additions and 117 deletions.
diff --git a/blog/0-streampark-flink-on-k8s.md b/blog/0-streampark-flink-on-k8s.md
@@ -2,11 +2,12 @@
 slug: streampark-flink-on-k8s
 title: StreamPark Flink on Kubernetes practice
 tags: [StreamPark, 生产实践, FlinkSQL, Kubernetes]
+description: Wuxin Technology was founded in January 2018. The current main business includes the research and development, design, manufacturing and sales of RELX brand products. With core technologies and capabilities covering the entire industry chain, RELX is committed to providing users with products that are both high quality and safe
 ---
 
-# StreamPark Flink on Kubernetes practice
+Wuxin Technology was founded in January 2018. The current main business includes the research and development, design, manufacturing and sales of RELX brand products. With core technologies and capabilities covering the entire industry chain, RELX is committed to providing users with products that are both high quality and safe.
 
-  Wuxin Technology was founded in January 2018. The current main business includes the research and development, design, manufacturing and sales of RELX brand products. With core technologies and capabilities covering the entire industry chain, RELX is committed to providing users with products that are both high quality and safe.
+<!-- truncate -->
 
 ## **Why Choose Native Kubernetes**
 

diff --git a/blog/1-flink-framework-streampark.md b/blog/1-flink-framework-streampark.md
@@ -4,7 +4,11 @@ title: Flink Development Toolkit StreamPark
 tags: [StreamPark, DataStream, FlinkSQL]
 ---
 
-<br/>
+Although the Hadoop system is widely used today, its architecture is complicated, it has a high maintenance complexity, version upgrades are challenging, and due to departmental reasons, data center scheduling is prolonged. We urgently need to explore agile data platform models. With the current prevalence of cloud-native architecture and the backdrop of lake and warehouse integration, we have decided to use Doris as an offline data warehouse and TiDB (which is already in production) as a real-time data platform. Furthermore, because Doris has ODBC capabilities on MySQL, it can integrate external database resources and uniformly output reports.
+
+![](/blog/belle/doris.png)
+
+<!-- truncate -->
 
 # 1. Background
 

diff --git a/blog/2-streampark-usercase-chinaunion.md b/blog/2-streampark-usercase-chinaunion.md
@@ -4,7 +4,7 @@ title: China Union's Flink Real-Time Computing Platform Ops Practice
 tags: [StreamPark, Production Practice, FlinkSQL]
 ---
 
-# China Union Flink Real-Time Computing Platform Ops Practices
+![](/blog/chinaunion/overall_architecture.png)
 
 **Abstract:** This article is compiled from the sharing of Mu Chunjin, the head of China Union Data Science's real-time computing team and Apache StreamPark Committer, at the Flink Forward Asia 2022 platform construction session. The content of this article is mainly divided into four parts:
 
@@ -13,11 +13,10 @@ tags: [StreamPark, Production Practice, FlinkSQL]
 - Integrated Management Based on StreamPark
 - Future Planning and Evolution
 
-## **Introduction to the Real-Time Computing Platform Background**
-
-![](/blog/chinaunion/overall_architecture.png)
+<!-- truncate -->
 
 
+## **Introduction to the Real-Time Computing Platform Background**
 
 The image above depicts the overall architecture of the real-time computing platform. At the bottom layer, we have the data sources. Due to some sensitive information, the detailed information of the data sources is not listed. It mainly includes three parts: business databases, user behavior logs, and user location. China Union has a vast number of data sources, with just the business databases comprising tens of thousands of tables. The data is primarily processed through Flink SQL and the DataStream API. The data processing workflow includes real-time parsing of data sources by Flink, real-time computation of rules, and real-time products. Users perform real-time data subscriptions on the visualization subscription platform. They can draw an electronic fence on the map and set some rules, such as where the data comes from, how long it stays inside the fence, etc. They can also filter some features. User information that meets these rules will be pushed in real-time. Next is the real-time security part. If a user connects to a high-risk base station or exhibits abnormal operational behavior, we may suspect fraudulent activity and take actions such as shutting down the phone number, among other things. Additionally, there are some real-time features of users and a real-time big screen display.
 

diff --git a/blog/3-streampark-usercase-bondex-paimon.md b/blog/3-streampark-usercase-bondex-paimon.md
@@ -4,8 +4,6 @@ title: 海程邦达基于 Apache Paimon + StreamPark 的流式数仓实践
 tags: [StreamPark, 生产实践, paimon, streaming-warehouse]
 ---
 
-# 海程邦达基于 Apache Paimon + StreamPark 的流式数仓实践
-
 ![](/blog/bondex/Bondex.png)
 
 **导读：**本文主要介绍作为供应链物流服务商海程邦达在数字化转型过程中采用 Paimon + StreamPark 平台实现流式数仓的落地方案。我们以 Apache StreamPark 流批一体平台提供了一个易于上手的生产操作手册，以帮助用户提交 Flink 任务并迅速掌握 Paimon 的使用方法。
@@ -16,6 +14,8 @@ tags: [StreamPark, 生产实践, paimon, streaming-warehouse]
 - 问题排查分析
 - 未来规划
 
+<!-- truncate -->
+
 ## 01 公司业务情况介绍
 
 海程邦达集团一直专注于供应链物流领域，通过打造优秀的国际化物流平台，为客户提供端到端一站式智慧型供应链物流服务。集团现有员工 2000 余人，年营业额逾 120 亿人民币，网络遍及全球 200 余个港口，在海内外有超 80 家分、子公司，助力中国企业与世界互联互通。
@@ -262,9 +262,9 @@ kubectl create ns streamx
 使用 default 账户创建 clusterrolebinding 资源:
 
 ```shell
-kubectl create secret docker-registry streamparksecret 
---docker-server=registry-vpc.cn-zhangjiakou.aliyuncs.com 
---docker-username=xxxxxx 
+kubectl create secret docker-registry streamparksecret
+--docker-server=registry-vpc.cn-zhangjiakou.aliyuncs.com
+--docker-username=xxxxxx
 --docker-password=xxxxxx -n streamx```
 ```
 
@@ -283,9 +283,9 @@ kubectl create secret docker-registry streamparksecret
 创建 k8s secret 密钥用来拉取 ACR 中的镜像 streamparksecret 为密钥名称 自定义
 
 ```shell
-kubectl create secret docker-registry streamparksecret 
---docker-server=registry-vpc.cn-zhangjiakou.aliyuncs.com 
---docker-username=xxxxxx 
+kubectl create secret docker-registry streamparksecret
+--docker-server=registry-vpc.cn-zhangjiakou.aliyuncs.com
+--docker-username=xxxxxx
 --docker-password=xxxxxx -n streamx
 ```
 
@@ -312,13 +312,13 @@ kubectl -f rbac.yaml
 kubectl -f oss-plugin.yaml
 ```
 
-\- 创建 CP&SP 的 PV 
+\- 创建 CP&SP 的 PV
 
 ```shell
 kubectl -f checkpoints_pv.yaml kubectl -f savepoints_pv.yaml
 ```
 
-\- 创建 CP&SP 的 PVC 
+\- 创建 CP&SP 的 PVC
 
 ```shell
 kubectl -f checkpoints_pvc.yaml kubectl -f savepoints_pvc.yaml
@@ -336,7 +336,7 @@ kubectl -f checkpoints_pvc.yaml kubectl -f savepoints_pvc.yaml
 
 ```sql
 SET 'execution.runtime-mode' = 'streaming';
-set 'table.exec.sink.upsert-materialize' = 'none'; 
+set 'table.exec.sink.upsert-materialize' = 'none';
 SET 'sql-client.execution.result-mode' = 'tableau';
 -- 创建并使用 FTS Catalog 底层存储方案采用阿里云oss
 CREATE CATALOG `table_store` WITH (
@@ -368,7 +368,7 @@ Kubernetes Namespace ：streamx
 Kubernetes ClusterId ：（任务名自定义即可）
 
 #上传到阿里云镜像仓库的基础镜像
-Flink Base Docker Image ：registry-vpc.cn-zhangjiakou.aliyuncs.com/xxxxx/flink-table-store:v1.16.0    
+Flink Base Docker Image ：registry-vpc.cn-zhangjiakou.aliyuncs.com/xxxxx/flink-table-store:v1.16.0
 
 Rest-Service Exposed Type：NodePort
 
@@ -408,7 +408,7 @@ volumes:
     persistentVolumeClaim:
       claimName: flink-savepoints-csi-pvc
 
-imagePullSecrets:      
+imagePullSecrets:
 - name: streamparksecret
 ```
 
@@ -675,7 +675,7 @@ CREATE TABLE IF NOT EXISTS dwm.`dwm_business_order_count` (
 `delivery_center_op_name` varchar(200) NOT NULL COMMENT '交付名称',
 `sum_orderCount` BIGINT NOT NULL COMMENT '订单数',
 `create_date` timestamp NOT NULL COMMENT '创建时间',
-PRIMARY KEY (`l_year`, 
+PRIMARY KEY (`l_year`,
              `l_month`,
              `l_date`,
              `order_type_name`,
@@ -805,12 +805,12 @@ ADS 层聚合表采用 agg sum 会出现 dwd 数据流不产生 update_before，
 
 解决办法：
 
-By specifying 'changelog-producer' = 'full-compaction', 
-Table Store will compare the results between full compactions and produce the differences as changelog. 
+By specifying 'changelog-producer' = 'full-compaction',
+Table Store will compare the results between full compactions and produce the differences as changelog.
 The latency of changelog is affected by the frequency of full compactions.
 
-By specifying changelog-producer.compaction-interval table property (default value 30min), 
-users can define the maximum interval between two full compactions to ensure latency. 
+By specifying changelog-producer.compaction-interval table property (default value 30min),
+users can define the maximum interval between two full compactions to ensure latency.
 This table property does not affect normal compactions and they may still be performed once in a while by writers to reduce reader costs.
 
 这样能解决上述问题。但是随之而来出现了新的问题。默认 changelog-producer.compaction-interval 是 30min，意味着 上游的改动到 ads 查询要间隔 30min，生产过程中发现将压缩间隔时间改成 1min 或者 2 分钟的情况下，又会出现上述 ADS 层聚合数据不准的情况。

diff --git a/blog/4-streampark-usercase-shunwang.md b/blog/4-streampark-usercase-shunwang.md
@@ -4,8 +4,6 @@ title: StreamPark 在顺网科技的大规模生产实践
 tags: [StreamPark, 生产实践, FlinkSQL]
 ---
 
-# StreamPark 在顺网科技的大规模生产实践
-
 ![](/blog/shunwang/autor.png)
 
 **导读：**本文主要介绍顺网科技在使用 Flink 计算引擎中遇到的一些挑战，基于 StreamPark 作为实时数据平台如何来解决这些问题，从而大规模支持公司的业务。
@@ -18,6 +16,8 @@ tags: [StreamPark, 生产实践, FlinkSQL]
 - 带来的收益
 - 未来规划
 
+<!-- truncate -->
+
 ## **公司业务介绍**
 
 杭州顺网科技股份有限公司成立于 2005 年，秉承科技连接快乐的企业使命，是国内具有影响力的泛娱乐技术服务平台之一。多年来公司始终以产品和技术为驱动，致力于以数字化平台服务为人们创造沉浸式的全场景娱乐体验。
@@ -189,7 +189,7 @@ https://github.com/apache/streampark/issues/2142
 
 
 
-## 带来的收益 
+## 带来的收益
 
 我们从 StreamX 1.2.3（StreamPark 前身）开始探索和使用，经过一年多时间的磨合，我们发现 StreamPark 真实解决了 Flink 作业在开发管理和运维上的诸多痛点。
 
@@ -201,7 +201,7 @@ StreamPark 给顺网科技带来的最大的收益就是降低了 Flink 的使
 
 ![图片](/blog/shunwang/achievements2.png)
 
-##  未 来 规 划 
+##  未 来 规 划
 
 顺网科技作为 StreamPark 早期的用户之一，在 1 年期间内一直和社区同学保持交流，参与 StreamPark 的稳定性打磨，我们将生产运维中遇到的 Bug 和新的 Feature 提交给了社区。在未来，我们希望可以在 StreamPark 上管理 Flink 表的元数据信息，基于 Flink 引擎通过多 Catalog 实现跨数据源查询分析功能。目前 StreamPark 正在对接 Flink-SQL-Gateway 能力，这一块在未来对于表元数据的管理和跨数据源查询功能会提供了很大的帮助。
 

diff --git a/blog/5-streampark-usercase-dustess.md b/blog/5-streampark-usercase-dustess.md
@@ -4,8 +4,6 @@ title: StreamPark 在尘锋信息的最佳实践，化繁为简极致体验
 tags: [StreamPark, 生产实践, FlinkSQL]
 ---
 
-# StreamPark 在尘锋信息的最佳实践，化繁为简极致体验
-
 **摘要：**本文源自 StreamPark 在尘锋信息的生产实践, 作者是资深数据开发工程师Gump。主要内容为：
 
 1. 技术选型
@@ -18,15 +16,17 @@ tags: [StreamPark, 生产实践, FlinkSQL]
 
 目前，尘锋已在全国拥有13个城市中心，覆盖华北、华中、华东、华南、西南五大区域，为超30个行业的10,000+家企业提供数字营销服务。
 
-## **01 技术选型** 
+<!-- truncate -->
+
+## **01 技术选型**
 
 尘锋信息在2021年进入了快速发展时期，随着服务行业和企业客户的增加，实时需求越来越多，落地实时计算平台迫在眉睫。
 
 由于公司处于高速发展期，需求紧迫且变化快，所以团队的技术选型遵循以下原则:
 
 - 快：由于业务紧迫，我们需要快速落地规划的技术选型并运用生产
 - 稳：满足快的基础上，所选择技术一定要稳定服务业务
-- 新：在以上基础，所选择的技术也尽量的新  
+- 新：在以上基础，所选择的技术也尽量的新
 - 全：所选择技术能够满足公司快速发展和变化的业务，能够符合团队长期发展目标，能够支持且快速支持二次开发
 
 首先在计算引擎方面：我们选择 Flink，原因如下:
@@ -67,9 +67,9 @@ Flink SQL 可以极大提升开发效率和提高 Flink 的普及。StreamPark
 
 
 
-Flink SQL 现在虽然足够强大，但使用 Java 和 Scala 等 JVM 语言开发 Flink 任务会更加灵活、定制化更强、便于调优和提升资源利用率。与 SQL 相比 Jar 包提交任务最大的问题是Jar包的上传管理等，没有优秀的工具产品会严重降低开发效率和加大维护成本。 
+Flink SQL 现在虽然足够强大，但使用 Java 和 Scala 等 JVM 语言开发 Flink 任务会更加灵活、定制化更强、便于调优和提升资源利用率。与 SQL 相比 Jar 包提交任务最大的问题是Jar包的上传管理等，没有优秀的工具产品会严重降低开发效率和加大维护成本。
 
-StreamPark 除了支持 Jar 上传，更提供了**在线更新构建**的功能，优雅解决了以上问题：  
+StreamPark 除了支持 Jar 上传，更提供了**在线更新构建**的功能，优雅解决了以上问题：
 
 1、新建 Project ：填写 GitHub/Gitlab（支持企业私服）地址及用户名密码, StreamPark 就能 Pull 和 Build 项目。
 
@@ -158,7 +158,7 @@ StreamPark 的环境搭建非常简单，跟随官网的搭建教程可以在小
 http://www.streamxhub.com/docs/user-guide/deployment
 ```
 
-为了快速落地和生产使用，我们选择了稳妥的 On Yarn 资源管理模式（虽然 StreamPark 已经很完善的支持 K8S），且已经有较多公司通过 StreamPark 落地了 K8S 部署方式，大家可以参考: 
+为了快速落地和生产使用，我们选择了稳妥的 On Yarn 资源管理模式（虽然 StreamPark 已经很完善的支持 K8S），且已经有较多公司通过 StreamPark 落地了 K8S 部署方式，大家可以参考:
 
 ```
 http://www.streamxhub.com/blog/flink-development-framework-streamx
@@ -194,11 +194,11 @@ StreamPark 非常贴心的准备了 Demo SQL 任务，可以直接在刚搭建
 StreamingContext = ParameterTool + StreamExecutionEnvironment
 ```
 
-- StreamingContext 为 StreamPark 的封装对象 
-- ParameterTool 为解析配置文件后的参数对象 
+- StreamingContext 为 StreamPark 的封装对象
+- ParameterTool 为解析配置文件后的参数对象
 
 ```
- String value = ParameterTool.get("${user.custom.key}") 
+ String value = ParameterTool.get("${user.custom.key}")
 ```
 
 - StreamExecutionEnvironment 为 Apache Flink 原生任务上下文
@@ -223,13 +223,13 @@ StreamingContext = ParameterTool + StreamExecutionEnvironment
 - 计算能力开放：将大数据平台的服务器资源开放业务团队使用
 - 解决方案开放：Flink 生态的成熟 Connector、Exactly Once 语义支持，可减少业务团队流处理相关的开发成本及维护成本
 
-目前 StreamPark 还不支持多业务组功能，多业务组功能会抽象后贡献社区。   
+目前 StreamPark 还不支持多业务组功能，多业务组功能会抽象后贡献社区。
 
-![](/blog/dustess/manager.png)        
+![](/blog/dustess/manager.png)
 
 ![](/blog/dustess/task_retrieval.png)
 
-## **04 未来规划**   
+## **04 未来规划**
 
 ### **01 Flink on K8S**
 

diff --git a/blog/6-streampark-usercase-joyme.md b/blog/6-streampark-usercase-joyme.md
@@ -4,8 +4,6 @@ title: StreamPark 在 Joyme 的生产实践
 tags: [StreamPark, 生产实践, FlinkSQL]
 ---
 
-<br/>
-
 **摘要：** 本文带来 StreamPark 在 Joyme 中的生产实践, 作者是 Joyme 的大数据工程师秦基勇, 主要内容为:
 
 - 遇见StreamPark
@@ -16,6 +14,8 @@ tags: [StreamPark, 生产实践, FlinkSQL]
 - 社区印象
 - 总结
 
+<!-- truncate -->
+
 ## 1 遇见 StreamPark
 
 遇见 StreamPark 是必然的，基于我们现有的实时作业开发模式，不得不寻找一个开源的平台来支撑我司的实时业务。我们的现状如下:
@@ -55,7 +55,7 @@ CREATE TABLE source_table (
 'format.derive-schema' = 'true'
 );
 
--- 落地表sink 
+-- 落地表sink
 CREATE TABLE sink_table (
 `uid` STRING
 ) WITH (
@@ -92,9 +92,9 @@ SELECT  Data.uid  FROM source_table;
 
 由于我们的模式部署是 on Yarn，在动态选项配置里配置了 Yarn 的队列名称。也有一些配置了开启增量的 Checkpoint 选项和状态过期时间，基本的这些参数都可以从 Flink 的官网去查询到。之前有一些作业确实经常出现内存溢出的问题，加上增量参数和过期参数以后，作业的运行情况好多了。还有就是 Flink Sql 作业设计到状态这种比较大和逻辑复杂的情况下，我个人感觉还是用 Streaming 代码来实现比较好控制一些。
 
-- -Dyarn.application.queue= yarn队列名称 
-- -Dstate.backend.incremental=true 
-- -Dtable.exec.state.ttl=过期时间 
+- -Dyarn.application.queue= yarn队列名称
+- -Dstate.backend.incremental=true
+- -Dtable.exec.state.ttl=过期时间
 
 完成配置以后提交，然后在 application 界面进行部署。
 
@@ -158,4 +158,4 @@ StreamPark 的监控需要在 setting 模块去配置发送邮件的基本信息
 
 目前我司线上运行 60 个实时作业，Flink sql 与 Custom-code 差不多各一半。后续也会有更多的实时任务进行上线。很多同学都会担心 StreamPark 稳不稳定的问题，就我司根据几个月的生产实践而言，StreamPark 只是一个帮助你开发作业，部署，监控和管理的一个平台。到底稳不稳，还是要看自家的 Hadoop yarn 集群稳不稳定（我们用的onyan模式），其实已经跟 StreamPark关系不大了。还有就是你写的 Flink Sql 或者是代码健不健壮。更多的是这两方面应该是大家要考虑的，这两方面没问题再充分利用 StreamPark 的灵活性才能让作业更好的运行，单从一方面说 StreamPark 稳不稳定，实属偏激。
 
-以上就是 StreamPark 在乐我无限的全部分享内容，感谢大家看到这里。非常感谢 StreamPark 提供给我们这么优秀的产品，这就是做的利他人之事。从1.0 到 1.2.1 平时遇到那些bug都会被即时的修复，每一个issue都被认真对待。目前我们还是 onyarn的部署模式，重启yarn还是会导致作业的lost状态，重启yarn也不是天天都干的事，关于这个社区也会尽早的会去修复这个问题。相信 StreamPark 会越来越好，未来可期。
+以上就是 StreamPark 在乐我无限的全部分享内容，感谢大家看到这里。非常感谢 StreamPark 提供给我们这么优秀的产品，这就是做的利他人之事。从1.0 到 1.2.1 平时遇到那些bug都会被即时的修复，每一个issue都被认真对待。目前我们还是 onyarn的部署模式，重启yarn还是会导致作业的lost状态，重启yarn也不是天天都干的事，关于这个社区也会尽早的会去修复这个问题。相信 StreamPark 会越来越好，未来可期。