From bbd07d4ee4199787d620e401b8291425dae0d654 Mon Sep 17 00:00:00 2001 From: Cason <1125193113@qq.com> Date: Mon, 22 Jan 2024 14:20:44 +0800 Subject: [PATCH] Replace All Instances of "StreamPark" with "Apache StreamPark" in Official Documentation --- blog/0-streampark-flink-on-k8s.md | 60 +++++++-------- blog/1-flink-framework-streampark.md | 44 +++++------ blog/2-streampark-usercase-chinaunion.md | 44 +++++------ blog/3-streampark-usercase-bondex-paimon.md | 28 +++---- blog/4-streampark-usercase-shunwang.md | 60 +++++++-------- blog/5-streampark-usercase-dustess.md | 74 +++++++++---------- blog/6-streampark-usercase-joyme.md | 30 ++++---- blog/7-streampark-usercase-haibo.md | 38 +++++----- blog/8-streampark-usercase-ziru.md | 54 +++++++------- .../contribution_guide/become_committer.md | 8 +- .../contribution_guide/become_pmc_member.md | 8 +- community/contribution_guide/mailing_lists.md | 8 +- .../new_committer_process.md | 10 +-- .../new_pmc_member_process.md | 16 ++-- community/release/How-to-release.md | 34 ++++----- community/release/how-to-verify.md | 4 +- community/submit_guide/document.md | 4 +- community/submit_guide/submit-code.md | 2 +- docs/connector/1-kafka.md | 16 ++-- docs/connector/2-jdbc.md | 10 +-- docs/connector/3-clickhouse.md | 12 +-- docs/connector/4-doris.md | 6 +- docs/connector/5-es.md | 18 ++--- docs/connector/6-hbase.md | 22 +++--- docs/connector/7-http.md | 6 +- docs/connector/8-redis.md | 20 ++--- docs/development/alert-conf.md | 6 +- docs/development/conf.md | 14 ++-- docs/development/model.md | 34 ++++----- docs/flink-k8s/1-deployment.md | 26 +++---- docs/flink-k8s/2-k8s-pvc-integration.md | 6 +- .../3-hadoop-resource-integration.md | 6 +- docs/intro.md | 22 +++--- docs/user-guide/1-deployment.md | 16 ++-- docs/user-guide/11-platformInstall.md | 20 ++--- docs/user-guide/12-platformBasicUsage.md | 46 ++++++------ docs/user-guide/2-quickstart.md | 2 +- docs/user-guide/3-development.md | 2 +- docs/user-guide/4-dockerDeployment.md | 12 +-- docs/user-guide/6-Team.md | 12 +-- docs/user-guide/8-YarnQueueManagement.md | 6 +- .../0-streampark-flink-on-k8s.md | 58 +++++++-------- .../1-flink-framework-streampark.md | 44 +++++------ .../2-streampark-usercase-chinaunion.md | 44 +++++------ .../3-streampark-usercase-bondex-paimon.md | 28 +++---- .../4-streampark-usercase-shunwang.md | 60 +++++++-------- .../5-streampark-usercase-dustess.md | 74 +++++++++---------- .../6-streampark-usercase-joyme.md | 30 ++++---- .../7-streampark-usercase-haibo.md | 40 +++++----- .../8-streampark-usercase-ziru.md | 50 ++++++------- .../contribution_guide/become_committer.md | 4 +- .../contribution_guide/become_pmc_member.md | 4 +- .../contribution_guide/mailing_lists.md | 8 +- .../new_committer_process.md | 10 +-- .../new_pmc_member_process.md | 16 ++-- .../current/release/How-to-release.md | 34 ++++----- .../current/release/how-to-verify.md | 6 +- .../current/submit_guide/document.md | 4 +- .../current/submit_guide/submit-code.md | 2 +- .../current/connector/1-kafka.md | 14 ++-- .../current/connector/2-jdbc.md | 10 +-- .../current/connector/3-clickhouse.md | 12 +-- .../current/connector/4-doris.md | 6 +- .../current/connector/5-es.md | 18 ++--- .../current/connector/6-hbase.md | 20 ++--- .../current/connector/7-http.md | 6 +- .../current/connector/8-redis.md | 20 ++--- .../current/development/alert-conf.md | 4 +- .../current/development/conf.md | 10 +-- .../current/development/model.md | 30 ++++---- .../current/flink-k8s/1-deployment.md | 30 ++++---- .../flink-k8s/2-k8s-pvc-integration.md | 8 +- .../3-hadoop-resource-integration.md | 6 +- .../current/intro.md | 6 +- .../current/user-guide/1-deployment.md | 16 ++-- .../current/user-guide/11-platformInstall.md | 20 ++--- .../user-guide/12-platformBasicUsage.md | 44 +++++------ .../current/user-guide/2-quickstart.md | 2 +- .../current/user-guide/3-development.md | 2 +- .../current/user-guide/4-dockerDeployment.md | 12 +-- .../current/user-guide/6-Team.md | 8 +- .../current/user-guide/7-Variable.md | 2 +- .../user-guide/8-YarnQueueManagement.md | 6 +- src/pages/download/release-note/2.0.0.md | 4 +- 84 files changed, 849 insertions(+), 849 deletions(-) diff --git a/blog/0-streampark-flink-on-k8s.md b/blog/0-streampark-flink-on-k8s.md index 4474edbaa..7f14be00b 100644 --- a/blog/0-streampark-flink-on-k8s.md +++ b/blog/0-streampark-flink-on-k8s.md @@ -1,7 +1,7 @@ --- slug: streampark-flink-on-k8s -title: StreamPark Flink on Kubernetes practice -tags: [StreamPark, Production Practice, FlinkSQL, Kubernetes] +title: Apache StreamPark Flink on Kubernetes practice +tags: [Apache StreamPark, Production Practice, FlinkSQL, Kubernetes] description: Wuxin Technology was founded in January 2018. The current main business includes the research and development, design, manufacturing and sales of RELX brand products. With core technologies and capabilities covering the entire industry chain, RELX is committed to providing users with products that are both high quality and safe --- @@ -21,9 +21,9 @@ Native Kubernetes offers the following advantages: ![](/blog/relx/nativekubernetes_architecture.png) -When Flink On Kubernetes meets StreamPark +When Flink On Kubernetes meets Apache StreamPark - Flink on Native Kubernetes currently supports Application mode and Session mode. Compared with the two, Application mode deployment avoids the resource isolation problem and client resource consumption problem of Session mode. Therefore, it is recommended to use Application Mode to deploy Flink tasks in ** production environments. **Let’s take a look at the method of using the original script and the process of using StreamPark to develop and deploy a Flink on Native Kubernetes job. + Flink on Native Kubernetes currently supports Application mode and Session mode. Compared with the two, Application mode deployment avoids the resource isolation problem and client resource consumption problem of Session mode. Therefore, it is recommended to use Application Mode to deploy Flink tasks in ** production environments. **Let’s take a look at the method of using the original script and the process of using Apache StreamPark to develop and deploy a Flink on Native Kubernetes job. Deploy Kubernetes using scripts In the absence of a platform that supports Flink on Kubernetes task development and deployment, you need to use scripts to submit and stop tasks. This is also the default method provided by Flink. The specific steps are as follows: @@ -69,17 +69,17 @@ kubectl -n flink-cluster get svc The above is the process of deploying a Flink task to Kubernetes using the most original script method provided by Flink. Only the most basic task submission is achieved. If it is to reach the production use level, there are still a series of problems that need to be solved, such as: the method is too Originally, it was unable to adapt to large batches of tasks, unable to record task checkpoints and real-time status tracking, difficult to operate and monitor tasks, had no alarm mechanism, and could not be managed in a centralized manner, etc. -## **Deploy Flink on Kubernetes using StreamPark** +## **Deploy Flink on Kubernetes using Apache StreamPark** There will be higher requirements for using Flink on Kubernetes in enterprise-level production environments. Generally, you will choose to build your own platform or purchase related commercial products. No matter which solution meets the product capabilities: large-scale task development and deployment, status tracking, operation and maintenance monitoring , failure alarms, unified task management, high availability, etc. are common demands. - In response to the above issues, we investigated open source projects in the open source field that support the development and deployment of Flink on Kubernetes tasks. During the investigation, we also encountered other excellent open source projects. After comprehensively comparing multiple open source projects, we came to the conclusion: **StreamPark has great performance in either completness, user experience, or stability, so we finally chose StreamPark as our one-stop real-time computing platform. ** + In response to the above issues, we investigated open source projects in the open source field that support the development and deployment of Flink on Kubernetes tasks. During the investigation, we also encountered other excellent open source projects. After comprehensively comparing multiple open source projects, we came to the conclusion: **Apache StreamPark has great performance in either completness, user experience, or stability, so we finally chose Apache StreamPark as our one-stop real-time computing platform. ** - Let’s take a look at how StreamPark supports Flink on Kubernetes: + Let’s take a look at how Apache StreamPark supports Flink on Kubernetes: ### **Basic environment configuration** - Basic environment configuration includes Kubernetes and Docker repository information as well as Flink client information configuration. The simplest way for the Kubernetes basic environment is to directly copy the .kube/config of the Kubernetes node to the StreamPark node user directory, and then use the kubectl command to create a Flink-specific Kubernetes Namespace and perform RBAC configuration. + Basic environment configuration includes Kubernetes and Docker repository information as well as Flink client information configuration. The simplest way for the Kubernetes basic environment is to directly copy the .kube/config of the Kubernetes node to the Apache StreamPark node user directory, and then use the kubectl command to create a Flink-specific Kubernetes Namespace and perform RBAC configuration. ```shell # Create k8s namespace used by Flink jobs @@ -93,17 +93,17 @@ Docker account information can be configured directly in the Docker Setting inte ![](/blog/relx/docker_setting.png) -StreamPark can adapt to multi-version Flink job development. The Flink client can be configured directly on the StreamPark Setting interface: +Apache StreamPark can adapt to multi-version Flink job development. The Flink client can be configured directly on the Apache StreamPark Setting interface: ![](/blog/relx/flinkversion_setting.png) ### **Job development** -After StreamPark has configured the basic environment, it only takes three steps to develop and deploy a Flink job: +After Apache StreamPark has configured the basic environment, it only takes three steps to develop and deploy a Flink job: ![](/blog/relx/development_process.png) - StreamPark supports both Upload Jar and direct writing of Flink SQL jobs. **Flink SQL jobs only need to enter SQL and dependencies. This method greatly improves the development experience and avoids problems such as dependency conflicts.** This article does not focus on this part。 + Apache StreamPark supports both Upload Jar and direct writing of Flink SQL jobs. **Flink SQL jobs only need to enter SQL and dependencies. This method greatly improves the development experience and avoids problems such as dependency conflicts.** This article does not focus on this part。 Here you need to select the deployment mode as kubernetes application, and configure the following parameters on the job development page: The parameters in the red box are the basic parameters of Flink on Kubernetes. @@ -119,7 +119,7 @@ After StreamPark has configured the basic environment, it only takes three steps ### **Job online** -After the job development is completed, the job comes online. In this step, StreamPark has done a lot of work, as follows: +After the job development is completed, the job comes online. In this step, Apache StreamPark has done a lot of work, as follows: - Prepare environment - Dependency download in job @@ -131,7 +131,7 @@ After the job development is completed, the job comes online. In this step, Stre ![](/blog/relx/operation.png) -We can see a series of work done by StreamPark when building and pushing the image: **Read the configuration, build the image, and push the image to the remote repository...** I want to give StreamPark a big thumbs up! +We can see a series of work done by Apache StreamPark when building and pushing the image: **Read the configuration, build the image, and push the image to the remote repository...** I want to give Apache StreamPark a big thumbs up! ![](/blog/relx/step_details.png) @@ -141,17 +141,17 @@ We can see a series of work done by StreamPark when building and pushing the ima ![](/blog/relx/homework_submit.png) - The entire process only requires the above three steps to complete the development and deployment of a Flink on Kubernetes job on StreamPark. StreamPark's support for Flink on Kubernetes goes far beyond simply submitting a task. + The entire process only requires the above three steps to complete the development and deployment of a Flink on Kubernetes job on Apache StreamPark. Apache StreamPark's support for Flink on Kubernetes goes far beyond simply submitting a task. ### **Job management** -**After the job is submitted, StreamPark can obtain the latest checkpoint address of the task, the running status of the task, and the real-time resource consumption information of the cluster in real time. It can very conveniently start and stop the running task with one click, and supports recording the savepoint location when stopping the job. , as well as functions such as restoring the state from savepoint when restarting, thus ensuring the data consistency of the production environment and truly possessing the one-stop development, deployment, operation and maintenance monitoring capabilities of Flink on Kubernetes.** +**After the job is submitted, Apache StreamPark can obtain the latest checkpoint address of the task, the running status of the task, and the real-time resource consumption information of the cluster in real time. It can very conveniently start and stop the running task with one click, and supports recording the savepoint location when stopping the job. , as well as functions such as restoring the state from savepoint when restarting, thus ensuring the data consistency of the production environment and truly possessing the one-stop development, deployment, operation and maintenance monitoring capabilities of Flink on Kubernetes.** -Next, let’s take a look at how StreamPark supports this capability: +Next, let’s take a look at how Apache StreamPark supports this capability: - **Record checkpoint in real time** - After the job is submitted, sometimes it is necessary to change the job logic but to ensure data consistency, then the platform needs to have the ability to record the location of each checkpoint in real time, as well as the ability to record the last savepoint location when stopped. StreamPark is on Flink on Kubernetes This function is implemented very well. By default, checkpoint information will be obtained and recorded in the corresponding table every 5 seconds, and according to the policy of retaining the number of checkpoints in Flink, only state.checkpoints.num-retained will be retained, and the excess will be deleted. There is an option to check the savepoint when the task is stopped. If the savepoint option is checked, the savepoint operation will be performed when the task is stopped, and the specific location of the savepoint will also be recorded in the table. + After the job is submitted, sometimes it is necessary to change the job logic but to ensure data consistency, then the platform needs to have the ability to record the location of each checkpoint in real time, as well as the ability to record the last savepoint location when stopped. Apache StreamPark is on Flink on Kubernetes This function is implemented very well. By default, checkpoint information will be obtained and recorded in the corresponding table every 5 seconds, and according to the policy of retaining the number of checkpoints in Flink, only state.checkpoints.num-retained will be retained, and the excess will be deleted. There is an option to check the savepoint when the task is stopped. If the savepoint option is checked, the savepoint operation will be performed when the task is stopped, and the specific location of the savepoint will also be recorded in the table. The root path of the default savepoint only needs to be configured in the Flink Home flink-conf.yaml file to automatically identify it. In addition to the default address, the root path of the savepoint can also be customized and specified when stopping. @@ -161,27 +161,27 @@ Next, let’s take a look at how StreamPark supports this capability: - **Track running status in real time** - For challenges in the production environment, a very important point is whether monitoring is in place, especially for Flink on Kubernetes. This is very important and is the most basic capability. StreamPark can monitor the running status of Flink on Kubernetes jobs in real time and display it to users on the platform. Tasks can be easily retrieved based on various running statuses on the page. + For challenges in the production environment, a very important point is whether monitoring is in place, especially for Flink on Kubernetes. This is very important and is the most basic capability. Apache StreamPark can monitor the running status of Flink on Kubernetes jobs in real time and display it to users on the platform. Tasks can be easily retrieved based on various running statuses on the page. ![](/blog/relx/run_status.png) - **Complete alarm mechanism** - In addition, StreamPark also has complete alarm functions: supporting email, DingTalk, WeChat and SMS, etc. This is also an important reason why the company chose StreamPark as the one-stop platform for Flink on Kubernetes after initial research. + In addition, Apache StreamPark also has complete alarm functions: supporting email, DingTalk, WeChat and SMS, etc. This is also an important reason why the company chose Apache StreamPark as the one-stop platform for Flink on Kubernetes after initial research. ![](/blog/relx/alarm.png) - From the above, we can see that StreamPark has the capabilities to support the development and deployment process of Flink on Kubernetes, including: ** job development capabilities, deployment capabilities, monitoring capabilities, operation and maintenance capabilities, exception handling capabilities, etc. StreamPark provides a relatively complete set of s solution. And it already has some CICD/DevOps capabilities, and the overall completion level continues to improve. It is a product that supports the full link of Flink on Kubernetes one-stop development, deployment, operation and maintenance work in the entire open source field. StreamPark is worthy of praise. ** + From the above, we can see that Apache StreamPark has the capabilities to support the development and deployment process of Flink on Kubernetes, including: ** job development capabilities, deployment capabilities, monitoring capabilities, operation and maintenance capabilities, exception handling capabilities, etc. Apache StreamPark provides a relatively complete set of s solution. And it already has some CICD/DevOps capabilities, and the overall completion level continues to improve. It is a product that supports the full link of Flink on Kubernetes one-stop development, deployment, operation and maintenance work in the entire open source field. Apache StreamPark is worthy of praise. ** -## **StreamPark’s implementation in Wuxin Technology** +## **Apache StreamPark’s implementation in Wuxin Technology** - StreamPark was launched late in Wuxin Technology. It is currently mainly used for the development and deployment of real-time data integration jobs and real-time indicator calculation jobs. There are Jar tasks and Flink SQL tasks, all deployed using Native Kubernetes; data sources include CDC, Kafka, etc., and Sink end There are Maxcompute, kafka, Hive, etc. The following is a screenshot of the company's development environment StreamPark platform: + Apache StreamPark was launched late in Wuxin Technology. It is currently mainly used for the development and deployment of real-time data integration jobs and real-time indicator calculation jobs. There are Jar tasks and Flink SQL tasks, all deployed using Native Kubernetes; data sources include CDC, Kafka, etc., and Sink end There are Maxcompute, kafka, Hive, etc. The following is a screenshot of the company's development environment Apache StreamPark platform: ![](/blog/relx/screenshot.png) ## Problems encountered - Any new technology has a process of exploration and fall into pitfalls. The experience of failure is precious. Here are some pitfalls and experiences that StreamPark has stepped into during the implementation of fog core technology. **The content of this section is not only about StreamPark. I believe it will bring some reference to all friends who use Flink on Kubernetes**. + Any new technology has a process of exploration and fall into pitfalls. The experience of failure is precious. Here are some pitfalls and experiences that Apache StreamPark has stepped into during the implementation of fog core technology. **The content of this section is not only about Apache StreamPark. I believe it will bring some reference to all friends who use Flink on Kubernetes**. ### **FAQs are summarized below** @@ -191,7 +191,7 @@ Next, let’s take a look at how StreamPark supports this capability: - **Scala version inconsistent** - Since StreamPark deployment requires a Scala environment, and Flink SQL operation requires the Flink SQL Client provided by StreamPark, it is necessary to ensure that the Scala version of the Flink job is consistent with the Scala version of StreamPark. + Since Apache StreamPark deployment requires a Scala environment, and Flink SQL operation requires the Flink SQL Client provided by Apache StreamPark, it is necessary to ensure that the Scala version of the Flink job is consistent with the Scala version of Apache StreamPark. - **Be aware of class conflicts** @@ -223,7 +223,7 @@ Next, let’s take a look at how StreamPark supports this capability: - **Each restart of the task will result in one more Job instance** - Under the premise that kubernetes-based HA is configured, when you need to stop the Flink task, you need to use cancel of StreamPark. Do not delete the Deployment of the Flink task directly through the kubernetes cluster. Because Flink's shutdown has its own shutdown process, when deleting a pod, the corresponding configuration files in the Configmap will also be deleted. Direct deletion of the pod will result in the remnants of the Configmap. When a task with the same name is restarted, two identical jobs will appear because at startup, the task will load the remaining configuration files and try to restore the closed task. + Under the premise that kubernetes-based HA is configured, when you need to stop the Flink task, you need to use cancel of Apache StreamPark. Do not delete the Deployment of the Flink task directly through the kubernetes cluster. Because Flink's shutdown has its own shutdown process, when deleting a pod, the corresponding configuration files in the Configmap will also be deleted. Direct deletion of the pod will result in the remnants of the Configmap. When a task with the same name is restarted, two identical jobs will appear because at startup, the task will load the remaining configuration files and try to restore the closed task. - **How to implement kubernetes pod domain name access** @@ -295,14 +295,14 @@ push k8s-harbor.xxx.com/streamx/udf_flink_1.13.6-scala_2.11:latest ## **Future Expectations** -- **StreamPark supports Flink job metric monitoring** +- **Apache StreamPark supports Flink job metric monitoring** - It would be great if StreamPark could connect to Flink Metric data and display Flink’s real-time consumption data at every moment on the StreamPark platform. + It would be great if Apache StreamPark could connect to Flink Metric data and display Flink’s real-time consumption data at every moment on the Apache StreamPark platform. -- **StreamPark supports Flink job log persistence** +- **Apache StreamPark supports Flink job log persistence** - For Flink deployed to YARN, if the Flink program hangs, we can go to YARN to view the historical logs. However, for Kubernetes, if the program hangs, the Kubernetes pod will disappear and there will be no way to check the logs. Therefore, users need to use tools on Kubernetes for log persistence. It would be better if StreamPark supports the Kubernetes log persistence interface. + For Flink deployed to YARN, if the Flink program hangs, we can go to YARN to view the historical logs. However, for Kubernetes, if the program hangs, the Kubernetes pod will disappear and there will be no way to check the logs. Therefore, users need to use tools on Kubernetes for log persistence. It would be better if Apache StreamPark supports the Kubernetes log persistence interface. - **Improvement of the problem of too large image** - StreamPark's current image support for Flink on Kubernetes jobs is to combine the basic image and user code into a Fat image and push it to the Docker repository. The problem with this method is that it takes a long time when the image is too large. It is hoped that the basic image can be restored in the future. There is no need to hit the business code together every time, which can greatly improve development efficiency and save costs. + Apache StreamPark's current image support for Flink on Kubernetes jobs is to combine the basic image and user code into a Fat image and push it to the Docker repository. The problem with this method is that it takes a long time when the image is too large. It is hoped that the basic image can be restored in the future. There is no need to hit the business code together every time, which can greatly improve development efficiency and save costs. diff --git a/blog/1-flink-framework-streampark.md b/blog/1-flink-framework-streampark.md index 25070df41..e99fd99a5 100644 --- a/blog/1-flink-framework-streampark.md +++ b/blog/1-flink-framework-streampark.md @@ -1,7 +1,7 @@ --- slug: flink-development-framework-streampark -title: StreamPark - Powerful Flink Development Framework -tags: [StreamPark, DataStream, FlinkSQL] +title: Apache StreamPark - Powerful Flink Development Framework +tags: [Apache StreamPark, DataStream, FlinkSQL] --- Although the Hadoop system is widely used today, its architecture is complicated, it has a high maintenance complexity, version upgrades are challenging, and due to departmental reasons, data center scheduling is prolonged. We urgently need to explore agile data platform models. With the current popularization of cloud-native architecture and the integration between lake and warehous, we have decided to use Doris as an offline data warehouse and TiDB (which is already in production) as a real-time data platform. Furthermore, because Doris has ODBC capabilities on MySQL, it can integrate external database resources and uniformly output reports. @@ -56,18 +56,18 @@ However, because object storage requires the entire object to be rewritten for r
-## Introducing StreamPark +## Introducing Apache StreamPark Previously, when we wrote Flink SQL, we generally used Java to wrap SQL, packed it into a jar package, and submitted it to the S3 platform through the command line. This approach has always been unfriendly; the process is cumbersome, and the costs for development and operations are too high. We hoped to further streamline the process by abstracting the Flink TableEnvironment, letting the platform handle initialization, packaging, and running Flink tasks, and automating the building, testing, and deployment of Flink applications. -This is an era of open-source uprising. Naturally, we turned our attention to the open-source realm: among numerous open-source projects, after comparing various projects, we found that both Zeppelin and StreamPark provide substantial support for Flink and both claim to support Flink on K8s. Eventually, both were shortlisted for our selection. Here's a brief comparison of their support for K8s (if there have been updates since, please kindly correct). +This is an era of open-source uprising. Naturally, we turned our attention to the open-source realm: among numerous open-source projects, after comparing various projects, we found that both Zeppelin and Apache StreamPark provide substantial support for Flink and both claim to support Flink on K8s. Eventually, both were shortlisted for our selection. Here's a brief comparison of their support for K8s (if there have been updates since, please kindly correct). - + @@ -123,15 +123,15 @@ This is an era of open-source uprising. Naturally, we turned our attention to th
-During our research process, we communicated with the main developers of both tools multiple times. After our repeated studies and assessments, we eventually decided to adopt StreamPark as our primary Flink development tool for now. +During our research process, we communicated with the main developers of both tools multiple times. After our repeated studies and assessments, we eventually decided to adopt Apache StreamPark as our primary Flink development tool for now. -
(StreamPark's official splash screen)
+
(Apache StreamPark's official splash screen)

-After extended development and testing by our team, StreamPark currently boasts: +After extended development and testing by our team, Apache StreamPark currently boasts: * Comprehensive SQL validation capabilities * It has achieved automatic build/push for images @@ -143,21 +143,21 @@ This effectively addresses most of the challenges we currently face in developme -
(Demo video showcasing StreamPark's support for multiple Flink versions)
+
(Demo video showcasing Apache StreamPark's support for multiple Flink versions)

-In its latest release, version 1.2.0, StreamPark provides robust support for both K8s-Native-Application and K8s-Session-Application modes. +In its latest release, version 1.2.0, Apache StreamPark provides robust support for both K8s-Native-Application and K8s-Session-Application modes. -
(StreamPark's K8s deployment demo video)
+
(Apache StreamPark's K8s deployment demo video)

### K8s Native Application Mode -Within StreamPark, all we need to do is configure the relevant parameters, fill in the corresponding dependencies in the Maven POM, or upload the dependency jar files. Once we click on 'Apply', the specified dependencies will be generated. This implies that we can also compile all the UDFs we use into jar files, as well as various connector.jar files, and use them directly in SQL. As illustrated below: +Within Apache StreamPark, all we need to do is configure the relevant parameters, fill in the corresponding dependencies in the Maven POM, or upload the dependency jar files. Once we click on 'Apply', the specified dependencies will be generated. This implies that we can also compile all the UDFs we use into jar files, as well as various connector.jar files, and use them directly in SQL. As illustrated below: ![](/blog/belle/dependency.png) @@ -169,7 +169,7 @@ We can also specify resources, designate dynamic parameters within Flink Run as ![](/blog/belle/pod.png) -After saving the program, when clicking to run, we can also specify a savepoint. Once the task is successfully submitted, StreamPark will, based on the FlinkPod's network Exposed Type (be it loadBalancer, NodePort, or ClusterIp), return the corresponding WebURL, seamlessly enabling a WebUI redirect. However, as of now, due to security considerations within our online private K8s cluster, there hasn't been a connection established between the Pod and client node network (and there's currently no plan for this). Hence, we only employ NodePort. If the number of future tasks increases significantly, and there's a need for ClusterIP, we might consider deploying StreamPark in K8s or further integrate it with Ingress. +After saving the program, when clicking to run, we can also specify a savepoint. Once the task is successfully submitted, Apache StreamPark will, based on the FlinkPod's network Exposed Type (be it loadBalancer, NodePort, or ClusterIp), return the corresponding WebURL, seamlessly enabling a WebUI redirect. However, as of now, due to security considerations within our online private K8s cluster, there hasn't been a connection established between the Pod and client node network (and there's currently no plan for this). Hence, we only employ NodePort. If the number of future tasks increases significantly, and there's a need for ClusterIP, we might consider deploying Apache StreamPark in K8s or further integrate it with Ingress. ![](/blog/belle/start.png) @@ -185,7 +185,7 @@ Below is the specific submission process in the K8s Application mode: ### K8s Native Session Mode -StreamPark also offers robust support for the K8s Native-Session mode, which lays a solid technical foundation for our subsequent offline FlinkSQL development or for segmenting certain resources. +Apache StreamPark also offers robust support for the K8s Native-Session mode, which lays a solid technical foundation for our subsequent offline FlinkSQL development or for segmenting certain resources. To use the Native-Session mode, one must first use the Flink command to create a Flink cluster that operates within K8s. For instance: @@ -203,7 +203,7 @@ To use the Native-Session mode, one must first use the Flink command to create a ![](/blog/belle/flinksql.png) -As shown in the image above, we use that ClusterId as the Kubernetes ClusterId task parameter for StreamPark. Once the task is saved and submitted, it quickly transitions to a 'Running' state: +As shown in the image above, we use that ClusterId as the Kubernetes ClusterId task parameter for Apache StreamPark. Once the task is saved and submitted, it quickly transitions to a 'Running' state: ![](/blog/belle/detail.png) @@ -211,13 +211,13 @@ Following the application info's WebUI link: ![](/blog/belle/dashboard.png) -It becomes evident that StreamPark essentially uploads the jar package to the Flink cluster through REST API and then schedules the task for execution. +It becomes evident that Apache StreamPark essentially uploads the jar package to the Flink cluster through REST API and then schedules the task for execution.
### Custom Code Mode -To our delight, StreamPark also provides support for coding DataStream/FlinkSQL tasks. For special requirements, we can achieve our implementations in Java/Scala. You can compose tasks following the scaffold method recommended by StreamPark or write a standard Flink task. By adopting this approach, we can delegate code management to git, utilizing the platform for automated compilation, packaging, and deployment. Naturally, if functionality can be achieved via SQL, we would prefer not to customize DataStream, thereby minimizing unnecessary operational complexities. +To our delight, Apache StreamPark also provides support for coding DataStream/FlinkSQL tasks. For special requirements, we can achieve our implementations in Java/Scala. You can compose tasks following the scaffold method recommended by Apache StreamPark or write a standard Flink task. By adopting this approach, we can delegate code management to git, utilizing the platform for automated compilation, packaging, and deployment. Naturally, if functionality can be achieved via SQL, we would prefer not to customize DataStream, thereby minimizing unnecessary operational complexities.

@@ -225,26 +225,26 @@ To our delight, StreamPark also provides support for coding DataStream/FlinkSQL ## Suggestions for Improvement -StreamPark, similar to any other new tools, does have areas for further enhancement based on our current evaluations: +Apache StreamPark, similar to any other new tools, does have areas for further enhancement based on our current evaluations: * **Strengthening Resource Management**: Features like multi-file system jar resources and robust task versioning are still awaiting additions. * **Enriching Frontend Features**: For instance, once a task is added, functionalities like copying could be integrated. * **Visualization of Task Submission Logs**: The process of task submission involves loading class files, jar packaging, building and submitting images, and more. A failure at any of these stages could halt the task. However, error logs are not always clear, or due to some anomaly, the exceptions aren't thrown as expected, leaving users puzzled about rectifications. -It's a universal truth that innovations aren't perfect from the outset. Although minor issues exist and there are areas for improvement with StreamPark, its merits outweigh its limitations. As a result, we've chosen StreamPark as our Flink DevOps platform. We're also committed to collaborating with its main developers to refine StreamPark further. We wholeheartedly invite others to use it and contribute towards its advancement. +It's a universal truth that innovations aren't perfect from the outset. Although minor issues exist and there are areas for improvement with Apache StreamPark, its merits outweigh its limitations. As a result, we've chosen Apache StreamPark as our Flink DevOps platform. We're also committed to collaborating with its main developers to refine Apache StreamPark further. We wholeheartedly invite others to use it and contribute towards its advancement.
## Future Prospects * We'll keep our focus on Doris and plan to unify business data with log data in Doris, leveraging Flink to realize lakehouse capabilities. -* Our next step is to explore integrating StreamPark with DolphinScheduler 2.x. This would enhance DolphinScheduler's offline tasks, and gradually we aim to replace Spark with Flink for a unified batch-streaming solution. +* Our next step is to explore integrating Apache StreamPark with DolphinScheduler 2.x. This would enhance DolphinScheduler's offline tasks, and gradually we aim to replace Spark with Flink for a unified batch-streaming solution. * Drawing from our own experiments with S3, after building the fat-jar, we're considering bypassing image building. Instead, we'll mount PVC directly to the Flink Pod's directory using Pod Template, refining the code submission process even further. -* We plan to persistently implement StreamPark in our production environment. Collaborating with community developers, we aim to boost StreamPark's Flink stream development, deployment, and monitoring capabilities. Our collective vision is to evolve StreamPark into a holistic stream data DevOps platform. +* We plan to persistently implement Apache StreamPark in our production environment. Collaborating with community developers, we aim to boost Apache StreamPark's Flink stream development, deployment, and monitoring capabilities. Our collective vision is to evolve Apache StreamPark into a holistic stream data DevOps platform. Resources: -StreamPark GitHub: [https://github.com/apache/incubator-streampark](https://github.com/apache/incubator-streampark)
+Apache StreamPark GitHub: [https://github.com/apache/incubator-streampark](https://github.com/apache/incubator-streampark)
Doris GitHub: [https://github.com/apache/doris](https://github.com/apache/doris) ![](/blog/belle/author.png) diff --git a/blog/2-streampark-usercase-chinaunion.md b/blog/2-streampark-usercase-chinaunion.md index 6f4ea6bfd..a7a2121b2 100644 --- a/blog/2-streampark-usercase-chinaunion.md +++ b/blog/2-streampark-usercase-chinaunion.md @@ -1,7 +1,7 @@ --- slug: streampark-usercase-chinaunion title: China Union's Flink Real-Time Computing Platform Ops Practice -tags: [StreamPark, Production Practice, FlinkSQL] +tags: [Apache StreamPark, Production Practice, FlinkSQL] --- ![](/blog/chinaunion/overall_architecture.png) @@ -10,7 +10,7 @@ tags: [StreamPark, Production Practice, FlinkSQL] - Introduction to the Real-Time Computing Platform Background - Operational Challenges of Flink Real-Time Jobs -- Integrated Management Based on StreamPark +- Integrated Management Based on Apache StreamPark - Future Planning and Evolution @@ -68,28 +68,28 @@ In terms of job operation and maintenance dilemmas, firstly, the job deployment Due to various factors in the job operation and maintenance difficulties, business support challenges arise, such as a high rate of failures during launch, impact on data quality, lengthy launch times, high data latency, and issues with missed alarm handling, leading to complaints. In addition, the impact on our business is unclear, and once a problem arises, addressing the issue becomes the top priority. -## **基于 StreamPark 一体化管理** +## **基于 Apache StreamPark 一体化管理** ![](/blog/chinaunion/job_management.png) -In response to the two dilemmas mentioned above, we have resolved many issues through StreamPark's integrated management. First, let's take a look at the dual evolution of StreamPark, which includes Flink Job Management and Flink Job DevOps Platform. In terms of job management, StreamPark supports deploying Flink real-time jobs to different clusters, such as Flink's native Standalone mode, and the Session, Application, and PerJob modes of Flink on Yarn. In the latest version, it will support Kubernetes Native Session mode. The middle layer includes project management, job management, cluster management, team management, variable management, and alarm management. +In response to the two dilemmas mentioned above, we have resolved many issues through Apache StreamPark's integrated management. First, let's take a look at the dual evolution of Apache StreamPark, which includes Flink Job Management and Flink Job DevOps Platform. In terms of job management, Apache StreamPark supports deploying Flink real-time jobs to different clusters, such as Flink's native Standalone mode, and the Session, Application, and PerJob modes of Flink on Yarn. In the latest version, it will support Kubernetes Native Session mode. The middle layer includes project management, job management, cluster management, team management, variable management, and alarm management. - Project Management: When deploying a Flink program, you can fill in the git address in project management and select the branch you want to deploy. -- Job Management: You can specify the execution mode of the Flink job, such as which type of cluster you want to submit to. You can also configure some resources, such as the number of TaskManagers, the memory size of TaskManager/JobManager, parallelism, etc. Additionally, you can set up some fault tolerance measures; for instance, if a Flink job fails, StreamPark can support automatic restarts, and it also supports the input of some dynamic parameters. +- Job Management: You can specify the execution mode of the Flink job, such as which type of cluster you want to submit to. You can also configure some resources, such as the number of TaskManagers, the memory size of TaskManager/JobManager, parallelism, etc. Additionally, you can set up some fault tolerance measures; for instance, if a Flink job fails, Apache StreamPark can support automatic restarts, and it also supports the input of some dynamic parameters. - Cluster Management: You can add and manage big data clusters through the interface. - Team Management: In the actual production process of an enterprise, there are multiple teams, and these teams are isolated from each other. - Variable Management: You can maintain some variables in one place. For example, you can define Kafka's Broker address as a variable. When configuring Flink jobs or SQL, you can replace the Broker's IP with a variable. Moreover, if this Kafka needs to be decommissioned later, you can also use this variable to check which jobs are using this cluster, facilitating some subsequent processes. - Alarm Management: Supports multiple alarm modes, such as WeChat, DingTalk, SMS, and email. -StreamPark supports the submission of Flink SQL and Flink Jar, allows for resource configuration, and supports state tracking, indicating whether the state is running, failed, etc. Additionally, it provides a metrics dashboard and supports the viewing of various logs. +Apache StreamPark supports the submission of Flink SQL and Flink Jar, allows for resource configuration, and supports state tracking, indicating whether the state is running, failed, etc. Additionally, it provides a metrics dashboard and supports the viewing of various logs. ![](/blog/chinaunion/devops_platform.png) The Flink Job DevOps platform primarily consists of the following parts: -- Teams: StreamPark supports multiple teams, each with its team administrator who has all permissions. There are also team developers who only have a limited set of permissions. +- Teams: Apache StreamPark supports multiple teams, each with its team administrator who has all permissions. There are also team developers who only have a limited set of permissions. - Compilation and Packaging: When creating a Flink project, you can configure the git address, branch, and packaging commands in the project, and then compile and package with a single click of the build button. - Release and Deployment: During release and deployment, a Flink job is created. Within the Flink job, you can choose the execution mode, deployment cluster, resource settings, fault tolerance settings, and fill in variables. Finally, the Flink job can be started or stopped with a single click. - State Monitoring: After the Flink job is started, real-time tracking of its state begins, including Flink's running status, runtime duration, Checkpoint information, etc. There is also support for one-click redirection to Flink's Web UI. @@ -97,7 +97,7 @@ The Flink Job DevOps platform primarily consists of the following parts: ![](/blog/chinaunion/multi_team_support.png) -Companies generally have multiple teams working on real-time jobs simultaneously. In our company, this includes a real-time data collection team, a data processing team, and a real-time marketing team. StreamPark supports resource isolation for multiple teams. +Companies generally have multiple teams working on real-time jobs simultaneously. In our company, this includes a real-time data collection team, a data processing team, and a real-time marketing team. Apache StreamPark supports resource isolation for multiple teams. ![](/blog/chinaunion/platformized_management.png) @@ -110,7 +110,7 @@ Management of the Flink job platform faces the following challenges: -Based on the challenges mentioned above, StreamPark has addressed the issues of unclear ownership and untraceable branches through project management. This is because when creating a project, you need to manually specify certain branches. Once the packaging is successful, these branches are recorded. Job management centralizes configurations, preventing scripts from being too dispersed. Additionally, there is strict control over the permissions for starting and stopping jobs, preventing an uncontrollable state due to script permissions. StreamPark interacts with clusters through interfaces to obtain job information, allowing for more precise job control. +Based on the challenges mentioned above, Apache StreamPark has addressed the issues of unclear ownership and untraceable branches through project management. This is because when creating a project, you need to manually specify certain branches. Once the packaging is successful, these branches are recorded. Job management centralizes configurations, preventing scripts from being too dispersed. Additionally, there is strict control over the permissions for starting and stopping jobs, preventing an uncontrollable state due to script permissions. Apache StreamPark interacts with clusters through interfaces to obtain job information, allowing for more precise job control. @@ -118,15 +118,15 @@ Referring to the image above, you can see at the bottom of the diagram that pack ![图片](/blog/chinaunion/development_efficiency.png) -In the early stages, we needed to go through seven steps for deployment, including connecting to a VPN, logging in through 4A, executing compile scripts, executing start scripts, opening Yarn, searching for the job name, and entering the Flink UI. StreamPark supports one-click deployment for four of these steps, including one-click packaging, one-click release, one-click start, and one-click access to the Flink UI. +In the early stages, we needed to go through seven steps for deployment, including connecting to a VPN, logging in through 4A, executing compile scripts, executing start scripts, opening Yarn, searching for the job name, and entering the Flink UI. Apache StreamPark supports one-click deployment for four of these steps, including one-click packaging, one-click release, one-click start, and one-click access to the Flink UI. ![图片](/blog/chinaunion/submission_process.png) -The image above illustrates the job submission process of our StreamPark platform. Firstly, StreamPark proceeds to release the job, during which some resources are uploaded. Following that, the job is submitted, accompanied by various configured parameters, and it is published to the cluster using the Flink Submit method via an API call. At this point, there are multiple Flink Submit instances corresponding to different execution modes, such as Yarn Session, Yarn Application, Kubernetes Session, Kubernetes Application, and so on; all of these are controlled here. After submitting the job, if it is a Flink on Yarn job, the platform will acquire the Application ID or Job ID of the Flink job. This ID is then stored in our database. Similarly, if the job is executed based on Kubernetes, a Job ID will be obtained. Subsequently, when tracking the job status, we primarily use these stored IDs to monitor the state of the job. +The image above illustrates the job submission process of our Apache StreamPark platform. Firstly, Apache StreamPark proceeds to release the job, during which some resources are uploaded. Following that, the job is submitted, accompanied by various configured parameters, and it is published to the cluster using the Flink Submit method via an API call. At this point, there are multiple Flink Submit instances corresponding to different execution modes, such as Yarn Session, Yarn Application, Kubernetes Session, Kubernetes Application, and so on; all of these are controlled here. After submitting the job, if it is a Flink on Yarn job, the platform will acquire the Application ID or Job ID of the Flink job. This ID is then stored in our database. Similarly, if the job is executed based on Kubernetes, a Job ID will be obtained. Subsequently, when tracking the job status, we primarily use these stored IDs to monitor the state of the job. ![图片](/blog/chinaunion/status_acquisition_bottleneck.png) -As mentioned above, in the case of Flink on Yarn jobs, two IDs are acquired upon job submission: the Application ID and the Job ID. These IDs are used to retrieve the job status. However, when there is a large number of Flink jobs, certain issues may arise. StreamPark utilizes a status retriever that periodically sends requests to the ResourceManager every five seconds, using the Application ID or Job ID stored in our database. If there are a considerable number of jobs, during each polling cycle, the ResourceManager is responsible for calling the Job Manager's address to access its status. This can lead to significant pressure on the number of connections to the ResourceManager and an overall increase in the number of connections. +As mentioned above, in the case of Flink on Yarn jobs, two IDs are acquired upon job submission: the Application ID and the Job ID. These IDs are used to retrieve the job status. However, when there is a large number of Flink jobs, certain issues may arise. Apache StreamPark utilizes a status retriever that periodically sends requests to the ResourceManager every five seconds, using the Application ID or Job ID stored in our database. If there are a considerable number of jobs, during each polling cycle, the ResourceManager is responsible for calling the Job Manager's address to access its status. This can lead to significant pressure on the number of connections to the ResourceManager and an overall increase in the number of connections. @@ -134,38 +134,38 @@ In the diagram mentioned earlier, the connection count to the ResourceManager sh ![图片](/blog/chinaunion/state_optimization.png) -To address the issues mentioned above, we have made some optimizations in StreamPark. Firstly, after submitting a job, StreamPark saves the Application ID or Job ID, and it also retrieves and stores the direct access address of the Job Manager in the database. Therefore, instead of polling the ResourceManager for job status, it can directly call the addresses of individual Job Managers to obtain the real-time status. This significantly reduces the number of connections to the ResourceManager. As can be seen from the latter part of the diagram above, there are basically no significant spikes in connection counts, which substantially alleviates the pressure on the ResourceManager. Moreover, this ensures that as the number of Flink jobs continues to grow, the system will not encounter bottlenecks in status retrieval. +To address the issues mentioned above, we have made some optimizations in Apache StreamPark. Firstly, after submitting a job, Apache StreamPark saves the Application ID or Job ID, and it also retrieves and stores the direct access address of the Job Manager in the database. Therefore, instead of polling the ResourceManager for job status, it can directly call the addresses of individual Job Managers to obtain the real-time status. This significantly reduces the number of connections to the ResourceManager. As can be seen from the latter part of the diagram above, there are basically no significant spikes in connection counts, which substantially alleviates the pressure on the ResourceManager. Moreover, this ensures that as the number of Flink jobs continues to grow, the system will not encounter bottlenecks in status retrieval. ![图片](/blog/chinaunion/state_recovery.png) -Another issue that StreamPark resolves is safeguarding Flink's state recovery. In the past, when we used scripts for operations and maintenance, especially during business upgrades, it was necessary to recover from the latest checkpoint when starting Flink. However, developers often forgot to recover from the previous checkpoint, leading to significant data quality issues and complaints. StreamPark's process is designed to mitigate this issue. Upon the initial start of a Flink job, it polls every five seconds to retrieve checkpoint records, saving them in a database. When manually stopping a Flink job through StreamPark, users have the option to perform a savepoint. If this option is selected, the path of the savepoint is saved in the database. In addition, records of each checkpoint are also stored in the database. When restarting a Flink job, the system defaults to using the latest checkpoint or savepoint record. This effectively prevents issues associated with failing to recover from the previous checkpoint. It also avoids the resource wastage caused by having to rerun jobs with offset rollbacks to address problems, while ensuring consistency in data processing. +Another issue that Apache StreamPark resolves is safeguarding Flink's state recovery. In the past, when we used scripts for operations and maintenance, especially during business upgrades, it was necessary to recover from the latest checkpoint when starting Flink. However, developers often forgot to recover from the previous checkpoint, leading to significant data quality issues and complaints. Apache StreamPark's process is designed to mitigate this issue. Upon the initial start of a Flink job, it polls every five seconds to retrieve checkpoint records, saving them in a database. When manually stopping a Flink job through Apache StreamPark, users have the option to perform a savepoint. If this option is selected, the path of the savepoint is saved in the database. In addition, records of each checkpoint are also stored in the database. When restarting a Flink job, the system defaults to using the latest checkpoint or savepoint record. This effectively prevents issues associated with failing to recover from the previous checkpoint. It also avoids the resource wastage caused by having to rerun jobs with offset rollbacks to address problems, while ensuring consistency in data processing. ![图片](/blog/chinaunion/multiple_environments_and_components.png) -StreamPark also addresses the challenges associated with referencing multiple components across various environments. In a corporate setting, there are typically multiple environments, such as development, testing, and production. Each environment generally includes multiple components, such as Kafka, HBase, Redis, etc. Additionally, within a single environment, there may be multiple instances of the same component. For example, in a real-time computing platform at China Union, when consuming data from an upstream Kafka cluster and writing the relevant data to a downstream Kafka cluster, two sets of Kafka are involved within the same environment. It can be challenging to determine the specific environment and component based solely on IP addresses. To address this, we define the IP addresses of all components as variables. For instance, the Kafka cluster variable, Kafka.cluster, exists in development, testing, and production environments, but it points to different Broker addresses in each. Thus, regardless of the environment in which a Flink job is configured, referencing this variable is sufficient. This approach significantly reduces the incidence of operational failures in production environments. +Apache StreamPark also addresses the challenges associated with referencing multiple components across various environments. In a corporate setting, there are typically multiple environments, such as development, testing, and production. Each environment generally includes multiple components, such as Kafka, HBase, Redis, etc. Additionally, within a single environment, there may be multiple instances of the same component. For example, in a real-time computing platform at China Union, when consuming data from an upstream Kafka cluster and writing the relevant data to a downstream Kafka cluster, two sets of Kafka are involved within the same environment. It can be challenging to determine the specific environment and component based solely on IP addresses. To address this, we define the IP addresses of all components as variables. For instance, the Kafka cluster variable, Kafka.cluster, exists in development, testing, and production environments, but it points to different Broker addresses in each. Thus, regardless of the environment in which a Flink job is configured, referencing this variable is sufficient. This approach significantly reduces the incidence of operational failures in production environments. ![图片](/blog/chinaunion/multiple_execution_modes.png) -StreamPark supports multiple execution modes for Flink, including three deployment modes based on Yarn: Application, Perjob, and Session. Additionally, it supports two deployment modes for Kubernetes: Application and Session, as well as some Remote modes. +Apache StreamPark supports multiple execution modes for Flink, including three deployment modes based on Yarn: Application, Perjob, and Session. Additionally, it supports two deployment modes for Kubernetes: Application and Session, as well as some Remote modes. ![图片](/blog/chinaunion/versioning.png) -StreamPark also supports multiple versions of Flink. For example, while our current version is 1.14.x, we would like to experiment with the new 1.16.x release. However, it’s not feasible to upgrade all existing jobs to 1.16.x. Instead, we can opt to upgrade only the new jobs to 1.16.x, allowing us to leverage the benefits of the new version while maintaining compatibility with the older version. +Apache StreamPark also supports multiple versions of Flink. For example, while our current version is 1.14.x, we would like to experiment with the new 1.16.x release. However, it’s not feasible to upgrade all existing jobs to 1.16.x. Instead, we can opt to upgrade only the new jobs to 1.16.x, allowing us to leverage the benefits of the new version while maintaining compatibility with the older version. ## **Future Planning and Evolution** ![图片](/blog/chinaunion/contribution_and_enhancement.png) -In the future, we will increase our involvement in the development of StreamPark, and we have planned the following directions for enhancement: -- High Availability: StreamPark currently does not support high availability, and this aspect needs further strengthening. +In the future, we will increase our involvement in the development of Apache StreamPark, and we have planned the following directions for enhancement: +- High Availability: Apache StreamPark currently does not support high availability, and this aspect needs further strengthening. - State Management: In enterprise practices, each operator in a Flink job has a UID. If the Flink UID is not set, it could lead to situations where state recovery is not possible when upgrading the Flink job. This issue cannot be solved through the platform at the moment. Therefore, we plan to add this functionality to the platform. We will introduce a feature that checks whether the operator has a UID set when submitting a Flink Jar. If not, a reminder will be issued to avoid state recovery issues every time a Flink job is deployed. Previously, when facing such situations, we had to use the state processing API to deserialize from the original state, and then create a new state using the state processing API for the upgraded Flink to load. -- More Detailed Monitoring: Currently, StreamPark supports sending alerts when a Flink job fails. We hope to also send alerts when a Task fails, and need to know the reason for the failure. In addition, enhancements are needed in job backpressure monitoring alerts, Checkpoint timeout alerts, failure alerts, and performance metric collection. +- More Detailed Monitoring: Currently, Apache StreamPark supports sending alerts when a Flink job fails. We hope to also send alerts when a Task fails, and need to know the reason for the failure. In addition, enhancements are needed in job backpressure monitoring alerts, Checkpoint timeout alerts, failure alerts, and performance metric collection. - Stream-Batch Integration: Explore a platform that integrates both streaming and batch processing, combining the Flink stream-batch unified engine with data lake storage that supports stream-batch unification. ![](/blog/chinaunion/road_map.png) -The above diagram represents the Roadmap for StreamPark. -- Data Source: StreamPark will support rapid integration with more data sources, achieving one-click data onboarding. +The above diagram represents the Roadmap for Apache StreamPark. +- Data Source: Apache StreamPark will support rapid integration with more data sources, achieving one-click data onboarding. - Operation Center: Acquire more Flink Metrics to further enhance the capabilities in monitoring and operation. - K8S-operator: The existing Flink on K8S is somewhat cumbersome, having gone through the processes of packaging Jars, building images, and pushing images. There is a need for future improvements and optimization, and we are actively embracing the upstream K8S-operator integration. - Streaming Data Warehouse: Enhance support for Flink SQL job capabilities, simplify the submission of Flink SQL jobs, and plan to integrate with Flink SQL Gateway. Enhance capabilities in the SQL data warehouse domain, including metadata storage, unified table creation syntax validation, runtime testing, and interactive queries, while actively embracing Flink upstream to explore real-time data warehouses and streaming data warehouses. diff --git a/blog/3-streampark-usercase-bondex-paimon.md b/blog/3-streampark-usercase-bondex-paimon.md index d3b645350..2c8d8abd6 100644 --- a/blog/3-streampark-usercase-bondex-paimon.md +++ b/blog/3-streampark-usercase-bondex-paimon.md @@ -1,12 +1,12 @@ --- slug: streampark-usercase-bondex-with-paimon -title: Based on Apache Paimon + StreamPark's Streaming Data Warehouse Practice by Bondex -tags: [StreamPark, Production Practice, paimon, streaming-warehouse] +title: Based on Apache Paimon + Apache StreamPark's Streaming Data Warehouse Practice by Bondex +tags: [Apache StreamPark, Production Practice, paimon, streaming-warehouse] --- ![](/blog/bondex/Bondex.png) -**Foreword: **This article mainly introduces the implementation of a streaming data warehouse by Bondex, a supply chain logistics service provider, in the process of digital transformation using the Paimon + StreamPark platform. We provide an easy-to-follow operational manual with the Apache StreamPark integrated stream-batch platform to help users submit Flink tasks and quickly master the use of Paimon. +**Foreword: **This article mainly introduces the implementation of a streaming data warehouse by Bondex, a supply chain logistics service provider, in the process of digital transformation using the Paimon + Apache StreamPark platform. We provide an easy-to-follow operational manual with the Apache StreamPark integrated stream-batch platform to help users submit Flink tasks and quickly master the use of Paimon. - Introduction to Company Business - Pain Points and Selection in Big Data Technology @@ -102,11 +102,11 @@ Continuing the characteristics of the Kappa architecture with a single stream pr ## 03 Production Practices -This solution adopts Flink Application on K8s clusters, with Flink CDC for real-time ingestion of relational database data from business systems. Tasks for Flink + Paimon Streaming Data Warehouse are submitted through the StreamPark task platform, with the Trino engine ultimately used to access Finereport for service provision and developer queries. Paimon's underlying storage supports the S3 protocol, and as the company's big data services rely on Alibaba Cloud, Object Storage Service (OSS) is used as the data filesystem. +This solution adopts Flink Application on K8s clusters, with Flink CDC for real-time ingestion of relational database data from business systems. Tasks for Flink + Paimon Streaming Data Warehouse are submitted through the Apache StreamPark task platform, with the Trino engine ultimately used to access Finereport for service provision and developer queries. Paimon's underlying storage supports the S3 protocol, and as the company's big data services rely on Alibaba Cloud, Object Storage Service (OSS) is used as the data filesystem. -[StreamPark](https://github.com/apache/incubator-streampark) is a real-time computing platform that leverages the powerful capabilities of [Paimon](https://github.com/apache/incubator-paimon) to process real-time data streams. This platform offers the following key features: +[Apache StreamPark](https://github.com/apache/incubator-streampark) is a real-time computing platform that leverages the powerful capabilities of [Paimon](https://github.com/apache/incubator-paimon) to process real-time data streams. This platform offers the following key features: -**Real-time Data Processing: **StreamPark supports the submission of real-time data stream tasks, capable of real-time acquisition, transformation, filtering, and analysis of data. This is extremely important for applications that require rapid response to real-time data, such as real-time monitoring, real-time recommendations, and real-time risk control. +**Real-time Data Processing: **Apache StreamPark supports the submission of real-time data stream tasks, capable of real-time acquisition, transformation, filtering, and analysis of data. This is extremely important for applications that require rapid response to real-time data, such as real-time monitoring, real-time recommendations, and real-time risk control. **Scalability: **Capable of efficiently processing large-scale real-time data with good scalability. It can operate in a distributed computing environment, automatically handling parallelization, fault recovery, and load balancing to ensure efficient and reliable data processing. @@ -114,7 +114,7 @@ This solution adopts Flink Application on K8s clusters, with Flink CDC for real- **Ease of Use: **Provides a straightforward graphical interface and simplified API, enabling easy construction and deployment of data processing tasks without needing to delve into underlying technical details. -By submitting Paimon tasks on the StreamPark platform, we can establish a full-link real-time flowing, queryable, and layered reusable Pipline. +By submitting Paimon tasks on the Apache StreamPark platform, we can establish a full-link real-time flowing, queryable, and layered reusable Pipline. ![](/blog/bondex/pipline.png) @@ -127,7 +127,7 @@ The main components versions used are as follows: ### **Environment Setup** -Download flink-1.16.0-scala-2.12.tar.gz which can be obtained from the official Flink website. Download the corresponding version of the package to the StreamPark deployment server. +Download flink-1.16.0-scala-2.12.tar.gz which can be obtained from the official Flink website. Download the corresponding version of the package to the Apache StreamPark deployment server. ```shell # Unzip @@ -182,7 +182,7 @@ export PATH=$PATH:$FLINK_HOME/bin source /etc/profile ``` -In StreamPark, add Flink conf: +In Apache StreamPark, add Flink conf: ![](/blog/bondex/flink_conf.jpg) @@ -236,7 +236,7 @@ docker push registry-vpc.cn-zhangjiakou.aliyuncs.com/xxxxx/flink-table-store:v1. Next, prepare the Paimon jar package. You can download the corresponding version from the Apache [Repository](https://repository.apache.org/content/groups/snapshots/org/apache/paimon). It's important to note that it should be consistent with the major version of Flink. -### **Managing Jobs with StreamPark** +### **Managing Jobs with Apache StreamPark** **Prerequisites:** @@ -247,7 +247,7 @@ Next, prepare the Paimon jar package. You can download the corresponding version **Kubernetes Client Connection Configuration:** -Copy the k8s master node's `~/.kube/config` configuration directly to the directory on the StreamPark server, then execute the following command on the StreamPark server to display the k8s cluster as running, which indicates successful permission and network verification. +Copy the k8s master node's `~/.kube/config` configuration directly to the directory on the Apache StreamPark server, then execute the following command on the Apache StreamPark server to display the k8s cluster as running, which indicates successful permission and network verification. ```shell kubectl cluster-info @@ -272,11 +272,11 @@ kubectl create secret docker-registry streamparksecret In this case, Alibaba Cloud's Container Registry Service (ACR) is used, but you can also substitute it with a self-hosted image service such as Harbor. -Create a namespace named StreamPark (set the security setting to private). +Create a namespace named Apache StreamPark (set the security setting to private). ![](/blog/bondex/aliyun.png) -Configure the image repository in StreamPark; task build images will be pushed to the repository. +Configure the image repository in Apache StreamPark; task build images will be pushed to the repository. ![](/blog/bondex/dockersystem_setting.png) @@ -935,4 +935,4 @@ In complex real-time tasks, resources can be increased by modifying dynamic para - Subsequently, we will integrate with Trino Catalog to access Doris, realizing a one-service solution for both offline and real-time data. - We will continue to advance the pace of building an integrated streaming and batch data warehouse within the group, adopting the architecture of Doris + Paimon. -Here, I would like to thank Teacher Zhixin and the StreamPark community for their strong support during the use of StreamPark + Paimon. The problems encountered in the learning process are promptly clarified and resolved. We will also actively participate in community exchanges and contributions in the future, enabling Paimon to provide more developers and enterprises with integrated stream and batch data lake solutions. +Here, I would like to thank Teacher Zhixin and the Apache StreamPark community for their strong support during the use of Apache StreamPark + Paimon. The problems encountered in the learning process are promptly clarified and resolved. We will also actively participate in community exchanges and contributions in the future, enabling Paimon to provide more developers and enterprises with integrated stream and batch data lake solutions. diff --git a/blog/4-streampark-usercase-shunwang.md b/blog/4-streampark-usercase-shunwang.md index 33d621618..fc4ef8ce6 100644 --- a/blog/4-streampark-usercase-shunwang.md +++ b/blog/4-streampark-usercase-shunwang.md @@ -1,17 +1,17 @@ --- slug: streampark-usercase-shunwang -title: StreamPark in the Large-Scale Production Practice at Shunwang Technology -tags: [StreamPark, Production Practice, FlinkSQL] +title: Apache StreamPark in the Large-Scale Production Practice at Shunwang Technology +tags: [Apache StreamPark, Production Practice, FlinkSQL] --- ![](/blog/shunwang/autor.png) -**Preface:** This article primarily discusses the challenges encountered by Shunwang Technology in using the Flink computation engine, and how StreamPark is leveraged as a real-time data platform to address these challenges, thus supporting the company's business on a large scale. +**Preface:** This article primarily discusses the challenges encountered by Shunwang Technology in using the Flink computation engine, and how Apache StreamPark is leveraged as a real-time data platform to address these challenges, thus supporting the company's business on a large scale. - Introduction to the company's business - Challenges encountered -- Why choose StreamPark +- Why choose Apache StreamPark - Implementation in practice - Benefits Brought - Future planning @@ -43,7 +43,7 @@ Flink, as one of the most popular real-time computing frameworks currently, boas Facing a series of pain points in the management and operation of Flink jobs, we have been looking for suitable solutions to lower the barrier to entry for our developers using Flink and improve work efficiency. -Before we encountered StreamPark, we researched some companies' Flink management solutions and found that they all developed and managed Flink jobs through self-developed real-time job platforms. Consequently, we decided to develop our own real-time computing management platform to meet the basic needs of our developers for Flink job management and operation. Our platform is called Streaming-Launcher, with the following main functions: +Before we encountered Apache StreamPark, we researched some companies' Flink management solutions and found that they all developed and managed Flink jobs through self-developed real-time job platforms. Consequently, we decided to develop our own real-time computing management platform to meet the basic needs of our developers for Flink job management and operation. Our platform is called Streaming-Launcher, with the following main functions: ![Image](/blog/shunwang/launcher.png) @@ -71,19 +71,19 @@ To view logs for a job, developers must go through multiple steps, which to some ![Image](/blog/shunwang/step.png) -## **Why Use StreamPark** +## **Why Use Apache StreamPark** Faced with the defects of our self-developed platform Streaming-Launcher, we have been considering how to further lower the barriers to using Flink and improve work efficiency. Considering the cost of human resources and time, we decided to seek help from the open-source community and look for an appropriate open-source project to manage and maintain our Flink tasks. -### 01 **StreamPark: A Powerful Tool for Solving Flink Issues** +### 01 **Apache StreamPark: A Powerful Tool for Solving Flink Issues** -Fortunately, in early June 2022, we stumbled upon StreamPark on GitHub and embarked on a preliminary exploration full of hope. We found that StreamPark's capabilities can be broadly divided into three areas: user permission management, job operation and maintenance management, and development scaffolding. +Fortunately, in early June 2022, we stumbled upon Apache StreamPark on GitHub and embarked on a preliminary exploration full of hope. We found that Apache StreamPark's capabilities can be broadly divided into three areas: user permission management, job operation and maintenance management, and development scaffolding. **User Permission Management** -In the StreamPark platform, to avoid the risk of users having too much authority and making unnecessary misoperations that affect job stability and the accuracy of environmental configurations, some user permission management functions are provided. This is very necessary for enterprise-level users. +In the Apache StreamPark platform, to avoid the risk of users having too much authority and making unnecessary misoperations that affect job stability and the accuracy of environmental configurations, some user permission management functions are provided. This is very necessary for enterprise-level users. @@ -93,13 +93,13 @@ In the StreamPark platform, to avoid the risk of users having too much authority **Job Operation and Maintenance Management** -**Our main focus during the research on StreamPark was its capability to manage the entire lifecycle of jobs:** from development and deployment to management and problem diagnosis. **Fortunately, StreamPark excels in this aspect, relieving developers from the pain points associated with Flink job management and operation.** Within StreamPark’s job operation and maintenance management, there are three main modules: basic job management functions, Jar job management, and FlinkSQL job management as shown below: +**Our main focus during the research on Apache StreamPark was its capability to manage the entire lifecycle of jobs:** from development and deployment to management and problem diagnosis. **Fortunately, Apache StreamPark excels in this aspect, relieving developers from the pain points associated with Flink job management and operation.** Within Apache StreamPark’s job operation and maintenance management, there are three main modules: basic job management functions, Jar job management, and FlinkSQL job management as shown below: ![Image](/blog/shunwang/homework_manager.png) **Development Scaffolding** -Further research revealed that StreamPark is not just a platform but also includes a development scaffold for Flink jobs. It offers a better solution for code-written Flink jobs, standardizing program configuration, providing a simplified programming model, and a suite of Connectors that lower the barrier to entry for DataStream development. +Further research revealed that Apache StreamPark is not just a platform but also includes a development scaffold for Flink jobs. It offers a better solution for code-written Flink jobs, standardizing program configuration, providing a simplified programming model, and a suite of Connectors that lower the barrier to entry for DataStream development. @@ -109,9 +109,9 @@ Further research revealed that StreamPark is not just a platform but also includ -### 02 **How StreamPark Addresses Issues of the Self-Developed Platform** +### 02 **How Apache StreamPark Addresses Issues of the Self-Developed Platform** -We briefly introduced the core capabilities of StreamPark above. During the technology selection process at Shunwang Technology, we found that StreamPark not only includes the basic functions of our existing Streaming-Launcher but also offers a more complete set of solutions to address its many shortcomings. Here, we focus on the solutions provided by StreamPark for the deficiencies of our self-developed platform, Streaming-Launcher. +We briefly introduced the core capabilities of Apache StreamPark above. During the technology selection process at Shunwang Technology, we found that Apache StreamPark not only includes the basic functions of our existing Streaming-Launcher but also offers a more complete set of solutions to address its many shortcomings. Here, we focus on the solutions provided by Apache StreamPark for the deficiencies of our self-developed platform, Streaming-Launcher. @@ -121,21 +121,21 @@ We briefly introduced the core capabilities of StreamPark above. During the tech **One-Stop Flink Job Development Capability** -To lower the barriers to Flink job development and improve developers' work efficiency, **StreamPark provides features like FlinkSQL IDE, parameter management, task management, code management, one-click compilation, and one-click job deployment and undeployment**. Our research found that these integrated features of StreamPark could further enhance developers’ work efficiency. To some extent, developers no longer need to concern themselves with the difficulties of Flink job management and operation and can focus on developing the business logic. These features also solve the pain point of cumbersome SQL development processes in Streaming-Launcher. +To lower the barriers to Flink job development and improve developers' work efficiency, **Apache StreamPark provides features like FlinkSQL IDE, parameter management, task management, code management, one-click compilation, and one-click job deployment and undeployment**. Our research found that these integrated features of Apache StreamPark could further enhance developers’ work efficiency. To some extent, developers no longer need to concern themselves with the difficulties of Flink job management and operation and can focus on developing the business logic. These features also solve the pain point of cumbersome SQL development processes in Streaming-Launcher. ![Image](/blog/shunwang/application.png) **Support for Multiple Deployment Modes** -The Streaming-Launcher was not flexible for developers since it only supported the Yarn Session mode. StreamPark provides a comprehensive solution for this aspect. **StreamPark fully supports all of Flink's deployment modes: Remote, Yarn Per-Job, Yarn Application, Yarn Session, K8s Session, and K8s Application,** allowing developers to freely choose the appropriate running mode for different business scenarios. +The Streaming-Launcher was not flexible for developers since it only supported the Yarn Session mode. Apache StreamPark provides a comprehensive solution for this aspect. **Apache StreamPark fully supports all of Flink's deployment modes: Remote, Yarn Per-Job, Yarn Application, Yarn Session, K8s Session, and K8s Application,** allowing developers to freely choose the appropriate running mode for different business scenarios. **Unified Job Management Center** -For developers, job running status is one of their primary concerns. In Streaming-Launcher, due to the lack of a unified job management interface, developers had to rely on alarm information and Yarn application status to judge the state of tasks, which was very unfriendly. StreamPark addresses this issue by providing a unified job management interface, allowing for a clear view of the running status of each task. +For developers, job running status is one of their primary concerns. In Streaming-Launcher, due to the lack of a unified job management interface, developers had to rely on alarm information and Yarn application status to judge the state of tasks, which was very unfriendly. Apache StreamPark addresses this issue by providing a unified job management interface, allowing for a clear view of the running status of each task. ![Image](/blog/shunwang/management.png) -In the Streaming-Launcher, developers had to go through multiple steps to diagnose job issues and locate job runtime logs. StreamPark offers a one-click jump feature that allows quick access to job runtime logs. +In the Streaming-Launcher, developers had to go through multiple steps to diagnose job issues and locate job runtime logs. Apache StreamPark offers a one-click jump feature that allows quick access to job runtime logs. ![Image](/blog/shunwang/logs.png) @@ -143,17 +143,17 @@ In the Streaming-Launcher, developers had to go through multiple steps to diagno ## Practical Implementation -When introducing StreamPark to Shunwang Technology, due to the company's business characteristics and some customized requirements from the developers, we made some additions and optimizations to the functionalities of StreamPark. We have also summarized some problems encountered during the use and corresponding solutions. +When introducing Apache StreamPark to Shunwang Technology, due to the company's business characteristics and some customized requirements from the developers, we made some additions and optimizations to the functionalities of Apache StreamPark. We have also summarized some problems encountered during the use and corresponding solutions. ### 01 **Leveraging the Capabilities of Flink-SQL-Gateway** -At Shunwang Technology, we have developed the ODPS platform based on the Flink-SQL-Gateway to facilitate business developers in managing the metadata of Flink tables. Business developers perform DDL operations on Flink tables in ODPS, and then carry out analysis and query operations on the created Flink tables in StreamPark. Throughout the entire business development process, we have decoupled the creation and analysis of Flink tables, making the development process appear more straightforward. +At Shunwang Technology, we have developed the ODPS platform based on the Flink-SQL-Gateway to facilitate business developers in managing the metadata of Flink tables. Business developers perform DDL operations on Flink tables in ODPS, and then carry out analysis and query operations on the created Flink tables in Apache StreamPark. Throughout the entire business development process, we have decoupled the creation and analysis of Flink tables, making the development process appear more straightforward. -If developers wish to query real-time data in ODPS, we need to provide a Flink SQL runtime environment. We have used StreamPark to run a Yarn Session Flink environment to support ODPS in conducting real-time queries. +If developers wish to query real-time data in ODPS, we need to provide a Flink SQL runtime environment. We have used Apache StreamPark to run a Yarn Session Flink environment to support ODPS in conducting real-time queries. ![Image](/blog/shunwang/homework.png) -Currently, the StreamPark community is intergrating with Flink-SQL-Gateway to further lower the barriers to developing real-time jobs. +Currently, the Apache StreamPark community is intergrating with Flink-SQL-Gateway to further lower the barriers to developing real-time jobs. https://github.com/apache/streampark/issues/2274 @@ -171,7 +171,7 @@ In practice, we found it difficult for business developers to intuitively know h ### 03 **Enhancing Alert Capabilities** -Since each company's SMS alert platform is implemented differently, the StreamPark community has not abstracted a unified SMS alert function. Here, we have implemented the SMS alert function through the Webhook method. +Since each company's SMS alert platform is implemented differently, the Apache StreamPark community has not abstracted a unified SMS alert function. Here, we have implemented the SMS alert function through the Webhook method. ![Image](/blog/shunwang/alarm.png) @@ -179,7 +179,7 @@ Since each company's SMS alert platform is implemented differently, the StreamPa ### 04 **Adding a Blocking Queue to Solve Throttling Issues** -In production practice, we found that when a large number of tasks fail at the same time, such as when a Yarn Session cluster goes down, platforms like Feishu/Lark and WeChat have throttling issues when multiple threads call the alert interface simultaneously. Consequently, only a portion of the alert messages will be sent by StreamPark due to the throttling issues of platforms like Feishu/Lark and WeChat, which can easily mislead developers in troubleshooting. At Shunwang Technology, we added a blocking queue and an alert thread to solve the throttling issue. +In production practice, we found that when a large number of tasks fail at the same time, such as when a Yarn Session cluster goes down, platforms like Feishu/Lark and WeChat have throttling issues when multiple threads call the alert interface simultaneously. Consequently, only a portion of the alert messages will be sent by Apache StreamPark due to the throttling issues of platforms like Feishu/Lark and WeChat, which can easily mislead developers in troubleshooting. At Shunwang Technology, we added a blocking queue and an alert thread to solve the throttling issue. ![Image](/blog/shunwang/queue.png) @@ -191,11 +191,11 @@ https://github.com/apache/streampark/issues/2142 ## Benefits Brought -We started exploring and using StreamX 1.2.3 (the predecessor of StreamPark) and after more than a year of running in, we found that StreamPark truly resolves many pain points in the development management and operation and maintenance of Flink jobs. +We started exploring and using StreamX 1.2.3 (the predecessor of Apache StreamPark) and after more than a year of running in, we found that Apache StreamPark truly resolves many pain points in the development management and operation and maintenance of Flink jobs. -The greatest benefit that StreamPark has brought to Shunwang Technology is the lowered threshold for using Flink and improved development efficiency. Previously, our business development colleagues had to use multiple tools such as vscode, GitLab, and a scheduling platform in the original Streaming-Launcher to complete a FlinkSQL job development. The process was tedious, going through multiple tools from development to compilation to release. StreamPark provides one-stop service, allowing job development, compilation, and release to be completed on StreamPark, simplifying the entire development process. +The greatest benefit that Apache StreamPark has brought to Shunwang Technology is the lowered threshold for using Flink and improved development efficiency. Previously, our business development colleagues had to use multiple tools such as vscode, GitLab, and a scheduling platform in the original Streaming-Launcher to complete a FlinkSQL job development. The process was tedious, going through multiple tools from development to compilation to release. Apache StreamPark provides one-stop service, allowing job development, compilation, and release to be completed on Apache StreamPark, simplifying the entire development process. -**At present, StreamPark has been massively deployed in the production environment at Shunwang Technology, with the number of FlinkSQL jobs managed by StreamPark increasing from the initial 500+ to nearly 700, while also managing more than 10 Yarn Session Clusters.** +**At present, Apache StreamPark has been massively deployed in the production environment at Shunwang Technology, with the number of FlinkSQL jobs managed by Apache StreamPark increasing from the initial 500+ to nearly 700, while also managing more than 10 Yarn Session Clusters.** ![Image](/blog/shunwang/achievements1.png) @@ -203,11 +203,11 @@ The greatest benefit that StreamPark has brought to Shunwang Technology is the l ## Future Plans -As one of the early users of StreamPark, Shunwang Technology has been communicating with the community for a year and participating in the polishing of StreamPark's stability. We have submitted the Bugs encountered in production operations and new Features to the community. In the future, we hope to manage the metadata information of Flink tables on StreamPark, and implement cross-data-source query analysis functions based on the Flink engine through multiple Catalogs. Currently, StreamPark is integrating with Flink-SQL-Gateway capabilities, which will greatly help in the management of table metadata and cross-data-source query functions in the future. +As one of the early users of Apache StreamPark, Shunwang Technology has been communicating with the community for a year and participating in the polishing of Apache StreamPark's stability. We have submitted the Bugs encountered in production operations and new Features to the community. In the future, we hope to manage the metadata information of Flink tables on Apache StreamPark, and implement cross-data-source query analysis functions based on the Flink engine through multiple Catalogs. Currently, Apache StreamPark is integrating with Flink-SQL-Gateway capabilities, which will greatly help in the management of table metadata and cross-data-source query functions in the future. -Since Shunwang Technology primarily runs jobs in Yarn Session mode, we hope that StreamPark can provide more support for Remote clusters, Yarn Session clusters, and K8s Session clusters, such as monitoring and alerts, and optimizing operational processes. +Since Shunwang Technology primarily runs jobs in Yarn Session mode, we hope that Apache StreamPark can provide more support for Remote clusters, Yarn Session clusters, and K8s Session clusters, such as monitoring and alerts, and optimizing operational processes. -Considering the future, as the business develops, we may use StreamPark to manage more Flink real-time jobs, and StreamPark in single-node mode may not be safe. Therefore, we are also looking forward to the High Availability (HA) of StreamPark. +Considering the future, as the business develops, we may use Apache StreamPark to manage more Flink real-time jobs, and Apache StreamPark in single-node mode may not be safe. Therefore, we are also looking forward to the High Availability (HA) of Apache StreamPark. -We will also participate in the construction of the capabilities of StreamPark is integrating with Flink-SQL-Gateway, enriching Flink Cluster functionality, and StreamPark HA. +We will also participate in the construction of the capabilities of Apache StreamPark is integrating with Flink-SQL-Gateway, enriching Flink Cluster functionality, and Apache StreamPark HA. diff --git a/blog/5-streampark-usercase-dustess.md b/blog/5-streampark-usercase-dustess.md index 3a1910e6e..92a63a929 100644 --- a/blog/5-streampark-usercase-dustess.md +++ b/blog/5-streampark-usercase-dustess.md @@ -1,10 +1,10 @@ --- slug: streampark-usercase-dustess -title: StreamPark's Best Practices at Dustess, Simplifying Complexity for the Ultimate Experience -tags: [StreamPark, Production Practice, FlinkSQL] +title: Apache StreamPark's Best Practices at Dustess, Simplifying Complexity for the Ultimate Experience +tags: [Apache StreamPark, Production Practice, FlinkSQL] --- -**Abstract:** This article originates from the production practices of StreamPark at Dustess Information, written by the senior data development engineer, Gump. The main content includes: +**Abstract:** This article originates from the production practices of Apache StreamPark at Dustess Information, written by the senior data development engineer, Gump. The main content includes: 1. Technology selection 2. Practical implementation @@ -35,7 +35,7 @@ Firstly, in terms of the computing engine: We chose Flink for the following reas - Flink supports both batch and stream processing. Although the company's current batch processing architecture is based on Hive, Spark, etc., using Flink for stream computing facilitates the subsequent construction of unified batch and stream processing and lake-house architecture. - The domestic ecosystem of Flink has become increasingly mature, and Flink is starting to break boundaries towards the development of stream-based data warehousing. -At the platform level, we comprehensively compared StreamPark, Apache Zeppelin, and flink-streaming-platform-web, also thoroughly read their source code and conducted an analysis of their advantages and disadvantages. We won’t elaborate on the latter two projects in this article, but those interested can search for them on GitHub. We ultimately chose StreamPark for the following reasons: +At the platform level, we comprehensively compared Apache StreamPark, Apache Zeppelin, and flink-streaming-platform-web, also thoroughly read their source code and conducted an analysis of their advantages and disadvantages. We won’t elaborate on the latter two projects in this article, but those interested can search for them on GitHub. We ultimately chose Apache StreamPark for the following reasons: ### **High Completion** @@ -51,11 +51,11 @@ When creating a task, you can **freely choose the Flink version**. The Flink bin #### **2. Supports Multiple Deployment Modes** -StreamPark supports **all the mainstream submission modes** for Flink, such as standalone, yarn-session, yarn application, yarn-perjob, kubernetes-session, kubernetes-application. Moreover, StreamPark does not simply piece together Flink run commands to submit tasks. Instead, it introduces the Flink Client source package and directly calls the Flink Client API for task submission. The advantages of this approach include modular code, readability, ease of extension, stability, and the ability to quickly adapt to upgrades of the Flink version. +Apache StreamPark supports **all the mainstream submission modes** for Flink, such as standalone, yarn-session, yarn application, yarn-perjob, kubernetes-session, kubernetes-application. Moreover, Apache StreamPark does not simply piece together Flink run commands to submit tasks. Instead, it introduces the Flink Client source package and directly calls the Flink Client API for task submission. The advantages of this approach include modular code, readability, ease of extension, stability, and the ability to quickly adapt to upgrades of the Flink version. ![](/blog/dustess/execution_mode.png) -Flink SQL can greatly improve development efficiency and the popularity of Flink. StreamPark’s support for **Flink SQL is very comprehensive**, with an excellent SQL editor, dependency management, multi-version task management, etc. The StreamPark official website states that it will introduce metadata management integration for Flink SQL in the future. Stay tuned. +Flink SQL can greatly improve development efficiency and the popularity of Flink. Apache StreamPark’s support for **Flink SQL is very comprehensive**, with an excellent SQL editor, dependency management, multi-version task management, etc. The Apache StreamPark official website states that it will introduce metadata management integration for Flink SQL in the future. Stay tuned. ![](/blog/dustess/flink_sql.png) @@ -69,33 +69,33 @@ Flink SQL can greatly improve development efficiency and the popularity of Flink Although Flink SQL is now powerful enough, using JVM languages like Java and Scala to develop Flink tasks can be more flexible, more customizable, and better for tuning and improving resource utilization. The biggest problem with submitting tasks via Jar packages, compared to SQL, is the management of the Jar uploads. Without excellent tooling products, this can significantly reduce development efficiency and increase maintenance costs. -Besides supporting Jar uploads, StreamPark also provides an **online update build** feature, which elegantly solves the above problems: +Besides supporting Jar uploads, Apache StreamPark also provides an **online update build** feature, which elegantly solves the above problems: -1. Create Project: Fill in the GitHub/Gitlab (supports enterprise private servers) address and username/password, and StreamPark can Pull and Build the project. +1. Create Project: Fill in the GitHub/Gitlab (supports enterprise private servers) address and username/password, and Apache StreamPark can Pull and Build the project. -2. When creating a StreamPark Custom-Code task, refer to the Project, specify the main class, and optionally automate Pull, Build, and bind the generated Jar when starting the task, which is very elegant! +2. When creating a Apache StreamPark Custom-Code task, refer to the Project, specify the main class, and optionally automate Pull, Build, and bind the generated Jar when starting the task, which is very elegant! -At the same time, the StreamPark community is also perfecting the entire task compilation and launch process. The future StreamPark will be even more refined and professional on this foundation. +At the same time, the Apache StreamPark community is also perfecting the entire task compilation and launch process. The future Apache StreamPark will be even more refined and professional on this foundation. ![](/blog/dustess/system_list.png) #### **5. Comprehensive Task Parameter Configuration** -For data development using Flink, the parameters submitted with Flink run are almost impossible to maintain. StreamPark has also **elegantly solved** this kind of problem, mainly because, as mentioned above, StreamPark directly calls the Flink Client API and has connected the entire process through the StreamPark product frontend. +For data development using Flink, the parameters submitted with Flink run are almost impossible to maintain. Apache StreamPark has also **elegantly solved** this kind of problem, mainly because, as mentioned above, Apache StreamPark directly calls the Flink Client API and has connected the entire process through the Apache StreamPark product frontend. ![](/blog/dustess/parameter_configuration.png) -As you can see, StreamPark's task parameter settings cover all the mainstream parameters, and every parameter has been thoughtfully provided with an introduction and an optimal recommendation based on best practices. This is also very beneficial for newcomers to Flink, helping them to avoid common pitfalls! +As you can see, Apache StreamPark's task parameter settings cover all the mainstream parameters, and every parameter has been thoughtfully provided with an introduction and an optimal recommendation based on best practices. This is also very beneficial for newcomers to Flink, helping them to avoid common pitfalls! #### **6. Excellent Configuration File Design** -In addition to the native parameters for Flink tasks, which are covered by the task parameters above, StreamPark also provides a powerful **Yaml configuration file** mode and **programming model**. +In addition to the native parameters for Flink tasks, which are covered by the task parameters above, Apache StreamPark also provides a powerful **Yaml configuration file** mode and **programming model**. ![](/blog/dustess/extended_parameters.jpg) -1. For Flink SQL tasks, you can configure the parameters that StreamPark has already built-in, such as **CheckPoint, retry mechanism, State Backend, table planner, mode**, etc., directly using the task's Yaml configuration file. +1. For Flink SQL tasks, you can configure the parameters that Apache StreamPark has already built-in, such as **CheckPoint, retry mechanism, State Backend, table planner, mode**, etc., directly using the task's Yaml configuration file. -2. For Jar tasks, StreamPark offers a generic programming model that encapsulates the native Flink API. Combined with the wrapper package provided by StreamPark, it can very elegantly retrieve custom parameters from the configuration file. For more details, see the documentation: +2. For Jar tasks, Apache StreamPark offers a generic programming model that encapsulates the native Flink API. Combined with the wrapper package provided by Apache StreamPark, it can very elegantly retrieve custom parameters from the configuration file. For more details, see the documentation: Programming model: @@ -111,13 +111,13 @@ http://www.streamxhub.com/docs/development/config In addition: -StreamPark also **supports Apache Flink native tasks**. The parameter configuration can be statically maintained within the Java task internal code, covering a wide range of scenarios, such as seamless migration of existing Flink tasks, etc. +Apache StreamPark also **supports Apache Flink native tasks**. The parameter configuration can be statically maintained within the Java task internal code, covering a wide range of scenarios, such as seamless migration of existing Flink tasks, etc. #### **7. Checkpoint Management** -Regarding Flink's Checkpoint (Savepoint) mechanism, the greatest difficulty is maintenance. StreamPark has also elegantly solved this problem: +Regarding Flink's Checkpoint (Savepoint) mechanism, the greatest difficulty is maintenance. Apache StreamPark has also elegantly solved this problem: -- StreamPark will **automatically maintain** the task Checkpoint directory and versions in the system for easy retrieval. +- Apache StreamPark will **automatically maintain** the task Checkpoint directory and versions in the system for easy retrieval. - When users need to update and restart an application, they can choose whether to save a Savepoint. - When restarting a task, it is possible to choose to recover from a specified version of Checkpoint/Savepoint. @@ -129,42 +129,42 @@ As shown below, developers can very intuitively and conveniently upgrade or deal #### **8. Comprehensive Alerting Features** -For streaming computations, which are 7*24H resident tasks, monitoring and alerting are very important. StreamPark also has a **comprehensive solution** for these issues: +For streaming computations, which are 7*24H resident tasks, monitoring and alerting are very important. Apache StreamPark also has a **comprehensive solution** for these issues: - It comes with an email-based alerting method, which has zero development cost and can be used once configured. -- Thanks to the excellent modularity of the StreamPark source code, it's possible to enhance the code at the Task Track point and introduce the company's internal SDK for telephone, group, and other alerting methods, all with a very low development cost. +- Thanks to the excellent modularity of the Apache StreamPark source code, it's possible to enhance the code at the Task Track point and introduce the company's internal SDK for telephone, group, and other alerting methods, all with a very low development cost. ![](/blog/dustess/alarm_email.png) ### **Excellent Source Code** -Following the principle of technology selection, a new technology must be sufficiently understood in terms of underlying principles and architectural ideas before it is considered for production use. Before choosing StreamPark, its architecture and source code were subjected to in-depth study and reading. It was found that the underlying technologies used by StreamPark are very familiar to Chinese developers: MySQL, Spring Boot, Mybatis Plus, Vue, etc. The code style is unified and elegantly implemented with complete annotations. The modules are independently abstracted and reasonable, employing numerous design patterns, and the code quality is very high, making it highly suitable for troubleshooting and further development in the later stages. +Following the principle of technology selection, a new technology must be sufficiently understood in terms of underlying principles and architectural ideas before it is considered for production use. Before choosing Apache StreamPark, its architecture and source code were subjected to in-depth study and reading. It was found that the underlying technologies used by Apache StreamPark are very familiar to Chinese developers: MySQL, Spring Boot, Mybatis Plus, Vue, etc. The code style is unified and elegantly implemented with complete annotations. The modules are independently abstracted and reasonable, employing numerous design patterns, and the code quality is very high, making it highly suitable for troubleshooting and further development in the later stages. ![](/blog/dustess/code_notebook.png) -In November 2021, StreamPark was successfully selected by Open Source China as a GVP - Gitee "Most Valuable Open Source Project," which speaks volumes about its quality and potential. +In November 2021, Apache StreamPark was successfully selected by Open Source China as a GVP - Gitee "Most Valuable Open Source Project," which speaks volumes about its quality and potential. ![](/blog/dustess/certificate.png) ### **03 Active Community** -The community is currently very active. Since the end of November 2021, when StreamPark (based on 1.2.0-release) was implemented, StreamPark had just started to be recognized by everyone, and there were some minor bugs in the user experience (not affecting core functionality). At that time, in order to go live quickly, some features were disabled and some minor bugs were fixed. Just as we were preparing to contribute back to the community, we found that these had already been fixed, indicating that the community's iteration cycle is very fast. In the future, our company's team will also strive to stay in sync with the community, quickly implement new features, and improve data development efficiency while reducing maintenance costs. +The community is currently very active. Since the end of November 2021, when Apache StreamPark (based on 1.2.0-release) was implemented, Apache StreamPark had just started to be recognized by everyone, and there were some minor bugs in the user experience (not affecting core functionality). At that time, in order to go live quickly, some features were disabled and some minor bugs were fixed. Just as we were preparing to contribute back to the community, we found that these had already been fixed, indicating that the community's iteration cycle is very fast. In the future, our company's team will also strive to stay in sync with the community, quickly implement new features, and improve data development efficiency while reducing maintenance costs. ## **02 Implementation Practice** -StreamPark's environment setup is very straightforward, following the official website's building tutorial you can complete the setup within a few hours. It now supports a front-end and back-end separation packaging deployment model, which can meet the needs of more companies, and there has already been a Docker Build related PR, suggesting that StreamPark's compilation and deployment will become even more convenient and quick in the future. Related documentation is as follows: +Apache StreamPark's environment setup is very straightforward, following the official website's building tutorial you can complete the setup within a few hours. It now supports a front-end and back-end separation packaging deployment model, which can meet the needs of more companies, and there has already been a Docker Build related PR, suggesting that Apache StreamPark's compilation and deployment will become even more convenient and quick in the future. Related documentation is as follows: ``` http://www.streamxhub.com/docs/user-guide/deployment ``` -For rapid implementation and production use, we chose the reliable On Yarn resource management mode (even though StreamPark already supports K8S quite well), and there are already many companies that have deployed using StreamPark on K8S, which you can refer to: +For rapid implementation and production use, we chose the reliable On Yarn resource management mode (even though Apache StreamPark already supports K8S quite well), and there are already many companies that have deployed using Apache StreamPark on K8S, which you can refer to: ``` http://www.streamxhub.com/blog/flink-development-framework-streamx ``` -Integrating StreamPark with the Hadoop ecosystem can be said to be zero-cost (provided that Flink is integrated with the Hadoop ecosystem according to the Flink official website, and tasks can be launched via Flink scripts). +Integrating Apache StreamPark with the Hadoop ecosystem can be said to be zero-cost (provided that Flink is integrated with the Hadoop ecosystem according to the Flink official website, and tasks can be launched via Flink scripts). Currently, we are also conducting K8S testing and solution design, and will be migrating to K8S in the future. @@ -174,7 +174,7 @@ At present, our company's tasks based on Flink SQL are mainly for simple real-ti ![](/blog/dustess/online_flinksql.png) -StreamPark has thoughtfully prepared a demo SQL task that can be run directly on a newly set up platform. This attention to detail demonstrates the community's commitment to user experience. Initially, our simple tasks were written and executed using Flink SQL, and StreamPark's support for Flink SQL is excellent, with a superior SQL editor and innovative POM and Jar package dependency management that can meet many SQL scenario needs. +Apache StreamPark has thoughtfully prepared a demo SQL task that can be run directly on a newly set up platform. This attention to detail demonstrates the community's commitment to user experience. Initially, our simple tasks were written and executed using Flink SQL, and Apache StreamPark's support for Flink SQL is excellent, with a superior SQL editor and innovative POM and Jar package dependency management that can meet many SQL scenario needs. Currently, we are researching and designing solutions related to metadata, permissions, UDFs, etc. @@ -182,11 +182,11 @@ Currently, we are researching and designing solutions related to metadata, permi Since most of the data development team members have a background in Java and Scala, we've implemented Jar-based builds for more flexible development, transparent tuning of Flink tasks, and to cover more scenarios. Our implementation was in two phases: -**First Phase:** StreamPark provides support for native Apache Flink projects. We configured our existing tasks' Git addresses in StreamPark, used Maven to package them as Jar files, and created StreamPark Apache Flink tasks for seamless migration. In this process, StreamPark was merely used as a platform tool for task submission and state maintenance, without leveraging the other features mentioned above. +**First Phase:** Apache StreamPark provides support for native Apache Flink projects. We configured our existing tasks' Git addresses in Apache StreamPark, used Maven to package them as Jar files, and created Apache StreamPark Apache Flink tasks for seamless migration. In this process, Apache StreamPark was merely used as a platform tool for task submission and state maintenance, without leveraging the other features mentioned above. -**Second Phase:** After migrating tasks to StreamPark in the first phase and having them run on the platform, the tasks' configurations, such as checkpoint, fault tolerance, and adjustments to business parameters within Flink tasks, required source code modifications, pushes, and builds. This was very inefficient and opaque. +**Second Phase:** After migrating tasks to Apache StreamPark in the first phase and having them run on the platform, the tasks' configurations, such as checkpoint, fault tolerance, and adjustments to business parameters within Flink tasks, required source code modifications, pushes, and builds. This was very inefficient and opaque. -Therefore, following StreamPark's QuickStart, we quickly integrated StreamPark's programming model, which is an encapsulation for StreamPark Flink tasks (for Apache Flink). +Therefore, following Apache StreamPark's QuickStart, we quickly integrated Apache StreamPark's programming model, which is an encapsulation for Apache StreamPark Flink tasks (for Apache Flink). Example: @@ -194,7 +194,7 @@ Example: StreamingContext = ParameterTool + StreamExecutionEnvironment ``` -- StreamingContext is the encapsulation object for StreamPark +- StreamingContext is the encapsulation object for Apache StreamPark - ParameterTool is the parameter object after parsing the configuration file ``` @@ -205,7 +205,7 @@ StreamingContext = ParameterTool + StreamExecutionEnvironment ## **03 Business Support & Capability Opening** -Currently, Dustess Info's real-time computing platform based on StreamPark has been online since the end of November last year and has launched 50+ Flink tasks, including 10+ Flink SQL tasks and 40+ Jar tasks. At present, it is mainly used internally by the data team, and the real-time computing platform will be opened up for use by business teams across the company shortly, which will significantly increase the number of tasks. +Currently, Dustess Info's real-time computing platform based on Apache StreamPark has been online since the end of November last year and has launched 50+ Flink tasks, including 10+ Flink SQL tasks and 40+ Jar tasks. At present, it is mainly used internally by the data team, and the real-time computing platform will be opened up for use by business teams across the company shortly, which will significantly increase the number of tasks. ![](/blog/dustess/online_jar.png) @@ -213,17 +213,17 @@ Currently, Dustess Info's real-time computing platform based on StreamPark has b The real-time data warehouse mainly uses Jar tasks because the model is more generic. Using Jar tasks can generically handle a large number of data table synchronization and calculations, and even achieve configuration-based synchronization. Our real-time data warehouse mainly uses Apache Doris for storage, with Flink handling the cleaning and calculations (the goal being storage-computation separation). -Using StreamPark to integrate other components is also very straightforward, and we have also abstracted the configuration related to Apache Doris and Kafka into the configuration file, which greatly enhances our development efficiency and flexibility. +Using Apache StreamPark to integrate other components is also very straightforward, and we have also abstracted the configuration related to Apache Doris and Kafka into the configuration file, which greatly enhances our development efficiency and flexibility. ### **02 Capability Opening** -Other business teams outside the data team also have many stream processing scenarios. Hence, after secondary development of the real-time computing platform based on StreamPark, we opened up the following capabilities to all business teams in the company: +Other business teams outside the data team also have many stream processing scenarios. Hence, after secondary development of the real-time computing platform based on Apache StreamPark, we opened up the following capabilities to all business teams in the company: - Business capability opening: The upstream real-time data warehouse collects all business tables through log collection and writes them into Kafka. Business teams can base their business-related development on Kafka, or they can perform OLAP analysis through the real-time data warehouse (Apache Doris). - Computing capability opening: The server resources of the big data platform are made available for use by business teams. - Solution opening: The mature Connectors of the Flink ecosystem and support for Exactly Once semantics can reduce the development and maintenance costs related to stream processing for business teams. -Currently, StreamPark does not support multi-business group functions. The multi-business group function will be abstracted and contributed to the community. +Currently, Apache StreamPark does not support multi-business group functions. The multi-business group function will be abstracted and contributed to the community. ![](/blog/dustess/manager.png) @@ -240,7 +240,7 @@ Currently, all our company's Flink tasks run on Yarn, which meets current needs, - **Separation of Storage and Computation**. Flink's computational resources and state storage are separated; computational resources can be mixed with other component resources, improving machine utilization. - **Elastic Scaling**. It is capable of elastic scaling, better saving manpower and material costs. -I am also currently organizing and implementing related technical architectures and solutions and have completed the technical verification of Flink on Kubernetes using StreamPark in an experimental environment. With the support of the StreamPark platform and the enthusiastic help of the community, I believe that production implementation is not far off. +I am also currently organizing and implementing related technical architectures and solutions and have completed the technical verification of Flink on Kubernetes using Apache StreamPark in an experimental environment. With the support of the Apache StreamPark platform and the enthusiastic help of the community, I believe that production implementation is not far off. ### **02 Stream-Batch Unification Construction** @@ -259,7 +259,7 @@ Regarding the unification of stream and batch, I am also currently researching a ## **05 Closing Words** -That's all for the sharing of StreamPark in the production practice at Dustess Info. Thank you all for reading this far. The original intention of writing this article was to bring a bit of StreamPark's production practice experience and reference to everyone, and together with the buddies in the StreamPark community, to jointly build StreamPark. In the future, I plan to participate and contribute more. A big thank you to the developers of StreamPark for providing such an excellent product; in many details, we can feel everyone's dedication. Although the current production version used by the company (1.2.0-release) still has some room for improvement in task group search, edit return jump page, and other interactive experiences, the merits outweigh the minor issues. I believe that StreamPark will get better and better, **and I also believe that StreamPark will promote the popularity of Apache Flink**. Finally, let's end with a phrase from the Apache Flink community: The future is real-time! +That's all for the sharing of Apache StreamPark in the production practice at Dustess Info. Thank you all for reading this far. The original intention of writing this article was to bring a bit of Apache StreamPark's production practice experience and reference to everyone, and together with the buddies in the Apache StreamPark community, to jointly build Apache StreamPark. In the future, I plan to participate and contribute more. A big thank you to the developers of Apache StreamPark for providing such an excellent product; in many details, we can feel everyone's dedication. Although the current production version used by the company (1.2.0-release) still has some room for improvement in task group search, edit return jump page, and other interactive experiences, the merits outweigh the minor issues. I believe that Apache StreamPark will get better and better, **and I also believe that Apache StreamPark will promote the popularity of Apache Flink**. Finally, let's end with a phrase from the Apache Flink community: The future is real-time! diff --git a/blog/6-streampark-usercase-joyme.md b/blog/6-streampark-usercase-joyme.md index 511527813..522886f9f 100644 --- a/blog/6-streampark-usercase-joyme.md +++ b/blog/6-streampark-usercase-joyme.md @@ -1,12 +1,12 @@ --- slug: streampark-usercase-joyme -title: StreamPark's Production Practice in Joyme -tags: [StreamPark, Production Practice, FlinkSQL] +title: Apache StreamPark's Production Practice in Joyme +tags: [Apache StreamPark, Production Practice, FlinkSQL] --- -**Abstract:** This article presents the production practices of StreamPark at Joyme, written by Qin Jiyong, a big data engineer at Joyme. The main contents include: +**Abstract:** This article presents the production practices of Apache StreamPark at Joyme, written by Qin Jiyong, a big data engineer at Joyme. The main contents include: -- Encountering StreamPark +- Encountering Apache StreamPark - Flink SQL job development - Custom code job development - Monitoring and alerting @@ -16,9 +16,9 @@ tags: [StreamPark, Production Practice, FlinkSQL] -## 1 Encountering StreamPark +## 1 Encountering Apache StreamPark -Encountering StreamPark was inevitable. Based on our existing real-time job development mode, we had to find an open-source platform to support our company's real-time business. Our current situation was as follows: +Encountering Apache StreamPark was inevitable. Based on our existing real-time job development mode, we had to find an open-source platform to support our company's real-time business. Our current situation was as follows: - We wrote jobs and packaged them to servers, then executed the Flink run command to submit them, which was a cumbersome and inefficient process. - Flink SQL was submitted through a self-developed old platform. The developers of the old platform had left, and no one maintained the subsequent code, even if someone did, they would have to face the problem of high maintenance costs. @@ -27,13 +27,13 @@ Encountering StreamPark was inevitable. Based on our existing real-time job deve For all these reasons, we needed an open-source platform to manage our real-time jobs, and we also needed to refactor to unify the development mode and language and centralize project management. -The first encounter with StreamPark basically confirmed our choice. We quickly deployed and installed according to the official documentation, performed some operations after setting up, and were greeted with a user-friendly interface. StreamPark's support for multiple versions of Flink, permission management, job monitoring, and other series of functions already met our needs well. Further understanding revealed that its community is also very active. We have witnessed the process of StreamPark's feature completion since version 1.1.0. The development team is very ambitious, and we believe they will continue to improve. +The first encounter with Apache StreamPark basically confirmed our choice. We quickly deployed and installed according to the official documentation, performed some operations after setting up, and were greeted with a user-friendly interface. Apache StreamPark's support for multiple versions of Flink, permission management, job monitoring, and other series of functions already met our needs well. Further understanding revealed that its community is also very active. We have witnessed the process of Apache StreamPark's feature completion since version 1.1.0. The development team is very ambitious, and we believe they will continue to improve. ## 2 Development of Flink SQL Jobs The Flink SQL development mode has brought great convenience. For some simple metric developments, it is possible to complete them with just a few SQL statements, without the need to write a single line of code. Flink SQL has facilitated the development work for many colleagues, especially since writing code can be somewhat difficult for those who work on warehouses. -To add a new task, you open the task addition interface of StreamPark, where the default Development Mode is Flink SQL mode. You can write the SQL logic directly in the Flink SQL section. +To add a new task, you open the task addition interface of Apache StreamPark, where the default Development Mode is Flink SQL mode. You can write the SQL logic directly in the Flink SQL section. For the Flink SQL part, you can progressively write the logic SQL following the Flink official website's documentation. Generally, for our company, it consists of three parts: the Source connection, intermediate logic processing, and finally the Sink. Essentially, the Source is consuming data from Kafka, the logic layer will involve MySQL for dimension table queries, and the Sink part is mostly Elasticsearch, Redis, or MySQL. @@ -76,7 +76,7 @@ SELECT Data.uid FROM source_table; ### **2. Add Dependency** -In terms of dependencies, it's an unique feature to Streampark. A complete Flink SQL job is innovatively split into two components within StreamPark: the SQL and the dependencies. The SQL part is easy to understand and requires no further explanation, but the dependencies are the various Connector JARs needed within the SQL, such as Kafka and MySQL Connectors. If these are used within the SQL, then these Connector dependencies must be introduced. In StreamPark, there are two ways to add dependencies: one is based on the standard Maven pom coordinates, and the other is by uploading the required Jars from a local source. These two methods can also be mixed and used as needed; simply apply, and these dependencies will be automatically loaded when the job is submitted. +In terms of dependencies, it's an unique feature to Streampark. A complete Flink SQL job is innovatively split into two components within Apache StreamPark: the SQL and the dependencies. The SQL part is easy to understand and requires no further explanation, but the dependencies are the various Connector JARs needed within the SQL, such as Kafka and MySQL Connectors. If these are used within the SQL, then these Connector dependencies must be introduced. In Apache StreamPark, there are two ways to add dependencies: one is based on the standard Maven pom coordinates, and the other is by uploading the required Jars from a local source. These two methods can also be mixed and used as needed; simply apply, and these dependencies will be automatically loaded when the job is submitted. ![](/blog/joyme/add_dependency.png) @@ -106,7 +106,7 @@ For streaming jobs, we use Flink Java for development, having refactored previou ![](/blog/joyme/project_configuration.png) -Once the configuration is completed, compile the corresponding project to finish the packaging phase. Thus, the Custom code jobs can also reference it. Compilation is required every time the code needs to go live; otherwise, only the last compiled code is available. Here's an issue: for security reasons, our company’s GitLab account passwords are regularly updated. This leads to a situation where the StreamPark projects have the old passwords configured, resulting in a failure when pulling projects from Git during compilation. To address this problem, we contacted the community and learned that the capability to modify projects has been added in the subsequent version 1.2.1. +Once the configuration is completed, compile the corresponding project to finish the packaging phase. Thus, the Custom code jobs can also reference it. Compilation is required every time the code needs to go live; otherwise, only the last compiled code is available. Here's an issue: for security reasons, our company’s GitLab account passwords are regularly updated. This leads to a situation where the Apache StreamPark projects have the old passwords configured, resulting in a failure when pulling projects from Git during compilation. To address this problem, we contacted the community and learned that the capability to modify projects has been added in the subsequent version 1.2.1. ![](/blog/joyme/flink_system.png) @@ -120,7 +120,7 @@ As well as the task’s parallelism, monitoring method, etc., memory size should ## 4 Monitoring and Alerts -The monitoring in StreamPark requires configuration in the setting module to set up the basic information for sending emails. +The monitoring in Apache StreamPark requires configuration in the setting module to set up the basic information for sending emails. ![](/blog/joyme/system_setting.png) @@ -132,7 +132,7 @@ When our jobs fail, we can receive alerts through email. These alerts are quite ![](/blog/joyme/alarm_eamil.png) -Regarding alerts, we have developed a scheduled task based on StreamPark's t_flink_app table. Why do this? Because most people might not check their emails promptly when it comes to email notifications. Therefore, we opted to monitor the status of each task and send corresponding monitoring information to our Lark (Feishu) alert group, enabling us to promptly identify and address issues with the tasks. It's a simple Python script, then configured with crontab to execute at scheduled times. +Regarding alerts, we have developed a scheduled task based on Apache StreamPark's t_flink_app table. Why do this? Because most people might not check their emails promptly when it comes to email notifications. Therefore, we opted to monitor the status of each task and send corresponding monitoring information to our Lark (Feishu) alert group, enabling us to promptly identify and address issues with the tasks. It's a simple Python script, then configured with crontab to execute at scheduled times. ## 5 Common Issues @@ -152,10 +152,10 @@ If the task has started but fails during the running phase, this situation might ## 6 Community Impression -Often when we discuss issues in the StreamPark user group, we get immediate responses from community members. Issues that cannot be resolved at the moment are generally fixed in the next version or the latest code branch. In the group, we also see many non-community members actively helping each other out. There are many big names from other communities as well, and many members actively join the community development work. The whole community feels very active to me! +Often when we discuss issues in the Apache StreamPark user group, we get immediate responses from community members. Issues that cannot be resolved at the moment are generally fixed in the next version or the latest code branch. In the group, we also see many non-community members actively helping each other out. There are many big names from other communities as well, and many members actively join the community development work. The whole community feels very active to me! ## 7 Conclusion -Currently, our company runs 60 real-time jobs online, with Flink SQL and custom code each making up about half. More real-time tasks will be put online subsequently. Many colleagues worry about the stability of StreamPark, but based on several months of production practice in our company, StreamPark is just a platform to help you develop, deploy, monitor, and manage jobs. Whether it is stable or not depends on the stability of our own Hadoop Yarn cluster (we use the onyan mode) and has little to do with StreamPark itself. It also depends on the robustness of the Flink SQL or code you write. These two aspects should be the primary concerns. Only when these two aspects are problem-free can the flexibility of StreamPark be fully utilized to improve job performance. To discuss the stability of StreamPark in isolation is somewhat extreme. +Currently, our company runs 60 real-time jobs online, with Flink SQL and custom code each making up about half. More real-time tasks will be put online subsequently. Many colleagues worry about the stability of Apache StreamPark, but based on several months of production practice in our company, Apache StreamPark is just a platform to help you develop, deploy, monitor, and manage jobs. Whether it is stable or not depends on the stability of our own Hadoop Yarn cluster (we use the onyan mode) and has little to do with Apache StreamPark itself. It also depends on the robustness of the Flink SQL or code you write. These two aspects should be the primary concerns. Only when these two aspects are problem-free can the flexibility of Apache StreamPark be fully utilized to improve job performance. To discuss the stability of Apache StreamPark in isolation is somewhat extreme. -That is all the content shared by StreamPark at Joyme. Thank you for reading this article. We are very grateful for such an excellent product provided by StreamPark, which is a true act of benefiting others. From version 1.0 to 1.2.1, the bugs encountered are promptly fixed, and every issue is taken seriously. We are still using the on yarn deployment mode. Restarting yarn will still cause jobs to be lost, but restarting yarn is not something we do every day. The community will also look to fix this problem as soon as possible. I believe that StreamPark will get better and better, with a promising future ahead. \ No newline at end of file +That is all the content shared by Apache StreamPark at Joyme. Thank you for reading this article. We are very grateful for such an excellent product provided by Apache StreamPark, which is a true act of benefiting others. From version 1.0 to 1.2.1, the bugs encountered are promptly fixed, and every issue is taken seriously. We are still using the on yarn deployment mode. Restarting yarn will still cause jobs to be lost, but restarting yarn is not something we do every day. The community will also look to fix this problem as soon as possible. I believe that Apache StreamPark will get better and better, with a promising future ahead. \ No newline at end of file diff --git a/blog/7-streampark-usercase-haibo.md b/blog/7-streampark-usercase-haibo.md index 5badbe01a..cde7deca5 100644 --- a/blog/7-streampark-usercase-haibo.md +++ b/blog/7-streampark-usercase-haibo.md @@ -1,13 +1,13 @@ --- slug: streampark-usercase-haibo title: An All-in-One Computation Tool in Haibo Tech's Production Practice and facilitation in Smart City Construction -tags: [StreamPark, Production Practice, FlinkSQL] +tags: [Apache StreamPark, Production Practice, FlinkSQL] --- -**Summary:** The author of this article, "StreamPark: An All-in-One Computation Tool in Haibo Tech's Production Practice and facilitation in Smart City Construction," is the Big Data Architect at Haibo Tech. The main topics covered include: +**Summary:** The author of this article, "Apache StreamPark: An All-in-One Computation Tool in Haibo Tech's Production Practice and facilitation in Smart City Construction," is the Big Data Architect at Haibo Tech. The main topics covered include: -1. Choosing StreamPark +1. Choosing Apache StreamPark 2. Getting Started Quickly 3. Application Scenarios 4. Feature Extensions @@ -18,11 +18,11 @@ Haibo Tech is an industry-leading company offering AI IoT products and solutions -## **01. Choosing StreamPark** +## **01. Choosing Apache StreamPark** Haibo Tech started using Flink SQL to aggregate and process various real-time IoT data since 2020. With the accelerated pace of smart city construction in various cities, the types and volume of IoT data to be aggregated are also increasing. This has resulted in an increasing number of Flink SQL tasks being maintained online, making a dedicated platform for managing numerous Flink SQL tasks an urgent need. -After evaluating Apache Zeppelin and StreamPark, we chose StreamPark as our real-time computing platform. Compared to Apache Zeppelin, StreamPark may not be as well-known. However, after experiencing the initial release of StreamPark and reading its design documentation, we recognized that its all-in-one design philosophy covers the entire lifecycle of Flink task development. This means that configuration, development, deployment, and operations can all be accomplished on a single platform. Our developers, operators, and testers can collaboratively work on StreamPark. The **low-code** + **all-in-one** design principles solidified our confidence in using StreamPark. +After evaluating Apache Zeppelin and Apache StreamPark, we chose Apache StreamPark as our real-time computing platform. Compared to Apache Zeppelin, Apache StreamPark may not be as well-known. However, after experiencing the initial release of Apache StreamPark and reading its design documentation, we recognized that its all-in-one design philosophy covers the entire lifecycle of Flink task development. This means that configuration, development, deployment, and operations can all be accomplished on a single platform. Our developers, operators, and testers can collaboratively work on Apache StreamPark. The **low-code** + **all-in-one** design principles solidified our confidence in using Apache StreamPark. // Video link (streampark official video) @@ -32,7 +32,7 @@ After evaluating Apache Zeppelin and StreamPark, we chose StreamPark as our real ### **1. Quick Start** -Using StreamPark to accomplish a real-time aggregation task is as simple as putting an elephant into a fridge, and it can be done in just three steps: +Using Apache StreamPark to accomplish a real-time aggregation task is as simple as putting an elephant into a fridge, and it can be done in just three steps: - Edit SQL @@ -50,11 +50,11 @@ With just the above three steps, you can complete the aggregation task from Mysq ### **2. Production Practice** -StreamPark is primarily used at Haibo for running real-time Flink SQL tasks: reading data from Kafka, processing it, and outputting to Clickhouse or Elasticsearch. +Apache StreamPark is primarily used at Haibo for running real-time Flink SQL tasks: reading data from Kafka, processing it, and outputting to Clickhouse or Elasticsearch. -Starting from October 2021, the company gradually migrated Flink SQL tasks to the StreamPark platform for centralized management. It supports the aggregation, computation, and alerting of our real-time IoT data. +Starting from October 2021, the company gradually migrated Flink SQL tasks to the Apache StreamPark platform for centralized management. It supports the aggregation, computation, and alerting of our real-time IoT data. -As of now, StreamPark has been deployed in various government and public security production environments, aggregating and processing real-time IoT data, as well as capturing data on people and vehicles. Below is a screenshot of the StreamPark platform deployed on a city's dedicated network: +As of now, Apache StreamPark has been deployed in various government and public security production environments, aggregating and processing real-time IoT data, as well as capturing data on people and vehicles. Below is a screenshot of the Apache StreamPark platform deployed on a city's dedicated network: ![](/blog/haibo/application.png) @@ -62,31 +62,31 @@ As of now, StreamPark has been deployed in various government and public securit #### **1. Real-time IoT Sensing Data Aggregation** -For aggregating real-time IoT sensing data, we directly use StreamPark to develop Flink SQL tasks. For methods not provided by Flink SQL, StreamPark also supports UDF-related functionalities. Users can upload UDF packages through StreamPark, and then call the relevant UDF in SQL to achieve more complex logical operations. +For aggregating real-time IoT sensing data, we directly use Apache StreamPark to develop Flink SQL tasks. For methods not provided by Flink SQL, Apache StreamPark also supports UDF-related functionalities. Users can upload UDF packages through Apache StreamPark, and then call the relevant UDF in SQL to achieve more complex logical operations. -The "SQL+UDF" approach meets most of our data aggregation scenarios. If business changes in the future, we only need to modify the SQL statement in StreamPark to complete business changes and deployment. +The "SQL+UDF" approach meets most of our data aggregation scenarios. If business changes in the future, we only need to modify the SQL statement in Apache StreamPark to complete business changes and deployment. ![](/blog/haibo/data_aggregation.png) #### **2. Flink CDC Database Synchronization** -To achieve synchronization between various databases and data warehouses, we use StreamPark to develop Flink CDC SQL tasks. With the capabilities of Flink CDC, we've implemented data synchronization between Oracle and Oracle, as well as synchronization between Mysql/Postgresql and Clickhouse. +To achieve synchronization between various databases and data warehouses, we use Apache StreamPark to develop Flink CDC SQL tasks. With the capabilities of Flink CDC, we've implemented data synchronization between Oracle and Oracle, as well as synchronization between Mysql/Postgresql and Clickhouse. ![](/blog/haibo/flink_cdc.png) **3. Data Analysis Model Management** -For tasks that can't use Flink SQL and need Flink code development, such as real-time control models and offline data analysis models, StreamPark offers a Custom code approach, allowing users to upload executable Flink Jar packages and run them. +For tasks that can't use Flink SQL and need Flink code development, such as real-time control models and offline data analysis models, Apache StreamPark offers a Custom code approach, allowing users to upload executable Flink Jar packages and run them. -Currently, we have uploaded over 20 analysis models, such as personnel and vehicles, to StreamPark, which manages and operates them. +Currently, we have uploaded over 20 analysis models, such as personnel and vehicles, to Apache StreamPark, which manages and operates them. ![](/blog/haibo/data_aggregation.png) -**In Summary:** Whether it's Flink SQL tasks or Custom code tasks, StreamPark provides excellent support to meet various business scenarios. However, StreamPark lacks task scheduling capabilities. If you need to schedule tasks regularly, StreamPark currently cannot meet this need. Community members are actively developing scheduling-related modules, and the soon-to-be-released version 1.2.3 will support task scheduling capabilities, so stay tuned. +**In Summary:** Whether it's Flink SQL tasks or Custom code tasks, Apache StreamPark provides excellent support to meet various business scenarios. However, Apache StreamPark lacks task scheduling capabilities. If you need to schedule tasks regularly, Apache StreamPark currently cannot meet this need. Community members are actively developing scheduling-related modules, and the soon-to-be-released version 1.2.3 will support task scheduling capabilities, so stay tuned. ## **04. Feature Extension** -Datahub is a metadata management platform developed by Linkedin, offering data source management, data lineage, data quality checks, and more. Haibo Tech has developed an extension based on StreamPark and Datahub, implementing table-level/field-level lineage features. With the data lineage feature, users can check the field lineage relationship of Flink SQL and save the lineage relationship to the Linkedin/Datahub metadata management platform. +Datahub is a metadata management platform developed by Linkedin, offering data source management, data lineage, data quality checks, and more. Haibo Tech has developed an extension based on Apache StreamPark and Datahub, implementing table-level/field-level lineage features. With the data lineage feature, users can check the field lineage relationship of Flink SQL and save the lineage relationship to the Linkedin/Datahub metadata management platform. // Two video links (Data lineage feature developed based on streampark) @@ -94,10 +94,10 @@ Datahub is a metadata management platform developed by Linkedin, offering data s ## **05. Future Expectations** -Currently, the StreamPark community's Roadmap indicates that StreamPark 1.3.0 will usher in a brand new Workbench experience, a unified resource management center (unified management of JAR/UDF/Connectors), batch task scheduling, and more. These are also some of the brand-new features we are eagerly anticipating. +Currently, the Apache StreamPark community's Roadmap indicates that Apache StreamPark 1.3.0 will usher in a brand new Workbench experience, a unified resource management center (unified management of JAR/UDF/Connectors), batch task scheduling, and more. These are also some of the brand-new features we are eagerly anticipating. -The Workbench will use a new workbench-style SQL development style. By selecting a data source, SQL can be generated automatically, further enhancing Flink task development efficiency. The unified UDF resource center will solve the current problem where each task has to upload its dependency package. The batch task scheduling feature will address StreamPark's current inability to schedule tasks. +The Workbench will use a new workbench-style SQL development style. By selecting a data source, SQL can be generated automatically, further enhancing Flink task development efficiency. The unified UDF resource center will solve the current problem where each task has to upload its dependency package. The batch task scheduling feature will address Apache StreamPark's current inability to schedule tasks. -Below is a prototype designed by StreamPark developers, so please stay tuned. +Below is a prototype designed by Apache StreamPark developers, so please stay tuned. ![](/blog/haibo/data_source.png) diff --git a/blog/8-streampark-usercase-ziru.md b/blog/8-streampark-usercase-ziru.md index eda608f9f..1b8b2a918 100644 --- a/blog/8-streampark-usercase-ziru.md +++ b/blog/8-streampark-usercase-ziru.md @@ -1,16 +1,16 @@ --- slug: streampark-usercase-ziru title: Ziroom's Real-Time Computing Platform Practice Based on Apache StreamPark -tags: [StreamPark, Production Practice] +tags: [Apache StreamPark, Production Practice] --- ![](/blog/ziru/cover.png) -**Introduction:** Ziroom, an O2O internet company focusing on providing rental housing products and services, has built an online, data-driven, and intelligent platform that covers the entire chain of urban living. Real-time computing has always played an important role in Ziroom. To date, Ziroom processes TB-level data daily. This article, brought by the real-time computing team from Ziroom, introduces the in-depth practice of Ziroom's real-time computing platform based on StreamPark. +**Introduction:** Ziroom, an O2O internet company focusing on providing rental housing products and services, has built an online, data-driven, and intelligent platform that covers the entire chain of urban living. Real-time computing has always played an important role in Ziroom. To date, Ziroom processes TB-level data daily. This article, brought by the real-time computing team from Ziroom, introduces the in-depth practice of Ziroom's real-time computing platform based on Apache StreamPark. - Challenges in real-time computing - The journey to the solution -- In-depth practice based on StreamPark +- In-depth practice based on Apache StreamPark - Summary of practical experience and examples - Benefits brought by the implementation - Future plans @@ -63,7 +63,7 @@ Therefore, there is a need to improve the efficiency of development and debuggin In the early stages of platform construction, we comprehensively surveyed almost all relevant projects in the industry, covering both commercial paid versions and open-source versions, starting from early 2022. After investigation and comparison, we found that these projects have their limitations to varying extents, and their usability and stability could not be effectively guaranteed. -Overall, StreamPark performed best in our evaluation. It was the only project without major flaws and with strong extensibility: supporting both SQL and JAR jobs, with the most complete and stable deployment mode for Flink jobs. Its unique architectural design not only avoids locking in specific Flink versions but also supports convenient version switching and parallel processing, effectively solving job dependency isolation and conflict issues. The job management & operations capabilities we focused on were also very complete, including monitoring, alerts, SQL validation, SQL version comparison, CI, etc. StreamPark's support for Flink on K8s was the most comprehensive among all the open-source projects we surveyed. However, StreamPark's K8s mode submission required local image building, leading to storage resource consumption. +Overall, Apache StreamPark performed best in our evaluation. It was the only project without major flaws and with strong extensibility: supporting both SQL and JAR jobs, with the most complete and stable deployment mode for Flink jobs. Its unique architectural design not only avoids locking in specific Flink versions but also supports convenient version switching and parallel processing, effectively solving job dependency isolation and conflict issues. The job management & operations capabilities we focused on were also very complete, including monitoring, alerts, SQL validation, SQL version comparison, CI, etc. Apache StreamPark's support for Flink on K8s was the most comprehensive among all the open-source projects we surveyed. However, Apache StreamPark's K8s mode submission required local image building, leading to storage resource consumption. In the latest 2.2 version, the community has already restructured this part. @@ -73,19 +73,19 @@ After analyzing the pros and cons of many open-source projects, we decided to pa 2. In the selection of open-source components, after comprehensive comparison and evaluation of various indicators, we finally chose what was then StreamX. Subsequent close communication with the community allowed us to deeply appreciate the serious and responsible attitude of the founders and the united and friendly atmosphere of the community. We also witnessed the project's inclusion in the Apache Incubator in September 2022, making us hopeful for its future. -3. On the basis of StreamPark, we aim to promote integration with the existing ecosystem of the company to better meet our business needs. +3. On the basis of Apache StreamPark, we aim to promote integration with the existing ecosystem of the company to better meet our business needs. -## **In-depth Practice Based on StreamPark** +## **In-depth Practice Based on Apache StreamPark** -Based on the above decisions, we initiated the evolution of the real-time computing platform, oriented by "pain point needs," and built a stable, efficient, and easy-to-maintain real-time computing platform based on StreamPark. Since the beginning of 2022, we have participated in the construction of the community while officially scheduling our internal platform construction. +Based on the above decisions, we initiated the evolution of the real-time computing platform, oriented by "pain point needs," and built a stable, efficient, and easy-to-maintain real-time computing platform based on Apache StreamPark. Since the beginning of 2022, we have participated in the construction of the community while officially scheduling our internal platform construction. -First, we further improved related functionalities on the basis of StreamPark: +First, we further improved related functionalities on the basis of Apache StreamPark: ![](/blog/ziru/platform_construction.png) ### **01 LDAP Login Support** -On the basis of StreamPark, we further improved related functionalities, including support for LDAP, so that in the future we can fully open up real-time capabilities, allowing analysts from the company's four business lines to use the platform, expected to reach about 170 people. With the increase in numbers, account management becomes increasingly important, especially in the case of personnel changes, account cancellation, and application become frequent and time-consuming operations. Therefore, integrating LDAP becomes particularly important. We communicated with the community in a timely manner and initiated a discussion, eventually contributing this Feature. Now, starting LDAP in StreamPark has become very simple, requiring just two steps: +On the basis of Apache StreamPark, we further improved related functionalities, including support for LDAP, so that in the future we can fully open up real-time capabilities, allowing analysts from the company's four business lines to use the platform, expected to reach about 170 people. With the increase in numbers, account management becomes increasingly important, especially in the case of personnel changes, account cancellation, and application become frequent and time-consuming operations. Therefore, integrating LDAP becomes particularly important. We communicated with the community in a timely manner and initiated a discussion, eventually contributing this Feature. Now, starting LDAP in Apache StreamPark has become very simple, requiring just two steps: #### step1: Fill in the corresponding LDAP @@ -116,7 +116,7 @@ On the login interface, click LDAP login method, then enter the corresponding ac ### **02 Automatic Ingress Generation for Job Submission** -Due to the company's network security policy, only port 80 is opened on the Kubernetes host machines by the operation team, making it impossible to directly access the job WebUI on Kubernetes via "domain + random port." To solve this problem, we needed to use Ingress to add a proxy layer to the access path, achieving the effect of access routing. In StreamPark version 2.0, we contributed the functionality related to Ingress [3]. We adopted a strategy pattern implementation, initially obtaining Kubernetes metadata information to identify its version and accordingly constructing respective objects, ensuring smooth use of the Ingress function across various Kubernetes environments. +Due to the company's network security policy, only port 80 is opened on the Kubernetes host machines by the operation team, making it impossible to directly access the job WebUI on Kubernetes via "domain + random port." To solve this problem, we needed to use Ingress to add a proxy layer to the access path, achieving the effect of access routing. In Apache StreamPark version 2.0, we contributed the functionality related to Ingress [3]. We adopted a strategy pattern implementation, initially obtaining Kubernetes metadata information to identify its version and accordingly constructing respective objects, ensuring smooth use of the Ingress function across various Kubernetes environments. The specific configuration steps are as follows: @@ -138,7 +138,7 @@ You will notice that the generated address consists of three parts: domain + job ### **03 Support for Viewing Job Deployment Logs** -In the process of continuous job deployment, we gradually realized that without logs, we cannot perform effective operations. Log retention, archiving, and viewing became an important part in our later problem-solving process. Therefore, in StreamPark version 2.0, we contributed the ability to archive startup logs in On Kubernetes mode and view them on the page [4]. Now, by clicking the log viewing button in the job list, it is very convenient to view the real-time logs of the job. +In the process of continuous job deployment, we gradually realized that without logs, we cannot perform effective operations. Log retention, archiving, and viewing became an important part in our later problem-solving process. Therefore, in Apache StreamPark version 2.0, we contributed the ability to archive startup logs in On Kubernetes mode and view them on the page [4]. Now, by clicking the log viewing button in the job list, it is very convenient to view the real-time logs of the job. ![](/blog/ziru/k8s_log.png) @@ -148,7 +148,7 @@ In actual use, as the number of jobs increased, the number of users rose, and mo To solve this problem, we proposed a demand in the community: we hoped that each job could directly jump to the corresponding monitoring chart and log archive page through a hyperlink, so that users could directly view the monitoring information and logs related to their jobs. This avoids tedious searches in complex system interfaces, thus improving the efficiency of troubleshooting. -We had a discussion in the community, and it was quickly responded to, as everyone thought it was a common need. Soon, a developer contributed a design and related PR, and the issue was quickly resolved. Now, to enable this feature in StreamPark has become very simple: +We had a discussion in the community, and it was quickly responded to, as everyone thought it was a common need. Soon, a developer contributed a design and related PR, and the issue was quickly resolved. Now, to enable this feature in Apache StreamPark has become very simple: #### step1: Create a badge label @@ -225,11 +225,11 @@ User B's actual execution SQL: SELECT name, Encryption_function(age), price, Sensitive_field_functions(phone) FROM user; ``` -### **06 Data Synchronization Platform Based on StreamPark** +### **06 Data Synchronization Platform Based on Apache StreamPark** -With the successful implementation of StreamPark's technical solutions in the company, we achieved deep support for Flink jobs, bringing a qualitative leap in data processing. This prompted us to completely revamp our past data synchronization logic, aiming to reduce operational costs through technical optimization and integration. Therefore, we gradually replaced historical Sqoop jobs, Canal jobs, and Hive JDBC Handler jobs with Flink CDC jobs, Flink stream, and batch jobs. In this process, we continued to optimize and strengthen StreamPark's interface capabilities, adding a status callback mechanism and achieving perfect integration with the DolphinScheduler [7] scheduling system, further enhancing our data processing capabilities. +With the successful implementation of Apache StreamPark's technical solutions in the company, we achieved deep support for Flink jobs, bringing a qualitative leap in data processing. This prompted us to completely revamp our past data synchronization logic, aiming to reduce operational costs through technical optimization and integration. Therefore, we gradually replaced historical Sqoop jobs, Canal jobs, and Hive JDBC Handler jobs with Flink CDC jobs, Flink stream, and batch jobs. In this process, we continued to optimize and strengthen Apache StreamPark's interface capabilities, adding a status callback mechanism and achieving perfect integration with the DolphinScheduler [7] scheduling system, further enhancing our data processing capabilities. -External system integration with StreamPark is simple, requiring only a few steps: +External system integration with Apache StreamPark is simple, requiring only a few steps: 1. First, create a token for API access: @@ -255,11 +255,11 @@ curl -X POST '/flink/app/start' \ ## **Summary of Practical Experience** -During our in-depth use of StreamPark, we summarized some common issues and explored solutions in the practice process, which we have compiled into examples for reference. +During our in-depth use of Apache StreamPark, we summarized some common issues and explored solutions in the practice process, which we have compiled into examples for reference. ### **01 Building Base Images** -To deploy a Flink job on Kubernetes using StreamPark, you first need to prepare a Base image built on Flink. Then, on the Kubernetes platform, the user-provided image is used to start the Flink job. If we continue to use the official "bare image," it is far from sufficient for actual development. Business logic developed by users often involves multiple upstream and downstream data sources, requiring related data source Connectors and dependencies like Hadoop. Therefore, these dependencies need to be included in the image. Below, I will introduce the specific operation steps. +To deploy a Flink job on Kubernetes using Apache StreamPark, you first need to prepare a Base image built on Flink. Then, on the Kubernetes platform, the user-provided image is used to start the Flink job. If we continue to use the official "bare image," it is far from sufficient for actual development. Business logic developed by users often involves multiple upstream and downstream data sources, requiring related data source Connectors and dependencies like Hadoop. Therefore, these dependencies need to be included in the image. Below, I will introduce the specific operation steps. #### step1: First, create a folder containing two folders and a Dockerfile file @@ -328,7 +328,7 @@ RUN unzip -d arthas-latest-bin arthas-packaging-latest-bin.zip ### **03 Resolution of Dependency Conflicts in Images** -In the process of using StreamPark, we often encounter dependency conflict exceptions like NoClassDefFoundError, ClassNotFoundException, and NoSuchMethodError in Flink jobs running on base images. The troubleshooting approach is to find the package path of the conflicting class indicated in the error. For example, if the error class is in org.apache.orc:orc-core, go to the corresponding module directory, run `mvn dependency::tree`, search for orc-core, see who brought in the dependency, and remove it using exclusion. Below, I will introduce in detail a method of custom packaging to resolve dependency conflicts, illustrated by a dependency conflict caused by the flink-shaded-hadoop-3-uber JAR package in a base image. +In the process of using Apache StreamPark, we often encounter dependency conflict exceptions like NoClassDefFoundError, ClassNotFoundException, and NoSuchMethodError in Flink jobs running on base images. The troubleshooting approach is to find the package path of the conflicting class indicated in the error. For example, if the error class is in org.apache.orc:orc-core, go to the corresponding module directory, run `mvn dependency::tree`, search for orc-core, see who brought in the dependency, and remove it using exclusion. Below, I will introduce in detail a method of custom packaging to resolve dependency conflicts, illustrated by a dependency conflict caused by the flink-shaded-hadoop-3-uber JAR package in a base image. #### step1: Clone the flink-shaded project locally👇 @@ -346,7 +346,7 @@ git clone https://github.com/apache/flink-shaded.git ### **04 Centralized Job Configuration Example** -One of the great conveniences of using StreamPark is centralized configuration management. You can configure all settings in the conf file in the Flink directory bound to the platform. +One of the great conveniences of using Apache StreamPark is centralized configuration management. You can configure all settings in the conf file in the Flink directory bound to the platform. ```shell cd /flink-1.14.5/conf @@ -363,9 +363,9 @@ Clicking Sync Conf will synchronize the global configuration file, and new jobs ![](/blog/ziru/sync_conf.png) -### **05 StreamPark DNS Resolution Configuration** +### **05 Apache StreamPark DNS Resolution Configuration** -A correct and reasonable DNS resolution configuration is very important when submitting FlinkSQL on the StreamPark platform. It mainly involves the following points: +A correct and reasonable DNS resolution configuration is very important when submitting FlinkSQL on the Apache StreamPark platform. It mainly involves the following points: 1. Flink jobs' Checkpoint writing to HDFS requires a snapshot write to an HDFS node obtained through ResourceManager. If there are expansions in the Hadoop cluster in the enterprise, and these new nodes are not covered by the DNS resolution service, this will directly lead to Checkpoint failure, affecting online stability. @@ -449,7 +449,7 @@ To achieve isolation between production and testing environments, we introduced export HADOOP_CONF_DIR=/home/streamx/conf ``` -This effectively cut off the default logic of Flink on K8s loading HDFS configuration. This operation ensures that A StreamPark only connects to A Hadoop environment, while B StreamPark connects to B Hadoop environment, thus achieving complete isolation between testing and production environments. +This effectively cut off the default logic of Flink on K8s loading HDFS configuration. This operation ensures that A Apache StreamPark only connects to A Hadoop environment, while B Apache StreamPark connects to B Hadoop environment, thus achieving complete isolation between testing and production environments. Specifically, after this command takes effect, we can ensure that Flink jobs submitted on port 10002 connect to the B Hadoop environment. Thus, the B Hadoop environment is isolated from the Hadoop environment used by Flink jobs submitted on port 10000 in the past, effectively preventing interference between different environments and ensuring system stability and reliability. @@ -540,16 +540,16 @@ netstat -tlnp | grep 10002 ## **Benefits Brought** -Our team has been using StreamX (the predecessor of StreamPark) and, after more than a year of practice and refinement, StreamPark has significantly improved our challenges in developing, managing, and operating Apache Flink jobs. As a one-stop service platform, StreamPark greatly simplifies the entire development process. Now, we can complete job development, compilation, and release directly on the StreamPark platform, not only lowering the management and deployment threshold of Flink but also significantly improving development efficiency. +Our team has been using StreamX (the predecessor of Apache StreamPark) and, after more than a year of practice and refinement, Apache StreamPark has significantly improved our challenges in developing, managing, and operating Apache Flink jobs. As a one-stop service platform, Apache StreamPark greatly simplifies the entire development process. Now, we can complete job development, compilation, and release directly on the Apache StreamPark platform, not only lowering the management and deployment threshold of Flink but also significantly improving development efficiency. -Since deploying StreamPark, we have been using the platform on a large scale in a production environment. From initially managing over 50 FlinkSQL jobs to nearly 500 jobs now, as shown in the diagram, StreamPark is divided into 7 teams, each with dozens of jobs. This transformation not only demonstrates StreamPark's scalability and efficiency but also fully proves its strong practical value in actual business. +Since deploying Apache StreamPark, we have been using the platform on a large scale in a production environment. From initially managing over 50 FlinkSQL jobs to nearly 500 jobs now, as shown in the diagram, Apache StreamPark is divided into 7 teams, each with dozens of jobs. This transformation not only demonstrates Apache StreamPark's scalability and efficiency but also fully proves its strong practical value in actual business. ![](/blog/ziru/production_environment.png) ## **Future Expectations** -As one of the early users of StreamPark, we have maintained close communication with the community, participating in the stability improvement of StreamPark. We have submitted bugs encountered in production operation and new features to the community. In the future, we hope to manage the metadata information of Apache Paimon lake tables and the capability of auxiliary jobs for +As one of the early users of Apache StreamPark, we have maintained close communication with the community, participating in the stability improvement of Apache StreamPark. We have submitted bugs encountered in production operation and new features to the community. In the future, we hope to manage the metadata information of Apache Paimon lake tables and the capability of auxiliary jobs for -Paimon's Actions on StreamPark. Based on the Flink engine, by interfacing with the Catalog of lake tables and Action jobs, we aim to realize the management and optimization of lake table jobs in one integrated capability. Currently, StreamPark is working on integrating the capabilities with Paimon data, which will greatly assist in real-time data lake ingestion in the future. +Paimon's Actions on Apache StreamPark. Based on the Flink engine, by interfacing with the Catalog of lake tables and Action jobs, we aim to realize the management and optimization of lake table jobs in one integrated capability. Currently, Apache StreamPark is working on integrating the capabilities with Paimon data, which will greatly assist in real-time data lake ingestion in the future. -We are very grateful for the technical support that the StreamPark team has provided us all along. We wish Apache StreamPark continued success, more users, and its early graduation to become a top-level Apache project. +We are very grateful for the technical support that the Apache StreamPark team has provided us all along. We wish Apache StreamPark continued success, more users, and its early graduation to become a top-level Apache project. diff --git a/community/contribution_guide/become_committer.md b/community/contribution_guide/become_committer.md index 96c4b493d..c84cd7ff4 100644 --- a/community/contribution_guide/become_committer.md +++ b/community/contribution_guide/become_committer.md @@ -38,10 +38,10 @@ only by code. Apache StreamPark community strives to be meritocratic. Thus, once someone has contributed sufficiently to any area of CoPDoC they can be a -candidate for committer-ship and at last voted in as a StreamPark +candidate for committer-ship and at last voted in as a Apache StreamPark committer. Being an Apache StreamPark committer does not necessarily mean you must commit code with your commit privilege to the codebase; it -means you are committed to the StreamPark project and are productively +means you are committed to the Apache StreamPark project and are productively contributing to our community's success. ## Committer requirements: @@ -58,8 +58,8 @@ and fair. Committer candidates should have a decent amount of continuous engagements and contributions (fixing bugs, adding new features, writing documentation, maintaining issues boards, code review, or answering -community questions) to StreamPark either by contributing to the codebase -of the main website or StreamPark's GitHub repositories. +community questions) to Apache StreamPark either by contributing to the codebase +of the main website or Apache StreamPark's GitHub repositories. - +3 months with light activity and engagement. - +2 months of medium activity and engagement. diff --git a/community/contribution_guide/become_pmc_member.md b/community/contribution_guide/become_pmc_member.md index 725626622..d52abb902 100644 --- a/community/contribution_guide/become_pmc_member.md +++ b/community/contribution_guide/become_pmc_member.md @@ -38,10 +38,10 @@ only by code. Apache StreamPark community strives to be meritocratic. Thus, once someone has contributed sufficiently to any area of CoPDoC they can be a -candidate for PMC membership and at last voted in as a StreamPark +candidate for PMC membership and at last voted in as a Apache StreamPark PMC member. Being an Apache StreamPark PMC member does not necessarily mean you must commit code with your commit privilege to the codebase; it -means you are committed to the StreamPark project and are productively +means you are committed to the Apache StreamPark project and are productively contributing to our community's success. ## PMC member requirements: @@ -58,8 +58,8 @@ and fair. PMC member candidates should have a decent amount of continuous engagements and contributions (fixing bugs, adding new features, writing documentation, maintaining issues boards, code review, or answering -community questions) to StreamPark either by contributing to the codebase -of the main website or StreamPark's GitHub repositories. +community questions) to Apache StreamPark either by contributing to the codebase +of the main website or Apache StreamPark's GitHub repositories. - +5 months with light activity and engagement. - +4 months of medium activity and engagement. diff --git a/community/contribution_guide/mailing_lists.md b/community/contribution_guide/mailing_lists.md index 3f6affcf4..7f008966f 100644 --- a/community/contribution_guide/mailing_lists.md +++ b/community/contribution_guide/mailing_lists.md @@ -21,7 +21,7 @@ sidebar_position: 1 limitations under the License. --> -The StreamPark developer mailing list is the preferred means for all your questions when using StreamPark, which pushes your doubts out to the entire community. +The Apache StreamPark developer mailing list is the preferred means for all your questions when using Apache StreamPark, which pushes your doubts out to the entire community. This is the best way to keep up-to-date with the community. Before you post anything to the mailing lists, be sure that you already **subscribe** to them. @@ -31,12 +31,12 @@ The currently available lists are listed in the below table. ### Developer List -- Use this list for your StreamPark questions -- Used by StreamPark contributors to discuss development of StreamPark +- Use this list for your Apache StreamPark questions +- Used by Apache StreamPark contributors to discuss development of Apache StreamPark ### Commits List -- Notifications on changes to the StreamPark codebase +- Notifications on changes to the Apache StreamPark codebase | List Name | Address | Subscribe | Unsubscribe | Archive | |---------------------|------------------------------|-------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------------| diff --git a/community/contribution_guide/new_committer_process.md b/community/contribution_guide/new_committer_process.md index 4baadf7f6..6a1c3d995 100644 --- a/community/contribution_guide/new_committer_process.md +++ b/community/contribution_guide/new_committer_process.md @@ -71,7 +71,7 @@ Subject: [VOTE] New committer: ${NEW_COMMITTER_NAME} ``` ```text -Hi StreamPark PPMC, +Hi Apache StreamPark PPMC, This is a formal vote about inviting ${NEW_COMMITTER_NAME} as our new committer. @@ -92,7 +92,7 @@ Subject: [RESULT] [VOTE] New committer: ${NEW_COMMITTER_NAME} ``` ```text -Hi StreamPark PPMC, +Hi Apache StreamPark PPMC, The vote has now closed. The results are: @@ -110,13 +110,13 @@ The vote is ***successful/not successful*** ```text To: ${NEW_COMMITTER_EMAIL} Cc: private@streampark.apache.org -Subject: Invitation to become StreamPark committer: ${NEW_COMMITTER_NAME} +Subject: Invitation to become Apache StreamPark committer: ${NEW_COMMITTER_NAME} ``` ```text Hello ${NEW_COMMITTER_NAME}, -The StreamPark Project Management Committee (PMC) +The Apache StreamPark Project Management Committee (PMC) hereby offers you committer privileges to the project. These privileges are offered on the understanding that you'll use them reasonably and with common sense. @@ -165,7 +165,7 @@ establishing you as a committer. ```text To: ${NEW_COMMITTER_EMAIL} Cc: private@streampark.apache.org -Subject: Re: invitation to become StreamPark committer +Subject: Re: invitation to become Apache StreamPark committer ``` ```text diff --git a/community/contribution_guide/new_pmc_member_process.md b/community/contribution_guide/new_pmc_member_process.md index 9828d0f31..8da0d4cb2 100644 --- a/community/contribution_guide/new_pmc_member_process.md +++ b/community/contribution_guide/new_pmc_member_process.md @@ -69,7 +69,7 @@ Subject: [VOTE] New PMC member candidate: ${NEW_PMC_NAME} ``` ```text -Hi StreamPark PPMC, +Hi Apache StreamPark PPMC, This is a formal vote about inviting ${NEW_PMC_NAME} as our new PMC member. @@ -90,7 +90,7 @@ Subject: [RESULT] [VOTE] New PMC member: ${NEW_PMC_NAME} ``` ```text -Hi StreamPark PPMC, +Hi Apache StreamPark PPMC, The vote has now closed. The results are: @@ -108,11 +108,11 @@ The vote is ***successful/not successful*** ```text To: board@apache.org Cc: private@.apache.org -Subject: [NOTICE] ${NEW_PMC_NAME} for StreamPark PMC member +Subject: [NOTICE] ${NEW_PMC_NAME} for Apache StreamPark PMC member ``` ```text -StreamPark proposes to invite ${NEW_PMC_NAME} to join the PMC. +Apache StreamPark proposes to invite ${NEW_PMC_NAME} to join the PMC. The vote result is available here: https://lists.apache.org/... ``` @@ -124,13 +124,13 @@ See [newpmc](https://www.apache.org/dev/pmc.html#newpmc) ```text To: ${NEW_PMC_EMAIL} Cc: private@streampark.apache.org -Subject: Invitation to become StreamPark PMC member: ${NEW_PMC_NAME} +Subject: Invitation to become Apache StreamPark PMC member: ${NEW_PMC_NAME} ``` ```text Hello ${NEW_PMC_NAME}, -The StreamPark Project Management Committee (PMC) +The Apache StreamPark Project Management Committee (PMC) hereby offers you committer privileges to the project as well as membership in the PMC. These privileges are offered on the understanding that @@ -178,7 +178,7 @@ establishing you as a PMC member. ```text To: ${NEW_PMC_EMAIL} Cc: private@streamparkv.apache.org -Subject: Re: invitation to become StreamPark PMC member +Subject: Re: invitation to become Apache StreamPark PMC member ``` ```text @@ -265,7 +265,7 @@ To: dev@streampark.apache.org ``` ```text -Hi StreamPark Community, +Hi Apache StreamPark Community, The Podling Project Management Committee (PPMC) for Apache StreamPark has invited ${NEW_PMC_NAME} to become our PMC member and diff --git a/community/release/How-to-release.md b/community/release/How-to-release.md index adb6f1440..1e8a6f73b 100644 --- a/community/release/How-to-release.md +++ b/community/release/How-to-release.md @@ -59,9 +59,9 @@ GnuPG needs to construct a user ID to identify your key. Real name: muchunjin # Please enter 'gpg real name' Email address: muchunjin@apache.org # Please enter your apache email address here -Comment: for apache StreamPark release create at 20230501 # Please enter some comments here +Comment: for apache Apache StreamPark release create at 20230501 # Please enter some comments here You selected this USER-ID: - "muchunjin (for apache StreamPark release create at 20230501) " + "muchunjin (for apache Apache StreamPark release create at 20230501) " Change (N)ame, (C)omment, (E)mail or (O)kay/(Q)uit? O # Please enter O here We need to generate a lot of random bytes. It is a good idea to perform @@ -94,7 +94,7 @@ public and secret key created and signed. pub rsa4096 2023-05-01 [SC] 85778A4CE4DD04B7E07813ABACFB69E705016886 -uid muchunjin (for apache StreamPark release create at 20230501) +uid muchunjin (for apache Apache StreamPark release create at 20230501) sub rsa4096 2023-05-01 [E] ``` @@ -108,7 +108,7 @@ $ gpg --keyid-format SHORT --list-keys ------------------------ pub rsa4096/05016886 2023-05-01 [SC] 85778A4CE4DD04B7E07813ABACFB69E705016886 -uid [ultimate] muchunjin (for apache StreamPark release create at 20230501) +uid [ultimate] muchunjin (for apache Apache StreamPark release create at 20230501) sub rsa4096/0C5A4E1C 2023-05-01 [E] # Send public key to keyserver via key id @@ -122,7 +122,7 @@ Verify whether it is synchronized to the public network, it will take about a mi ```shell $ gpg --keyserver keyserver.ubuntu.com --recv-keys 05016886 # If the following content appears, it means success -gpg: key ACFB69E705016886: "muchunjin (for apache StreamPark release create at 20230501) " not changed +gpg: key ACFB69E705016886: "muchunjin (for apache Apache StreamPark release create at 20230501) " not changed gpg: Total number processed: 1 gpg: unchanged: 1 ``` @@ -380,15 +380,15 @@ $ for i in *.tar.gz; do echo $i; gpg --verify $i.asc $i ; done apache-streampark-2.1.0-incubating-src.tar.gz gpg: Signature made Tue May 2 12:16:35 2023 CST gpg: using RSA key 85778A4CE4DD04B7E07813ABACFB69E705016886 -gpg: Good signature from "muchunjin (for apache StreamPark release create at 20230501) " [ultimate] +gpg: Good signature from "muchunjin (for apache Apache StreamPark release create at 20230501) " [ultimate] apache-streampark_2.11-2.1.0-incubating-bin.tar.gz gpg: Signature made Tue May 2 12:16:36 2023 CST gpg: using RSA key 85778A4CE4DD04B7E07813ABACFB69E705016886 -gpg: Good signature from "muchunjin (for apache StreamPark release create at 20230501) " [ultimate] +gpg: Good signature from "muchunjin (for apache Apache StreamPark release create at 20230501) " [ultimate] apache-streampark_2.12-2.1.0-incubating-bin.tar.gz gpg: Signature made Tue May 2 12:16:37 2023 CST gpg: using RSA key 85778A4CE4DD04B7E07813ABACFB69E705016886 -gpg: BAD signature from "muchunjin (for apache StreamPark release create at 20230501) " [ultimate] +gpg: BAD signature from "muchunjin (for apache Apache StreamPark release create at 20230501) " [ultimate] # Verify SHA512 $ for i in *.tar.gz; do echo $i; sha512sum --check $i.sha512; done @@ -431,7 +431,7 @@ svn add 2.0.0-RC1 svn status # 3. Submit to svn remote server -svn commit -m "release for StreamPark 2.1.0" +svn commit -m "release for Apache StreamPark 2.1.0" ``` #### 3.7 Check Apache SVN Commit Results @@ -451,7 +451,7 @@ Send a voting email in the community requires at least three `+1` and no `-1`. > `Body`: ``` -Hello StreamPark Community: +Hello Apache StreamPark Community: This is a call for vote to release Apache StreamPark(Incubating) version release-2.1.0-RC1. @@ -482,11 +482,11 @@ Please vote accordingly: *Valid check is a requirement for a vote. *Checklist for reference: -[ ] Download StreamPark are valid. +[ ] Download Apache StreamPark are valid. [ ] Checksums and PGP signatures are valid. [ ] Source code distributions have correct names matching the current release. -[ ] LICENSE and NOTICE files are correct for each StreamPark repo. +[ ] LICENSE and NOTICE files are correct for each Apache StreamPark repo. [ ] All files have license headers if necessary. [ ] No compiled archives bundled in source archive. [ ] Can compile from source. @@ -512,7 +512,7 @@ After 72 hours, the voting results will be counted, and the voting result email > `Body`: ``` -Dear StreamPark community, +Dear Apache StreamPark community, Thanks for your review and vote for "Release Apache StreamPark (Incubating) 2.1.0-rc1" I'm happy to announce the vote has passed: @@ -567,7 +567,7 @@ The Apache StreamPark community has voted on and approved a proposal to release We now kindly request the Incubator PMC members review and vote on this incubator release. Apache StreamPark, Make stream processing easier! easy-to-use streaming application development framework and operation platform. -StreamPark community vote thread: +Apache StreamPark community vote thread: https://lists.apache.org/thread/t01b2lbtqzyt7j4dsbdp5qjc3gngjsdq Vote result thread: @@ -658,7 +658,7 @@ Vote thread: https://lists.apache.org/thread/k3cvcbzxqs6qy62d1o6r9pqpykcgvvhm -Thanks everyone for your feedback and help with StreamPark apache release. The StreamPark team will take the steps to complete this release and will announce it soon. +Thanks everyone for your feedback and help with Apache StreamPark apache release. The Apache StreamPark team will take the steps to complete this release and will announce it soon. Best, ChunJin Mu @@ -752,12 +752,12 @@ Hi all, We are glad to announce the release of Apache StreamPark(incubating) 2.1.0. Once again I would like to express my thanks to your help. -StreamPark(https://streampark.apache.org/) Make stream processing easier! easy-to-use streaming application development framework and operation platform +Apache StreamPark(https://streampark.apache.org/) Make stream processing easier! easy-to-use streaming application development framework and operation platform Download Links: https://streampark.apache.org/download/ Release Notes: https://streampark.apache.org/download/release-note/2.1.0 -StreamPark Resources: +Apache StreamPark Resources: - Issue: https://github.com/apache/incubator-streampark/issues - Mailing list: dev@streampark.apache.org diff --git a/community/release/how-to-verify.md b/community/release/how-to-verify.md index abe157f2a..f1a5903f5 100644 --- a/community/release/how-to-verify.md +++ b/community/release/how-to-verify.md @@ -143,7 +143,7 @@ cd apache-streampark-${release_version}-incubating-src ***package mode, just select mixed mode *** ->[StreamPark] StreamPark supports front-end and server-side mixed / detached packaging mode, Which mode do you need ? +>[Apache StreamPark] Apache StreamPark supports front-end and server-side mixed / detached packaging mode, Which mode do you need ? > >1. mixed mode > @@ -151,7 +151,7 @@ cd apache-streampark-${release_version}-incubating-src > > select 1 ->[StreamPark] StreamPark supports Scala 2.11 and 2.12. Which version do you need ? +>[Apache StreamPark] Apache StreamPark supports Scala 2.11 and 2.12. Which version do you need ? > >1. 2.11 >2. 2.12 diff --git a/community/submit_guide/document.md b/community/submit_guide/document.md index 9329bb7b7..d24f4be94 100644 --- a/community/submit_guide/document.md +++ b/community/submit_guide/document.md @@ -21,11 +21,11 @@ sidebar_position: 1 limitations under the License. --> -Good documentation is critical for any type of software. Any contribution that can improve the StreamPark documentation is welcome. +Good documentation is critical for any type of software. Any contribution that can improve the Apache StreamPark documentation is welcome. ## Get the document project -Documentation for the StreamPark project is maintained in a separate [git repository](https://github.com/apache/incubator-streampark-website). +Documentation for the Apache StreamPark project is maintained in a separate [git repository](https://github.com/apache/incubator-streampark-website). First you need to fork the document project into your own github repository, and then clone the document to your local computer. diff --git a/community/submit_guide/submit-code.md b/community/submit_guide/submit-code.md index 003b08abe..c281f92cb 100644 --- a/community/submit_guide/submit-code.md +++ b/community/submit_guide/submit-code.md @@ -85,4 +85,4 @@ sidebar_position: 2 * Then the community Committers will do CodeReview, and then he will discuss some details (including design, implementation, performance, etc.) with you. When everyone on the team is satisfied with this modification, the commit will be merged into the dev branch -* Finally, congratulations, you have become an official contributor to StreamPark ! +* Finally, congratulations, you have become an official contributor to Apache StreamPark ! diff --git a/docs/connector/1-kafka.md b/docs/connector/1-kafka.md index 04c80936a..88965682f 100644 --- a/docs/connector/1-kafka.md +++ b/docs/connector/1-kafka.md @@ -8,7 +8,7 @@ import TabItem from '@theme/TabItem'; [Flink officially](https://ci.apache.org/projects/flink/flink-docs-release-1.12/zh/dev/connectors/kafka.html) provides a connector to [Apache Kafka](https://kafka.apache.org) connector for reading from or writing to a Kafka topic, providing **exactly once** processing semantics -`KafkaSource` and `KafkaSink` in `StreamPark` are further encapsulated based on `kafka connector` from the official website, simplifying the development steps, making it easier to read and write data +`KafkaSource` and `KafkaSink` in `Apache StreamPark` are further encapsulated based on `kafka connector` from the official website, simplifying the development steps, making it easier to read and write data ## Dependencies @@ -66,7 +66,7 @@ val stream = env.addSource(new FlinkKafkaConsumer[String]("topic", new SimpleStr ``` -You can see a series of kafka connection information defined, this way the parameters are hard-coded, very insensitive, let's see how to use `StreamPark` to access `kafka` data, we just define the configuration file in the rule format and then write the code +You can see a series of kafka connection information defined, this way the parameters are hard-coded, very insensitive, let's see how to use `Apache StreamPark` to access `kafka` data, we just define the configuration file in the rule format and then write the code ### example @@ -173,7 +173,7 @@ Let's take a look at more usage and configuration methods ### Consume multiple Kafka instances -`StreamPark` has taken into account the configuration of kafka of multiple different instances at the beginning of development . How to unify the configuration, and standardize the format? The solution in streampark is this, if we want to consume two different instances of kafka at the same time, the configuration file is defined as follows, +`Apache StreamPark` has taken into account the configuration of kafka of multiple different instances at the beginning of development . How to unify the configuration, and standardize the format? The solution in streampark is this, if we want to consume two different instances of kafka at the same time, the configuration file is defined as follows, As you can see in the `kafka.source` directly under the kafka instance name, here we unified called **alias** , **alias** must be unique, to distinguish between different instances If there is only one kafka instance, then you can not configure `alias` When writing the code for consumption, pay attention to the corresponding **alias** can be specified, the configuration and code is as follows @@ -304,7 +304,7 @@ Regarding kafka's partition dynamics, by default, partition discovery is disable For more details, please refer to the [official website documentation](https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/connectors/kafka.html#partition-discovery) Flink Kafka Consumer is also able to discover Topics using regular expressions, please refer to the [official website documentation](https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/connectors/kafka.html#topic-discovery) -A simpler way is provided in `StreamPark`, you need to configure the regular pattern of the matching `topic` instance name in `pattern` +A simpler way is provided in `Apache StreamPark`, you need to configure the regular pattern of the matching `topic` instance name in `pattern` @@ -391,7 +391,7 @@ DataStream stream = env.addSource(myConsumer); -This setting is not recommended in `StreamPark`, a more convenient way is provided by specifying `auto.offset.reset` in the configuration +This setting is not recommended in `Apache StreamPark`, a more convenient way is provided by specifying `auto.offset.reset` in the configuration * `earliest` consume from earliest record * `latest` consume from latest record @@ -524,7 +524,7 @@ The returned object is wrapped in a `KafkaRecord`, which has the current `offset In many case, the timestamp of the record is embedded (explicitly or implicitly) in the record itself. In addition, users may want to specify in a custom way, for example a special record in a `Kafka` stream containing a `watermark` of the current event time. For these cases, `Flink Kafka Consumer` is allowed to specify `AssignerWithPeriodicWatermarks` or `AssignerWithPunctuatedWatermarks`. -In the `StreamPark` run pass a `WatermarkStrategy` as a parameter to assign a `Watermark`, for example, parse the data in the `topic` as a `user` object, there is an `orderTime` in `user` which is a time type, we use this as a base to assign a `Watermark` to it +In the `Apache StreamPark` run pass a `WatermarkStrategy` as a parameter to assign a `Watermark`, for example, parse the data in the `topic` as a `user` object, there is an `orderTime` in `user` which is a time type, we use this as a base to assign a `Watermark` to it @@ -651,7 +651,7 @@ If the `watermark assigner` relies on messages read from `Kafka` to raise the `w ## Kafka Sink (Producer) -In `StreamPark` the `Kafka Producer` is called `KafkaSink`, which allows messages to be written to one or more `Kafka topics`. +In `Apache StreamPark` the `Kafka Producer` is called `KafkaSink`, which allows messages to be written to one or more `Kafka topics`. @@ -980,7 +980,7 @@ class JavaUser implements Serializable { ### specific partitioner -`KafkaSink` allows you to specify a kafka partitioner, if you don't specify it, the default is to use `StreamPark` built-in **KafkaEqualityPartitioner** partitioner, as the name, the partitioner can write data to each partition evenly, the `scala` api is set by the ` partitioner` parameter to set the partitioner, +`KafkaSink` allows you to specify a kafka partitioner, if you don't specify it, the default is to use `Apache StreamPark` built-in **KafkaEqualityPartitioner** partitioner, as the name, the partitioner can write data to each partition evenly, the `scala` api is set by the ` partitioner` parameter to set the partitioner, `java` api is set by `partitioner()` method diff --git a/docs/connector/2-jdbc.md b/docs/connector/2-jdbc.md index 02992a004..4a7171117 100755 --- a/docs/connector/2-jdbc.md +++ b/docs/connector/2-jdbc.md @@ -9,11 +9,11 @@ import TabItem from '@theme/TabItem'; Flink officially provides the [JDBC](https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/connectors/jdbc.html) connector for reading from or writing to JDBC, which can provides **AT_LEAST_ONCE** (at least once) processing semantics -`StreamPark` implements **EXACTLY_ONCE** (Exactly Once) semantics of `JdbcSink` based on two-stage commit, and uses [`HikariCP`](https://github.com/brettwooldridge/HikariCP) as connection pool to make data reading and write data more easily and accurately +`Apache StreamPark` implements **EXACTLY_ONCE** (Exactly Once) semantics of `JdbcSink` based on two-stage commit, and uses [`HikariCP`](https://github.com/brettwooldridge/HikariCP) as connection pool to make data reading and write data more easily and accurately ## JDBC Configuration -The implementation of the `Jdbc Connector` in `StreamPark` uses the [`HikariCP`](https://github.com/brettwooldridge/HikariCP) connection pool, which is configured under the namespace of `jdbc`, and the agreed configuration is as follows: +The implementation of the `Jdbc Connector` in `Apache StreamPark` uses the [`HikariCP`](https://github.com/brettwooldridge/HikariCP) connection pool, which is configured under the namespace of `jdbc`, and the agreed configuration is as follows: ```yaml jdbc: @@ -60,7 +60,7 @@ Except for the special `semantic` configuration item, all other configurations m ## JDBC read -In `StreamPark`, `JdbcSource` is used to read data, and according to the data `offset` to read data can be replayed, we look at the specific how to use `JdbcSource` to read data, if the demand is as follows +In `Apache StreamPark`, `JdbcSource` is used to read data, and according to the data `offset` to read data can be replayed, we look at the specific how to use `JdbcSource` to read data, if the demand is as follows
@@ -208,7 +208,7 @@ public interface SQLResultFunction extends Serializable { ## JDBC Read Write -In `StreamPark`, `JdbcSink` is used to write data, let's see how to write data with `JdbcSink`, the example is to read data from `kakfa` and write to `mysql`. +In `Apache StreamPark`, `JdbcSink` is used to write data, let's see how to write data with `JdbcSink`, the example is to read data from `kakfa` and write to `mysql`. @@ -229,7 +229,7 @@ jdbc: password: 123456 ``` :::danger Cautions -The configuration under `jdbc` **semantic** is the semantics of writing, as described in [Jdbc Info Configuration](#jdbc-info-config), the configuration will only take effect on `JdbcSink`, `StreamPark` is based on two-phase commit to achieve **EXACTLY_ONCE** semantics, +The configuration under `jdbc` **semantic** is the semantics of writing, as described in [Jdbc Info Configuration](#jdbc-info-config), the configuration will only take effect on `JdbcSink`, `Apache StreamPark` is based on two-phase commit to achieve **EXACTLY_ONCE** semantics, This requires that the database being manipulated supports transactions(`mysql`, `oracle`, `MariaDB`, `MS SQL Server`), theoretically all databases that support standard Jdbc transactions can do EXACTLY_ONCE (exactly once) write ::: diff --git a/docs/connector/3-clickhouse.md b/docs/connector/3-clickhouse.md index dfaf1718c..e87608c72 100755 --- a/docs/connector/3-clickhouse.md +++ b/docs/connector/3-clickhouse.md @@ -11,7 +11,7 @@ import TabItem from '@theme/TabItem'; [ClickHouse](https://clickhouse.com/) is a columnar database management system (DBMS) for online analytics (OLAP). Currently, Flink does not officially provide a connector for writing to ClickHouse and reading from ClickHouse. Based on the access form supported by [ClickHouse - HTTP client](https://clickhouse.com/docs/zh/interfaces/http/) -and [JDBC driver](https://clickhouse.com/docs/zh/interfaces/jdbc), StreamPark encapsulates ClickHouseSink for writing data to ClickHouse in real-time. +and [JDBC driver](https://clickhouse.com/docs/zh/interfaces/jdbc), Apache StreamPark encapsulates ClickHouseSink for writing data to ClickHouse in real-time. `ClickHouse` writes do not support transactions, using JDBC write data to it could provide (AT_LEAST_ONCE) semanteme. Using the HTTP client to write asynchronously, it will retry the asynchronous write multiple times. The failed data will be written to external components (Kafka, MySQL, HDFS, HBase), @@ -65,10 +65,10 @@ public class ClickHouseUtil { The method of splicing various parameters into the request url is cumbersome and hard-coded, which is very inflexible. -### Write with StreamPark +### Write with Apache StreamPark -To access `ClickHouse` data with `StreamPark`, you only need to define the configuration file in the specified format and then write code. -The configuration and code are as follows. The configuration of `ClickHose JDBC` in `StreamPark` is in the configuration list, and the sample running program is scala +To access `ClickHouse` data with `Apache StreamPark`, you only need to define the configuration file in the specified format and then write code. +The configuration and code are as follows. The configuration of `ClickHose JDBC` in `Apache StreamPark` is in the configuration list, and the sample running program is scala #### configuration list @@ -147,14 +147,14 @@ Clickhouse INSERT must insert data through the POST method. The general operatio $ echo 'INSERT INTO t VALUES (1),(2),(3)' | curl 'http://localhost:8123/' --data-binary @- ``` -The operation of the above method is relatively simple. Sure java could also be used for writing. StreamPark adds many functions to the http post writing method, +The operation of the above method is relatively simple. Sure java could also be used for writing. Apache StreamPark adds many functions to the http post writing method, including encapsulation enhancement, adding cache, asynchronous writing, failure retry, and data backup after reaching the retry threshold, To external components (kafka, mysql, hdfs, hbase), etc., the above functions only need to define the configuration file in the prescribed format, and write the code. ### Write to ClickHouse -The configuration of `ClickHose JDBC` in `StreamPark` is in the configuration list, and the sample running program is scala, as follows: +The configuration of `ClickHose JDBC` in `Apache StreamPark` is in the configuration list, and the sample running program is scala, as follows: asynchttpclient is used as an HTTP asynchronous client for writing. first, import the jar of asynchttpclient ```xml diff --git a/docs/connector/4-doris.md b/docs/connector/4-doris.md index cdedc7fc0..017990326 100644 --- a/docs/connector/4-doris.md +++ b/docs/connector/4-doris.md @@ -12,11 +12,11 @@ import TabItem from '@theme/TabItem'; [Apache Doris](https://doris.apache.org/) is a high-performance, and real-time analytical database, which could support high-concurrent point query scenarios. -StreamPark encapsulates DoirsSink for writing data to Doris in real-time, based on [Doris' stream loads](https://doris.apache.org/administrator-guide/load-data/stream-load-manual.html) +Apache StreamPark encapsulates DoirsSink for writing data to Doris in real-time, based on [Doris' stream loads](https://doris.apache.org/administrator-guide/load-data/stream-load-manual.html) -### Write with StreamPark +### Write with Apache StreamPark -Use `StreamPark` to write data to `Doris`. DorisSink only supports JSON format (single-layer) writing currently, +Use `Apache StreamPark` to write data to `Doris`. DorisSink only supports JSON format (single-layer) writing currently, such as: {"id":1,"name":"streampark"} The example of the running program is java, as follows: #### configuration list diff --git a/docs/connector/5-es.md b/docs/connector/5-es.md index 7f1874f9b..1d8b1b14e 100755 --- a/docs/connector/5-es.md +++ b/docs/connector/5-es.md @@ -14,12 +14,12 @@ for Elasticsearch, which is used to write data to Elasticsearch, which can provi ElasticsearchSink uses TransportClient (before 6.x) or RestHighLevelClient (starting with 6.x) to communicate with the Elasticsearch cluster. -`StreamPark` further encapsulates Flink-connector-elasticsearch6, shields development details, and simplifies write +`Apache StreamPark` further encapsulates Flink-connector-elasticsearch6, shields development details, and simplifies write operations for Elasticsearch6 and above. :::tip hint -Because there are conflicts between different versions of Flink Connector Elasticsearch, StreamPark temporarily only +Because there are conflicts between different versions of Flink Connector Elasticsearch, Apache StreamPark temporarily only supports write operations of Elasticsearch6 and above. If you wants to using Elasticsearch5, you need to exclude the flink-connector-elasticsearch6 dependency and introduce the flink-connector-elasticsearch5 dependency to create org.apache.flink.streaming.connectors.elasticsearch5.ElasticsearchSink instance writes data. @@ -200,11 +200,11 @@ input.addSink(esSinkBuilder.build) -The ElasticsearchSink created above is very inflexible to add parameters. `StreamPark` follows the concept of convention over configuration and automatic configuration. -Users only need to configure es connection parameters and Flink operating parameters, and StreamPark will automatically assemble source and sink, +The ElasticsearchSink created above is very inflexible to add parameters. `Apache StreamPark` follows the concept of convention over configuration and automatic configuration. +Users only need to configure es connection parameters and Flink operating parameters, and Apache StreamPark will automatically assemble source and sink, which greatly simplifies development logic and improves development efficiency and maintainability. -## Using StreamPark writes to Elasticsearch +## Using Apache StreamPark writes to Elasticsearch Please ensure that operation requests are sent to the Elasticsearch cluster at least once after enabling Flink checkpointing in ESSink. @@ -234,7 +234,7 @@ host: localhost:9200 ### 2. 写入Elasticsearch -Using StreamPark writes to Elasticsearch +Using Apache StreamPark writes to Elasticsearch @@ -289,7 +289,7 @@ object ConnectorApp extends FlinkStreaming { -Flink ElasticsearchSinkFunction可以执行多种类型请求,如(DeleteRequest、 UpdateRequest、IndexRequest),StreamPark也对以上功能进行了支持,对应方法如下: +Flink ElasticsearchSinkFunction可以执行多种类型请求,如(DeleteRequest、 UpdateRequest、IndexRequest),Apache StreamPark也对以上功能进行了支持,对应方法如下: ```scala import org.apache.streampark.flink.core.scala.StreamingContext @@ -376,8 +376,8 @@ See [Official Documentation](https://nightlies.apache.org/flink/flink-docs-relea The BulkProcessor inside es can further configure its behavior of how to refresh the cache operation request, see the [official documentation](https://nightlies.apache.org/flink/flink-docs-release-1.14/zh/docs/connectors/datastream/elasticsearch/#elasticsearch-sink) for details - **Configuring the Internal** Bulk Processor -### StreamPark configuration +### Apache StreamPark configuration -All other configurations must comply with the StreamPark configuration. +All other configurations must comply with the Apache StreamPark configuration. For [specific configurable](/docs/development/conf) items and the role of each parameter, please refer to the project configuration diff --git a/docs/connector/6-hbase.md b/docs/connector/6-hbase.md index eeecb3bf3..2ae38e749 100755 --- a/docs/connector/6-hbase.md +++ b/docs/connector/6-hbase.md @@ -11,15 +11,15 @@ import TabItem from '@theme/TabItem'; large-scale structured storage clusters can be built on cheap PC Servers. Unlike general relational databases, HBase is a database suitable for unstructured data storage because HBase storage is based on a column rather than a row-based schema. -Flink does not officially provide a connector for Hbase DataStream. StreamPark encapsulates HBaseSource and HBaseSink based on `Hbase-client`. -It supports automatic connection creation based on configuration and simplifies development. StreamPark reading Hbase can record the latest status of the read data when the checkpoint is enabled, +Flink does not officially provide a connector for Hbase DataStream. Apache StreamPark encapsulates HBaseSource and HBaseSink based on `Hbase-client`. +It supports automatic connection creation based on configuration and simplifies development. Apache StreamPark reading Hbase can record the latest status of the read data when the checkpoint is enabled, and the offset corresponding to the source can be restored through the data itself. Implement source-side AT_LEAST_ONCE. HbaseSource implements Flink Async I/O to improve streaming throughput. The sink side supports AT_LEAST_ONCE by default. EXACTLY_ONCE is supported when checkpointing is enabled. :::tip hint -StreamPark reading HBASE can record the latest state of the read data when checkpoint is enabled. +Apache StreamPark reading HBASE can record the latest state of the read data when checkpoint is enabled. Whether the previous state can be restored after the job is resumed depends entirely on whether the data itself has an offset identifier, which needs to be manually specified in the code. The recovery logic needs to be specified in the func parameter of the getDataStream method of HBaseSource. ::: @@ -239,11 +239,11 @@ class HBaseWriter extends RichSinkFunction { -Reading and writing HBase in this way is cumbersome and inconvenient. `StreamPark` follows the concept of convention over configuration and automatic configuration. -Users only need to configure Hbase connection parameters and Flink operating parameters. StreamPark will automatically assemble source and sink, +Reading and writing HBase in this way is cumbersome and inconvenient. `Apache StreamPark` follows the concept of convention over configuration and automatic configuration. +Users only need to configure Hbase connection parameters and Flink operating parameters. Apache StreamPark will automatically assemble source and sink, which greatly simplifies development logic and improves development efficiency and maintainability. -## write and read Hbase with StreamPark +## write and read Hbase with Apache StreamPark ### 1. Configure policies and connection information @@ -260,7 +260,7 @@ hbase: ### 2. Read and write HBase -Writing to Hbase with StreamPark is very simple, the code is as follows: +Writing to Hbase with Apache StreamPark is very simple, the code is as follows: @@ -363,7 +363,7 @@ object HBaseSinkApp extends FlinkStreaming { -When StreamPark writes to HBase, you need to create the method of HBaseQuery, +When Apache StreamPark writes to HBase, you need to create the method of HBaseQuery, specify the method to convert the query result into the required object, identify whether it is running, and pass in the running parameters. details as follows ```scala @@ -391,7 +391,7 @@ class HBaseSource(@(transient@param) val ctx: StreamingContext, property: Proper } ``` -StreamPark HbaseSource implements flink Async I/O, which is used to improve the throughput of Streaming: first create a DataStream, +Apache StreamPark HbaseSource implements flink Async I/O, which is used to improve the throughput of Streaming: first create a DataStream, then create an HBaseRequest and call requestOrdered() or requestUnordered() to create an asynchronous stream, as follows: ```scala class HBaseRequest[T: TypeInformation](@(transient@param) private val stream: DataStream[T], property: Properties = new Properties()) { @@ -430,7 +430,7 @@ class HBaseRequest[T: TypeInformation](@(transient@param) private val stream: Da } ``` -StreamPark supports two ways to write data: 1. addSink() 2. writeUsingOutputFormat Examples are as follows: +Apache StreamPark supports two ways to write data: 1. addSink() 2. writeUsingOutputFormat Examples are as follows: ```scala //1)Insert way 1 HBaseSink().sink[TestEntity](source, "order") @@ -445,4 +445,4 @@ StreamPark supports two ways to write data: 1. addSink() 2. writeUsingOutputForm ## Other configuration -All other configurations must comply with the StreamPark configuration. For specific configurable items and the role of each parameter, please refer to the 【project configuration](/docs/development/conf) +All other configurations must comply with the Apache StreamPark configuration. For specific configurable items and the role of each parameter, please refer to the 【project configuration](/docs/development/conf) diff --git a/docs/connector/7-http.md b/docs/connector/7-http.md index e6fd6ed4f..2df14ea9a 100755 --- a/docs/connector/7-http.md +++ b/docs/connector/7-http.md @@ -9,7 +9,7 @@ import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; Some background services receive data through HTTP requests. In this scenario, Flink can write result data through HTTP -requests. Currently, Flink officially does not provide a connector for writing data through HTTP requests. StreamPark +requests. Currently, Flink officially does not provide a connector for writing data through HTTP requests. Apache StreamPark encapsulates HttpSink to write data asynchronously in real-time based on asynchttpclient. `HttpSink` writes do not support transactions, writing data to the target service provides AT_LEAST_ONCE semantics. Data @@ -29,7 +29,7 @@ Asynchronous writing uses asynchttpclient as the client, you need to import the ``` -## Write with StreamPark +## Write with Apache StreamPark ### http asynchronous write support type @@ -141,5 +141,5 @@ After the asynchronous write data reaches the maximum retry value, the data will ::: ## Other configuration -All other configurations must comply with the **StreamPark** configuration. +All other configurations must comply with the **Apache StreamPark** configuration. For specific configurable items and the role of each parameter, please refer [Project configuration](/docs/development/conf) diff --git a/docs/connector/8-redis.md b/docs/connector/8-redis.md index 94b2badea..141293a7c 100644 --- a/docs/connector/8-redis.md +++ b/docs/connector/8-redis.md @@ -9,10 +9,10 @@ import TabItem from '@theme/TabItem'; [Redis](http://www.redis.cn/) is an open source in-memory data structure storage system that can be used as a database, cache, and messaging middleware. It supports many types of data structures such as strings, hashes, lists, sets, ordered sets and range queries, bitmaps, hyperlogloglogs and geospatial index radius queries. Redis has built-in transactions and various levels of disk persistence, and provides high availability through Redis Sentinel and Cluster. - Flink does not officially provide a connector for writing reids data.StreamPark is based on [Flink Connector Redis](https://bahir.apache.org/docs/flink/current/flink-streaming-redis/) + Flink does not officially provide a connector for writing reids data.Apache StreamPark is based on [Flink Connector Redis](https://bahir.apache.org/docs/flink/current/flink-streaming-redis/) It encapsulates RedisSink, configures redis connection parameters, and automatically creates redis connections to simplify development. Currently, RedisSink supports the following connection methods: single-node mode, sentinel mode, and cluster mode because it does not support transactions. -StreamPark uses Redis' **MULTI** command to open a transaction and the **EXEC** command to commit a transaction, see the link for details: +Apache StreamPark uses Redis' **MULTI** command to open a transaction and the **EXEC** command to commit a transaction, see the link for details: http://www.redis.cn/topics/transactions.html , using RedisSink supports AT_LEAST_ONCE (at least once) processing semantics by default. EXACTLY_ONCE semantics are supported with checkpoint enabled. :::tip tip @@ -21,7 +21,7 @@ EXACTLY_ONCE semantics will write to redis in batch when the flink job checkpoin ::: ## Redis Write Dependency -Flink Connector Redis officially provides two kinds, the following two api are the same, StreamPark is using org.apache.bahir dependency. +Flink Connector Redis officially provides two kinds, the following two api are the same, Apache StreamPark is using org.apache.bahir dependency. ```xml org.apache.bahir @@ -163,10 +163,10 @@ public class FlinkRedisSink { } ``` -The above creation of FlinkJedisPoolConfig is tedious, and each operation of redis has to build RedisMapper, which is very insensitive. `StreamPark` uses a convention over configuration and automatic configuration. This only requires configuring redis -StreamPark automatically assembles the source and sink parameters, which greatly simplifies the development logic and improves development efficiency and maintainability. +The above creation of FlinkJedisPoolConfig is tedious, and each operation of redis has to build RedisMapper, which is very insensitive. `Apache StreamPark` uses a convention over configuration and automatic configuration. This only requires configuring redis +Apache StreamPark automatically assembles the source and sink parameters, which greatly simplifies the development logic and improves development efficiency and maintainability. -## StreamPark Writes to Redis +## Apache StreamPark Writes to Redis RedisSink defaults to AT_LEAST_ONCE (at least once) processing semantics, two-stage segment submission supports EXACTLY_ONCE semantics with checkpoint enabled, available connection types: single-node mode, sentinel mode. @@ -216,7 +216,7 @@ redis.sink: ### 2. Write to Redis -Writing to redis with StreamPark is very simple, the code is as follows: +Writing to redis with Apache StreamPark is very simple, the code is as follows: @@ -277,7 +277,7 @@ case class RedisMapper[T](cmd: RedisCommand, additionalKey: String, key: T => St -As the code shows, StreamPark automatically loads the configuration to create a RedisSink, and the user completes the redis write operation by creating the required RedisMapper object, **additionalKey is the outermost key when hset is invalid for other write commands**. +As the code shows, Apache StreamPark automatically loads the configuration to create a RedisSink, and the user completes the redis write operation by creating the required RedisMapper object, **additionalKey is the outermost key when hset is invalid for other write commands**. RedisSink.sink() write the corresponding key corresponding to the data is required to specify the expiration time, if not specified default expiration time is java Integer.MAX_VALUE (67 years). As shown in the code. ```scala @@ -357,11 +357,11 @@ public enum RedisCommand { ``` :::info Warning -RedisSink currently supports single-node mode and sentinel mode connections. And its cluster mode does not support transactions, but StreamPark is currently for support. Please call the official Flink Connector Redis api if you have a usage scenario.
+RedisSink currently supports single-node mode and sentinel mode connections. And its cluster mode does not support transactions, but Apache StreamPark is currently for support. Please call the official Flink Connector Redis api if you have a usage scenario.
Checkpoint must be enabled under EXACTLY_ONCE semantics, otherwise the program will throw parameter exceptions.
EXACTLY_ONCE semantics checkpoint data sink cache inside the memory, you need to reasonably set the checkpoint interval according to the actual data, otherwise there is a risk of **oom**.
::: ## Other Configuration -All other configurations must adhere to the **StreamPark** configuration, please refer to [project configuration](/docs/development/conf) for specific configurable items and the role of each parameter. +All other configurations must adhere to the **Apache StreamPark** configuration, please refer to [project configuration](/docs/development/conf) for specific configurable items and the role of each parameter. diff --git a/docs/development/alert-conf.md b/docs/development/alert-conf.md index b0680357c..56d07842c 100644 --- a/docs/development/alert-conf.md +++ b/docs/development/alert-conf.md @@ -7,14 +7,14 @@ sidebar_position: 3 import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -`StreamPark` supports a variety of alerts, mainly as follows: +`Apache StreamPark` supports a variety of alerts, mainly as follows: * **E-mail**: Mail notification * **DingTalk**: DingTalk Custom Group Robot * **WeChat**: Enterprise WeChat Custom Group Robot * **Lark**: Feishu Custom Group Robot -StreamPark support any combination of alerts +Apache StreamPark support any combination of alerts :::info Future plan @@ -24,7 +24,7 @@ StreamPark support any combination of alerts ## Added alert configuration -`Click StreamPark` -> Setting on the left, then click `Alert Setting` to enter the alert configuration. +`Click Apache StreamPark` -> Setting on the left, then click `Alert Setting` to enter the alert configuration. ![alert_add_setting.png](/doc/image/alert/alert_add_setting.png) Click `Add New` to add alert configuration: diff --git a/docs/development/conf.md b/docs/development/conf.md index 3c8b1a303..f91eb81d0 100755 --- a/docs/development/conf.md +++ b/docs/development/conf.md @@ -16,7 +16,7 @@ ClientFailureRate, ClientTables } from '../components/TableData.jsx'; -Configuration is very important in `StreamPark`. +Configuration is very important in `Apache StreamPark`. ## Why do I need to configure @@ -100,9 +100,9 @@ A simpler method should be used, such as simplifying some environment initializa **Absolutely** -`StreamPark` proposes the concept of unified program configuration, which is generated by configuring a series of parameters from development to deployment in the `application.yml`according to a specific format a general configuration template, so that the initialization of the environment can be completed by transferring the configuration of the project to the program when the program is started. This is the concept of `configuration file`. +`Apache StreamPark` proposes the concept of unified program configuration, which is generated by configuring a series of parameters from development to deployment in the `application.yml`according to a specific format a general configuration template, so that the initialization of the environment can be completed by transferring the configuration of the project to the program when the program is started. This is the concept of `configuration file`. -`StreamPark` provides a higher level of abstraction for the `Flink SQL`, developers only need to define SQL to `sql.yaml`, when the program is started, the `sql.yaml` is transferred to the main program, and the SQL will be automatically loaded and executed. This is the concept of `sql file`. +`Apache StreamPark` provides a higher level of abstraction for the `Flink SQL`, developers only need to define SQL to `sql.yaml`, when the program is started, the `sql.yaml` is transferred to the main program, and the SQL will be automatically loaded and executed. This is the concept of `sql file`. ## Terms @@ -112,7 +112,7 @@ The SQL extracted in Flink SQL task is put into `sql.yaml`, this file with speci ## Configuration file -In StreamPark, the configuration file of `DataStream` job and `Flink Sql` are common. In other words, this configuration file can define the configurations of `DataStream` and `Flink Sql` (the configuration file in Flink SQL job is optional). The format of the configuration file must be `yaml` and must meet the requirements of yaml. +In Apache StreamPark, the configuration file of `DataStream` job and `Flink Sql` are common. In other words, this configuration file can define the configurations of `DataStream` and `Flink Sql` (the configuration file in Flink SQL job is optional). The format of the configuration file must be `yaml` and must meet the requirements of yaml. How to configure this configuration file and what to pay attention to. @@ -127,7 +127,7 @@ flink: jobmanager: property: #@see: https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html $internal.application.main: org.apache.streampark.flink.quickstart.QuickStartApp - pipeline.name: StreamPark QuickStart App + pipeline.name: Apache StreamPark QuickStart App yarn.application.queue: taskmanager.numberOfTaskSlots: 1 parallelism.default: 2 @@ -227,7 +227,7 @@ There are many basic parameters. The five most basic parameters are as follows. `$internal.application.main` and `pipeline.name` must be set. ::: -If you need to set more parameters, please refer to [`here`](https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html), These parameters must be placed under the property and the parameter names must be correct. StreamPark will automatically resolve these parameters and take effect. +If you need to set more parameters, please refer to [`here`](https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html), These parameters must be placed under the property and the parameter names must be correct. Apache StreamPark will automatically resolve these parameters and take effect. ##### Memory parameters @@ -399,7 +399,7 @@ sql: | :::danger attention -In the above content, | after SQL is required. In addition, | will retain the format of the whole section. StreamPark can directly define multiple SQLs at once. Each SQLs must be separated by semicolons, and each section of SQLs must follow the format and specification specified by Flink SQL. +In the above content, | after SQL is required. In addition, | will retain the format of the whole section. Apache StreamPark can directly define multiple SQLs at once. Each SQLs must be separated by semicolons, and each section of SQLs must follow the format and specification specified by Flink SQL. ::: ## Summary diff --git a/docs/development/model.md b/docs/development/model.md index 809e61d95..06a8130de 100644 --- a/docs/development/model.md +++ b/docs/development/model.md @@ -9,7 +9,7 @@ import TabItem from '@theme/TabItem'; There are some rules and conventions to be followed in any framework. Only by following and mastering these rules can we use them more easily and achieve twice the result with half the effort.When we develop Flink job, we actually use the API provided by Flink to write an executable program (which must have a `main()` function) according to the development method required by Flink. We access various`Connector`in the program, and after a series of `operator`operations, we finally sink the data to the target storage through the `Connector` . -We call this method of step-by-step programming according to certain agreed rules the "programming paradigm". In this chapter, we will talk about the "programming paradigm" of StreamPark and the development considerations. +We call this method of step-by-step programming according to certain agreed rules the "programming paradigm". In this chapter, we will talk about the "programming paradigm" of Apache StreamPark and the development considerations. Let's start from these aspects @@ -25,11 +25,11 @@ Let's start from these aspects ![](/doc/image_en/streampark_archite.png) ## Programming paradigm -`streampark-core` is positioned as a programming time framework, rapid development scaffolding, specifically created to simplify Flink development. Developers will use this module during the development phase. Let's take a look at what the programming paradigm of `DataStream` and `Flink Sql` with StreamPark looks like, and what the specifications and requirements are. +`streampark-core` is positioned as a programming time framework, rapid development scaffolding, specifically created to simplify Flink development. Developers will use this module during the development phase. Let's take a look at what the programming paradigm of `DataStream` and `Flink Sql` with Apache StreamPark looks like, and what the specifications and requirements are. ### DataStream -StreamPark provides both `scala` and `Java` APIs to develop `DataStream` programs, the specific code development is as follows. +Apache StreamPark provides both `scala` and `Java` APIs to develop `DataStream` programs, the specific code development is as follows. @@ -77,7 +77,7 @@ To develop with the `scala` API, the program must inherit from `FlinkStreaming`. Development with the `Java` API can not omit the `main()` method due to the limitations of the language itself, so it will be a standard `main()` function,. The user needs to create the `StreamingContext` manually. `StreamingContext` is a very important class, which will be introduced later. :::tip tip -The above lines of `scala` and `Java` code are the basic skeleton code necessary to develop `DataStream` with StreamPark. Developing a `DataStream` program with StreamPark. Starting from these lines of code, Java API development requires the developer to manually start the task `start`. +The above lines of `scala` and `Java` code are the basic skeleton code necessary to develop `DataStream` with Apache StreamPark. Developing a `DataStream` program with Apache StreamPark. Starting from these lines of code, Java API development requires the developer to manually start the task `start`. ::: ### Flink Sql @@ -86,11 +86,11 @@ The TableEnvironment is used to create the contextual execution environment for The Flink community has been promoting the batch processing capability of DataStream and unifying the stream-batch integration, and in Flink 1.12, the stream-batch integration is truly unified, many historical APIs such as: DataSet API, BatchTableEnvironment API, etc. are deprecated and retired from the history stage. TableEnvironment** and **StreamTableEnvironment**. - StreamPark provides a more convenient API for the development of **TableEnvironment** and **StreamTableEnvironment** environments. + Apache StreamPark provides a more convenient API for the development of **TableEnvironment** and **StreamTableEnvironment** environments. #### TableEnvironment -To develop Table & SQL jobs, TableEnvironment will be the recommended entry class for Flink, supporting both Java API and Scala API, the following code demonstrates how to develop a TableEnvironment type job in StreamPark +To develop Table & SQL jobs, TableEnvironment will be the recommended entry class for Flink, supporting both Java API and Scala API, the following code demonstrates how to develop a TableEnvironment type job in Apache StreamPark @@ -128,7 +128,7 @@ public class JavaTableApp { :::tip tip -The above lines of Scala and Java code are the essential skeleton code for developing a TableEnvironment with StreamPark. +The above lines of Scala and Java code are the essential skeleton code for developing a TableEnvironment with Apache StreamPark. Scala API must inherit FlinkTable, Java API development needs to manually construct TableContext, and the developer needs to manually start the task `start`. ::: @@ -136,7 +136,7 @@ Scala API must inherit FlinkTable, Java API development needs to manually constr #### StreamTableEnvironment `StreamTableEnvironment` is used in stream computing scenarios, where the object of stream computing is a `DataStream`. Compared to `TableEnvironment`, `StreamTableEnvironment` provides an interface to convert between `DataStream` and `Table`. If your application is written using the `DataStream API` in addition to the `Table API` & `SQL`, you need to use the `StreamTableEnvironment`. -The following code demonstrates how to develop a `StreamTableEnvironment` type job in StreamPark. +The following code demonstrates how to develop a `StreamTableEnvironment` type job in Apache StreamPark. @@ -181,12 +181,12 @@ public class JavaStreamTableApp { :::tip tip -The above lines of scala and Java code are the essential skeleton code for developing `StreamTableEnvironment` with StreamPark, and for developing `StreamTableEnvironment` programs with StreamPark. Starting from these lines of code, Java code needs to construct `StreamTableContext` manually, and `Java API` development requires the developer to start the task `start` manually. +The above lines of scala and Java code are the essential skeleton code for developing `StreamTableEnvironment` with Apache StreamPark, and for developing `StreamTableEnvironment` programs with Apache StreamPark. Starting from these lines of code, Java code needs to construct `StreamTableContext` manually, and `Java API` development requires the developer to start the task `start` manually. ::: ## RunTime Context -**RunTime Context** - **StreamingContext** , **TableContext** , **StreamTableContext** are three very important objects in StreamPark, next we look at the definition and role of these three **Context**. +**RunTime Context** - **StreamingContext** , **TableContext** , **StreamTableContext** are three very important objects in Apache StreamPark, next we look at the definition and role of these three **Context**.
@@ -224,7 +224,7 @@ class StreamingContext(val parameter: ParameterTool, private val environment: St This object is very important and will be used throughout the lifecycle of the task in the `DataStream` job. The `StreamingContext` itself inherits from the `StreamExecutionEnvironment`, and the configuration file is fully integrated into the `StreamingContext`, so that it is very easy to get various parameters from the `StreamingContext`. ::: -In StreamPark, `StreamingContext` is also the entry class for the Java API to write `DataStream` jobs, one of the constructors of `StreamingContext` is specially built for the Java API, the constructor is defined as follows: +In Apache StreamPark, `StreamingContext` is also the entry class for the Java API to write `DataStream` jobs, one of the constructors of `StreamingContext` is specially built for the Java API, the constructor is defined as follows: ```scala /** @@ -301,7 +301,7 @@ class TableContext(val parameter: ParameterTool, } ``` -In StreamPark, `TableContext` is also the entry class for the Java API to write `Table Sql` jobs of type `TableEnvironment`. One of the constructor methods of `TableContext` is a constructor specifically built for the `Java API`, which is defined as follows: +In Apache StreamPark, `TableContext` is also the entry class for the Java API to write `Table Sql` jobs of type `TableEnvironment`. One of the constructor methods of `TableContext` is a constructor specifically built for the `Java API`, which is defined as follows: ```scala @@ -391,7 +391,7 @@ class StreamTableContext(val parameter: ParameterTool, ``` -In StreamPark, `StreamTableContext` is the entry class for the Java API to write `Table Sql` jobs of type `StreamTableEnvironment`. One of the constructors of `StreamTableContext` is a function built specifically for the Java API, which is defined as follows: +In Apache StreamPark, `StreamTableContext` is the entry class for the Java API to write `Table Sql` jobs of type `StreamTableEnvironment`. One of the constructors of `StreamTableContext` is a function built specifically for the Java API, which is defined as follows: ```scala @@ -526,7 +526,7 @@ The **destroy** stage is an optional stage that requires developer participation ## Catalog Structure -The recommended project directory structure is as follows, please refer to the directory structure and configuration in [StreamPark-flink-quickstart](https://github.com/apache/incubator-streampark-quickstart) +The recommended project directory structure is as follows, please refer to the directory structure and configuration in [Apache StreamPark-flink-quickstart](https://github.com/apache/incubator-streampark-quickstart) ``` tree . @@ -602,11 +602,11 @@ assembly.xml is the configuration file needed for the assembly packaging plugin, ## Packaged Deployment -The recommended packaging mode in [streampark-flink-quickstart](https://github.com/apache/incubator-streampark-quickstart/tree/dev/quickstart-flink) is recommended. It runs `maven package` directly to generate a standard StreamPark recommended project package, after unpacking the directory structure is as follows. +The recommended packaging mode in [streampark-flink-quickstart](https://github.com/apache/incubator-streampark-quickstart/tree/dev/quickstart-flink) is recommended. It runs `maven package` directly to generate a standard Apache StreamPark recommended project package, after unpacking the directory structure is as follows. ``` text . -StreamPark-flink-quickstart-1.0.0 +Apache StreamPark-flink-quickstart-1.0.0 ├── bin │ ├── startup.sh //Launch Script │ ├── setclasspath.sh //Java environment variable-related scripts (used internally, not of concern to users) @@ -616,7 +616,7 @@ StreamPark-flink-quickstart-1.0.0 │ ├── application.yaml //Project's configuration file │ ├── sql.yaml // flink sql file ├── lib -│ └── StreamPark-flink-quickstart-1.0.0.jar //The project's jar package +│ └── Apache StreamPark-flink-quickstart-1.0.0.jar //The project's jar package └── temp ``` diff --git a/docs/flink-k8s/1-deployment.md b/docs/flink-k8s/1-deployment.md index 0701713a3..b2698eda3 100644 --- a/docs/flink-k8s/1-deployment.md +++ b/docs/flink-k8s/1-deployment.md @@ -5,22 +5,22 @@ sidebar_position: 1 --- -StreamPark Flink Kubernetes is based on [Flink Native Kubernetes](https://ci.apache.org/projects/flink/flink-docs-stable/docs/deployment/resource-providers/native_kubernetes/) and support deployment modes as below: +Apache StreamPark Flink Kubernetes is based on [Flink Native Kubernetes](https://ci.apache.org/projects/flink/flink-docs-stable/docs/deployment/resource-providers/native_kubernetes/) and support deployment modes as below: * Native-Kubernetes Application * Native-Kubernetes Session -At now, one StreamPark only supports one Kubernetes cluster.You can submit [Fearure Request Issue](https://github.com/apache/incubator-streampark/issues) , when multiple Kubernetes clusters are needed. +At now, one Apache StreamPark only supports one Kubernetes cluster.You can submit [Fearure Request Issue](https://github.com/apache/incubator-streampark/issues) , when multiple Kubernetes clusters are needed.

## Environments requirement -Additional operating environment to run StreamPark Flink-K8s is as below: +Additional operating environment to run Apache StreamPark Flink-K8s is as below: * Kubernetes -* Maven(StreamPark runNode) -* Docker(StreamPark runNode) +* Maven(Apache StreamPark runNode) +* Docker(Apache StreamPark runNode) -StreamPark entity can be deployed on Kubernetes nodes, and can also be deployed on node out of Kubernetes cluster when there are **smooth network** between the node and cluster. +Apache StreamPark entity can be deployed on Kubernetes nodes, and can also be deployed on node out of Kubernetes cluster when there are **smooth network** between the node and cluster.

@@ -29,7 +29,7 @@ StreamPark entity can be deployed on Kubernetes nodes, and can also be deployed ### configuration for connecting Kubernetes -StreamPark connects to Kubernetes cluster by default connection credentials `~/.kube/config `.User can copy `.kube/config` from Kubernetes node to StreamPark nodes,or download it from Kubernetes provided by cloud service providers.If considering Permission constraints, User also can +Apache StreamPark connects to Kubernetes cluster by default connection credentials `~/.kube/config `.User can copy `.kube/config` from Kubernetes node to Apache StreamPark nodes,or download it from Kubernetes provided by cloud service providers.If considering Permission constraints, User also can generate custom account`s configuration by themselves. ```shell @@ -51,12 +51,12 @@ kubectl create clusterrolebinding flink-role-binding-default --clusterrole=edit ### Configuration for remote Docker service -On Setting page of StreamPark, user can configure the connection information for Docker service of Kubernetes cluster. +On Setting page of Apache StreamPark, user can configure the connection information for Docker service of Kubernetes cluster. ![docker register setting](/doc/image/docker_register_setting.png) -Building a Namespace named `streampark`(other name should be set at Setting page of StreamPark) at remote Docker.The namespace is push/pull space of StreamPark Flink image and Docker Register User should own `pull`/`push` permission of this namespace. +Building a Namespace named `streampark`(other name should be set at Setting page of Apache StreamPark) at remote Docker.The namespace is push/pull space of Apache StreamPark Flink image and Docker Register User should own `pull`/`push` permission of this namespace. ```shell @@ -81,9 +81,9 @@ parameter descriptions are as below: * **Flink Base Docker Image**: Base Flink Docker Image Tag can be obtained from [DockerHub - offical/flink](https://hub.docker.com/_/flink) .And user can also use private image when Docker Register Account owns `pull` permission of it. * **Rest-Service Exposed Type**:Description of candidate values for native Flink K8s configuration [kubernetes.rest-service.exposed.type](https://ci.apache.org/projects/flink/flink-docs-stable/docs/deployment/config/#kubernetes) : - * `ClusterIP`:ip that StreamPark can access; - * `LoadBalancer`:resource of LoadBalancer should be allocated in advance, Flink Namespace own permission of automatic binding,and StreamPark can access LoadBalancer`s gateway; - * `NodePort`:StreamPark can access all K8s nodes; + * `ClusterIP`:ip that Apache StreamPark can access; + * `LoadBalancer`:resource of LoadBalancer should be allocated in advance, Flink Namespace own permission of automatic binding,and Apache StreamPark can access LoadBalancer`s gateway; + * `NodePort`:Apache StreamPark can access all K8s nodes; * **Kubernetes Pod Template**: It`s Flink custom configuration of pod-template.The container-name must be flink-main-container. If the k8s pod needs a secret key to pull the docker image, please fill in the information about the secret key in the pod template file.The example pod-template is as below: @@ -114,7 +114,7 @@ The additional configuration of Flink-Native-Kubernetes Session Job will be deci ## other configuration -StreamPark parameter related to Flink-K8s in `applicaton.yml` are as below.And in most condition, it is no need to change it. +Apache StreamPark parameter related to Flink-K8s in `applicaton.yml` are as below.And in most condition, it is no need to change it. | Configuration item | Description | Default value | |-----------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------|---------------| diff --git a/docs/flink-k8s/2-k8s-pvc-integration.md b/docs/flink-k8s/2-k8s-pvc-integration.md index 3ed1d97f0..e2bb50cd3 100755 --- a/docs/flink-k8s/2-k8s-pvc-integration.md +++ b/docs/flink-k8s/2-k8s-pvc-integration.md @@ -8,7 +8,7 @@ sidebar_position: 2 The support for pvc resource(mount file resources such as checkpoint/savepoint/logs and so on) is based on pod-template at current version。 -Users do not have to concern the Native-Kubernetes Session.It will be processed when Session Cluster is constructed .Native-Kubernetes Application can be constructed by configuring on StreamPark webpage using `pod-template`、`jm-pod-template`、`tm-pod-template`. +Users do not have to concern the Native-Kubernetes Session.It will be processed when Session Cluster is constructed .Native-Kubernetes Application can be constructed by configuring on Apache StreamPark webpage using `pod-template`、`jm-pod-template`、`tm-pod-template`.
@@ -45,9 +45,9 @@ There are three ways to provide the dependency when using `rocksdb-backend`. 1. Flink Base Docker Image contains the dependency(user fix the dependency conflict by themself); -2. Put the dependency `flink-statebackend-rocksdb_xx.jar` to the path `Workspace/jars` in StreamPark ; +2. Put the dependency `flink-statebackend-rocksdb_xx.jar` to the path `Workspace/jars` in Apache StreamPark ; -3. Add the rockdb-backend dependency to StreamPark Dependency(StreamPark will fix the conflict automatically) : +3. Add the rockdb-backend dependency to Apache StreamPark Dependency(Apache StreamPark will fix the conflict automatically) : ![rocksdb dependency](/doc/image/rocksdb_dependency.png) diff --git a/docs/flink-k8s/3-hadoop-resource-integration.md b/docs/flink-k8s/3-hadoop-resource-integration.md index c03274d18..90bd0b20d 100644 --- a/docs/flink-k8s/3-hadoop-resource-integration.md +++ b/docs/flink-k8s/3-hadoop-resource-integration.md @@ -6,7 +6,7 @@ sidebar_position: 3 ## Using Hadoop resource in Flink on K8s -Using Hadoop resources under the StreamPark Flink-K8s runtime, such as checkpoint mount HDFS, read and write Hive, etc. The general process is as follows: +Using Hadoop resources under the Apache StreamPark Flink-K8s runtime, such as checkpoint mount HDFS, read and write Hive, etc. The general process is as follows: #### 1、HDFS @@ -26,7 +26,7 @@ flink-json-1.14.5.jar log4j-1.2-api-2.17.1.jar log4j-slf4j-impl- ​ This is to download the shaded jar and put it in the lib directory of flink. Take hadoop2 as an example, download `flink-shaded-hadoop-2-uber`:https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.7.5-9.0/flink-shaded-hadoop-2-uber-2.7.5-9.0.jar -​ In addition, you can configure the shade jar in a dependent manner in the `Dependency` in the StreamPark task configuration. the following configuration: +​ In addition, you can configure the shade jar in a dependent manner in the `Dependency` in the Apache StreamPark task configuration. the following configuration: ```xml @@ -128,7 +128,7 @@ public static String getHadoopConfConfigMapName(String clusterId) { ​ c、`flink-sql-connector-hive`:https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-hive-2.3.6_2.12/1.14.5/flink-sql-connector-hive-2.3.6_2.12-1.14.5.jar -​ Similarly, the above-mentioned hive-related jars can also be dependently configured in the `Dependency` in the task configuration of StreamPark in a dependent manner, which will not be repeated here. +​ Similarly, the above-mentioned hive-related jars can also be dependently configured in the `Dependency` in the task configuration of Apache StreamPark in a dependent manner, which will not be repeated here. ##### ii、Add hive configuration file (hive-site.xml) diff --git a/docs/intro.md b/docs/intro.md index 6ed75b7c5..c2140f1fe 100644 --- a/docs/intro.md +++ b/docs/intro.md @@ -1,6 +1,6 @@ --- id: 'intro' -title: 'What is StreamPark' +title: 'What is Apache StreamPark' sidebar_position: 1 --- @@ -12,15 +12,15 @@ make stream processing easier!!! ## 🚀 What is Apache StreamPark™ -`Apache StreamPark` is an easy-to-use stream processing application development framework and one-stop stream processing operation platform, Aimed at ease building and managing streaming applications, StreamPark provides scaffolding for writing streaming process logics with Apache Flink and Apache Spark. -StreamPark also provides a professional task management including task development, scheduling, interactive query, deployment, operation, maintenance, etc. +`Apache StreamPark` is an easy-to-use stream processing application development framework and one-stop stream processing operation platform, Aimed at ease building and managing streaming applications, Apache StreamPark provides scaffolding for writing streaming process logics with Apache Flink and Apache Spark. +Apache StreamPark also provides a professional task management including task development, scheduling, interactive query, deployment, operation, maintenance, etc. ## Why Apache StreamPark™ Apache Flink and Apache Spark are widely used as the next generation of big data streaming computing engines. Based on a bench of excellent experiences combined with best practices, we extracted the task deployment and runtime parameters into the configuration files. In this way, an easy-to-use RuntimeContext with out-of-the-box connectors would bring easier and more efficient task development experience. It reduces the learning cost and development barriers, hence developers can focus on the business logic. -On the other hand, It can be challenge for enterprises to use Flink & Spark if there is no professional management platform for Flink & Spark tasks during the deployment phase. StreamPark provides such a professional task management platform, including task development, scheduling, interactive query, deployment, operation, maintenance, etc. +On the other hand, It can be challenge for enterprises to use Flink & Spark if there is no professional management platform for Flink & Spark tasks during the deployment phase. Apache StreamPark provides such a professional task management platform, including task development, scheduling, interactive query, deployment, operation, maintenance, etc. ## 🎉 Features * Apache Flink & Apache Spark application development scaffold @@ -32,19 +32,19 @@ On the other hand, It can be challenge for enterprises to use Flink & Spark if t ## 🏳‍🌈 Architecture of Apache StreamPark™ -The overall architecture of Apache StreamPark is shown in the following figure. Apache StreamPark consists of three parts, they are StreamPark-core and StreamPark-console. +The overall architecture of Apache StreamPark is shown in the following figure. Apache StreamPark consists of three parts, they are Apache StreamPark-core and Apache StreamPark-console. -![StreamPark Archite](/doc/image_en/streampark_archite.png) +![Apache StreamPark Archite](/doc/image_en/streampark_archite.png) -### 1️⃣ StreamPark-core +### 1️⃣ Apache StreamPark-core -The positioning of `StreamPark-core` is a framework uesd while developing, it focuses on coding development, regulates configuration files, and develops in the convention over configuration guide. -StreamPark-core provides a development-time RunTime Content and a series of out-of-the-box Connectors. Cumbersome operations are simplified by extending `DataStream-related` methods and integrating DataStream and `Flink sql` api . +The positioning of `Apache StreamPark-core` is a framework uesd while developing, it focuses on coding development, regulates configuration files, and develops in the convention over configuration guide. +Apache StreamPark-core provides a development-time RunTime Content and a series of out-of-the-box Connectors. Cumbersome operations are simplified by extending `DataStream-related` methods and integrating DataStream and `Flink sql` api . development efficiency and development experience will be highly improved because users can focus on the business. -### 2️⃣ StreamPark-console +### 2️⃣ Apache StreamPark-console -`StreamPark-console` is a comprehensive real-time `low code` data platform that can manage `Flink` tasks more convenient. +`Apache StreamPark-console` is a comprehensive real-time `low code` data platform that can manage `Flink` tasks more convenient. It integrates the experience of many best practices and integrates many functions such as project compilation, release, parameter configuration, startup, `savepoint`, `flame graph`, `Flink SQL`, monitoring, etc., which greatly simplifies the daily operation of Flink tasks and maintenance. The ultimate goal is to create a one-stop big data platform, diff --git a/docs/user-guide/1-deployment.md b/docs/user-guide/1-deployment.md index 90b576664..4de20a9be 100755 --- a/docs/user-guide/1-deployment.md +++ b/docs/user-guide/1-deployment.md @@ -6,9 +6,9 @@ sidebar_position: 1 import { DeploymentEnvs } from '../components/TableData.jsx'; -The overall component stack structure of StreamPark is as follows. It consists of two major parts: streampark-core and streampark-console. streampark-console is a very important module, positioned as a **integrated real-time data platform**, ** streaming data warehouse Platform**, **Low Code**, **Flink & Spark task hosting platform**, can better manage Flink tasks, integrate project compilation, publishing, parameter configuration, startup, savepoint, flame graph ( flame graph ), Flink SQL, monitoring and many other functions are integrated into one, which greatly simplifies the daily operation and maintenance of Flink tasks and integrates many best practices. Its ultimate goal is to create a one-stop big data solution that integrates real-time data warehouses and batches +The overall component stack structure of Apache StreamPark is as follows. It consists of two major parts: streampark-core and streampark-console. streampark-console is a very important module, positioned as a **integrated real-time data platform**, ** streaming data warehouse Platform**, **Low Code**, **Flink & Spark task hosting platform**, can better manage Flink tasks, integrate project compilation, publishing, parameter configuration, startup, savepoint, flame graph ( flame graph ), Flink SQL, monitoring and many other functions are integrated into one, which greatly simplifies the daily operation and maintenance of Flink tasks and integrates many best practices. Its ultimate goal is to create a one-stop big data solution that integrates real-time data warehouses and batches -![StreamPark Archite](/doc/image_en/streampark_archite.png) +![Apache StreamPark Archite](/doc/image_en/streampark_archite.png) streampark-console provides an out-of-the-box installation package. Before installation, there are some requirements for the environment. The specific requirements are as follows: @@ -16,7 +16,7 @@ streampark-console provides an out-of-the-box installation package. Before insta -At present, StreamPark has released tasks for Flink, and supports both `Flink on YARN` and `Flink on Kubernetes` modes. +At present, Apache StreamPark has released tasks for Flink, and supports both `Flink on YARN` and `Flink on Kubernetes` modes. ### Hadoop To use `Flink on YARN`, you need to install and configure Hadoop-related environment variables in the deployed cluster. For example, if you installed the hadoop environment based on CDH, @@ -34,7 +34,7 @@ export HADOOP_YARN_HOME=$HADOOP_HOME/../hadoop-yarn ### Kubernetes -Using `Flink on Kubernetes` requires additional deployment/or use of an existing Kubernetes cluster, please refer to the entry: [**StreamPark Flink-K8s Integration Support**](../flink-k8s/1-deployment.md). +Using `Flink on Kubernetes` requires additional deployment/or use of an existing Kubernetes cluster, please refer to the entry: [**Apache StreamPark Flink-K8s Integration Support**](../flink-k8s/1-deployment.md). ## Build & Deploy @@ -118,7 +118,7 @@ Go to `conf`, modify `conf/application.yml`, find the spring item, find the prof ```yaml spring: profiles.active: mysql #[h2,pgsql,mysql] - application.name: StreamPark + application.name: Apache StreamPark devtools.restart.enabled: false mvc.pathmatch.matching-strategy: ant_path_matcher servlet: @@ -177,7 +177,7 @@ Relevant logs will be output to **streampark-console-service-1.0.0/logs/streampa After the above steps, even if the deployment is completed, you can directly log in to the system -![StreamPark Login](/doc/image/streampark_login.jpeg) +![Apache StreamPark Login](/doc/image/streampark_login.jpeg) :::tip hint Default password: admin / streampark @@ -185,9 +185,9 @@ Default password: admin / streampark ## System Configuration -After entering the system, the first thing to do is to modify the system configuration. Under the menu/StreamPark/Setting, the operation interface is as follows: +After entering the system, the first thing to do is to modify the system configuration. Under the menu/Apache StreamPark/Setting, the operation interface is as follows: -![StreamPark Settings](/doc/image/streampark_settings_2.0.0.png) +![Apache StreamPark Settings](/doc/image/streampark_settings_2.0.0.png) The main configuration items are divided into the following categories diff --git a/docs/user-guide/11-platformInstall.md b/docs/user-guide/11-platformInstall.md index 5bc92a70e..17e7cd63e 100644 --- a/docs/user-guide/11-platformInstall.md +++ b/docs/user-guide/11-platformInstall.md @@ -15,13 +15,13 @@ ## Software Requirements Notes: -1. **For installing StreamPark alone, Hadoop can be ignored.** +1. **For installing Apache StreamPark alone, Hadoop can be ignored.** 2. If using yarn application mode for executing Flink jobs, Hadoop is required. > - JDK : 1.8+ > - MySQL : 5.6+ > - Flink : 1.12.0+ > - Hadoop : 2.7.0+ -> - StreamPark : 2.0.0+ +> - Apache StreamPark : 2.0.0+ Software versions used in this document: > - **JDK: 1.8.0_181** @@ -75,7 +75,7 @@ flink -v cp mysql-connector-java-8.0.28.jar /usr/local/streampark/lib ``` ![4_mysql_dep](/doc/image/install/4_mysql_dep.png) -## Download StreamPark +## Download Apache StreamPark > Download URL: [https://dlcdn.apache.org/incubator/streampark/2.0.0/apache-streampark_2.12-2.0.0-incubating-bin.tar.gz](https://dlcdn.apache.org/incubator/streampark/2.0.0/apache-streampark_2.12-2.0.0-incubating-bin.tar.gz) > Upload [apache-streampark_2.12-2.0.0-incubating-bin.tar.gz](https://dlcdn.apache.org/incubator/streampark/2.0.0/apache-streampark_2.12-2.0.0-incubating-bin.tar.gz) to the server /usr/local path @@ -90,11 +90,11 @@ tar -zxvf apache-streampark_2.12-2.0.0-incubating-bin.tar.gz # Installation ## Initialize System Data -> **Purpose: Create databases (tables) dependent on StreamPark component deployment, and pre-initialize the data required for its operation (e.g., web page menus, user information), to facilitate subsequent operations.** +> **Purpose: Create databases (tables) dependent on Apache StreamPark component deployment, and pre-initialize the data required for its operation (e.g., web page menus, user information), to facilitate subsequent operations.** ### View Execution of SteamPark Metadata SQL File > Explanation: -> - StreamPark supports MySQL, PostgreSQL, H2 +> - Apache StreamPark supports MySQL, PostgreSQL, H2 > - This document uses MySQL as an example; the PostgreSQL process is basically the same > Database creation script: /usr/local/apache-st @@ -131,7 +131,7 @@ show tables; ``` ![13_show_streampark_db_tables](/doc/image/install/13_show_streampark_db_tables.png) -## StreamPark Configuration +## Apache StreamPark Configuration > Purpose: Configure the data sources needed for startup. > Configuration file location: /usr/local/streampark/conf @@ -178,7 +178,7 @@ vim application.yml ![18_application_yml_ldap](/doc/image_en/install/18_application_yml_ldap.png) ### 【Optional】Configuring Kerberos -> Background: Enterprise-level Hadoop cluster environments have set security access mechanisms, such as Kerberos. StreamPark can also be configured with Kerberos, allowing Flink to authenticate through Kerberos and submit jobs to the Hadoop cluster. +> Background: Enterprise-level Hadoop cluster environments have set security access mechanisms, such as Kerberos. Apache StreamPark can also be configured with Kerberos, allowing Flink to authenticate through Kerberos and submit jobs to the Hadoop cluster. > **Modifications are as follows:** > 1. **security.kerberos.login.enable=true** @@ -188,13 +188,13 @@ vim application.yml > 5. **java.security.krb5.conf=/etc/krb5.conf** ![19_kerberos_yml_config](/doc/image/install/19_kerberos_yml_config.png) -## Starting StreamPark -## Enter the StreamPark Installation Path on the Server +## Starting Apache StreamPark +## Enter the Apache StreamPark Installation Path on the Server ```bash cd /usr/local/streampark/ ``` ![20_enter_streampark_dir](/doc/image/install/20_enter_streampark_dir.png) -## Start the StreamPark Service +## Start the Apache StreamPark Service ```bash ./bin/startup.sh ``` diff --git a/docs/user-guide/12-platformBasicUsage.md b/docs/user-guide/12-platformBasicUsage.md index 9c140fe52..6b574addc 100644 --- a/docs/user-guide/12-platformBasicUsage.md +++ b/docs/user-guide/12-platformBasicUsage.md @@ -1,5 +1,5 @@ # Quick Start -> Note: This section is designed to provide a convenient process for submitting Flink jobs using the StreamPark platform through simple operational steps. +> Note: This section is designed to provide a convenient process for submitting Flink jobs using the Apache StreamPark platform through simple operational steps. ## Configure FLINK_HOME ![1_config_flink_home](/doc/image_en/platform-usage/1_config_flink_home.png) @@ -9,7 +9,7 @@ ![3_display_flink_home_config](/doc/image_en/platform-usage/3_display_flink_home_config.png) ## Configure Flink Cluster -> Depending on the Flink deployment mode and resource management method, StreamPark supports the following six job modes: +> Depending on the Flink deployment mode and resource management method, Apache StreamPark supports the following six job modes: > - **Standalone Session** > - **Yarn Session** > - **Yarn Per-job** @@ -79,8 +79,8 @@ start-cluster.sh ![22_submit_flink_job_2](/doc/image_en/platform-usage/22_submit_flink_job_2.png) ## Check Job Status -### View via StreamPark Dashboard -> StreamPark dashboard +### View via Apache StreamPark Dashboard +> Apache StreamPark dashboard ![23_flink_job_dashboard](/doc/image_en/platform-usage/23_flink_job_dashboard.png) @@ -97,21 +97,21 @@ start-cluster.sh _web_ui_2.png) -> With this, the process of submitting a Flink job using the StreamPark platform is essentially complete. Below is a brief summary of the general process for managing Flink jobs on the StreamPark platform. +> With this, the process of submitting a Flink job using the Apache StreamPark platform is essentially complete. Below is a brief summary of the general process for managing Flink jobs on the Apache StreamPark platform. -## StreamPark Platform's Process for Managing Flink Jobs +## Apache StreamPark Platform's Process for Managing Flink Jobs ![28_streampark_process_workflow](/doc/image_en/platform-usage/28_streampark_process_workflow.png) -> Stopping, modifying, and deleting Flink jobs through the StreamPark platform is relatively simple and can be experienced by users themselves. It is worth noting that: **If a job is in a running state, it cannot be deleted and must be stopped first**. +> Stopping, modifying, and deleting Flink jobs through the Apache StreamPark platform is relatively simple and can be experienced by users themselves. It is worth noting that: **If a job is in a running state, it cannot be deleted and must be stopped first**. -# StreamPark System Module Introduction +# Apache StreamPark System Module Introduction ## System Settings > Menu location ![29_streampark_system_menu](/doc/image_en/platform-usage/29_streampark_system_menu.png) ### User Management -> For managing users of the StreamPark platform +> For managing users of the Apache StreamPark platform ![30_streampark_user_management_menu](/doc/image_en/platform-usage/30_streampark_user_management_menu.png) ### Token Management @@ -151,9 +151,9 @@ curl -X POST '/flink/app/cancel' \ ![36_streampark_menu_management](/doc/image_en/platform-usage/36_streampark_menu_management.png) -## StreamPark Menu Modules +## Apache StreamPark Menu Modules ### Project -> StreamPark integrates with code repositories to achieve CICD +> Apache StreamPark integrates with code repositories to achieve CICD ![37_streampark_project_menu](/doc/image_en/platform-usage/37_streampark_project_menu.png) > To use, click "+ Add new," configure repo information, and save. @@ -213,11 +213,11 @@ curl -X POST '/flink/app/cancel' \ ![54_visit_flink_cluster_web_ui](/doc/image_en/platform-usage/54_visit_flink_cluster_web_ui.png) -# Using Native Flink with StreamPark -> 【**To be improved**】In fact, a key feature of StreamPark is the optimization of the management mode for native Flink jobs at the user level, enabling users to rapidly develop, deploy, run, and monitor Flink jobs using the platform. Meaning, if users are familiar with native Flink, they will find StreamPark even more intuitive to use. +# Using Native Flink with Apache StreamPark +> 【**To be improved**】In fact, a key feature of Apache StreamPark is the optimization of the management mode for native Flink jobs at the user level, enabling users to rapidly develop, deploy, run, and monitor Flink jobs using the platform. Meaning, if users are familiar with native Flink, they will find Apache StreamPark even more intuitive to use. ## Flink Deployment Modes -### How to Use in StreamPark +### How to Use in Apache StreamPark > **Session Mode** 1. Configure Flink Cluster @@ -248,7 +248,7 @@ flink run-application -t yarn-application \ -Dyarn.provided.lib.dirs="hdfs://myhdfs/my-remote-flink-dist-dir" \ hdfs://myhdfs/jars/my-application.jar ``` -### How to Use in StreamPark +### How to Use in Apache StreamPark > When creating or modifying a job, add in “Dynamic Properties” as per the specified format ![67_dynamic_params_usage](/doc/image_en/platform-usage/67_dynamic_params_usage.png) @@ -261,7 +261,7 @@ flink run-application -t yarn-application \ ![68_native_flink_restart_strategy](/doc/image_en/platform-usage/68_native_flink_restart_strategy.png) -### How to Use in StreamPark +### How to Use in Apache StreamPark > 【**To be improved**】Generally, alerts are triggered when a job fails or an anomaly occurs 1. Configure alert notifications @@ -283,7 +283,7 @@ flink run-application -t yarn-application \ ![72_native_flink_save_checkpoint_gramma](/doc/image_en/platform-usage/72_native_flink_save_checkpoint_gramma.png) -### How to Configure Savepoint in StreamPark +### How to Configure Savepoint in Apache StreamPark > Users can set a savepoint when stopping a job ![73_streampark_save_checkpoint](/doc/image_en/platform-usage/73_streampark_save_checkpoint.png) @@ -298,7 +298,7 @@ flink run-application -t yarn-application \ ![77_show_checkpoint_file_name_2](/doc/image_en/platform-usage/77_show_checkpoint_file_name_2.png) -### How to Restore a Job from a Specified Savepoint in StreamPark +### How to Restore a Job from a Specified Savepoint in Apache StreamPark > Users have the option to choose during job startup ![78_usage_checkpoint_in_streampark](/doc/image_en/platform-usage/78_usage_checkpoint_in_streampark.png) @@ -311,7 +311,7 @@ flink run-application -t yarn-application \ ![79_native_flink_job_status](/doc/image_en/platform-usage/79_native_flink_job_status.svg) -### Job Status in StreamPark +### Job Status in Apache StreamPark > 【**To be improved**】 @@ -321,10 +321,10 @@ flink run-application -t yarn-application \ ![80_native_flink_job_details_page](/doc/image_en/platform-usage/80_native_flink_job_details_page.png) -### Job Details in StreamPark +### Job Details in Apache StreamPark ![81_streampark_flink_job_details_page](/doc/image_en/platform-usage/81_streampark_flink_job_details_page.png) -> In addition, for jobs in k8s mode, StreamPark also supports real-time display of startup logs, as shown below +> In addition, for jobs in k8s mode, Apache StreamPark also supports real-time display of startup logs, as shown below ![82_streampark_flink_job_starting_log_info](/doc/image_en/platform-usage/82_streampark_flink_job_starting_log_info.png) @@ -333,8 +333,8 @@ flink run-application -t yarn-application \ > Native Flink provides a REST API > Reference: [https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/ops/rest_api/](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/ops/rest_api/) -### How StreamPark Integrates with Third-Party Systems -> StreamPark also provides Restful APIs, supporting integration with other systems. +### How Apache StreamPark Integrates with Third-Party Systems +> Apache StreamPark also provides Restful APIs, supporting integration with other systems. > For example, it offers REST API interfaces for starting and stopping jobs. ![83_streampark_restful_api_1](/doc/image_en/platform-usage/83_streampark_restful_api_1.png) diff --git a/docs/user-guide/2-quickstart.md b/docs/user-guide/2-quickstart.md index 160dc2641..21688125b 100644 --- a/docs/user-guide/2-quickstart.md +++ b/docs/user-guide/2-quickstart.md @@ -8,7 +8,7 @@ sidebar_position: 2 The installation of the one-stop platform `streampark-console` has been introduced in detail in the previous chapter. In this chapter, let's see how to quickly deploy and run a job with `streampark-console`. The official structure and specification) and projects developed with `streampark` are well supported. Let's use `streampark-quickstart` to quickly start the journey of `streampark-console` -`streampark-quickstart` is a sample program for developing Flink by StreamPark. For details, please refer to: +`streampark-quickstart` is a sample program for developing Flink by Apache StreamPark. For details, please refer to: - Github: [https://github.com/apache/incubator-streampark-quickstart.git](https://github.com/apache/incubator-streampark-quickstart) diff --git a/docs/user-guide/3-development.md b/docs/user-guide/3-development.md index c9efc4464..a77c26dd6 100755 --- a/docs/user-guide/3-development.md +++ b/docs/user-guide/3-development.md @@ -44,7 +44,7 @@ Copy the path of the extracted directory, for example: `${workspace}/incubator-s ### Start the Backend Service -Navigate to `streampark-console/streampark-console-service/src/main/java/org/apache/streampark/console/StreamParkConsoleBootstrap.java` +Navigate to `streampark-console/streampark-console-service/src/main/java/org/apache/streampark/console/Apache StreamParkConsoleBootstrap.java` Modify the launch configuration diff --git a/docs/user-guide/4-dockerDeployment.md b/docs/user-guide/4-dockerDeployment.md index b8d38c391..1a8347b78 100644 --- a/docs/user-guide/4-dockerDeployment.md +++ b/docs/user-guide/4-dockerDeployment.md @@ -4,7 +4,7 @@ title: 'Docker Tutorial' sidebar_position: 4 --- -This tutorial uses the docker method to deploy StreamPark via Docker. +This tutorial uses the docker method to deploy Apache StreamPark via Docker. ## Prepare Docker 1.13.1+ @@ -18,9 +18,9 @@ To start the service with docker, you need to install [docker](https://www.docke To start the service with docker-compose, you need to install [docker-compose](https://docs.docker.com/compose/install/) first -## StreamPark Deployment +## Apache StreamPark Deployment -### 1. StreamPark deployment based on h2 and docker-compose +### 1. Apache StreamPark deployment based on h2 and docker-compose This method is suitable for beginners to learn and become familiar with the features. The configuration will reset after the container is restarted. Below, you can configure Mysql or Pgsql for persistence. @@ -32,7 +32,7 @@ wget https://raw.githubusercontent.com/apache/incubator-streampark/dev/deploy/do docker-compose up -d ``` -Once the service is started, StreamPark can be accessed through http://localhost:10000 and also through http://localhost:8081 to access Flink. Accessing the StreamPark link will redirect you to the login page, where the default user and password for StreamPark are admin and streampark respectively. To learn more about the operation, please refer to the user manual for a quick start. +Once the service is started, Apache StreamPark can be accessed through http://localhost:10000 and also through http://localhost:8081 to access Flink. Accessing the Apache StreamPark link will redirect you to the login page, where the default user and password for Apache StreamPark are admin and streampark respectively. To learn more about the operation, please refer to the user manual for a quick start. ### 3. Configure flink home @@ -49,7 +49,7 @@ Note:When configuring the flink-sessin cluster address, the ip address is not lo ![](/doc/image/remoteSubmission.png) #### Use existing Mysql services -This approach is suitable for enterprise production, where you can quickly deploy StreamPark based on docker and associate it with an online database +This approach is suitable for enterprise production, where you can quickly deploy Apache StreamPark based on docker and associate it with an online database Note: The diversity of deployment support is maintained through the .env configuration file, make sure there is one and only one .env file in the directory ```shell @@ -92,7 +92,7 @@ SPRING_DATASOURCE_PASSWORD=streampark docker-compose up -d ``` -## Build images based on source code for StreamPark deployment +## Build images based on source code for Apache StreamPark deployment ``` git clone https://github.com/apache/incubator-streampark.git cd incubator-streampark/deploy/docker diff --git a/docs/user-guide/6-Team.md b/docs/user-guide/6-Team.md index 328e93b0f..594248948 100644 --- a/docs/user-guide/6-Team.md +++ b/docs/user-guide/6-Team.md @@ -8,16 +8,16 @@ sidebar_position: 6 ADMIN can select the user type when creating or modifying a user. There are two user types: ADMIN and USER. -- ADMIN means the system administrator, that is: the super administrator of StreamPark, who has all the permissions of - the StreamPark management page and each team. +- ADMIN means the system administrator, that is: the super administrator of Apache StreamPark, who has all the permissions of + the Apache StreamPark management page and each team. - USER means a normal user of the platform. Creating a USER is just creating an account. By default, users don't have any permissions on the platform. After account is created and the ADMIN binds it to some teams, USER will have permissions in the corresponding teams. ## Team Management -In order to facilitate the management of applications in different departments within the company, StreamPark supports -team management. ADMIN can create different teams for different departments on StreamPark. +In order to facilitate the management of applications in different departments within the company, Apache StreamPark supports +team management. ADMIN can create different teams for different departments on Apache StreamPark.

@@ -34,9 +34,9 @@ can view or operate the applications of the corresponding team. ## Role Management In order to facilitate application management and prevent misoperation, the team also needs to distinguish between -administrator and developer, so StreamPark introduces role management. +administrator and developer, so Apache StreamPark introduces role management. -Currently, StreamPark supports two roles: team admin and developer. The team admin has all the permissions in the team. +Currently, Apache StreamPark supports two roles: team admin and developer. The team admin has all the permissions in the team. Compared with the team admin, the developer has fewer permissions to delete applications and add USER to the team.

diff --git a/docs/user-guide/8-YarnQueueManagement.md b/docs/user-guide/8-YarnQueueManagement.md index 5dcf6bd51..e8d44a7f2 100644 --- a/docs/user-guide/8-YarnQueueManagement.md +++ b/docs/user-guide/8-YarnQueueManagement.md @@ -19,7 +19,7 @@ will be time-consuming and accompanied by poor user experience. If a task is submitted to an incorrect queue due to the error-input, it's likely to affect the stability of yarn applications on the queue and the abuse of the queue resource. -So StreamPark introduced the queue management feature to ensure that a set of added queues are shared within the same team, +So Apache StreamPark introduced the queue management feature to ensure that a set of added queues are shared within the same team, that is, ensure that queues resource is isolated within the scope of the team. It can generate the following benefits: - When deploying Flink `yarn-application` applications or Flink `yarn-session` clusters, it could set quickly and accurately yarn queue(`yarn.application.queue`) & labels(`yarn.application.node-label`). @@ -88,13 +88,13 @@ due to the relationship between the queues and the `yarn-session` mode flink clu

- Session cluster is shared by all teams. Why is it that when creating a `yarn-session` flink cluster, only the queues in the current team instead of all teams can be used in the queues candidate list ? -> Based on the mentioned above, StreamPark hopes that when creating a `yarn-session` flink cluster, +> Based on the mentioned above, Apache StreamPark hopes that when creating a `yarn-session` flink cluster, administrators can specify the queue belonged to current of the current team only, which could be better for administrators to perceive the impact of current operations on the current team. - Why not support the isolation for `flink yarn-session clusters / general clusters` on team wide ? - The impact range caused by changes in cluster visibility is larger than that caused by changes in queue visibility. - - StreamPark need to face greater difficulties in backward compatibility while also considering the user experience. + - Apache StreamPark need to face greater difficulties in backward compatibility while also considering the user experience. - At present, there is no exact research on the users group and applications scale deployed using `yarn-application` & `yarn-session` cluster modes in the community. Based on this fact, the community didn't provide greater feature support. diff --git a/i18n/zh-CN/docusaurus-plugin-content-blog/0-streampark-flink-on-k8s.md b/i18n/zh-CN/docusaurus-plugin-content-blog/0-streampark-flink-on-k8s.md index 39e96328f..fa2f4fc54 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-blog/0-streampark-flink-on-k8s.md +++ b/i18n/zh-CN/docusaurus-plugin-content-blog/0-streampark-flink-on-k8s.md @@ -1,7 +1,7 @@ --- slug: streampark-flink-on-k8s -title: StreamPark Flink on Kubernetes 实践 -tags: [StreamPark, 生产实践, FlinkSQL, Kubernetes] +title: Apache StreamPark Flink on Kubernetes 实践 +tags: [Apache StreamPark, 生产实践, FlinkSQL, Kubernetes] --- 雾芯科技创立于2018年1月。目前主营业务包括 RELX 悦刻品牌产品的研发、设计、制造及销售。凭借覆盖全产业链的核心技术与能力,RELX 悦刻致力于为用户提供兼具高品质和安全性的产品。 @@ -20,9 +20,9 @@ Native Kubernetes 和 Standalone Kubernetes 主要区别在于 Flink 与 Kuberne ![](/blog/relx/nativekubernetes_architecture.png) -## **当 Flink On Kubernetes 遇见 StreamPark** +## **当 Flink On Kubernetes 遇见 Apache StreamPark** -Flink on Native Kubernetes 目前支持 Application 模式和 Session 模式,两者对比 Application 模式部署规避了 Session 模式的资源隔离问题、以及客户端资源消耗问题,因此**生产环境更推荐采用 Application Mode 部署 Flink 任务。**下面我们分别看看使用原始脚本的方式和使用 StreamPark 开发部署一个 Flink on Native Kubernetes 作业的流程。 +Flink on Native Kubernetes 目前支持 Application 模式和 Session 模式,两者对比 Application 模式部署规避了 Session 模式的资源隔离问题、以及客户端资源消耗问题,因此**生产环境更推荐采用 Application Mode 部署 Flink 任务。**下面我们分别看看使用原始脚本的方式和使用 Apache StreamPark 开发部署一个 Flink on Native Kubernetes 作业的流程。 ### ***使用脚本方式部署Kubernetes*** @@ -69,17 +69,17 @@ kubectl -n flink-cluster get svc 以上就是使用 Flink 提供的最原始的脚本方式把一个 Flink 任务部署到 Kubernetes 上的过程,只做到了最基本的任务提交,如果要达到生产使用级别,还有一系列的问题需要解决,如:方式过于原始无法适配大批量任务、无法记录任务checkpoint 和实时状态跟踪、任务运维和监控困难、无告警机制、 无法集中化管理等等。 -## **使用 StreamPark 部署 Flink on Kubernetes** +## **使用 Apache StreamPark 部署 Flink on Kubernetes** ------ 对于企业级生产环境使用 Flink on Kubernetes 会有着更高的要求, 一般会选择自建平台或者购买相关商业产品, 不论哪种方案在产品能力上满足: **大批量任务开发部署、状态跟踪、运维监控、失败告警、任务统一管理、高可用性** 等这些是普遍的诉求。 -针对以上问题我们调研了开源领域支持开发部署 Flink on Kubernetes 任务的开源项目,调研的过程中也不乏遇到了其他优秀的开源项目,在综合对比了多个开源项目后得出结论: **StreamPark 不论是完成度还是使用体验、稳定性等整体表现都非常出色,因此最终选择了 StreamPark 作为我们的一站式实时计算平台。**下面我们看看 StreamPark 是如何支持 Flink on Kubernetes : +针对以上问题我们调研了开源领域支持开发部署 Flink on Kubernetes 任务的开源项目,调研的过程中也不乏遇到了其他优秀的开源项目,在综合对比了多个开源项目后得出结论: **Apache StreamPark 不论是完成度还是使用体验、稳定性等整体表现都非常出色,因此最终选择了 Apache StreamPark 作为我们的一站式实时计算平台。**下面我们看看 Apache StreamPark 是如何支持 Flink on Kubernetes : ### **基础环境配置** -基础环境配置包括 Kubernetes 和 Docker 仓库信息以及 Flink 客户端的信息配置。对于 Kubernetes 基础环境最为简单的方式是直接拷贝 Kubernetes 节点的 .kube/config 到 StreamPark 节点用户目录,之后使用 kubectl 命令创建 Flink 专用的 Kubernetes Namespace 以及进行 RBAC 配置。 +基础环境配置包括 Kubernetes 和 Docker 仓库信息以及 Flink 客户端的信息配置。对于 Kubernetes 基础环境最为简单的方式是直接拷贝 Kubernetes 节点的 .kube/config 到 Apache StreamPark 节点用户目录,之后使用 kubectl 命令创建 Flink 专用的 Kubernetes Namespace 以及进行 RBAC 配置。 ```shell # 创建Flink作业使用的k8s namespace @@ -92,17 +92,17 @@ Docker 账户信息直接在 Docker Setting 界面配置即可: ![](/blog/relx/docker_setting.png) -StreamPark 可适配多版本 Flink 作业开发,Flink 客户端直接在 StreamPark Setting 界面配置即可: +Apache StreamPark 可适配多版本 Flink 作业开发,Flink 客户端直接在 Apache StreamPark Setting 界面配置即可: ![](/blog/relx/flinkversion_setting.png) ### **作业开发** -StreamPark 做好基础环境配置之后只需要三步即可开发部署一个 Flink 作业: +Apache StreamPark 做好基础环境配置之后只需要三步即可开发部署一个 Flink 作业: ![](/blog/relx/development_process.png) -StreamPark 既支持 Upload Jar 也支持直接编写 Flink SQL 作业, **Flink SQL 作业只需要输入SQL 和 依赖项即可, 该方式大大提升了开发体验,** **并且规避了依赖冲突等问题,**对此部分本文不重点介绍。 +Apache StreamPark 既支持 Upload Jar 也支持直接编写 Flink SQL 作业, **Flink SQL 作业只需要输入SQL 和 依赖项即可, 该方式大大提升了开发体验,** **并且规避了依赖冲突等问题,**对此部分本文不重点介绍。 这里需要选择部署模式为 kubernetes application, 并且需要在作业开发页面进行以下参数的配置:红框中参数为 Flink on Kubernetes 基础参数。 @@ -118,7 +118,7 @@ StreamPark 既支持 Upload Jar 也支持直接编写 Flink SQL 作业, **Flink ### **作业上线** -作业开发完成之后是作业上线环节,在这一环节中 StreamPark 做了大量的工作,具体如下: +作业开发完成之后是作业上线环节,在这一环节中 Apache StreamPark 做了大量的工作,具体如下: - 准备环境 - 作业中的依赖下载 @@ -130,7 +130,7 @@ StreamPark 既支持 Upload Jar 也支持直接编写 Flink SQL 作业, **Flink ![](/blog/relx/operation.png) -在镜像构建和推送的时候我们可以看到 StreamPark 做的一系列工作: **读取配置、构建镜像、推送镜像到远程仓库...** 这里要给StreamPark 一个大大的赞! +在镜像构建和推送的时候我们可以看到 Apache StreamPark 做的一系列工作: **读取配置、构建镜像、推送镜像到远程仓库...** 这里要给Apache StreamPark 一个大大的赞! ![](/blog/relx/step_details.png) @@ -140,19 +140,19 @@ StreamPark 既支持 Upload Jar 也支持直接编写 Flink SQL 作业, **Flink ![](/blog/relx/homework_submit.png) -整个过程仅需上述三步,即可完成在 StreamPark 上开发和部署一个Flink on Kubernetes 作业。而 StreamPark 对于 Flink on Kubernetes 的支持远远不止提交个任务这么简单。 +整个过程仅需上述三步,即可完成在 Apache StreamPark 上开发和部署一个Flink on Kubernetes 作业。而 Apache StreamPark 对于 Flink on Kubernetes 的支持远远不止提交个任务这么简单。 ### **作业管理** -**在作业提交之后,StreamPark 能实时获取到任务的最新 checkpoint 地址、任务的运行状态、集群实时的资源消耗信息,针对运行的任务可以非常方便的一键启停, 在停止作业时支持记录 savepoint 的位置, 以及再次启动时从 savepoint 恢复状态等功能,从而保证了生产环境的数据一致性,真正具备 Flink on Kubernetes 的 一站式开发、部署、运维监控的能力。** +**在作业提交之后,Apache StreamPark 能实时获取到任务的最新 checkpoint 地址、任务的运行状态、集群实时的资源消耗信息,针对运行的任务可以非常方便的一键启停, 在停止作业时支持记录 savepoint 的位置, 以及再次启动时从 savepoint 恢复状态等功能,从而保证了生产环境的数据一致性,真正具备 Flink on Kubernetes 的 一站式开发、部署、运维监控的能力。** -接下来我们来看看这一块的能力 StreamPark 是如何进行支持的: +接下来我们来看看这一块的能力 Apache StreamPark 是如何进行支持的: - **实时记录checkpoint** ------ -在作业提交之后,有时候需要更改作业逻辑但是要保证数据的一致性,那么就需要平台具有实时记录每一次 checkpoint 位置的能力, 以及停止时记录最后的 savepoint 位置的能力,StreamPark 在 Flink on Kubernetes 上很好的实现了该功能。默认会每隔5秒获取一次 checkpoint 信息记录到对应的表中,并且会按照 Flink 中保留 checkpoint 数量的策略,只保留 state.checkpoints.num-retained 个,超过的部分则删除。在任务停止时有勾选 savepoint 的选项,如勾选了savepoint 选项,在任务停止时会做 savepoint 操作,同样会记录 savepoint 具体位置到表中。 +在作业提交之后,有时候需要更改作业逻辑但是要保证数据的一致性,那么就需要平台具有实时记录每一次 checkpoint 位置的能力, 以及停止时记录最后的 savepoint 位置的能力,Apache StreamPark 在 Flink on Kubernetes 上很好的实现了该功能。默认会每隔5秒获取一次 checkpoint 信息记录到对应的表中,并且会按照 Flink 中保留 checkpoint 数量的策略,只保留 state.checkpoints.num-retained 个,超过的部分则删除。在任务停止时有勾选 savepoint 的选项,如勾选了savepoint 选项,在任务停止时会做 savepoint 操作,同样会记录 savepoint 具体位置到表中。 默认 savepoint 的根路径只需要在 Flink Home flink-conf.yaml 文件中配置即可自动识别,除了默认地址,在停止时也可以自定义指定 savepoint 的根路径。 @@ -164,7 +164,7 @@ StreamPark 既支持 Upload Jar 也支持直接编写 Flink SQL 作业, **Flink ------ -对于生产环境的挑战,很重要的一点就是监控是否到位,Flink on Kubernetes 更是如此。这点很重要, 也是最基本需要具备的能力,StreamPark 可实时监控 Flink on Kubernetes 作业的运行状态并在平台上展示给用户,在页面上可以很方便的根据各种运行状态来检索任务。 +对于生产环境的挑战,很重要的一点就是监控是否到位,Flink on Kubernetes 更是如此。这点很重要, 也是最基本需要具备的能力,Apache StreamPark 可实时监控 Flink on Kubernetes 作业的运行状态并在平台上展示给用户,在页面上可以很方便的根据各种运行状态来检索任务。 ![](/blog/relx/run_status.png) @@ -172,21 +172,21 @@ StreamPark 既支持 Upload Jar 也支持直接编写 Flink SQL 作业, **Flink ------ -除此之外 StreamPark 还具备完善的报警功能: 支持邮件、钉钉、微信和短信 等。这也是当初公司调研之后选择 StreamPark 作为 Flink on Kubernetes 一站式平台的重要原因。 +除此之外 Apache StreamPark 还具备完善的报警功能: 支持邮件、钉钉、微信和短信 等。这也是当初公司调研之后选择 Apache StreamPark 作为 Flink on Kubernetes 一站式平台的重要原因。 ![](/blog/relx/alarm.png) -通过以上我们看到 StreamPark 在支持 Flink on Kubernetes 开发部署过程中具备的能力, 包括:**作业的开发能力、部署能力、监控能力、运维能力、异常处理能力等,StreamPark 提供的是一套相对完整的解决方案。 且已经具备了一些 CICD/DevOps 的能力,整体的完成度还在持续提升。是在整个开源领域中对于 Flink on Kubernetes 一站式开发部署运维工作全链路都支持的产品,StreamPark 是值得被称赞的。** +通过以上我们看到 Apache StreamPark 在支持 Flink on Kubernetes 开发部署过程中具备的能力, 包括:**作业的开发能力、部署能力、监控能力、运维能力、异常处理能力等,Apache StreamPark 提供的是一套相对完整的解决方案。 且已经具备了一些 CICD/DevOps 的能力,整体的完成度还在持续提升。是在整个开源领域中对于 Flink on Kubernetes 一站式开发部署运维工作全链路都支持的产品,Apache StreamPark 是值得被称赞的。** -## **StreamPark 在雾芯科技的落地实践** +## **Apache StreamPark 在雾芯科技的落地实践** -StreamPark 在雾芯科技落地较晚,目前主要用于实时数据集成作业和实时指标计算作业的开发部署,有 Jar 任务也有 Flink SQL 任务,全部使用 Native Kubernetes 部署;数据源有CDC、Kafka 等,Sink 端有 Maxcompute、kafka、Hive 等,以下是公司开发环境StreamPark 平台截图: +Apache StreamPark 在雾芯科技落地较晚,目前主要用于实时数据集成作业和实时指标计算作业的开发部署,有 Jar 任务也有 Flink SQL 任务,全部使用 Native Kubernetes 部署;数据源有CDC、Kafka 等,Sink 端有 Maxcompute、kafka、Hive 等,以下是公司开发环境Apache StreamPark 平台截图: ![](/blog/relx/screenshot.png) ## 遇到的问题 -任何新技术都有探索与踩坑的过程,失败的经验是宝贵的,这里介绍下 StreamPark 在雾芯科技落地过程中踩的一些坑和经验,**这块的内容不仅仅关于 StreamPark 的部分, 相信会带给所有使用 Flink on Kubernetes 的小伙伴一些参考**。 +任何新技术都有探索与踩坑的过程,失败的经验是宝贵的,这里介绍下 Apache StreamPark 在雾芯科技落地过程中踩的一些坑和经验,**这块的内容不仅仅关于 Apache StreamPark 的部分, 相信会带给所有使用 Flink on Kubernetes 的小伙伴一些参考**。 ### **常见问题总结如下** @@ -196,7 +196,7 @@ StreamPark 在雾芯科技落地较晚,目前主要用于实时数据集成作 - **scala 版本不一致** -由于 StreamPark 部署需要 Scala 环境,而且 Flink SQL 运行需要用到 StreamPark 提供的 Flink SQL Client,因此一定要保证 Flink 作业的 Scala 版本和 StreamPark 的 Scala 版本保持一致。 +由于 Apache StreamPark 部署需要 Scala 环境,而且 Flink SQL 运行需要用到 Apache StreamPark 提供的 Flink SQL Client,因此一定要保证 Flink 作业的 Scala 版本和 Apache StreamPark 的 Scala 版本保持一致。 - **注意类冲突** @@ -228,7 +228,7 @@ HDFS 阿里云 OSS/AWS S3 都可以进行 checkpoint 和 savepoint 存储,Flin - **任务每次重启都会导致多出一个 Job 实例** -在配置了基于 kubernetes 的HA的前提条件下,当需要停止 Flink 任务时,需要通过 StreamPark 的 cancel 来进行,不要直接通过 kubernetes 集群删除 Flink 任务的 Deployment。因为 Flink 的关闭有其自有的关闭流程,在删除 pod 同时 Configmap 中的相应配置文件也会被一并删除,而直接删除 pod 会导致 Configmap 的残留。当相同名称的任务重启时,会出现两个相同 Job 现象,因为在启动时,任务会加载之前残留的配置文件,尝试将已经关闭的任务恢复。 +在配置了基于 kubernetes 的HA的前提条件下,当需要停止 Flink 任务时,需要通过 Apache StreamPark 的 cancel 来进行,不要直接通过 kubernetes 集群删除 Flink 任务的 Deployment。因为 Flink 的关闭有其自有的关闭流程,在删除 pod 同时 Configmap 中的相应配置文件也会被一并删除,而直接删除 pod 会导致 Configmap 的残留。当相同名称的任务重启时,会出现两个相同 Job 现象,因为在启动时,任务会加载之前残留的配置文件,尝试将已经关闭的任务恢复。 - **kubernetes pod 域名访问怎么实现** @@ -300,14 +300,14 @@ push k8s-harbor.xxx.com/streamx/udf_flink_1.13.6-scala_2.11:latest ## **未来期待** -- **StreamPark 对于 Flink 作业 Metric 监控的支持** +- **Apache StreamPark 对于 Flink 作业 Metric 监控的支持** -StreamPark 如果可以对接 Flink Metric 数据而且可以在 StreamPark 平台上展示每时每刻 Flink 的实时消费数据情况就太棒了 +Apache StreamPark 如果可以对接 Flink Metric 数据而且可以在 Apache StreamPark 平台上展示每时每刻 Flink 的实时消费数据情况就太棒了 -- **StreamPark 对于Flink 作业日志持久化的支持** +- **Apache StreamPark 对于Flink 作业日志持久化的支持** -对于部署到 YARN 的 Flink 来说,如果 Flink 程序挂了,我们可以去 YARN 上看历史日志,但是对于 Kubernetes 来说,如果程序挂了,那么 Kubernetes 的 pod 就消失了,就没法查日志了。所以用户需要借助 Kubernetes 上的工具进行日志持久化,如果 StreamPark 支持 Kubernetes 日志持久化接口就更好了。 +对于部署到 YARN 的 Flink 来说,如果 Flink 程序挂了,我们可以去 YARN 上看历史日志,但是对于 Kubernetes 来说,如果程序挂了,那么 Kubernetes 的 pod 就消失了,就没法查日志了。所以用户需要借助 Kubernetes 上的工具进行日志持久化,如果 Apache StreamPark 支持 Kubernetes 日志持久化接口就更好了。 - **镜像过大的问题改进** -StreamPark 目前对于 Flink on Kubernetes 作业的镜像支持是将基础镜像和用户代码打成一个 Fat 镜像推送到 Docker 仓库,这种方式存在的问题就是镜像过大的时候耗时比较久,希望未来基础镜像可以复用不需要每次都与业务代码打到一起,这样可以极大地提升开发效率和节约成本。 +Apache StreamPark 目前对于 Flink on Kubernetes 作业的镜像支持是将基础镜像和用户代码打成一个 Fat 镜像推送到 Docker 仓库,这种方式存在的问题就是镜像过大的时候耗时比较久,希望未来基础镜像可以复用不需要每次都与业务代码打到一起,这样可以极大地提升开发效率和节约成本。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-blog/1-flink-framework-streampark.md b/i18n/zh-CN/docusaurus-plugin-content-blog/1-flink-framework-streampark.md index 3604cdf9f..381622535 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-blog/1-flink-framework-streampark.md +++ b/i18n/zh-CN/docusaurus-plugin-content-blog/1-flink-framework-streampark.md @@ -1,7 +1,7 @@ --- slug: flink-development-framework-streampark -title: Flink 开发利器 StreamPark -tags: [StreamPark, DataStream, FlinkSQL] +title: Flink 开发利器 Apache StreamPark +tags: [Apache StreamPark, DataStream, FlinkSQL] ---
@@ -58,18 +58,18 @@ Flink 从 1.13 版本开始,就支持 Pod Template,我们可以在 Pod Templ
-## 引入 StreamPark +## 引入 Apache StreamPark 之前我们写 Flink SQL 基本上都是使用 Java 包装 SQL,打 jar 包,提交到 S3 平台上。通过命令行方式提交代码,但这种方式始终不友好,流程繁琐,开发和运维成本太大。我们希望能够进一步简化流程,将 Flink TableEnvironment 抽象出来,有平台负责初始化、打包运行 Flink 任务,实现 Flink 应用程序的构建、测试和部署自动化。 -这是个开源兴起的时代,我们自然而然的将目光投向开源领域中:在一众开源项目中,经过对比各个项目综合评估发现 Zeppelin StreamPark 这两个项目对 Flink 的支持较为完善,都宣称支持 Flink on K8s ,最终进入到我们的目标选择范围中,以下是两者在 K8s 相关支持的简单比较(目前如果有更新,麻烦批评指正)。 +这是个开源兴起的时代,我们自然而然的将目光投向开源领域中:在一众开源项目中,经过对比各个项目综合评估发现 Zeppelin Apache StreamPark 这两个项目对 Flink 的支持较为完善,都宣称支持 Flink on K8s ,最终进入到我们的目标选择范围中,以下是两者在 K8s 相关支持的简单比较(目前如果有更新,麻烦批评指正)。
Feature ZeppelinStreamParkApache StreamPark
- + @@ -125,15 +125,15 @@ Flink 从 1.13 版本开始,就支持 Pod Template,我们可以在 Pod Templ
-调研过程中,我们与两者的主开发人员都进行了多次沟通。经过我们反复研究之后,还是决定将 StreamPark 作为我们目前的 Flink 开发工具来使用。 +调研过程中,我们与两者的主开发人员都进行了多次沟通。经过我们反复研究之后,还是决定将 Apache StreamPark 作为我们目前的 Flink 开发工具来使用。 -
(StreamPark 官网的闪屏)
+
(Apache StreamPark 官网的闪屏)

-经过开发同学长时间开发测试,StreamPark 目前已经具备: +经过开发同学长时间开发测试,Apache StreamPark 目前已经具备: * 完善的SQL 校验功能 * 实现了自动 build/push 镜像 @@ -145,21 +145,21 @@ Flink 从 1.13 版本开始,就支持 Pod Template,我们可以在 Pod Templ -
(StreamPark 对 Flink 多版本的支持演示视频)
+
(Apache StreamPark 对 Flink 多版本的支持演示视频)

-在目前最新发布的 1.2.0 版本中,StreamPark 较为完善地支持了 K8s-Native-Application 和 K8s-Session-Application 模式。 +在目前最新发布的 1.2.0 版本中,Apache StreamPark 较为完善地支持了 K8s-Native-Application 和 K8s-Session-Application 模式。 -
(StreamPark K8s 部署演示视频)
+
(Apache StreamPark K8s 部署演示视频)

### K8s Native Application 模式 -在 StreamPark 中,我们只需要配置相应的参数,并在 Maven POM 中填写相应的依赖,或者上传依赖 jar 包,点击 Apply,相应的依赖就会生成。这就意味着我们也可以将所有使用的 UDF 打成 jar 包,以及各种 connector.jar,直接在 SQL 中使用。如下图: +在 Apache StreamPark 中,我们只需要配置相应的参数,并在 Maven POM 中填写相应的依赖,或者上传依赖 jar 包,点击 Apply,相应的依赖就会生成。这就意味着我们也可以将所有使用的 UDF 打成 jar 包,以及各种 connector.jar,直接在 SQL 中使用。如下图: ![](/blog/belle/dependency.png) @@ -171,7 +171,7 @@ SQL 校验能力和 Zeppelin 基本一致: ![](/blog/belle/pod.png) -程序保存后,点击运行时,也可以指定 savepoint。任务提交成功后,StreamPark 会根据 FlinkPod 网络 Exposed Type(loadBalancer/NodePort/ClusterIp),返回相应的 WebURL,从而自然的实现 WebUI 跳转。但是,目前因为线上私有 K8s 集群出于安全性考虑,尚未打通 Pod 与客户端节点网络(目前也没有这个规划)。所以么,我们只使用 NodePort。如果后续任务数过多,有使用 ClusterIP 的需求的话,我们可能会将 StreamPark 部署在 K8s,或者同 Ingress 做进一步整合。 +程序保存后,点击运行时,也可以指定 savepoint。任务提交成功后,Apache StreamPark 会根据 FlinkPod 网络 Exposed Type(loadBalancer/NodePort/ClusterIp),返回相应的 WebURL,从而自然的实现 WebUI 跳转。但是,目前因为线上私有 K8s 集群出于安全性考虑,尚未打通 Pod 与客户端节点网络(目前也没有这个规划)。所以么,我们只使用 NodePort。如果后续任务数过多,有使用 ClusterIP 的需求的话,我们可能会将 Apache StreamPark 部署在 K8s,或者同 Ingress 做进一步整合。 ![](/blog/belle/start.png) @@ -187,7 +187,7 @@ SQL 校验能力和 Zeppelin 基本一致: ### K8s Native Session 模式 -StreamPark 还较好地支持了 K8s Native-Sesson 模式,这为我们后续做离线 FlinkSQL 开发或部分资源隔离做了较好的技术支持。 +Apache StreamPark 还较好地支持了 K8s Native-Sesson 模式,这为我们后续做离线 FlinkSQL 开发或部分资源隔离做了较好的技术支持。 Native-Session 模式需要事先使用 Flink 命令创建一个运行在 K8s 中的 Flink 集群。如下: @@ -205,7 +205,7 @@ Native-Session 模式需要事先使用 Flink 命令创建一个运行在 K8s ![](/blog/belle/flinksql.png) -如上图,使用该 ClusterId 作为 StreamPark 的任务参数 Kubernetes ClusterId。保存提交任务后,任务会很快处于 Running 状态: +如上图,使用该 ClusterId 作为 Apache StreamPark 的任务参数 Kubernetes ClusterId。保存提交任务后,任务会很快处于 Running 状态: ![](/blog/belle/detail.png) @@ -213,13 +213,13 @@ Native-Session 模式需要事先使用 Flink 命令创建一个运行在 K8s ![](/blog/belle/dashboard.png) -可以看到,其实 StreamPark 是将 jar 包通过 REST API 上传到 Flink 集群上,并调度执行任务的。 +可以看到,其实 Apache StreamPark 是将 jar 包通过 REST API 上传到 Flink 集群上,并调度执行任务的。
### Custom Code 模式 -另我们惊喜的是,StreamPark 还支持代码编写 DataStream/FlinkSQL 任务。对于特殊需求,我们可以自己写 Java/Scala 实现。可以根据 StreamPark 推荐的脚手架方式编写任务,也可以编写一个标准普通的 Flink 任务,通过这种方式我们可以将代码管理交由 git 实现,平台可以用来自动化编译打包与部署。当然,如果能用 SQL 实现的功能,我们会尽量避免自定义 DataStream,减少不必要的运维麻烦。 +另我们惊喜的是,Apache StreamPark 还支持代码编写 DataStream/FlinkSQL 任务。对于特殊需求,我们可以自己写 Java/Scala 实现。可以根据 Apache StreamPark 推荐的脚手架方式编写任务,也可以编写一个标准普通的 Flink 任务,通过这种方式我们可以将代码管理交由 git 实现,平台可以用来自动化编译打包与部署。当然,如果能用 SQL 实现的功能,我们会尽量避免自定义 DataStream,减少不必要的运维麻烦。

@@ -227,26 +227,26 @@ Native-Session 模式需要事先使用 Flink 命令创建一个运行在 K8s ## 改进意见 -当然 StreamPark 还有很多需要改进的地方,就目前测试来看: +当然 Apache StreamPark 还有很多需要改进的地方,就目前测试来看: * **资源管理还有待加强**:多文件系统jar包等资源管理功能尚未添加,任务版本功能有待加强。 * **前端 button 功能还不够丰富**:比如任务添加后续可以增加复制等功能按钮。 * **任务提交日志也需要可视化展示**:任务提交伴随着加载 class 文件,打 jar 包,build 镜像,提交镜像,提交任务等过程,每一个环节出错,都会导致任务的失败,但是失败日志往往不明确,或者因为某种原因导致异常未正常抛出,没有转换任务状态,用户会无从下手改进。 -众所周知,一个新事物的出现一开始总会不是那么完美。尽管有些许问题和需要改进的 point,但是瑕不掩瑜,我们仍然选择 StreamPark 作为我们的 Flink DevOps,我们也将会和主开发人员一道共同完善 StreamPark,也欢迎更多的人来使用,为 StreamPark 带来更多进步。 +众所周知,一个新事物的出现一开始总会不是那么完美。尽管有些许问题和需要改进的 point,但是瑕不掩瑜,我们仍然选择 Apache StreamPark 作为我们的 Flink DevOps,我们也将会和主开发人员一道共同完善 Apache StreamPark,也欢迎更多的人来使用,为 Apache StreamPark 带来更多进步。
## 未来规划 * 我们会继续跟进 Doris,并将业务数据 + 日志数据统一入 Doris,通过 Flink 实现湖仓一体; -* 我们也会逐步将探索 StreamPark 同 DolphinScheduler 2.x 进行整合,完善DolphinScheduler 离线任务,逐步用 Flink 替换掉 Spark,实现真正的流批一体; +* 我们也会逐步将探索 Apache StreamPark 同 DolphinScheduler 2.x 进行整合,完善DolphinScheduler 离线任务,逐步用 Flink 替换掉 Spark,实现真正的流批一体; * 基于我们自身在 S3 上的探索积累,fat-jar 包 build 完成之后不再构建镜像,直接利用 Pod Tempelet 挂载 PVC 到 Flink Pod 中的目录,进一步优化代码提交流程; -* 将 StreamPark 持续应用到我们生产中,并汇同社区开发人员,共同努力,增强 StreamPark 在 Flink 流上的开发部署能力与运行监控能力,努力把 StreamPark 打造成一个功能完善的流数据 DevOps。 +* 将 Apache StreamPark 持续应用到我们生产中,并汇同社区开发人员,共同努力,增强 Apache StreamPark 在 Flink 流上的开发部署能力与运行监控能力,努力把 Apache StreamPark 打造成一个功能完善的流数据 DevOps。 附: -StreamPark GitHub:[https://github.com/apache/incubator-streampark](https://github.com/apache/incubator-streampark)
+Apache StreamPark GitHub:[https://github.com/apache/incubator-streampark](https://github.com/apache/incubator-streampark)
Doris GitHub:[https://github.com/apache/doris](https://github.com/apache/doris) ![](/blog/belle/author.png) diff --git a/i18n/zh-CN/docusaurus-plugin-content-blog/2-streampark-usercase-chinaunion.md b/i18n/zh-CN/docusaurus-plugin-content-blog/2-streampark-usercase-chinaunion.md index ca77f9980..3c950e419 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-blog/2-streampark-usercase-chinaunion.md +++ b/i18n/zh-CN/docusaurus-plugin-content-blog/2-streampark-usercase-chinaunion.md @@ -1,7 +1,7 @@ --- slug: streampark-usercase-chinaunion title: 联通 Flink 实时计算平台化运维实践 -tags: [StreamPark, 生产实践, FlinkSQL] +tags: [Apache StreamPark, 生产实践, FlinkSQL] --- # 联通 Flink 实时计算平台化运维实践 @@ -10,7 +10,7 @@ tags: [StreamPark, 生产实践, FlinkSQL] - 实时计算平台背景介绍 - Flink 实时作业运维挑战 -- 基于 StreamPark 一体化管理 +- 基于 Apache StreamPark 一体化管理 - 未来规划与演进 @@ -71,16 +71,16 @@ tags: [StreamPark, 生产实践, FlinkSQL] 由于作业运维困境上的种种因素,会产生业务支撑困境,如导致上线故障率高、影响数据质量、上线时间长、数据延迟高、告警漏发处理等,引起的投诉,此外,我们的业务影响不明确,一旦出现问题,处理问题会成为第一优先级。 -## **基于 StreamPark 一体化管理** +## **基于 Apache StreamPark 一体化管理** ![](/blog/chinaunion/job_management.png) -对于以上的两种困境,我们基于 StreamPark 一体化管理解决了很多问题,首先来看一下 StreamPark 的双线演进,分别是 Flink 作业管理和 Flink 作业 DevOps 平台;在作业管理上,StreamPark 支持将 Flink 实时作业部署到不同的集群里去,比如 Flink 原生自带的 Standalone 模式,Flink on Yarn 的 Session、Application、PerJob 模式,在最新的版本中将支持 Kubernetes Native Session 模式;中间层是项目管理、作业管理、集群管理、团队管理、变量管理、告警管理。 +对于以上的两种困境,我们基于 Apache StreamPark 一体化管理解决了很多问题,首先来看一下 Apache StreamPark 的双线演进,分别是 Flink 作业管理和 Flink 作业 DevOps 平台;在作业管理上,Apache StreamPark 支持将 Flink 实时作业部署到不同的集群里去,比如 Flink 原生自带的 Standalone 模式,Flink on Yarn 的 Session、Application、PerJob 模式,在最新的版本中将支持 Kubernetes Native Session 模式;中间层是项目管理、作业管理、集群管理、团队管理、变量管理、告警管理。 - 项目管理:当部署 Flink 程序的时候,可以在项目管理里填写 git 地址,同时选择要部署的分支。 -- 作业管理:可以指定 Flink 作业的执行模式,比如你要提交到什么类型的集群里去,同时还可以配置一些资源,比如 TaskManager 的数量、TaskManager/JobManager 的内存大小、并行度等等,还可以设置一些容错,比如 Flink 作业失败后,StreamPark 可以支持它自动拉起,同时支持传入一些动态参数。 +- 作业管理:可以指定 Flink 作业的执行模式,比如你要提交到什么类型的集群里去,同时还可以配置一些资源,比如 TaskManager 的数量、TaskManager/JobManager 的内存大小、并行度等等,还可以设置一些容错,比如 Flink 作业失败后,Apache StreamPark 可以支持它自动拉起,同时支持传入一些动态参数。 - 集群管理:可以在界面上添加和管理大数据集群。 - 团队管理:在企业的实际生产过程中会有多个团队,团队之间是隔离的。 - 变量管理:可以把一些变量统一维护在一个地方,比如 Kafka 的 Broker 地址定义成一个变量,在配置 Flink 作业或者 SQL 的时候,就可以以变量的方式来替换 Broker 的 IP,且后续这个 Kafka 要下线的时候,也可以通过这个变量去查看到底哪些作业使用了这个集群,方便我们去做一些后续的流程。 @@ -88,12 +88,12 @@ tags: [StreamPark, 生产实践, FlinkSQL] -StreamPark 支持 Flink SQL、Flink Jar 的提交,支持资源配置,支持状态跟踪,如状态是运行状态,失败状态等,同时支持指标大屏和各种日志查看。 +Apache StreamPark 支持 Flink SQL、Flink Jar 的提交,支持资源配置,支持状态跟踪,如状态是运行状态,失败状态等,同时支持指标大屏和各种日志查看。 ![](/blog/chinaunion/devops_platform.png) Flink 作业 DevOps 平台,主要包括以下几部分: -- 团队:StreamPark 支持多个团队,每个团队都有团队的管理员,他拥有所有权限,同时还有团队的开发者,他只有少量的一部分权限。 +- 团队:Apache StreamPark 支持多个团队,每个团队都有团队的管理员,他拥有所有权限,同时还有团队的开发者,他只有少量的一部分权限。 - 编译、打包:在创建 Flink 项目时,可以把 git 地址、分支、打包的命令等配置在项目里,然后一键点击 build 按钮进行编译、打包。 - 发布、部署:发布和部署的时候会创建 Flink 作业,在 Flink 作业里可以选择执行模式、部署集群、资源设置、容错设置、变量填充,最后通过一键启动停止,启动 Flink 作业。 - 状态监测:Flink 作业启动完成之后,就是状态的实时跟踪,包括 Flink 的运行状态、运行时长、Checkpoint 信息等,并支持一键跳转到 Flink 的 Web UI。 @@ -101,7 +101,7 @@ Flink 作业 DevOps 平台,主要包括以下几部分: ![](/blog/chinaunion/multi_team_support.png) -企业一般有多个团队同时开发实时作业,在我们公司包含实时采集团队、数据处理团队和实时的营销团队,StreamPark 支持多个团队的资源隔离。 +企业一般有多个团队同时开发实时作业,在我们公司包含实时采集团队、数据处理团队和实时的营销团队,Apache StreamPark 支持多个团队的资源隔离。 ![](/blog/chinaunion/platformized_management.png) @@ -114,7 +114,7 @@ Flink 作业平台化管理面临如下挑战: -基于以上的挑战,StreamPark 通过项目管理来解决了责任人不明确,分支不可追溯的问题,因为在创建项目的时候需要手动指定一些分支,一旦打包成功,这些分支是有记录的;通过作业管理对配置进行了集中化,避免了脚本太过于分散,而且作业启动、停止的权限有严格的控制,避免了脚本化权限不可控的状态,StreamPark 以接口的方式与集群进行交互来获取作业信息,这样做会让作业控制更加精细。 +基于以上的挑战,Apache StreamPark 通过项目管理来解决了责任人不明确,分支不可追溯的问题,因为在创建项目的时候需要手动指定一些分支,一旦打包成功,这些分支是有记录的;通过作业管理对配置进行了集中化,避免了脚本太过于分散,而且作业启动、停止的权限有严格的控制,避免了脚本化权限不可控的状态,Apache StreamPark 以接口的方式与集群进行交互来获取作业信息,这样做会让作业控制更加精细。 @@ -122,15 +122,15 @@ Flink 作业平台化管理面临如下挑战: ![图片](/blog/chinaunion/development_efficiency.png) -早期我们需要通过 7 步进行部署,包括连接 VPN、登录 4A、执行编译脚本、执行启动脚本、打开 Yarn、搜索作业名、进入 Flink UI 等 7 个步骤,StreamPark 可以支持 4 个一键进行部署,包括一键打包、一键发布、一键启动、一键到 Flink UI。 +早期我们需要通过 7 步进行部署,包括连接 VPN、登录 4A、执行编译脚本、执行启动脚本、打开 Yarn、搜索作业名、进入 Flink UI 等 7 个步骤,Apache StreamPark 可以支持 4 个一键进行部署,包括一键打包、一键发布、一键启动、一键到 Flink UI。 ![图片](/blog/chinaunion/submission_process.png) -上图是我们 StreamPark 的作业提交流程,首先 StreamPark 会将作业进行发布,发布的时候会上传一些资源,然后会进行作业的提交,提交的时候会带上配置的一些参数,以 Flink Submit 的方式调用接口发布到集群上;这里会有多个 Flink Submit 对应着不同的执行模式,比如 Yarn Session、Yarn Application、Kubernetes Session、Kubernetes Application 等都是在这里控制的,提交作业之后,如果是 Flink on Yarn 作业,会得到这个 Flink 作业的 Application ID 或者 Job ID,这个 ID 会保存在我们的数据库中,如果是基于 Kubernetes 执行的话,也会得到 Job ID,后面我们在跟踪作业状态的时候,主要就是通过保存的这些 ID 去跟踪作业的状态。 +上图是我们 Apache StreamPark 的作业提交流程,首先 Apache StreamPark 会将作业进行发布,发布的时候会上传一些资源,然后会进行作业的提交,提交的时候会带上配置的一些参数,以 Flink Submit 的方式调用接口发布到集群上;这里会有多个 Flink Submit 对应着不同的执行模式,比如 Yarn Session、Yarn Application、Kubernetes Session、Kubernetes Application 等都是在这里控制的,提交作业之后,如果是 Flink on Yarn 作业,会得到这个 Flink 作业的 Application ID 或者 Job ID,这个 ID 会保存在我们的数据库中,如果是基于 Kubernetes 执行的话,也会得到 Job ID,后面我们在跟踪作业状态的时候,主要就是通过保存的这些 ID 去跟踪作业的状态。 ![图片](/blog/chinaunion/status_acquisition_bottleneck.png) -如上所述,如果是 Flink on Yarn 作业,在提交作业的时候会获取两个 ID,Application ID 或者 Job ID,基于这两个 ID 可以获取我们的状态,但当 Flink 作业非常多的时候会遇到一些问题,StreamPark 它是有一个状态获取器,它会通过我们保存的数据库里的 Application ID 或者 Job ID,去向 ResourceManager 做一个请求,会做每五秒钟周期性的轮询,如果作业特别多,每次轮询 ResourceManager 会负责再去调用 Job Manager 的地址访问它的状态,这就会导致 ResourceManager 的连接数压力较大和连接数过高。 +如上所述,如果是 Flink on Yarn 作业,在提交作业的时候会获取两个 ID,Application ID 或者 Job ID,基于这两个 ID 可以获取我们的状态,但当 Flink 作业非常多的时候会遇到一些问题,Apache StreamPark 它是有一个状态获取器,它会通过我们保存的数据库里的 Application ID 或者 Job ID,去向 ResourceManager 做一个请求,会做每五秒钟周期性的轮询,如果作业特别多,每次轮询 ResourceManager 会负责再去调用 Job Manager 的地址访问它的状态,这就会导致 ResourceManager 的连接数压力较大和连接数过高。 @@ -138,38 +138,38 @@ Flink 作业平台化管理面临如下挑战: ![图片](/blog/chinaunion/state_optimization.png) -针对上面的问题,我们做了一些优化,首先 StreamPark 保存了提交作业之后的 Application ID 或者 Job ID,同时也会获取 Job Manager 直接访问的地址,并保存在数据库中,每次轮询时不再通过 ResourceManager 获取作业的状态,它可以直接调用各个 Job Manager 的地址实时获取状态,极大的降低了 ResourceManager 的连接数;从上图最后的部分可以看到,基本不会产生太大的连接数,大大减轻了 ResourceManager 的压力,且后续当 Flink 作业越来越多时获取状态也不会遇到瓶颈的问题。 +针对上面的问题,我们做了一些优化,首先 Apache StreamPark 保存了提交作业之后的 Application ID 或者 Job ID,同时也会获取 Job Manager 直接访问的地址,并保存在数据库中,每次轮询时不再通过 ResourceManager 获取作业的状态,它可以直接调用各个 Job Manager 的地址实时获取状态,极大的降低了 ResourceManager 的连接数;从上图最后的部分可以看到,基本不会产生太大的连接数,大大减轻了 ResourceManager 的压力,且后续当 Flink 作业越来越多时获取状态也不会遇到瓶颈的问题。 ![图片](/blog/chinaunion/state_recovery.png) -StreamPark 解决的另一个问题是 Flink 从状态恢复的保障,以前我们用脚本做运维的时候,在启动 Flink 的时候,尤其是在业务升级的时候,要从上一个最新的 Checkpoint 来恢复,但经常有开发人员忘记从上一个检查点进行恢复,导致数据质量产生很大的问题,遭到投诉,StreamPark 的流程是在首次启动的时候,每五秒钟轮询一次获取 Checkpoint 的记录,同时保存在数据库之中,在 StreamPark 上手动停止 Flink 作业的时候,可以选择做不做 Savepoint,如果选择了做 Savepoint,会将 Savepoint 的路径保存在数据库中,同时每次的 Checkpoint 记录也保存在数据库中,当下次启动 Flink 作业的时候,默认会选择最新的 Checkpoint 或者 Savepoint 记录,有效避免了无法从上一个检查点去恢复的问题,也避免了导致问题后要进行 offset 回拨重跑作业造成的资源浪费,同时也保证了数据处理的一致性。 +Apache StreamPark 解决的另一个问题是 Flink 从状态恢复的保障,以前我们用脚本做运维的时候,在启动 Flink 的时候,尤其是在业务升级的时候,要从上一个最新的 Checkpoint 来恢复,但经常有开发人员忘记从上一个检查点进行恢复,导致数据质量产生很大的问题,遭到投诉,Apache StreamPark 的流程是在首次启动的时候,每五秒钟轮询一次获取 Checkpoint 的记录,同时保存在数据库之中,在 Apache StreamPark 上手动停止 Flink 作业的时候,可以选择做不做 Savepoint,如果选择了做 Savepoint,会将 Savepoint 的路径保存在数据库中,同时每次的 Checkpoint 记录也保存在数据库中,当下次启动 Flink 作业的时候,默认会选择最新的 Checkpoint 或者 Savepoint 记录,有效避免了无法从上一个检查点去恢复的问题,也避免了导致问题后要进行 offset 回拨重跑作业造成的资源浪费,同时也保证了数据处理的一致性。 ![图片](/blog/chinaunion/multiple_environments_and_components.png) -StreamPark 还解决了在多环境下多个组件的引用挑战,比如在企业中通常会有多套环境,如开发环境、测试环境、生产环境等,一般来说每套环境下都会有多个组件,比如 Kafka,HBase、Redis 等,而且在同一套环境里还可能会存在多个相同的组件,比如在联通的实时计算平台,从上游的 Kafka 消费数据的时候,将符合要求的数据再写到下游的 Kafka,这个时候同一套环境会涉及到两套 Kafka,单纯从 IP 很难判断是哪个环境哪个组件,所以我们将所有组件的 IP 地址都定义成一个变量,比如 Kafka 集群,开发环境、测试环境、生产环境都有 Kafka.cluster 这个变量,但它们指向的 Broker 的地址是不一样的,这样不管是在哪个环境下配置 Flink 作业,只要引用这个变量就可以了,大大降低了生产上的故障率。 +Apache StreamPark 还解决了在多环境下多个组件的引用挑战,比如在企业中通常会有多套环境,如开发环境、测试环境、生产环境等,一般来说每套环境下都会有多个组件,比如 Kafka,HBase、Redis 等,而且在同一套环境里还可能会存在多个相同的组件,比如在联通的实时计算平台,从上游的 Kafka 消费数据的时候,将符合要求的数据再写到下游的 Kafka,这个时候同一套环境会涉及到两套 Kafka,单纯从 IP 很难判断是哪个环境哪个组件,所以我们将所有组件的 IP 地址都定义成一个变量,比如 Kafka 集群,开发环境、测试环境、生产环境都有 Kafka.cluster 这个变量,但它们指向的 Broker 的地址是不一样的,这样不管是在哪个环境下配置 Flink 作业,只要引用这个变量就可以了,大大降低了生产上的故障率。 ![图片](/blog/chinaunion/multiple_execution_modes.png) -StreamPark 支持 Flink 多执行的模式,包括基于 on Yarn 的 Application/ Perjob / Session 三种部署模式,还支持 Kubernetes 的 Application 和 Session 两种部署模式,还有一些 Remote 的模式。 +Apache StreamPark 支持 Flink 多执行的模式,包括基于 on Yarn 的 Application/ Perjob / Session 三种部署模式,还支持 Kubernetes 的 Application 和 Session 两种部署模式,还有一些 Remote 的模式。 ![图片](/blog/chinaunion/versioning.png) -StreamPark 也支持 Flink 的多版本,比如联通现在用的是 1.14.x,现在 1.16.x 出来后我们也想体验一下,但不可能把所有的作业都升级到 1.16.x,我们可以把新上线的升级到 1.16.x,这样可以很好的满足使用新版本的要求,同时也兼容老版本。 +Apache StreamPark 也支持 Flink 的多版本,比如联通现在用的是 1.14.x,现在 1.16.x 出来后我们也想体验一下,但不可能把所有的作业都升级到 1.16.x,我们可以把新上线的升级到 1.16.x,这样可以很好的满足使用新版本的要求,同时也兼容老版本。 ## **未来规划与演进** ![图片](/blog/chinaunion/contribution_and_enhancement.png) -未来我们将加大力度参与 StreamPark 建设,以下我们计划要增强的方向。 -- 高可用:StreamPark 目前不支持高可用,这方面还需要做一些加强。 +未来我们将加大力度参与 Apache StreamPark 建设,以下我们计划要增强的方向。 +- 高可用:Apache StreamPark 目前不支持高可用,这方面还需要做一些加强。 - 状态的管理:在企业实践中 Flink 作业在上线时,每个算子会有 UID。如果 Flink UID 不做设置,做 Flink 作业的升级的时候,就有可能出现状态无法恢复的情况,目前通过平台还无法解决这个问题,所以我们想在平台上增加这个功能,在 Flink Jar 提交时,增加检测算子是否设置 UID 的功能,如果没有,会发出提醒,这样可以避免每次上线 Flink 作业时,作业无法恢复的问题;之前遇到这种情况的时候,我们需要使用状态处理的 API,从原来的状态里进行反序列化,然后再用状态处理 API 去制作新的状态,供升级后的 Flink 加载状态。 -- 更细致的监控:目前支持 Flink 作业失败之后,StreamPark 发出告警。我们希望 Task 失败之后也可以发出告警,我们需要知道失败的原因;还有作业反压监控告警、Checkpoint 超时、失败告警性能指标采集,也有待加强。 +- 更细致的监控:目前支持 Flink 作业失败之后,Apache StreamPark 发出告警。我们希望 Task 失败之后也可以发出告警,我们需要知道失败的原因;还有作业反压监控告警、Checkpoint 超时、失败告警性能指标采集,也有待加强。 - 流批一体:结合 Flink 流批一体引擎和数据湖流批一体存储探索流批一体平台。 ![](/blog/chinaunion/road_map.png) -上图是 StreamPark 的 Roadmap。 -- 数据源:StreamPark 会支持更多数据源的快速接入,达到数据一键入户。 +上图是 Apache StreamPark 的 Roadmap。 +- 数据源:Apache StreamPark 会支持更多数据源的快速接入,达到数据一键入户。 - 运维中心:获取更多 Flink Metrics 进一步加强监控运维的能力。 - K8S-operator:现有的 Flink on K8S 还是有点重,经历了打 Jar 包、打镜像、推镜像的过程,后续需要改进优化,积极拥抱上游对接的 K8S-operator。 - 流式数仓:增强对 Flink SQL 作业能力的支持,简化 Flink SQL 作业的提交,计划对接 Flink SQL Gateway;SQL 数仓方面的能力加强,包括元数据存储、统一建表语法校验、运行测试、交互式查询,积极拥抱 Flink 上游,探索实时数仓和流式数仓。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-blog/3-streampark-usercase-bondex-paimon.md b/i18n/zh-CN/docusaurus-plugin-content-blog/3-streampark-usercase-bondex-paimon.md index d66e83b72..8895314d8 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-blog/3-streampark-usercase-bondex-paimon.md +++ b/i18n/zh-CN/docusaurus-plugin-content-blog/3-streampark-usercase-bondex-paimon.md @@ -1,12 +1,12 @@ --- slug: streampark-usercase-bondex-with-paimon -title: 海程邦达基于 Apache Paimon + StreamPark 的流式数仓实践 -tags: [StreamPark, 生产实践, paimon, streaming-warehouse] +title: 海程邦达基于 Apache Paimon + Apache StreamPark 的流式数仓实践 +tags: [Apache StreamPark, 生产实践, paimon, streaming-warehouse] --- ![](/blog/bondex/Bondex.png) -**导读:**本文主要介绍作为供应链物流服务商海程邦达在数字化转型过程中采用 Paimon + StreamPark 平台实现流式数仓的落地方案。我们以 Apache StreamPark 流批一体平台提供了一个易于上手的生产操作手册,以帮助用户提交 Flink 任务并迅速掌握 Paimon 的使用方法。 +**导读:**本文主要介绍作为供应链物流服务商海程邦达在数字化转型过程中采用 Paimon + Apache StreamPark 平台实现流式数仓的落地方案。我们以 Apache StreamPark 流批一体平台提供了一个易于上手的生产操作手册,以帮助用户提交 Flink 任务并迅速掌握 Paimon 的使用方法。 - 公司业务情况介绍 - 大数据技术痛点以及选型 @@ -102,11 +102,11 @@ kappa 架构只用一套数据流处理架构来解决离线和实时数据, ## 03 生 产 实 践 -本方案采用 Flink Application On K8s 集群,Flink CDC 实时摄取业务系统关系型数据库数据,通过 StreamPark 任务平台提交 Flink + Paimon Streaming Data Warehouse 任务, 最后采用 Trino 引擎接入 Finereport 提供服务和开发人员的查询。Paimon 底层存储支持 S3 协议,因为公司大数据服务依赖于阿里云所以使用对象存储OSS作为数据文件系统。 +本方案采用 Flink Application On K8s 集群,Flink CDC 实时摄取业务系统关系型数据库数据,通过 Apache StreamPark 任务平台提交 Flink + Paimon Streaming Data Warehouse 任务, 最后采用 Trino 引擎接入 Finereport 提供服务和开发人员的查询。Paimon 底层存储支持 S3 协议,因为公司大数据服务依赖于阿里云所以使用对象存储OSS作为数据文件系统。 -[StreamPark](https://github.com/apache/incubator-streampark) 是一个实时计算平台,与 [Paimon](https://github.com/apache/incubator-paimon) 结合使用其强大功能来处理实时数据流。此平台提供以下主要功能: +[Apache StreamPark](https://github.com/apache/incubator-streampark) 是一个实时计算平台,与 [Paimon](https://github.com/apache/incubator-paimon) 结合使用其强大功能来处理实时数据流。此平台提供以下主要功能: -**实时数据处理:**StreamPark 支持提交实时数据流任务,能够实时获取、转换、过滤和分析数据。这对于需要快速响应实时数据的应用非常重要,例如实时监控、实时推荐和实时风控等领域。 +**实时数据处理:**Apache StreamPark 支持提交实时数据流任务,能够实时获取、转换、过滤和分析数据。这对于需要快速响应实时数据的应用非常重要,例如实时监控、实时推荐和实时风控等领域。 **可扩展性:**可以高效处理大规模实时数据,具备良好的可扩展性。可以在分布式计算环境中运行,并能够自动处理并行化、故障恢复和负载均衡等问题,以确保高效且可靠地处理数据。 @@ -114,7 +114,7 @@ kappa 架构只用一套数据流处理架构来解决离线和实时数据, **易用性:**提供了直观的图形界面和简化的 API,可以轻松地构建和部署数据处理任务,而无需深入了解底层技术细节。 -通过在 StreamPark 平台上提交 Paimon 任务,我们可以建立一个全链路实时流动、可查询和分层可复用的 Pipline。 +通过在 Apache StreamPark 平台上提交 Paimon 任务,我们可以建立一个全链路实时流动、可查询和分层可复用的 Pipline。 ![](/blog/bondex/pipline.png) @@ -127,7 +127,7 @@ kappa 架构只用一套数据流处理架构来解决离线和实时数据, ### **环境构建** -下载 flink-1.16.0-scala-2.12.tar.gz 可以在 flink官网 下载对应版本的安装包到StreamPark 部署服务器 +下载 flink-1.16.0-scala-2.12.tar.gz 可以在 flink官网 下载对应版本的安装包到Apache StreamPark 部署服务器 ```shell #解压 @@ -182,7 +182,7 @@ export PATH=$PATH:$FLINK_HOME/bin source /etc/profile ``` -在 StreamPark 添加 Flink conf: +在 Apache StreamPark 添加 Flink conf: ![](/blog/bondex/flink_conf.jpg) @@ -236,7 +236,7 @@ docker push registry-vpc.cn-zhangjiakou.aliyuncs.com/xxxxx/flink-table-store:v1. 接下来准备 Paimon jar 包,可以在 Apache [Repository](https://repository.apache.org/content/groups/snapshots/org/apache/paimon) 下载对应版本,需要注意的是要和 flink 大版本保持一致 -### **使用 StreamPark 管理作业** +### **使用 Apache StreamPark 管理作业** **前提条件:** @@ -247,7 +247,7 @@ docker push registry-vpc.cn-zhangjiakou.aliyuncs.com/xxxxx/flink-table-store:v1. **Kubernetes 客户端连接配置:** -将 k8s master节点~/.kube/config 配置直接拷贝到 StreamPark 服务器的目录,之后在 StreamPark 服务器执行以下命令显示 k8s 集群 running 代表权限和网络验证成功。 +将 k8s master节点~/.kube/config 配置直接拷贝到 Apache StreamPark 服务器的目录,之后在 Apache StreamPark 服务器执行以下命令显示 k8s 集群 running 代表权限和网络验证成功。 ```shell kubectl cluster-info @@ -272,11 +272,11 @@ kubectl create secret docker-registry streamparksecret 案例中使用阿里云容器镜像服务ACR,也可以使用自建镜像服务harbor代替。 -创建命名空间 StreamPark (安全设置需要设置为私有) +创建命名空间 Apache StreamPark (安全设置需要设置为私有) ![](/blog/bondex/aliyun.png) -在 StreamPark 配置镜像仓库,任务构建镜像会推送到镜像仓库 +在 Apache StreamPark 配置镜像仓库,任务构建镜像会推送到镜像仓库 ![](/blog/bondex/dockersystem_setting.png) @@ -935,4 +935,4 @@ https://github.com/apache/incubator-paimon/pull/1308 - 后面将基于 trino Catalog接入Doris,实现真正的离线数据和实时数据的one service - 采用 doris + paimon 的架构方案继续推进集团内部流批一体数仓建设的步伐 -在这里要感谢之信老师和 StreamPark 社区在使用 StreamPark + Paimon 过程中的大力支持,在学习使用过程中遇到的问题,都能在第一时间给到解惑并得到解决,我们后面也会积极参与社区的交流和建设,让 paimon 能为更多开发者和企业提供流批一体的数据湖解决方案。 +在这里要感谢之信老师和 Apache StreamPark 社区在使用 Apache StreamPark + Paimon 过程中的大力支持,在学习使用过程中遇到的问题,都能在第一时间给到解惑并得到解决,我们后面也会积极参与社区的交流和建设,让 paimon 能为更多开发者和企业提供流批一体的数据湖解决方案。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-blog/4-streampark-usercase-shunwang.md b/i18n/zh-CN/docusaurus-plugin-content-blog/4-streampark-usercase-shunwang.md index 776036fb5..49fd179c7 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-blog/4-streampark-usercase-shunwang.md +++ b/i18n/zh-CN/docusaurus-plugin-content-blog/4-streampark-usercase-shunwang.md @@ -1,18 +1,18 @@ --- slug: streampark-usercase-shunwang -title: StreamPark 在顺网科技的大规模生产实践 -tags: [StreamPark, 生产实践, FlinkSQL] +title: Apache StreamPark 在顺网科技的大规模生产实践 +tags: [Apache StreamPark, 生产实践, FlinkSQL] --- ![](/blog/shunwang/autor.png) -**导读:**本文主要介绍顺网科技在使用 Flink 计算引擎中遇到的一些挑战,基于 StreamPark 作为实时数据平台如何来解决这些问题,从而大规模支持公司的业务。 +**导读:**本文主要介绍顺网科技在使用 Flink 计算引擎中遇到的一些挑战,基于 Apache StreamPark 作为实时数据平台如何来解决这些问题,从而大规模支持公司的业务。 - 公司业务介绍 - 遇到的挑战 -- 为什么用 StreamPark +- 为什么用 Apache StreamPark - 落地实践 - 带来的收益 - 未来规划 @@ -44,7 +44,7 @@ Flink 作为当下实时计算领域中最流行的技术框架之一,拥有 在面对 Flink 作业管理和运维上的的一系列痛点后,我们一直在寻找合适的解决方案来降低开发同学使用 Flink 门槛,提高工作效率。 -在没有遇到 StreamPark 之前,我们调研了部分公司的 Flink 管理解决方案,发现都是通过自研实时作业平台的方式来开发和管理 Flink 作业。于是,我们也决定自研一套实时计算管理平台,来满足了开发同学对于 Flink 作业管理和运维的基础需求,我们这套平台叫 Streaming-Launcher,大体功能如下: +在没有遇到 Apache StreamPark 之前,我们调研了部分公司的 Flink 管理解决方案,发现都是通过自研实时作业平台的方式来开发和管理 Flink 作业。于是,我们也决定自研一套实时计算管理平台,来满足了开发同学对于 Flink 作业管理和运维的基础需求,我们这套平台叫 Streaming-Launcher,大体功能如下: ![图片](/blog/shunwang/launcher.png) @@ -72,19 +72,19 @@ Streaming-Launcher 中,没有提供统一的作业管理界面。开发同学 ![图片](/blog/shunwang/step.png) -## **为什么用** **StreamPark** +## **为什么用** **Apache StreamPark** 面对自研平台 Streaming-Launcher 存在的缺陷,我们一直在思考如何将 Flink 的使用门槛降到更低,进一步提高工作效率。考虑到人员投入成本和时间成本,我们决定向开源社区求助寻找合适的开源项目来对我们的 Flink 任务进行管理和运维。 -### 01 **StreamPark 解决 Flink 问题的利器** +### 01 **Apache StreamPark 解决 Flink 问题的利器** -很幸运在 2022 年 6 月初,我们在 GitHub 机缘巧合之间认识到了 StreamPark,我们满怀希望地对 StreamPark 进行了初步的探索。发现 StreamPark 具备的能力大概分为三大块:用户权限管理、作业运维管理和开发脚手架。 +很幸运在 2022 年 6 月初,我们在 GitHub 机缘巧合之间认识到了 Apache StreamPark,我们满怀希望地对 Apache StreamPark 进行了初步的探索。发现 Apache StreamPark 具备的能力大概分为三大块:用户权限管理、作业运维管理和开发脚手架。 **用户权限管理** -在 StreamPark 平台中为了避免用户权限过大,发生一些不必要的误操作,影响作业运行稳定性和环境配置的准确性,提供了相应的一些用户权限管理功能,这对企业级用户来说,非常有必要。 +在 Apache StreamPark 平台中为了避免用户权限过大,发生一些不必要的误操作,影响作业运行稳定性和环境配置的准确性,提供了相应的一些用户权限管理功能,这对企业级用户来说,非常有必要。 @@ -94,13 +94,13 @@ Streaming-Launcher 中,没有提供统一的作业管理界面。开发同学 **作业运维管理** -**我们在对 StreamPark 做调研的时候,最关注的是 StreamPark 对于作业的管理的能力。**StreamPark 是否有能力管理作业一个完整的生命周期:作业开发、作业部署、作业管理、问题诊断等。**很幸运,StreamPark 在这一方面非常优秀,对于开发同学来说只需要关注业务本身,不再需要特别关心 Flink 作业管理和运维上遇到的一系列痛点。**在 StreamPark 作业开发管理管理中,大致分为三个模块:作业管理基础功能,Jar 作业管理,FlinkSQL 作业管理。如下: +**我们在对 Apache StreamPark 做调研的时候,最关注的是 Apache StreamPark 对于作业的管理的能力。**Apache StreamPark 是否有能力管理作业一个完整的生命周期:作业开发、作业部署、作业管理、问题诊断等。**很幸运,Apache StreamPark 在这一方面非常优秀,对于开发同学来说只需要关注业务本身,不再需要特别关心 Flink 作业管理和运维上遇到的一系列痛点。**在 Apache StreamPark 作业开发管理管理中,大致分为三个模块:作业管理基础功能,Jar 作业管理,FlinkSQL 作业管理。如下: ![图片](/blog/shunwang/homework_manager.png) **开发脚手架** -通过进一步的研究发现,StreamPark 不仅仅是一个平台,还包含 Flink 作业开发脚手架, 在 StreamPark 中,针对编写代码的 Flink 作业,提供一种更好的解决方案,将程序配置标准化,提供了更为简单的编程模型,同时还提供了一些列 Connectors,降低了 DataStream 开发的门槛。 +通过进一步的研究发现,Apache StreamPark 不仅仅是一个平台,还包含 Flink 作业开发脚手架, 在 Apache StreamPark 中,针对编写代码的 Flink 作业,提供一种更好的解决方案,将程序配置标准化,提供了更为简单的编程模型,同时还提供了一些列 Connectors,降低了 DataStream 开发的门槛。 @@ -110,9 +110,9 @@ Streaming-Launcher 中,没有提供统一的作业管理界面。开发同学 -### 02 **StreamPark 解决自研平台的问题** +### 02 **Apache StreamPark 解决自研平台的问题** -上面我们简单介绍了 StreamPark 的核心能力。在顺网科技的技术选型过程中,我们发现 StreamPark 所具备强大的功能不仅包含了现有 Streaming-Launcher 的基础功能,还提供了更完整的对应方案解决了 Streaming-Launcher 的诸多不足。在这部分,着重介绍下 StreamPark 针对我们自研平台 Streaming-Launcher 的不足所提供的解决方案。 +上面我们简单介绍了 Apache StreamPark 的核心能力。在顺网科技的技术选型过程中,我们发现 Apache StreamPark 所具备强大的功能不仅包含了现有 Streaming-Launcher 的基础功能,还提供了更完整的对应方案解决了 Streaming-Launcher 的诸多不足。在这部分,着重介绍下 Apache StreamPark 针对我们自研平台 Streaming-Launcher 的不足所提供的解决方案。 @@ -122,21 +122,21 @@ Streaming-Launcher 中,没有提供统一的作业管理界面。开发同学 **Flink 作业一站式的开发能力** -StreamPark 为了降低 Flink 作业开发门槛,提高开发同学工作效率,**提供了 FlinkSQL IDE、参数管理、任务管理、代码管理、一键编译、一键作业上下线等使用的功能**。在调研中,我们发现 StreamPark 集成的这些功能可以进一步提升开发同学的工作效率。在某种程度上来说,开发同学不需要去关心 Flink 作业管理和运维的难题,只要专注于业务的开发。同时,这些功能也解决了 Streaming-Launcher 中 SQL 开发流程繁琐的痛点。 +Apache StreamPark 为了降低 Flink 作业开发门槛,提高开发同学工作效率,**提供了 FlinkSQL IDE、参数管理、任务管理、代码管理、一键编译、一键作业上下线等使用的功能**。在调研中,我们发现 Apache StreamPark 集成的这些功能可以进一步提升开发同学的工作效率。在某种程度上来说,开发同学不需要去关心 Flink 作业管理和运维的难题,只要专注于业务的开发。同时,这些功能也解决了 Streaming-Launcher 中 SQL 开发流程繁琐的痛点。 ![图片](/blog/shunwang/application.png) **支持多种部署模式** -在 Streaming-Launcher 中,由于只支持 Yarn Session 模式,对于开发同学来说,其实非常不灵活。StreamPark 对于这一方面也提供了完善的解决方案。**StreamPark 完整的支持了Flink 的所有部署模式:Remote、Yarn Per-Job、Yarn Application、Yarn Session、K8s Session、K8s Application****,**可以让开发同学针对不同的业务场景自由选择合适的运行模式。** +在 Streaming-Launcher 中,由于只支持 Yarn Session 模式,对于开发同学来说,其实非常不灵活。Apache StreamPark 对于这一方面也提供了完善的解决方案。**Apache StreamPark 完整的支持了Flink 的所有部署模式:Remote、Yarn Per-Job、Yarn Application、Yarn Session、K8s Session、K8s Application****,**可以让开发同学针对不同的业务场景自由选择合适的运行模式。** **作业统一管理中心** -对于开发同学来说,作业运行状态是他们最关心的内容之一。在 Streaming-Launcher 中由于缺乏作业统一管理界面,开发同学只能通过告警信息和 Yarn 中Application 的状态信息来判断任务状态,这对开发同学来说非常不友好。StreamPark 针对这一点,提供了作业统一管理界面,可以一目了然查看到每个任务的运行情况。 +对于开发同学来说,作业运行状态是他们最关心的内容之一。在 Streaming-Launcher 中由于缺乏作业统一管理界面,开发同学只能通过告警信息和 Yarn 中Application 的状态信息来判断任务状态,这对开发同学来说非常不友好。Apache StreamPark 针对这一点,提供了作业统一管理界面,可以一目了然查看到每个任务的运行情况。 ![图片](/blog/shunwang/management.png) -在 Streaming-Launcher 中,开发同学在作业问题诊断的时候,需要通过多个步骤才能定位作业运行日志。StreamPark 提供了一键跳转功能,能快速定位到作业运行日志。 +在 Streaming-Launcher 中,开发同学在作业问题诊断的时候,需要通过多个步骤才能定位作业运行日志。Apache StreamPark 提供了一键跳转功能,能快速定位到作业运行日志。 ![图片](/blog/shunwang/logs.png) @@ -144,17 +144,17 @@ StreamPark 为了降低 Flink 作业开发门槛,提高开发同学工作效 ## 落 地 实 践 -在 StreamPark 引入顺网科技时,由于公司业务的特点和开发同学的一些定制化需求,我们对 StreamPark 的功能做了一些增加和优化,同时也总结了一些在使用过程中遇到的问题和对应的解决方案。 +在 Apache StreamPark 引入顺网科技时,由于公司业务的特点和开发同学的一些定制化需求,我们对 Apache StreamPark 的功能做了一些增加和优化,同时也总结了一些在使用过程中遇到的问题和对应的解决方案。 ### 01 **结合 Flink-SQL-Gateway 的能力** -在顺网科技,我们基于 Flink-SQL-Gateway 自研了 ODPS 平台来方便业务开发同学管理 Flink 表的元数据。业务开发同学在 ODPS 上对 Flink 表进行 DDL 操作,然后在 StreamPark 上对创建的 Flink 表进行分析查询操作。在整个业务开发流程上,我们对 Flink 表的创建和分析实现了解耦,让开发流程显得比较清晰。 +在顺网科技,我们基于 Flink-SQL-Gateway 自研了 ODPS 平台来方便业务开发同学管理 Flink 表的元数据。业务开发同学在 ODPS 上对 Flink 表进行 DDL 操作,然后在 Apache StreamPark 上对创建的 Flink 表进行分析查询操作。在整个业务开发流程上,我们对 Flink 表的创建和分析实现了解耦,让开发流程显得比较清晰。 -开发同学如果想在 ODPS 上查询实时数据,我们需要提供一个 Flink SQL 的运行环境。我们使用 StreamPark 运行了一个 Yarn Session 的 Flink 环境提供给 ODPS 做实时查询。 +开发同学如果想在 ODPS 上查询实时数据,我们需要提供一个 Flink SQL 的运行环境。我们使用 Apache StreamPark 运行了一个 Yarn Session 的 Flink 环境提供给 ODPS 做实时查询。 ![图片](/blog/shunwang/homework.png) -目前 StreamPark 社区为了进一步降低实时作业开发门槛,正在对接 Flink-SQL-Gateway。 +目前 Apache StreamPark 社区为了进一步降低实时作业开发门槛,正在对接 Flink-SQL-Gateway。 https://github.com/apache/streampark/issues/2274 @@ -172,7 +172,7 @@ https://github.com/apache/streampark/issues/2274 ### 03 **增强告警能力** -因为每个公司的短信告警平台实现都各不相同,所以 StreamPark 社区并没有抽象出统一的短信告警功能。在此,我们通过 Webhook 的方式,自己实现了短信告警功能。 +因为每个公司的短信告警平台实现都各不相同,所以 Apache StreamPark 社区并没有抽象出统一的短信告警功能。在此,我们通过 Webhook 的方式,自己实现了短信告警功能。 ![图片](/blog/shunwang/alarm.png) @@ -180,7 +180,7 @@ https://github.com/apache/streampark/issues/2274 ### 04 **增加阻塞队列解决限流问题** -在生产实践中,我们发现在大批量任务同时失败的时候,比如 Yarn Session 集群挂了,飞书 / 微信等平台在多线程同时调用告警接口时会存在限流的问题,那么大量的告警信息因为飞书 / 微信等平台限流问题,StreamPark 只会发送一部分的告警信息,这样非常容易误导开发同学排查问题。在顺网科技,我们增加了一个阻塞队列和一个告警线程,来解决限流问题。 +在生产实践中,我们发现在大批量任务同时失败的时候,比如 Yarn Session 集群挂了,飞书 / 微信等平台在多线程同时调用告警接口时会存在限流的问题,那么大量的告警信息因为飞书 / 微信等平台限流问题,Apache StreamPark 只会发送一部分的告警信息,这样非常容易误导开发同学排查问题。在顺网科技,我们增加了一个阻塞队列和一个告警线程,来解决限流问题。 ![图片](/blog/shunwang/queue.png) @@ -192,11 +192,11 @@ https://github.com/apache/streampark/issues/2142 ## 带来的收益 -我们从 StreamX 1.2.3(StreamPark 前身)开始探索和使用,经过一年多时间的磨合,我们发现 StreamPark 真实解决了 Flink 作业在开发管理和运维上的诸多痛点。 +我们从 StreamX 1.2.3(Apache StreamPark 前身)开始探索和使用,经过一年多时间的磨合,我们发现 Apache StreamPark 真实解决了 Flink 作业在开发管理和运维上的诸多痛点。 -StreamPark 给顺网科技带来的最大的收益就是降低了 Flink 的使用门槛,提升了开发效率。我们业务开发同学在原先的 Streaming-Launcher 中需要使用 vscode、GitLab 和调度平台等多个工具完成一个 FlinkSQL 作业开发,从开发到编译到发布的流程中经过多个工具使用,流程繁琐。StreamPark 提供一站式服务,可以在 StreamPark 上完成作业开发编译发布,简化了整个开发流程。 +Apache StreamPark 给顺网科技带来的最大的收益就是降低了 Flink 的使用门槛,提升了开发效率。我们业务开发同学在原先的 Streaming-Launcher 中需要使用 vscode、GitLab 和调度平台等多个工具完成一个 FlinkSQL 作业开发,从开发到编译到发布的流程中经过多个工具使用,流程繁琐。Apache StreamPark 提供一站式服务,可以在 Apache StreamPark 上完成作业开发编译发布,简化了整个开发流程。 -**目前 StreamPark 在顺网科技已经大规模在生产环境投入使用,StreamPark 从最开始管理的 500+ 个 FlinkSQL 作业增加到了近 700 个 FlinkSQL作业,同时管理了 10+ 个 Yarn Sesssion Cluster。** +**目前 Apache StreamPark 在顺网科技已经大规模在生产环境投入使用,Apache StreamPark 从最开始管理的 500+ 个 FlinkSQL 作业增加到了近 700 个 FlinkSQL作业,同时管理了 10+ 个 Yarn Sesssion Cluster。** ![图片](/blog/shunwang/achievements1.png) @@ -204,11 +204,11 @@ StreamPark 给顺网科技带来的最大的收益就是降低了 Flink 的使 ## 未 来 规 划 -顺网科技作为 StreamPark 早期的用户之一,在 1 年期间内一直和社区同学保持交流,参与 StreamPark 的稳定性打磨,我们将生产运维中遇到的 Bug 和新的 Feature 提交给了社区。在未来,我们希望可以在 StreamPark 上管理 Flink 表的元数据信息,基于 Flink 引擎通过多 Catalog 实现跨数据源查询分析功能。目前 StreamPark 正在对接 Flink-SQL-Gateway 能力,这一块在未来对于表元数据的管理和跨数据源查询功能会提供了很大的帮助。 +顺网科技作为 Apache StreamPark 早期的用户之一,在 1 年期间内一直和社区同学保持交流,参与 Apache StreamPark 的稳定性打磨,我们将生产运维中遇到的 Bug 和新的 Feature 提交给了社区。在未来,我们希望可以在 Apache StreamPark 上管理 Flink 表的元数据信息,基于 Flink 引擎通过多 Catalog 实现跨数据源查询分析功能。目前 Apache StreamPark 正在对接 Flink-SQL-Gateway 能力,这一块在未来对于表元数据的管理和跨数据源查询功能会提供了很大的帮助。 -由于在顺网科技多是已 Yarn Session 模式运行的作业,我们希望 StreamPark 可以提供更多对于 Remote集群、Yarn Session 集群和 K8s Session 集群功能支持,比如监控告警,优化操作流程等方面。 +由于在顺网科技多是已 Yarn Session 模式运行的作业,我们希望 Apache StreamPark 可以提供更多对于 Remote集群、Yarn Session 集群和 K8s Session 集群功能支持,比如监控告警,优化操作流程等方面。 -考虑到未来,随着业务发展可能会使用 StreamPark 管理更多的 Flink 实时作业,单节点模式下的 StreamPark 可能并不安全。所以我们对于 StreamPark 的 HA 也是非常期待。 +考虑到未来,随着业务发展可能会使用 Apache StreamPark 管理更多的 Flink 实时作业,单节点模式下的 Apache StreamPark 可能并不安全。所以我们对于 Apache StreamPark 的 HA 也是非常期待。 -对于 StreamPark 对接 Flink-SQL-Gateway 能力、丰富 Flink Cluster 功能和 StreamPark HA,我们后续也会参与建设中。 +对于 Apache StreamPark 对接 Flink-SQL-Gateway 能力、丰富 Flink Cluster 功能和 Apache StreamPark HA,我们后续也会参与建设中。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-blog/5-streampark-usercase-dustess.md b/i18n/zh-CN/docusaurus-plugin-content-blog/5-streampark-usercase-dustess.md index bec4373ac..4c62a0700 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-blog/5-streampark-usercase-dustess.md +++ b/i18n/zh-CN/docusaurus-plugin-content-blog/5-streampark-usercase-dustess.md @@ -1,10 +1,10 @@ --- slug: streampark-usercase-dustess -title: StreamPark 在尘锋信息的最佳实践,化繁为简极致体验 -tags: [StreamPark, 生产实践, FlinkSQL] +title: Apache StreamPark 在尘锋信息的最佳实践,化繁为简极致体验 +tags: [Apache StreamPark, 生产实践, FlinkSQL] --- -**摘要:**本文源自 StreamPark 在尘锋信息的生产实践, 作者是资深数据开发工程师Gump。主要内容为: +**摘要:**本文源自 Apache StreamPark 在尘锋信息的生产实践, 作者是资深数据开发工程师Gump。主要内容为: 1. 技术选型 2. 落地实践 @@ -35,7 +35,7 @@ tags: [StreamPark, 生产实践, FlinkSQL] - Flink 支持批流一体,虽然目前公司的批处理架构还是基于 Hive、Spark 等。使用 Flink 进行流计算,便于后期建设批流一体和湖仓一体 - 目前国内 Flink 生态已经越来越成熟,Flink 也开始着手踏破边界向流式数仓发展 -在平台层面,我们综合对比了 StreamPark 、 Apache Zeppelin 和 flink-streaming-platform-web,也深入阅读了源码和并做了优缺点分析,关于后两个项目本文就不展开赘述,感兴趣的朋友可以去 GitHub 搜索,我们最终选择 StreamPark,理由如下: +在平台层面,我们综合对比了 Apache StreamPark 、 Apache Zeppelin 和 flink-streaming-platform-web,也深入阅读了源码和并做了优缺点分析,关于后两个项目本文就不展开赘述,感兴趣的朋友可以去 GitHub 搜索,我们最终选择 Apache StreamPark,理由如下: ### **完成度高** @@ -51,11 +51,11 @@ tags: [StreamPark, 生产实践, FlinkSQL] #### **2. 支持多种部署模式** -StreamPark 支持 Flink **所有主流的提交模式**,如 standalone、yarn-session 、yarn application、yarn-perjob、kubernetes-session、kubernetes-application 而且StreamPark 不是简单的拼接 Flink run 命令来进行的任务提交,而是引入了 Flink Client 源码包,直接调用 Flink Client API 来进行的任务提交。这样的好处是代码模块化、易读、便于扩展,稳定,且能在后期根据 Flink 版本升级进行很快的适配。 +Apache StreamPark 支持 Flink **所有主流的提交模式**,如 standalone、yarn-session 、yarn application、yarn-perjob、kubernetes-session、kubernetes-application 而且Apache StreamPark 不是简单的拼接 Flink run 命令来进行的任务提交,而是引入了 Flink Client 源码包,直接调用 Flink Client API 来进行的任务提交。这样的好处是代码模块化、易读、便于扩展,稳定,且能在后期根据 Flink 版本升级进行很快的适配。 ![](/blog/dustess/execution_mode.png) -Flink SQL 可以极大提升开发效率和提高 Flink 的普及。StreamPark 对于 **Flink SQL 的支持非常到位**,优秀的 SQL 编辑器,依赖管理,任务多版本管理等等。StreamPark 官网介绍后期会加入 Flink SQL 的元数据管理整合,大家拭目以待。 +Flink SQL 可以极大提升开发效率和提高 Flink 的普及。Apache StreamPark 对于 **Flink SQL 的支持非常到位**,优秀的 SQL 编辑器,依赖管理,任务多版本管理等等。Apache StreamPark 官网介绍后期会加入 Flink SQL 的元数据管理整合,大家拭目以待。 ![](/blog/dustess/flink_sql.png) @@ -69,33 +69,33 @@ Flink SQL 可以极大提升开发效率和提高 Flink 的普及。StreamPark Flink SQL 现在虽然足够强大,但使用 Java 和 Scala 等 JVM 语言开发 Flink 任务会更加灵活、定制化更强、便于调优和提升资源利用率。与 SQL 相比 Jar 包提交任务最大的问题是Jar包的上传管理等,没有优秀的工具产品会严重降低开发效率和加大维护成本。 -StreamPark 除了支持 Jar 上传,更提供了**在线更新构建**的功能,优雅解决了以上问题: +Apache StreamPark 除了支持 Jar 上传,更提供了**在线更新构建**的功能,优雅解决了以上问题: -1、新建 Project :填写 GitHub/Gitlab(支持企业私服)地址及用户名密码, StreamPark 就能 Pull 和 Build 项目。 +1、新建 Project :填写 GitHub/Gitlab(支持企业私服)地址及用户名密码, Apache StreamPark 就能 Pull 和 Build 项目。 -2、创建 StreamPark Custom-Code 任务时引用 Project,指定主类,启动任务时可选自动 Pull、Build 和绑定生成的 Jar,非常优雅! +2、创建 Apache StreamPark Custom-Code 任务时引用 Project,指定主类,启动任务时可选自动 Pull、Build 和绑定生成的 Jar,非常优雅! -同时 StreamPark 社区最近也在完善整个任务编译、上线的流程,以后的 StreamPark 会在此基础上更加完善和专业。 +同时 Apache StreamPark 社区最近也在完善整个任务编译、上线的流程,以后的 Apache StreamPark 会在此基础上更加完善和专业。 ![](/blog/dustess/system_list.png) #### **5. 完善的任务参数配置** -对于使用 Flink 做数据开发而言,Flink run 提交的参数几乎是难以维护的。StreamPark 也非常**优雅的解决**了此类问题,原因是上面提到的 StreamPark 直接调用 Flink Client API,并且从 StreamPark 产品前端打通了整个流程。 +对于使用 Flink 做数据开发而言,Flink run 提交的参数几乎是难以维护的。Apache StreamPark 也非常**优雅的解决**了此类问题,原因是上面提到的 Apache StreamPark 直接调用 Flink Client API,并且从 Apache StreamPark 产品前端打通了整个流程。 ![](/blog/dustess/parameter_configuration.png) -大家可以看到,StreamPark 的任务参数设置涵盖了主流的所有参数,并且非常细心的对每个参数都做了介绍和最佳实践的最优推荐。这对于刚使用 Flink 的同学来说也是非常好的事情,避免踩坑! +大家可以看到,Apache StreamPark 的任务参数设置涵盖了主流的所有参数,并且非常细心的对每个参数都做了介绍和最佳实践的最优推荐。这对于刚使用 Flink 的同学来说也是非常好的事情,避免踩坑! #### **6. 优秀的配置文件设计** -对于 Flink 任务的原生参数,上面的任务参数已经涵盖了很大一部分。StreamPark 还提供了强大的**Yaml 配置文件** 模式和 **编程模型**。 +对于 Flink 任务的原生参数,上面的任务参数已经涵盖了很大一部分。Apache StreamPark 还提供了强大的**Yaml 配置文件** 模式和 **编程模型**。 ![](/blog/dustess/extended_parameters.jpg) -1、对于 Flink SQL 任务,直接使用任务的 Yaml 配置文件可以配置 StreamPark 已经内置的参数,如常见的 **CheckPoint、重试机制、State Backend、table planer 、mode** 等等。 +1、对于 Flink SQL 任务,直接使用任务的 Yaml 配置文件可以配置 Apache StreamPark 已经内置的参数,如常见的 **CheckPoint、重试机制、State Backend、table planer 、mode** 等等。 -2、对于 Jar 任务,StreamPark 提供了通用的编程模型,该模型封装了 Flink 原生 API ,结合 StreamPark 提供的封装包可以非常优雅的获取配置文件中的自定义参数,这块文档详见: +2、对于 Jar 任务,Apache StreamPark 提供了通用的编程模型,该模型封装了 Flink 原生 API ,结合 Apache StreamPark 提供的封装包可以非常优雅的获取配置文件中的自定义参数,这块文档详见: 编程模型: @@ -111,13 +111,13 @@ http://www.streamxhub.com/docs/development/config 除此之外: -StreamPark 也**支持Apache Flink 原生任务**,参数配置可以由 Java 任务内部代码静态维护,可以覆盖非常多的场景,比如存量 Flink 任务无缝迁移等等 +Apache StreamPark 也**支持Apache Flink 原生任务**,参数配置可以由 Java 任务内部代码静态维护,可以覆盖非常多的场景,比如存量 Flink 任务无缝迁移等等 #### **7. Checkpoint 管理** -关于 Flink 的 Checkpoint(Savepoint)机制,最大的困难是维护 ,StreamPark 也非常优雅的解决此问题: +关于 Flink 的 Checkpoint(Savepoint)机制,最大的困难是维护 ,Apache StreamPark 也非常优雅的解决此问题: -- StreamPark 会**自动维护**任务 Checkpoint 的目录及版本至系统中方便检索 +- Apache StreamPark 会**自动维护**任务 Checkpoint 的目录及版本至系统中方便检索 - 当用户需要更新重启应用时,可以选择是否保存 Savepoint - 重新启动任务时可选择 Checkpoint/Savepoint 从指定版本的恢复 @@ -129,42 +129,42 @@ StreamPark 也**支持Apache Flink 原生任务**,参数配置可以由 Java #### **8. 完善的报警功能** -对于流式计算此类7*24H常驻任务来说,监控报警是非常重要的 ,StreamPark 对于此类问题也有**完善的解决方案**: +对于流式计算此类7*24H常驻任务来说,监控报警是非常重要的 ,Apache StreamPark 对于此类问题也有**完善的解决方案**: - 自带基于邮件的报警方式,0开发成本,配置即可使用 -- 得益于 StreamPark 源码优秀的模块化,可以在 Task Track 处进行代码增强,引入公司内部的 SDK 进行电话、群组等报警方式,开发成本也非常低 +- 得益于 Apache StreamPark 源码优秀的模块化,可以在 Task Track 处进行代码增强,引入公司内部的 SDK 进行电话、群组等报警方式,开发成本也非常低 ![](/blog/dustess/alarm_email.png) ### **源码优秀** -遵循技术选型原则,一个新的技术必须足够了解底层原理和架构思想后,才会考虑应用生产。在选择 StreamPark 之前,对其架构和源码进入过深入研究和阅读。发现 StreamPark 所选用的底层技术是国人非常熟悉的:MySQL、Spring Boot、Mybatis Plus、Vue 等,代码风格统一,实现优雅,注释完善,各模块独立抽象合理,使用了较多的设计模式,且代码质量很高,非常适合后期的排错及二次开发。 +遵循技术选型原则,一个新的技术必须足够了解底层原理和架构思想后,才会考虑应用生产。在选择 Apache StreamPark 之前,对其架构和源码进入过深入研究和阅读。发现 Apache StreamPark 所选用的底层技术是国人非常熟悉的:MySQL、Spring Boot、Mybatis Plus、Vue 等,代码风格统一,实现优雅,注释完善,各模块独立抽象合理,使用了较多的设计模式,且代码质量很高,非常适合后期的排错及二次开发。 ![](/blog/dustess/code_notebook.png) -StreamPark 于 2021年11月成功被开源中国评选为GVP - Gitee「最有价值开源项目」,足以见得其质量和潜力。 +Apache StreamPark 于 2021年11月成功被开源中国评选为GVP - Gitee「最有价值开源项目」,足以见得其质量和潜力。 ![](/blog/dustess/certificate.png) ### **03 社区活跃** -目前社区非常活跃,从2021年11月底落地 StreamPark (基于1.2.0-release),当时StreamPark 刚刚才被大家认识,还有一些体验上的小 Bug(不影响核心功能)。当时为了快速上线,屏蔽掉了一些功能和修复了一些小 Bug,正当准备贡献给社区时发现早已修复,这也可以看出目前社区的迭代周期非常快。以后我们公司团队也会努力和社区保持一致,将新特性快速落地,提升数据开发效率和降低维护成本。 +目前社区非常活跃,从2021年11月底落地 Apache StreamPark (基于1.2.0-release),当时Apache StreamPark 刚刚才被大家认识,还有一些体验上的小 Bug(不影响核心功能)。当时为了快速上线,屏蔽掉了一些功能和修复了一些小 Bug,正当准备贡献给社区时发现早已修复,这也可以看出目前社区的迭代周期非常快。以后我们公司团队也会努力和社区保持一致,将新特性快速落地,提升数据开发效率和降低维护成本。 ## **02 落地实践** -StreamPark 的环境搭建非常简单,跟随官网的搭建教程可以在小时内完成搭建。目前已经支持了前后端分离打包部署的模式,可以满足更多公司的需求,而且已经有 Docker Build 相关的 PR,相信以后 StreamPark 的编译部署会更加方便快捷。相关文档如下: +Apache StreamPark 的环境搭建非常简单,跟随官网的搭建教程可以在小时内完成搭建。目前已经支持了前后端分离打包部署的模式,可以满足更多公司的需求,而且已经有 Docker Build 相关的 PR,相信以后 Apache StreamPark 的编译部署会更加方便快捷。相关文档如下: ``` http://www.streamxhub.com/docs/user-guide/deployment ``` -为了快速落地和生产使用,我们选择了稳妥的 On Yarn 资源管理模式(虽然 StreamPark 已经很完善的支持 K8S),且已经有较多公司通过 StreamPark 落地了 K8S 部署方式,大家可以参考: +为了快速落地和生产使用,我们选择了稳妥的 On Yarn 资源管理模式(虽然 Apache StreamPark 已经很完善的支持 K8S),且已经有较多公司通过 Apache StreamPark 落地了 K8S 部署方式,大家可以参考: ``` http://www.streamxhub.com/blog/flink-development-framework-streamx ``` -StreamPark 整合 Hadoop 生态可以说是0成本的(前提是按照 Flink 官网将 Flink 与 Hadoop 生态整合,能够通过 Flink 脚本启动任务即可) +Apache StreamPark 整合 Hadoop 生态可以说是0成本的(前提是按照 Flink 官网将 Flink 与 Hadoop 生态整合,能够通过 Flink 脚本启动任务即可) 目前我们也正在进行 K8S 的测试及方案设计,在未来一段时间会整体迁移至 K8S @@ -174,7 +174,7 @@ StreamPark 整合 Hadoop 生态可以说是0成本的(前提是按照 Flink ![](/blog/dustess/online_flinksql.png) -StreamPark 非常贴心的准备了 Demo SQL 任务,可以直接在刚搭建的平台上运行,从这些细节可以看出社区对用户体验非常重视。前期我们的简单任务都是通过 Flink SQL 来编写及运行,StreamPark 对于 Flink SQL 的支持得非常好,优秀的 SQL 编辑器,创新型的 POM 及 Jar 包依赖管理,可以满足非常多的 SQL 场景下的问题。 +Apache StreamPark 非常贴心的准备了 Demo SQL 任务,可以直接在刚搭建的平台上运行,从这些细节可以看出社区对用户体验非常重视。前期我们的简单任务都是通过 Flink SQL 来编写及运行,Apache StreamPark 对于 Flink SQL 的支持得非常好,优秀的 SQL 编辑器,创新型的 POM 及 Jar 包依赖管理,可以满足非常多的 SQL 场景下的问题。 目前我们正在进行元数据层面、权限、UDF等相关的方案调研、设计等 @@ -182,11 +182,11 @@ StreamPark 非常贴心的准备了 Demo SQL 任务,可以直接在刚搭建 由于目前团队的数据开发同学大多有 Java 和 Scala 语言基础,为了更加灵活的开发、更加透明的调优 Flink 任务及覆盖更多场景,我们也快速的落地了基于 Jar 包的构建方式。我们落地分为了两个阶段 -第一阶段:**StreamPark 提供了原生 Apache Flink 项目的支持**,我们将存量的任务Git地址配置至 StreamPark,底层使用 Maven 打包为 Jar 包,创建 StreamPark 的 Apache Flink任务,无缝的进行了迁移。在这个过程中,StreamPark 只是作为了任务提交和状态维护的一个平台工具,远远没有使用到上面提到的其他功能。 +第一阶段:**Apache StreamPark 提供了原生 Apache Flink 项目的支持**,我们将存量的任务Git地址配置至 Apache StreamPark,底层使用 Maven 打包为 Jar 包,创建 Apache StreamPark 的 Apache Flink任务,无缝的进行了迁移。在这个过程中,Apache StreamPark 只是作为了任务提交和状态维护的一个平台工具,远远没有使用到上面提到的其他功能。 -第二阶段:第一阶段将任务都迁移至 StreamPark 上之后,任务已经在平台上运行,但是任务的配置,如 checkpoint,容错以及 Flink 任务内部的业务参数的调整都需要修改源码 push 及 build,效率十分低下且不透明。 +第二阶段:第一阶段将任务都迁移至 Apache StreamPark 上之后,任务已经在平台上运行,但是任务的配置,如 checkpoint,容错以及 Flink 任务内部的业务参数的调整都需要修改源码 push 及 build,效率十分低下且不透明。 -于是,根据 StreamPark 的 QuickStart 我们快速整合了StreamPark 的编程模型,也就是StreamPark Flink 任务(对于 Apache Flink)的封装。 +于是,根据 Apache StreamPark 的 QuickStart 我们快速整合了Apache StreamPark 的编程模型,也就是Apache StreamPark Flink 任务(对于 Apache Flink)的封装。 如: @@ -194,7 +194,7 @@ StreamPark 非常贴心的准备了 Demo SQL 任务,可以直接在刚搭建 StreamingContext = ParameterTool + StreamExecutionEnvironment ``` -- StreamingContext 为 StreamPark 的封装对象 +- StreamingContext 为 Apache StreamPark 的封装对象 - ParameterTool 为解析配置文件后的参数对象 ``` @@ -205,7 +205,7 @@ StreamingContext = ParameterTool + StreamExecutionEnvironment ## **03 业务支撑 & 能力开放** -目前尘锋基于 StreamPark 的实时计算平台从去年11月底上线至今,已经上线 50+ Flink 任务,其中 10+为 Flink SQL 任务,40+ 为 Jar 任务。目前主要还是数据团队内部使用,近期会将实时计算平台开放全公司业务团队使用,任务量会大量增加。 +目前尘锋基于 Apache StreamPark 的实时计算平台从去年11月底上线至今,已经上线 50+ Flink 任务,其中 10+为 Flink SQL 任务,40+ 为 Jar 任务。目前主要还是数据团队内部使用,近期会将实时计算平台开放全公司业务团队使用,任务量会大量增加。 ![](/blog/dustess/online_jar.png) @@ -213,17 +213,17 @@ StreamingContext = ParameterTool + StreamExecutionEnvironment 时数仓主要是用 Jar 任务,因为模式比较通用,使用 Jar 任务可以通用化的处理大量的数据表同步和计算,甚至做到配置化同步等,我们的实时数仓主要基 Apache Doris 来存储,使用 Flink 来进行清洗计算(目标是存算分离) -使用 StreamPark 整合其他组件也是非常简单,同时我们也将 Apache Doris 和 Kafka 相关的配置也抽象到了配置文件中,大大提升了我们的开发效率和灵活度。 +使用 Apache StreamPark 整合其他组件也是非常简单,同时我们也将 Apache Doris 和 Kafka 相关的配置也抽象到了配置文件中,大大提升了我们的开发效率和灵活度。 ### **02 能力开放** -数据团队外的其他业务团队也有很多的流处理场景,于是我们将基于 StreamPark 的实时计算平台二次开发后,将以下能力开放全公司业务团队 +数据团队外的其他业务团队也有很多的流处理场景,于是我们将基于 Apache StreamPark 的实时计算平台二次开发后,将以下能力开放全公司业务团队 - 业务能力开放:实时数仓上游将所有业务表通过日志采集写入 Kafka,业务团队可基于 Kafka 进行业务相关开发,也可通过实时数仓(Apache Doris)进行 OLAP分析 - 计算能力开放:将大数据平台的服务器资源开放业务团队使用 - 解决方案开放:Flink 生态的成熟 Connector、Exactly Once 语义支持,可减少业务团队流处理相关的开发成本及维护成本 -目前 StreamPark 还不支持多业务组功能,多业务组功能会抽象后贡献社区。 +目前 Apache StreamPark 还不支持多业务组功能,多业务组功能会抽象后贡献社区。 ![](/blog/dustess/manager.png) @@ -240,7 +240,7 @@ StreamingContext = ParameterTool + StreamExecutionEnvironment - **存储计算分离**。Flink 计算资源和状态存储分离,计算资源能够和其他组件资源进行混部,提升机器使用率 - **弹性扩缩容**。能够弹性扩缩容,更好的节省人力和物力成本 -目前本人也在整理和落地相关的技术架构及方案,并已在实验环境使用 StreamPark 完成了 Flink on kubernetes 的技术验证,生产落地这一目标由于有 StreamPark 的平台支持,以及社区同学的热心帮心,相信在未来不久就能达成。 +目前本人也在整理和落地相关的技术架构及方案,并已在实验环境使用 Apache StreamPark 完成了 Flink on kubernetes 的技术验证,生产落地这一目标由于有 Apache StreamPark 的平台支持,以及社区同学的热心帮心,相信在未来不久就能达成。 ### **02 流批一体建设** @@ -259,7 +259,7 @@ StreamingContext = ParameterTool + StreamExecutionEnvironment ## **05 结束语** -以上就是 StreamPark 在尘锋信息生产实践的全部分享内容,感谢大家看到这里。写这篇文章的初心是为大家带来一点 StreamPark 的生产实践的经验和参考,并且和 StreamPark 社区的小伙伴们一道,共同建设 StreamPark ,未来也准备会有更多的参与和建设。非常感谢 StreamPark 的开发者们,能够提供这样优秀的产品,足够多的细节都感受到了大家的用心。虽然目前公司生产使用的(1.2.0-release)版本,在任务分组检索,编辑返回跳页等交互体验上还有些许不足,但瑕不掩瑜,相信 StreamPark 会越来越好,**也相信 StreamPark 会推动 Apache Flink 的普及**。最后用 Apache Flink 社区的一句话来作为结束吧:实时即未来! +以上就是 Apache StreamPark 在尘锋信息生产实践的全部分享内容,感谢大家看到这里。写这篇文章的初心是为大家带来一点 Apache StreamPark 的生产实践的经验和参考,并且和 Apache StreamPark 社区的小伙伴们一道,共同建设 Apache StreamPark ,未来也准备会有更多的参与和建设。非常感谢 Apache StreamPark 的开发者们,能够提供这样优秀的产品,足够多的细节都感受到了大家的用心。虽然目前公司生产使用的(1.2.0-release)版本,在任务分组检索,编辑返回跳页等交互体验上还有些许不足,但瑕不掩瑜,相信 Apache StreamPark 会越来越好,**也相信 Apache StreamPark 会推动 Apache Flink 的普及**。最后用 Apache Flink 社区的一句话来作为结束吧:实时即未来! diff --git a/i18n/zh-CN/docusaurus-plugin-content-blog/6-streampark-usercase-joyme.md b/i18n/zh-CN/docusaurus-plugin-content-blog/6-streampark-usercase-joyme.md index f983b3a16..a3a07d2e7 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-blog/6-streampark-usercase-joyme.md +++ b/i18n/zh-CN/docusaurus-plugin-content-blog/6-streampark-usercase-joyme.md @@ -1,12 +1,12 @@ --- slug: streampark-usercase-joyme -title: StreamPark 在 Joyme 的生产实践 -tags: [StreamPark, 生产实践, FlinkSQL] +title: Apache StreamPark 在 Joyme 的生产实践 +tags: [Apache StreamPark, 生产实践, FlinkSQL] --- -**摘要:** 本文带来 StreamPark 在 Joyme 中的生产实践, 作者是 Joyme 的大数据工程师秦基勇, 主要内容为: +**摘要:** 本文带来 Apache StreamPark 在 Joyme 中的生产实践, 作者是 Joyme 的大数据工程师秦基勇, 主要内容为: -- 遇见StreamPark +- 遇见Apache StreamPark - Flink Sql 作业开发 - Custom code 作业开发 - 监控告警 @@ -16,9 +16,9 @@ tags: [StreamPark, 生产实践, FlinkSQL] -## 1 遇见 StreamPark +## 1 遇见 Apache StreamPark -遇见 StreamPark 是必然的,基于我们现有的实时作业开发模式,不得不寻找一个开源的平台来支撑我司的实时业务。我们的现状如下: +遇见 Apache StreamPark 是必然的,基于我们现有的实时作业开发模式,不得不寻找一个开源的平台来支撑我司的实时业务。我们的现状如下: - 编写作业打包到服务器,然后执行 Flink run 命令进行提交,过程繁琐,效率低下 - Flink Sql 通过自研的老平台提交,老平台开发人员已离职,后续的代码无人维护,即便有人维护也不得不面对维护成本高的问题 @@ -27,13 +27,13 @@ tags: [StreamPark, 生产实践, FlinkSQL] 基于以上种种原因,我们需要一个开源平台来管理我们的实时作业,同时我们也需要进行重构,统一开发模式,统一开发语言,将项目集中管理。 -第一次遇见 StreamPark 就基本确定了,我们根据官网的文档快速进行了部署安装,搭建以后进行了一些操作,界面友好,Flink 多版本支持,权限管理,作业监控等一系列功能已能较好的满足我们的需求,进一步了解到其社区也很活跃,从 1.1.0 版本开始见证了 StreamPark 功能完善的过程,开发团队是非常有追求的,相信会不断的完善。 +第一次遇见 Apache StreamPark 就基本确定了,我们根据官网的文档快速进行了部署安装,搭建以后进行了一些操作,界面友好,Flink 多版本支持,权限管理,作业监控等一系列功能已能较好的满足我们的需求,进一步了解到其社区也很活跃,从 1.1.0 版本开始见证了 Apache StreamPark 功能完善的过程,开发团队是非常有追求的,相信会不断的完善。 ## 2 Flink SQL 作业开发 Flink Sql 开发模式带来了很大的便利,对于一些简单的指标开发,只需要简单的 Sql 就可以完成,不需要写一行代码。Flink Sql 方便了很多同学的开发工作,毕竟一些做仓库的同学在编写代码方面还是有些难度。 -打开 StreamPark 的任务新增界面进行添加新任务,默认 Development Mode 就是 Flink Sql 模式。直接在 Flink Sql 部分编写Sql 逻辑。 +打开 Apache StreamPark 的任务新增界面进行添加新任务,默认 Development Mode 就是 Flink Sql 模式。直接在 Flink Sql 部分编写Sql 逻辑。 Flink Sql 部分,按照 Flink 官网的文档逐步编写逻辑 Sql 即可,对于我司来说,一般就三部分: 接入 Source ,中间逻辑处理,最后 Sink。基本上 Source 都是消费 kafka 的数据,逻辑处理层会有关联 MySQL 去做维表查询,最后 Sink 部分大多是 Es,Redis,MySQL。 @@ -76,7 +76,7 @@ SELECT Data.uid FROM source_table; ### **2. 添加依赖** -关于依赖这块是 StreamPark 里特有的,在 StreamPark 中创新型的将一个完整的 Flink Sql 任务拆分成两部分组成: Sql 和 依赖, Sql 很好理解不多啰嗦, 依赖是 Sql 里需要用到的一些 Connector 的 Jar, 如 Sql 里用到了 Kafka 和 MySQL 的 Connector, 那就需要引入这两个 Connector 的依赖, 在 StreamPark 中添加依赖两种方式,一种是基于标准的 Maven pom 坐标方式,另一种是从本地上传需要的 Jar 。这两种也可以混着用,按需添加,点击应用即可, 在提交作业的时候就会自动加载这些依赖。 +关于依赖这块是 Apache StreamPark 里特有的,在 Apache StreamPark 中创新型的将一个完整的 Flink Sql 任务拆分成两部分组成: Sql 和 依赖, Sql 很好理解不多啰嗦, 依赖是 Sql 里需要用到的一些 Connector 的 Jar, 如 Sql 里用到了 Kafka 和 MySQL 的 Connector, 那就需要引入这两个 Connector 的依赖, 在 Apache StreamPark 中添加依赖两种方式,一种是基于标准的 Maven pom 坐标方式,另一种是从本地上传需要的 Jar 。这两种也可以混着用,按需添加,点击应用即可, 在提交作业的时候就会自动加载这些依赖。 ![](/blog/joyme/add_dependency.png) @@ -106,7 +106,7 @@ Streaming 作业我们是使用 Flink java 进行开发,将之前 Spark scala ![](/blog/joyme/project_configuration.png) -配置完成以后,根据对应的项目进行编译,也就完成项目的打包环节。这样后面的 Constom code 作业也可以引用。每次需要上线都需要进行编译才可以,否则只能是上次编译的代码。这里有个问题,为了安全,我司的 gitlab 账号密码都是定期更新的。这样就会导致,StreamPark 已经配置好的项目还是之前的密码,结果导致编译时从 git 里拉取项目失败,导致整个编译环节失败,针对这个问题,我们联系到社区,了解到这部分已经在后续的 1.2.1 版本中支持了项目的修改操作。 +配置完成以后,根据对应的项目进行编译,也就完成项目的打包环节。这样后面的 Constom code 作业也可以引用。每次需要上线都需要进行编译才可以,否则只能是上次编译的代码。这里有个问题,为了安全,我司的 gitlab 账号密码都是定期更新的。这样就会导致,Apache StreamPark 已经配置好的项目还是之前的密码,结果导致编译时从 git 里拉取项目失败,导致整个编译环节失败,针对这个问题,我们联系到社区,了解到这部分已经在后续的 1.2.1 版本中支持了项目的修改操作。 ![](/blog/joyme/flink_system.png) @@ -120,7 +120,7 @@ Streaming 作业我们是使用 Flink java 进行开发,将之前 Spark scala ## 4 监控告警 -StreamPark 的监控需要在 setting 模块去配置发送邮件的基本信息。 +Apache StreamPark 的监控需要在 setting 模块去配置发送邮件的基本信息。 ![](/blog/joyme/system_setting.png) @@ -132,7 +132,7 @@ StreamPark 的监控需要在 setting 模块去配置发送邮件的基本信息 ![](/blog/joyme/alarm_eamil.png) -关于报警这一块目前我们基于 StreamPark 的 t_flink_app 表进行了一个定时任务的开发。为什么要这么做?因为发送邮件这种通知,大部分人可能不会去及时去看。所以我们选择监控每个任务的状态去把对应的监控信息发送我们的飞书报警群,这样可以及时发现问题去解决任务。一个简单的 python 脚本,然后配置了 crontab 去定时执行。 +关于报警这一块目前我们基于 Apache StreamPark 的 t_flink_app 表进行了一个定时任务的开发。为什么要这么做?因为发送邮件这种通知,大部分人可能不会去及时去看。所以我们选择监控每个任务的状态去把对应的监控信息发送我们的飞书报警群,这样可以及时发现问题去解决任务。一个简单的 python 脚本,然后配置了 crontab 去定时执行。 ## 5 常见问题 @@ -152,10 +152,10 @@ StreamPark 的监控需要在 setting 模块去配置发送邮件的基本信息 ## 6 社区印象 -很多时候我们在 StreamPark 用户群里讨论问题,都会得到社区小伙伴的即时响应。提交的一些 issue 在当下不能解决的,基本也会在下一个版本或者最新的代码分支中进行修复。在群里,我们也看到很多不是社区的小伙伴,也在积极互相帮助去解决问题。群里也有很多其他社区的大佬,很多小伙伴也积极加入了社区的开发工作。整个社区给我的感觉还是很活跃! +很多时候我们在 Apache StreamPark 用户群里讨论问题,都会得到社区小伙伴的即时响应。提交的一些 issue 在当下不能解决的,基本也会在下一个版本或者最新的代码分支中进行修复。在群里,我们也看到很多不是社区的小伙伴,也在积极互相帮助去解决问题。群里也有很多其他社区的大佬,很多小伙伴也积极加入了社区的开发工作。整个社区给我的感觉还是很活跃! ## 7 总结 -目前我司线上运行 60 个实时作业,Flink sql 与 Custom-code 差不多各一半。后续也会有更多的实时任务进行上线。很多同学都会担心 StreamPark 稳不稳定的问题,就我司根据几个月的生产实践而言,StreamPark 只是一个帮助你开发作业,部署,监控和管理的一个平台。到底稳不稳,还是要看自家的 Hadoop yarn 集群稳不稳定(我们用的onyan模式),其实已经跟 StreamPark关系不大了。还有就是你写的 Flink Sql 或者是代码健不健壮。更多的是这两方面应该是大家要考虑的,这两方面没问题再充分利用 StreamPark 的灵活性才能让作业更好的运行,单从一方面说 StreamPark 稳不稳定,实属偏激。 +目前我司线上运行 60 个实时作业,Flink sql 与 Custom-code 差不多各一半。后续也会有更多的实时任务进行上线。很多同学都会担心 Apache StreamPark 稳不稳定的问题,就我司根据几个月的生产实践而言,Apache StreamPark 只是一个帮助你开发作业,部署,监控和管理的一个平台。到底稳不稳,还是要看自家的 Hadoop yarn 集群稳不稳定(我们用的onyan模式),其实已经跟 Apache StreamPark关系不大了。还有就是你写的 Flink Sql 或者是代码健不健壮。更多的是这两方面应该是大家要考虑的,这两方面没问题再充分利用 Apache StreamPark 的灵活性才能让作业更好的运行,单从一方面说 Apache StreamPark 稳不稳定,实属偏激。 -以上就是 StreamPark 在乐我无限的全部分享内容,感谢大家看到这里。非常感谢 StreamPark 提供给我们这么优秀的产品,这就是做的利他人之事。从1.0 到 1.2.1 平时遇到那些bug都会被即时的修复,每一个issue都被认真对待。目前我们还是 onyarn的部署模式,重启yarn还是会导致作业的lost状态,重启yarn也不是天天都干的事,关于这个社区也会尽早的会去修复这个问题。相信 StreamPark 会越来越好,未来可期。 +以上就是 Apache StreamPark 在乐我无限的全部分享内容,感谢大家看到这里。非常感谢 Apache StreamPark 提供给我们这么优秀的产品,这就是做的利他人之事。从1.0 到 1.2.1 平时遇到那些bug都会被即时的修复,每一个issue都被认真对待。目前我们还是 onyarn的部署模式,重启yarn还是会导致作业的lost状态,重启yarn也不是天天都干的事,关于这个社区也会尽早的会去修复这个问题。相信 Apache StreamPark 会越来越好,未来可期。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-blog/7-streampark-usercase-haibo.md b/i18n/zh-CN/docusaurus-plugin-content-blog/7-streampark-usercase-haibo.md index d8713a1f6..e3c832010 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-blog/7-streampark-usercase-haibo.md +++ b/i18n/zh-CN/docusaurus-plugin-content-blog/7-streampark-usercase-haibo.md @@ -1,12 +1,12 @@ --- slug: streampark-usercase-haibo -title: StreamPark 一站式计算利器在海博科技的生产实践,助力智慧城市建设 -tags: [StreamPark, 生产实践, FlinkSQL] +title: Apache StreamPark 一站式计算利器在海博科技的生产实践,助力智慧城市建设 +tags: [Apache StreamPark, 生产实践, FlinkSQL] --- -**摘要:**本文「 StreamPark 一站式计算利器在海博科技的生产实践,助力智慧城市建设 」作者是海博科技大数据架构师王庆焕,主要内容为: +**摘要:**本文「 Apache StreamPark 一站式计算利器在海博科技的生产实践,助力智慧城市建设 」作者是海博科技大数据架构师王庆焕,主要内容为: -1. 选择 StreamPark +1. 选择 Apache StreamPark 2. 快速上手 3. 应用场景 4. 功能扩展 @@ -16,11 +16,11 @@ tags: [StreamPark, 生产实践, FlinkSQL] -## **01. 选择 StreamPark** +## **01. 选择 Apache StreamPark** 海博科技自 2020 年开始使用 Flink SQL 汇聚、处理各类实时物联数据。随着各地市智慧城市建设步伐的加快,需要汇聚的各类物联数据的数据种类、数据量也不断增加,导致线上维护的 Flink SQL 任务越来越多,一个专门的能够管理众多 Flink SQL 任务的计算平台成为了迫切的需求。 -在体验对比了 Apache Zeppelin 和 StreamPark 之后,我们选择了 StreamPark 作为公司的实时计算平台。相比 Apache Zeppelin, StreamPark 并不出名。‍‍‍‍‍‍‍‍‍‍‍‍但是在体验了 StreamPark 发行的初版,阅读其设计文档后,我们发现其基于 **一站式** 设计的思想,能够覆盖 Flink 任务开发的全生命周期,使得配置、开发、部署、运维全部在一个平台即可完成。我们的开发、运维、测试的同学可以使用 StreamPark 协同工作,**低代码** + **一站式** 的设计思想坚定了我们使用 StreamPark 的信心。 +在体验对比了 Apache Zeppelin 和 Apache StreamPark 之后,我们选择了 Apache StreamPark 作为公司的实时计算平台。相比 Apache Zeppelin, Apache StreamPark 并不出名。‍‍‍‍‍‍‍‍‍‍‍‍但是在体验了 Apache StreamPark 发行的初版,阅读其设计文档后,我们发现其基于 **一站式** 设计的思想,能够覆盖 Flink 任务开发的全生命周期,使得配置、开发、部署、运维全部在一个平台即可完成。我们的开发、运维、测试的同学可以使用 Apache StreamPark 协同工作,**低代码** + **一站式** 的设计思想坚定了我们使用 Apache StreamPark 的信心。 //视频链接( StreamX 官方视频) @@ -30,7 +30,7 @@ tags: [StreamPark, 生产实践, FlinkSQL] ### **1. 快速上手** -使用 StreamPark 完成一个实时汇聚任务就像把大象放进冰箱一样简单,仅需三步即可完成: +使用 Apache StreamPark 完成一个实时汇聚任务就像把大象放进冰箱一样简单,仅需三步即可完成: - 编辑 SQL @@ -48,11 +48,11 @@ tags: [StreamPark, 生产实践, FlinkSQL] ### **2. 生产实践** -StreamPark 在海博主要用于运行实时 Flink SQL任务: 读取 Kafka 上的数据,进行处理输出至 Clickhouse 或者 Elasticsearch 中。 +Apache StreamPark 在海博主要用于运行实时 Flink SQL任务: 读取 Kafka 上的数据,进行处理输出至 Clickhouse 或者 Elasticsearch 中。 -从2021年10月开始,公司逐渐将 Flink SQL 任务迁移至 StreamPark 平台来集中管理,承载我司实时物联数据的汇聚、计算、预警。 +从2021年10月开始,公司逐渐将 Flink SQL 任务迁移至 Apache StreamPark 平台来集中管理,承载我司实时物联数据的汇聚、计算、预警。 -截至目前,StreamPark 已在多个政府、公安生产环境进行部署,汇聚处理城市实时物联数据、人车抓拍数据。以下是在某市专网部署的 StreamPark 平台截图 : +截至目前,Apache StreamPark 已在多个政府、公安生产环境进行部署,汇聚处理城市实时物联数据、人车抓拍数据。以下是在某市专网部署的 Apache StreamPark 平台截图 : ![](/blog/haibo/application.png) @@ -60,31 +60,31 @@ StreamPark 在海博主要用于运行实时 Flink SQL任务: 读取 Kafka 上 #### **1. 实时物联感知数据汇聚** -汇聚实时的物联感知数据,我们直接使用 StreamPark 开发 Flink SQL 任务,针对 Flink SQL 未提供的方法,StreamPark 也支持 Udf 相关功能,用户通过 StreamPark 上传 Udf 包,即可在 SQL 中调用相关 Udf,实现更多复杂的逻辑操作。 +汇聚实时的物联感知数据,我们直接使用 Apache StreamPark 开发 Flink SQL 任务,针对 Flink SQL 未提供的方法,Apache StreamPark 也支持 Udf 相关功能,用户通过 Apache StreamPark 上传 Udf 包,即可在 SQL 中调用相关 Udf,实现更多复杂的逻辑操作。 -“SQL+UDF” 的方式,能够满足我们绝大部分的数据汇聚场景,如果后期业务变动,也只需要在 StreamPark 中修改 SQL 语句,即可完成业务变更与上线。 +“SQL+UDF” 的方式,能够满足我们绝大部分的数据汇聚场景,如果后期业务变动,也只需要在 Apache StreamPark 中修改 SQL 语句,即可完成业务变更与上线。 ![](/blog/haibo/data_aggregation.png) #### **2. Flink CDC数据库同步** -为了实现各类数据库与数据仓库之前的同步,我们使用 StreamPark 开发 Flink CDC SQL 任务。借助于 Flink CDC 的能力,实现了 Oracle 与 Oracle 之间的数据同步, Mysql/Postgresql 与 Clickhouse 之间的数据同步。 +为了实现各类数据库与数据仓库之前的同步,我们使用 Apache StreamPark 开发 Flink CDC SQL 任务。借助于 Flink CDC 的能力,实现了 Oracle 与 Oracle 之间的数据同步, Mysql/Postgresql 与 Clickhouse 之间的数据同步。 ![](/blog/haibo/flink_cdc.png) **3. 数据分析模型管理** -针对无法使用 Flink SQL 需要开发 Flink 代码的任务,例如: 实时布控模型、离线数据分析模型,StreamPark 提供了 Custom code 的方式, 允许用户上传可执行的 Flink Jar 包并运行。 +针对无法使用 Flink SQL 需要开发 Flink 代码的任务,例如: 实时布控模型、离线数据分析模型,Apache StreamPark 提供了 Custom code 的方式, 允许用户上传可执行的 Flink Jar 包并运行。 -目前,我们已经将人员,车辆等 20 余类分析模型上传至 StreamPark,交由 StreamPark 管理运行。 +目前,我们已经将人员,车辆等 20 余类分析模型上传至 Apache StreamPark,交由 Apache StreamPark 管理运行。 ![](/blog/haibo/data_aggregation.png) -**综上:** 无论是 Flink SQL 任务还是 Custome code 任务,StreamPark 均提供了很好的支持,满足各种不同的业务场景。 但是 StreamPark 缺少任务调度的能力,如果你需要定期调度任务, StreamPark 目前无法满足。社区成员正在努力开发调度相关的模块,在即将发布的 1.2.3 中 会支持任务调度功能,敬请期待。 +**综上:** 无论是 Flink SQL 任务还是 Custome code 任务,Apache StreamPark 均提供了很好的支持,满足各种不同的业务场景。 但是 Apache StreamPark 缺少任务调度的能力,如果你需要定期调度任务, Apache StreamPark 目前无法满足。社区成员正在努力开发调度相关的模块,在即将发布的 1.2.3 中 会支持任务调度功能,敬请期待。 ## **04. 功能扩展** -Datahub 是 Linkedin 开发的一个元数据管理平台,提供了数据源管理、数据血缘、数据质量检查等功能。海博科技基于 StreamPark 和 Datahub 进行了二次开发,实现了数据表级/字段级的血缘功能。通过数据血缘功能,帮助用户检查 Flink SQL 的字段血缘关系。并将血缘关系保存至Linkedin/Datahub 元数据管理平台。 +Datahub 是 Linkedin 开发的一个元数据管理平台,提供了数据源管理、数据血缘、数据质量检查等功能。海博科技基于 Apache StreamPark 和 Datahub 进行了二次开发,实现了数据表级/字段级的血缘功能。通过数据血缘功能,帮助用户检查 Flink SQL 的字段血缘关系。并将血缘关系保存至Linkedin/Datahub 元数据管理平台。 //两个视频链接(基于 StreamX 开发的数据血缘功能) @@ -92,10 +92,10 @@ Datahub 是 Linkedin 开发的一个元数据管理平台,提供了数据源 ## **05. 未来期待** -目前,StreamPark 社区的 Roadmap 显示 StreamPark 1.3.0 将迎来全新的 Workbench 体验、统一的资源管理中心 (JAR/UDF/Connectors 统一管理)、批量任务调度等功能。这也是我们非常期待的几个全新功能。 +目前,Apache StreamPark 社区的 Roadmap 显示 Apache StreamPark 1.3.0 将迎来全新的 Workbench 体验、统一的资源管理中心 (JAR/UDF/Connectors 统一管理)、批量任务调度等功能。这也是我们非常期待的几个全新功能。 -Workbench 将使用全新的工作台式的 SQL 开发风格,选择数据源即可生成 SQL,进一步提升 Flink 任务开发效率。统一的 UDF 资源中心将解决当前每个任务都要上传依赖包的问题。批量任务调度功能将解决当前 StreamPark 无法调度任务的遗憾。 +Workbench 将使用全新的工作台式的 SQL 开发风格,选择数据源即可生成 SQL,进一步提升 Flink 任务开发效率。统一的 UDF 资源中心将解决当前每个任务都要上传依赖包的问题。批量任务调度功能将解决当前 Apache StreamPark 无法调度任务的遗憾。 -下图是 StreamPark 开发者设计的原型图,敬请期待。 +下图是 Apache StreamPark 开发者设计的原型图,敬请期待。 ![](/blog/haibo/data_source.png) diff --git a/i18n/zh-CN/docusaurus-plugin-content-blog/8-streampark-usercase-ziru.md b/i18n/zh-CN/docusaurus-plugin-content-blog/8-streampark-usercase-ziru.md index 8532e5885..eef621d7a 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-blog/8-streampark-usercase-ziru.md +++ b/i18n/zh-CN/docusaurus-plugin-content-blog/8-streampark-usercase-ziru.md @@ -1,16 +1,16 @@ --- slug: streampark-usercase-ziru title: 自如基于Apache StreamPark 的实时计算平台实践 -tags: [StreamPark, 生产实践] +tags: [Apache StreamPark, 生产实践] --- ![](/blog/ziru/cover.png) -**导读:**自如作为一家专注于提供租房产品和服务的 O2O 互联网公司,构建了一个涵盖城市居住生活领域全链条的在线化、数据化、智能化平台,实时计算在自如一直扮演着重要的角色。到目前为止,自如每日需要处理 TB 级别的数据,本文由来自自如的实时计算小伙伴带来,介绍了自如基于 StreamPark 的实时计算平台深度实践。 +**导读:**自如作为一家专注于提供租房产品和服务的 O2O 互联网公司,构建了一个涵盖城市居住生活领域全链条的在线化、数据化、智能化平台,实时计算在自如一直扮演着重要的角色。到目前为止,自如每日需要处理 TB 级别的数据,本文由来自自如的实时计算小伙伴带来,介绍了自如基于 Apache StreamPark 的实时计算平台深度实践。 - 实时计算遇到的挑战 - 需求解决方案之路 -- 基于 StreamPark 的深度实践 +- 基于 Apache StreamPark 的深度实践 - 实践经验总结和示例 - 带来的收益 - 未来规划 @@ -62,7 +62,7 @@ tags: [StreamPark, 生产实践] 在平台构建的初期阶段,2022 年初开始我们就全面调查了行业内的几乎所有相关项目,涵盖了商业付费版和开源版。经过调查和对比发现,这些项目都或多或少地存在一定的局限性,可用性和稳定性也无法有效地保障。 -综合下来 StreamPark 在我们的评估中表现最优,是唯一一个没有硬伤且扩展性很强的项目:同时支持 SQL 和 JAR 作业,在 Flink 作业的部署模式上也是最为完善和稳定的,特有的架构设计使得不仅不会锁定 Flink 版本,还支持便捷的版本切换和并行处理,有效解决了作业依赖隔离和冲突的问题。我们重点关注的作业管理 & 运维能力也非常完善,包括监控、告警、SQL 校验、SQL 版本对比、CI 等功能,StreamPark 对 Flink on K8s 的支持也是我们调研的所有开源项目中最为完善的。但 StreamPark 的 K8s 模式提交需在本地构建镜像导致存储资源消耗。 +综合下来 Apache StreamPark 在我们的评估中表现最优,是唯一一个没有硬伤且扩展性很强的项目:同时支持 SQL 和 JAR 作业,在 Flink 作业的部署模式上也是最为完善和稳定的,特有的架构设计使得不仅不会锁定 Flink 版本,还支持便捷的版本切换和并行处理,有效解决了作业依赖隔离和冲突的问题。我们重点关注的作业管理 & 运维能力也非常完善,包括监控、告警、SQL 校验、SQL 版本对比、CI 等功能,Apache StreamPark 对 Flink on K8s 的支持也是我们调研的所有开源项目中最为完善的。但 Apache StreamPark 的 K8s 模式提交需在本地构建镜像导致存储资源消耗。 目前在最新的 2.2 版本中社区已经重构了这部分实现 @@ -72,19 +72,19 @@ tags: [StreamPark, 生产实践] 2.在开源组件的选择上,我们经过各项指标综合对比评估,最终选择了当时的 StreamX。后续和社区保持密切的沟通,在此过程中深刻感受到创始人认真负责的态度和社区的团结友善的氛围,也见证了项目 2022 年 09 月加入 Apache 孵化器的过程,这让我们对该项目的未来充满希望。 -3.在 StreamPark 基础上,我们要推动与公司已有生态的整合,以便更好地满足我们的业务需求。 +3.在 Apache StreamPark 基础上,我们要推动与公司已有生态的整合,以便更好地满足我们的业务需求。 -## **基于 StreamPark 的深度实践** +## **基于 Apache StreamPark 的深度实践** 基于上述决策,我们启动了以 “痛点需求” 为导向的实时计算平台演进工作,基于StremaPark 打造一个稳定、高效、易维护的实时计算平台。从 2022 年初开始我们便参与社区的建设,同时我们内部平台建设也正式提上日程。 -首先我们在 StreamPark 的基础上进一步完善相关的功能: +首先我们在 Apache StreamPark 的基础上进一步完善相关的功能: ![](/blog/ziru/platform_construction.png) ### **01 LDAP 登录支持** -在 StreamPark 的基础上,我们进一步完善了相关的功能,其中包括对 LDAP 的支持,以便我们未来可以完全开放实时能力,让公司的四个业务线所属的分析师能够使用该平台,预计届时人数将达到 170 人左右。随着人数的增加,账号的管理变得越发重要,特别是在人员变动时,账号的注销和申请将成为一项频繁且耗时的操作。所以,接入 LDAP 变得尤为重要。因此我们及时和社区沟通,并且发起讨论,最终我们贡献了该 Feature。现在在 StreamPark 开启 LDAP 已经变得非常简单,只需要简单两步即可: +在 Apache StreamPark 的基础上,我们进一步完善了相关的功能,其中包括对 LDAP 的支持,以便我们未来可以完全开放实时能力,让公司的四个业务线所属的分析师能够使用该平台,预计届时人数将达到 170 人左右。随着人数的增加,账号的管理变得越发重要,特别是在人员变动时,账号的注销和申请将成为一项频繁且耗时的操作。所以,接入 LDAP 变得尤为重要。因此我们及时和社区沟通,并且发起讨论,最终我们贡献了该 Feature。现在在 Apache StreamPark 开启 LDAP 已经变得非常简单,只需要简单两步即可: #### step1: 填写对应的 LDAP 配置: @@ -115,7 +115,7 @@ ldap: ### **02 提交作业自动生成 Ingress** -由于公司的网络安全政策,运维人员在 Kubernetes 的宿主机上仅开放了 80 端口,这导致我们无法直接通过 “域名+随机端口” 的方式访问在 Kubernetes 上的作业 WebUI。为了解决这个问题,我们需要使用Ingress在访问路径上增加一层代理,从而启到访问路由的效果。在 StreamPark 2.0 版本我们贡献了 Ingress 相关的功能[3]。采用了策略模式的实现方式,在初始构建阶段,获取 Kubernetes 的元数据信息来识别其版本,针对不同版本来进行相应的对象构建,确保了在各种 Kubernetes 环境中都能够顺利地使用 Ingress 功能。 +由于公司的网络安全政策,运维人员在 Kubernetes 的宿主机上仅开放了 80 端口,这导致我们无法直接通过 “域名+随机端口” 的方式访问在 Kubernetes 上的作业 WebUI。为了解决这个问题,我们需要使用Ingress在访问路径上增加一层代理,从而启到访问路由的效果。在 Apache StreamPark 2.0 版本我们贡献了 Ingress 相关的功能[3]。采用了策略模式的实现方式,在初始构建阶段,获取 Kubernetes 的元数据信息来识别其版本,针对不同版本来进行相应的对象构建,确保了在各种 Kubernetes 环境中都能够顺利地使用 Ingress 功能。 具体的配置步骤如下: @@ -137,7 +137,7 @@ ldap: ### **03 支持查看作业部署日志** -在持续部署作业的过程中,我们逐渐意识到,没有日志就无法进行有效的运维操作,日志的留存归档和查看成为了我们在后期排查问题时非常重要的一环。因此在 StreamPark 2.0 版本我们贡献了 On Kubernetes 模式下启动日志存档、页面查看的能力[4],现在点击作业列表里的日志查看按钮,可以很方便的查看作业的实时日志。 +在持续部署作业的过程中,我们逐渐意识到,没有日志就无法进行有效的运维操作,日志的留存归档和查看成为了我们在后期排查问题时非常重要的一环。因此在 Apache StreamPark 2.0 版本我们贡献了 On Kubernetes 模式下启动日志存档、页面查看的能力[4],现在点击作业列表里的日志查看按钮,可以很方便的查看作业的实时日志。 ![](/blog/ziru/k8s_log.png) @@ -147,7 +147,7 @@ ldap: 为了解决这个问题,我们在社区提出一个需求:希望每个作业都能够通过超链接直接跳转到对应的监控图表和日志归档页面,这样使用者就可以直接查看与自己作业相关的监控信息和日志。无需在复杂的系统界面中进行繁琐的搜索,从而提高故障排查的效率。 -我们在社区展开了讨论、并很快得到响应、大家都认为这是一个普遍存在的需求、因此很快有开发小伙伴提交了设计和相关PR,该问题也很快被解决,现在在 StreamPark 中要开启该功能已经变得非常简单: +我们在社区展开了讨论、并很快得到响应、大家都认为这是一个普遍存在的需求、因此很快有开发小伙伴提交了设计和相关PR,该问题也很快被解决,现在在 Apache StreamPark 中要开启该功能已经变得非常简单: #### step1: 创建徽章标签 @@ -222,11 +222,11 @@ SELECT Encryption_function(name), age, price, Sensitive_field_functions(phone) F SELECT name, Encryption_function(age), price, Sensitive_field_functions(phone) FROM user; ``` -### **06 基于 StreamPark 的数据同步平台** +### **06 基于 Apache StreamPark 的数据同步平台** -随着 StreamPark 的技术解决方案在公司的成功落地,我们实现了对 Flink 作业的深度支持,从而为数据处理带来质的飞跃。这促使我们对过往的数据同步逻辑进行彻底的革新,目标是通过技术的优化和整合,最大限度地降低运维成本。因此,我们逐步替换了历史上的 Sqoop 作业、Canal 作业和 Hive JDBC Handler 作业,转而采用 Flink CDC 作业、Flink 流和批作业。在这个过程中,我们也不断优化和强化 StreamPark 的接口能力,新增了状态回调机制,同时实现了与 DolphinScheduler[7] 调度系统的完美集成,进一步提升了我们的数据处理能力。 +随着 Apache StreamPark 的技术解决方案在公司的成功落地,我们实现了对 Flink 作业的深度支持,从而为数据处理带来质的飞跃。这促使我们对过往的数据同步逻辑进行彻底的革新,目标是通过技术的优化和整合,最大限度地降低运维成本。因此,我们逐步替换了历史上的 Sqoop 作业、Canal 作业和 Hive JDBC Handler 作业,转而采用 Flink CDC 作业、Flink 流和批作业。在这个过程中,我们也不断优化和强化 Apache StreamPark 的接口能力,新增了状态回调机制,同时实现了与 DolphinScheduler[7] 调度系统的完美集成,进一步提升了我们的数据处理能力。 -外部系统集成 StreamPark 步骤如下,只需要简单几个步骤即可: +外部系统集成 Apache StreamPark 步骤如下,只需要简单几个步骤即可: 1.首先创建 API 访问的 Token: @@ -252,11 +252,11 @@ curl -X POST '/flink/app/start' \ ## **实践经验总结** -在深度使用 StreamPark 实践过程中,我们总结了一些常见问题和实践过程中所探索出解决方案,我们把这些汇总成示例,仅供大家参考。 +在深度使用 Apache StreamPark 实践过程中,我们总结了一些常见问题和实践过程中所探索出解决方案,我们把这些汇总成示例,仅供大家参考。 ### **01 构建 Base 镜像** -要使用 StreamPark 在 Kubernetes 上部署一个 Flink 作业,首先要准备一个基于 Flink 构建的 Base 镜像。然后,在 Kubernetes 平台上,会使用用户所提供的镜像来启动 Flink 作业。如果是沿用官方所提供的 “裸镜像”,在实际开发中是远远不够的,用户开发的业务逻辑往往会涉及到上下游多个数据源,这就需要相关数据源的 Connector,以及 Hadoop 等关联依赖。因此需要将这部分依赖项打入镜像中,下面我将介绍具体操作步骤。 +要使用 Apache StreamPark 在 Kubernetes 上部署一个 Flink 作业,首先要准备一个基于 Flink 构建的 Base 镜像。然后,在 Kubernetes 平台上,会使用用户所提供的镜像来启动 Flink 作业。如果是沿用官方所提供的 “裸镜像”,在实际开发中是远远不够的,用户开发的业务逻辑往往会涉及到上下游多个数据源,这就需要相关数据源的 Connector,以及 Hadoop 等关联依赖。因此需要将这部分依赖项打入镜像中,下面我将介绍具体操作步骤。 #### step1: 首先创建一个文件夹,内部包含两个文件夹和一个 Dockerfile 文件 @@ -325,7 +325,7 @@ RUN unzip -d arthas-latest-bin arthas-packaging-latest-bin.zip ### **03 镜像中依赖冲突的解决方式** -在使用 StreamPark 的过程中,我们常遇到基于 Base 镜像运行的 Flink 作业中出现 NoClassDefFoundError、ClassNotFoundException 和 NoSuchMethodError 这三种依赖冲突异常。排查思路就是,找到报错中所示的冲突类,所在的包路径。例如这个报错的类在 org.apache.orc:orc-core, 就到相应模块的目录下跑 mvn dependency::tree 然后搜 orc-core,看一下是谁带进来的依赖,用 exclusion 去掉就可以了。下面我通过一个 base 镜像中的 flink-shaded-hadoop-3-uber JAR 包引起的依赖冲突示例,来详细介绍通过自定义打包的方法来解决依赖冲突。 +在使用 Apache StreamPark 的过程中,我们常遇到基于 Base 镜像运行的 Flink 作业中出现 NoClassDefFoundError、ClassNotFoundException 和 NoSuchMethodError 这三种依赖冲突异常。排查思路就是,找到报错中所示的冲突类,所在的包路径。例如这个报错的类在 org.apache.orc:orc-core, 就到相应模块的目录下跑 mvn dependency::tree 然后搜 orc-core,看一下是谁带进来的依赖,用 exclusion 去掉就可以了。下面我通过一个 base 镜像中的 flink-shaded-hadoop-3-uber JAR 包引起的依赖冲突示例,来详细介绍通过自定义打包的方法来解决依赖冲突。 #### step1: Clone flink-shaded 项目到本地👇 @@ -343,7 +343,7 @@ git clone https://github.com/apache/flink-shaded.git ### **04 集中作业配置示例** -使用 StreamPark 有个非常大的便利就是可以进行配置的集中管理,可以将所有的配置项,配置到平台所绑定的 Flink 目录下的 conf 文件中。 +使用 Apache StreamPark 有个非常大的便利就是可以进行配置的集中管理,可以将所有的配置项,配置到平台所绑定的 Flink 目录下的 conf 文件中。 ```shell cd /flink-1.14.5/conf @@ -360,9 +360,9 @@ vim flink-conf.yaml ![](/blog/ziru/sync_conf.png) -### **05 StreamPark 配置 DNS 解析** +### **05 Apache StreamPark 配置 DNS 解析** -在使用 StreamPark 平台提交 FlinkSQL 的过程中,一个正确合理的 DNS 解析配置非常重要。主要涉及到以下几点: +在使用 Apache StreamPark 平台提交 FlinkSQL 的过程中,一个正确合理的 DNS 解析配置非常重要。主要涉及到以下几点: 1.Flink 作业的 Checkpoint 写入 HDFS 需要通过 ResourceManager 获取的一个 HDFS 节点进行快照写入,如果企业中同时有发生Hadoop集群的扩容,并且这些这些新扩容出来的节点,没有被DNS解析服务所覆盖,就直接会导致Checkpoint失败,从而影响线上稳定。 @@ -444,7 +444,7 @@ metadata: export HADOOP_CONF_DIR=/home/streamx/conf ``` -有效地切断了 Flink on K8s 加载 HDFS 配置的默认逻辑。这样的操作确保了 A StreamPark 仅连接至 A Hadoop 环境,而 B StreamPark 则对应连接至 B Hadoop 环境,从而达到了将测试和生产环境进行完整隔离的目的。 +有效地切断了 Flink on K8s 加载 HDFS 配置的默认逻辑。这样的操作确保了 A Apache StreamPark 仅连接至 A Hadoop 环境,而 B Apache StreamPark 则对应连接至 B Hadoop 环境,从而达到了将测试和生产环境进行完整隔离的目的。 具体来说,在这一操作指令生效后,我们就可以确保在 10002 端口提交的 Flink 作业所连接的 Hadoop 环境为 B Hadoop 环境。这样一来,B Hadoop 环境与过去在 10000 端口提交的 Flink 作业所使用的Hadoop环境就成功实现了隔离,有效防止了不同环境之间的相互干扰,确保了系统的稳定性和可靠性。 @@ -533,17 +533,17 @@ netstat -tlnp | grep 10002 ## **带来的收益** -我们的团队从 StreamX(即 StreamPark 的前身)开始使用,经过一年多的实践和磨合,StreamPark 显著改善了我们在 Apache Flink 作业的开发管理和运维上的诸多挑战。StreamPark 作为一站式服务平台,极大地简化了整个开发流程。现在,我们可以直接在 StreamPark 平台上完成作业的开发、编译和发布,这不仅降低了 Flink 的管理和部署门槛,还显著提高了开发效率。 +我们的团队从 StreamX(即 Apache StreamPark 的前身)开始使用,经过一年多的实践和磨合,Apache StreamPark 显著改善了我们在 Apache Flink 作业的开发管理和运维上的诸多挑战。Apache StreamPark 作为一站式服务平台,极大地简化了整个开发流程。现在,我们可以直接在 Apache StreamPark 平台上完成作业的开发、编译和发布,这不仅降低了 Flink 的管理和部署门槛,还显著提高了开发效率。 -自从部署 StreamPark 以来,我们已经在生产环境中大规模使用该平台。从最初管理的 50 多个 FlinkSQL 作业,增长到目前近 500 个作业,如图在 StreamPark 上划分为 7 个 team,每个 team 中有几十个作业。这一转变不仅展示了 StreamPark 的可扩展性和高效性,也充分证明了它在实际业务中的强大应用价值。 +自从部署 Apache StreamPark 以来,我们已经在生产环境中大规模使用该平台。从最初管理的 50 多个 FlinkSQL 作业,增长到目前近 500 个作业,如图在 Apache StreamPark 上划分为 7 个 team,每个 team 中有几十个作业。这一转变不仅展示了 Apache StreamPark 的可扩展性和高效性,也充分证明了它在实际业务中的强大应用价值。 ![](/blog/ziru/production_environment.png) ## **未 来 期 待** -自如作为 StreamPark 早期的用户之一,我们一直和社区同学保持密切交流,参与 StreamPark 的稳定性打磨,我们将生产运维中遇到的 Bug 和新的 Feature 提交给了社区。在未来,我们希望可以在 StreamPark 上管理 Apache Paimon 湖表的元数据信息和 Paimon 的 Action 辅助作业的能力,基于 Flink 引擎通过对接湖表的 Catalog 和 Action 作业,来实现湖表作业的管理、优化于一体的能力。目前 StreamPark 正在对接 Paimon 数据集成的能力,这一块在未来对于实时一键入湖会提供很大的帮助。 +自如作为 Apache StreamPark 早期的用户之一,我们一直和社区同学保持密切交流,参与 Apache StreamPark 的稳定性打磨,我们将生产运维中遇到的 Bug 和新的 Feature 提交给了社区。在未来,我们希望可以在 Apache StreamPark 上管理 Apache Paimon 湖表的元数据信息和 Paimon 的 Action 辅助作业的能力,基于 Flink 引擎通过对接湖表的 Catalog 和 Action 作业,来实现湖表作业的管理、优化于一体的能力。目前 Apache StreamPark 正在对接 Paimon 数据集成的能力,这一块在未来对于实时一键入湖会提供很大的帮助。 -在此也非常感谢 StreamPark 团队一直以来对我们的技术支持,祝 Apache StreamPark 越来越好,越来越多用户去使用,早日毕业成为顶级 Apache 项目。 +在此也非常感谢 Apache StreamPark 团队一直以来对我们的技术支持,祝 Apache StreamPark 越来越好,越来越多用户去使用,早日毕业成为顶级 Apache 项目。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/become_committer.md b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/become_committer.md index e2e6ca5b2..2cf492c5a 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/become_committer.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/become_committer.md @@ -31,7 +31,7 @@ sidebar_position: 2 - **Documentation** - 没有它,内容只会停留在作者的头脑中。 - **Code** - 没有代码,讨论就毫无意义。 -Apache StreamPark 社区努力追求基于功绩的原则。因此,一旦有人在 CoPDoC 的任何领域有了足够的贡献,他们就可以成为 Committer 的候选人,最终被投票选为 StreamPark 的 Committer。成为 Apache StreamPark 的 Committer 并不一定意味着你必须使用你的提交权限向代码库提交代码;它意味着你致力于 StreamPark 项目并为我们社区的成功做出了积极的贡献。 +Apache StreamPark 社区努力追求基于功绩的原则。因此,一旦有人在 CoPDoC 的任何领域有了足够的贡献,他们就可以成为 Committer 的候选人,最终被投票选为 Apache StreamPark 的 Committer。成为 Apache StreamPark 的 Committer 并不一定意味着你必须使用你的提交权限向代码库提交代码;它意味着你致力于 Apache StreamPark 项目并为我们社区的成功做出了积极的贡献。 ## Committer 的要求: @@ -39,7 +39,7 @@ Apache StreamPark 社区努力追求基于功绩的原则。因此,一旦有 ### 持续的贡献 -Committer 的候选人应该持续参与并为 StreamPark 做出大量的贡献(例如修复漏洞、添加新功能、编写文档、维护问题板、代码审查或回答社区问题),无论是向主网站的代码库还是 StreamPark 的 GitHub 仓库贡献。 +Committer 的候选人应该持续参与并为 Apache StreamPark 做出大量的贡献(例如修复漏洞、添加新功能、编写文档、维护问题板、代码审查或回答社区问题),无论是向主网站的代码库还是 Apache StreamPark 的 GitHub 仓库贡献。 - +3 个月的轻度活动和参与。 - +2 个月的中度活动和参与。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/become_pmc_member.md b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/become_pmc_member.md index 449a03ef0..f4f85e4b8 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/become_pmc_member.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/become_pmc_member.md @@ -30,7 +30,7 @@ sidebar_position: 3 - **Documentation** - 没有它,内容只会停留在作者的头脑中。 - **Code** - 没有代码,讨论就毫无意义。 -Apache StreamPark 社区努力追求基于功绩的原则。因此,一旦有人在 CoPDoC 的任何领域有了足够的贡献,他们就可以成为 PMC 成员资格的候选人,最终被投票选为 StreamPark 的 PMC 成员。成为 Apache StreamPark 的 PMC 成员并不一定意味着您必须使用您的提交权限向代码库提交代码;它意味着您致力于 StreamPark 项目并为我们社区的成功做出了积极的贡献。 +Apache StreamPark 社区努力追求基于功绩的原则。因此,一旦有人在 CoPDoC 的任何领域有了足够的贡献,他们就可以成为 PMC 成员资格的候选人,最终被投票选为 Apache StreamPark 的 PMC 成员。成为 Apache StreamPark 的 PMC 成员并不一定意味着您必须使用您的提交权限向代码库提交代码;它意味着您致力于 Apache StreamPark 项目并为我们社区的成功做出了积极的贡献。 ## PMC 成员的要求: @@ -38,7 +38,7 @@ Apache StreamPark 社区努力追求基于功绩的原则。因此,一旦有 ### 持续的贡献 -PMC 成员的候选人应该持续参与并为 StreamPark 做出大量的贡献(例如修复漏洞、添加新功能、编写文档、维护问题板、代码审查或回答社区问题),无论是向主网站的代码库还是 StreamPark 的 GitHub 仓库贡献。 +PMC 成员的候选人应该持续参与并为 Apache StreamPark 做出大量的贡献(例如修复漏洞、添加新功能、编写文档、维护问题板、代码审查或回答社区问题),无论是向主网站的代码库还是 Apache StreamPark 的 GitHub 仓库贡献。 - +5 个月的轻度活动和参与。 - +4 个月的中度活动和参与。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/mailing_lists.md b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/mailing_lists.md index 18b6f6503..fb6918fc3 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/mailing_lists.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/mailing_lists.md @@ -21,7 +21,7 @@ sidebar_position: 1 limitations under the License. --> -StreamPark 开发者邮件列表是您在使用 StreamPark 时提出所有问题的首选方式,它能将您的问题推向整个社区。 +Apache StreamPark 开发者邮件列表是您在使用 Apache StreamPark 时提出所有问题的首选方式,它能将您的问题推向整个社区。 这是与社区保持同步的最佳方式。 在您向邮件列表发送任何内容之前,请确保您已经**订阅**了它们。 @@ -31,12 +31,12 @@ StreamPark 开发者邮件列表是您在使用 StreamPark 时提出所有问题 ### 开发者列表 -- 使用此列表提出您对 StreamPark 的问题 -- 由 StreamPark 贡献者用来讨论 StreamPark 的开发 +- 使用此列表提出您对 Apache StreamPark 的问题 +- 由 Apache StreamPark 贡献者用来讨论 Apache StreamPark 的开发 ### 提交列表 -- 关于 StreamPark 代码库的更改的通知 +- 关于 Apache StreamPark 代码库的更改的通知 | 列表名称 | 地址 | 订阅 | 退订 | 归档 | |---------------------|------------------------------|-------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------------| diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/new_committer_process.md b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/new_committer_process.md index 0240a5bec..033d8768b 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/new_committer_process.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/new_committer_process.md @@ -71,7 +71,7 @@ Subject: [VOTE] New committer: ${NEW_COMMITTER_NAME} ``` ```text -Hi StreamPark PPMC, +Hi Apache StreamPark PPMC, This is a formal vote about inviting ${NEW_COMMITTER_NAME} as our new committer. @@ -92,7 +92,7 @@ Subject: [RESULT] [VOTE] New committer: ${NEW_COMMITTER_NAME} ``` ```text -Hi StreamPark PPMC, +Hi Apache StreamPark PPMC, The vote has now closed. The results are: @@ -110,13 +110,13 @@ The vote is ***successful/not successful*** ```text To: ${NEW_COMMITTER_EMAIL} Cc: private@streampark.apache.org -Subject: Invitation to become StreamPark committer: ${NEW_COMMITTER_NAME} +Subject: Invitation to become Apache StreamPark committer: ${NEW_COMMITTER_NAME} ``` ```text Hello ${NEW_COMMITTER_NAME}, -The StreamPark Project Management Committee (PMC) +The Apache StreamPark Project Management Committee (PMC) hereby offers you committer privileges to the project. These privileges are offered on the understanding that you'll use them reasonably and with common sense. @@ -165,7 +165,7 @@ establishing you as a committer. ```text To: ${NEW_COMMITTER_EMAIL} Cc: private@streampark.apache.org -Subject: Re: invitation to become StreamPark committer +Subject: Re: invitation to become Apache StreamPark committer ``` ```text diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/new_pmc_member_process.md b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/new_pmc_member_process.md index 786cce013..9b2d3d18e 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/new_pmc_member_process.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/contribution_guide/new_pmc_member_process.md @@ -69,7 +69,7 @@ Subject: [VOTE] New PMC member candidate: ${NEW_PMC_NAME} ``` ```text -Hi StreamPark PPMC, +Hi Apache StreamPark PPMC, This is a formal vote about inviting ${NEW_PMC_NAME} as our new PMC member. @@ -91,7 +91,7 @@ Subject: [RESULT] [VOTE] New PMC member: ${NEW_PMC_NAME} ``` ```text -Hi StreamPark PPMC, +Hi Apache StreamPark PPMC, The vote has now closed. The results are: @@ -109,11 +109,11 @@ The vote is ***successful/not successful*** ```text To: board@apache.org Cc: private@.apache.org -Subject: [NOTICE] ${NEW_PMC_NAME} for StreamPark PMC member +Subject: [NOTICE] ${NEW_PMC_NAME} for Apache StreamPark PMC member ``` ```text -StreamPark proposes to invite ${NEW_PMC_NAME} to join the PMC. +Apache StreamPark proposes to invite ${NEW_PMC_NAME} to join the PMC. The vote result is available here: https://lists.apache.org/... ``` @@ -125,13 +125,13 @@ The vote result is available here: https://lists.apache.org/... ```text To: ${NEW_PMC_EMAIL} Cc: private@streampark.apache.org -Subject: Invitation to become StreamPark PMC member: ${NEW_PMC_NAME} +Subject: Invitation to become Apache StreamPark PMC member: ${NEW_PMC_NAME} ``` ```text Hello ${NEW_PMC_NAME}, -The StreamPark Project Management Committee (PMC) +The Apache StreamPark Project Management Committee (PMC) hereby offers you committer privileges to the project as well as membership in the PMC. These privileges are offered on the understanding that @@ -179,7 +179,7 @@ establishing you as a PMC member. ```text To: ${NEW_PMC_EMAIL} Cc: private@streamparkv.apache.org -Subject: Re: invitation to become StreamPark PMC member +Subject: Re: invitation to become Apache StreamPark PMC member ``` ```text @@ -266,7 +266,7 @@ To: dev@streampark.apache.org ``` ```text -Hi StreamPark Community, +Hi Apache StreamPark Community, The Podling Project Management Committee (PPMC) for Apache StreamPark has invited ${NEW_PMC_NAME} to become our PMC member and diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/release/How-to-release.md b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/release/How-to-release.md index a3f1d28e4..1f95afdf2 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/release/How-to-release.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/release/How-to-release.md @@ -59,9 +59,9 @@ GnuPG needs to construct a user ID to identify your key. Real name: muchunjin # Please enter 'gpg real name' Email address: muchunjin@apache.org # Please enter your apache email address here -Comment: for apache StreamPark release create at 20230501 # Please enter some comments here +Comment: for apache Apache StreamPark release create at 20230501 # Please enter some comments here You selected this USER-ID: - "muchunjin (for apache StreamPark release create at 20230501) " + "muchunjin (for apache Apache StreamPark release create at 20230501) " Change (N)ame, (C)omment, (E)mail or (O)kay/(Q)uit? O # Please enter O here We need to generate a lot of random bytes. It is a good idea to perform @@ -94,7 +94,7 @@ public and secret key created and signed. pub rsa4096 2023-05-01 [SC] 85778A4CE4DD04B7E07813ABACFB69E705016886 -uid muchunjin (for apache StreamPark release create at 20230501) +uid muchunjin (for apache Apache StreamPark release create at 20230501) sub rsa4096 2023-05-01 [E] ``` @@ -108,7 +108,7 @@ $ gpg --keyid-format SHORT --list-keys ------------------------ pub rsa4096/05016886 2023-05-01 [SC] 85778A4CE4DD04B7E07813ABACFB69E705016886 -uid [ultimate] muchunjin (for apache StreamPark release create at 20230501) +uid [ultimate] muchunjin (for apache Apache StreamPark release create at 20230501) sub rsa4096/0C5A4E1C 2023-05-01 [E] # Send public key to keyserver via key id @@ -122,7 +122,7 @@ $ gpg --keyserver keyserver.ubuntu.com --send-key 584EE68E ```shell $ gpg --keyserver keyserver.ubuntu.com --recv-keys 05016886 # If the following content appears, it means success -gpg: key ACFB69E705016886: "muchunjin (for apache StreamPark release create at 20230501) " not changed +gpg: key ACFB69E705016886: "muchunjin (for apache Apache StreamPark release create at 20230501) " not changed gpg: Total number processed: 1 gpg: unchanged: 1 ``` @@ -380,15 +380,15 @@ $ for i in *.tar.gz; do echo $i; gpg --verify $i.asc $i ; done apache-streampark-2.1.0-incubating-src.tar.gz gpg: Signature made Tue May 2 12:16:35 2023 CST gpg: using RSA key 85778A4CE4DD04B7E07813ABACFB69E705016886 -gpg: Good signature from "muchunjin (for apache StreamPark release create at 20230501) " [ultimate] +gpg: Good signature from "muchunjin (for apache Apache StreamPark release create at 20230501) " [ultimate] apache-streampark_2.11-2.1.0-incubating-bin.tar.gz gpg: Signature made Tue May 2 12:16:36 2023 CST gpg: using RSA key 85778A4CE4DD04B7E07813ABACFB69E705016886 -gpg: Good signature from "muchunjin (for apache StreamPark release create at 20230501) " [ultimate] +gpg: Good signature from "muchunjin (for apache Apache StreamPark release create at 20230501) " [ultimate] apache-streampark_2.12-2.1.0-incubating-bin.tar.gz gpg: Signature made Tue May 2 12:16:37 2023 CST gpg: using RSA key 85778A4CE4DD04B7E07813ABACFB69E705016886 -gpg: BAD signature from "muchunjin (for apache StreamPark release create at 20230501) " [ultimate] +gpg: BAD signature from "muchunjin (for apache Apache StreamPark release create at 20230501) " [ultimate] # 验证 SHA512 $ for i in *.tar.gz; do echo $i; sha512sum --check $i.sha512; done @@ -431,7 +431,7 @@ svn add 2.0.0-RC1 svn status # 3. 提交到svn远端服务器 -svn commit -m "release for StreamPark 2.1.0" +svn commit -m "release for Apache StreamPark 2.1.0" ``` #### 3.7 检查Apache SVN提交结果 @@ -451,7 +451,7 @@ svn commit -m "release for StreamPark 2.1.0" > `Body`: ``` -Hello StreamPark Community: +Hello Apache StreamPark Community: This is a call for vote to release Apache StreamPark(Incubating) version release-2.1.0-RC1. @@ -482,11 +482,11 @@ Please vote accordingly: *Valid check is a requirement for a vote. *Checklist for reference: -[ ] Download StreamPark are valid. +[ ] Download Apache StreamPark are valid. [ ] Checksums and PGP signatures are valid. [ ] Source code distributions have correct names matching the current release. -[ ] LICENSE and NOTICE files are correct for each StreamPark repo. +[ ] LICENSE and NOTICE files are correct for each Apache StreamPark repo. [ ] All files have license headers if necessary. [ ] No compiled archives bundled in source archive. [ ] Can compile from source. @@ -512,7 +512,7 @@ Thanks! > `Body`: ``` -Dear StreamPark community, +Dear Apache StreamPark community, Thanks for your review and vote for "Release Apache StreamPark (Incubating) 2.1.0-rc1" I'm happy to announce the vote has passed: @@ -569,7 +569,7 @@ The Apache StreamPark community has voted on and approved a proposal to release We now kindly request the Incubator PMC members review and vote on this incubator release. Apache StreamPark, Make stream processing easier! easy-to-use streaming application development framework and operation platform. -StreamPark community vote thread: +Apache StreamPark community vote thread: https://lists.apache.org/thread/t01b2lbtqzyt7j4dsbdp5qjc3gngjsdq Vote result thread: @@ -660,7 +660,7 @@ Vote thread: https://lists.apache.org/thread/k3cvcbzxqs6qy62d1o6r9pqpykcgvvhm -Thanks everyone for your feedback and help with StreamPark apache release. The StreamPark team will take the steps to complete this release and will announce it soon. +Thanks everyone for your feedback and help with Apache StreamPark apache release. The Apache StreamPark team will take the steps to complete this release and will announce it soon. Best, ChunJin Mu @@ -758,12 +758,12 @@ Hi all, We are glad to announce the release of Apache StreamPark(incubating) 2.1.0. Once again I would like to express my thanks to your help. -StreamPark(https://streampark.apache.org/) Make stream processing easier! easy-to-use streaming application development framework and operation platform +Apache StreamPark(https://streampark.apache.org/) Make stream processing easier! easy-to-use streaming application development framework and operation platform Download Links: https://streampark.apache.org/download/ Release Notes: https://streampark.apache.org/download/release-note/2.1.0 -StreamPark Resources: +Apache StreamPark Resources: - Issue: https://github.com/apache/incubator-streampark/issues - Mailing list: dev@streampark.apache.org diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/release/how-to-verify.md b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/release/how-to-verify.md index edd5dc4f2..c0ae0db4c 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/release/how-to-verify.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/release/how-to-verify.md @@ -143,7 +143,7 @@ cd apache-streampark-${release_version}-incubating-src ***选择编译模式, 这里只能选择1*** ->[StreamPark] StreamPark supports front-end and server-side mixed / detached packaging mode, Which mode do you need ? +>[Apache StreamPark] Apache StreamPark supports front-end and server-side mixed / detached packaging mode, Which mode do you need ? > >1. mixed mode > @@ -153,7 +153,7 @@ cd apache-streampark-${release_version}-incubating-src ***选择 scala 版本, 第一次编译 scala 2.11版本选择 1, 第二次编译 scala 2.12版本选择 2*** ->[StreamPark] StreamPark supports Scala 2.11 and 2.12. Which version do you need ? +>[Apache StreamPark] Apache StreamPark supports Scala 2.11 and 2.12. Which version do you need ? > >1. 2.11 >2. 2.12 @@ -183,7 +183,7 @@ https://apache.org/legal/resolved.html) 回复的邮件一定要带上自己检查了那些项信息,仅仅回复`+1 approve`,是无效的。 -PPMC在dev@streampark.apache.org StreamPark 的社区投票时,请带上 binding后缀,表示对 StreamPark 社区中的投票具有约束性投票,方便统计投票结果。 +PPMC在dev@streampark.apache.org Apache StreamPark 的社区投票时,请带上 binding后缀,表示对 Apache StreamPark 社区中的投票具有约束性投票,方便统计投票结果。 IPMC在general@incubator.apache.org incubator社区投票,请带上 binding后缀,表示对incubator社区中的投票具有约束性投票,方便统计投票结果。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/submit_guide/document.md b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/submit_guide/document.md index 52bddedf4..533566342 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/submit_guide/document.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/submit_guide/document.md @@ -21,11 +21,11 @@ sidebar_position: 1 limitations under the License. --> -对于任何类型的软件来说,良好的文档都是至关重要的。任何能够改进 StreamPark 文档的贡献都是受欢迎的。 +对于任何类型的软件来说,良好的文档都是至关重要的。任何能够改进 Apache StreamPark 文档的贡献都是受欢迎的。 ## 获取文档项目 -StreamPark 项目的文档在一个单独的 [git 仓库](https://github.com/apache/incubator-streampark-website) 中维护。 +Apache StreamPark 项目的文档在一个单独的 [git 仓库](https://github.com/apache/incubator-streampark-website) 中维护。 首先,您需要将文档项目 fork 到您自己的 github 仓库,然后将文`clone`到您的本地计算机。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/submit_guide/submit-code.md b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/submit_guide/submit-code.md index 43e8704f6..14c913d0f 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/submit_guide/submit-code.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs-community/current/submit_guide/submit-code.md @@ -85,4 +85,4 @@ sidebar_position: 2 * 然后社区的 Committers 将进行 CodeReview,并与您讨论一些细节(包括设计、实现、性能等)。当团队的每个成员都对此修改感到满意时,提交将被合并到 dev 分支。 -* 最后,恭喜您,您已经成为 StreamPark 的官方贡献者! +* 最后,恭喜您,您已经成为 Apache StreamPark 的官方贡献者! diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/1-kafka.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/1-kafka.md index 27a79dc93..4ecc49db0 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/1-kafka.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/1-kafka.md @@ -8,7 +8,7 @@ import TabItem from '@theme/TabItem'; [Flink 官方](https://ci.apache.org/projects/flink/flink-docs-release-1.12/zh/dev/connectors/kafka.html)提供了[Apache Kafka](http://kafka.apache.org)的连接器,用于从 Kafka topic 中读取或者向其中写入数据,可提供 **精确一次** 的处理语义 -`StreamPark`中`KafkaSource`和`KafkaSink`基于官网的`kafka connector`进一步封装,屏蔽很多细节,简化开发步骤,让数据的读取和写入更简单 +`Apache StreamPark`中`KafkaSource`和`KafkaSink`基于官网的`kafka connector`进一步封装,屏蔽很多细节,简化开发步骤,让数据的读取和写入更简单 ## 依赖 @@ -68,7 +68,7 @@ properties.setProperty("group.id", "test") val stream = env.addSource(new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties)) ``` -可以看到一上来定义了一堆kafka的连接信息,这种方式各项参数都是硬编码的方式写死的,非常的不灵敏,下面我们来看看如何用`StreamPark`接入 `kafka`的数据,只需要按照规定的格式定义好配置文件然后编写代码即可,配置和代码如下 +可以看到一上来定义了一堆kafka的连接信息,这种方式各项参数都是硬编码的方式写死的,非常的不灵敏,下面我们来看看如何用`Apache StreamPark`接入 `kafka`的数据,只需要按照规定的格式定义好配置文件然后编写代码即可,配置和代码如下 ### 基础消费示例 @@ -314,7 +314,7 @@ DataStream source1 = new KafkaSource(context) 更多详情请参考[官网文档](https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/connectors/kafka.html#partition-discovery) Flink Kafka Consumer 还能够使用正则表达式基于 Topic 名称的模式匹配来发现 Topic,详情请参考[官网文档](https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/connectors/kafka.html#topic-discovery) -在`StreamPark`中提供更简单的方式,具体需要在 `pattern`下配置要匹配的`topic`实例名称的正则即可 +在`Apache StreamPark`中提供更简单的方式,具体需要在 `pattern`下配置要匹配的`topic`实例名称的正则即可 @@ -403,7 +403,7 @@ DataStream stream = env.addSource(myConsumer); -在`StreamPark`中不推荐这种方式进行设定,提供了更方便的方式,只需要在配置里指定 `auto.offset.reset` 即可 +在`Apache StreamPark`中不推荐这种方式进行设定,提供了更方便的方式,只需要在配置里指定 `auto.offset.reset` 即可 * `earliest` 从最早的记录开始 * `latest` 从最新的记录开始 @@ -535,7 +535,7 @@ class JavaUser implements Serializable { 在许多场景中,记录的时间戳是(显式或隐式)嵌入到记录本身中。此外,用户可能希望定期或以不规则的方式`Watermark`,例如基于`Kafka`流中包含当前事件时间的`watermark`的特殊记录。对于这些情况,`Flink Kafka Consumer`是允许指定`AssignerWithPeriodicWatermarks`或`AssignerWithPunctuatedWatermarks` -在`StreamPark`中运行传入一个`WatermarkStrategy`作为参数来分配`Watermark`,如下面的示例,解析`topic`中的数据为`user`对象,`user`中有个 `orderTime` 是时间类型,我们以这个为基准,为其分配一个`Watermark` +在`Apache StreamPark`中运行传入一个`WatermarkStrategy`作为参数来分配`Watermark`,如下面的示例,解析`topic`中的数据为`user`对象,`user`中有个 `orderTime` 是时间类型,我们以这个为基准,为其分配一个`Watermark` @@ -665,7 +665,7 @@ class JavaUser implements Serializable { ## Kafka Sink (Producer) -在`StreamPark`中`Kafka Producer` 被称为`KafkaSink`,它允许将消息写入一个或多个`Kafka topic中` +在`Apache StreamPark`中`Kafka Producer` 被称为`KafkaSink`,它允许将消息写入一个或多个`Kafka topic中` @@ -990,7 +990,7 @@ class JavaUser implements Serializable { ### 指定partitioner -`KafkaSink`允许显示的指定一个kafka分区器,不指定默认使用`StreamPark`内置的 **KafkaEqualityPartitioner** 分区器,顾名思义,该分区器可以均匀的将数据写到各个分区中去,`scala` api是通过`partitioner`参数来设置分区器, +`KafkaSink`允许显示的指定一个kafka分区器,不指定默认使用`Apache StreamPark`内置的 **KafkaEqualityPartitioner** 分区器,顾名思义,该分区器可以均匀的将数据写到各个分区中去,`scala` api是通过`partitioner`参数来设置分区器, `java` api中是通过`partitioner()`方法来设置的 :::tip 注意事项 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/2-jdbc.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/2-jdbc.md index 3ae5cf947..336b8f9e6 100755 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/2-jdbc.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/2-jdbc.md @@ -9,11 +9,11 @@ import TabItem from '@theme/TabItem'; Flink 官方 提供了[JDBC](https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/connectors/jdbc.html)的连接器,用于从 JDBC 中读取或者向其中写入数据,可提供 **AT_LEAST_ONCE** (至少一次)的处理语义 -`StreamPark`中基于两阶段提交实现了 **EXACTLY_ONCE** (精确一次)语义的`JdbcSink`,并且采用[`HikariCP`](https://github.com/brettwooldridge/HikariCP)为连接池,让数据的读取和写入更简单更准确 +`Apache StreamPark`中基于两阶段提交实现了 **EXACTLY_ONCE** (精确一次)语义的`JdbcSink`,并且采用[`HikariCP`](https://github.com/brettwooldridge/HikariCP)为连接池,让数据的读取和写入更简单更准确 ## JDBC 信息配置 -在`StreamPark`中`JDBC Connector`的实现用到了[` HikariCP `](https://github.com/brettwooldridge/HikariCP)连接池,相关的配置在`jdbc`的namespace下,约定的配置如下: +在`Apache StreamPark`中`JDBC Connector`的实现用到了[` HikariCP `](https://github.com/brettwooldridge/HikariCP)连接池,相关的配置在`jdbc`的namespace下,约定的配置如下: ```yaml jdbc: @@ -59,7 +59,7 @@ jdbc: ## JDBC 读取数据 -在`StreamPark`中`JdbcSource`用来读取数据,并且根据数据的`offset`做到数据读时可回放,我们看看具体如何用`JdbcSource`读取数据,假如需求如下 +在`Apache StreamPark`中`JdbcSource`用来读取数据,并且根据数据的`offset`做到数据读时可回放,我们看看具体如何用`JdbcSource`读取数据,假如需求如下
@@ -225,7 +225,7 @@ public interface SQLResultFunction extends Serializable { ## JDBC 读取写入 -`StreamPark`中`JdbcSink`是用来写入数据,我们看看具体如何用`JdbcSink`写入数据,假如需求是需要从`kakfa`中读取数据,写入到`mysql` +`Apache StreamPark`中`JdbcSink`是用来写入数据,我们看看具体如何用`JdbcSink`写入数据,假如需求是需要从`kakfa`中读取数据,写入到`mysql` @@ -246,7 +246,7 @@ jdbc: password: 123456 ``` :::danger 注意事项 -配置里`jdbc`下的 **semantic** 是写入的语义,在上面[Jdbc信息配置](#jdbc-信息配置)有介绍,该配置只会在`JdbcSink`下生效,`StreamPark`中基于两阶段提交实现了 **EXACTLY_ONCE** 语义, +配置里`jdbc`下的 **semantic** 是写入的语义,在上面[Jdbc信息配置](#jdbc-信息配置)有介绍,该配置只会在`JdbcSink`下生效,`Apache StreamPark`中基于两阶段提交实现了 **EXACTLY_ONCE** 语义, 这本身需要被操作的数据库(`mysql`,`oracle`,`MariaDB`,`MS SQL Server`)等支持事务,理论上所有支持标准Jdbc事务的数据库都可以做到EXACTLY_ONCE(精确一次)的写入 ::: diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/3-clickhouse.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/3-clickhouse.md index 3f4e47b0f..dd814936f 100755 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/3-clickhouse.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/3-clickhouse.md @@ -9,7 +9,7 @@ import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; [ClickHouse](https://clickhouse.com/)是一个用于联机分析(OLAP)的列式数据库管理系统(DBMS),主要面向OLAP场景。目前flink官方未提供写入 -读取clickhouse数据的连接器。StreamPark 基于ClickHouse 支持的访问形式[HTTP客户端](https://clickhouse.com/docs/zh/interfaces/http/)、 +读取clickhouse数据的连接器。Apache StreamPark 基于ClickHouse 支持的访问形式[HTTP客户端](https://clickhouse.com/docs/zh/interfaces/http/)、 [JDBC驱动](https://clickhouse.com/docs/zh/interfaces/jdbc/)封装了ClickHouseSink用于向clickhouse实时写入数据。 `ClickHouse`写入不支持事务,使用 JDBC 向其中写入数据可提供 AT_LEAST_ONCE (至少一次)的处理语义。使用 HTTP客户端 异步写入,对异步写入重试多次 @@ -63,9 +63,9 @@ public class ClickHouseUtil { 以上将各项参数拼接为请求 url 的方式较繁琐,并且是硬编码的方式写死的,非常的不灵敏. -### StreamPark 方式写入 +### Apache StreamPark 方式写入 -用`StreamPark`接入 `clickhouse`的数据, 只需要按照规定的格式定义好配置文件然后编写代码即可,配置和代码如下在`StreamPark`中`clickhose jdbc` 约定的配置见配置列表,运行程序样例为scala,如下: +用`Apache StreamPark`接入 `clickhouse`的数据, 只需要按照规定的格式定义好配置文件然后编写代码即可,配置和代码如下在`Apache StreamPark`中`clickhose jdbc` 约定的配置见配置列表,运行程序样例为scala,如下: #### 配置信息 @@ -149,12 +149,12 @@ clickhouse INSERT 必须通过POST方法来插入数据 常规操作如下: $ echo 'INSERT INTO t VALUES (1),(2),(3)' | curl 'http://localhost:8123/' --data-binary @- ``` -上述方式操作较简陋,当然也可以使用java 代码来进行写入, StreamPark 对 http post 写入方式进行封装增强,增加缓存、异步写入、失败重试、达到重试阈值后数据备份至外部组件(kafka,mysql,hdfs,hbase) +上述方式操作较简陋,当然也可以使用java 代码来进行写入, Apache StreamPark 对 http post 写入方式进行封装增强,增加缓存、异步写入、失败重试、达到重试阈值后数据备份至外部组件(kafka,mysql,hdfs,hbase) 等功能,以上功能只需要按照规定的格式定义好配置文件然后编写代码即可,配置和代码如下 -### StreamPark 方式写入 +### Apache StreamPark 方式写入 -在`StreamPark`中`clickhose jdbc` 约定的配置见配置列表,运行程序样例为scala,如下: +在`Apache StreamPark`中`clickhose jdbc` 约定的配置见配置列表,运行程序样例为scala,如下: 这里采用asynchttpclient作为http异步客户端来进行写入,先导入 asynchttpclient 的jar diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/4-doris.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/4-doris.md index ee968a7c0..9fe72aa5f 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/4-doris.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/4-doris.md @@ -9,11 +9,11 @@ import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; [Apache Doris](https://doris.apache.org/)是一款基于大规模并行处理技术的分布式 SQL 数据库,主要面向 OLAP 场景。 -StreamPark 基于Doris的[stream load](https://doris.apache.org/administrator-guide/load-data/stream-load-manual.html)封装了DoirsSink用于向Doris实时写入数据。 +Apache StreamPark 基于Doris的[stream load](https://doris.apache.org/administrator-guide/load-data/stream-load-manual.html)封装了DoirsSink用于向Doris实时写入数据。 -### StreamPark 方式写入 +### Apache StreamPark 方式写入 -用`StreamPark`写入 `doris`的数据, 目前 DorisSink 只支持 JSON 格式(单层)写入,如:{"id":1,"name":"streampark"} +用`Apache StreamPark`写入 `doris`的数据, 目前 DorisSink 只支持 JSON 格式(单层)写入,如:{"id":1,"name":"streampark"} 运行程序样例为java,如下: #### 配置信息 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/5-es.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/5-es.md index 358f12c81..477796164 100755 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/5-es.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/5-es.md @@ -12,10 +12,10 @@ import TabItem from '@theme/TabItem'; [Flink 官方](https://nightlies.apache.org/flink/flink-docs-release-1.14/zh/docs/connectors/)提供了[Elasticsearch](https://nightlies.apache.org/flink/flink-docs-release-1.14/zh/docs/connectors/datastream/elasticsearch/)的连接器,用于向 elasticsearch 中写入数据,可提供 **至少一次** 的处理语义 ElasticsearchSink 使用 TransportClient(6.x 之前)或者 RestHighLevelClient(6.x 开始)和 Elasticsearch 集群进行通信, -`StreamPark`对 flink-connector-elasticsearch6 进一步封装,屏蔽开发细节,简化Elasticsearch6及以上的写入操作。 +`Apache StreamPark`对 flink-connector-elasticsearch6 进一步封装,屏蔽开发细节,简化Elasticsearch6及以上的写入操作。 :::tip 提示 -因为Flink Connector Elasticsearch 不同版本之间存在冲突`StreamPark`暂时仅支持Elasticsearch6及以上的写入操作,如需写入Elasticsearch5需要使用者排除 +因为Flink Connector Elasticsearch 不同版本之间存在冲突`Apache StreamPark`暂时仅支持Elasticsearch6及以上的写入操作,如需写入Elasticsearch5需要使用者排除 flink-connector-elasticsearch6 依赖,引入 flink-connector-elasticsearch5依赖 创建 org.apache.flink.streaming.connectors.elasticsearch5.ElasticsearchSink 实例写入数据。 ::: @@ -177,10 +177,10 @@ input.addSink(esSinkBuilder.build) -以上创建ElasticsearchSink添加参数非常的不灵敏。`StreamPark`使用约定大于配置、自动配置的方式只需要配置es -连接参数、flink运行参数,StreamPark 会自动组装source和sink,极大的简化开发逻辑,提升开发效率和维护性。 +以上创建ElasticsearchSink添加参数非常的不灵敏。`Apache StreamPark`使用约定大于配置、自动配置的方式只需要配置es +连接参数、flink运行参数,Apache StreamPark 会自动组装source和sink,极大的简化开发逻辑,提升开发效率和维护性。 -## StreamPark 写入 Elasticsearch +## Apache StreamPark 写入 Elasticsearch ESSink 在启用 Flink checkpoint 后,保证至少一次将操作请求发送到 Elasticsearch 集群。 @@ -212,7 +212,7 @@ host: localhost:9200 ### 2. 写入Elasticsearch -用 StreamPark 写入Elasticsearch非常简单,代码如下: +用 Apache StreamPark 写入Elasticsearch非常简单,代码如下: @@ -266,7 +266,7 @@ object ConnectorApp extends FlinkStreaming { -Flink ElasticsearchSinkFunction可以执行多种类型请求,如(DeleteRequest、 UpdateRequest、IndexRequest),StreamPark也对以上功能进行了支持,对应方法如下: +Flink ElasticsearchSinkFunction可以执行多种类型请求,如(DeleteRequest、 UpdateRequest、IndexRequest),Apache StreamPark也对以上功能进行了支持,对应方法如下: ```scala import org.apache.streampark.flink.core.scala.StreamingContext import org.apache.flink.streaming.api.datastream.DataStreamSink @@ -344,5 +344,5 @@ Elasticsearch 操作请求可能由于多种原因而失败,可以通过实现 [官方文档](https://nightlies.apache.org/flink/flink-docs-release-1.14/zh/docs/connectors/datastream/elasticsearch/#elasticsearch-sink)**处理失败的 Elasticsearch 请求** 单元 ### 配置内部批量处理器 es内部`BulkProcessor`可以进一步配置其如何刷新缓存操作请求的行为详细查看[官方文档](https://nightlies.apache.org/flink/flink-docs-release-1.14/zh/docs/connectors/datastream/elasticsearch/#elasticsearch-sink)**配置内部批量处理器** 单元 -### StreamPark配置 -其他的所有的配置都必须遵守 **StreamPark** 配置,具体可配置项和各个参数的作用请参考[项目配置](/docs/development/conf) +### Apache StreamPark配置 +其他的所有的配置都必须遵守 **Apache StreamPark** 配置,具体可配置项和各个参数的作用请参考[项目配置](/docs/development/conf) diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/6-hbase.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/6-hbase.md index cd4c98262..60e77ae52 100755 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/6-hbase.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/6-hbase.md @@ -10,12 +10,12 @@ import TabItem from '@theme/TabItem'; [Apache HBase](https://hbase.apache.org/book.html)是一个高可靠性、高性能、面向列、可伸缩的分布式存储系统,利用HBase技术可在廉价PC Server 上搭建起大规模结构化存储集群。 HBase不同于一般的关系数据库,它是一个适合于非结构化数据存储的数据库,HBase基于列的而不是基于行的模式。 -flink官方未提供Hbase DataStream的连接器。StreamPark 基于`Hbase-client`封装了HBaseSource、HBaseSink,支持依据配置自动创建连接,简化开发。 -StreamPark 读取Hbase在开启chekpoint情况下可以记录读取数据的最新状态,通过数据本身标识可以恢复source对应偏移量。实现source端AT_LEAST_ONCE(至少一次语义)。 +flink官方未提供Hbase DataStream的连接器。Apache StreamPark 基于`Hbase-client`封装了HBaseSource、HBaseSink,支持依据配置自动创建连接,简化开发。 +Apache StreamPark 读取Hbase在开启chekpoint情况下可以记录读取数据的最新状态,通过数据本身标识可以恢复source对应偏移量。实现source端AT_LEAST_ONCE(至少一次语义)。 HbaseSource 实现了flink Async I/O,用于提升streaming的吞吐量,sink端默认支持AT_LEAST_ONCE (至少一次)的处理语义。在开启checkpoint情况下支持EXACTLY_ONCE()精确一次语义。 :::tip 提示 -StreamPark 读取HBASE在开启chekpoint情况下可以记录读取数据的最新状态,作业恢复后从是否可以恢复之前状态完全取决于数据本身是否有偏移量的标识,需要在代码手动指定。 +Apache StreamPark 读取HBASE在开启chekpoint情况下可以记录读取数据的最新状态,作业恢复后从是否可以恢复之前状态完全取决于数据本身是否有偏移量的标识,需要在代码手动指定。 在HBaseSource的getDataStream方法func参数指定恢复逻辑。 ::: @@ -234,9 +234,9 @@ class HBaseWriter extends RichSinkFunction { -以方式读写Hbase较繁琐,非常的不灵敏。`StreamPark`使用约定大于配置、自动配置的方式只需要配置Hbase连接参数、flink运行参数,StreamPark 会自动组装source和sink,极大的简化开发逻辑,提升开发效率和维护性。 +以方式读写Hbase较繁琐,非常的不灵敏。`Apache StreamPark`使用约定大于配置、自动配置的方式只需要配置Hbase连接参数、flink运行参数,Apache StreamPark 会自动组装source和sink,极大的简化开发逻辑,提升开发效率和维护性。 -## StreamPark 读写 Hbase +## Apache StreamPark 读写 Hbase ### 1. 配置策略和连接信息 @@ -252,7 +252,7 @@ hbase: ``` ### 2. 读写入Hbase -用 StreamPark 写入Hbase非常简单,代码如下: +用 Apache StreamPark 写入Hbase非常简单,代码如下: @@ -358,7 +358,7 @@ object HBaseSinkApp extends FlinkStreaming { -StreamPark 写入Hbase 需要创建HBaseQuery的方法、指定将查询结果转化为需要对象的方法、标识是否在运行、传入运行参数。具体如下: +Apache StreamPark 写入Hbase 需要创建HBaseQuery的方法、指定将查询结果转化为需要对象的方法、标识是否在运行、传入运行参数。具体如下: ```scala /** * @param ctx @@ -384,7 +384,7 @@ class HBaseSource(@(transient@param) val ctx: StreamingContext, property: Proper } ``` -StreamPark HbaseSource 实现了flink Async I/O 用于提升Streaming的吞吐量,先创建 DataStream 然后创建 HBaseRequest 调用 +Apache StreamPark HbaseSource 实现了flink Async I/O 用于提升Streaming的吞吐量,先创建 DataStream 然后创建 HBaseRequest 调用 requestOrdered() 或者 requestUnordered() 创建异步流,建如下代码: ```scala class HBaseRequest[T: TypeInformation](@(transient@param) private val stream: DataStream[T], property: Properties = new Properties()) { @@ -423,7 +423,7 @@ class HBaseRequest[T: TypeInformation](@(transient@param) private val stream: Da } ``` -StreamPark 支持两种方式写入数据:1.addSink() 2. writeUsingOutputFormat 样例如下: +Apache StreamPark 支持两种方式写入数据:1.addSink() 2. writeUsingOutputFormat 样例如下: ```scala //1)插入方式1 HBaseSink().sink[TestEntity](source, "order") @@ -438,4 +438,4 @@ StreamPark 支持两种方式写入数据:1.addSink() 2. writeUsingOutputForma ## 其他配置 -其他的所有的配置都必须遵守 **StreamPark** 配置,具体可配置项和各个参数的作用请参考[项目配置](/docs/development/conf) +其他的所有的配置都必须遵守 **Apache StreamPark** 配置,具体可配置项和各个参数的作用请参考[项目配置](/docs/development/conf) diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/7-http.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/7-http.md index 06973a406..f0823851c 100755 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/7-http.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/7-http.md @@ -9,7 +9,7 @@ import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; 一些后台服务通过http请求接收数据,这种场景下flink可以通过http请求写入结果数据,目前flink官方未提供通过http请求写入 -数据的连接器。StreamPark 基于asynchttpclient封装了HttpSink异步实时写入数据。 +数据的连接器。Apache StreamPark 基于asynchttpclient封装了HttpSink异步实时写入数据。 `HttpSink`写入不支持事务,向目标服务写入数据可提供 AT_LEAST_ONCE (至少一次)的处理语义。异步写入重试多次失败的数据会写入外部组件(kafka,mysql,hdfs,hbase) ,最终通过人为介入来恢复数据,达到最终数据一致。 @@ -26,7 +26,7 @@ import TabItem from '@theme/TabItem'; ``` -## StreamPark 方式写入 +## Apache StreamPark 方式写入 ### http异步写入支持类型 @@ -134,4 +134,4 @@ object HttpSinkApp extends FlinkStreaming { ::: ## 其他配置 -其他的所有的配置都必须遵守 **StreamPark** 配置,具体可配置项和各个参数的作用请参考[项目配置](/docs/development/conf) +其他的所有的配置都必须遵守 **Apache StreamPark** 配置,具体可配置项和各个参数的作用请参考[项目配置](/docs/development/conf) diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/8-redis.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/8-redis.md index 3c2f8c2b8..afb4deb0a 100755 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/8-redis.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/connector/8-redis.md @@ -12,10 +12,10 @@ import TabItem from '@theme/TabItem'; hyperloglogs 和 地理空间(geospatial) 索引半径查询。 Redis 内置了事务(transactions) 和不同级别的 磁盘持久化(persistence), 并通过 Redis哨兵(Sentinel)和自动 分区(Cluster)提供高可用性(high availability)。 -flink官方未提供写入reids数据的连接器。StreamPark 基于[Flink Connector Redis](https://bahir.apache.org/docs/flink/current/flink-streaming-redis/) +flink官方未提供写入reids数据的连接器。Apache StreamPark 基于[Flink Connector Redis](https://bahir.apache.org/docs/flink/current/flink-streaming-redis/) 封装了RedisSink、配置redis连接参数,即可自动创建redis连接简化开发。目前RedisSink支持连接方式有:单节点模式、哨兵模式,因集群模式不支持事务,目前未支持。 -StreamPark 使用Redis的 **MULTI** 命令开启事务,**EXEC** 命令提交事务,细节见链接: +Apache StreamPark 使用Redis的 **MULTI** 命令开启事务,**EXEC** 命令提交事务,细节见链接: http://www.redis.cn/topics/transactions.html ,使用RedisSink 默认支持AT_LEAST_ONCE (至少一次)的处理语义。在开启checkpoint情况下支持EXACTLY_ONCE语义。 :::tip 提示 @@ -24,7 +24,7 @@ EXACTLY_ONCE语义下会在flink作业checkpoint整体完成情况下批量写 ::: ## Redis写入依赖 -Flink Connector Redis 官方提供两种,以下两种api均相同,StreamPark 使用的是org.apache.bahir依赖 +Flink Connector Redis 官方提供两种,以下两种api均相同,Apache StreamPark 使用的是org.apache.bahir依赖 ```xml org.apache.bahir @@ -166,10 +166,10 @@ public class FlinkRedisSink { } ``` -以上创建FlinkJedisPoolConfig较繁琐,redis的每种操作都要构建RedisMapper,非常的不灵敏。`StreamPark`使用约定大于配置、自动配置的方式只需要配置redis -连接参数、flink运行参数,StreamPark 会自动组装source和sink,极大的简化开发逻辑,提升开发效率和维护性。 +以上创建FlinkJedisPoolConfig较繁琐,redis的每种操作都要构建RedisMapper,非常的不灵敏。`Apache StreamPark`使用约定大于配置、自动配置的方式只需要配置redis +连接参数、flink运行参数,Apache StreamPark 会自动组装source和sink,极大的简化开发逻辑,提升开发效率和维护性。 -## StreamPark 写入 Redis +## Apache StreamPark 写入 Redis RedisSink 默认为AT_LEAST_ONCE (至少一次)的处理语义,在开启checkpoint情况下两阶段段提交支持EXACTLY_ONCE语义,可使用的连接类型: 单节点模式、哨兵模式。 @@ -219,7 +219,7 @@ redis.sink: ### 2. 写入Redis -用 StreamPark 写入redis非常简单,代码如下: +用 Apache StreamPark 写入redis非常简单,代码如下: @@ -280,7 +280,7 @@ case class RedisMapper[T](cmd: RedisCommand, additionalKey: String, key: T => St -如代码所示,StreamPark 会自动加载配置创建RedisSink,用户通过创建需要的RedisMapper对象即完成redis写入操作,**additionalKey为hset时为最外层key其他写入命令无效**。 +如代码所示,Apache StreamPark 会自动加载配置创建RedisSink,用户通过创建需要的RedisMapper对象即完成redis写入操作,**additionalKey为hset时为最外层key其他写入命令无效**。 RedisSink.sink()写入相应的key对应数据是需要指定过期时间,如果未指定默认过期时间为java Integer.MAX_VALUE (67年)。如代码所示: ```scala @@ -360,11 +360,11 @@ public enum RedisCommand { ``` :::info 警告 -RedisSink 目前支持单节点模式、哨兵模式连接,集群模式不支持事务,StreamPark 目前为支持,如有使用场景,请调用Flink Connector Redis官方api。

+RedisSink 目前支持单节点模式、哨兵模式连接,集群模式不支持事务,Apache StreamPark 目前为支持,如有使用场景,请调用Flink Connector Redis官方api。

EXACTLY_ONCE语义下必须开启checkpoint,否则程序会抛出参数异常。

EXACTLY_ONCE语义下checkpoint的数据sink缓存在内存里面,需要根据实际数据合理设置checkpoint时间间隔,否则有**oom**的风险。

::: ## 其他配置 -其他的所有的配置都必须遵守 **StreamPark** 配置,具体可配置项和各个参数的作用请参考[项目配置](/docs/development/conf) +其他的所有的配置都必须遵守 **Apache StreamPark** 配置,具体可配置项和各个参数的作用请参考[项目配置](/docs/development/conf) diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/development/alert-conf.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/development/alert-conf.md index 96182e154..050f38c0e 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/development/alert-conf.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/development/alert-conf.md @@ -7,7 +7,7 @@ sidebar_position: 3 import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -目前 `StreamPark` 中支持配置多种报警方式,主要有以下几种: +目前 `Apache StreamPark` 中支持配置多种报警方式,主要有以下几种: * **E-mail**:邮件通知 * **DingTalk**:钉钉自定义群机器人 @@ -22,7 +22,7 @@ import TabItem from '@theme/TabItem'; ::: ## 新增报警配置 -点击左侧 `StreamPark -> Setting`,然后选择`Alert Setting` 进入报警信息配置。 +点击左侧 `Apache StreamPark -> Setting`,然后选择`Alert Setting` 进入报警信息配置。 ![alert_add_setting.png](/doc/image/alert/alert_add_setting.png) 点击 `Add New` 进行报警新增配置: diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/development/conf.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/development/conf.md index da507863f..afbd57923 100755 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/development/conf.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/development/conf.md @@ -16,7 +16,7 @@ ClientFailureRate, ClientTables } from '../components/TableData.jsx'; -配置在`StreamPark`中是非常重要的概念,先说说为什么需要配置 +配置在`Apache StreamPark`中是非常重要的概念,先说说为什么需要配置 ## 为什么需要配置 @@ -101,10 +101,10 @@ public class JavaTableApp { **答案是肯定的** -针对参数设置的问题,在`StreamPark`中提出统一程序配置的概念,把程序的一系列参数从开发到部署阶段按照特定的格式配置到`application.yml`里,抽象出 +针对参数设置的问题,在`Apache StreamPark`中提出统一程序配置的概念,把程序的一系列参数从开发到部署阶段按照特定的格式配置到`application.yml`里,抽象出 一个通用的配置模板,按照这种规定的格式将上述配置的各项参数在配置文件里定义出来,在程序启动的时候将这个项目配置传入到程序中即可完成环境的初始化工作,在任务启动的时候也会自动识别启动时的参数. -针对Flink Sql作业在代码里写sql的问题,`StreamPark`针对`Flink Sql`作业做了更高层级封装和抽象,开发者只需要将sql按照一定的规范要求定义到`application.yaml`中,在程序启动时传入该文件到主程序中, 就会自动按照要求加载执行sql +针对Flink Sql作业在代码里写sql的问题,`Apache StreamPark`针对`Flink Sql`作业做了更高层级封装和抽象,开发者只需要将sql按照一定的规范要求定义到`application.yaml`中,在程序启动时传入该文件到主程序中, 就会自动按照要求加载执行sql 下面我们来详细看看这个配置文件的各项配置都是如何进行配置的,有哪些注意事项 @@ -245,7 +245,7 @@ option下的参数必须是 `完整参数名` `$internal.application.main` 和 `yarn.application.name` 这两个参数是必须的 ::: 如您需要设置更多的参数,可参考[`这里`](https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html) -一定要将这些参数放到`property`下,并且参数名称要正确,`StreamPark`会自动解析这些参数并生效 +一定要将这些参数放到`property`下,并且参数名称要正确,`Apache StreamPark`会自动解析这些参数并生效 ##### Memory参数 Memory相关的参数设置也非常之多,一般常见的配置如下 @@ -325,7 +325,7 @@ sql: | :::danger 特别注意 -上面内容中 **sql:** 后面的 **|** 是必带的, 加上 **|** 会保留整段内容的格式,重点是保留了换行符, StreamPark封装了Flink Sql的提交,可以直接将多个Sql一次性定义出来,每个Sql必须用 **;** 分割,每段 Sql也必须遵循Flink Sql规定的格式和规范 +上面内容中 **sql:** 后面的 **|** 是必带的, 加上 **|** 会保留整段内容的格式,重点是保留了换行符, Apache StreamPark封装了Flink Sql的提交,可以直接将多个Sql一次性定义出来,每个Sql必须用 **;** 分割,每段 Sql也必须遵循Flink Sql规定的格式和规范 ::: ## 总结 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/development/model.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/development/model.md index bec708e74..39cb213cc 100755 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/development/model.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/development/model.md @@ -8,7 +8,7 @@ import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; 任何框架都有一些要遵循的规则和约定, 我们只有遵循并掌握了这些规则, 才能更加游刃有余的使用, 使其发挥事半功倍的效果, 我们开发 Flink 作业,其实就是利用 Flink 提供的 API , 按照 Flink 要求的开发方式, 写一个可以执行的(必须有`main()`函数)的程序, 在程序里接入各种`Connector`经过一系列的`算子`操作, 最终将数据通过`Connector` sink 到目标存储, -我们把这种按照某种约定的规则去逐步编程的方式称之为`编程模型`, 这一章节我们就来聊聊 StreamPark 的`编程模型`以及开发注意事项 +我们把这种按照某种约定的规则去逐步编程的方式称之为`编程模型`, 这一章节我们就来聊聊 Apache StreamPark 的`编程模型`以及开发注意事项 我们从这几个方面开始入手 @@ -25,11 +25,11 @@ import TabItem from '@theme/TabItem'; ## 编程模型 -`streampark-core` 定位是编程时框架,快速开发脚手架,专门为简化 Flink 开发而生,开发者在开发阶段会使用到该模块,下面我们来看看 `DataStream` 和 `Flink Sql` 用 StreamPark 来开发编程模型是什么样的,有什么规范和要求 +`streampark-core` 定位是编程时框架,快速开发脚手架,专门为简化 Flink 开发而生,开发者在开发阶段会使用到该模块,下面我们来看看 `DataStream` 和 `Flink Sql` 用 Apache StreamPark 来开发编程模型是什么样的,有什么规范和要求 ### DataStream - StreamPark 提供了`scala`和`Java`两种 API 来开发 `DataStream` 程序,具体代码开发如下 + Apache StreamPark 提供了`scala`和`Java`两种 API 来开发 `DataStream` 程序,具体代码开发如下 @@ -78,7 +78,7 @@ public class MyFlinkJavaApp { :::tip 提示 -以上几行 `scala` 和 `Java` 代码就是用 StreamPark 开发 `DataStream` 必不可少的最基本的骨架代码,用 StreamPark 开发 `DataStream` 程序,从这几行代码开始, Java API 开发需要开发者手动启动任务 `start` +以上几行 `scala` 和 `Java` 代码就是用 Apache StreamPark 开发 `DataStream` 必不可少的最基本的骨架代码,用 Apache StreamPark 开发 `DataStream` 程序,从这几行代码开始, Java API 开发需要开发者手动启动任务 `start` ::: @@ -88,11 +88,11 @@ TableEnvironment 是用来创建 Table & SQL 程序的上下文执行环境,也 Flink 社区一直在推进 DataStream 的批处理能力,统一流批一体,在 Flink 1.12 中流批一体真正统一运行,诸多历史 API 如: DataSet API, BatchTableEnvironment API 等被废弃,退出历史舞台,官方推荐使用 **TableEnvironment** 和 **StreamTableEnvironment** - StreamPark 针对 **TableEnvironment** 和 **StreamTableEnvironment** 这两种环境的开发,提供了对应的更方便快捷的 API + Apache StreamPark 针对 **TableEnvironment** 和 **StreamTableEnvironment** 这两种环境的开发,提供了对应的更方便快捷的 API #### TableEnvironment -开发Table & SQL 作业, TableEnvironment 会是 Flink 推荐使用的入口类, 同时能支持 Java API 和 Scala API,下面的代码演示了在 StreamPark 如何开发一个 TableEnvironment 类型的作业 +开发Table & SQL 作业, TableEnvironment 会是 Flink 推荐使用的入口类, 同时能支持 Java API 和 Scala API,下面的代码演示了在 Apache StreamPark 如何开发一个 TableEnvironment 类型的作业 @@ -130,7 +130,7 @@ public class JavaTableApp { :::tip 提示 -以上几行 Scala 和 Java 代码就是用 StreamPark 开发 TableEnvironment 必不可少的最基本的骨架代码,用 StreamPark 开发 TableEnvironment 程序,从这几行代码开始, +以上几行 Scala 和 Java 代码就是用 Apache StreamPark 开发 TableEnvironment 必不可少的最基本的骨架代码,用 Apache StreamPark 开发 TableEnvironment 程序,从这几行代码开始, Scala API 必须继承 FlinkTable, Java API 开发需要手动构造 TableContext ,需要开发者手动启动任务 `start` ::: @@ -138,7 +138,7 @@ Scala API 必须继承 FlinkTable, Java API 开发需要手动构造 TableContex #### StreamTableEnvironment `StreamTableEnvironment` 用于流计算场景,流计算的对象是 `DataStream`。相比 `TableEnvironment`, `StreamTableEnvironment` 提供了 `DataStream` 和 `Table` 之间相互转换的接口,如果用户的程序除了使用 `Table API` & `SQL` 编写外,还需要使用到 `DataStream API`,则需要使用 `StreamTableEnvironment`。 -下面的代码演示了在 StreamPark 如何开发一个 `StreamTableEnvironment` 类型的作业 +下面的代码演示了在 Apache StreamPark 如何开发一个 `StreamTableEnvironment` 类型的作业 @@ -183,12 +183,12 @@ public class JavaStreamTableApp { :::tip 特别注意 -以上几行 scala 和 Java 代码就是用 StreamPark 开发 `StreamTableEnvironment` 必不可少的最基本的骨架代码,用 StreamPark 开发 `StreamTableEnvironment` 程序,从这几行代码开始,Java 代码需要手动构造 `StreamTableContext`,`Java API`开发需要开发者手动启动任务`start` +以上几行 scala 和 Java 代码就是用 Apache StreamPark 开发 `StreamTableEnvironment` 必不可少的最基本的骨架代码,用 Apache StreamPark 开发 `StreamTableEnvironment` 程序,从这几行代码开始,Java 代码需要手动构造 `StreamTableContext`,`Java API`开发需要开发者手动启动任务`start` ::: ## RunTime Context -**RunTime Context** — **StreamingContext** , **TableContext** , **StreamTableContext** 是 StreamPark 中几个非常重要三个对象,接下来我们具体看看这三个 **Context** 的定义和作用 +**RunTime Context** — **StreamingContext** , **TableContext** , **StreamTableContext** 是 Apache StreamPark 中几个非常重要三个对象,接下来我们具体看看这三个 **Context** 的定义和作用
@@ -226,7 +226,7 @@ class StreamingContext(val parameter: ParameterTool, private val environment: St 这个对象非常重要,在 `DataStream` 作业中会贯穿整个任务的生命周期, `StreamingContext` 本身继承自 `StreamExecutionEnvironment` ,配置文件会完全融合到 `StreamingContext` 中,这样就可以非常方便的从 `StreamingContext` 中获取各种参数 ::: -在 StreamPark 中, `StreamingContext` 也是 Java API 编写 `DataStream` 作业的入口类, `StreamingContext` 的构造方法中有一个是专门为 Java API 打造的,该构造函数定义如下: +在 Apache StreamPark 中, `StreamingContext` 也是 Java API 编写 `DataStream` 作业的入口类, `StreamingContext` 的构造方法中有一个是专门为 Java API 打造的,该构造函数定义如下: ```scala /** @@ -303,7 +303,7 @@ class TableContext(val parameter: ParameterTool, } ``` -在 StreamPark 中,`TableContext` 也是 Java API 编写 `TableEnvironment` 类型的 `Table Sql` 作业的入口类,`TableContext` 的构造方法中有一个是专门为 `Java API` 打造的,该构造函数定义如下: +在 Apache StreamPark 中,`TableContext` 也是 Java API 编写 `TableEnvironment` 类型的 `Table Sql` 作业的入口类,`TableContext` 的构造方法中有一个是专门为 `Java API` 打造的,该构造函数定义如下: ```scala @@ -393,7 +393,7 @@ class StreamTableContext(val parameter: ParameterTool, ``` -在StreamPark中,`StreamTableContext` 是 Java API 编写 `StreamTableEnvironment` 类型的 `Table Sql` 作业的入口类,`StreamTableContext` 的构造方法中有一个是专门为 Java API 打造的,该构造函数定义如下: +在Apache StreamPark中,`StreamTableContext` 是 Java API 编写 `StreamTableEnvironment` 类型的 `Table Sql` 作业的入口类,`StreamTableContext` 的构造方法中有一个是专门为 Java API 打造的,该构造函数定义如下: ```scala @@ -528,7 +528,7 @@ StreamTableContext context = new StreamTableContext(JavaConfig); ::: ## 目录结构 -推荐的项目目录结构如下,具体可以参考[StreamPark Quickstart](https://github.com/apache/incubator-streampark-quickstart) 里的目录结构和配置 +推荐的项目目录结构如下,具体可以参考[Apache StreamPark Quickstart](https://github.com/apache/incubator-streampark-quickstart) 里的目录结构和配置 ``` tree . @@ -604,7 +604,7 @@ assembly.xml 是assembly打包插件需要用到的配置文件,定义如下: ## 打包部署 -推荐 [streampark-flink-quickstart](https://github.com/apache/incubator-streampark-quickstart/tree/dev/quickstart-flink) 里的打包模式,直接运行`maven package`即可生成一个标准的StreamPark推荐的项目包,解包后目录结构如下 +推荐 [streampark-flink-quickstart](https://github.com/apache/incubator-streampark-quickstart/tree/dev/quickstart-flink) 里的打包模式,直接运行`maven package`即可生成一个标准的Apache StreamPark推荐的项目包,解包后目录结构如下 ``` text . diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/flink-k8s/1-deployment.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/flink-k8s/1-deployment.md index 4d4727643..d7d4790be 100755 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/flink-k8s/1-deployment.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/flink-k8s/1-deployment.md @@ -4,24 +4,24 @@ title: 'Flink K8s 集成支持' sidebar_position: 1 --- -StreamPark Flink Kubernetes 基于 [Flink Native Kubernetes](https://ci.apache.org/projects/flink/flink-docs-stable/docs/deployment/resource-providers/native_kubernetes/) 实现,支持以下 Flink 运行模式: +Apache StreamPark Flink Kubernetes 基于 [Flink Native Kubernetes](https://ci.apache.org/projects/flink/flink-docs-stable/docs/deployment/resource-providers/native_kubernetes/) 实现,支持以下 Flink 运行模式: * Native-Kubernetes Application * Native-Kubernetes Session -单个 StreamPark 实例当前只支持单个 Kubernetes 集群,如果您有多 Kubernetes 支持的诉求,欢迎提交相关的 [Fearure Request Issue](https://github.com/apache/incubator-streampark/issues) : ) +单个 Apache StreamPark 实例当前只支持单个 Kubernetes 集群,如果您有多 Kubernetes 支持的诉求,欢迎提交相关的 [Fearure Request Issue](https://github.com/apache/incubator-streampark/issues) : )

## 额外环境要求 -StreamPark Flink-K8s 需要具备以下额外的运行环境: +Apache StreamPark Flink-K8s 需要具备以下额外的运行环境: * Kubernetes -* Maven(StreamPark 运行节点具备) -* Docker(StreamPark 运行节点是具备) +* Maven(Apache StreamPark 运行节点具备) +* Docker(Apache StreamPark 运行节点是具备) -StreamPark 实例并不需要强制部署在 Kubernetes 所在节点上,可以部署在 Kubernetes 集群外部节点,但是需要该 StreamPark 部署节点与 Kubernetes 集群**保持网络通信畅通**。 +Apache StreamPark 实例并不需要强制部署在 Kubernetes 所在节点上,可以部署在 Kubernetes 集群外部节点,但是需要该 Apache StreamPark 部署节点与 Kubernetes 集群**保持网络通信畅通**。

@@ -30,9 +30,9 @@ StreamPark 实例并不需要强制部署在 Kubernetes 所在节点上,可以 ### Kubernetes 连接配置 -StreamPark 直接使用系统 `~/.kube/config ` 作为 Kubernetes 集群的连接凭证,最为简单的方式是直接拷贝 Kubernetes 节点的 `.kube/config` 到 StreamPark 节点用户目录,各云服务商 Kubernetes 服务也都提供了相关配置的快速下载。当然为了权限约束,也可以自行生成对应 k8s 自定义账户的 config。 +Apache StreamPark 直接使用系统 `~/.kube/config ` 作为 Kubernetes 集群的连接凭证,最为简单的方式是直接拷贝 Kubernetes 节点的 `.kube/config` 到 Apache StreamPark 节点用户目录,各云服务商 Kubernetes 服务也都提供了相关配置的快速下载。当然为了权限约束,也可以自行生成对应 k8s 自定义账户的 config。 -完成后,可以通过 StreamPark 所在机器的 kubectl 快速检查目标 Kubernetes 集群的连通性: +完成后,可以通过 Apache StreamPark 所在机器的 kubectl 快速检查目标 Kubernetes 集群的连通性: ```shell kubectl cluster-info @@ -50,13 +50,13 @@ kubectl create clusterrolebinding flink-role-binding-default --clusterrole=edit ### Docker 远程容器服务配置 -在 StreamPark Setting 页面,配置目标 Kubernetes 集群所使用的 Docker 容器服务的连接信息。 +在 Apache StreamPark Setting 页面,配置目标 Kubernetes 集群所使用的 Docker 容器服务的连接信息。 ![docker register setting](/doc/image/docker_register_setting.png) -在远程 Docker 容器服务创建一个名为 `streampark` 的 Namespace(该Namespace可自定义命名,命名不为 streampark 请在setting页面修改确认) ,为 StreamPark 自动构建的 Flink image 推送空间,请确保使用的 Docker Register User 具有该 Namespace 的 `pull`/`push` 权限。 +在远程 Docker 容器服务创建一个名为 `streampark` 的 Namespace(该Namespace可自定义命名,命名不为 streampark 请在setting页面修改确认) ,为 Apache StreamPark 自动构建的 Flink image 推送空间,请确保使用的 Docker Register User 具有该 Namespace 的 `pull`/`push` 权限。 -可以在 StreamPark 所在节点通过 docker command 简单测试权限: +可以在 Apache StreamPark 所在节点通过 docker command 简单测试权限: ```shell # verify access @@ -81,9 +81,9 @@ docker pull /streampark/busybox * **Flink Base Docker Image**: 基础 Flink Docker 镜像的 Tag,可以直接从 [DockerHub - offical/flink](https://hub.docker.com/_/flink) 获取,也支持用户私有的底层镜像,此时在 setting 设置 Docker Register Account 需要具备该私有镜像 `pull` 权限。 * **Rest-Service Exposed Type**:对应 Flink 原生 [kubernetes.rest-service.exposed.type](https://ci.apache.org/projects/flink/flink-docs-stable/docs/deployment/config/#kubernetes) 配置,各个候选值说明: - * `ClusterIP`:需要 StreamPark 可直接访问 K8s 内部网络; - * `LoadBalancer`:需要 K8s 提前创建 LoadBalancer 资源,且 Flink Namespace 具备自动绑定权限,同时 StreamPark 可以访问该 LoadBalancer 网关; - * `NodePort`:需要 StreamPark 可以直接连通所有 K8s 节点; + * `ClusterIP`:需要 Apache StreamPark 可直接访问 K8s 内部网络; + * `LoadBalancer`:需要 K8s 提前创建 LoadBalancer 资源,且 Flink Namespace 具备自动绑定权限,同时 Apache StreamPark 可以访问该 LoadBalancer 网关; + * `NodePort`:需要 Apache StreamPark 可以直接连通所有 K8s 节点; * **Kubernetes Pod Template**: Flink 自定义 pod-template 配置,注意container-name必须为flink-main-container,如果k8s pod拉取docker镜像需要秘钥,请在pod template文件中补全秘钥相关信息,pod-template模板如下: ``` apiVersion: v1 @@ -112,7 +112,7 @@ Flink-Native-Kubernetes Session 任务 K8s 额外的配置(pod-template 等) ## 相关参数配置 -StreamPark 在 `applicaton.yml` Flink-K8s 相关参数如下,默认情况下不需要额外调整默认值。 +Apache StreamPark 在 `applicaton.yml` Flink-K8s 相关参数如下,默认情况下不需要额外调整默认值。 | 配置项 | 描述 | 默认值 | |:-----------------------------------------------------------------------|-----------------------------------------------------------| ------- | diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/flink-k8s/2-k8s-pvc-integration.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/flink-k8s/2-k8s-pvc-integration.md index b84819d35..025ee5235 100755 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/flink-k8s/2-k8s-pvc-integration.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/flink-k8s/2-k8s-pvc-integration.md @@ -6,9 +6,9 @@ sidebar_position: 2 ## K8s PVC 资源使用说明 -当前版本 StreamPark Flink-K8s 任务对 PVC 资源(挂载 checkpoint/savepoint/logs 等文件资源)的支持基于 pod-template。 +当前版本 Apache StreamPark Flink-K8s 任务对 PVC 资源(挂载 checkpoint/savepoint/logs 等文件资源)的支持基于 pod-template。 -Native-Kubernetes Session 由创建 Session Cluster 时控制,这里不再赘述。Native-Kubernetes Application 支持在 StreamPark 页面上直接编写 `pod-template`,`jm-pod-template`,`tm-pod-template` 配置。 +Native-Kubernetes Session 由创建 Session Cluster 时控制,这里不再赘述。Native-Kubernetes Application 支持在 Apache StreamPark 页面上直接编写 `pod-template`,`jm-pod-template`,`tm-pod-template` 配置。
@@ -44,9 +44,9 @@ spec: 1. 提供的 Flink Base Docker Image 已经包含该依赖(用户自行解决依赖冲突); -2. 在 StreamPark 本地 `Workspace/jars` 目录下放置 `flink-statebackend-rocksdb_xx.jar` 依赖; +2. 在 Apache StreamPark 本地 `Workspace/jars` 目录下放置 `flink-statebackend-rocksdb_xx.jar` 依赖; -3. 在 StreamPark Dependency 配置中加入 rockdb-backend 依赖(此时 StreamPark 会自动解决依赖冲突): +3. 在 Apache StreamPark Dependency 配置中加入 rockdb-backend 依赖(此时 Apache StreamPark 会自动解决依赖冲突): ![rocksdb dependency](/doc/image/rocksdb_dependency.png) diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/flink-k8s/3-hadoop-resource-integration.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/flink-k8s/3-hadoop-resource-integration.md index 2380be896..b20609bce 100755 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/flink-k8s/3-hadoop-resource-integration.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/flink-k8s/3-hadoop-resource-integration.md @@ -6,7 +6,7 @@ sidebar_position: 3 ## 在 Flink on K8s 上使用 Hadoop 资源 -在 StreamPark Flink-K8s runtime 下使用 Hadoop 资源,如 checkpoint 挂载 HDFS、读写 Hive 等,大概流程如下: +在 Apache StreamPark Flink-K8s runtime 下使用 Hadoop 资源,如 checkpoint 挂载 HDFS、读写 Hive 等,大概流程如下: #### 1、HDFS @@ -26,7 +26,7 @@ flink-json-1.14.5.jar log4j-1.2-api-2.17.1.jar log4j-slf4j-impl- ​ 这是需要将 shade jar 下载下来,然后放在 flink 的 lib 目录下,这里 以hadoop2 为例,下载 `flink-shaded-hadoop-2-uber`:https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.7.5-9.0/flink-shaded-hadoop-2-uber-2.7.5-9.0.jar -​ 另外,可以将 shade jar 以依赖的方式在 StreamPark 的任务配置中的`Dependency` 进行依赖配置,如下配置: +​ 另外,可以将 shade jar 以依赖的方式在 Apache StreamPark 的任务配置中的`Dependency` 进行依赖配置,如下配置: ```xml @@ -128,7 +128,7 @@ public static String getHadoopConfConfigMapName(String clusterId) { ​ c、`flink-sql-connector-hive`:https://repo1.maven.org/maven2/org/apache/flink/flink-sql-connector-hive-2.3.6_2.12/1.14.5/flink-sql-connector-hive-2.3.6_2.12-1.14.5.jar -​ 同样,也可以将上述 hive 相关 jar 以依赖的方式在 StreamPark 的任务配置中的`Dependency` 进行依赖配置,这里不再赘述。 +​ 同样,也可以将上述 hive 相关 jar 以依赖的方式在 Apache StreamPark 的任务配置中的`Dependency` 进行依赖配置,这里不再赘述。 ##### ii、添加 hive 的配置文件(hive-site.xml) diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/intro.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/intro.md index a94fbd35a..6784510f2 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/intro.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/intro.md @@ -12,8 +12,8 @@ make stream processing easier!!! ## 🚀 什么是 Apache StreamPark™ -实时即未来,在实时处理流域 `Apache Spark` 和 `Apache Flink` 是一个伟大的进步,尤其是 `Apache Flink` 被普遍认为是下一代大数据流计算引擎, 我们在使用 `Flink` & `Spark` 时发现从编程模型, 启动配置到运维管理都有很多可以抽象共用的地方, 我们将一些好的经验固化下来并结合业内的最佳实践, 通过不断努力诞生了今天的框架 —— `StreamPark`, 项目的初衷是 —— 让流处理更简单, -使用 `StreamPark` 开发流处理作业, 可以极大降低学习成本和开发门槛, 让开发者只用关心最核心的业务,`StreamPark` 规范了项目的配置,鼓励函数式编程,定义了最佳的编程方式,提供了一系列开箱即用的`Connectors`,标准化了配置、开发、测试、部署、监控、运维的整个过程, 提供了`scala`和`java`两套 Api, 并且提供了一个一站式的流处理作业开发管理平台, 从流处理作业开发到上线全生命周期都 +实时即未来,在实时处理流域 `Apache Spark` 和 `Apache Flink` 是一个伟大的进步,尤其是 `Apache Flink` 被普遍认为是下一代大数据流计算引擎, 我们在使用 `Flink` & `Spark` 时发现从编程模型, 启动配置到运维管理都有很多可以抽象共用的地方, 我们将一些好的经验固化下来并结合业内的最佳实践, 通过不断努力诞生了今天的框架 —— `Apache StreamPark`, 项目的初衷是 —— 让流处理更简单, +使用 `Apache StreamPark` 开发流处理作业, 可以极大降低学习成本和开发门槛, 让开发者只用关心最核心的业务,`Apache StreamPark` 规范了项目的配置,鼓励函数式编程,定义了最佳的编程方式,提供了一系列开箱即用的`Connectors`,标准化了配置、开发、测试、部署、监控、运维的整个过程, 提供了`scala`和`java`两套 Api, 并且提供了一个一站式的流处理作业开发管理平台, 从流处理作业开发到上线全生命周期都 做了支持, 是一个一站式的流处理计算平台. @@ -31,7 +31,7 @@ make stream processing easier!!! `Apache StreamPark` 核心由`streampark-core` 和 `streampark-console` 组成 -![StreamPark Archite](/doc/image/streampark_archite.png) +![Apache StreamPark Archite](/doc/image/streampark_archite.png) ### 1️⃣ streampark-core diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/1-deployment.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/1-deployment.md index 1dfec80b8..d10873449 100755 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/1-deployment.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/1-deployment.md @@ -6,9 +6,9 @@ sidebar_position: 1 import { DeploymentEnvs } from '../components/TableData.jsx'; -StreamPark 总体组件栈架构如下, 由 streampark-core 和 streampark-console 两个大的部分组成 , streampark-console 是一个非常重要的模块, 定位是一个**综合实时数据平台**,**流式数仓平台**, **低代码 ( Low Code )**, **Flink & Spark 任务托管平台**,可以较好的管理 Flink 任务,集成了项目编译、发布、参数配置、启动、savepoint,火焰图 ( flame graph ),Flink SQL,监控等诸多功能于一体,大大简化了 Flink 任务的日常操作和维护,融合了诸多最佳实践。其最终目标是打造成一个实时数仓,流批一体的一站式大数据解决方案 +Apache StreamPark 总体组件栈架构如下, 由 streampark-core 和 streampark-console 两个大的部分组成 , streampark-console 是一个非常重要的模块, 定位是一个**综合实时数据平台**,**流式数仓平台**, **低代码 ( Low Code )**, **Flink & Spark 任务托管平台**,可以较好的管理 Flink 任务,集成了项目编译、发布、参数配置、启动、savepoint,火焰图 ( flame graph ),Flink SQL,监控等诸多功能于一体,大大简化了 Flink 任务的日常操作和维护,融合了诸多最佳实践。其最终目标是打造成一个实时数仓,流批一体的一站式大数据解决方案 -![StreamPark Archite](/doc/image/streampark_archite.png) +![Apache StreamPark Archite](/doc/image/streampark_archite.png) streampark-console 提供了开箱即用的安装包,安装之前对环境有些要求,具体要求如下: @@ -16,7 +16,7 @@ streampark-console 提供了开箱即用的安装包,安装之前对环境有 -目前 StreamPark 对 Flink 的任务发布,同时支持 `Flink on YARN` 和 `Flink on Kubernetes` 两种模式。 +目前 Apache StreamPark 对 Flink 的任务发布,同时支持 `Flink on YARN` 和 `Flink on Kubernetes` 两种模式。 ### Hadoop 使用 `Flink on YARN`,需要部署的集群安装并配置 Hadoop的相关环境变量,如你是基于 CDH 安装的 hadoop 环境, @@ -34,7 +34,7 @@ export HADOOP_YARN_HOME=$HADOOP_HOME/../hadoop-yarn ### Kubernetes -使用 `Flink on Kubernetes`,需要额外部署/或使用已经存在的 Kubernetes 集群,请参考条目: [**StreamPark Flink-K8s 集成支持**](../flink-k8s/1-deployment.md)。 +使用 `Flink on Kubernetes`,需要额外部署/或使用已经存在的 Kubernetes 集群,请参考条目: [**Apache StreamPark Flink-K8s 集成支持**](../flink-k8s/1-deployment.md)。 ## 安装 @@ -100,7 +100,7 @@ streampark-console-service-1.2.1 ```yaml spring: profiles.active: mysql #[h2,pgsql,mysql] - application.name: StreamPark + application.name: Apache StreamPark devtools.restart.enabled: false mvc.pathmatch.matching-strategy: ant_path_matcher servlet: @@ -159,7 +159,7 @@ bash startup.sh 经过以上步骤,即可部署完成,可以直接登录系统 -![StreamPark Login](/doc/image/streampark_login.jpeg) +![Apache StreamPark Login](/doc/image/streampark_login.jpeg) :::tip 提示 默认密码: admin / streampark @@ -167,9 +167,9 @@ bash startup.sh ## 系统配置 -进入系统之后,第一件要做的事情就是修改系统配置,在菜单/StreamPark/Setting 下,操作界面如下: +进入系统之后,第一件要做的事情就是修改系统配置,在菜单/Apache StreamPark/Setting 下,操作界面如下: -![StreamPark Settings](/doc/image/streampark_settings_2.0.0.png) +![Apache StreamPark Settings](/doc/image/streampark_settings_2.0.0.png) 主要配置项分为以下几类 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/11-platformInstall.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/11-platformInstall.md index 6579f95c7..50f8d38e7 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/11-platformInstall.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/11-platformInstall.md @@ -15,13 +15,13 @@ ## 软件要求 Notes: -1. **单纯安装StreamPark,可忽略hadoop** +1. **单纯安装Apache StreamPark,可忽略hadoop** 2. 若采用 yarn application 模式 执行flink作业,需要hadoop > - JDK : 1.8+ > - MySQL : 5.6+ > - Flink : 1.12.0+ > - Hadoop : 2.7.0+ -> - StreamPark : 2.0.0+ +> - Apache StreamPark : 2.0.0+ 本文档采用的软件版本信息 > - **JDK:1.8.0_181** @@ -75,7 +75,7 @@ flink -v cp mysql-connector-java-8.0.28.jar /usr/local/streampark/lib ``` ![4_mysql_dep](/doc/image/install/4_mysql_dep.png) -## 下载StreamPark +## 下载Apache StreamPark > 下载URL:[https://dlcdn.apache.org/incubator/streampark/2.0.0/apache-streampark_2.12-2.0.0-incubating-bin.tar.gz](https://dlcdn.apache.org/incubator/streampark/2.0.0/apache-streampark_2.12-2.0.0-incubating-bin.tar.gz) > 上传 [apache-streampark_2.12-2.0.0-incubating-bin.tar.gz](https://dlcdn.apache.org/incubator/streampark/2.0.0/apache-streampark_2.12-2.0.0-incubating-bin.tar.gz) 至 服务器 /usr/local 路径 @@ -90,11 +90,11 @@ tar -zxvf apache-streampark_2.12-2.0.0-incubating-bin.tar.gz # 安装 ## 初始化系统数据 -> **目的:创建StreamPark组件部署依赖的数据库(表),同时将其运行需要的数据提前初始化(比如:web页面的菜单、用户等信息),便于后续操作。** +> **目的:创建Apache StreamPark组件部署依赖的数据库(表),同时将其运行需要的数据提前初始化(比如:web页面的菜单、用户等信息),便于后续操作。** ### 查看执行SteamPark元数据SQL文件 > 说明: -> - StreamPark支持MySQL、PostgreSQL、H2 +> - Apache StreamPark支持MySQL、PostgreSQL、H2 > - 本次以MySQL为例,PostgreSQL流程基本一致 > 数据库创建脚本: /usr/local/apache-streampark_2.12-2.0.0-incubating-bin/script/schema/mysql-schema.sql @@ -133,7 +133,7 @@ show tables; ``` ![13_show_streampark_db_tables](/doc/image/install/13_show_streampark_db_tables.png) -## StreamPark配置 +## Apache StreamPark配置 > 目的:配置启动需要的数据源。 > 配置文件所在路径:/usr/local/streampark/conf @@ -180,7 +180,7 @@ vim application.yml ![18_application_yml_ldap](/doc/image/install/18_application_yml_ldap.png) ### 【可选】配置kerberos -> 背景:企业级hadoop集群环境都有设置安全访问机制,比如kerberos。StreamPark也可配置kerberos,使得flink可通过kerberos认证,向hadoop集群提交作业。 +> 背景:企业级hadoop集群环境都有设置安全访问机制,比如kerberos。Apache StreamPark也可配置kerberos,使得flink可通过kerberos认证,向hadoop集群提交作业。 > **修改项如下:** > 1. **security.kerberos.login.enable=true** @@ -190,13 +190,13 @@ vim application.yml > 5. **java.security.krb5.conf=/etc/krb5.conf** ![19_kerberos_yml_config](/doc/image/install/19_kerberos_yml_config.png) -## 启动StreamPark -## 进入服务器StreamPark安装路径 +## 启动Apache StreamPark +## 进入服务器Apache StreamPark安装路径 ```bash cd /usr/local/streampark/ ``` ![20_enter_streampark_dir](/doc/image/install/20_enter_streampark_dir.png) -## 启动StreamPark服务 +## 启动Apache StreamPark服务 ```bash ./bin/startup.sh ``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/12-platformBasicUsage.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/12-platformBasicUsage.md index 40afc17f4..ed829f9aa 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/12-platformBasicUsage.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/12-platformBasicUsage.md @@ -1,5 +1,5 @@ # 快速上手 -> 说明:该部分旨在通过简单的操作步骤,体验使用StreamPark平台提交flink作业的便捷流程。 +> 说明:该部分旨在通过简单的操作步骤,体验使用Apache StreamPark平台提交flink作业的便捷流程。 ## 配置FLINK_HOME ![1_config_flink_home](/doc/image/platform-usage/1_config_flink_home.png) @@ -9,7 +9,7 @@ ![3_display_flink_home_config](/doc/image/platform-usage/3_display_flink_home_config.png) ## 配置Flink Cluster -> 根据flink 部署模式 以及 资源管理方式,StreamPark 支持以下6种作业模式 +> 根据flink 部署模式 以及 资源管理方式,Apache StreamPark 支持以下6种作业模式 > - **Standalone Session** > - **Yarn Session** > - **Yarn Per-job** @@ -79,8 +79,8 @@ start-cluster.sh ![22_submit_flink_job_2](/doc/image/platform-usage/22_submit_flink_job_2.png) ## 查看作业状态 -### 通过StreamPark看板查看 -> StreamPark dashboard +### 通过Apache StreamPark看板查看 +> Apache StreamPark dashboard ![23_flink_job_dashboard](/doc/image/platform-usage/23_flink_job_dashboard.png) @@ -95,22 +95,22 @@ start-cluster.sh ![27_display_native_flink_job_web_ui_2](/doc/image/platform-usage/27_display_native_flink_job_web_ui_2.png) -> 至此,一个使用StreamPark平台提交flink job的流程基本完成。下面简单总结下StreamPark平台管理flink作业的大致流程。 +> 至此,一个使用Apache StreamPark平台提交flink job的流程基本完成。下面简单总结下Apache StreamPark平台管理flink作业的大致流程。 -## StreamPark平台管理flink job的流程 +## Apache StreamPark平台管理flink job的流程 ![28_streampark_process_workflow](/doc/image/platform-usage/28_streampark_process_workflow.png) -> 通过 StreamPark 平台 停止、修改、删除 flink job 相对简单,大家可自行体验,需要说明的一点是:**若作业为running状态,则不可删除,需先停止**。 +> 通过 Apache StreamPark 平台 停止、修改、删除 flink job 相对简单,大家可自行体验,需要说明的一点是:**若作业为running状态,则不可删除,需先停止**。 -# StreamPark系统模块简介 +# Apache StreamPark系统模块简介 ## 系统设置 > 菜单位置 ![29_streampark_system_menu](/doc/image/platform-usage/29_streampark_system_menu.png) ### User Management -> 用于管理StreamPark平台用户 +> 用于管理Apache StreamPark平台用户 ![30_streampark_user_management_menu](/doc/image/platform-usage/30_streampark_user_management_menu.png) ### Token Management @@ -150,9 +150,9 @@ curl -X POST '/flink/app/cancel' \ ![36_streampark_menu_management](/doc/image/platform-usage/36_streampark_menu_management.png) -## StreamPark菜单模块 +## Apache StreamPark菜单模块 ### Project -> StreamPark结合代码仓库实现CICD +> Apache StreamPark结合代码仓库实现CICD ![37_streampark_project_menu](/doc/image/platform-usage/37_streampark_project_menu.png) > 使用时,点击 “+ Add new ”,配置repo信息,保存。 @@ -212,8 +212,8 @@ curl -X POST '/flink/app/cancel' \ ![54_visit_flink_cluster_web_ui](/doc/image/platform-usage/54_visit_flink_cluster_web_ui.png) -# 原生flink 与 StreamPark关联使用 -> 【**待完善**】其实,个人理解,StreamPark一大特点是对flink原生作业的管理模式在用户使用层面进行了优化,使得用户能利用该平台快速开发、部署、运行、监控flink作业。所以,想表达的意思是:如果用户对原生flink比较熟悉的话,那StreamPark使用起来就会更加得心应手。 +# 原生flink 与 Apache StreamPark关联使用 +> 【**待完善**】其实,个人理解,Apache StreamPark一大特点是对flink原生作业的管理模式在用户使用层面进行了优化,使得用户能利用该平台快速开发、部署、运行、监控flink作业。所以,想表达的意思是:如果用户对原生flink比较熟悉的话,那Apache StreamPark使用起来就会更加得心应手。 ## flink部署模式 > 下面内容摘自 **张利兵 老师 极客时间专栏** 《[Flink核心技术与实战](https://time.geekbang.org/course/intro/100058801)》 @@ -231,7 +231,7 @@ curl -X POST '/flink/app/cancel' \ ![60_flink_deployment_difference_6](/doc/image/platform-usage/60_flink_deployment_difference_6.png) -### 如何在StreamPark中使用 +### 如何在Apache StreamPark中使用 > **Session 模式** 1. 配置 Flink Cluster @@ -262,7 +262,7 @@ flink run-application -t yarn-application \ -Dyarn.provided.lib.dirs="hdfs://myhdfs/my-remote-flink-dist-dir" \ hdfs://myhdfs/jars/my-application.jar ``` -### 如何在StreamPark中使用 +### 如何在Apache StreamPark中使用 > 创建 或 修改 作业时,在“Dynamic Properties”里面按指定格式添加即可 ![67_dynamic_params_usage](/doc/image/platform-usage/67_dynamic_params_usage.png) @@ -275,7 +275,7 @@ flink run-application -t yarn-application \ ![68_native_flink_restart_strategy](/doc/image/platform-usage/68_native_flink_restart_strategy.png) -### 如何在StreamPark中使用 +### 如何在Apache StreamPark中使用 > 【**待完善**】一般在作业失败或出现异常时,会触发告警 1. 配置告警通知 @@ -297,7 +297,7 @@ flink run-application -t yarn-application \ ![72_native_flink_save_checkpoint_gramma](/doc/image/platform-usage/72_native_flink_save_checkpoint_gramma.png) -### 如何在StreamPark中配置savepoint +### 如何在Apache StreamPark中配置savepoint > 当停止作业时,可以让用户设置savepoint ![73_streampark_save_checkpoint](/doc/image/platform-usage/73_streampark_save_checkpoint.png) @@ -312,7 +312,7 @@ flink run-application -t yarn-application \ ![77_show_checkpoint_file_name_2](/doc/image/platform-usage/77_show_checkpoint_file_name_2.png) -### 如何在StreamPark中由指定savepoint恢复作业 +### 如何在Apache StreamPark中由指定savepoint恢复作业 > 启动作业时,会让选择 ![78_usage_checkpoint_in_streampark](/doc/image/platform-usage/78_usage_checkpoint_in_streampark.png) @@ -325,7 +325,7 @@ flink run-application -t yarn-application \ ![79_native_flink_job_status](/doc/image/platform-usage/79_native_flink_job_status.svg) -### StreamPark中的作业状态 +### Apache StreamPark中的作业状态 > 【**待完善**】 @@ -335,10 +335,10 @@ flink run-application -t yarn-application \ ![80_native_flink_job_details_page](/doc/image/platform-usage/80_native_flink_job_details_page.png) -### StreamPark中作业详情 +### Apache StreamPark中作业详情 ![81_streampark_flink_job_details_page](/doc/image/platform-usage/81_streampark_flink_job_details_page.png) -> 同时在k8s模式下的作业,StreamPark还支持启动日志实时展示,如下 +> 同时在k8s模式下的作业,Apache StreamPark还支持启动日志实时展示,如下 ![82_streampark_flink_job_starting_log_info](/doc/image/platform-usage/82_streampark_flink_job_starting_log_info.png) @@ -347,7 +347,7 @@ flink run-application -t yarn-application \ > 原生flink提供了 rest api > 参考:[https://nightlies.apache.org/flink/flink-docs-release-1.14/zh/docs/ops/rest_api/](https://nightlies.apache.org/flink/flink-docs-release-1.14/zh/docs/ops/rest_api/) -### StreamPark如何与第三方系统集成 +### Apache StreamPark如何与第三方系统集成 > 也提供了Restful Api,支持与其他系统对接, > 比如:开启作业 启动|停止 restapi 接口 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/2-quickstart.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/2-quickstart.md index 5efb2d32a..ace1a1061 100755 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/2-quickstart.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/2-quickstart.md @@ -8,7 +8,7 @@ sidebar_position: 2 在上个章节已经详细介绍了一站式平台 `streampark-console` 的安装, 本章节看看如果用 `streampark-console` 快速部署运行一个作业, `streampark-console` 对标准的 Flink 程序 ( 按照 Flink 官方要求的结构和规范 ) 和用 `streampark` 开发的项目都做了很好的支持,下面我们使用 `streampark-quickstart` 来快速开启 `streampark-console` 之旅 -`streampark-quickstart` 是 StreamPark 开发 Flink 的上手示例程序,具体请查阅: +`streampark-quickstart` 是 Apache StreamPark 开发 Flink 的上手示例程序,具体请查阅: - Github: [https://github.com/apache/incubator-streampark-quickstart.git](https://github.com/apache/incubator-streampark-quickstart) diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/3-development.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/3-development.md index 8dffa0399..7148cebc5 100755 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/3-development.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/3-development.md @@ -44,7 +44,7 @@ tar -zxvf apache-streampark-2.2.0-incubating-bin.tar.gz ### 启动后台服务 -找到 `streampark-console/streampark-console-service/src/main/java/org/apache/streampark/console/StreamParkConsoleBootstrap.java` +找到 `streampark-console/streampark-console-service/src/main/java/org/apache/streampark/console/Apache StreamParkConsoleBootstrap.java` 修改启动配置 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/4-dockerDeployment.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/4-dockerDeployment.md index 193aa624e..901a3cbb3 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/4-dockerDeployment.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/4-dockerDeployment.md @@ -4,7 +4,7 @@ title: 'Docker 部署' sidebar_position: 4 --- -本教程使用 Docker 完成 StreamPark 的部署。 +本教程使用 Docker 完成 Apache StreamPark 的部署。 ## 前置条件 @@ -17,9 +17,9 @@ sidebar_position: 4 ### 2. 安装 docker-compose 使用 docker-compose 启动服务,需要先安装 [docker-compose](https://docs.docker.com/compose/install/) -## 部署 StreamPark +## 部署 Apache StreamPark -### 1. 基于 h2 和 docker-compose 部署 StreamPark +### 1. 基于 h2 和 docker-compose 部署 Apache StreamPark 该方式适用于入门学习、熟悉功能特性,容器重启后配置会失效,下方可以配置Mysql、Pgsql进行持久化 @@ -30,7 +30,7 @@ wget https://raw.githubusercontent.com/apache/incubator-streampark/dev/deploy/do wget https://raw.githubusercontent.com/apache/incubator-streampark/dev/deploy/docker/.env docker-compose up -d ``` -服务启动后,可以通过 http://localhost:10000 访问 StreamPark,同时也可以通过 http://localhost:8081访问Flink。访问StreamPark链接后会跳转到登陆页面,StreamPark 默认的用户和密码分别为 admin 和 streampark。想要了解更多操作请参考用户手册快速上手。 +服务启动后,可以通过 http://localhost:10000 访问 Apache StreamPark,同时也可以通过 http://localhost:8081访问Flink。访问Apache StreamPark链接后会跳转到登陆页面,Apache StreamPark 默认的用户和密码分别为 admin 和 streampark。想要了解更多操作请参考用户手册快速上手。 ![](/doc/image/streampark_docker-compose.png) 该部署方式会自动给你启动一个flink-session集群供你去进行flink任务使用,同时也会挂载本地docker服务以及~/.kube来用于k8s模式的任务提交 @@ -51,7 +51,7 @@ docker-compose up -d #### 使用已有的 Mysql 服务 -该方式适用于企业生产,你可以基于 docker 快速部署 StreamPark 并将其和线上数据库进行关联 +该方式适用于企业生产,你可以基于 docker 快速部署 Apache StreamPark 并将其和线上数据库进行关联 注意:部署支持的多样性是通过.env这个配置文件来进行维护的,要保证目录下有且仅有一个.env文件 ```shell @@ -93,7 +93,7 @@ SPRING_DATASOURCE_PASSWORD=streampark docker-compose up -d ``` -## 基于源码构建镜像进行StreamPark部署 +## 基于源码构建镜像进行Apache StreamPark部署 ```shell git clone https://github.com/apache/incubator-streampark.git diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/6-Team.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/6-Team.md index 8c8b8f032..b44f3135c 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/6-Team.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/6-Team.md @@ -8,12 +8,12 @@ sidebar_position: 6 ADMIN 创建或修改用户时可以指定用户类型,用户类型有 ADMIN 和 USER 两种。 -- ADMIN 表示系统管理员,即:StreamPark 的超级管理员,有 StreamPark 管理页面以及各个团队的所有权限。 +- ADMIN 表示系统管理员,即:Apache StreamPark 的超级管理员,有 Apache StreamPark 管理页面以及各个团队的所有权限。 - USER 表示平台的普通用户。创建 USER 只是创建账号的过程,默认普通用户在平台没有任何权限。创建 USER 后且系统管理员给 USER 在一些团队绑定角色后,USER 才会在相应团队有权限。 ## 团队管理 -为了方便管理公司内不同部门的作业,StreamPark 支持了团队管理。系统管理员可以在 StreamPark 上为不同部门创建不同的团队。 +为了方便管理公司内不同部门的作业,Apache StreamPark 支持了团队管理。系统管理员可以在 Apache StreamPark 上为不同部门创建不同的团队。

@@ -27,9 +27,9 @@ ADMIN 创建或修改用户时可以指定用户类型,用户类型有 ADMIN ## 角色管理 -为了便于管理作业以及防止误操作,团队内部也需要区分管理员和普通开发者,所以 StreamPark 引入了角色管理。 +为了便于管理作业以及防止误操作,团队内部也需要区分管理员和普通开发者,所以 Apache StreamPark 引入了角色管理。 -当前,StreamPark 支持两者角色,分别是:team admin 和 developer。 team admin 拥有团队内的所有权限,developer 相比 team admin 而言,少了删除作业、添加 USER 到团队等权限。 +当前,Apache StreamPark 支持两者角色,分别是:team admin 和 developer。 team admin 拥有团队内的所有权限,developer 相比 team admin 而言,少了删除作业、添加 USER 到团队等权限。

diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/7-Variable.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/7-Variable.md index fa2e1fa77..67c3166e6 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/7-Variable.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/7-Variable.md @@ -8,7 +8,7 @@ sidebar_position: 7 在实际生产环境中,Flink作业一般很复杂,会依赖多个外部组件,例如,从Kafka中消费数据时要从HBase或Redis中去获取相关数据,然后将关联好的数据写入到外部组件,这样的情况下会导致如下问题: -- Flink作业想要关联这些组件,需要将外部组件的连接信息传递给Flink作业,这些连接信息是跟随StreamPark的Application一起配置的,一旦一些组件的连接信息有变化,依赖这些组件的Application都要修改,这会导致大量的操作且成本很高。 +- Flink作业想要关联这些组件,需要将外部组件的连接信息传递给Flink作业,这些连接信息是跟随Apache StreamPark的Application一起配置的,一旦一些组件的连接信息有变化,依赖这些组件的Application都要修改,这会导致大量的操作且成本很高。 - 团队中一般有很多开发人员,如果对组件连接信息没有统一的传递规范,会导致相同的组件有不同的参数名,这样难以统计外部组件到底被哪些作业依赖。 - 在企业生产中,通常有多个环境,比如有测试环境、生产环境等,很多时候无法通过IP和端口来判断属于哪个环境,这样的话本来属于生产环境的IP和端口可能配置到了测试环境,导致生产故障。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/8-YarnQueueManagement.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/8-YarnQueueManagement.md index 7079f83ad..5af6d6c8a 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/8-YarnQueueManagement.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/user-guide/8-YarnQueueManagement.md @@ -20,7 +20,7 @@ sidebar_position: 8 如果由于输入错误而将任务提交到错误的队列中, 可能会影响队列上 Yarn 应用程序的稳定性,并滥用队列资源。 -因此,StreamPark 引入了队列管理功能,以确保一组添加的队列在同一团队内共享, +因此,Apache StreamPark 引入了队列管理功能,以确保一组添加的队列在同一团队内共享, 也就是确保队列资源在团队范围内是隔离的。它可以产生以下好处: - 当部署 Flink `yarn-application` 应用程序或 Flink `yarn-session` 集群时, 它可以快速准确地设置 Yarn 队列(`yarn.application.queue`)和标签(`yarn.application.node-label`)。 @@ -81,11 +81,11 @@ sidebar_position: 8

- 会话集群被所有团队共享。为什么创建 `yarn-session` Flink 集群时,只能使用当前团队中的队列而非所有团队中的队列作为候选队列列表? -> 基于上述所提到的,StreamPark 希望在创建 `yarn-session` Flink 集群时,管理员只能指定当前团队所属的队列,这有助于管理员更好地感知当前操作对当前团队的影响。 +> 基于上述所提到的,Apache StreamPark 希望在创建 `yarn-session` Flink 集群时,管理员只能指定当前团队所属的队列,这有助于管理员更好地感知当前操作对当前团队的影响。 - 为什么不支持将 `flink yarn-session clusters / general clusters` 在团队范围内进行隔离? - 集群可见性的变化带来的影响范围比队列可见性的变化范围更大。 - - StreamPark 需要面对更大的向后兼容性难题,同时还需要考虑用户体验。 + - Apache StreamPark 需要面对更大的向后兼容性难题,同时还需要考虑用户体验。 - 目前,社区对使用 `yarn-application` 和 `yarn-session` 集群模式部署的用户群体和应用规模没有确切的研究。 基于这一事实,社区没有提供更大的功能支持。 diff --git a/src/pages/download/release-note/2.0.0.md b/src/pages/download/release-note/2.0.0.md index 1bacb2452..f6e93d5b8 100644 --- a/src/pages/download/release-note/2.0.0.md +++ b/src/pages/download/release-note/2.0.0.md @@ -3,7 +3,7 @@
-Apache StreamPark (incubating) 2.0.0 is the first Apache version since StreamPark joined the ASF incubator. In this version, many new features have been added, and many bugs have been fixed. It is important for the stability and functionality of the entire product this +Apache StreamPark (incubating) 2.0.0 is the first Apache version since Apache StreamPark joined the ASF incubator. In this version, many new features have been added, and many bugs have been fixed. It is important for the stability and functionality of the entire product this A very big improvement, as follows:
@@ -87,7 +87,7 @@ A very big improvement, as follows: - Fix the wrong duration (#1585) - Flink auto recover taskmanager failed #1175 (#1178) - Flink cluster cannot modify the flink version #1992 (#1993) -- StreamPark cannot search for failed status tasks (#1917) +- Apache StreamPark cannot search for failed status tasks (#1917) - Taskmanager memory usage statistics calculation error on yarn mode #1061 (#1065) - The configured value cannot be cleared in the System Setting module #2084 (#2085) - The copy function of the application does not copy the args information (#2083)
功能 ZeppelinStreamParkApache StreamPark