TODO: Please see this task board in GitHub.
Sample repo for understanding spark structured streaming data-ops with DataBricks and Azure IoT tools
- You can utilize device simulator in my GitHub. It can send expected telemetry to IoT Hub and Spark.
- This repo uses Azure Event Hubs Connector for Apache Spark instead of Kafka Connector, because it supports Azure IoT Hubs message property. See more detail in this blog.
After I see some public documents, I realized there are few documents to describe following things.
- How to write unit test for streaming data in local Spark environment.
- How to automate CI/CD pipeline with DataBricks.
For helping developers to keep their code quality high through testing and pipelines, I want to share how to achieve it.
- If you are new for developing inside a container, please read this document and setup environment by refering Getting started.
- Clone and open repository inside the container with this document.
- Set environment variable with event hub (IoT Hub) information
export EVENTHUB_CONNECTION_STRING="{Your event hub connection string}"
export EVENTHUB_CONSUMER_GROUP="{Your consumer group name}"
- Run
pyspark --packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.16 < stream_app.py
in Visual Studio Code terminal to execute structred streaming. It shows telemetry in console.
Name | Example | IoT Hub Build-in endpoints name |
---|---|---|
EVENTHUB_CONNECTION_STRING | Endpoint=sb://xxx.servicebus.windows.net/; SharedAccessKeyName=xxxxx;SharedAccessKey=xxx;EntityPath=xxxx |
Event Hub-compatible endpoint |
EVENTHUB_CONSUMER_GROUP | Consume group name which you created. Default is $Default |
Consumer Groups |
- Uses
xxx
for mocking secrets - Please refer 3rd column to pick up connection setting from Azure IoT Hub's built-in endpoint
In this repo, we uses pytest for unit testing. If you want to run unit test, please type and run pytest
in the root folder. You'll see following output in the terminal.
Please see this dotument.
In the Notebook, app fetches secrets from Azure KeyVault, so you need to setup it at first.
- Save your
EVENTHUB_CONNECTION_STRING
's value (Event Hub-compatible endpoint in IoT Hub) asiot-connection-string
secret in Azure KeyVault. Please refer this document. - Setup Azure Key Vault-backed scope in your Azure DataBricks. Please add
key-vault-secrets
as scope name. Please refer this document. - Please import
ProcessStreaming.py
undernotebooks
folder to your Databricks and run it on a cluster.
For understanding concept of this repo, please read following repo and blog
- DataOps for the Modern Data Warehouse
- Spark Streaming part 3: DevOps, tools and tests for Spark applications
- Connect your Apache Spark application with Azure Event Hubs
- Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)
- Structured Streaming Programming Guide
We have 2 options to test streaming data. One is reading stored file as stream and another is using MemoryStream. Because I can easily generate json file from real streaming data, I selected to use first choice.
- Please see Streaming Our Data section in this document. It describes how to emulate stream
- Unit Testing Apache Spark Structured Streaming using MemoryStream