Skip to content

sample repo for understanding spark structured streaming and data ops

License

Notifications You must be signed in to change notification settings

NT-D/streaming-dataops

Repository files navigation

TODO: Please see this task board in GitHub.

Streaming data-ops

Sample repo for understanding spark structured streaming data-ops with DataBricks and Azure IoT tools

Motivation (problem statement)

After I see some public documents, I realized there are few documents to describe following things.

  • How to write unit test for streaming data in local Spark environment.
  • How to automate CI/CD pipeline with DataBricks.

For helping developers to keep their code quality high through testing and pipelines, I want to share how to achieve it.

Architecture

Architecture

How to run app locally

  1. If you are new for developing inside a container, please read this document and setup environment by refering Getting started.
  2. Clone and open repository inside the container with this document.
  3. Set environment variable with event hub (IoT Hub) information
export EVENTHUB_CONNECTION_STRING="{Your event hub connection string}"
export EVENTHUB_CONSUMER_GROUP="{Your consumer group name}"
  1. Run pyspark --packages com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.16 < stream_app.py in Visual Studio Code terminal to execute structred streaming. It shows telemetry in console.

environment variable example

Name Example IoT Hub Build-in endpoints name
EVENTHUB_CONNECTION_STRING Endpoint=sb://xxx.servicebus.windows.net/; SharedAccessKeyName=xxxxx;SharedAccessKey=xxx;EntityPath=xxxx Event Hub-compatible endpoint
EVENTHUB_CONSUMER_GROUP Consume group name which you created. Default is $Default Consumer Groups
  • Uses xxx for mocking secrets
  • Please refer 3rd column to pick up connection setting from Azure IoT Hub's built-in endpoint

How to run test locally

In this repo, we uses pytest for unit testing. If you want to run unit test, please type and run pytest in the root folder. You'll see following output in the terminal. pytest

Setup and run CI/CD pipeline with Azure DataBricks and Azure Devops

Please see this dotument.

Utilize pyot library from Databricks notebook

In the Notebook, app fetches secrets from Azure KeyVault, so you need to setup it at first.

  1. Save your EVENTHUB_CONNECTION_STRING's value (Event Hub-compatible endpoint in IoT Hub) as iot-connection-string secret in Azure KeyVault. Please refer this document.
  2. Setup Azure Key Vault-backed scope in your Azure DataBricks. Please add key-vault-secrets as scope name. Please refer this document.
  3. Please import ProcessStreaming.py under notebooks folder to your Databricks and run it on a cluster.

Reference

DataOps strategy

For understanding concept of this repo, please read following repo and blog

Spark structured streaming and Azure Event Hubs

Unit testing with Spark structred streaming

We have 2 options to test streaming data. One is reading stored file as stream and another is using MemoryStream. Because I can easily generate json file from real streaming data, I selected to use first choice.

Setup development environment

About

sample repo for understanding spark structured streaming and data ops

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published