Skip to content

Latest commit

 

History

History
396 lines (291 loc) · 30.1 KB

File metadata and controls

396 lines (291 loc) · 30.1 KB

DataOps - Parking Sensor (Synapse)

The sample demonstrate how DevOps principles can be applied to an end to end Data Pipeline Solution built according to the Modern Data Warehouse (MDW) pattern, implemented in Azure Synapse.

Contents


Solution Overview

The solution pulls near realtime Melbourne Parking Sensor data from a publicly available REST api endpoint and saves this to Azure Data Lake Gen2. It then validates, cleanses, and transforms the data to a known schema using Azure Synapse - Spark Pools. A second Spark job then transforms these into a Star Schema which are then loaded into Azure Synapse - SQL Dedicated Pool (formerly SQLDW) using Polybase. The entire pipeline is orchestrated with Azure Synapse Data Pipelines.

Architecture

The following shows the overall architecture of the solution.

Architecture

Sample PowerBI report

PowerBI report

Continuous Integration and Continuous Delivery (CI/CD)

The following shows the overall CI/CD process end to end.

CI/CD

See here for details.

Technologies used

It makes use of the following azure services:

Key Learnings

The following summarizes key learnings and best practices demonstrated by this sample solution:

1. Use Data Tiering in your Data Lake

  • Generally, you want to divide your data lake into three major areas which contain your Bronze, Silver and Gold datasets.
    1. Bronze - This is a landing area for your raw datasets with no to minimal data transformations applied, and therefore are optimized for writes / ingestion. Treat these datasets as an immutable, append only store.
    2. Silver - These are cleansed, semi-processed datasets. These conform to a known schema and predefined data invariants and might have further data augmentation applied. These are typically used by Data Scientists.
    3. Gold - These are highly processed, highly read-optimized datasets primarily for consumption of business users. Typically, these are structured in your standard Fact and Dimension tables.

2. Validate data early in your pipeline

  • Add data validation between the Bronze and Silver datasets. By validating early in your pipeline, you can ensure all succeeding datasets conform to a specific schema and known data invariants. This also can potentially prevent data pipeline failures in cases of unexpected changes to the input data.
  • Data that does not pass this validation stage can be rerouted to a Malformed Record store for diagnostic purpose.
  • It may be tempting to add validation prior to landing in the Bronze area of your data lake. This is generally not recommended. Bronze datasets are there to ensure you have as close of a copy of the source system data. This can used to replay the data pipeline for both testing (ei. testing data validation logic) and data recovery purposes (ei. data corruption is introduced due to a bug in the data transformation code and thus pipeline needs to be replayed).

3. Make your data pipelines replayable and idempotent

  • Silver and Gold datasets can get corrupted due to a number of reasons such as unintended bugs, unexpected input data changes, and more. By making data pipelines replayable and idempotent, you can recover from this state through deployment of code fix and replaying the data pipelines.
  • Idempotency also ensures data-duplication is mitigated when replaying your data pipelines.

4. Ensure data transformation code is testable

  • Abstracting away data transformation code from data access code is key to ensuring unit tests can be written against data transformation logic. An example of this is moving transformation code from notebooks into packages.
  • While it is possible to run tests against notebooks, by shifting tests left you increase developer productivity by increasing the speed of the feedback cycle.

5. Have a CI/CD pipeline

  • This means including all artifacts needed to build the data pipeline from scratch in source control. This includes infrastructure-as-code artifacts, database objects (schema definitions, functions, stored procedures, etc), reference/application data, data pipeline definitions, and data validation and transformation logic.
  • There should also be a safe, repeatable process to move changes through dev, test and finally production.

6. Secure and centralize configuration

  • Maintain a central, secure location for sensitive configuration such as database connection strings that can be access by the appropriate services within the specific environment.
  • Any example of this is securing secrets in KeyVault per environment, then having the relevant services query KeyVault for the configuration.

7. Monitor infrastructure, pipelines and data

  • A proper monitoring solution should be in-place to ensure failures are identified, diagnosed and addressed in a timely manner. Aside from the base infrastructure and pipeline runs, data should also be monitored. A common area that should have data monitoring is the malformed record store.

Key Concepts

Build and Release Pipeline

The Build and Release Pipelines definitions can be found here.

Environments

  1. Sandbox and Dev- the DEV resource group is used by developers to build and test their solutions. It contains two logical environments - (1) a Sandbox environment per developer so each developer can make and test their changes in isolation prior committing to main, and (2) a shared Dev environment for integrating changes from the entire development team. "Isolated" sandbox environment are accomplish through a number of practices depending on the Azure Service.
    • DataLake Gen2 - a "sandbox" file system is created. Each developer creates their own folder within this Sandbox filesystem.
    • AzureSQL or SQLDW - A transient database (restored from DEV) is spun up per developer on demand, if required.
    • Azure Synapse - git integration allows them to make changes to their own branches and debug runs independently.
  2. Stage - the STG resource group is used to test deployments prior to going to production in a production-like environment. Integration tests are run in this environment.
  3. Production - the PROD resource group is the final Production environment.

Build and Release Sequence

There are eight numbered orange boxes describing the sequence from sandbox development to target environments:

CI/CD

  1. Developers develop in their own Sandbox environments within the DEV resource group and commit changes into their own short-lived git branches. (i.e. <developer_name>/<branch_name>)
  2. When changes are complete, developers raise a PR to main for review. This automatically kicks-off the PR validation pipeline which runs the unit tests, linting and DACPAC builds.
  3. On PR completion, the commit to main will trigger a Build pipeline -- publishing all necessary Build Artifacts.
  4. The completion of a successful Build pipeline will trigger the first stage of the Release pipeline. This deploys the publish build artifacts into the DEV environment, with the exception of Azure Synapse Artifacts*.
  5. Developers perform a Manual Publish to the DEV Synapse Workspace from the collaboration branch (main). This updates the ARM templates in in the workspace_publish branch.
  6. On the successful completion of the first stage, this triggers an Manual Approval Gate**. On Approval, the release pipeline continues with the second stage -- deploying changes to the Staging environment.
  7. Integration tests are run to test changes in the Staging environment.
  8. ***On the successful completion of the second stage, this triggers a second Manual Approval Gate. On Approval, the release pipeline continues with the third stage -- deploying changes to the Production environment.

Notes:

More resources:

Testing

  • Unit Testing - These test small pieces of functionality within your code. Data transformation code should have unit tests and can be accomplished by abstracting Data Transformation logic into packages. Unit tests along with linting are automatically run when a PR is raised to main.

    • See here for unit tests within the solution and the corresponding QA Pipeline that executes the unit tests on every PR.
  • Integration Testing - These are run to ensure integration points of the solution function as expected. In this demo solution, an actual Synapse Data Pipeline run is automatically triggered and its output verified as part of the Release to the Staging Environment.

Observability / Monitoring

Please check the details here.

How to use the sample

Prerequisites

  1. Github account
  2. Azure Account
  3. Azure DevOps Project

Software pre-requisites if you don't use dev container

Software pre-requisites if you use dev container

Setup and Deployment

IMPORTANT NOTE: As with all Azure Deployments, this will incur associated costs. Remember to teardown all related resources after use to avoid unnecessary costs. See here for list of deployed resources. See here for information on the clean_up script. This deployment was tested using WSL 2 (Ubuntu 20.04) and Debian GNU/Linux 9.9 (stretch)

  1. Initial Setup

    • Ensure that:

      • You are logged in to the Azure CLI. To login, run

        az login
      • Azure CLI is targeting the Azure Subscription you want to deploy the resources to. To set target Azure Subscription, run

        az account set -s <AZURE_SUBSCRIPTION_ID>
      • Azure CLI is targeting the Azure DevOps organization and project you want to deploy the pipelines to. To set target Azure DevOps project, run

        az devops configure --defaults organization=https://dev.azure.com/<MY_ORG>/ project=<MY_PROJECT>
      • Azure DevOps organization has the Synapse workspace deployment extension installed. For more information on how to install Azure DevOps extensions, see here. To install extension, run

        az devops extension install --extension-id "synapsecicd-deploy" --publisher-id "AzureSynapseWorkspace" --org "<MY_ORG>"`

        Note that this requires the Project Collection Administrator role or organization Owner role.

    • Fork this repository into a new Github repo.

    • Set the following required environment variables:

      • GITHUB_REPO - Name of your forked github repo in this form <my_github_handle>/<repo>. (ei. "devlace/mdw-dataops-import")
      • GITHUB_PAT_TOKEN - a Github PAT token. Generate them here. This requires "repo" scope.

      Optionally, set the following environment variables:

      • AZURE_LOCATION - Azure location to deploy resources. Default: westus.
      • AZURE_SUBSCRIPTION_ID - Azure subscription id to use to deploy resources. Default: default azure subscription. To see your default, run az account list.
      • DEPLOYMENT_ID - string appended to all resource names. This is to ensure uniqueness of azure resource names. Default: random five character string.
      • AZDO_PIPELINES_BRANCH_NAME - git branch where Azure DevOps pipelines definitions are retrieved from. Default: main.
      • SYNAPSE_SQL_PASSWORD - Password of the Synapse SQL instance. Default: random string.
    • If you are using dev container, follow the below steps:

      • Rename .envtemplate under ".devcontainer" folder to devcontainer.env and update the values as mentioned above instead of setting those as environment variables.
      • Open the project inside the vscode dev container (see details here).

      To further customize the solution, set parameters in arm.parameters files located in the infrastructure folder.

  2. Deploy Azure resources

    • cd into the e2e_samples/parking_sensors_synapse folder of the repo

    • Configure your default AzDo Organization and Project

      az devops configure --defaults organization="$AZDO_ORGANIZATION_URL" project="$AZDO_PROJECT"
    • Run ./deploy.sh.

      • This may take around ~30mins or more to run end to end. So grab yourself a cup of coffee... ☕
      • After a successful deployment, you will find .env.{environment_name} files containing essential configuration information per environment. See here for list of deployed resources.
      • Note that if you are using dev container, you would run the same script but inside the dev container terminal.
    • As part of the deployment script, this updated the Azure DevOps Release Pipeline YAML definition to point to your Github repository. Commit and push up these changes.

      • This will trigger a Build and Release which will fail due to a lacking workspace_publish branch -- this is expected. This branch will be created once you've setup git integration with your DEV Synapse workspace and publish a change.
  3. Setup Synapse git integration in DEV Synapse workspace

    IMPORTANT NOTE: Only the DEV Synapse workspace should be setup with Git integration. Do not setup git integration in the STG and PROD Data Factories.

    • In the Azure Portal, navigate to the Synapse workspace in the DEV environment and launch the Synapse workspace portal.
    • Under "manage" > Source Control - Git configuration, select "Configure". For more information, see here.
    • Fill in the repository settings with the following:
      • Repository type: Github
      • Use GitHub Enterprise Server: Unselected, unless you are using GitHub Enterprise Server
      • GitHub repository owner: your_Github_account
      • Github Account: your_Github_account
      • Git repository (select Use repository link, if forked): forked Github repository url
      • Collaboration branch: main
      • Publish branch: workspace_publish
      • Root folder: /e2e_samples/parking_sensors_synapse/synapse/workspace
      • Import existing resources to repository: Selected
      • Import resource into this branch: main
    • When prompted to select a working branch, check Use existing and select main

    The final settings should look like as below:

    Synapse Github Integration

    Ensure you Import Existing Synapse resources to repository. The deployment script deployed Synapse Workspace objects with Linked Service configurations in line with the newly deployed environments. Importing existing Synapse Workspace resources definitions to the repository overrides any default Linked Services values so they are correctly in sync with your DEV environment.

  4. Run setup notebook in Synapse workspace per environment

    • Grant yourself Storage Data Blob Contributor to the Synapse main storage (mdwdopsst2<ENV><DEPLOYMENT_ID>).
    • Navigate into DEV Synapse workspace notebooks tab and select the 00_setup notebook.
    • Run this notebook, attaching to the created Spark Pool.
    • Repeat this in the STG and PROD Synapse workspace.
  5. Trigger an initial Release

    • In the DEV Synapse workspace, navigate to "Manage > Triggers". Select the T_Sched trigger and activate it by clicking on the "Play" icon next to it. Click Publish to publish changes.

      Publishing a change is required to generate the workspace_publish branch which is used in the Release pipelines. Note that publishing changes in a Synapse workspace currently cannot be automated. See Known Issues.

    • In Azure DevOps, notice a new run of the Build Pipeline (mdwdops-ci-artifacts) off main. This will build the Python package and SQL DACPAC, then publish these as Pipeline Artifacts.
    • After completion, this should automatically trigger the Release Pipeline (mdwdops-cd-release). This will deploy the artifacts across environments.
      • You may need to authorize the Pipelines initially to use the Service Connection and deploy the target environments for the first time. Release Pipeline
    • Optional. Trigger the Synapse Data Pipelines per environment.
      • In the Synapse workspace of each environment, navigate to "Author", then select the P_Ingest_MelbParkingData.
      • Select "Trigger > Trigger Now".
      • To monitor the run, go to "Monitor > Pipeline runs". Pipeline Run
  6. Optional. Visualize data in PowerBI

    This requires PowerBI Desktop App installed.

    • Open the provided PowerBi pbix (PowerBI_ParkingSensors.pbix) under reports folder.
    • Under Queries, select "Transform Data" > "Data source settings".
    • Select "Change Source..." and enter the Server and Database details of your SQL Dedicated Pool. Click "Ok".

      You can retrieve these from the Azure Portal under "Connection Strings" of your SQL Dedicated Pool Instance.

    • Select "Edit Permissions...". Under "Credentials", select "Edit...". Select the "Database" tab. Enter the User name and password of your SQL Dedicated Pool Instance.

      You can retrieve these from the Secrets in your KeyVault instance.

    • Close the Data Source tabs.
    • Click on Refresh data.

      Your Dashboard will initially be empty. You will need your data pipeline to run a few times for the data in your SQL Dedicated Pool to populate.

Congratulations!! 🥳 You have successfully deployed the solution and accompanying Build and Release Pipelines.

If you've encountered any issues, please review the Troubleshooting section and the Known Issues section. If you are still stuck, please file a Github issue with the relevant error message, error screenshots, and replication steps.

Deployed Resources

After a successful deployment, you should have the following resources:

  • In Azure, three (3) Resource Groups (one per environment) each with the following Azure resources.
    • Azure Synapse Workspace including:
      • Data Pipelines - with pipelines, datasets, linked services, triggers deployed and configured correctly per environment.
      • Notebooks - Spark, SQL Serverless
      • Workspace package (ei. Python Wheel package)
      • Spark Pool
        • Workspace package installed
        • Configured to point the deployed Log Analytics workspace, under "Apache Spark Configuration".
      • SQL Dedicated Pool (formerly SQLDW) - Initally empty. Release Pipeline should deploy SQL Database objects using the SQL DACPAC.
    • Data Lake Store Gen2 and a Service Principal (SP) with Storage Contributor rights assigned.
    • Log Analytics Workspace - including a kusto query on Query explorer -> Saved queries, to verify results that will be looged on Synapse notebooks (notebooks are not deployed yet).
    • Application Insights
    • KeyVault with all relevant secrets stored.
  • In Azure DevOps
    • Four (4) Azure Pipelines
      • mdwdops-cd-release - Release Pipeline
      • mdwdops-ci-artifacts - Build Pipeline
      • mdwdops-ci-qa-python - "QA" pipeline runs on PR to main
      • mdwdops-ci-qa-sql - "QA" pipeline runs on PR to main
    • Three (6) Variables Groups - two per environment
      • mdwdops-release-dev
      • mdwdops-secrets-dev*
      • mdwdops-release-stg
      • mdwdops-secrets-stg*
      • mdwdops-release-prod
      • mdwdops-secrets-prod*
    • Four (4) Service Connections
      • Three Azure Service Connections (one per environment) each with a Service Principal with Contributor rights to the corresponding Resource Group.
        • mdwdops-serviceconnection-dev
        • mdwdops-serviceconnection-stg
        • mdwdops-serviceconnection-prod
      • Github Service Connection for retrieving code from Github
        • mdwdops-github
    • Three additional Service Principals (one per environment) with Synapse Administrator role for running Integration Tests

Notes:

Clean up

This sample comes with an optional, interactive clean-up script which will delete resources with mdwdops in its name. It will list resources to be deleted and will prompt before continuing.

IMPORTANT NOTE: As it simply searches for mdwdops in the resource name, it could list resources not part of the deployment! Use with care.

Data Lake Physical layout

ADLS Gen2 is structured as the following:

    datalake                    <- filesystem
        /sys/databricks/libs    <- contains all libs, jars, wheels needed for processing
        /data
            /lnd                <- Bronze - landing folder where all data files are ingested into.
            /interim            <- Silver - interim (cleansed) tables
            /dw                 <- Gold - final tables 

Known Issues, Limitations and Workarounds

The following lists some limitations of the solution and associated deployment script:

  • Azure DevOps Variable Groups linked to KeyVault can only be created via the UI, cannot be created programmatically and was not incorporated in the automated deployment of the solution.
    • Workaround: Deployment add sensitive configuration as "secrets" in Variable Groups with the downside of duplicated information. If you wish, you may manually link a second Variable Group to KeyVault to pull out the secrets. KeyVault secret names should line up with required variables in the Azure DevOps pipelines. See here for more information.
  • Azure DevOps Environment and Approval Gates can only be managed via the UI, cannot be managed programmatically and was not incorporated in the automated deployment of the solution.
    • Workaround: Approval Gates can be easily configured manually. See here for more information.
  • Azure Synapse SQL Serverless artifacts (ei. Tables, Views, etc) are not currently updated as part of the CICD pipeline.
  • Manually publishing the Synapse workspace is required. Currently, this cannot be automated. Failing to publish will mean potentially releasing a stale data pipeline definition in the release pipeline.
    • Mitigation: Set Approval Gates between environments. This allows for an opportunity to verify whether the manual publish has been performed.