DBT (Data Building Tool) is a command-line tool that enables data analysts and engineers to transform data in their warehouses simply by writing select statements.
Cloud Composer is a fully managed data workflow orchestration service that empowers you to author, schedule, and monitor pipelines.
This repository demonstrate using the dbt to manage tables in BigQuery and using Cloud Composer for schedule the dbt run.
There are two sets of example:
- Basic
The basic example is demonstrating the minimum configuration that you need to run dbt on Cloud Composer - Optimized
The optimized example is demonstrating optimization on splitting the dbt run for each models,
implementing incremental in the dbt model, and using Airflow execution date to handle backfill.
These GCP services will be used in the example code:
- Cloud Composer
- BigQuery
- Google Cloud Storage (GCS)
- Cloud Build
- Google Container Repository (GCR)
- Cloud Source Repository (CSR)
This diagram explains the example solution's flow:
- The code starts from a dbt project stored in a repository. (The example is under [basic or optimized]/dbt-project folder)
- Any changes from the dbt project will trigger Cloud Build run
- The Cloud Build will create/update an image to GCR; and export dbt docs to GCS
- The Airflow DAG deployed to Cloud Composer (The example is under [basic or optimized]/dag folder)
- The dbt run triggered using KubernetesPodOperator that pulls image from the step #3
- At the end of the process the BigQuery objects will be created/updated (i.e datasets and tables)
- Cloud Composer environment
https://cloud.google.com/composer/docs/how-to/managing/creating - Set 3 ENVIRONMENT VARIABLES in the Cloud Composer (AIRFLOW_VAR_BIGQUERY_LOCATION, AIRFLOW_VAR_RUN_ENVIRONMENT, AIRFLOW_VAR_SOURCE_DATA_PROJECT)
https://cloud.google.com/composer/docs/how-to/managing/environment-variables - Cloud Source Repository (or any git provider)
Store the code from dbt-project in this dedicated repository
The repository should contain dbt_project.yml file (Check the example code under [basic or optimized]/dbt-project] folder)
Note that the dedicated dbt-project repository is not this example code repository (github repo) - Cloud Build triggers
Trigger build from the dbt project repository
https://cloud.google.com/build/docs/automating-builds/create-manage-triggers - BigQuery API enabled
- Service account to run dbt commands
- Kubernetes Secret to be binded with the service account
https://cloud.google.com/kubernetes-engine/docs/concepts/secret
Check in the /dbt-project/.dbt/profiles.yml, you will find 2 options to run the dbt:
-
local
You can run the dbt project using your local machine or Cloud Shell. To do that, rungcloud auth application-default login
Trigger dbt run by using this command:
dbt run --vars '{"project_id": [Your Project id], "bigquery_location": "us", "execution_date": "1970-01-01","source_data_project": "bigquery-public-data"}' --profiles-dir .dbt
-
remote
This option is for running dbt using service account
For example from Cloud build and Cloud Composer
Check cloudbuild.yaml and dag/dbt_with_kubernetes.py to see how to use this option
After all the Prerequisites are prepared. You will have:
- A dbt-project repository
- Airflow DAG to run the dbt
Here are the follow up steps for running the code:
- Push the code in dbt-project repository and make sure the Cloud Build triggered; and successfully create the docker image
- In the Cloud Composer UI, run the DAG (e.g dbt_with_kubernetes.py)
- If successfull, check the BigQuery console to check the tables
With this mechanism, you have 2 independent runs.
Updating the dbt-project, including models, schema and configurations will run the Cloud Build to create the docker image.
The DAG as dbt scheduler will run the dbt-project from the latest docker image available.
You can pass variables from Cloud Composer to the dbt run.
As an example, in this code we configure the BigQuery dataset location in the US as part of the DAG.
default_dbt_vars = {
"project_id": project,
# Example on using Cloud Composer's variable to be passed to dbt
"bigquery_location": Variable.get("bigquery_location"),
"key_file_dir": '/var/secrets/google/key.json',
"source_data_project": Variable.get("source_data_project")
}
In the dbt script, you can use the variable like this:
location: "{{ var('bigquery_location') }}"