Skip to content

Latest commit

 

History

History
305 lines (226 loc) · 13.3 KB

README.md

File metadata and controls

305 lines (226 loc) · 13.3 KB

tasman_logo tasman_logo


We are the boutique analytics consultancy that turns disorganised data into real business value. Get in touch to learn more about how Tasman can help solve your organisations data challenges.

tasman_dbt_utils

What is tasman_dbt_utils?

tasman_dbt_utils is a dbt package with reusable macro's. It includes macro's for, but not limited to:

  • tests
  • monitoring & auditing
  • SQL functions
  • ops functions

The intention for this package is to have each macro available for dbt-snowflake and dbt-bigquery. Some functions will be only available for one of the platforms because of the fundamental differences between them. For example managing warehouse-size's is only a thing in Snowflake. This package does not intend to replace other packages which are commonly used such as dbt-utils or dbt-expectations. This package add's functionalities which those package's don't provide.

Installation

packages:
    - git: "https://github.com/TasmanAnalytics/tasman_dbt_utils.git"
      revision: 1.0.0

Macro's & tests

Tests

test_count_distinct_matches_source (source)

  1. Count distinct of the test column (e.g. transaction_id)
  2. Aggregates this by a specified field (e.g. date) to get a aggregated measure (e.g. date | count_transactions)
  3. Compares this against another model (ideally with the same granularity) (e.g. date | count_transactions)
  4. Returns any rows where there is a discrepancy between the aggregated measures
  5. (Bonus) If you are fine with tests not being an exact match, then you can specify a threshold for which failures can occur e.g. count transactions can fluctuate within ±5% range from source

Arguments

  • source_model (required): The name of the model that contains the source of truth. Specify this as a ref function e.g. ref('raw_jaffle_shop').
    • These can be seed files or dbt models, so there's a degree of flexibility here
  • source_metric (required): The name of the column/metric sourced from source_model
  • comparison_field (required): The name of the column/metric sourced from model in the YAML file i.e. the column/metric that is being compared against
  • percent_mismatch_threshold (optional, default = 0): The threshold that you would allow your tests to be out by. e.g. if you are happy with ±5% discrepancy, then set to 5.

Usage This works similarly to the off-the-box tests offered by dbt (unique, not_null etc)

  1. Create the test in the YAML config, specifying values for all arguments marked as required above. Example:
    • add any additional filtering conditions to your model via config/where block
version: 2

models:
  - name: dmn_jaffle_shop
    description: ""
    columns:
      - name: transaction_id
        description: ""
        tests:
          - not_null
          - unique
          - count_aggregate_matches_source:
              name: count_transactions_matches_source__dmn_jaffle_shop
              source_model: ref('raw_jaffle_shop')
              source_field: sale_date
              source_metric: transaction_amount
              comparison_field: date_trunc(day, created_timestamp)
              config:
                where: date_trunc(day, created_timestamp) between '2022-01-11' and '2022-12-31' and sale_type != 'CANCELLED'
  1. Specify a unique test name for name
    • If this is not specified then dbt will, by default, concatenate all the test arguments into a long list, making the whole test unreadable.
  2. Run dbt test as you normally would e.g. dbt test -s dmn_jaffle_shop

test_sum_matches_source (source)

  1. Sum of the test column (e.g. revenue)
  2. Aggregates this by a specified field (e.g. date) to get a aggregated measure (e.g. date | sum_revenue)
  3. Compares this against another model (ideally with the same granularity) (e.g. date | sum_revenue)
  4. Returns any rows where there is a discrepancy between the aggregated measures
  5. (Bonus) If you are fine with tests not being an exact match, then you can specify a threshold for which failures can occur e.g. Sum revenue can fluctuate within ±5% range from source

Arguments

  • source_model (required): The name of the model that contains the source of truth. Specify this as a ref function e.g. ref('raw_jaffle_shop')
    • These can be seed files or dbt models, so there's a degree of flexibility here
  • source_metric (required): The name of the column/metric sourced from source_model
  • comparison_field (required): The name of the column/metric sourced from model in the YAML file i.e. the column/metric that is being compared against
  • percent_mismatch_threshold (optional, default = 0): The threshold that you would allow your tests to be out by. e.g. if you are happy with ±5% discrepancy, then set to 5.

Usage This works similarly to the off-the-box tests offered by dbt (unique, not_null etc)

  1. Create the test in the YAML config, specifying values for all arguments marked as required above. Example:
    • add any additional filtering conditions to your model via config/where block
version: 2

models:
  - name: dmn_jaffle_shop
    description: ""
    columns:
      - name: revenue
        description: ""
        tests:
          - not_null
          - sum_aggregate_matches_source:
              name: sum_revenue_matches_source__dmn_jaffle_shop
              source_model: ref('raw_jaffle_shop')
              source_field: sale_date
              source_metric: sum_revenue
              comparison_field: date_trunc(day, created_timestamp)
              config:
                where: date_trunc(day, created_timestamp) between '2022-01-11' and '2022-12-31' and sale_type != 'CANCELLED'
  1. Specify a unique test name for name
    • If this is not specified then dbt will, by default, concatenate all the test arguments into a long list, making the whole test unreadable.
  2. Run dbt test as you normally would e.g. dbt test -s dmn_jaffle_shop

SQL Functions

include_source (source)

A frequently used pattern for creating initial CTEs to reference sources to create a dbt model dependancy.

Platform Support
BigQuery
Snowflake

Arguments

  • source: (required) Source model name to be used in script. This is also used to name the CTE.

Usage

    {{include_source('dbo','user')}}
    {{include_source('dbo','event')}}

include_ref (source)

A frequently used pattern for creating initial CTEs to reference upstream models to create a dbt model dependancy.

Platform Support
BigQuery
Snowflake

Arguments

  • source: (required) Source model name to be used in script. This is also used to name the CTE.
  • where_statement: (optional) This can be used to do an initial filter on the model.

Usage

    {{include('stg_user', 'where user_active = true')}}
    {{include('dmn_pipeline')}}

Monitoring & auditing

create_table_profile (source)

Prints a summary of statistics about the target model to the terminal.

| database_name   | schema_name | table_name | column_name          | ordinal_position | row_count | distinct_count | null_count | is_unique | max        | min        |     avg |
| --------------- | ----------- | ---------- | -------------------- | ---------------- | --------- | -------------- | ---------- | --------- | ---------- | ---------- | ------- |
| tasman-internal | dbt_jurri   | customers  | customer_id          |                1 |       100 |            100 |          0 |      True | 100        | 1          | 50.500… |
| tasman-internal | dbt_jurri   | customers  | first_name           |                2 |       100 |             79 |          0 |     False |            |            |         |
| tasman-internal | dbt_jurri   | customers  | last_name            |                3 |       100 |             19 |          0 |     False |            |            |         |
| tasman-internal | dbt_jurri   | customers  | first_order          |                4 |       100 |             46 |         38 |     False | 2018-04-07 | 2018-01-01 |         |
| tasman-internal | dbt_jurri   | customers  | most_recent_order    |                5 |       100 |             52 |         38 |     False | 2018-04-09 | 2018-01-09 |         |
| tasman-internal | dbt_jurri   | customers  | number_of_orders     |                6 |       100 |              4 |         38 |     False | 5          | 1          |  1.597… |
| tasman-internal | dbt_jurri   | customers  | customer_lifetime... |                7 |       100 |             35 |         38 |     False | 99         | 1          | 26.968… |
Platform Support
BigQuery
Snowflake

Scope: model, seed, snapshot

Arguments

  • table: the name of the table it should do the table scan on.
  • schema (optional, default=target.schema): the schema of where the target table is located.
  • database (optional, default=target.database): the database of where the target table is located.

get_object_keys (source)

Gets all of the object keys (including nested keys) of a column and prints them to the terminal.

Platform Support
BigQuery
Snowflake

Scope: model, snapshot

Argements

  • column: the name of the object column.
  • table: the name of the target table.
  • schema (optional, default=target.schema): the schema of where the target table is located.
  • database (optional, default=target.database): the database of where the target table is located.

Ops

set_warehouse_size (source)

Sets a custom warehouse size for individual models.

Platform Support
BigQuery
Snowflake

Scope: model, snapshot

Arguments

  • size (required): the size of the warehouse

Usage

{{
    config(
        snowflake_warehouse=tasman_dbt_utils.set_warehouse('M')
    )
}}

This requires a dict of environment, warehouses and the available warehouse size's to be set in the dbt_project.yml. If the environment is missing from dbt_project.yml it uses the default warehouse.

vars:
  tasman_dbt_utils:
    snowflake_warehouses:
      prod:
        warehouse_prefix: "PROD_WH_"
        size: ["XS", "S", "M"]
      ci:
        warehouse_prefix: "CI_WH_"
        size: ["XS", "S", "M"]

drop_old_relations (source)

This macro takes the relations in the manifest and compares it to the tables and views in the warehouse. Tables and views which are in the warehouse but not in the manifest will be dropped.

Platform Support
BigQuery
Snowflake

Scope: model, seed, snapshot

Arguments

  • schema_prefix (optional, default=target.schema): the prefix of the schema's where the relations should be deleted.
  • database (optional, default=target.database): the database where the relations should be deleted.
  • dry_run (optional, default=True): when set to True it will print the statements, when set to False it will actually remove the relations.

Usage

dbt run-operation drop_old_relations --args '{dry_run: False, schema: dbt}'