From 9516cf65914c8a955a48ffa758cf417c35d3325e Mon Sep 17 00:00:00 2001 From: Valentin Matton Date: Fri, 27 Sep 2024 15:32:21 +0200 Subject: [PATCH] docs(dbt): start guidelines --- pipeline/CONTRIBUTING.md | 41 +++++++++++++--------------- pipeline/dbt/CONTRIBUTING.md | 52 ++++++++++++++++++++++++++++++++++++ 2 files changed, 70 insertions(+), 23 deletions(-) create mode 100644 pipeline/dbt/CONTRIBUTING.md diff --git a/pipeline/CONTRIBUTING.md b/pipeline/CONTRIBUTING.md index ed58fe312..cc8d7950f 100644 --- a/pipeline/CONTRIBUTING.md +++ b/pipeline/CONTRIBUTING.md @@ -12,19 +12,12 @@ pip install -U pip setuptools wheel # Install the dev dependencies pip install -r requirements/dev/requirements.txt -``` - -## Running the test suite -```bash -# Copy (and optionally edit) the template .env -cp .template.env .env - -# simply use tox (for reproducible environnement, packaging errors, etc.) -tox +# Install dbt +pip install -r requirements/tasks/dbt/requirements.txt ``` -## dbt +## Running `dbt` * dbt is configured to target the `target-db` postgres container (see the root `docker-compose.yml`). * all dbt commands must be run in the in the `pipeline/dbt` directory. @@ -44,20 +37,12 @@ dbt run-operation create_udfs # run commands dbt ls -# staging, basic processing/mapping: -# - retrieve data from datalake table -# - retrieve data from raw dedicated source tables -# - retrieve data from the Soliguide S3 -dbt run --select staging - -# intermediate, specific transformations -dbt run --select intermediate - -# marts, last touch -dbt run --select marts +dbt build --select models/staging +dbt build --select models/intermediate +dbt build --select models/marts ``` -## Update schema in dbt seeds +## Updating schema in dbt seeds * Required when the schema changes. @@ -65,7 +50,7 @@ dbt run --select marts python scripts/update_schema_seeds.py ``` -## Manage the pipeline requirements +## Managing the pipeline requirements In order to prevent conflicts: @@ -84,3 +69,13 @@ make all # to upgrade dependencies make upgrade all ``` + +## Running the test suite + +```bash +# Copy (and optionally edit) the template .env +cp .template.env .env + +# simply use tox (for reproducible environnement, packaging errors, etc.) +tox +``` diff --git a/pipeline/dbt/CONTRIBUTING.md b/pipeline/dbt/CONTRIBUTING.md new file mode 100644 index 000000000..49e915bf6 --- /dev/null +++ b/pipeline/dbt/CONTRIBUTING.md @@ -0,0 +1,52 @@ +# `dbt` guidelines + +## testing models + +#### `data_tests` vs `unit_tests` vs `contract`: + +* with `dbt build`, `data_tests` are run **after** model execution **on the actual data**. A failing test will not prevent the faulty data to be propagated downstream, unless properly managed by the orchestration. +* with `dbt build`, `unit_tests` are run **before** model execution **on mock-up data**. This is great to test logic, but requires to make assumptions on the input data. +* `contract`s are enforced using actual DB constraints, **on the actual data**. A failing constraint will stop the model execution and prevent faulty data to be propagated downstream. Unlike `data_tests`, we cannot set a severity level. There is no middle ground. And the faulty data cannot be easily queried. + +✅ use `unit_tests` to test **complex logic** on well-defined data (e.g. converting opening hours). + +❌ avoid `unit_tests` for simple transformations. There are costly to maintain and will very often just duplicate the implementation. + +✅ always add a few `data_tests`. + +✅ use `contract`s on `marts`. Marts data can be consumed by clients. + +#### which layer (`source`, `staging`, `intermediate`, `marts`) should I test ? + +It's better to test data early, so we can make assumption on which which we can later build. + +Our `source` layer is essentially tables containing the raw data in jsonb `data` columns. While this is very handy to load data, it is unpractical to test with `data_tests`. + +Therefore our tests start at the `staging` layer. + +✅ `staging`: use `data_tests` extensively. Assumptions on data made in downstream models should be tested. + +✅ `intermediate`: use `data_tests` for primary keys and foreign keys. Use the generic tests `check_structure`, `check_service` and `check_address`. + +✅ `marts`: use `contracts` + generic tests `check_structure`, `check_service` and `check_address`. + +#### which type of `data_tests` should I use ? + +* to stay manageable, our tests should be more or less uniform across the codebase. + +✅ always use native `unique` and `not_null` for primary keys. + +✅ always use `relationships` for foreign keys. + +✅ use `not_null`, `dbt_utils.not_empty_string` and `dbt_utils.not_constant` when possible. + +✅ use `accepted_values` for categorical columns from well-defined data. + +❌ avoid `accepted_values` for categorical columns of less than great data, or downgrade the test severity to `warn`. Otherwise the test could fail too regularly. + +✅ For simple cases, use predefined generic data tests over custom data tests (in `tests/`). Usually requires less code and is easier to read, *unless* you want to test complex logic. + +## references + +* https://www.datafold.com/blog/7-dbt-testing-best-practices +* https://docs.getdbt.com/best-practices \ No newline at end of file