Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor pytest unit tests to dbt unit tests #346

Open
wants to merge 48 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
9d8f0c8
Replaced python unit test with dbt 1.8 unit test
adamribaudo-velir Jun 7, 2024
298f373
refactored unit tests for stg_ga4__session_conversions_daily
adamribaudo-velir Jun 7, 2024
906bec2
update test name
adamribaudo-velir Jun 7, 2024
e994408
Replaced Python unit test with dbt unit test
adamribaudo-velir Jun 22, 2024
34bbab9
variable override working properly
adamribaudo-velir Jun 22, 2024
4b66d1f
using overrides properly
adamribaudo-velir Jun 22, 2024
79f7e27
replaced another unit test
adamribaudo-velir Jun 22, 2024
cecf337
replaced python unit test
adamribaudo-velir Jun 22, 2024
63a7d86
add unit test for stg_ga4__client_key_first_last_pageviews
adamribaudo-velir Jun 22, 2024
6e709db
replace unit test
adamribaudo-velir Jun 22, 2024
9d53c9a
unit test for stg_ga4__sessions_traffic_sources_last_non_direct_daily…
adamribaudo-velir Jun 22, 2024
3425fdf
Add package-lock.yml to .gitignore
davidbooke4 Oct 22, 2024
c3ba7f7
Add vars to dbt_project.yml for testing
davidbooke4 Oct 23, 2024
10456ef
Merge branch 'main' into feature/dbt-unit-tests
davidbooke4 Oct 23, 2024
a1f10df
Add unit tests to stg_ga4__events.yml for the url_parsing macros
davidbooke4 Oct 23, 2024
5972788
Add conditions for cases when event_source is null for session parame…
davidbooke4 Oct 23, 2024
20598fb
Add unit test to stg_ga4__sessions_traffic_sources_daily for testing …
davidbooke4 Oct 23, 2024
282eeee
Add unit test to stg_ga4__user_id_mapping to test the latest mapping …
davidbooke4 Oct 23, 2024
c321197
Add descriptions for unit tests that were missing them
davidbooke4 Oct 23, 2024
8a1796e
Remove python unit tests that have been migrated to dbt unit tests
davidbooke4 Oct 23, 2024
c0aba5f
Add unit test to stg_ga4__events for testing transformations in stg_g…
davidbooke4 Oct 24, 2024
922ba07
Remove todo and example stg_ga4__events unit test files
davidbooke4 Oct 24, 2024
3a4f677
Add sessions_traffic_sources_last_non_direct_daily python unit test back
davidbooke4 Oct 24, 2024
c870130
Comment out unit tests for disabled models
davidbooke4 Oct 24, 2024
7386371
Remove edits from dbt_project.yml
davidbooke4 Oct 24, 2024
76f2c7f
Comment out unit test for sessions_traffic_sources_last_non_direct_da…
davidbooke4 Oct 24, 2024
697bafd
Update unit test section in README
davidbooke4 Oct 24, 2024
616da99
Simplify event_params construction in test_base_to_stg_ga4__events in…
davidbooke4 Oct 24, 2024
653e1ae
Update yml files to use consistent new line convention
davidbooke4 Oct 24, 2024
50ff2e8
update PR template
adamribaudo-velir Oct 24, 2024
68f9f87
Update default channel grouping test to use seed instead of fixture a…
davidbooke4 Oct 25, 2024
a3d9c1e
Comment out unit tests for disabled models
davidbooke4 Oct 28, 2024
1dd415e
Un-comment unit tests
davidbooke4 Oct 29, 2024
4ef2503
Add profiles.yml for Github Actions to execute dbt commands and add .…
davidbooke4 Oct 29, 2024
83bd23b
Add profile and variables to dbt_project.yml so Github Action can run…
davidbooke4 Oct 29, 2024
6f4335e
Add dbt unit tests job to github CI workflow
davidbooke4 Oct 29, 2024
947868d
Remove empty step
davidbooke4 Oct 29, 2024
8c879f7
Add repo to checkout step so PR code is checked out to test adding ne…
davidbooke4 Oct 29, 2024
95b3a60
Change workflow on behavior for testing changes
davidbooke4 Oct 29, 2024
555671e
Add comments related to unit tests and new Github Actions job to mark…
davidbooke4 Oct 31, 2024
f198262
Make updates for dbt unit test Github Action and allow for use of env…
davidbooke4 Oct 31, 2024
621e429
Add conditional logic to allow for use of --empty flag
davidbooke4 Oct 31, 2024
c82a36f
Fix spacing for comments added to README.md
davidbooke4 Oct 31, 2024
0ae02cc
Enable models dependent on project variables if environment variables…
davidbooke4 Nov 5, 2024
99a50c8
Set start_date to environment variable if it exists
davidbooke4 Nov 5, 2024
cd35ef9
Remove variables from dbt_project.yml and have models look for increm…
davidbooke4 Nov 5, 2024
ac415db
Add more environment variables to CI workflow
davidbooke4 Nov 5, 2024
7e44907
Update README after removing project variables in dbt_project.yml
davidbooke4 Nov 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions .github/workflows/run_unit_tests_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@ name: Run Unit Tests on Pull Request
on: [pull_request_target,workflow_dispatch]
env:
BIGQUERY_PROJECT: ${{ secrets.BIGQUERY_PROJECT }}
BIGQUERY_PROPERTY_ID: ${{ secrets.BIGQUERY_PROPERTY_ID }}
BIGQUERY_DATASET: ${{ secrets.BIGQUERY_DATASET }}
BIGQUERY_KEYFILE: ./unit_tests/dbt-service-account.json

jobs:
pytest_run_all:
Expand Down Expand Up @@ -35,3 +38,36 @@ jobs:

- name: Run tests
run: python -m pytest .

run_dbt_unit_tests:
name: Run dbt Unit Tests
runs-on: ubuntu-latest
steps:
- name: Check out
davidbooke4 marked this conversation as resolved.
Show resolved Hide resolved
uses: actions/checkout@v3
with:
ref: ${{ github.event.pull_request.head.sha }}

- uses: actions/setup-python@v1
with:
python-version: "3.11.x"

- name: Authenticate using service account
run: 'echo "$KEYFILE" > ./unit_tests/dbt-service-account.json'
shell: bash
env:
KEYFILE: ${{ secrets.GCP_BIGQUERY_USER_KEYFILE }}

- name: Install dbt
run: |
pip install dbt-core
pip install dbt-bigquery
dbt deps

- name: Materialize necessary dbt resources
run: |
dbt seed -f
dbt run -s +test_type:unit -f --empty

- name: Run dbt unit tests
run: dbt test -s test_type:unit
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
target/
dbt_packages/
logs/
package-lock.yml
.user.yml

google-cloud-sdk/
unit_tests/.env
Expand Down
36 changes: 34 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ packages:
```
## Required Variables

This package assumes that you have an existing DBT project with a BigQuery profile and a BigQuery GCP instance available with GA4 event data loaded. Source data is defined using the `project` and `dataset` variables below. The `static_incremental_days` variable defines how many days' worth of data to reprocess during incremental runs.
This package assumes that you have an existing DBT project with a BigQuery profile and a BigQuery GCP instance available with GA4 event data loaded. Source data is defined using the `project` and `property_ids` variables below. The `static_incremental_days` variable defines how many days' worth of data to reprocess during incremental runs. The `start_date` variable defines the earliest date for which data is included and loaded into the models in this package.

```
vars:
Expand Down Expand Up @@ -214,6 +214,8 @@ vars:
value_type: "string_value"
```

The `derived_user_properties` set in `dbt_project.yml` should either be updated to reflect the derived user properties for your project or they should be removed if you don't wish to set any derived user properties.

### Derived Session Properties

Derived session properties are similar to derived user properties, but on a per-session basis, for properties that change slowly over time. This provides additional flexibility in allowing users to turn any event parameter into a session property.
Expand Down Expand Up @@ -247,6 +249,8 @@ vars:
value_type: "int_value"
```

The `derived_session_properties` set in `dbt_project.yml` should either be updated to reflect the derived session properties for your project or they should be removed if you don't wish to set any derived session properties.

### GA4 Recommended Events

See the README file at /dbt_packages/models/staging/recommended_events for instructions on enabling [Google's recommended events](https://support.google.com/analytics/answer/9267735?hl=en).
Expand All @@ -261,6 +265,8 @@ vars:
conversion_events: ['purchase','download']
```

The `conversion_events` set in `dbt_project.yml` should either be updated to reflect the conversion events for your project or they should be removed if you don't wish to set any conversion events.

### Session Attribution Lookback Window

The `stg_ga4__sessions_traffic_sources_last_non_direct_daily` model provides last non-direct session attribution within a configurable lookback window. The default is 30 days, but this can be overridden with the `session_attribution_lookback_window_days` variable.
Expand Down Expand Up @@ -302,9 +308,35 @@ The easiest option is using OAuth with your Google Account. Summarized instructi
```
gcloud auth application-default login --scopes=https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/iam.test
```

The `profiles.yml` file included in this package should be removed. The `profile: 'default'` line in `dbt_project.yml` in this package should also be removed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidbooke4 I understand that these files aren't needed by a package user, but what is the benefit of removing them? They'll be added back the next time the person runs dbt-deps, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you're right - I forgot that would be the case! I'll update the comments I made in the README based on that fact.

One consequence of defining some of these project variables in dbt_project.yml within the package is that these variables will carry through to someone's dbt project if they've installed dbt-ga4 but haven't set new values for these variables in their own dbt_project.yml. So models such as stg_ga4__page_conversions and stg_ga4__derived_session_properties will be enabled and have fields created based on the variables I've added to dbt_project.yml in the package.

I'm trying to do some exploring to see if there's an alternative, but what are your thoughts on that? Do you think that'd be okay or do we need to find a way so these models aren't enabled if someone doesn't set those variables?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoot, yea that's a problem because I don't think many package users use those advanced features or set those variables.

Can you look at alternatives?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found an alternative that boils down to creating more repo environment variables and updating the conditional enabled/disabled logic in some models to look for those environment variables.

However, now I'm back to an earlier problem where the project is unable to compile because unit tests are defined for models that are disabled. There's an open issue and PR related to this (with commits as recent as last week), so a fix should be in place soon 🤞.

What are your thoughts on waiting for this fix to be in place before merging this PR? The alternative would be to remove the dbt unit tests and add the pytest tests back in for the few models that are enabled based on setting the conversion events and derived user/session properties variables. I'd add in a TODO for replacing those pytest tests with unit tests if we decide to proceed with that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @adamribaudo-velir! I wanted to give the update that the PR I mentioned in my comment above was merged last week. That means we're probably at least a couple weeks out before it's included in a dbt version release, but I don't think there's any rush to get my PR merged so that should be fine. Let me know what you think though!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Initially replied in the main thread)

Sounds great! Yes, let's wait until this is released in dbt-core. Thanks for staying on top of it.


# Unit Testing

This package uses `pytest` as a method of unit testing individual models. More details can be found in the [unit_tests/README.md](unit_tests) folder.
The dbt-ga4 package treats each model and macro as a 'unit' of code. If we fix the input to each unit, we can test that we received the expected output.

This package currently uses a combination of dbt unit tests and `pytest` as a method of unit testing individual models. The remaining `pytest` unit test will be refactored to a dbt unit test when possible - progress on the bug preventing that work can be tracked [here](https://github.com/dbt-labs/dbt-core/issues/10353).

### dbt unit tests

dbt's documentation on unit tests can be found [here](https://docs.getdbt.com/docs/build/unit-tests). Unit tests are performed the same way other types of dbt tests are executed.

Execute a specific test:
```
dbt test -s <test_name>
```
Execute all tests configured for a model:
```
dbt test -s <model_name>
```
Execute all dbt unit tests:
```
dbt test -s test_type:unit
```

### pytest

More details on using `pytest` for unit testing can be found in the [unit_tests/README.md](unit_tests) folder.

# Overriding Default Channel Groupings

Expand Down
1 change: 1 addition & 0 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
- Configuration and dynamic templates to create custom event tables and dimensions
- Configuration to create custom dimensions (session, user, event_*) from event parameters
- Use Fivetran's `union_data` method (or something similar) to handle multiple, unioned GA4 exports. https://github.com/fivetran/dbt_xero_source/blob/main/models/tmp/stg_xero__account_tmp.sql
- Un-comment unit test in `stg_ga4__sessions_traffic_sources_last_non_direct_daily.yml` once [this bug](https://github.com/dbt-labs/dbt-core/issues/10353) is resolved. Once that is complete, the `unit_tests` folder pertaining to the `pytest` unit tests should be removed along with the `pytest_run_all` job in `run_unit_tests_on_pr.yml`.

## Misc

Expand Down
17 changes: 17 additions & 0 deletions dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,23 @@ seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

profile: 'default'

# These variables are used for unit tests during CI for the package
# These variables should either be removed or updated to reflect the needs of your GA data and dbt project
vars:
start_date: "20230306"
static_incremental_days: 3
derived_session_properties:
- event_parameter: "page_location"
session_property_name: "most_recent_page_location"
value_type: "string_value"
derived_user_properties:
- event_parameter: "page_title"
user_property_name: "most_recent_page_title"
value_type: "string_value"
conversion_events: ['large_button_clicked', 'add_to_cart']

target-path: "target" # directory which will store compiled SQL files
clean-targets: # directories to be removed by `dbt clean`
- "target"
Expand Down
3 changes: 2 additions & 1 deletion macros/base_select.sql
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,8 @@
, ecommerce.transaction_id
, items
, {%- if var('combined_dataset', false) != false %} cast(left(regexp_replace(_table_suffix, r'^(intraday_)?\d{8}', ''), 100) as int64)
{%- else %} {{ var('property_ids')[0] }}
{%- elif var('property_ids', false) != false %} {{ var('property_ids')[0] }}
{%- else %} {{ env_var('BIGQUERY_PROPERTY_ID') }}
{%- endif %} as property_id
{% endmacro %}

Expand Down
8 changes: 5 additions & 3 deletions models/staging/base/base_ga4__events.sql
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,11 @@ with source as (
select
{{ ga4.base_select_source() }}
from {{ source('ga4', 'events') }}
where cast(left(replace(_table_suffix, 'intraday_', ''), 8) as int64) >= {{var('start_date')}}
{% if is_incremental() %}
and parse_date('%Y%m%d', left(replace(_table_suffix, 'intraday_', ''), 8)) in ({{ partitions_to_replace | join(',') }})
{% if not flags.EMPTY %}
where cast(left(replace(_table_suffix, 'intraday_', ''), 8) as int64) >= {{var('start_date')}}
{% if is_incremental() %}
and parse_date('%Y%m%d', left(replace(_table_suffix, 'intraday_', ''), 8)) in ({{ partitions_to_replace | join(',') }})
{% endif %}
{% endif %}
),
renamed as (
Expand Down
6 changes: 4 additions & 2 deletions models/staging/src_ga4.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,13 @@ sources:
- name: ga4
database: | # Source from target.project if multi-property, otherwise source from source_project
{%- if var('combined_dataset', false) != false -%} {{target.project}}
{%- else -%} {{var('source_project')}}
{%- elif var('source_project', false) != false -%} {{var('source_project')}}
{%- else -%} {{env_var('BIGQUERY_PROJECT')}}
{%- endif -%}
schema: | # Source from combined property dataset if set, otherwise source from original GA4 property
{%- if var('combined_dataset', false) != false -%} {{var('combined_dataset')}}
{%- else -%} analytics_{{var('property_ids')[0]}}
{%- elif var('property_ids', false) != false -%} analytics_{{var('property_ids')[0]}}
{%- else -%} analytics_{{env_var('BIGQUERY_PROPERTY_ID')}}
{%- endif -%}
tables:
- name: events
Expand Down
18 changes: 17 additions & 1 deletion models/staging/stg_ga4__client_key_first_last_events.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,20 @@ models:
- name: client_key
description: Hashed combination of user_pseudo_id and stream_id
tests:
- unique
- unique
unit_tests:
- name: test_stg_ga4__client_key_first_last_events
description: Test pulling the first and last event per client key
model: stg_ga4__client_key_first_last_events
given:
- input: ref('stg_ga4__events')
format: csv
rows: |
stream_id,client_key,event_key,event_timestamp
1,IX+OyYJBgjwqML19GB/XIQ==,H06dLW6OhNJJ6SoEPFsSyg==,1661339279816517
1,IX+OyYJBgjwqML19GB/XIQ==,gt1SoAtrxDv33uDGwVeMVA==,1661339279816518
expect:
format: csv
rows: |
client_key,first_event,last_event
IX+OyYJBgjwqML19GB/XIQ==,H06dLW6OhNJJ6SoEPFsSyg==,gt1SoAtrxDv33uDGwVeMVA==
18 changes: 17 additions & 1 deletion models/staging/stg_ga4__client_key_first_last_pageviews.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,20 @@ models:
- name: client_key
description: Hashed combination of user_pseudo_id and stream_id
tests:
- unique
- unique
unit_tests:
- name: test_stg_ga4__client_key_first_last_pageviews
description: Test pulling the first and last page view per client key
model: stg_ga4__client_key_first_last_pageviews
given:
- input: ref('stg_ga4__event_page_view')
format: csv
rows: |
stream_id,client_key,event_key,event_timestamp,page_location
1,IX+OyYJBgjwqML19GB/XIQ==,H06dLW6OhNJJ6SoEPFsSyg==,1661339279816517,A
1,IX+OyYJBgjwqML19GB/XIQ==,gt1SoAtrxDv33uDGwVeMVA==,1661339279816518,B
expect:
format: csv
rows: |
client_key,first_page_view_event_key,last_page_view_event_key,first_page_location,last_page_location
IX+OyYJBgjwqML19GB/XIQ==,H06dLW6OhNJJ6SoEPFsSyg==,gt1SoAtrxDv33uDGwVeMVA==,A,B
37 changes: 36 additions & 1 deletion models/staging/stg_ga4__derived_session_properties.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,39 @@ models:
columns:
- name: session_key
tests:
- unique
- unique
unit_tests:
- name: test_derived_session_properties
description: Test whether a derived property is successfully retrieved from multiple event payloads
model: stg_ga4__derived_session_properties
given:
- input: ref('stg_ga4__events')
format: sql
rows: |
select
'AAA' as session_key
, 1617691790431476 as event_timestamp
, 'first_visit' as event_name
, ARRAY[STRUCT('my_param' as key, STRUCT(1 as int_value) as value)] as event_params
, ARRAY[STRUCT('my_property' as key, STRUCT('value1' as string_value) as value)] as user_properties
union all
select
'AAA' as session_key
, 1617691790431477 as event_timestamp
, 'first_visit' as event_name
, ARRAY[STRUCT('my_param' as key, STRUCT(2 as int_value) as value)] as event_params
, ARRAY[] as user_properties
union all
select
'BBB' as session_key
, 1617691790431477 as event_timestamp
, 'first_visit' as event_name
, ARRAY[STRUCT('my_param' as key, STRUCT(1 as int_value) as value)] as event_params
, ARRAY[STRUCT('my_property' as key, STRUCT('value2' as string_value) as value)] as user_properties
expect:
format: dict
rows:
- {session_key: AAA, my_derived_property: 2, my_derived_property2: value1}
- {session_key: BBB, my_derived_property: 1, my_derived_property2: value2}
overrides:
vars: {derived_session_properties: [{event_parameter: 'my_param',session_property_name: 'my_derived_property',value_type: 'int_value'},{user_property: 'my_property',session_property_name: 'my_derived_property2',value_type: 'string_value'}]}
34 changes: 33 additions & 1 deletion models/staging/stg_ga4__derived_user_properties.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,36 @@ models:
- name: client_key
description: Hashed combination of user_pseudo_id and stream_id
tests:
- unique
- unique
unit_tests:
- name: test_derived_user_properties
description: Test whether a derived user property is successfully retrieved from multiple event payloads
model: stg_ga4__derived_user_properties
given:
- input: ref('stg_ga4__events')
format: sql
rows: |
select
'AAA' as client_key
, 1617691790431476 as event_timestamp
, 'first_visit' as event_name
, ARRAY[STRUCT('my_param' as key, STRUCT(1 as int_value) as value)] as event_params
union all
select
'AAA' as client_key
, 1617691790431477 as event_timestamp
, 'first_visit' as event_name
, ARRAY[STRUCT('my_param' as key, STRUCT(2 as int_value) as value)] as event_params
union all
select
'BBB' as client_key
, 1617691790431477 as event_timestamp
, 'first_visit' as event_name
, ARRAY[STRUCT('my_param' as key, STRUCT(1 as int_value) as value)] as event_params
expect:
format: dict
rows:
- {client_key: AAA, my_derived_property: 2}
- {client_key: BBB, my_derived_property: 1}
overrides:
vars: {derived_user_properties: [{event_parameter: 'my_param',user_property_name: 'my_derived_property',value_type: 'int_value'}]}
21 changes: 20 additions & 1 deletion models/staging/stg_ga4__event_to_query_string_params.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,23 @@ version: 2
models:
- name: stg_ga4__event_to_query_string_params
description: This model pivots the query string parameters contained within the event's page_location field to become rows. Each row is a single parameter/value combination contained in a single event's query string.

unit_tests:
- name: test_stg_ga4__event_to_query_string_params
description: Test whether event query strings are flattened for each query string parameter
model: stg_ga4__event_to_query_string_params
given:
- input: ref('stg_ga4__events')
format: csv
rows: |
event_key,page_query_string
aaa,param1=value1&param2=value2
bbb,param1
ccc,param1=
expect:
format: csv
rows: |
event_key,param,value
aaa,param1,value1
aaa,param2,value2
bbb,param1,
ccc,param1,
Loading
Loading