Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge upstream #3

Open
wants to merge 84 commits into
base: main
Choose a base branch
from
Open

Merge upstream #3

wants to merge 84 commits into from

Conversation

ivan-toriya-precis
Copy link
Collaborator

Description & motivation

Checklist

  • I have verified that these changes work locally
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have run dbt test and python -m pytest . to validate existing tests

adamribaudo-velir and others added 30 commits March 11, 2023 09:21
* removing ga4 folder under staging

* update docs to remove extra ga4 folder

* updating references to ga4 folder in unit tests & docs
* rename traffic source fields and add docs

* more docs. breaking docs out into separate yml files per model

* finished moving docs out of stg_ga4.yml

* started renaming acquisition fields

* partial refactoring of acquisition fields

* more refactoring of event level acquisition fields

* finished refactoring. project runs

* fix unit test

* doc updates

* fix column name and test

* Multi property support using copy command (#150)

* macro to handle combined properties

* doc updates

* full refresh works. requires dbt_utils

* incremental run can handle if there are missing shards

* don't run pre-hook if property_ids var is not set

* added stream_id to all mart models

* fixing unit test

* exclude intraday table

I didn't have any problems without this exclude, but I figured it would be safer to exclude the intraday table because this 'clone' method doesn't currently support daily+streaming
* Add 'Push Notifications' default channel group

The rules matches the current rule in https://support.google.com/analytics/answer/9756891 i.e. medium ends with "push" or medium contains "mobile" or "notification" or source exactly matches "firebase".

I've changed the name from "Mobile Push Notifications" to just "Push Notifications" because these kind of notifications are also available on browsers and might have nothing to do with mobile devices.

* Add Push Notifications to test

---------

Co-authored-by: Guille <[email protected]>
…mentation in dbt_project.yml (should be in README.md)
The following error is corrected.
```
Parsing Error
  Error reading ga4: staging/base/base_ga4__events.yml - Runtime Error
    Syntax error near line 45
    ------------------------------
    42 |       - name: platform
    43 |         description: The data stream platform (Web, IOS or Android) from which the event originated.
    44 |       - name: privacy_info_analytics_storage
    45 |         description:   Whether Analytics storage is enabled for the user. Possible values are Yes, No, Unset
    46 |       - name: privacy_info_ads_storage
    47 |         description: Whether ad targeting is enabled for a user. Possible values are Yes, No, Unset
    
    Raw Error:
    ------------------------------
    while scanning for the next token
    found character '\t' that cannot start any token
      in "<unicode string>", line 45, column 22:
                description:    Whether Analytics storage is en ... 
```
* new macro and docs

* updated docs to indicate global namespace is required for custom events
* test code

* working time_on_page

* switch from time on page to engagement_time

* fix bad reference

* switch to total engagement time with denominator

* move engagement calculations outside of stg_ga4__page_view

* Added page_engagement_key documentation to stg_ga4.yml

* documentation

* restore stg_ga4.yml

* create stg_ga4__page_engaged_time.yml file

* delete tab (#155)

The following error is corrected.
```
Parsing Error
  Error reading ga4: staging/base/base_ga4__events.yml - Runtime Error
    Syntax error near line 45
    ------------------------------
    42 |       - name: platform
    43 |         description: The data stream platform (Web, IOS or Android) from which the event originated.
    44 |       - name: privacy_info_analytics_storage
    45 |         description:   Whether Analytics storage is enabled for the user. Possible values are Yes, No, Unset
    46 |       - name: privacy_info_ads_storage
    47 |         description: Whether ad targeting is enabled for a user. Possible values are Yes, No, Unset
    
    Raw Error:
    ------------------------------
    while scanning for the next token
    found character '\t' that cannot start any token
      in "<unicode string>", line 45, column 22:
                description:    Whether Analytics storage is en ... 
```

* update version for point release

* fix merge artifacts and remove hour from pages mart

* Set "Direct" instead "(none)" as default (#157)

* Macro to facilitate easy creation of custom events + params (#154)

* new macro and docs

* updated docs to indicate global namespace is required for custom events

* Add recommended event model for "select_content" event. (#156)

* remove hour from page_key

* updating local main branch from remote

---------

Co-authored-by: ken kuwabara <[email protected]>
Co-authored-by: Adam Ribaudo <[email protected]>
Co-authored-by: Clemens Siebenhaar <[email protected]>
Co-authored-by: Adam Ribaudo <[email protected]>
Co-authored-by: Liam O'Boyle <[email protected]>
* base_select macro and combine property enhancement

* adapter dispatch pattern for base select macros

* add back in unit test .env eample

---------

Co-authored-by: Adam Ribaudo <[email protected]>
…#176)

* fix to ensure that data is sourced from source db, not target db

* doc updates
…nsional modeling on very large installs (#175)

* incremental version of stg_ga4__sessions_traffic_sources_daily.sql

* incremental stg_ga4__session_conversions_daily

* daily session fact table and doc updates

* started daily session dim model

* working dim_sessions_daily model

* adding user_pseudo_id to allow for joins

* doc updates

* readme updates
Documentation updates based on some feedback that the "Multi-Property Support" section was unclear
* updating local main branch from remote

* updating local main branch from remote

* restore create_custom_event macro

* Restoring Readme updates

* Update dbt_project.yml

---------

Co-authored-by: Adam Ribaudo <[email protected]>
Infinitesimal GitHub cosmetics update
* Replace user_pseudo_id with client_key

* Remove stream_id from fct_ga4__user_ids

* Add descriptions for client_key throughout package

* restore stream_id to several models

* Move client_key creation to stag_ga4__events

---------

Co-authored-by: Adam Ribaudo <[email protected]>
* refactor combined_property_data

* typo
* multi-site support for streaming frequency

* tweaking documentation

---------

Co-authored-by: Adam Ribaudo <[email protected]>
adamribaudo-velir and others added 30 commits August 3, 2023 07:12
* add adapter dispatch pattern for path extraction

* update python version in unit test workflow
Removing confusion around grouping by hour
* initial transaction dedup concept

* first test

* working copy

* performance optimization

* docs

* code cleanup
* Align default channel groupings with GA4 docs

* rename unit test class

* replace curly quotes with straight quotes

---------

Co-authored-by: Adam Ribaudo <[email protected]>
* item and ecommerce field fix test concept

* add missing parameters and readme update

* restore missing fields

* Will move the readme changes to the release notes

and will make a new 5.1 release

---------

Co-authored-by: Adam Ribaudo <[email protected]>
fix spacing which causes syntax error if copy and pasted into yml configuration
* use target.project for destination

* update project var to source_project

* dynamic source based on combined_dataset var

* doc updates

* conditionally apply prehook

* fix quotes

* dynamically swap source project as well as dataset
* fix fct_ga4__pages model and add tests to catch issue

* remove page_path and page_title from fct_ga4__pages
Updated description of fct_ga4__pages
)

* Fixed indentation

* Fix to filter based on dates equal to or later than the start_date
* Fix uniqueness test for fct_ga4__pages

Multiple records can have the same event_date_dt and page_location.
Adding stream_id as a condition makes them unique.

> Data stream: Lives within a property, and is the source of data from an app or website. The best practice is to use a maximum of 3 data streams per property: 1 single web data stream to measure the web user journey and 1 app data stream each for iOS and Android.
>
> [[GA4] Google Analytics account structure - Analytics Help](https://support.google.com/analytics/answer/9679158?hl=en)

* Update README.md

Updated description of fct_ga4__pages.
* fix cpc classification and add test

* broaden test scope

* test original_page_location
* Fix error when setting a large number of properties

Bugfix

Fix #269.

This change greatly reduces the likelihood of an error when specifying a large number of property_ids in `ga4.combine_property_data()`.

* Fixed the following bug
  * Changed to copy a table for each peoperty_id

dbt_project.yml

```yml
vars:
    ga4:
        source_project: source-project-id
        property_ids: [
          000000001
          , 000000002
          , ...
          , 000000040
        ]
        start_date: 20210101
        static_incremental_days: 3
        combined_dataset: combined_dataset_name
```

```shell
$ dbt run -s base_ga4__events --full-refresh
06:51:19  Running with dbt=1.5.0
06:52:05  Found 999 models, 999 tests, 999 snapshots, 999 analyses, 999 macros, 999 operations, 999 seed files, 999 sources, 999 exposures, 999 metrics, 999 groups
06:52:06
06:52:14  Concurrency: 4 threads (target='dev')
06:52:14
06:52:14  1 of 1 START sql view model dataset_name.base_ga4__events ......... [RUN]
06:56:17  BigQuery adapter: https://console.cloud.google.com/bigquery?project=project-id&j=bq:asia-northeast1:????????-????-????-????-????????????&page=queryresults
06:56:17  1 of 1 ERROR creating sql view model dataset_name.base_ga4__events  [ERROR in 243.80s]
06:56:18
06:56:18  Finished running 1 view model in 0 hours 4 minutes and 11.62 seconds (251.62s).
06:56:22
06:56:22  Completed with 1 error and 0 warnings:
06:56:22
06:56:23  Database Error in model base_ga4__events (models/staging/base/base_ga4__events.sql)
06:56:23    The query is too large. The maximum standard SQL query length is 1024.00K characters, including comments and white space characters.
06:56:23
06:56:23  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
```

Merging this pull request will enable execution.

```shell
$ dbt run -s base_ga4__events --full-refresh
HH:mm:ss  Running with dbt=1.5.0
HH:mm:ss  Found 999 models, 999 tests, 999 snapshots, 999 analyses, 999 macros, 999 operations, 999 seed files, 999 sources, 999 exposures, 999 metrics, 999 groups
HH:mm:ss
HH:mm:ss  Concurrency: 4 threads (target='dev')
HH:mm:ss
HH:mm:ss  1 of 1 START sql incremental model dataset_name.base_ga4__events ... [RUN]
HH:mm:ss  Cloned from `source-project-id.analytics_000000001.events_*[20210101-20240324]` to `project-id.combined_dataset_name.events_YYYYMMDD000000001`.
HH:mm:ss  Cloned from `source-project-id.analytics_000000002.events_*[20210101-20240324]` to `project-id.combined_dataset_name.events_YYYYMMDD000000002`.
....
HH:mm:ss  Cloned from `source-project-id.analytics_000000040.events_*[20210101-20240324]` to `project-id.combined_dataset_name.events_YYYYMMDD000000040`.
HH:mm:ss  1 of 1 OK created sql incremental model dataset_name.base_ga4__events  [CREATE TABLE (? rows, ? processed) in ?]
HH:mm:ss
HH:mm:ss  Finished running 1 incremental model in ? (?).
HH:mm:ss
HH:mm:ss  Completed successfully
HH:mm:ss
HH:mm:ss  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
```

---

Fixed timeout in clone operation

The following error will almost never occur because I have changed to clone separated by property_id.

* Removed https://github.com/Velir/dbt-ga4/blame/6.0.1/README.md#L323-L332 from README.md
* Resolved the following operation

https://github.com/Velir/dbt-ga4/blame/6.0.1/README.md#L323-L332

> Jobs that run a large number of clone operations are prone to timing out. As a result, it is recommended that you increase the query timeout if you need to backfill or full-refresh the table, when first setting up or when the base model gets modified. Otherwise, it is best to prevent the base model from rebuilding on full refreshes unless needed to minimize timeouts.
>
> ```
> models:
>   ga4:
>     staging:
>       base:
>         base_ga4__events:
>           +full_refresh: false
> ```

* Changed the implementation of combine_property_data to the minimum necessary

* Remove latest_shard_to_retrieve
* feature: add property_id

* fix: remove "intraday_" for propery_id
* Add valid_column_name macro

* Update models using conversion_events

* Add tests

* Update project name in tests to call macros

* Add new macro to unit tests

* Fix invalid escape character

* Fix page test by harmonizing event name

* Fix jinja range error in test

---------

Co-authored-by: Adam Ribaudo <[email protected]>
* stashing changes

* updated tests and query parameter removal regex

* added multi-parameter text and fixed multi-parameter regex

* make remove_query_parameters macro over-ridable

* readme update

---------

Co-authored-by: Adam Ribaudo <[email protected]>
* generate event_key even when session_key is null

* Revert "generate event_key even when session_key is null"

This reverts commit 1753ca1.

* generate event_key even when session_key is null

* update yml

* fine-tune event key

* update comment

* update inline comment
Update readme in prep for release
Update version number in prep for release
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.