Deduplication in snowplow_web_page_context drops all events that have a duplicate #100

philipherrmann · 2021-01-04T16:31:52Z

If I don't get it wrong, the deduplication implemented here
https://github.com/fishtown-analytics/snowplow/blob/3795d06f365213ca4930d2447bd1580cb7031557/models/page_views/default/snowplow_web_page_context.sql#L43
drops all events that have a duplicated event_id (named root_id there), instead of keeping only the first one of those. Seems strange to me, is there a reason?

Also, for bigquery it would be quite easy to implement an incremental model for this since all the timestamps are there, right? Should I try to submit a PR?

jtcohen6 · 2021-01-06T11:19:08Z

Hey @philipherrmann, that logic comes straight from Snowplow's original web data model for Redshift. In particular, this comment:

-- exclude all root ID with more than one page view ID

We interpreted that to mean, "Events associated with multiple different page view IDs are considered noise and should be excluded." Why it's there is a fair question, though: snowplow/snowplow-web-data-model#43

To be honest, that's a place where we've deferred to the Snowplow folks. They've just released a new web model (SQL transformations for Redshift); I took a look and couldn't find this exact logic replicated there, though there were several other steps in which duplicated event IDs + page view IDs are removed entirely.

Also, for bigquery it would be quite easy to implement an incremental model for this since all the timestamps are there, right? Should I try to submit a PR?

The snowplow_web_page_context model actually does not run on BigQuery. Instead, page view IDs are just pulled straight into the snowplow_page_views model, without any duplication:
https://github.com/fishtown-analytics/snowplow/blob/f24a2bf91d4ce44f789d1cae0e33d85aa7f8eb58/models/page_views/bigquery/snowplow_page_views.sql#L62-L66

Why? At this point, I'm not quite sure. I think we should aim for consistency, and presently that feels like removing this logic across the board.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplication in snowplow_web_page_context drops all events that have a duplicate #100

Deduplication in snowplow_web_page_context drops all events that have a duplicate #100

philipherrmann commented Jan 4, 2021

jtcohen6 commented Jan 6, 2021

Deduplication in snowplow_web_page_context drops all events that have a duplicate #100

Deduplication in snowplow_web_page_context drops all events that have a duplicate #100

Comments

philipherrmann commented Jan 4, 2021

jtcohen6 commented Jan 6, 2021