Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication in snowplow_web_page_context drops all events that have a duplicate #100

Open
philipherrmann opened this issue Jan 4, 2021 · 1 comment

Comments

@philipherrmann
Copy link

If I don't get it wrong, the deduplication implemented here
https://github.com/fishtown-analytics/snowplow/blob/3795d06f365213ca4930d2447bd1580cb7031557/models/page_views/default/snowplow_web_page_context.sql#L43
drops all events that have a duplicated event_id (named root_id there), instead of keeping only the first one of those. Seems strange to me, is there a reason?

Also, for bigquery it would be quite easy to implement an incremental model for this since all the timestamps are there, right? Should I try to submit a PR?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 6, 2021

Hey @philipherrmann, that logic comes straight from Snowplow's original web data model for Redshift. In particular, this comment:

-- exclude all root ID with more than one page view ID

We interpreted that to mean, "Events associated with multiple different page view IDs are considered noise and should be excluded." Why it's there is a fair question, though: snowplow/snowplow-web-data-model#43

To be honest, that's a place where we've deferred to the Snowplow folks. They've just released a new web model (SQL transformations for Redshift); I took a look and couldn't find this exact logic replicated there, though there were several other steps in which duplicated event IDs + page view IDs are removed entirely.

Also, for bigquery it would be quite easy to implement an incremental model for this since all the timestamps are there, right? Should I try to submit a PR?

The snowplow_web_page_context model actually does not run on BigQuery. Instead, page view IDs are just pulled straight into the snowplow_page_views model, without any duplication:
https://github.com/fishtown-analytics/snowplow/blob/f24a2bf91d4ce44f789d1cae0e33d85aa7f8eb58/models/page_views/bigquery/snowplow_page_views.sql#L62-L66

Why? At this point, I'm not quite sure. I think we should aim for consistency, and presently that feels like removing this logic across the board.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants