You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The default models assume that the web_page_context data will be landed in a separate table to the raw events. On Snowflake that is not the case; the data is inside a variant field. Could some models for Snowflake be skipped or made unnecessary by just flattening out the field in the source events something like CONTEXTS_COM_SNOWPLOWANALYTICS_SNOWPLOW_WEB_PAGE_1[0]['id'] as PAGE_VIEW_ID ? Has that been considered?
-- This one is a little tougher to make incremental
-- because there's no timestamp field here. We could
-- relocate the event collector_tstamp (by root_id)
-- onto snowplow_base_web_page_context, but that would
-- likely mitigate any performance gains!
I have found it necessary anyway to dedupe the data in my raw events table, because if I don't then the snowplow_sessions_tmp model fails on merge.
Describe alternatives you've considered
To achieve that deduping in my raw events, this was recommended ( by helpful people on the slack channel ):
QUALIFY
row_number() over (partition by EVENT_ID order by COLLECTOR_TSTAMP, DERIVED_TSTAMP) = 1;
That works but is incredibly slow so I have had to implement it as incremental, using the collector_tstamp as a watermark.
My concern is with the web_page_context model being separated from the events, not incremental, only to be later recombined into a number of other models.
If you think there is merit to this I can look at trying to put in Snowflake specific changes to optimise the model flow for that system. Otherwise if this has already been considered and ruled out could you let me know why it might be a bad idea?
Additional context
This is Snowflake specific.
Who will this benefit?
Those with huge amounts of evergrowing events from Snowplow pipelines.
The text was updated successfully, but these errors were encountered:
Sorry for the serious delay in responding here, @davehowell. This is something I've wanted to do for some time (#52). The potential time (+ cost) savings on Snowflake are immense. I think the required pieces are extensive enough that it could (+ should) be part of a larger refactor.
Unfortunately I have moved on from the previous project and don't have access to Snowplow data to look at this again, but I recall that I created a few override snowplow models in the main dbt project in order to bypass some of this, as well as adding those flattened out fields into some of the special "include" fields that are available as vars.
If you don't think it's too much to bite off at one time then sure include it as part of your larger refactor. I have to say the Snowplow models are pretty complicated, a lot to get your head around, and this particular issue is very Snowflake specific so might be easier to do ( and more importantly for the reviewer!) separately. If not then I guess this can be closed.
I created a few override snowplow models in the main dbt project in order to bypass some of this, as well as adding those flattened out fields into some of the special "include" fields that are available as vars.
Honestly, this is what I've done every time I use Snowflake + Snowplow together. It would obviously be better if the package could "just work" on Snowflake in the way we know to be best.
Describe the feature
The default models assume that the
web_page_context
data will be landed in a separate table to the raw events. On Snowflake that is not the case; the data is inside a variant field. Could some models for Snowflake be skipped or made unnecessary by just flattening out the field in the source events something likeCONTEXTS_COM_SNOWPLOWANALYTICS_SNOWPLOW_WEB_PAGE_1[0]['id'] as PAGE_VIEW_ID
? Has that been considered?The snowplow_web_page_context model contains this comment:
I have found it necessary anyway to dedupe the data in my raw events table, because if I don't then the
snowplow_sessions_tmp
model fails on merge.Describe alternatives you've considered
To achieve that deduping in my raw events, this was recommended ( by helpful people on the slack channel ):
That works but is incredibly slow so I have had to implement it as incremental, using the collector_tstamp as a watermark.
My concern is with the
web_page_context
model being separated from the events, not incremental, only to be later recombined into a number of other models.If you think there is merit to this I can look at trying to put in Snowflake specific changes to optimise the model flow for that system. Otherwise if this has already been considered and ruled out could you let me know why it might be a bad idea?
Additional context
This is Snowflake specific.
Who will this benefit?
Those with huge amounts of evergrowing events from Snowplow pipelines.
The text was updated successfully, but these errors were encountered: