[EPIC] Incremental Model Improvements - Microbatch #10624

QMalcolm · 2024-08-28T15:59:38Z

Incremental models in dbt is a materialization strategy designed to efficiently update your data warehouse tables by only transforming and loading new or changed data since the last run. Instead of processing your entire dataset every time, incremental models append or update only the new rows, significantly reducing the time and resources required for your data transformations.

Even with all the benefits of incremental models as they exist today, there are limitations with this approach, such as:

burden is on YOU to calculate what’s “new” - what has already been loaded, what needs to be loaded, etc.
can be slow if you have many partitions to process (like when running in full-refresh mode) as it’s done in “one big” SQL statement - can time out, if it fails you end up needing to retry already successful partitions, etc.
if you want to specifically name a partition for your incremental model to process, you have to add additional “hack”y logic, likely using vars
data tests run on your entire model, rather than just the "new" data

In this project we're aiming to make incremental models easier to implement and more efficient to run.

P0s - Core

P0s - Core Framework

Give feedback

P0s - Adapters

Give feedback

[dbt-postgres] Microbatch strategy implementation: merge dbt-postgres#149
[dbt-snowflake] Microbatch strategy dbt-snowflake#1182
[dbt-spark] Microbatch Incremental Strategy dbt-spark#1109
[dbt-bigquery] Microbatch Strategy dbt-bigquery#1354
[dbt-redshift] Microbatch strategy dbt-redshift#923
[dbt-athena] Microbatch Strategy dbt-athena#760

enhancement microbatch
Introduce new Capability for MicrobatchConcurrency support dbt-adapters#359
[Microbatch] Update default make_temp_relation macro to incorporate a batch specific identifier if available dbt-adapters#360
Implement MicrobatchConcurrency for dbt-postgres
Enable MicrobatchConcurrency for dbt-snowflake dbt-snowflake#1260
Implement MIcrobatchConcurrency for dbt-bigquery
Implement MicrobatchConcurrency for dbt-redshift
Implement MicrobatchConcurrency for dbt-spark
Refactor dbt-postgres to use batch context object instead of __dbt_internal variables dbt-postgres#183
Refactor dbt-bigquery to use batch context object instead of __dbt_internal variables dbt-bigquery#1432
Refactor dbt-redshift to use batch context object instead of __dbt_internal variables dbt-redshift#966
Refactor dbt-spark to use batch context object instead of __dbt_internal variables dbt-spark#1156
Refactor dbt-snowflake to use batch context object instead of __dbt_internal variables dbt-snowflake#1272
Options

Bugs

Beta bugs

Give feedback

[Bug] Defaulting lookback to 0 results in consistently incomplete batches #10867

bug microbatch user docs
[Bug] On first-run/full-refresh the incremental strategy of a model isn't validated dbt-adapters#330

bug incremental
[Bug] canceling a run of a microbatch model should cause remaining batches to be skipped #10862

bug microbatch
[Bug] Invalid where filter for latest batch when event_column is of type date #10868

bug microbatch
[Bug] Microbatch error dbt-bigquery#1376

bug
[Feature] Emit a warning if none of a microbatch model's ref/source inputs have an event_time defined #10926

enhancement microbatch
[Bug] Microbatch: 'model' context variable is of ModelNode type instead of dict #10927
[Bug] Cartesian Join based deletion is causing performance problems when it hits a certain scale for microbatch models dbt-snowflake#1228

bug
[Bug] Database errors in microbatches are missing "node_info" in structured logs #10840

Impact: Exp bug incremental logging microbatch pre-release
[Bug] Microbatch: Calling ref() in a macro with a microbatch model name results in error #10928

bug microbatch pre-release
[Bug] Microbatch timestamps incorrectly offset for system timezone in Snowflake dbt-snowflake#1256

bug microbatch
[Bug] [Nit] Incorrect plural of partial success #10999

bug microbatch
[Bug] Skipping microbatch model fails with typeError: object of type 'Field' has no len() #11021

bug microbatch
[Bug] Microbatch does not work with date field types dbt-bigquery#1412

bug
Run pre/post model hooks once per microbatch model #11094
Incremental microbatch dbt list --output JSON raises TypeError: Object of type datetime is not JSON serializable #11098

microbatch
[Bug] microbatch models generate error when using ref with alias dbt-postgres#184

bug microbatch triage
[Bug] Ensure "no upstream event_time configured" warning is emitted on every dbt invocation, not just when full parsing #11159

microbatch
Options

P1s

Give feedback

Validate --event-time-start is earlier than --event-time-end #10786
Set a singular "current_time" for microbatch models per invocation #10819
[Feature] Make --event-end-time require --event-start-time and vice versa #10874

enhancement microbatch
[Feature] microbatches should default run in parallel according to threads #10855

enhancement microbatch
[Feature] automatically detect which batches can not be run in parallel #10853

enhancement microbatch
[Feature] Microbatch should allow copy_partitions dbt-bigquery#1414

enhancement microbatch
Options

P2s

Give feedback

The text was updated successfully, but these errors were encountered:

MaartenN1234 · 2024-09-09T12:40:02Z

Just for my understanding: Is it right that this issue seeks to address technical (performance/load) issues in models that take just one single ref as source (or if it has other sources as well we assume them to be stale) ?

I am looking for ways to support incremental processing of multi-table join models (e.g. https://discourse.getdbt.com/t/template-for-complex-incremental-models/10054, but I've seen many more similar help requests on community forums). To be sure, such features will not be in scope right ?

QMalcolm · 2024-09-13T17:24:09Z

Just for my understanding: Is it right that this issue seeks to address technical (performance/load) issues in models that take just one single ref as source (or if it has other sources as well we assume them to be stale) ?

I am looking for ways to support incremental processing of multi-table join models (e.g. https://discourse.getdbt.com/t/template-for-complex-incremental-models/10054, but I've seen many more similar help requests on community forums). To be sure, such features will not be in scope right ?

@MaartenN1234 I'm not sure that I fully understand the question being asked. For my clarity, is the question whether this new functionality will support more than one input to an incremental model? If so, the answer is yes!

For example, say we turn the jaffle-shop customers model into an incremental microbatch model. It'd look like the following

{{ config(materialized='incremental', incremental_strategy='microbatch', unique_key='id', event_time='created_at', batch_size='day') }}

with

customers as (
    select * from {{ ref('stg_customers') }}
),

orders as (
    select * from {{ ref('orders') }}
),

customer_orders_summary as (
    select
        orders.customer_id,
        count(distinct orders.order_id) as count_lifetime_orders,
        count(distinct orders.order_id) > 1 as is_repeat_buyer,
        min(orders.ordered_at) as first_ordered_at,
        max(orders.ordered_at) as last_ordered_at,
        sum(orders.subtotal) as lifetime_spend_pretax,
        sum(orders.tax_paid) as lifetime_tax_paid,
        sum(orders.order_total) as lifetime_spend
    from orders
    group by 1
),

joined as (
    select
        customers.*,
        customer_orders_summary.count_lifetime_orders,
        customer_orders_summary.first_ordered_at,
        customer_orders_summary.last_ordered_at,
        customer_orders_summary.lifetime_spend_pretax,
        customer_orders_summary.lifetime_tax_paid,
        customer_orders_summary.lifetime_spend,
        case
            when customer_orders_summary.is_repeat_buyer then 'returning'
            else 'new'
        end as customer_type
    from customers

    left join customer_orders_summary
        on customers.customer_id = customer_orders_summary.customer_id
)

select * from joined

If the models orders and stg_customers both have an event_time defined (they don't need to be incremental themselves), then they will automatically be filtered and batched by the generated event time filters.

MaartenN1234 · 2024-09-17T11:14:14Z

Just for my understanding: Is it right that this issue seeks to address technical (performance/load) issues in models that take just one single ref as source (or if it has other sources as well we assume them to be stale) ?
I am looking for ways to support incremental processing of multi-table join models (e.g. https://discourse.getdbt.com/t/template-for-complex-incremental-models/10054, but I've seen many more similar help requests on community forums). To be sure, such features will not be in scope right ?

@MaartenN1234 I'm not sure that I fully understand the question being asked. For my clarity, is the question whether this new functionality will support more than one input to an incremental model? If so, the answer is yes!

For example, say we turn the jaffle-shop customers model into an incremental microbatch model. It'd look like the following
{{ config(materialized='incremental', incremental_strategy='microbatch', unique_key='id', event_time='created_at', batch_size='day') }}

with

customers as (
    select * from {{ ref('stg_customers') }}
),

orders as (
    select * from {{ ref('orders') }}
),

customer_orders_summary as (
    select
        orders.customer_id,
        count(distinct orders.order_id) as count_lifetime_orders,
        count(distinct orders.order_id) > 1 as is_repeat_buyer,
        min(orders.ordered_at) as first_ordered_at,
        max(orders.ordered_at) as last_ordered_at,
        sum(orders.subtotal) as lifetime_spend_pretax,
        sum(orders.tax_paid) as lifetime_tax_paid,
        sum(orders.order_total) as lifetime_spend
    from orders
    group by 1
),

joined as (
    select
        customers.*,
        customer_orders_summary.count_lifetime_orders,
        customer_orders_summary.first_ordered_at,
        customer_orders_summary.last_ordered_at,
        customer_orders_summary.lifetime_spend_pretax,
        customer_orders_summary.lifetime_tax_paid,
        customer_orders_summary.lifetime_spend,
        case
            when customer_orders_summary.is_repeat_buyer then 'returning'
            else 'new'
        end as customer_type
    from customers

    left join customer_orders_summary
        on customers.customer_id = customer_orders_summary.customer_id
)

select * from joined
If the models orders and stg_customers both have an event_time defined (they don't need to be incremental themselves), then they will automatically be filtered and batched by the generated event time filters.

The critical requirement for me, is that matching rows (on the join condition) in both sources are not neccesarily created in the same batch. So when the filter is on the sources independently:
select * from {{ ref('stg_customers') }} where event_time > last_processed_event_time
and
select * from {{ ref('orders') }} where event_time > last_processed_event_time

stuff will be wrong (e.g. if we would load one more order, we would loose all previous from the aggregate or when the customer data is updated while no new orders for this client are to be processed the update will not be propagated).

To get it right, it should become somewhat like this:
select * from {{ ref('stg_customers') }} where event_time > last_processed_event_time or (customer_id IN ( select customer_id from {{ ref('orders') }} where event_time > last_processed_event_time))
and
select * from {{ ref('orders') }} where (customer_id IN ( select customer_id from {{ ref('stg_customers') }} where event_time > last_processed_event_time UNION ALL select customer_id from {{ ref('orders') }} where event_time > last_processed_event_time))

So one needs to incorporate the join clause and the aggregation into the change detection

QMalcolm · 2024-10-03T20:16:22Z

Sorry for accidentally closing this as completed last week. As penance, here is a photo of my cat Misu. He is very excited about microbatch models

QMalcolm added this to the v1.9 milestone Aug 28, 2024

QMalcolm pinned this issue Aug 28, 2024

MichelleArk mentioned this issue Sep 5, 2024

[Feature] Accept and render EventTimeFilter in BaseRelation dbt-labs/dbt-adapters#294

Closed

jessedobbelaere mentioned this issue Sep 11, 2024

[Feature] Support for new microbatching incremental strategy dbt-labs/dbt-athena#715

Closed

1 task

MichelleArk mentioned this issue Sep 14, 2024

[dbt-snowflake] Microbatch strategy dbt-labs/dbt-snowflake#1182

Closed

graciegoheen mentioned this issue Sep 14, 2024

[Feature] The insert_by_period materialization should graduate to part of the main project #4174

Closed

1 task

graciegoheen changed the title ~~[EPIC] Incremental Model Improvements~~ [EPIC] Incremental Model Improvements - Microbatch Sep 24, 2024

This was referenced Sep 25, 2024

[dbt-spark] Microbatch Incremental Strategy dbt-labs/dbt-spark#1109

Closed

Microbatch Strategy dbt-labs/dbt-spark#1108

Merged

QMalcolm mentioned this issue Sep 25, 2024

Enable retry support for Microbatch models #10751

Merged

5 tasks

This was referenced Sep 25, 2024

[dbt-bigquery] Microbatch Strategy dbt-labs/dbt-bigquery#1354

Closed

Microbatch Strategy dbt-labs/dbt-bigquery#1334

Merged

QMalcolm closed this as completed in #10751 Sep 26, 2024

MichelleArk reopened this Sep 26, 2024

graciegoheen mentioned this issue Oct 2, 2024

New snapshot config to validate uniqueness before merge #10236

Open

QMalcolm mentioned this issue Oct 2, 2024

[dbt-redshift] Microbatch strategy dbt-labs/dbt-redshift#923

Closed

tatiana mentioned this issue Oct 21, 2024

[async] Support microbatching when using ExecutionMode.AIRFLOW_ASYNC astronomer/astronomer-cosmos#1270

Open

1 task

This was referenced Nov 26, 2024

[dbt-athena] Microbatch Strategy dbt-labs/dbt-athena#760

Open

Enable MicrobatchConcurrency for dbt-snowflake dbt-labs/dbt-snowflake#1260

Closed

internetcoffeephone mentioned this issue Dec 5, 2024

Incremental microbatch dbt list --output JSON raises TypeError: Object of type datetime is not JSON serializable #11098

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Incremental Model Improvements - Microbatch #10624

[EPIC] Incremental Model Improvements - Microbatch #10624

QMalcolm commented Aug 28, 2024 •

edited by MichelleArk

Loading

P0s - Core Framework

P0s - Adapters

Beta bugs

P1s

P2s

MaartenN1234 commented Sep 9, 2024

QMalcolm commented Sep 13, 2024

MaartenN1234 commented Sep 17, 2024 •

edited

Loading

QMalcolm commented Oct 3, 2024

[EPIC] Incremental Model Improvements - Microbatch #10624

[EPIC] Incremental Model Improvements - Microbatch #10624

Comments

QMalcolm commented Aug 28, 2024 • edited by MichelleArk Loading

P0s - Core

P0s - Core Framework

P0s - Adapters

P0s - Adapters

Bugs

Beta bugs

P1s

P1s

P2s

P2s

MaartenN1234 commented Sep 9, 2024

QMalcolm commented Sep 13, 2024

MaartenN1234 commented Sep 17, 2024 • edited Loading

QMalcolm commented Oct 3, 2024

QMalcolm commented Aug 28, 2024 •

edited by MichelleArk

Loading

MaartenN1234 commented Sep 17, 2024 •

edited

Loading