-
Notifications
You must be signed in to change notification settings - Fork 960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clarify microbatch per feedback #6544
Changes from 2 commits
8cb3b7c
2eb0ff9
75c4466
147f080
a83f03d
565efb0
e3d8e96
9c52827
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,7 +8,7 @@ | |
|
||
:::info Microbatch | ||
|
||
The `microbatch` strategy is available in beta for [dbt Cloud Versionless](/docs/dbt-versions/upgrade-dbt-version-in-cloud#versionless) and dbt Core v1.9. | ||
The new `microbatch` strategy is available in beta for [dbt Cloud Versionless](/docs/dbt-versions/upgrade-dbt-version-in-cloud#versionless) and dbt Core v1.9. | ||
Check warning on line 11 in website/docs/docs/build/incremental-microbatch.md GitHub Actions / vale[vale] website/docs/docs/build/incremental-microbatch.md#L11
Raw output
|
||
|
||
If you use a custom microbatch macro, set a [distinct behavior flag](/reference/global-configs/behavior-changes#custom-microbatch-strategy) in your `dbt_project.yml` to enable batched execution. If you don't have a custom microbatch macro, you don't need to set this flag as dbt will handle microbatching automatically for any model using the [microbatch strategy](#how-microbatch-compares-to-other-incremental-strategies). | ||
|
||
|
@@ -22,17 +22,32 @@ | |
|
||
Incremental models in dbt are a [materialization](/docs/build/materializations) designed to efficiently update your data warehouse tables by only transforming and loading _new or changed data_ since the last run. Instead of reprocessing an entire dataset every time, incremental models process a smaller number of rows, and then append, update, or replace those rows in the existing table. This can significantly reduce the time and resources required for your data transformations. | ||
|
||
Microbatch incremental models make it possible to process transformations on very large time-series datasets with efficiency and resiliency. When dbt runs a microbatch model — whether for the first time, during incremental runs, or in specified backfills — it will split the processing into multiple queries (or "batches"), based on the [`event_time`](/reference/resource-configs/event-time) and `batch_size` you configure. | ||
Microbatch is an incremental strategy designed for large time-series datasets: | ||
- It complements, rather than replaces, existing incremental strategies by focusing on efficiency and simplicity in batch processing. | ||
- Unlike traditional incremental strategies, microbatch doesn't require managing uniqueness constraints or implementing complex conditional logic for [backfilling](#backfills). | ||
mirnawong1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- It relies solely on a time column ([`event_time`](/reference/resource-configs/event-time)) to handle data partitioning and filtering. | ||
mirnawong1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Note, microbatch might not be the best strategy for all use cases. Consider other strategies for use cases such as not having a reliable `event_time` column or if you want more control over the incremental logic. Read more in [How `microbatch` compares to other incremental strategies](#how-microbatch-compares-to-other-incremental-strategies). | ||
|
||
Each "batch" corresponds to a single bounded time period (by default, a single day of data). Where other incremental strategies operate only on "old" and "new" data, microbatch models treat every batch as an atomic unit that can be built or replaced on its own. Each batch is independent and <Term id="idempotent" />. This is a powerful abstraction that makes it possible for dbt to run batches separately — in the future, concurrently — and to retry them independently. | ||
### How microbatch works | ||
|
||
When dbt runs a microbatch model — whether for the first time, during incremental runs, or in specified backfills — it will split the processing into multiple queries (or "batches"), based on the `event_time` and `batch_size` you configure. | ||
|
||
Each "batch" corresponds to a single bounded time period (by default, a single day of data). Where other incremental strategies operate only on "old" and "new" data, microbatch models treat every batch as an atomic unit that can be built or replaced on its own. Each batch is independent and <Term id="idempotent" />. This is a powerful abstraction that makes it possible for dbt to run batches [separately](#backfills) — in the future, concurrently — and to [retry](#retry) them independently. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We just added support for concurrency - see #6550 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. great stuff, thanks Grace. i've tweaked this and will link out to the 'concurrently' line once we add a section about it and resolve 6550 🙏 that way, i don't block this pr. |
||
|
||
### Example | ||
|
||
A `sessions` model aggregates and enriches data that comes from two other models. | ||
- `page_views` is a large, time-series table. It contains many rows, new records almost always arrive after existing ones, and existing records rarely update. | ||
- `customers` is a relatively small dimensional table. Customer attributes update often, and not in a time-based manner — that is, older customers are just as likely to change column values as newer customers. | ||
A `sessions` model aggregates and enriches data that comes from two other models: | ||
- `page_views` is a large, time-series table. It contains many rows, new records almost always arrive after existing ones, and existing records rarely update. It uses the `page_view_start` column as its `event_time`. | ||
- `customers` is a relatively small dimensional table. Customer attributes update often, and not in a time-based manner — that is, older customers are just as likely to change column values as newer customers. The customers model doesn't configure an `event_time` column. | ||
|
||
As a result: | ||
|
||
The `page_view_start` column in `page_views` is configured as that model's `event_time`. The `customers` model does not configure an `event_time`. Therefore, each batch of `sessions` will filter `page_views` to the equivalent time-bounded batch, and it will not filter `customers` (a full scan for every batch). | ||
- Each batch of `sessions` will filter `page_views` to the equivalent time-bounded batch. | ||
- The `customers` table isn't filtered, resulting in a full scan for every batch. | ||
|
||
:::tip | ||
In addition to configuring `event_time` for the target table, you can also specify it for any upstream models that you want to filter, even if they have different time columns. | ||
mirnawong1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
::: | ||
|
||
<File name="models/staging/page_views.yml"> | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hiya @mirnawong1
Just to ask, should "complements" be "complements" or "compliments" I wasn't sure if complements is a US spelling.
Kind Regards
Natalie
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @nataliefiann , good question! this is right IMHO as complements means something that enhancements it. compliment generally means praise so in this context, microbatch enhances existing strategies.