-
Notifications
You must be signed in to change notification settings - Fork 960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add key requirements about adapter requirement #6582
base: current
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
| `unique_key` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String <br /> | Optional* | | ||
| `partition_by` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String | Optional* | | ||
|
||
***Note:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
***Note:** | |
*There are scenarios where the following optional configs become required: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if optional
is the right terminology, but I also don't know of a better terminology in this instance. Specifically, unique_key
is, I believe, an optional config for all models that does something regardless of adapter and has core
implications. This is in contrast to partition_by
which isn't actually used in core for anything. It is only ever used by adapter implementations that use it. That is to say defining partition_by
on a model in say dbt-postgres
will do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment
| `unique_key` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String <br /> | Optional* | | ||
| `partition_by` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String | Optional* | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did the
Required for the
check
strategy.
part of the description come from? It feels odd here, as in, what is a check
strategy in context of defining a microbatch model?
Additionally, I feel weird about including unique_key
and partition_by
in the table. They are only relevant for specific adapters, and not microbatch as a whole. The issue is the microbatch isn't a uniform strategy across the different adapters. That is (where ~=
means "approximately equals")
- dbt-postgres
microbatch
~=merge
- dbt-redshift
microbatch
~=delete+insert
- dbt-bigquery
microbatch
~=insert_overwrite
- dbt-spark
microbatch
~=insert_overwrite
- dbt-databricks
microbatch
~=insert_overwrite
- dbt-snowflake
microbatch
~=delete+insert
The unique_key
is required for dbt-postgres's microbatch
because it's merge
strategy requires a unique_key
. And that unique_key
is used to identify which rows in the data warehouse need to get merged. For dbt-bigquery and dbt-spark, their underlying implementation of insert_overwrite
requires a partition_by
. My understanding of that strategy is that partition_by
makes it efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh thanks for catching this! this def shouldn't be there - it was tricky and maybe we need a separate 'adapters-related' section to explain the microbatch x adapter strategy.
I'll rejig this !
| `begin` | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A | Date | Required | | ||
| `batch_size` | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | String | Required | | ||
| `lookback` | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | Integer | Optional | | ||
| `unique_key` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String <br /> | Optional* | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `unique_key` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String <br /> | Optional* | | |
| `unique_key` | A column(s) (string or array) or expression for the record. | N/A | String <br /> | Optional* | |
| `batch_size` | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | String | Required | | ||
| `lookback` | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | Integer | Optional | | ||
| `unique_key` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String <br /> | Optional* | | ||
| `partition_by` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String | Optional* | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| `partition_by` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String | Optional* | | |
| `partition_by` | A column(s) (string or array) or expression for the record. | N/A | String | Optional* | |
this pr adds support forincremental stategy microbatch, which requires additional configs over the standard set:
dbt-postgres requires a unique_key
dbt-bigquery requires a partition_by
dbt-spark requires a [partition_by]
Resolves #6576
🚀 Deployment available! Here are the direct links to the updated files: