Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add key requirements about adapter requirement #6582

Open
wants to merge 3 commits into
base: current
Choose a base branch
from

Conversation

mirnawong1
Copy link
Contributor

@mirnawong1 mirnawong1 commented Dec 3, 2024

this pr adds support forincremental stategy microbatch, which requires additional configs over the standard set:
dbt-postgres requires a unique_key
dbt-bigquery requires a partition_by
dbt-spark requires a [partition_by]

Resolves #6576


🚀 Deployment available! Here are the direct links to the updated files:

@mirnawong1 mirnawong1 requested a review from a team as a code owner December 3, 2024 15:17
Copy link

vercel bot commented Dec 3, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
docs-getdbt-com ✅ Ready (Inspect) Visit Preview Dec 3, 2024 6:49pm

@github-actions github-actions bot added content Improvements or additions to content Docs team Authored by the Docs team @dbt Labs size: small This change will take 1 to 2 days to address labels Dec 3, 2024
| `unique_key` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String <br /> | Optional* |
| `partition_by` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String | Optional* |

***Note:**
Copy link
Contributor

@matthewshaver matthewshaver Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
***Note:**
*There are scenarios where the following optional configs become required:

Copy link
Contributor

@QMalcolm QMalcolm Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if optional is the right terminology, but I also don't know of a better terminology in this instance. Specifically, unique_key is, I believe, an optional config for all models that does something regardless of adapter and has core implications. This is in contrast to partition_by which isn't actually used in core for anything. It is only ever used by adapter implementations that use it. That is to say defining partition_by on a model in say dbt-postgres will do nothing.

Copy link
Contributor

@QMalcolm QMalcolm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment

Comment on lines +188 to +189
| `unique_key` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String <br /> | Optional* |
| `partition_by` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String | Optional* |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did the

Required for the check strategy.

part of the description come from? It feels odd here, as in, what is a check strategy in context of defining a microbatch model?

Additionally, I feel weird about including unique_key and partition_by in the table. They are only relevant for specific adapters, and not microbatch as a whole. The issue is the microbatch isn't a uniform strategy across the different adapters. That is (where ~= means "approximately equals")

  • dbt-postgres microbatch ~= merge
  • dbt-redshift microbatch ~= delete+insert
  • dbt-bigquery microbatch ~= insert_overwrite
  • dbt-spark microbatch ~= insert_overwrite
  • dbt-databricks microbatch ~= insert_overwrite
  • dbt-snowflake microbatch ~= delete+insert

The unique_key is required for dbt-postgres's microbatch because it's merge strategy requires a unique_key. And that unique_key is used to identify which rows in the data warehouse need to get merged. For dbt-bigquery and dbt-spark, their underlying implementation of insert_overwrite requires a partition_by. My understanding of that strategy is that partition_by makes it efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks for catching this! this def shouldn't be there - it was tricky and maybe we need a separate 'adapters-related' section to explain the microbatch x adapter strategy.

I'll rejig this !

| `begin` | The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on `2024-10-01` with `begin = '2023-10-01` will process 366 batches (it's a leap year!) plus the batch for "today." | N/A | Date | Required |
| `batch_size` | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | String | Required |
| `lookback` | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | Integer | Optional |
| `unique_key` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String <br /> | Optional* |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `unique_key` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String <br /> | Optional* |
| `unique_key` | A column(s) (string or array) or expression for the record. | N/A | String <br /> | Optional* |

| `batch_size` | The granularity of your batches. Supported values are `hour`, `day`, `month`, and `year` | N/A | String | Required |
| `lookback` | Process X batches prior to the latest bookmark to capture late-arriving records. | `1` | Integer | Optional |
| `unique_key` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String <br /> | Optional* |
| `partition_by` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String | Optional* |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `partition_by` | A column(s) (string or array) or expression for the record. Required for the `check` strategy. | N/A | String | Optional* |
| `partition_by` | A column(s) (string or array) or expression for the record. | N/A | String | Optional* |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content Improvements or additions to content Docs team Authored by the Docs team @dbt Labs size: small This change will take 1 to 2 days to address
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] Some adapter incremental strategies require additional config keys
3 participants