Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap update (Dec 2024) #11173

Merged
merged 3 commits into from
Dec 20, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
207 changes: 207 additions & 0 deletions docs/roadmap/2024-12-play-on.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
# dbt: Play On (December 2024)

Oh hi there :)

We’ve had the opportunity to talk a lot about our investments in open source over the past few months:

- [I [Grace] renewed my vows with the dbt Community in a Vegas wedding ceremony officiated by Elvis](https://youtu.be/DC9sbZBYzpI?si=uK9Aie6Jl-FIHggm) (aka [@jtcohen6](https://github.com/jtcohen6)).
- [We spoke about dbt Labs’ continued commitment to Open Source in the Coalesce Keynote](https://youtu.be/I72yUtrmhbY?si=Iu6s9WXdHnFtyCvi).
- [We wrote about the importance of extensibility for how we build features in dbt Core.](https://www.getdbt.com/blog/dbt-core-v1-9-is-ga#:~:text=Extensibility%20is%20what%20powers%20the%20community)

All of these points and themes remain incredibly relevant to our team and are fueling our vision as we prepare for the new year:

- we're committed to defending dbt Core as the open source standard for data transformation, which will remain licensed under Apache 2.0
- the dbt framework will continue to be shaped by a collaborative effort between you (the community) and us (the maintainers)
- when we add something new to the standard, we are committing to the long term, which means we are intentional about *how* and *when* we expand its breadth - in the meantime, lean into dbt's extensibility (and show us how you're doing it!)

Now that we’ve gotten into our new rhythm…

Now that last year’s focus on stability has earned us the right to ship awesome new additions to the dbt framework…

Now that we’re actually seeing a *much* higher % of projects running the latest & greatest dbt…

…the next 6-12 months will feel similar to the last. dbt will keep getting better at doing what it does - being a mission-critical piece of your data stack, and a delightful part of your work. To that end, stability always comes first. And, we will be shipping some exciting new features to dbt Core.

# Oh What A Year It’s Been!

This year, we released two new minor versions of dbt Core.

| **Version** | **When** | **Namesake** | **Stuff** |
| --- | --- | --- | --- |
| [v1.8](https://docs.getdbt.com/docs/dbt-versions/core-upgrade/upgrading-to-v1.8) | May | [Julian Abele](https://github.com/dbt-labs/dbt-core/releases/tag/v1.8.0) | Unit testing. Decoupling of dbt-core and adapters. Flags for managing changes to legacy behaviors. |
| [v1.9](https://docs.getdbt.com/docs/dbt-versions/core-upgrade/upgrading-to-v1.9) | December | [Dr. Susan La Flesche Picotte](https://github.com/dbt-labs/dbt-core/releases/tag/v1.9.0) | Microbatch incremental strategy. New configurations and spec for snapshots. Standardizing support for Iceberg. |

In the [last roadmap post](https://github.com/dbt-labs/dbt-core/blob/main/docs/roadmap/2023-11-dbt-tng.md), we committed to prioritizing all-around interface stability. This included decoupling dbt-core and adapters, introducing [behavior change flags](https://docs.getdbt.com/reference/global-configs/behavior-changes) to give you time to adjust to ~~breaking~~ changes, and improving the stability of metadata artifacts. You can read more about those efforts [here](https://www.notion.so/E-team-offsite-competitive-deep-dive-SQLMesh-SDF-November-2024-129bb38ebda7807ca023ddf198fb1279?pvs=21).

Stability means you can upgrade with confidence.

Stability means less disruptions for our adapters and integrations in the dbt ecosystem.

Stability means we’ve earned the right to **ship some big new features**.

This past year, we were able to ship some long-awaited additions and enhancements to the dbt Core framework:

- [**Unit tests**](https://github.com/dbt-labs/dbt-core/discussions/8275) allow you to validate your SQL modeling logic on a small set of static inputs before you materialize your full model in production.
- [**Snapshots**](https://github.com/dbt-labs/dbt-core/discussions/7018) got the glow up they deserved - with new configurations and spec to make capturing your data changes easier to configure, run, and customize.
- [**Microbatch**](https://github.com/dbt-labs/dbt-core/discussions/10672) incremental models enable you to optimize your largest datasets, by transforming your timeseries data in discrete periods with their own SQL queries, rather than all at once.
- [**Iceberg**](https://docs.getdbt.com/blog/icebeg-is-an-implementation-detail) table format (an open standard for storing data and accessing metadata) is standardized across adapters, enabling you to store your data in a way that promises interoperability with multiple compute engines.

We also were able to close out some highly-upvoted “paper cuts”:

- `--empty` flag limits the `ref`s and `source`s to zero rows, which you can use for schema-only dry runs that validate your model SQL and run unit tests
- dbt now issues a single (batch) query when calculating source freshness through metadata, instead of executing a query per source
- improvements to `state:modified` help reduce the risk of false positives due to environment-aware logic
- you can now document your data tests with `description`s

A lot of these features are things that you (the community) have been discussing and experimenting with for years. To everyone who opened an issue, commented on a discussion, joined us for a feedback session, developed a package, or contributed to our code base… however you made your voice heard this year, **thank you** for continuing to care, for continuing to lean in, for continuing to help shape dbt.

# New Year’s Resolutions

To start off the new year, we’re focusing on three major areas of development to the dbt framework:

- **typed macros** - configure Python type annotations for better macro validation
- **catalogs** - first-class support for materializing dbt models into external catalogs, providing a warehouse-agnostic interface for managing data in object storage
- **sample mode** - limit your data to smaller, time-based samples for faster development and CI testing

## Typed Macros

In the simplest terms, [macros](https://docs.getdbt.com/docs/build/jinja-macros#macros) are are pieces of code, written in Jinja, that can be reused throughout your project – they are similar to "functions" in other programming languages.

In practice, you can reference a macro in a model’s SQL (or config block, or hook) to:

- make your SQL code more DRY by abstracting snippets of SQL into reusable “functions”
- use the results of one query to generate a set of logic
- change the way your project builds based on the current environment

Or, if you really need to, run arbitrary SQL in your warehouse via `dbt run-operation`.

Macros can depend on inputs, vars, `env_vars`, or even other macros — *and* there are “special” macros for defining custom generic tests and materializations.

This immense flexibility is one of the great benefits of macros — you can use them to solve **a lot** of different problems.

However, on the flip side, this flexibility — where macros can be whatever you want them to be — makes it challenging to validate that your macros are doing what you expect.

Without a way to define expected types for the inputs and outputs of macros, our adapter maintainers struggle to validate that a built-in macro override will produce the correct output. Furthermore, *any* analytics engineer writing a macro in dbt should be able to validate the expected behavior.

One of the things we aim to work on in the coming months is an interface for **typed macros.**

Imagine being able to codify the expected types* for the inputs and outputs for your macros:

```sql
{% macro cents_to_dollars(column_name: str, scale: int = 2 -> str) %}
({{ column_name }} / 100)::numeric(16, {{ scale }})
{% endmacro %}
```

**Note: These are a subset of Python types (internal to dbt), not the data types within the data warehouse.*

Then, we could issue warnings at parse time when usage of a macro violates these types. By adding the ability to configure type expectations, user-created macros become more predicable, and [built-in macros become easier to override](https://github.com/dbt-labs/dbt-core/issues/9164).

Should this type checking be on by default, or something you opt into? Should we also support dbt-specific types, such as `Relation`? Head over to the [github discussion](https://github.com/dbt-labs/dbt-core/discussions/11158) to participate in the conversation!

## Catalogs

In `v1.9`, we shipped a set of standard configs to materialize dbt models in Iceberg table format:

```sql
{{
config(
materialized = "table",
table_format = "iceberg",
external_volume = "s3_iceberg_snow"
)
}}

...
```

Supporting the Iceberg table format was our first step towards empowering users to adopt Iceberg as a standard storage format for their critical datasets.

In the coming months, we want to add first-class support for “catalogs” in dbt. “Catalogs,” including Glue or Iceberg REST, operate at a level of abstraction above specific Iceberg tables — and they can provide a warehouse-agnostic interface for managing a large number of datasets in object storage.

Imagine a new top-level `catalogs.yml` that tells dbt about the catalog integrations you want to write to:

```yaml
catalogs:
- catalog_name: my_first_catalog
write_integrations:
- integration_name: prod_glue_write_integration
external_volume: my_prod_external_volume
table_format: iceberg
catalog_type: glue
```

Then, in your model’s configuration, simply specify the `catalog` field:

```sql
{{
config(
catalog = "my_first_catalog"
)
}}

...
```

These new configurations will enable dbt to template the correct DDL statements for the platform (`CREATE GLUE ICEBERG TABLE` vs. `CREATE ICEBERG TABLE`, etc.). Now, when you run the above model, it will be materialized as an Iceberg table in s3 registered in the AWS Glue catalog.

We believe the approach of writing datasets to a platform-agnostic storage layer, and registering those datasets with a similarly agnostic *catalog service,* will become important to the technical foundations of the dbt workflow — the [Analytics Development Lifecycle (ADLC)](https://www.getdbt.com/resources/guides/the-analytics-development-lifecycle) — moving forward. Support for materializing Iceberg tables in external catalogs is another step down this path.

To read more about catalogs and participate in shaping this feature, head over to the [discussion on GitHub](https://github.com/dbt-labs/dbt-core/discussions/11171)!

## Sample Mode

We began our [“event_time” journey](https://github.com/dbt-labs/dbt-core/discussions/10672) with microbatch incremental models. Next up is [sample mode](https://github.com/dbt-labs/dbt-core/discussions/10672#:~:text=%F0%9F%8C%80%5BNext%5D%20%E2%80%9CSample%E2%80%9D%20mode%20for%20dev%20%26%20CI%20runs) - we want to support a pattern for speeding up development and CI testing by filtering your dataset to a *time-limited sample.*

Imagine your model contains a `ref` statements like so:

```sql
select * from {{ ref('fct_orders') }}
```

During standard runs, this compiles to:

```sql
select * from my_db.my_schema.fct_orders
```

But during a *sample* run, dbt could automatically filter down large tables to the last X days of data using the same `event_time` column used for microbatch. The exact syntax here is TBD, but imagine you run something like `dbt run --sample 3`, which then compiles your code to:

```sql
select * from my_db.my_schema.fct_orders
-- the event_time column in fct_orders is 'order_at'
where order_at > dateadd(-3, day, current_date)
```

No more [overriding the `source` or `ref` macro](https://discourse.getdbt.com/t/limiting-dev-runs-with-a-dynamic-date-range/508) to hack together this functionality.

Built-in support for “sample mode” — filtering to a consistent time-based “slice” across all models — means faster development and faster testing because you’re running on *less* data. This could be configured for a specific dbt invocation, a CI environment, or as a set-it-and-forget-it default for everyone who’s developing on your team’s project.

We’ll be opening an additional sample-mode-specific GitHub discussion in January, but feel free to queue up your thoughts [here](https://github.com/dbt-labs/dbt-core/discussions/10672) or [here](https://github.com/dbt-labs/dbt-core/issues/8378) in the meantime!

## and who could forget, paper cuts!

Those are the big things we plan to tackle in the coming months. As always, we want to tackle some smaller `paper_cut`s as well. Your upvotes and comments help us prioritize these, so please make some noise if there’s something you care a whole lot about. Some that are top of mind for me already:

- **DRY-er YML**, including:
- [Defining vars configs outside dbt_project.yml](https://github.com/dbt-labs/dbt-core/issues/2955)
- [Add ability to import/include YAML from other files](https://github.com/dbt-labs/dbt-core/issues/9695)
- **Enhancements to model versions and contracts**, including:
- [Automatically create view/clone of latest version](https://github.com/dbt-labs/dbt-core/issues/7442)
- [Support constraints independently from enforcing a full model contract](https://github.com/dbt-labs/dbt-core/issues/10195)
- and [warnings when configs are misspelled](https://github.com/dbt-labs/dbt-core/issues/8942) :)

# [**Call me**, **beep me** if you wanna reach me](https://www.youtube.com/watch?v=GIgLqN_rAXU)

If one of *your* New Year’s resolutions is to be more involved in the dbt community, here are some of the many ways you can contribute:

- open, upvote, and comment on GitHub [issues](https://docs.getdbt.com/community/resources/oss-expectations#issues)
- start a [discussion](https://docs.getdbt.com/community/resources/oss-expectations#discussions) or discourse post when you’ve got a Big Idea
- engage in conversations in the [dbt Community Slack](https://www.getdbt.com/community/join-the-community)
- [contribute code](https://docs.getdbt.com/community/resources/oss-expectations#pull-requests) back to one of our open source repos, for one of our issues tagged `help_wanted` or `good_first_issue`, and our engineering team will work with you to get it over the finish line
- join us on zoom for feedback sessions (which we’ll post about in slack and in the relevant GitHub discussion)
- share your creative solutions in a blog post, at a dbt meetup, or by talking at Coalesce

Your feedback and thoughts are incredibly valuable to us, so make yourself heard this year. We’re excited to build some awesome new features together.

[Your loving wife](https://youtu.be/DC9sbZBYzpI?si=xCGWoQDK-w13Fz6U&t=1594), Grace
Loading