From 7d73d8199e75a1608a637cfd7dc669eb1d8339bf Mon Sep 17 00:00:00 2001 From: Doug Beatty Date: Thu, 28 Sep 2023 16:46:30 -0600 Subject: [PATCH 01/11] Explanation of Parsing vs. Compilation vs. Runtime --- .../parsing-vs-compilation-vs-runtime.md | 151 ++++++++++++++++++ 1 file changed, 151 insertions(+) create mode 100644 core/dbt/parser/parsing-vs-compilation-vs-runtime.md diff --git a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md new file mode 100644 index 00000000000..8c1834d6d46 --- /dev/null +++ b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md @@ -0,0 +1,151 @@ +# Parsing vs. Compilation vs. Runtime + +## Context: Why this doc? + +There’s a lot of confusion about what dbt does at parse time vs. compile time / runtime. Even that separation is a relative simplification: parsing includes multiple steps, and while there are some distinctions between "compiling" and "running" a model, the two are **very** closely related. + +It's come up many times before, and we expect it will keep coming up! A decent number of bug reports in `dbt-core` are actually rooted in a misunderstanding of when configs are resolved, especially when folks are using pre/post hooks, or configs that alter materialization behavior (`partitions`, `merge_exclude_columns`, etc). + +So, here goes. + +## What is "parsing"? + +**In a sentence:** dbt reads all the files in your project, and constructs an internal representation of the project ("manifest"). + +To keep it really simple, let’s say this happens in two steps: "Parsing" and "Resolving." + +### Parsing + +As a user, you write models as SQL + YAML. dbt wants to understand that model as a Python object, defined by an internal data structure. It also wants to know its dependencies and configuration (= its place in the DAG). dbt reads your code **for that one model,** and attempts to construct that object, raising a **validation** error if it can’t. + +
+(Toggle for many more details.) + +- (Because your SQL and YAML live in separate files, this is actually two steps. But for things like `sources`, `exposures`, `metrics`, `tests`, it’s a single pass.) +- dbt needs to capture and store two vital pieces of information: **dependencies** and **configuration**. + - We need to know the shape of the DAG. That includes which models are disabled, in addition to which models will be depending on which other models. + - Plus, certain configurations have implications for **node selection**, which supports selecting models using the `tag:` and `config:` methods. +- Parsing also resolves the configuration for that model, based on configs set in `dbt_project.yml`, and macros like `generate_schema_name`. (These are "special" macros, whose results are saved at parse time!) +- The way dbt parses models depends on the language that model is written in. + - Python models are statically analyzed using the Python AST. [This is pretty fast.](https://www.notion.so/f113894b4cb7412e83b7f9fdb3a24cdb?pvs=21) + - Simple Jinja-SQL models (using just `ref()`, `source()`, &/or `config()` with literal inputs) are also [statically analyzed](https://docs.getdbt.com/reference/parsing#static-parser), using [a thing we built](https://github.com/dbt-labs/dbt-extractor). This is **very** fast (~0.3 ms). + - More complex Jinja-SQL models are parsed by actually rendering the Jinja, and "capturing" any instances of `ref()`, `source()`, &/or `config()`. This is kinda slow, but it’s more capable than our static parser. Those macros can receive `set` variables, or call other macros in turn, and we can still capture the right results because **we’re actually using real Jinja to render it.** + - We capture any other macros called in `depends_on.macros`. This enables us to do clever things later on, such as select models downstream of changed macros (`state:modified.macros`). + - **However:** If `ref()` is nested inside a conditional block that is false at parse time (e.g. `{% if execute %}`), we will miss capturing that macro call then. If the same conditional block resolves to true at runtime, we’re screwed! So [we have a runtime check](https://github.com/dbt-labs/dbt-core/blob/16f529e1d4e067bdbb6a659a622bead442f24b4e/core/dbt/context/providers.py#L495-L500) to validate that any `ref()` we see again at compile/runtime, is one we also previously captured at parse time. If we find a new `ref()` we weren’t expecting, there’s a risk that we’re running the DAG out of order! + +
+ +### Resolving + +After we’ve parsed all the objects in a project, we need to resolve the links between them. This is when we look up all the `ref()`, `source()`, `metric()`, and `doc()` calls that we captured during parsing. + +This is the first step of (almost) every dbt command! When it's done, we have the **Manifest**. + +
+(Toggle for many more details.) + +- If we find another node matching the lookup, we add it to the first node’s `depends_on.nodes`. +- If we don’t find an enabled node matching the lookup, we raise an error. + - (This is sometimes a failure mode for partial parsing, where we missed re-parsing a particular changed file/node, and it appears as though the node is missing when it clearly isn’t.) +- Corollary: During the initial parse (previous step), we’re not actually ready to look up `ref()`, `source()`, etc. But during that first Jinja render, we still want them to return a `Relation` object, to avoid type errors if users are writing custom code that expects to operate on a `Relation`. (Otherwise, we’d see all sorts of errors like "NoneType has no attribute "identifier.") So, during parsing, we just have `ref()` and `source()` return a placeholder `Relation` pointing to the model currently being parsed. This can lead to some odd behavior, such as in [this recent issue](https://github.com/dbt-labs/dbt-core/issues/6382). + +
+ +## What is "execution"? + +**In a sentence:** Now that dbt knows about all the stuff in your project, it can perform operations on top of it. + +Things it can do: + +- tell you about all the models that match certain criteria (`list`) +- compile + run a set of models, in DAG order +- interactively compile / preview some Jinja-SQL, that includes calls to macros or ref’s models defined in your project + +Depending on what’s involved, these operations may or may not require a live database connection. While executing, dbt produces metadata, which it returns as **log events** and **artifacts**. + +Put another way, dbt’s execution has required inputs, expected outputs, and the possibility for side effects: + +- **Inputs** (provided by user): project files, credentials, configuration → Manifest + runtime configuration +- **Outputs** (returned to user): logs & artifacts +- **Side effects** (not seen directly by user): changes in database state, depending on the operation being performed + +### Compiling a model + +We use the word "compiling" in a way that’s confusing for most software engineers (and many other people). Most of what’s described above, parsing + validating + constructing a Manifest (internal representation), falls more squarely in the traditional role of a language compiler. By contrast, when we talk about "compiling SQL," we’re really talking about something that happens at **runtime**. + +Devils in the details; toggle away. + +
+The mechanism of "compilation" varies by model language. + +- **Jinja-SQL** wants to compile down to "vanilla" SQL, appropriate for this database, where any calls to `ref('something')` have been replaced with `database.schema.something`. +- dbt doesn’t directly modify or rewrite user-provided **Python** code at all. Instead, "compilation" looks like code generation: appending more methods that allow calls to `dbt.ref()`, `dbt.source()`, and `dbt.config.get()` to return the correct results at runtime. + +
+ +
+If your model’s code uses a dynamic query to template code, this requires a database connection. + +- At this point, `[execute](https://docs.getdbt.com/reference/dbt-jinja-functions/execute)` is set to `True`. +- e.g. `dbt_utils.get_column_values`, `dbt_utils.star` +- Jinja-SQL supports this sort of dynamic templating. Python does not; there are other imperative ways to do this, using DataFrame methods / the Python interpreter at runtime. + +
+ +
+Compilation is also when ephemeral model CTEs are interpolated into the models that `ref` them. + +- The code for this is *gnarly*. That’s all I’m going to say about it for now. + +
+ +
+When compiling happens for a given node varies by command. + +- In `dbt compile`, every model is compiled concurrently, up to the number of threads, rather than in DAG order. + - For example, if one model’s templated SQL depends on an introspective query that expects another model to have already been materialized, this can lead to errors. + - `dbt compile` has not historically supported `--defer`, but this was added in v1.3 (with [one known bug](https://github.com/dbt-labs/dbt-core/issues/6124)). +- In `dbt run`, models are operated on in DAG order, where operating on one model means compiling it and then running its materialization. This way, if a downstream model’s compiled SQL will depend on an introspective query against the materialized results of an upstream model, we wait to compile it until the upstream model has completely finishing running. + +
+ +
+ +The outcome of compiling a model is updating its Manifest entry in two important ways: +- `compiled` is set to `True` +- `compiled_code` is populated with (what else) the compiled code for this model + +### Running / materializing a model + +A model’s `compiled_code` is passed into the materialization macro, and the materialization macro is executed. That materialization macro will also call user-provided pre- and post-hooks, and other built-in macros that return the appropriate DDL + DML statements (`create`, `alter`, `merge`, etc.) + +(It’s set to a context variable named [`sql`](https://github.com/dbt-labs/dbt-core/blob/16f529e1d4e067bdbb6a659a622bead442f24b4e/core/dbt/context/providers.py#L1314-L1323), which you’ll see in some materializations — and we should really change that to just `model['compiled_code']`.) + +## Why does it matter? + +Keeping these pieces of logic separate is one of the most important & opinionated abstractions offered by dbt. + +- **The separation of "control plane" logic** (configurations & shape of the DAG) **from "data plane" logic** (how data should be manipulated & transformed remotely). + - You must declare all dependencies & configurations ahead of time, rather than imperatively redefining them at runtime. You cannot dynamically redefine the DAG on the basis of a query result. + - This is limiting for some advanced use cases, but it prevents you from solving hard problems in exactly the wrong ways. +- **The separation of modeling code** ("logical" transformation written in SQL, or DataFrame manipulations) **from materialization code** ("physical" state changes via DDL/DML)**.** + - Every model is "just" a `select` statement, or a Python DataFrame. It can be developed, previewed, and tested as such, *without* mutating database state. Those mutations are defined declaratively, with reusable boilerplate ("view" vs. "table" vs. "incremental"), rather than imperatively each time. + + +## Appendix + +
+Click to toggle notes on parsing + +### Notes on parsing + +- **dbt has not yet connected to a database.** Every step performed thus far has required only project files, configuration, and `dbt-core`. You can perform parsing without an Internet connection. +- There is a command called `parse`, which does **just** "parsing" + "resolving," as a way to measure parsing performance in large projects. That command *could* write `manifest.json` once it's done; it doesn't, today, for no particular reason. (We're thinking about changing this.) +- In large projects, the parsing step can also be quite slow: reading lots of files, doing lots of dataclass validation, creating lots of links between lots of nodes. (See below for details on two potential optimizations.) + +### Two potential optimizations + +1. [**"Partial parsing."**](https://docs.getdbt.com/reference/parsing#partial-parsing) dbt saves the mostly-done Manifest from last time, in a file called `target/partial_parse.msgpack`. dbt **just** reads the files that have changed (based on file system metadata), and makes partial updates to that mostly-done Manifest. Of course, if a user has updated configuration that could be relevant globally (e.g. `dbt_project.yml`, `--vars`), we have to opt for a full re-parse — better safe (slow & correct) than sorry (fast & incorrect). +2. **A dbt Server.** Both the new `dbt-server` and the old `dbt-rpc` server have mechanisms to separate parsing from execution. They save a Manifest for reuse between commands. When files change, they trigger a re-parse and re-construction of the Manifest behind the scenes. This way, you don’t have to wait for re-parsing (even partial parsing) when you actually submit a command; it’s ready for you. The new approach, taken by `dbt-server` + Runtime, is much more resilient and less brittle than the previous approach in `dbt-rpc`. + +
From 9ffd5ab230ebc14b61509ba02c359c8dec5f850c Mon Sep 17 00:00:00 2001 From: Doug Beatty <44704949+dbeatty10@users.noreply.github.com> Date: Fri, 29 Sep 2023 09:40:41 -0600 Subject: [PATCH 02/11] Update core/dbt/parser/parsing-vs-compilation-vs-runtime.md --- core/dbt/parser/parsing-vs-compilation-vs-runtime.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md index 8c1834d6d46..7db5f4af8a0 100644 --- a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md +++ b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md @@ -16,7 +16,7 @@ To keep it really simple, let’s say this happens in two steps: "Parsing" and " ### Parsing -As a user, you write models as SQL + YAML. dbt wants to understand that model as a Python object, defined by an internal data structure. It also wants to know its dependencies and configuration (= its place in the DAG). dbt reads your code **for that one model,** and attempts to construct that object, raising a **validation** error if it can’t. +As a user, you write models as SQL + YAML. dbt wants to understand each model as a Python object, defined by an internal data structure. It also wants to know its dependencies and configuration (= its place in the DAG). dbt reads your code **for that one model,** and attempts to construct that object, raising a **validation** error if it can’t.
(Toggle for many more details.) From 82a4cdbcfe549e7a4ee5bad5dea50f7c8493ef5d Mon Sep 17 00:00:00 2001 From: Doug Beatty <44704949+dbeatty10@users.noreply.github.com> Date: Fri, 29 Sep 2023 09:45:41 -0600 Subject: [PATCH 03/11] Update core/dbt/parser/parsing-vs-compilation-vs-runtime.md --- core/dbt/parser/parsing-vs-compilation-vs-runtime.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md index 7db5f4af8a0..b7fea2e9284 100644 --- a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md +++ b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md @@ -23,7 +23,7 @@ As a user, you write models as SQL + YAML. dbt wants to understand each model as - (Because your SQL and YAML live in separate files, this is actually two steps. But for things like `sources`, `exposures`, `metrics`, `tests`, it’s a single pass.) - dbt needs to capture and store two vital pieces of information: **dependencies** and **configuration**. - - We need to know the shape of the DAG. That includes which models are disabled, in addition to which models will be depending on which other models. + - We need to know the shape of the DAG. This includes which models are disabled. It also includes dependency relationships between models. - Plus, certain configurations have implications for **node selection**, which supports selecting models using the `tag:` and `config:` methods. - Parsing also resolves the configuration for that model, based on configs set in `dbt_project.yml`, and macros like `generate_schema_name`. (These are "special" macros, whose results are saved at parse time!) - The way dbt parses models depends on the language that model is written in. From 273d5932cbe3c8ea422faf88ff230ce26a110760 Mon Sep 17 00:00:00 2001 From: Doug Beatty <44704949+dbeatty10@users.noreply.github.com> Date: Fri, 29 Sep 2023 09:46:52 -0600 Subject: [PATCH 04/11] Update core/dbt/parser/parsing-vs-compilation-vs-runtime.md --- core/dbt/parser/parsing-vs-compilation-vs-runtime.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md index b7fea2e9284..efff2f721ba 100644 --- a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md +++ b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md @@ -27,7 +27,7 @@ As a user, you write models as SQL + YAML. dbt wants to understand each model as - Plus, certain configurations have implications for **node selection**, which supports selecting models using the `tag:` and `config:` methods. - Parsing also resolves the configuration for that model, based on configs set in `dbt_project.yml`, and macros like `generate_schema_name`. (These are "special" macros, whose results are saved at parse time!) - The way dbt parses models depends on the language that model is written in. - - Python models are statically analyzed using the Python AST. [This is pretty fast.](https://www.notion.so/f113894b4cb7412e83b7f9fdb3a24cdb?pvs=21) + - Python models are statically analyzed using the Python AST. - Simple Jinja-SQL models (using just `ref()`, `source()`, &/or `config()` with literal inputs) are also [statically analyzed](https://docs.getdbt.com/reference/parsing#static-parser), using [a thing we built](https://github.com/dbt-labs/dbt-extractor). This is **very** fast (~0.3 ms). - More complex Jinja-SQL models are parsed by actually rendering the Jinja, and "capturing" any instances of `ref()`, `source()`, &/or `config()`. This is kinda slow, but it’s more capable than our static parser. Those macros can receive `set` variables, or call other macros in turn, and we can still capture the right results because **we’re actually using real Jinja to render it.** - We capture any other macros called in `depends_on.macros`. This enables us to do clever things later on, such as select models downstream of changed macros (`state:modified.macros`). From 49024755ea32b29aa16278be7a319a9aeb259c07 Mon Sep 17 00:00:00 2001 From: Doug Beatty <44704949+dbeatty10@users.noreply.github.com> Date: Fri, 29 Sep 2023 09:48:40 -0600 Subject: [PATCH 05/11] Update core/dbt/parser/parsing-vs-compilation-vs-runtime.md --- core/dbt/parser/parsing-vs-compilation-vs-runtime.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md index efff2f721ba..01a3f11d7df 100644 --- a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md +++ b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md @@ -86,7 +86,7 @@ Devils in the details; toggle away.
If your model’s code uses a dynamic query to template code, this requires a database connection. -- At this point, `[execute](https://docs.getdbt.com/reference/dbt-jinja-functions/execute)` is set to `True`. +- At this point, [`execute`](https://docs.getdbt.com/reference/dbt-jinja-functions/execute) is set to `True`. - e.g. `dbt_utils.get_column_values`, `dbt_utils.star` - Jinja-SQL supports this sort of dynamic templating. Python does not; there are other imperative ways to do this, using DataFrame methods / the Python interpreter at runtime. From 9c75e61f8a18b51a5f85b43d6229ea117e6bd945 Mon Sep 17 00:00:00 2001 From: Doug Beatty <44704949+dbeatty10@users.noreply.github.com> Date: Fri, 29 Sep 2023 09:49:19 -0600 Subject: [PATCH 06/11] Update core/dbt/parser/parsing-vs-compilation-vs-runtime.md --- core/dbt/parser/parsing-vs-compilation-vs-runtime.md | 1 - 1 file changed, 1 deletion(-) diff --git a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md index 01a3f11d7df..d3ae8c5cf00 100644 --- a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md +++ b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md @@ -102,7 +102,6 @@ Devils in the details; toggle away.
When compiling happens for a given node varies by command. -- In `dbt compile`, every model is compiled concurrently, up to the number of threads, rather than in DAG order. - For example, if one model’s templated SQL depends on an introspective query that expects another model to have already been materialized, this can lead to errors. - `dbt compile` has not historically supported `--defer`, but this was added in v1.3 (with [one known bug](https://github.com/dbt-labs/dbt-core/issues/6124)). - In `dbt run`, models are operated on in DAG order, where operating on one model means compiling it and then running its materialization. This way, if a downstream model’s compiled SQL will depend on an introspective query against the materialized results of an upstream model, we wait to compile it until the upstream model has completely finishing running. From 8c9bf2a8c1e95490438339108f5f49e775c54d24 Mon Sep 17 00:00:00 2001 From: Doug Beatty <44704949+dbeatty10@users.noreply.github.com> Date: Fri, 29 Sep 2023 09:49:46 -0600 Subject: [PATCH 07/11] Update core/dbt/parser/parsing-vs-compilation-vs-runtime.md --- core/dbt/parser/parsing-vs-compilation-vs-runtime.md | 1 - 1 file changed, 1 deletion(-) diff --git a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md index d3ae8c5cf00..9eb09e076fc 100644 --- a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md +++ b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md @@ -103,7 +103,6 @@ Devils in the details; toggle away. When compiling happens for a given node varies by command. - For example, if one model’s templated SQL depends on an introspective query that expects another model to have already been materialized, this can lead to errors. - - `dbt compile` has not historically supported `--defer`, but this was added in v1.3 (with [one known bug](https://github.com/dbt-labs/dbt-core/issues/6124)). - In `dbt run`, models are operated on in DAG order, where operating on one model means compiling it and then running its materialization. This way, if a downstream model’s compiled SQL will depend on an introspective query against the materialized results of an upstream model, we wait to compile it until the upstream model has completely finishing running.
From b4241b10963361490849da26820ba29ea18a41ff Mon Sep 17 00:00:00 2001 From: Doug Beatty <44704949+dbeatty10@users.noreply.github.com> Date: Fri, 29 Sep 2023 09:50:54 -0600 Subject: [PATCH 08/11] Apply suggestions from code review Co-authored-by: Jeremy Cohen --- core/dbt/parser/parsing-vs-compilation-vs-runtime.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md index 9eb09e076fc..e4ea4033e53 100644 --- a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md +++ b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md @@ -117,7 +117,7 @@ The outcome of compiling a model is updating its Manifest entry in two important A model’s `compiled_code` is passed into the materialization macro, and the materialization macro is executed. That materialization macro will also call user-provided pre- and post-hooks, and other built-in macros that return the appropriate DDL + DML statements (`create`, `alter`, `merge`, etc.) -(It’s set to a context variable named [`sql`](https://github.com/dbt-labs/dbt-core/blob/16f529e1d4e067bdbb6a659a622bead442f24b4e/core/dbt/context/providers.py#L1314-L1323), which you’ll see in some materializations — and we should really change that to just `model['compiled_code']`.) +(For legacy reasons, `compiled_code` is also available as a context variable named [`sql`](https://github.com/dbt-labs/dbt-core/blob/16f529e1d4e067bdbb6a659a622bead442f24b4e/core/dbt/context/providers.py#L1314-L1323). You'll see it referenced as `sql` in some materializations. Going forward, `model['compiled_code']` is a better way to access this.) ## Why does it matter? @@ -138,12 +138,12 @@ Keeping these pieces of logic separate is one of the most important & opinionate ### Notes on parsing - **dbt has not yet connected to a database.** Every step performed thus far has required only project files, configuration, and `dbt-core`. You can perform parsing without an Internet connection. -- There is a command called `parse`, which does **just** "parsing" + "resolving," as a way to measure parsing performance in large projects. That command *could* write `manifest.json` once it's done; it doesn't, today, for no particular reason. (We're thinking about changing this.) +- There is a command called `parse`, which does **just** "parsing" + "resolving," as a way to measure parsing performance in large projects. That command is the fastest way to write `manifest.json` (since v1.5). - In large projects, the parsing step can also be quite slow: reading lots of files, doing lots of dataclass validation, creating lots of links between lots of nodes. (See below for details on two potential optimizations.) ### Two potential optimizations 1. [**"Partial parsing."**](https://docs.getdbt.com/reference/parsing#partial-parsing) dbt saves the mostly-done Manifest from last time, in a file called `target/partial_parse.msgpack`. dbt **just** reads the files that have changed (based on file system metadata), and makes partial updates to that mostly-done Manifest. Of course, if a user has updated configuration that could be relevant globally (e.g. `dbt_project.yml`, `--vars`), we have to opt for a full re-parse — better safe (slow & correct) than sorry (fast & incorrect). -2. **A dbt Server.** Both the new `dbt-server` and the old `dbt-rpc` server have mechanisms to separate parsing from execution. They save a Manifest for reuse between commands. When files change, they trigger a re-parse and re-construction of the Manifest behind the scenes. This way, you don’t have to wait for re-parsing (even partial parsing) when you actually submit a command; it’s ready for you. The new approach, taken by `dbt-server` + Runtime, is much more resilient and less brittle than the previous approach in `dbt-rpc`. +2. Reusing manifests: https://docs.getdbt.com/reference/programmatic-invocations#reusing-objects. Note that this is taking "full control," and there are failure modes (example: [dbt-core#7945](https://github.com/dbt-labs/dbt-core/issues/7945)).
From e66458f53107ac65dcb624ee76543730b07742fa Mon Sep 17 00:00:00 2001 From: Doug Beatty <44704949+dbeatty10@users.noreply.github.com> Date: Fri, 29 Sep 2023 10:05:05 -0600 Subject: [PATCH 09/11] Fix a couple markdown rendering issues --- core/dbt/parser/parsing-vs-compilation-vs-runtime.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md index e4ea4033e53..c6361ad456a 100644 --- a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md +++ b/core/dbt/parser/parsing-vs-compilation-vs-runtime.md @@ -102,7 +102,7 @@ Devils in the details; toggle away.
When compiling happens for a given node varies by command. - - For example, if one model’s templated SQL depends on an introspective query that expects another model to have already been materialized, this can lead to errors. +- For example, if one model’s templated SQL depends on an introspective query that expects another model to have already been materialized, this can lead to errors. - In `dbt run`, models are operated on in DAG order, where operating on one model means compiling it and then running its materialization. This way, if a downstream model’s compiled SQL will depend on an introspective query against the materialized results of an upstream model, we wait to compile it until the upstream model has completely finishing running.
@@ -144,6 +144,6 @@ Keeping these pieces of logic separate is one of the most important & opinionate ### Two potential optimizations 1. [**"Partial parsing."**](https://docs.getdbt.com/reference/parsing#partial-parsing) dbt saves the mostly-done Manifest from last time, in a file called `target/partial_parse.msgpack`. dbt **just** reads the files that have changed (based on file system metadata), and makes partial updates to that mostly-done Manifest. Of course, if a user has updated configuration that could be relevant globally (e.g. `dbt_project.yml`, `--vars`), we have to opt for a full re-parse — better safe (slow & correct) than sorry (fast & incorrect). -2. Reusing manifests: https://docs.getdbt.com/reference/programmatic-invocations#reusing-objects. Note that this is taking "full control," and there are failure modes (example: [dbt-core#7945](https://github.com/dbt-labs/dbt-core/issues/7945)). +2. [**"Reusing manifests."**](https://docs.getdbt.com/reference/programmatic-invocations#reusing-objects) Note that this is taking "full control," and there are failure modes (example: [dbt-core#7945](https://github.com/dbt-labs/dbt-core/issues/7945)).
From eef369bb6f387d324738c1ae60f790c9cd8d4e2c Mon Sep 17 00:00:00 2001 From: Doug Beatty Date: Fri, 29 Sep 2023 10:22:37 -0600 Subject: [PATCH 10/11] Move to the "explain it like im 64" folder When ELI5 just isnt detailed enough. --- .../parser => docs/eli64}/parsing-vs-compilation-vs-runtime.md | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename {core/dbt/parser => docs/eli64}/parsing-vs-compilation-vs-runtime.md (100%) diff --git a/core/dbt/parser/parsing-vs-compilation-vs-runtime.md b/docs/eli64/parsing-vs-compilation-vs-runtime.md similarity index 100% rename from core/dbt/parser/parsing-vs-compilation-vs-runtime.md rename to docs/eli64/parsing-vs-compilation-vs-runtime.md From 4fc26eb82d9e9c80475b400cd3480249db1df65f Mon Sep 17 00:00:00 2001 From: Doug Beatty <44704949+dbeatty10@users.noreply.github.com> Date: Tue, 10 Oct 2023 09:13:54 -0600 Subject: [PATCH 11/11] Disambiguate Python references Disambiguate Python references and delineate SQL models ("Jinja-SQL") from Python models ("dbt-py") --- docs/eli64/parsing-vs-compilation-vs-runtime.md | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/docs/eli64/parsing-vs-compilation-vs-runtime.md b/docs/eli64/parsing-vs-compilation-vs-runtime.md index c6361ad456a..5c784f3b3a1 100644 --- a/docs/eli64/parsing-vs-compilation-vs-runtime.md +++ b/docs/eli64/parsing-vs-compilation-vs-runtime.md @@ -16,7 +16,9 @@ To keep it really simple, let’s say this happens in two steps: "Parsing" and " ### Parsing -As a user, you write models as SQL + YAML. dbt wants to understand each model as a Python object, defined by an internal data structure. It also wants to know its dependencies and configuration (= its place in the DAG). dbt reads your code **for that one model,** and attempts to construct that object, raising a **validation** error if it can’t. +As a user, you write models as SQL (or Python!) + YAML. For sake of simplicity, we'll mostly consider SQL models ("Jinja-SQL") with additional notes for Python models ("dbt-py") as-needed. + +dbt wants to understand and define each SQL model as an object in an internal data structure. It also wants to know its dependencies and configuration (= its place in the DAG). dbt reads your code **for that one model,** and attempts to construct that object, raising a **validation** error if it can’t.
(Toggle for many more details.) @@ -27,7 +29,7 @@ As a user, you write models as SQL + YAML. dbt wants to understand each model as - Plus, certain configurations have implications for **node selection**, which supports selecting models using the `tag:` and `config:` methods. - Parsing also resolves the configuration for that model, based on configs set in `dbt_project.yml`, and macros like `generate_schema_name`. (These are "special" macros, whose results are saved at parse time!) - The way dbt parses models depends on the language that model is written in. - - Python models are statically analyzed using the Python AST. + - dbt-py models are statically analyzed using the Python AST. - Simple Jinja-SQL models (using just `ref()`, `source()`, &/or `config()` with literal inputs) are also [statically analyzed](https://docs.getdbt.com/reference/parsing#static-parser), using [a thing we built](https://github.com/dbt-labs/dbt-extractor). This is **very** fast (~0.3 ms). - More complex Jinja-SQL models are parsed by actually rendering the Jinja, and "capturing" any instances of `ref()`, `source()`, &/or `config()`. This is kinda slow, but it’s more capable than our static parser. Those macros can receive `set` variables, or call other macros in turn, and we can still capture the right results because **we’re actually using real Jinja to render it.** - We capture any other macros called in `depends_on.macros`. This enables us to do clever things later on, such as select models downstream of changed macros (`state:modified.macros`). @@ -79,7 +81,7 @@ Devils in the details; toggle away. The mechanism of "compilation" varies by model language. - **Jinja-SQL** wants to compile down to "vanilla" SQL, appropriate for this database, where any calls to `ref('something')` have been replaced with `database.schema.something`. -- dbt doesn’t directly modify or rewrite user-provided **Python** code at all. Instead, "compilation" looks like code generation: appending more methods that allow calls to `dbt.ref()`, `dbt.source()`, and `dbt.config.get()` to return the correct results at runtime. +- dbt doesn’t directly modify or rewrite user-provided **dbt-py** code at all. Instead, "compilation" looks like code generation: appending more methods that allow calls to `dbt.ref()`, `dbt.source()`, and `dbt.config.get()` to return the correct results at runtime.
@@ -88,7 +90,7 @@ Devils in the details; toggle away. - At this point, [`execute`](https://docs.getdbt.com/reference/dbt-jinja-functions/execute) is set to `True`. - e.g. `dbt_utils.get_column_values`, `dbt_utils.star` -- Jinja-SQL supports this sort of dynamic templating. Python does not; there are other imperative ways to do this, using DataFrame methods / the Python interpreter at runtime. +- Jinja-SQL supports this sort of dynamic templating. dbt-py does not; there are other imperative ways to do this, using DataFrame methods / the Python interpreter at runtime. @@ -127,7 +129,7 @@ Keeping these pieces of logic separate is one of the most important & opinionate - You must declare all dependencies & configurations ahead of time, rather than imperatively redefining them at runtime. You cannot dynamically redefine the DAG on the basis of a query result. - This is limiting for some advanced use cases, but it prevents you from solving hard problems in exactly the wrong ways. - **The separation of modeling code** ("logical" transformation written in SQL, or DataFrame manipulations) **from materialization code** ("physical" state changes via DDL/DML)**.** - - Every model is "just" a `select` statement, or a Python DataFrame. It can be developed, previewed, and tested as such, *without* mutating database state. Those mutations are defined declaratively, with reusable boilerplate ("view" vs. "table" vs. "incremental"), rather than imperatively each time. + - Every model is "just" a `select` statement (for Jinja-SQL models), or a Python DataFrame (for dbt-py models). It can be developed, previewed, and tested as such, *without* mutating database state. Those mutations are defined declaratively, with reusable boilerplate ("view" vs. "table" vs. "incremental"), rather than imperatively each time. ## Appendix