Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Fix ExecutionMode.AIRFLOW_ASYNC query #1260

Open
2 tasks
tatiana opened this issue Oct 21, 2024 · 9 comments
Open
2 tasks

[bug] Fix ExecutionMode.AIRFLOW_ASYNC query #1260

tatiana opened this issue Oct 21, 2024 · 9 comments
Assignees
Labels
area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc bug Something isn't working dbt:run Primarily related to dbt run command or functionality profile:bigquery Related to BigQuery ProfileConfig stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed stretch goal
Milestone

Comments

@tatiana
Copy link
Collaborator

tatiana commented Oct 21, 2024

Context

In Cosmos 1.7, we introduced experimental BQ support to run dbt models with ExecutionMode.AIRFLOW_ASYNC in #1224 and #1230.

While chatting with @joppevos , he identified that the dbt run command:

dbt run --full-refresh

the BQ adaptor seems to create or replace on top of the table, not a drop/create:
https://github.com/dbt-labs/dbt-bigquery/blob/455c76887c9886c517df9619335066bedb1e1a43/dbt/include/bigquery/macros/adapters.sql#L16

Only if the partitions or clusters have changed then it drops
https://github.com/dbt-labs/dbt-bigquery/blob/455c76887c9886c517df9619335066bedb1e1a43/dbt/include/bigquery/macros/materializations/table.sql#L27

Action

  • Confirm this behaviour, by adding a breakpoint in the dbt run command
  • Fix the query behaviour in Cosmos
@tatiana tatiana added the bug Something isn't working label Oct 21, 2024
@dosubot dosubot bot added area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc dbt:run Primarily related to dbt run command or functionality profile:bigquery Related to BigQuery ProfileConfig labels Oct 21, 2024
@joppevos
Copy link
Contributor

Thanks for creating the issue @tatiana! I see it's assigned, but I'd love to give it a shot or collaborate.

@pankajkoti
Copy link
Contributor

@joppevos just wished to quickly check on this one -- May I know please if in your deployments, you install Airflow & dbt in the same virtualenv? asking because we're thinking of an approach that could necessarily leverage if they're in the same env.

@joppevos
Copy link
Contributor

joppevos commented Nov 12, 2024

Hey @pankajkoti - I do not use the virtualenv mode. I had a look myself on how to approach this issue, but getting the sql header information from DBT like create, etc is difficult.

@pankajkoti
Copy link
Contributor

hi @joppevos , din't mean the virtualenv mode. But was more curious on whether dbt & Airflow python packages are installed in the same virtual environment or different virtual environments in your deployments. I guess it's fine we will try to cater to both scenarios.

@pankajkoti
Copy link
Contributor

@tatiana and I paired up to debug the dbt run command using the dbt runner included with the Python package. While it took some time to trace through dbt’s multiple abstraction layers—starting from the click CLI interface that triggers the dbt runner—we were able to use breakpoints to locate where the SQL gets constructed & is available for execution. This happens here in the raw_execute method for the dbt BigQuery adapter.

We realized there are several layers of abstraction involved in building the final SQL, including CTEs and DDLs, which account for modes like full-refresh, incremental, and snapshot. Since dbt doesn’t provide an interface to directly expose the SQL, we’ll need to intercept this flow, likely by monkey patching the raw_execute method to capture the SQL. Here are the proposed approaches:
1. Patch the raw_execute method: Capture the generated SQL by writing it to a temporary file, and upload these files to remote storage so they’re accessible to subsequent tasks running on different workers. All this could happen in our Compile task in the async mode that could run the dbt run command & potentially conduct these steps.
2. Use SQL from the target/run directory: According to this reference (thanks to Tatiana), dbt already writes SQL to the target/run directory, and I verified that this directory contains the same SQL we’d capture with approach 1. We could modify dbt run in the Compile tasks for our async execution mode, monkey-patching only the database calls in the raw_execute method to avoid making changes to the database itself while still generating these SQL files.

Monkey patching would be straightforward if dbt and Airflow run in the same virtual environment and process. However, if we use a subprocess or separate virtual environments, the patch won’t work in the subprocess. We’ll need to figure out an effective approach to modify and mock the source code to yield the SQL in these cases.

Since this is more than a quick bug fix, I’d like to discuss whether we should proceed with one of these proposals for Cosmos 1.8 or defer it to Cosmos 1.9. This approach would also help streamline a lot of the async work planned for Cosmos 1.9. If feasible, shifting other priorities to 1.9 to make room for this in Cosmos 1.8 could give us a chance to gather feedback sooner and accelerate our async support. I’m looking forward to everyone’s thoughts.

@tatiana
Copy link
Collaborator Author

tatiana commented Nov 13, 2024

From what we're seeing, the fix for this ticket will be to implement #1261

By implementing #1261, we'd also be closing #1271, #1265 and #1261

@tatiana
Copy link
Collaborator Author

tatiana commented Nov 15, 2024

@pankajkoti @phanikumv and I just spoke to @cmarteepants and she prefers we return to this in January, once we're working on the async support.

@pankajkoti
Copy link
Contributor

Thanks @tatiana for the update! I will move this ticket to the backlog & mark it for Cosmos 1.9.0

@pankajkoti pankajkoti added this to the Cosmos 1.9.0 milestone Nov 15, 2024
Copy link

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:execution Related to the execution environment/mode, like Docker, Kubernetes, Local, VirtualEnv, etc bug Something isn't working dbt:run Primarily related to dbt run command or functionality profile:bigquery Related to BigQuery ProfileConfig stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed stretch goal
Projects
None yet
Development

No branches or pull requests

5 participants