Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create skip_nodes_if_on_run_start_fails behavior change flag #10699

Merged
merged 32 commits into from
Sep 30, 2024
Merged

Conversation

aranke
Copy link
Member

@aranke aranke commented Sep 11, 2024

Resolves #7387

Problem

  • If there is a failure in an on-run-start hook, dbt should skip all selected resources from running. This is not true right now.
  • If there is a failure in an on-run-end hook, dbt should still write run_results.json for all previous results. This is true right now, and the GAP team is counting on it remaining true.

Solution

I'm documenting how this PR is currently implemented, open to questions/comments/concerns.

Behavior

  • on-run-start/end hooks are run in serial (--threads 1)
  • If a hook fails, subsequent hooks of the same type (start/end) are skipped (i.e., hooks “fail fast”)
  • Hook status is now saved in run results instead of panicking and exiting immediately
  • This status is written to run_results.json at the end of the run
  • Create a skip_nodes_if_on_run_start_fails project config flag
    • If the flag is set and any on-run-start hook fails, mark all selected nodes as skipped
  • on-run-start/end hooks are always run, regardless of whether they passed or failed last time
  • Cosmetic changes to log output to make it more useful

Implementation

  • Modify the type signature of before_run to return a RunStatus instead of None so we can implement some railway-oriented programming
  • Combine safe_run_hooks and run_hooks into a single method to make logic easier to follow
    • Create a new get_execution_status method to encapsulate RunStatus logic
  • Add HookNode to ResultNode union type
  • Change some existing tests to account for hooks now being part of run results (1 per hook) and new logging
  • Rewrite test_run_hookstest_on_run_hooks with the following tests:
    • Start hook fails, flag is None/False/True
    • End hook fails, logs and run_results.json are still surfaced
    • No hooks run if zero nodes selected
  • Add a new test for retry to run hooks even if they succeeded last time

Checklist

  • I have read the contributing guide and understand what's expected of me.
  • I have run this code in development, and it appears to resolve the stated issue.
  • This PR includes tests, or tests are not required or relevant for this PR.
  • This PR has no interface changes (e.g., macros, CLI, logs, JSON artifacts, config files, adapter interface, etc.) or this PR has already received feedback and approval from Product or DX.
  • This PR includes type annotations for new and modified functions.

@cla-bot cla-bot bot added the cla:yes label Sep 11, 2024
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

Copy link

codecov bot commented Sep 11, 2024

Codecov Report

Attention: Patch coverage is 98.14815% with 2 lines in your changes missing coverage. Please review.

Project coverage is 89.17%. Comparing base (5e3d418) to head (9accf08).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #10699   +/-   ##
=======================================
  Coverage   89.16%   89.17%           
=======================================
  Files         183      183           
  Lines       23258    23288   +30     
=======================================
+ Hits        20738    20767   +29     
- Misses       2520     2521    +1     
Flag Coverage Δ
integration 86.44% <96.29%> (-0.02%) ⬇️
unit 62.19% <23.14%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Unit Tests 62.19% <23.14%> (-0.10%) ⬇️
Integration Tests 86.44% <96.29%> (-0.02%) ⬇️

@aranke aranke changed the title wip: refactor safe_run_hooks and add more metadata in Add new skip_nodes_if_on_run_start_fails to allow skipping nodes if on-run-start hooks fail Sep 18, 2024
@aranke aranke changed the title Add new skip_nodes_if_on_run_start_fails to allow skipping nodes if on-run-start hooks fail Add new skip_nodes_if_on_run_start_fails behavior change flag Sep 18, 2024
@aranke aranke changed the title Add new skip_nodes_if_on_run_start_fails behavior change flag Create skip_nodes_if_on_run_start_fails behavior change flag Sep 18, 2024
@aranke aranke marked this pull request as ready for review September 18, 2024 16:29
@aranke aranke requested a review from a team as a code owner September 18, 2024 16:29
core/dbt/task/runnable.py Outdated Show resolved Hide resolved
core/dbt/task/runnable.py Outdated Show resolved Hide resolved
core/dbt/task/run.py Show resolved Hide resolved
@graciegoheen graciegoheen added the user docs [docs.getdbt.com] Needs better documentation label Sep 24, 2024
@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 24, 2024

@aranke Behavior flag looks good, as does the "happy path" for skipping downstream nodes when it's enabled. When the behavior flag is not set / disabled, it works as before and the logs are even a bit nicer 👍

I found a few corner cases around dbt retry. Here's a reproduction case:

  1. Create a project with three on-run-start hooks and two models
  2. Put a syntax error in the second on-run-start hook (e.g. select column_does_not_exist)
  3. dbt run
  4. dbt retry - notice that the node numbering isn't right (models say they're 2 of 4 / 4 of 4)
  5. Fix the syntax error
  6. dbt retry - dbt successfully reruns the hooks, but then it also tries to materialize them as if they were models (?) using the default materialization (view)
# dbt_project.yml
on-run-start:
  - "select 1 as id"
  - "select column_does_not_exist"
  - "select 2 as id"

flags:
  skip_nodes_if_on_run_start_fails: True

The first run fails as expected, and skips models from running (good!):

% dbt run
22:44:11  Running with dbt=1.9.0-a1
22:44:11  Registered adapter: duckdb=1.8.3
22:44:11  Unable to do partial parsing because a project config has changed
22:44:11  Found 2 models, 3 operations, 4 data tests, 417 macros, 2 groups
22:44:11  
22:44:11  Concurrency: 1 threads (target='dev')
22:44:11  
22:44:11  1 of 3 START hook: testy.on-run-start.0 ........................................ [RUN]
22:44:11  1 of 3 OK hook: testy.on-run-start.0 ........................................... [OK in 0.01s]
22:44:11  2 of 3 START hook: testy.on-run-start.1 ........................................ [RUN]
22:44:11  2 of 3 ERROR hook: testy.on-run-start.1 ........................................ [ERROR in 0.01s]
22:44:11  3 of 3 SKIP hook: testy.on-run-start.2 ......................................... [SKIP in 0.00s]
22:44:11  
22:44:11  1 of 2 SKIP relation main.my_first_dbt_model ................................... [SKIP]
22:44:11  2 of 2 SKIP relation main.my_second_dbt_model .................................. [SKIP]
22:44:11  
22:44:11  Finished running 3 project hooks, 1 table model, 1 view model in 0 hours 0 minutes and 0.07 seconds (0.07s).
22:44:11  
22:44:11  Completed with 2 errors and 0 warnings:
22:44:11  
22:44:11    testy.on-run-start.1 failed, error:
 Binder Error: Referenced column "column_does_not_exist" not found in FROM clause!
LINE 2: select column_does_not_exist
               ^
22:44:11  
22:44:11  Done. PASS=1 WARN=0 ERROR=1 SKIP=3 TOTAL=5

Now retry, with weird numbering. dbt thinks it's running 4 total models (I think because the failed/skipped hooks are counted):

% dbt retry
22:44:15  Running with dbt=1.9.0-a1
22:44:15  Registered adapter: duckdb=1.8.3
22:44:15  Found 2 models, 3 operations, 4 data tests, 417 macros, 2 groups
22:44:15  
22:44:15  Concurrency: 1 threads (target='dev')
22:44:15  
22:44:15  1 of 3 START hook: testy.on-run-start.0 ........................................ [RUN]
22:44:15  1 of 3 OK hook: testy.on-run-start.0 ........................................... [OK in 0.01s]
22:44:15  2 of 3 START hook: testy.on-run-start.1 ........................................ [RUN]
22:44:15  2 of 3 ERROR hook: testy.on-run-start.1 ........................................ [ERROR in 0.01s]
22:44:15  3 of 3 SKIP hook: testy.on-run-start.2 ......................................... [SKIP in 0.00s]
22:44:15  
22:44:15  1 of 4 SKIP relation main.my_second_dbt_model .................................. [SKIP]
22:44:15  4 of 4 SKIP relation main.my_first_dbt_model ................................... [SKIP]
22:44:15  
22:44:15  Finished running 3 project hooks, 1 table model, 1 view model in 0 hours 0 minutes and 0.06 seconds (0.06s).
22:44:15  
22:44:15  Completed with 2 errors and 0 warnings:
22:44:15  
22:44:15    testy.on-run-start.1 failed, error:
 Binder Error: Referenced column "column_does_not_exist" not found in FROM clause!
LINE 2: select column_does_not_exist
               ^
22:44:15  
22:44:15  Done. PASS=1 WARN=0 ERROR=1 SKIP=3 TOTAL=5

Fix the syntax error in second hook, retry again. dbt will successfully run the hooks and build the models, but it also tries to build the hooks as models...

% dbt retry
22:44:33  Running with dbt=1.9.0-a1
22:44:33  Registered adapter: duckdb=1.8.3
22:44:33  Unable to do partial parsing because a project config has changed
22:44:34  Found 2 models, 3 operations, 4 data tests, 417 macros, 2 groups
22:44:34  
22:44:34  Concurrency: 1 threads (target='dev')
22:44:34  
22:44:34  1 of 3 START hook: testy.on-run-start.0 ........................................ [RUN]
22:44:34  1 of 3 OK hook: testy.on-run-start.0 ........................................... [OK in 0.01s]
22:44:34  2 of 3 START hook: testy.on-run-start.1 ........................................ [RUN]
22:44:34  2 of 3 OK hook: testy.on-run-start.1 ........................................... [OK in 0.01s]
22:44:34  3 of 3 START hook: testy.on-run-start.2 ........................................ [RUN]
22:44:34  3 of 3 OK hook: testy.on-run-start.2 ........................................... [OK in 0.01s]
22:44:34  
22:44:34  1 of 4 START sql table model main.my_first_dbt_model ........................... [RUN]
22:44:34  1 of 4 OK created sql table model main.my_first_dbt_model ...................... [OK in 0.04s]
22:44:34  2 of 4 START sql view model main.testy-on-run-start-1 .......................... [RUN]
22:44:34  2 of 4 ERROR creating sql view model main.testy-on-run-start-1 ................. [ERROR in 0.01s]
22:44:34  3 of 4 START sql view model main.testy-on-run-start-2 .......................... [RUN]
22:44:34  3 of 4 ERROR creating sql view model main.testy-on-run-start-2 ................. [ERROR in 0.00s]
22:44:34  4 of 4 START sql view model main.my_second_dbt_model ........................... [RUN]
22:44:34  4 of 4 OK created sql view model main.my_second_dbt_model ...................... [OK in 0.01s]
22:44:34  
22:44:34  Finished running 5 project hooks, 1 table model, 1 view model in 0 hours 0 minutes and 0.14 seconds (0.14s).
22:44:34  
22:44:34  Completed with 2 errors and 0 warnings:
22:44:34  
22:44:34    Compilation Error in operation testy-on-run-start-1 (./dbt_project.yml)
  'None' has no attribute 'database'
  
  > in macro materialization_view_default (macros/materializations/models/view.sql)
  > called by operation testy-on-run-start-1 (./dbt_project.yml)
22:44:34  
22:44:34    Compilation Error in operation testy-on-run-start-2 (./dbt_project.yml)
  'None' has no attribute 'database'
  
  > in macro materialization_view_default (macros/materializations/models/view.sql)
  > called by operation testy-on-run-start-2 (./dbt_project.yml)
22:44:34  
22:44:34  Done. PASS=5 WARN=0 ERROR=2 SKIP=0 TOTAL=7

@aranke
Copy link
Member Author

aranke commented Sep 27, 2024

@jtcohen6 Thanks for reviewing the behavior here, and good catch!
retry picked up these nodes because they're now recorded as “failed” in run_results.json.

I've tweaked the logic in retry to ignore operation nodes unless we're retrying run-operation, this seems to fix both the mis-numbering issue and the attempt to materialize a start hook.

I've also added a test for this case, but here are the new logs with the fix applied:

❯ dbt run
14:18:53  1 of 3 START hook: jaffle_shop.on-run-start.0 .................................. [RUN]
14:18:53  1 of 3 OK hook: jaffle_shop.on-run-start.0 ..................................... [OK in 0.01s]
14:18:53  2 of 3 START hook: jaffle_shop.on-run-start.1 .................................. [RUN]
14:18:53  2 of 3 ERROR hook: jaffle_shop.on-run-start.1 .................................. [ERROR in 0.02s]
14:18:53  3 of 3 SKIP hook: jaffle_shop.on-run-start.2 ................................... [SKIP in 0.00s]
14:18:53  
14:18:53  1 of 5 SKIP relation main.stg_orders ........................................... [SKIP]
14:18:53  2 of 5 SKIP relation main.orders ............................................... [SKIP]
14:18:53  3 of 5 SKIP relation main.stg_payments ......................................... [SKIP]
14:18:53  4 of 5 SKIP relation main.customers ............................................ [SKIP]
14:18:53  5 of 5 SKIP relation main.stg_customers ........................................ [SKIP]
❯ dbt retry
14:19:32  1 of 3 START hook: jaffle_shop.on-run-start.0 .................................. [RUN]
14:19:32  1 of 3 OK hook: jaffle_shop.on-run-start.0 ..................................... [OK in 0.01s]
14:19:32  2 of 3 START hook: jaffle_shop.on-run-start.1 .................................. [RUN]
14:19:32  2 of 3 ERROR hook: jaffle_shop.on-run-start.1 .................................. [ERROR in 0.02s]
14:19:32  3 of 3 SKIP hook: jaffle_shop.on-run-start.2 ................................... [SKIP in 0.00s]
14:19:32  
14:19:32  1 of 5 SKIP relation main.stg_customers ........................................ [SKIP]
14:19:32  2 of 5 SKIP relation main.orders ............................................... [SKIP]
14:19:32  3 of 5 SKIP relation main.stg_payments ......................................... [SKIP]
14:19:32  4 of 5 SKIP relation main.stg_orders ........................................... [SKIP]
14:19:32  5 of 5 SKIP relation main.customers ............................................ [SKIP]
  • Fix syntax error
❯ dbt retry
14:20:57  1 of 3 START hook: jaffle_shop.on-run-start.0 .................................. [RUN]
14:20:57  1 of 3 OK hook: jaffle_shop.on-run-start.0 ..................................... [OK in 0.01s]
14:20:57  2 of 3 START hook: jaffle_shop.on-run-start.1 .................................. [RUN]
14:20:57  2 of 3 OK hook: jaffle_shop.on-run-start.1 ..................................... [OK in 0.01s]
14:20:57  3 of 3 START hook: jaffle_shop.on-run-start.2 .................................. [RUN]
14:20:57  3 of 3 OK hook: jaffle_shop.on-run-start.2 ..................................... [OK in 0.02s]
14:20:57  
14:20:57  1 of 5 START sql view model main.stg_customers ................................. [RUN]
14:20:57  2 of 5 START sql view model main.stg_orders .................................... [RUN]
14:20:57  3 of 5 START sql view model main.stg_payments .................................. [RUN]
14:20:57  1 of 5 OK created sql view model main.stg_customers ............................ [OK in 0.07s]
14:20:57  3 of 5 OK created sql view model main.stg_payments ............................. [OK in 0.07s]
14:20:57  2 of 5 OK created sql view model main.stg_orders ............................... [OK in 0.08s]
14:20:57  4 of 5 START sql table model main.customers .................................... [RUN]
14:20:57  5 of 5 START sql table model main.orders ....................................... [RUN]
14:20:57  4 of 5 OK created sql table model main.customers ............................... [OK in 0.04s]
14:20:57  5 of 5 OK created sql table model main.orders .................................. [OK in 0.05s]

@aranke aranke requested a review from jtcohen6 September 27, 2024 14:22
@aranke
Copy link
Member Author

aranke commented Sep 27, 2024

I found a few corner cases around dbt retry.

@jtcohen6 are there any other corner cases here I should fix before merging in this PR? If not, would appreciate a 👍🏾 on the PR.

Thanks once again for taking the time to review this PR right before the Coalesce crunch.

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aranke Thanks for fixing! That's the only issue I found in my testing. This looks fit to ship 👍

@aranke aranke merged commit 2ff3f20 into main Sep 30, 2024
65 of 66 checks passed
@aranke aranke deleted the fix_7387 branch September 30, 2024 12:53
@aranke aranke mentioned this pull request Oct 1, 2024
4 tasks
@jtcohen6 jtcohen6 mentioned this pull request Dec 12, 2024
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes user docs [docs.getdbt.com] Needs better documentation
Projects
None yet
4 participants