-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add transient configs #777
Conversation
User experience change
Changing the default behavior for Python tables to be created as The differenceThe key difference is that transient tables do not have a Fail-safe period (whereas permanent tables do). So a user can trade data protection and recovery for lower cost. The consequencesIf this change is adopted as-is, then users will need to explicitly add DocumentationIt would be nice to add some documentation in docs.getdbt.com how to configure the I don't know if this is the best place or if there's a better place. We'd probably want to show the non-default config:
# dbt_project.yml
name: my_dbt_project
profile: all
config-version: 2
version: 1.0
models:
my_dbt_project:
+materialized: table
+transient: false # models/snek.py
import pandas as pd
def model(dbt, session):
dbt.config(transient=False)
return pd.DataFrame({"id": [1]}) TestingWe'd probably want to add some test case to make sure that both transient true & false behave as intended. Would probably be accomplished by using config like shown above and then checking the value in the information schema afterwards. Would you be able to add those tests @jeremyyeo ? Or would you be looking for further guidance or assistance from the Adapters team on the testing portion? |
Certainly would love some guidance added those - couldn't find any examples in the repo that does this (but could just be me essentially having a lack of testing knowledge). |
@jeremyyeo We'll need to get insight from someone on the Adapters team of what kind of tests they'd like for this. I didn't see any tests here that seemed like a perfect match:
|
@jeremyyeo this is a great change To add some tests:
|
@@ -38,7 +38,7 @@ | |||
|
|||
{% endmaterialization %} | |||
|
|||
{% macro py_write_table(compiled_code, target_relation, temporary=False) %} | |||
{% macro py_write_table(compiled_code, target_relation, temporary=False, table_type='transient') %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd rather pass transient
as a boolean into this macro and then handle the translation here for the call to df.write.mode()
; the translation is specific to this call and we assume this is a boolean in the user config. Right now we have the translation in the create
macro, which maps to "transient"
or ""
. Also, is empty string correct here? What happens if we submit a call to df.write.mode()
with table_type=""
(versus omitting the kwarg table_type
entirely)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is empty string correct here? What happens if we submit a call to
df.write.mode()
withtable_type=""
(versus omitting the kwargtable_type
entirely)?
@mikealfare that's a wise question 🧠
I didn't test this out personally, but the comments in the snowpark source code describe that case here.
So the way it is implemented in this PR currently, the string value of table_type
just becomes a pass-through in line 55 below:
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}}, table_type='{{table_type}}')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From Snowflake docs (and what Doug linked to):
table_type – The table type of table to be created. The supported values are: temp, temporary, and transient. An empty string means to create a permanent table. Learn more about table types here.
So effectively:
# Table is transient.
save_as_table(..., table_type='transient')
# Table is non-transient.
save_as_table(..., table_type='')
# Table is non-transient (so default of `table_type` seems to be empty string).
save_as_table(...)
Could swap it out to (according to Mike's suggestion):
-- dbt/include/snowflake/macros/materializations/table.sql
{% macro py_write_table(compiled_code, target_relation, temporary=False, transient=False) %}
...
{% set table_type = 'transient' if transient else '' %}
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}}, table_type='{{table_type}}')
Not too sure if it's worth writing additional logic to omit (or not) the table_type
arg when it is an empty string (or not) - i.e.
-- dbt/include/snowflake/macros/materializations/table.sql
{% macro py_write_table(compiled_code, target_relation, temporary=False, transient=False) %}
...
{% if transient %}
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}}, table_type='transient')
{% else %}
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}})
Happy to follow a suggested pattern @mikealfare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ftr, the docs linked above point out that the kwarg create_temp_table
is deprecated:
create_temp_table – (Deprecated) The to-be-created table will be temporary if this is set to True.
Which of course we still use in our save_as_table()
method calls - so potentially would be a good time to replace that with table_type
kwarg entirely.
Go from (status quo):
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}})
Or (this PR's original intended change):
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}}, table_type='{{table_type}}')
To just:
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='{{table_type}}')
And then - we'll have all types via:
# Table is transient.
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='transient')
# Table is non-transient.
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='')
# Table is non-transient. Same as above - so pick a style we want I guess.
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}')
# Table is temporary.
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='temporary')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, just getting around to plowing through my GH notifications. All of this context helps a lot.
Given that SF is deprecating create_temp_table
in favor of table_type
, I agree with using table_type
in the signature, as a string. And I would default it to empty string in alignment with the docs. This also keeps config parsing out of a macro that otherwise does not care about the jinja global config
that's floating around.
I wouldn't worry about removing the table_type
argument if table_type
is not in the call. However, I think we need to be cognizant of backwards compatibility for the temporary
argument in the py_write_table
macro. So that means we need to take both arguments in the py_write_table
macro and combine them. @dbeatty10, correct me if I'm wrong here, but I think that amounts to something like this:
{% if temporary == True %}
{% set table_type = "temporary" %} -- hence override the value of `table_type` that was passed in
{% else %}
-- this else is not needed, but communicates the concept
-- keep the value of `table_type` that was passed in (which could be the default empty string)
{% endif %}
Then we update the call to save_as_table
to exclude create_temp_table
in alignment with your third option above:
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='{{table_type}}')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds about right @mikealfare.
Logic
To support both backwards-compatibility as well as forward-facing use cases, I'd suggest that the new table_type
parameter takes precedence over temporary
(but only when it is specified).
So like this (in alignment with the Snowflake docs here):
- use
table_type
when specified - use
"temporary"
as the table type only whentemporary
isTrue
andtable_type
is not specified - default the table type to
"transient"
if all else fails
e.g., something similar to what you wrote, but with the precedence flipped:
An untested implementation
{% macro py_write_table(compiled_code, target_relation, temporary=False, table_type=none) %}
...
{% if table_type is none and temporary %}
{% set table_type = "temporary" %}
{% elif table_type is none %}
{# Default to "transient", just like dbt SQL tables in Snowflake #}
{% set table_type = "transient" %}
{% elif table_type == "permanent" %}
{# Snowflake treats "" as meaning "permanent" #}
{% set table_type = "" %}
{% endif %}
...
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='{{table_type}}')
Unnecessarily complicated?
This code might appear unnecessarily complicated on first blush. But it's a four-fold consequence of:
- Goal of giving precedence to
table_type
when a consumer uses the 3-parameter version of thepy_write_table
macro. - Goal of being backwards-compatible when a consumer uses the 2-parameter version of the
py_write_table
macro. - Goal of using
"transient"
as the default value (unless overridden) - Snowflake docs say: "The supported values of
table_type
are:temp
,temporary
, andtransient
. An empty string means to create a permanent table."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this, but I have one suggestion.
Since we're mimicking the behavior of py_write_table
and we want folks to use table_type
moving forward, we want to support table_type="transient"
, table_type="temp"
, and table_type="temporary"
. This logic does actually do that, but I read it multiple times before I realized that happens because we're updating the parameter that comes in, hence it passes straight through. For the sake of readability, especially because it's jinja, I'd like to add an else clause to the if block that just contains the comment "otherwise leave table_type as it is" (or something along those lines). It would save folks some time in the future.
{# Snowflake treats "" as meaning "permanent" #}
I completely misread the docs. I conflated this with the default value of "transient". It's kind of wild that empty string is a valid value that means something and is also not the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wild indeed! 🤠
Your suggestion works for me 👍
Here's what it would look like after adding that piece:
{% macro py_write_table(compiled_code, target_relation, temporary=False, table_type=none) %}
...
{% if table_type is none and temporary %}
{% set table_type = "temporary" %}
{% elif table_type is none %}
{#- Default to "transient", just like dbt SQL tables in Snowflake -#}
{% set table_type = "transient" %}
{% elif table_type == "permanent" %}
{#- Snowflake treats "" as meaning "permanent" -#}
{% set table_type = "" %}
{% else %}
{#- Otherwise leave table_type as it is -#}
{% endif %}
...
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='{{table_type}}')
@@ -46,7 +46,8 @@ | |||
{%- endif -%} | |||
|
|||
{%- elif language == 'python' -%} | |||
{{ py_write_table(compiled_code=compiled_code, target_relation=relation, temporary=temporary) }} | |||
{%- set table_type = 'transient' if transient else '' -%} | |||
{{ py_write_table(compiled_code=compiled_code, target_relation=relation, temporary=temporary, table_type=table_type) }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change
{%- set table_type = 'transient' if transient else '' -%}
{{ py_write_table(compiled_code=compiled_code, target_relation=relation, temporary=temporary, table_type=table_type) }}
to
{{ py_write_table(compiled_code=compiled_code, target_relation=relation, temporary=temporary, transient=transient) }}
and handle within py_write_table
instead if we want to proceed with Mike's suggestion above (https://github.com/dbt-labs/dbt-snowflake/pull/777/files#r1330527575).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do a few things here in alignment with my comment above. I would update the definition of table_type
here to consider temporary
as discussed above. And then I would not pass temporary
into the py_write_table
call at all. That tidies up all the config parsing. You would still need to keep that if block in the py_write_table
macro in the even that other folks are using it; but then when we eventually retire the temporary
argument in that macro, we don't need to remember to come back here and deal with it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbeatty10 I think this still aligns with what you're saying above, correct? We're effectively forcing the use of table_type
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikealfare there's one more thing to handle here:
- 👉 dbt-snowflake has a boolean config named
transient
that defaults toNone
Which brings us to three-valued logic ...
There's three-valued logic to handle for the transient
boolean config:
None
(pass-through and letpy_write_table
decide what to do)True
(usetransient
)False
(usepermanent
?)
How 'bout this?
So I think we'd need handle the null case first to be sure the table_type
is set (or not set!) correctly:
{% if transient is none %}
{% set table_type = none %}
{% elif transient %}
{% set table_type = "transient" %}
{% else %}
{% set table_type = "permanent" %}
{% endif %}
{{ py_write_table(compiled_code=compiled_code, target_relation=relation, table_type=table_type) }}
This is assuming an implementation of py_write_table
like our most recent iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If transient
is None
, then wouldn't we set table_table
based on temporary
since we're not passing temporary
into py_write_table
anymore? Put another way, I don't think you would be able to create a table with table_type="temporary"
in your logic flow without also passing temporary
into the call to py_write_table
. And I think we don't want to do that from what we said.
{% if transient is none and temporary %}
{% set table_type = "temporary" %}
{% elif transient is none %}
{% set table_type = none %}
{% elif transient %}
{% set table_type = "transient" %}
{% else %}
{% set table_type = "permanent" %}
{% endif %}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sharp eyes -- you read that right!
After examining the logic for sql
tables in dbt-snowflake it became clear that it prioritizes the temporary
config over the transient
config.
So I flipped-flopped from our earlier discussions and switched the implementation to standardize on the behavior of sql
tables.
Specifically, I just re-factored the code so that this logic applies to both sql
and python
tables.
Here is the relevant portion of the diff:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but then if someone calls the py_write_table
macro directly, then we want the override to go in the other direction because then we're only in the scope of python tables, not sql tables. If that's right, then let's get this updated with what you have and up for review. I'm working on getting another thing over the line, but can help push this along when it's ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikealfare Currently failing CI, but draft PR is up here: #802
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put a comment there. You're missing the close braces on the sets, so the code quality failed. Will we be moving forward with 802 then as the primary PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's consider #802 the primary PR. It preserves @jeremyyeo's commits history and authorship credit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I answered all of your questions. Let me know if I missed something. Thanks again for all of the write up; it really helps.
@@ -38,7 +38,7 @@ | |||
|
|||
{% endmaterialization %} | |||
|
|||
{% macro py_write_table(compiled_code, target_relation, temporary=False) %} | |||
{% macro py_write_table(compiled_code, target_relation, temporary=False, table_type='transient') %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, just getting around to plowing through my GH notifications. All of this context helps a lot.
Given that SF is deprecating create_temp_table
in favor of table_type
, I agree with using table_type
in the signature, as a string. And I would default it to empty string in alignment with the docs. This also keeps config parsing out of a macro that otherwise does not care about the jinja global config
that's floating around.
I wouldn't worry about removing the table_type
argument if table_type
is not in the call. However, I think we need to be cognizant of backwards compatibility for the temporary
argument in the py_write_table
macro. So that means we need to take both arguments in the py_write_table
macro and combine them. @dbeatty10, correct me if I'm wrong here, but I think that amounts to something like this:
{% if temporary == True %}
{% set table_type = "temporary" %} -- hence override the value of `table_type` that was passed in
{% else %}
-- this else is not needed, but communicates the concept
-- keep the value of `table_type` that was passed in (which could be the default empty string)
{% endif %}
Then we update the call to save_as_table
to exclude create_temp_table
in alignment with your third option above:
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='{{table_type}}')
@@ -46,7 +46,8 @@ | |||
{%- endif -%} | |||
|
|||
{%- elif language == 'python' -%} | |||
{{ py_write_table(compiled_code=compiled_code, target_relation=relation, temporary=temporary) }} | |||
{%- set table_type = 'transient' if transient else '' -%} | |||
{{ py_write_table(compiled_code=compiled_code, target_relation=relation, temporary=temporary, table_type=table_type) }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do a few things here in alignment with my comment above. I would update the definition of table_type
here to consider temporary
as discussed above. And then I would not pass temporary
into the py_write_table
call at all. That tidies up all the config parsing. You would still need to keep that if block in the py_write_table
macro in the even that other folks are using it; but then when we eventually retire the temporary
argument in that macro, we don't need to remember to come back here and deal with it here.
Closing in favor of #802 Thanks for all your work on this @jeremyyeo between finding the issue, explaining it so thoroughly (including links to Snowflake docs), and even fixing it! 🏆 Your authorship and commits are preserved within #802 -- @mikealfare and I just added some tests and tweaked the logic slightly. |
resolves #776
docs dbt-labs/docs.getdbt.com/#
Problem
Python models don't respect
transient
config.Solution
A snowpark table created via
save_as_table
can be transient via setting the paramtable_type = 'transient'
(https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/api/snowflake.snowpark.DataFrameWriter.save_as_table).snowflake__create_table_as()
) from being exclusive to SQL models.table_type
arg.There's probably something else to address.
+transient
config anywhere (i.e. profile, project, model level).This change would make Python tables also be created as transient - in line with (2).
Checklist