Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add transient configs #777

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .changes/unreleased/Fixes-20230918-105721.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
kind: Fixes
body: Make python models use transient config
time: 2023-09-18T10:57:21.113134+12:00
custom:
Author: jeremyyeo
Issue: "776"
4 changes: 2 additions & 2 deletions dbt/include/snowflake/macros/materializations/table.sql
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@

{% endmaterialization %}

{% macro py_write_table(compiled_code, target_relation, temporary=False) %}
{% macro py_write_table(compiled_code, target_relation, temporary=False, table_type='transient') %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather pass transient as a boolean into this macro and then handle the translation here for the call to df.write.mode(); the translation is specific to this call and we assume this is a boolean in the user config. Right now we have the translation in the create macro, which maps to "transient" or "". Also, is empty string correct here? What happens if we submit a call to df.write.mode() with table_type="" (versus omitting the kwarg table_type entirely)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is empty string correct here? What happens if we submit a call to df.write.mode() with table_type="" (versus omitting the kwarg table_type entirely)?

@mikealfare that's a wise question 🧠

I didn't test this out personally, but the comments in the snowpark source code describe that case here.

So the way it is implemented in this PR currently, the string value of table_type just becomes a pass-through in line 55 below:

df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}}, table_type='{{table_type}}')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Snowflake docs (and what Doug linked to):

table_type – The table type of table to be created. The supported values are: temp, temporary, and transient. An empty string means to create a permanent table. Learn more about table types here.

So effectively:

# Table is transient.
save_as_table(..., table_type='transient')

# Table is non-transient.
save_as_table(..., table_type='')

# Table is non-transient (so default of `table_type` seems to be empty string).
save_as_table(...)

Could swap it out to (according to Mike's suggestion):

-- dbt/include/snowflake/macros/materializations/table.sql
{% macro py_write_table(compiled_code, target_relation, temporary=False, transient=False) %}
...
{% set table_type = 'transient' if transient else '' %}
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}}, table_type='{{table_type}}')

Not too sure if it's worth writing additional logic to omit (or not) the table_type arg when it is an empty string (or not) - i.e.

-- dbt/include/snowflake/macros/materializations/table.sql
{% macro py_write_table(compiled_code, target_relation, temporary=False, transient=False) %}
...
{% if transient %}
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}}, table_type='transient')
{% else %}
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}})

Happy to follow a suggested pattern @mikealfare

Copy link
Contributor Author

@jeremyyeo jeremyyeo Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ftr, the docs linked above point out that the kwarg create_temp_table is deprecated:

create_temp_table – (Deprecated) The to-be-created table will be temporary if this is set to True.

Which of course we still use in our save_as_table() method calls - so potentially would be a good time to replace that with table_type kwarg entirely.

Go from (status quo):

df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}})

Or (this PR's original intended change):

df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}}, table_type='{{table_type}}')

To just:

df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='{{table_type}}')

And then - we'll have all types via:

# Table is transient.
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='transient')

# Table is non-transient.
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='')

# Table is non-transient. Same as above - so pick a style we want I guess.
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}')

# Table is temporary.
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='temporary')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, just getting around to plowing through my GH notifications. All of this context helps a lot.

Given that SF is deprecating create_temp_table in favor of table_type, I agree with using table_type in the signature, as a string. And I would default it to empty string in alignment with the docs. This also keeps config parsing out of a macro that otherwise does not care about the jinja global config that's floating around.

I wouldn't worry about removing the table_type argument if table_type is not in the call. However, I think we need to be cognizant of backwards compatibility for the temporary argument in the py_write_table macro. So that means we need to take both arguments in the py_write_table macro and combine them. @dbeatty10, correct me if I'm wrong here, but I think that amounts to something like this:

{% if temporary == True %}
{% set table_type = "temporary" %}  -- hence override the value of `table_type` that was passed in
{% else %}
-- this else is not needed, but communicates the concept
-- keep the value of `table_type` that was passed in (which could be the default empty string)
{% endif %}

Then we update the call to save_as_table to exclude create_temp_table in alignment with your third option above:

df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='{{table_type}}')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds about right @mikealfare.

Logic

To support both backwards-compatibility as well as forward-facing use cases, I'd suggest that the new table_type parameter takes precedence over temporary (but only when it is specified).

So like this (in alignment with the Snowflake docs here):

  • use table_type when specified
  • use "temporary" as the table type only when temporary is True and table_type is not specified
  • default the table type to "transient" if all else fails

e.g., something similar to what you wrote, but with the precedence flipped:

An untested implementation

{% macro py_write_table(compiled_code, target_relation, temporary=False, table_type=none) %}

...

{% if table_type is none and temporary %}
    {% set table_type = "temporary" %}
{% elif table_type is none %}
    {# Default to "transient", just like dbt SQL tables in Snowflake #}
    {% set table_type = "transient" %}
{% elif table_type == "permanent" %}
    {# Snowflake treats "" as meaning "permanent" #}
    {% set table_type = "" %}
{% endif %}

...

df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='{{table_type}}')

Unnecessarily complicated?

This code might appear unnecessarily complicated on first blush. But it's a four-fold consequence of:

  1. Goal of giving precedence to table_type when a consumer uses the 3-parameter version of the py_write_table macro.
  2. Goal of being backwards-compatible when a consumer uses the 2-parameter version of the py_write_table macro.
  3. Goal of using "transient" as the default value (unless overridden)
  4. Snowflake docs say: "The supported values of table_type are: temp, temporary, and transient. An empty string means to create a permanent table."

Copy link
Contributor

@mikealfare mikealfare Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this, but I have one suggestion.

Since we're mimicking the behavior of py_write_table and we want folks to use table_type moving forward, we want to support table_type="transient", table_type="temp", and table_type="temporary". This logic does actually do that, but I read it multiple times before I realized that happens because we're updating the parameter that comes in, hence it passes straight through. For the sake of readability, especially because it's jinja, I'd like to add an else clause to the if block that just contains the comment "otherwise leave table_type as it is" (or something along those lines). It would save folks some time in the future.

{# Snowflake treats "" as meaning "permanent" #}

I completely misread the docs. I conflated this with the default value of "transient". It's kind of wild that empty string is a valid value that means something and is also not the default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wild indeed! 🤠

Your suggestion works for me 👍

Here's what it would look like after adding that piece:

{% macro py_write_table(compiled_code, target_relation, temporary=False, table_type=none) %}

...

{% if table_type is none and temporary %}
    {% set table_type = "temporary" %}
{% elif table_type is none %}
    {#- Default to "transient", just like dbt SQL tables in Snowflake -#}
    {% set table_type = "transient" %}
{% elif table_type == "permanent" %}
    {#- Snowflake treats "" as meaning "permanent" -#}
    {% set table_type = "" %}
{% else %}
    {#- Otherwise leave table_type as it is -#}
{% endif %}

...

df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', table_type='{{table_type}}')

{{ compiled_code }}
def materialize(session, df, target_relation):
# make sure pandas exists
Expand All @@ -52,7 +52,7 @@ def materialize(session, df, target_relation):
# session.write_pandas does not have overwrite function
df = session.createDataFrame(df)
{% set target_relation_name = resolve_model_name(target_relation) %}
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}})
df.write.mode("overwrite").save_as_table('{{ target_relation_name }}', create_temp_table={{temporary}}, table_type='{{table_type}}')

def main(session):
dbt = dbtObj(session.table)
Expand Down
5 changes: 3 additions & 2 deletions dbt/include/snowflake/macros/relations/table/create.sql
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{% macro snowflake__create_table_as(temporary, relation, compiled_code, language='sql') -%}
{%- set transient = config.get('transient', default=true) -%}
{%- if language == 'sql' -%}
{%- set transient = config.get('transient', default=true) -%}
{%- set cluster_by_keys = config.get('cluster_by', default=none) -%}
{%- set enable_automatic_clustering = config.get('automatic_clustering', default=false) -%}
{%- set copy_grants = config.get('copy_grants', default=false) -%}
Expand Down Expand Up @@ -46,7 +46,8 @@
{%- endif -%}

{%- elif language == 'python' -%}
{{ py_write_table(compiled_code=compiled_code, target_relation=relation, temporary=temporary) }}
{%- set table_type = 'transient' if transient else '' -%}
{{ py_write_table(compiled_code=compiled_code, target_relation=relation, temporary=temporary, table_type=table_type) }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change

{%- set table_type = 'transient' if transient else '' -%}
{{ py_write_table(compiled_code=compiled_code, target_relation=relation, temporary=temporary, table_type=table_type) }}

to

{{ py_write_table(compiled_code=compiled_code, target_relation=relation, temporary=temporary, transient=transient) }}

and handle within py_write_table instead if we want to proceed with Mike's suggestion above (https://github.com/dbt-labs/dbt-snowflake/pull/777/files#r1330527575).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do a few things here in alignment with my comment above. I would update the definition of table_type here to consider temporary as discussed above. And then I would not pass temporary into the py_write_table call at all. That tidies up all the config parsing. You would still need to keep that if block in the py_write_table macro in the even that other folks are using it; but then when we eventually retire the temporary argument in that macro, we don't need to remember to come back here and deal with it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbeatty10 I think this still aligns with what you're saying above, correct? We're effectively forcing the use of table_type here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikealfare there's one more thing to handle here:

Which brings us to three-valued logic ...

There's three-valued logic to handle for the transient boolean config:

  • None (pass-through and let py_write_table decide what to do)
  • True (use transient)
  • False (use permanent?)

How 'bout this?

So I think we'd need handle the null case first to be sure the table_type is set (or not set!) correctly:

{% if transient is none %}
    {% set table_type = none %}
{% elif transient %}
    {% set table_type = "transient" %}
{% else %}
    {% set table_type = "permanent" %}
{% endif %}

{{ py_write_table(compiled_code=compiled_code, target_relation=relation, table_type=table_type) }}

This is assuming an implementation of py_write_table like our most recent iteration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If transient is None, then wouldn't we set table_table based on temporary since we're not passing temporary into py_write_table anymore? Put another way, I don't think you would be able to create a table with table_type="temporary" in your logic flow without also passing temporary into the call to py_write_table. And I think we don't want to do that from what we said.

{% if transient is none and temporary %}
    {% set table_type = "temporary" %}
{% elif transient is none %}
    {% set table_type = none %}
{% elif transient %}
    {% set table_type = "transient" %}
{% else %}
    {% set table_type = "permanent" %}
{% endif %}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sharp eyes -- you read that right!

After examining the logic for sql tables in dbt-snowflake it became clear that it prioritizes the temporary config over the transient config.

So I flipped-flopped from our earlier discussions and switched the implementation to standardize on the behavior of sql tables.

Specifically, I just re-factored the code so that this logic applies to both sql and python tables.

Here is the relevant portion of the diff:

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but then if someone calls the py_write_table macro directly, then we want the override to go in the other direction because then we're only in the scope of python tables, not sql tables. If that's right, then let's get this updated with what you have and up for review. I'm working on getting another thing over the line, but can help push this along when it's ready.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikealfare Currently failing CI, but draft PR is up here: #802

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put a comment there. You're missing the close braces on the sets, so the code quality failed. Will we be moving forward with 802 then as the primary PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's consider #802 the primary PR. It preserves @jeremyyeo's commits history and authorship credit.

{%- else -%}
{% do exceptions.raise_compiler_error("snowflake__create_table_as macro didn't get supported language, it got %s" % language) %}
{%- endif -%}
Expand Down