[CT-2271] [Feature] Compute seed file hashes incrementally #7124

noppaz · 2023-03-05T08:42:22Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

This relates to a conversion on Slack which also resulted in issue #7117.

If we allow dbt to hash larger seed files I think these should be computed incrementally to avoid allocation of too much memory. This will be most important in CI where memory is more limited.

My idea is to add another classmethod to the FileHash dataclass which has the same signature as from_contents. This method can then be used specifically for seed file hashes and from_contents can be continued to be used alongside it.

@classmethod
  def from_path(cls, path: str, name="sha256") -> "FileHash":
      """Create a file hash from the file at given path.
      """
      pass

Describe alternatives you've considered

The alternative is to go with what we have which could spike memory usage for those who override the default limit and have a lot of large seed files. Yes this could be an anti pattern but there's no reason to limit the user like that here.

Who will this benefit?

I would argue that the default limit should be increased from 1 MB but as we have the environment variable override possibility in above PR I would be fine with keeping the default at 1 MB for now.

Therefore, the most benefit are more advanced projects which will override the 1 MB limit with the environment variable.

Are you interested in contributing this feature?

Yes, I will create the PR

Anything else?

No response

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2023-03-06T16:46:35Z

Thanks for opening this issue and the associated PR @noppaz ! Excited to see you and @acurtis-evi collaborate and combine the complementary pieces.

Related to #6875

Current behavior

From the caveats to state comparison for seeds:

dbt stores a file hash of seed files that are <1 MiB in size. If the contents of these seeds is modified, the seed will be included in state:modified.

Proposed behavior

Lightly edited from here:

add a DBT_MAXIMUM_SEED_SIZE env variable (expressed in MiB)
default to 1
setting DBT_MAXIMUM_SEED_SIZE to 0 would remove the limit entirely
- i.e., if the derived constant MAXIMUM_SEED_SIZE is 0, then there is no limit to the seed size
compute file hashes incrementally for larger seed files

Functional approval

@jtcohen6 indicated the following in the discussion in Slack:

Ok! I'm not strictly opposed to either/both of:

making this limit configurable (you accept slowness as trade-off)

updating the logic to better handle larger files — if one of you would be interested in contributing that (not something we'd be prioritizing soon ourselves)

Acceptance criteria

enable configuration of the maximum seed size
- I'm guessing we'll want a corresponding click CLI option for any new environment variable (like here and here)
compute hash incrementally to support memory constrained environments
open an issue here to describe the new behavior
- e.g. update this documentation about hashes for seeds

noppaz · 2023-03-07T15:03:42Z

PR #7125 aims to solve two issues so the linking in the development pane to the right doesn't link to the PR atm.

noppaz · 2023-03-07T15:04:26Z

I've added an issue for documentation updates here dbt-labs/docs.getdbt.com#2958

dbeatty10 · 2023-03-07T19:06:45Z

PR #7125 aims to solve two issues so the linking in the development pane to the right doesn't link to the PR atm.

Listing the resolves keyword before each URL or issue number will do the trick. There's many different ways to express it. Here's one:

resolves https://github.com/dbt-labs/dbt-core/issues/7117
resolves https://github.com/dbt-labs/dbt-core/issues/7124

Here's another:

resolves #7117 resolves #7124

I made the update here, so you should see both linked now.

noppaz added enhancement New feature or request triage labels Mar 5, 2023

github-actions bot changed the title ~~[Feature] Compute seed file hashes incrementally~~ [CT-2271] [Feature] Compute seed file hashes incrementally Mar 5, 2023

noppaz linked a pull request Mar 5, 2023 that will close this issue

Make MAXIMUM_SEED_SIZE_MIB configurable #7125

Open

6 tasks

dbeatty10 removed the triage label Mar 6, 2023

jtcohen6 added help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors Team:Language labels Mar 22, 2023

jtcohen6 removed the Team:Language label Jul 19, 2023

jtcohen6 linked a pull request Dec 24, 2024 that will close this issue

Make MAXIMUM_SEED_SIZE_MIB configurable #11177

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-2271] [Feature] Compute seed file hashes incrementally #7124

[CT-2271] [Feature] Compute seed file hashes incrementally #7124

noppaz commented Mar 5, 2023 •

edited

Loading

dbeatty10 commented Mar 6, 2023

noppaz commented Mar 7, 2023

noppaz commented Mar 7, 2023

dbeatty10 commented Mar 7, 2023

[CT-2271] [Feature] Compute seed file hashes incrementally #7124

[CT-2271] [Feature] Compute seed file hashes incrementally #7124

Comments

noppaz commented Mar 5, 2023 • edited Loading

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

dbeatty10 commented Mar 6, 2023

Current behavior

Proposed behavior

Functional approval

Acceptance criteria

noppaz commented Mar 7, 2023

noppaz commented Mar 7, 2023

dbeatty10 commented Mar 7, 2023

noppaz commented Mar 5, 2023 •

edited

Loading