Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2271] [Feature] Compute seed file hashes incrementally #7124

Open
3 tasks done
noppaz opened this issue Mar 5, 2023 · 4 comments · May be fixed by #7125 or #11177
Open
3 tasks done

[CT-2271] [Feature] Compute seed file hashes incrementally #7124

noppaz opened this issue Mar 5, 2023 · 4 comments · May be fixed by #7125 or #11177
Labels
enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors

Comments

@noppaz
Copy link

noppaz commented Mar 5, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

This relates to a conversion on Slack which also resulted in issue #7117.

If we allow dbt to hash larger seed files I think these should be computed incrementally to avoid allocation of too much memory. This will be most important in CI where memory is more limited.

My idea is to add another classmethod to the FileHash dataclass which has the same signature as from_contents. This method can then be used specifically for seed file hashes and from_contents can be continued to be used alongside it.

@classmethod
  def from_path(cls, path: str, name="sha256") -> "FileHash":
      """Create a file hash from the file at given path.
      """
      pass

Describe alternatives you've considered

The alternative is to go with what we have which could spike memory usage for those who override the default limit and have a lot of large seed files. Yes this could be an anti pattern but there's no reason to limit the user like that here.

Who will this benefit?

I would argue that the default limit should be increased from 1 MB but as we have the environment variable override possibility in above PR I would be fine with keeping the default at 1 MB for now.

Therefore, the most benefit are more advanced projects which will override the 1 MB limit with the environment variable.

Are you interested in contributing this feature?

Yes, I will create the PR

Anything else?

No response

@noppaz noppaz added enhancement New feature or request triage labels Mar 5, 2023
@github-actions github-actions bot changed the title [Feature] Compute seed file hashes incrementally [CT-2271] [Feature] Compute seed file hashes incrementally Mar 5, 2023
@noppaz noppaz linked a pull request Mar 5, 2023 that will close this issue
6 tasks
@dbeatty10
Copy link
Contributor

Thanks for opening this issue and the associated PR @noppaz ! Excited to see you and @acurtis-evi collaborate and combine the complementary pieces.

Related to #6875

Current behavior

From the caveats to state comparison for seeds:

dbt stores a file hash of seed files that are <1 MiB in size. If the contents of these seeds is modified, the seed will be included in state:modified.

Proposed behavior

Lightly edited from here:

  • add a DBT_MAXIMUM_SEED_SIZE env variable (expressed in MiB)
  • default to 1
  • setting DBT_MAXIMUM_SEED_SIZE to 0 would remove the limit entirely
    • i.e., if the derived constant MAXIMUM_SEED_SIZE is 0, then there is no limit to the seed size
  • compute file hashes incrementally for larger seed files

Functional approval

@jtcohen6 indicated the following in the discussion in Slack:

Ok! I'm not strictly opposed to either/both of:

  • making this limit configurable (you accept slowness as trade-off)
  • updating the logic to better handle larger files — if one of you would be interested in contributing that (not something we'd be prioritizing soon ourselves)

Acceptance criteria

  • enable configuration of the maximum seed size
    • I'm guessing we'll want a corresponding click CLI option for any new environment variable (like here and here)
  • compute hash incrementally to support memory constrained environments
  • open an issue here to describe the new behavior
    • e.g. update this documentation about hashes for seeds

@dbeatty10 dbeatty10 removed the triage label Mar 6, 2023
@noppaz
Copy link
Author

noppaz commented Mar 7, 2023

PR #7125 aims to solve two issues so the linking in the development pane to the right doesn't link to the PR atm.

@noppaz
Copy link
Author

noppaz commented Mar 7, 2023

I've added an issue for documentation updates here dbt-labs/docs.getdbt.com#2958

@dbeatty10
Copy link
Contributor

PR #7125 aims to solve two issues so the linking in the development pane to the right doesn't link to the PR atm.

Listing the resolves keyword before each URL or issue number will do the trick. There's many different ways to express it. Here's one:

resolves https://github.com/dbt-labs/dbt-core/issues/7117
resolves https://github.com/dbt-labs/dbt-core/issues/7124

Here's another:

resolves #7117 resolves #7124

I made the update here, so you should see both linked now.

@jtcohen6 jtcohen6 added help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors Team:Language labels Mar 22, 2023
@jtcohen6 jtcohen6 linked a pull request Dec 24, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors
Projects
None yet
3 participants