Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1772] [Feature] Unexpected DuplicateProjectDependencyError in git packages with common transitive dependency #6552

Closed
2 tasks done
GiorgioBaldelli opened this issue Jan 9, 2023 · 15 comments
Labels
bug Something isn't working deps dbt's package manager stale Issues that have gone stale windows Everyone's favorite OS that's sometimes a little weird

Comments

@GiorgioBaldelli
Copy link

GiorgioBaldelli commented Jan 9, 2023

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Let's say we have two projects, project A & project B.

Both projects A and B import the same, custom dbt package from a git repository, let's call it example_package.

If we now modify project A's packages.yml to import project B, the package configurations would look like this:

Project A:

packages:
  - git: "https://gitlab.com/example/example_package.git"
    revision: main
  - git: "https://gitlab.com/example/project_b.git"
    revision: main

Project B:

packages:
  - git: "https://gitlab.com/example/example_package.git"
    revision: main

If we now attempt to run any dbt command on project A, for example dbt deps, we get the following error:

Found duplicate project "example_package". This occurs when a dependency has the same project name as some other dependency.

Is there a way to handle duplicate packages when using the git sytax to import custom, non-standard packages?

It looks like dbt may be able to handle duplicates only if we attempt to import standard packages from dbt hub? The recommended solution in this issue does not address situations in which the package is custom and does not therefore exist on dbt's package hub.

Expected Behavior

dbt knows how to handle duplicate package import using the git syntax.

Steps To Reproduce

Use the example package definition & attempt to run dbt deps.

Relevant log output

No response

Environment

No response

Which database adapter are you using with dbt?

No response

Additional Context

No response

@GiorgioBaldelli GiorgioBaldelli added bug Something isn't working triage labels Jan 9, 2023
@github-actions github-actions bot changed the title [Bug] dbt unable to handle duplicate package imports when using git syntax [CT-1772] [Bug] dbt unable to handle duplicate package imports when using git syntax Jan 9, 2023
@jtcohen6 jtcohen6 added the deps dbt's package manager label Jan 9, 2023
@jtcohen6 jtcohen6 self-assigned this Jan 9, 2023
@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 9, 2023

@GiorgioBaldelli Thanks for opening!

There have been a handful of issues similar to this one, although none that's exactly the scenario you're outlining here:

It looks like dbt may be able to handle duplicates only if we attempt to import standard packages from dbt hub?

That's more or less correct. The simple fact is, there are more user-friendly things we can do when you specify a Hub package, versus a git, local, or tarball (new in v1.4!) package. The Hub API includes metadata about each package's dependencies on other packages, which we can use to resolve versions and deduplicate common transitive dependencies.

As @dbeatty10 mentioned in #6502 (comment), you can optionally host the dbt package hub yourself, using these instructions, and set the DBT_PACKAGE_HUB_URL environment variable for each of your dbt projects so that dbt deps knows where to look (instead of hub.getdbt.com). That's a pretty significant amount of work on your end, though.

In the meantime, since you know that Project A depends on Project B which depends on example_package, could you just take the following approach?

# project_a/packages.yml
packages:
# this will be installed via project_b
#  - git: "https://gitlab.com/example/example_package.git"
#    revision: main
  - git: "https://gitlab.com/example/project_b.git"
    revision: main

# project_b/packages.yml
packages:
  - git: "https://gitlab.com/example/example_package.git"
    revision: main

I'm going to reclassify this from a bug to an enhancement. Unless I'm missing something, I think it would require a fairly significant lift to get this working, and it's not something we're likely to prioritize.

@jtcohen6 jtcohen6 added enhancement New feature or request and removed bug Something isn't working labels Jan 9, 2023
@jtcohen6 jtcohen6 changed the title [CT-1772] [Bug] dbt unable to handle duplicate package imports when using git syntax [CT-1772] [Feature] dbt should handle duplicate package imports when using git syntax Jan 9, 2023
@jtcohen6 jtcohen6 removed the triage label Jan 9, 2023
@jtcohen6 jtcohen6 removed their assignment Jan 9, 2023
@GiorgioBaldelli
Copy link
Author

Thanks for the response, jtcohen6.

Knowing that it’s possible to self-host dbt package hub is useful information.

In the meantime, since you know that Project A depends on Project B which depends on example_package, could you just take the following approach?

That would be a straightforward solution, I agree. In our case, it’s not as straightforward unfortunately: as a part of our dbt project deployment process, we apply some generic, strict linting rules to check if example_package is always defined in a given project’s packages.yml. Adding exceptions and de-duplication handling would require a significant amount of work.

In any case, thanks again for the quick response and for providing details about how dbt package hub handles deduplication.

@xesf
Copy link

xesf commented Jan 12, 2023

Not having access to the Hub Api in a Production or CI environment it is essential to have this issue sorted, or else we are force to only use a single package.

@nathaniel-may nathaniel-may added the wontfix Not a bug or out of scope for dbt-core label Jan 12, 2023
@xesf
Copy link

xesf commented Jan 12, 2023

Would be cool to understand why it has been marked as wontfix, as the functionality to use git repos inside the package yml file seem a replacer for the hub api or local packages.

@fabrice-etanchaud
Copy link

fabrice-etanchaud commented Jan 27, 2023

Hi Jeremy !
I am stuck in a (git) package lineage where I cannot see any workaround, for example :

my_source <-- my_enterprise_data_model__domain_A <--+
                                                    +------ my_cross_domains_mart
my_source <-- my_enterprise_data_model__domain_B <--+

I am going to try installing the hub locally. Implementing git packages deduplication would really be a game changer, because for now this breaks dbt incredible modulary.

Comment se passe la vie à Marseille ? C'est bientôt la Chandeleur, le moment de remonter la rue Sainte, prier la Vierge Noire et déguster les "navettes" encore chaudes de Saint-Victor !

Salutations!

(@jtcohen6 )

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 30, 2023

We closed this as wontfix because of my sense that this is a technical limitation of how dbt deps works with git packages. dbt isn't able to determine in advance the contents of git packages' packages.yml, and thereby understand their transitive dependencies, until those packages have already been installed.

But I was mistaken! dbt does (tries its best to do) exactly this:

def _checkout(self):
"""Performs a shallow clone of the repository into the downloads
directory. This function can be called repeatedly. If the project has
already been checked out at this version, it will be a no-op. Returns
the path to the checked out directory."""
try:
dir_ = git.clone_and_checkout(
self.git,
get_downloads_path(),
revision=self.revision,
dirname=self._checkout_name,
subdirectory=self.subdirectory,
)
except ExecutableError as exc:
if exc.cmd and exc.cmd[0] == "git":
fire_event(EnsureGitInstalled())
raise
return os.path.join(get_downloads_path(), dir_)
def _fetch_metadata(self, project, renderer) -> ProjectPackageMetadata:
path = self._checkout()
if (self.revision == "HEAD" or self.revision in ("main", "master")) and self.warn_unpinned:
warn_or_error(DepsUnpinned(git=self.git))
loaded = Project.from_project_root(path, renderer)
return ProjectPackageMetadata.from_project(loaded)

Namely, dbt does use git to perform "shallow clone" of the package, just to read its packages.yml, for use while resolving transitive dependencies and assembling the final set of resolved package names:

while pending:
next_pending = PackageListing()
# resolve the dependency in question
for package in pending:
final.incorporate(package)
target = final[package].resolved().fetch_metadata(config, renderer)
next_pending.update_from(target.packages)
pending = next_pending
resolved = final.resolved()
_check_for_duplicate_project_names(resolved, config, renderer)

So then the question is: Why are you seeing the duplicate project name error, which we explicitly check for & raise there?


When I try to reproduce the actual issue locally, to pinpoint where we'd need to make the change, I can't — it's working fine for me:

packages:
  - git: "https://github.com/fivetran/dbt_hubspot_source"
    revision: v0.3.0
    # this is also a dependency of fivetran/[email protected]
  - git: "https://github.com/fivetran/dbt_fivetran_utils.git"
    warn-unpinned: false
$ dbt --no-version-check deps
09:23:22  Running with dbt=1.5.0-a1
... [warnings for old package versions] ...
09:23:29  Installing https://github.com/fivetran/dbt_hubspot_source
09:23:30  Installed from revision v0.3.0
09:23:30  Installing https://github.com/fivetran/dbt_fivetran_utils.git
09:23:31  Installed from HEAD (default revision)
09:23:31  Installing fishtown-analytics/dbt_utils
09:23:31  Installed from version 0.6.6
09:23:31  Updated version available: 0.7.0
09:23:31
09:23:31  Updates available for packages: ['fishtown-analytics/dbt_utils']
Update your versions in packages.yml, then run dbt deps

If I drop a breakpoint between these two lines, all looks good:

ipdb> resolved
[<dbt.deps.git.GitPinnedPackage object at 0x10420fbb0>, <dbt.deps.git.GitPinnedPackage object at 0x10420ff70>, <dbt.deps.registry.RegistryPinnedPackage object at 0x10420fd60>]
ipdb> [res.name for res in resolved]
['https://github.com/fivetran/dbt_hubspot_source', 'https://github.com/fivetran/dbt_fivetran_utils.git', 'fishtown-analytics/dbt_utils']
ipdb> _check_for_duplicate_project_names(resolved, config, renderer)

@GiorgioBaldelli @xesf @fabrice-etanchaud Any chance you're doing something fun, like using Windows?

(Fabrice, ça fait longtemps ! ça va à Marseille, j'espère la meme à Niort)

@jtcohen6 jtcohen6 reopened this Jan 30, 2023
@jtcohen6 jtcohen6 added bug Something isn't working and removed enhancement New feature or request wontfix Not a bug or out of scope for dbt-core labels Jan 30, 2023
@jtcohen6 jtcohen6 changed the title [CT-1772] [Feature] dbt should handle duplicate package imports when using git syntax [CT-1772] [Feature] Unexpected DuplicateProjectDependencyError in git packages with common transitive dependency Jan 30, 2023
@fabrice-etanchaud
Copy link

Ahah, fun on Windows ! Shame on my company, because I have to use Windows there, although all my family's PCs are running Linux Q4OS ! I try it on linux and tell you...
Thank you Jérémy !

@jmussitsch
Copy link

@jtcohen6 thanks for looking into this! Is it safe to assume then that this part of the documentation is not entirely accurate:

https://docs.getdbt.com/docs/build/packages#hub-packages-recommended

Where possible, we recommend installing packages via dbt Hub, since this allows dbt to handle duplicate dependencies. This is helpful in situations such as:

Your project uses both the dbt-utils and Snowplow packages; and the Snowplow package also uses the dbt-utils package.
Your project uses both the Snowplow and Stripe packages, both of which use the dbt-utils package.
In comparison, other package installation methods are unable to handle the duplicate dbt-utils package.

@fabrice-etanchaud
Copy link

Yes ! it works. Thank you Jérémy.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 1, 2023

Is it safe to assume then that this part of the documentation is not entirely accurate

It does sound like it's worth updating that documentation—although it remains true that only Hub packages allow you to specify more-complex version requirements, with version resolution. With git packages, the shared dependency must be pinned to a single common revision (tag/release/branch/commit).

Yes ! it works. Thank you Jérémy.

That sounds like confirmation this issue is Windows-only? My guess would be, something to do with lacking the right file permissions to remove/overwrite the already-installed package?

@jtcohen6 jtcohen6 added the windows Everyone's favorite OS that's sometimes a little weird label Feb 1, 2023
@xesf
Copy link

xesf commented Feb 1, 2023

@jtcohen6 I was using mac m1.
My package file had 2 packages (dbt_utils and dbt_codegen) and I had to remove codegen and keep just the utils because of this issue.
I think both had revision as "main".

@GiorgioBaldelli
Copy link
Author

GiorgioBaldelli commented Feb 2, 2023

I was using mac m1.

Same here. In my attempts, I was running my dbt commands on a container application running on a mac m1. I was mounting a local directory that contained the dbt project definition, models, package config, etc.

@codsimmo7
Copy link

codsimmo7 commented Feb 21, 2023

Hi all,

I am experiencing the same issue using the "git" reference in packges.yml, however, I actually have a different repository design that I want to implement, but seems to be a limitation outside of GitHub (my team is using Azure DevOps Git repositories which is why we rely on the "git" package tag instead of "hub" or "package")

Project A (common repo):

  • Intended for a common repository of all sources and macros. Define the sources once in "shared_sources.yml", and simply import them across other projects/repos.

Project B (application specific):

  • Import from Project A the shared_sources.yml and macros to be referenced in some models.

Project C (application specific):

  • Import from Project A the shared_sources.yml and macros to be referenced in some models.

Project D (documentation for all applications):

  • Import from Project B and C
  • Intended to be a parent level repository that holds ALL documentation (dbt docs generate) in order to accurately see the lineage between Project A, B and C. In other words, regardless of which application references a common source, I want to be able to view the lineage of all models that use a particular source in one lineage graph, while ALSO having the convenience of defining those source definitions one time for ease of coding in a separate, shared/common repository. Note: I do NOT want to accomplish this by a single repository that hold every single thing because of the downsides of this approach such as who manages the approvals for code changed to application 1 vs application 2 in the same source code, compile times for many models etc.

As of right now, this type of implementation throws an error stating "Found a dependency with the same name as the root project <>. Package names must be unique in a project. Please rename one of these packages."

My second idea was to then combine both the common repo (Project A), and the documentation repo (Project D) into one, but the underlying issue still remains where Project B and C need to import from this new combined repo, while this new combine repo also needs to import from B and C (duplicate dependency) throwing the same error.

Reading the comments above, it seems that GitHub resolves this issue due to its inherent ability to handle duplicate dependencies. However, my enterprise is not able to utilize GitHub, rather we have to use AzureDevOps hence the need for the "git" package tag and "revision" qualifier. If the "git" qualifier could perform similarly to the "hub" or "package" qualifier" does, or there was a way to instruct the dependencies, and/or ignore ones that already exist, I think this approach could work.

Very unfortunate as of right now :/

Last thing to note: There's very little documentation around investigating the OpenSSL logs that the "dbt deps" command utilizes for troubleshooting why dbt authenticates in a random loop asking for your username/personal access token for Azure DevOps, only to fail (seemingly due to a certificate/proxy issue). You simply get a generic SSL error stating it was "unable to checkout spec=None" while sometimes stating "unable to checkout spec=" and yet still failing. Brute force retries over and over seem to all of a sudden work making the steps to replicate nearly impractical.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Aug 21, 2023
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working deps dbt's package manager stale Issues that have gone stale windows Everyone's favorite OS that's sometimes a little weird
Projects
None yet
Development

No branches or pull requests

7 participants