Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(vcs): VCS configuration to support common and custom attributes #9271

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

wkl3nk
Copy link
Contributor

@wkl3nk wkl3nk commented Oct 10, 2024

While VCS implementations are already plugins, they are not yet configurable. VCS implementations require common configurations (e.g., revision, recursive) and should support also VCS-specific configurations if they are consumed via their API.

This allows to add functionality to individual VCS implementations without the need to implement them for all of them.

This refactoring keeps the common configuration attributes as they are, while VCS-specific configurations are stored generically in an options attribute. The effects are limited to usage through the command-line interface of the
Downloader tool, designed primarily for scripted operations. In other stages of the ORT pipeline these configuration options have no effect.

The change also adds VCS-specific configuration options for Git: The strategy for checking out repositories with submodules now allows to only check out the first layer of submodules.

Example configuration in config.yml:

ort:
  analyzer:
    versionControlSystems:
      Git:
        options:
          submoduleHistoryDepth: 2
          updateNestedSubmodules: false

The change is split up into 2 commits:

  • First one deals with refactoring to use a new plugin interface for version control systems, that also allows to supply specific configurations.
  • Second one is for 2 configuration options for Github VCS, dealing with submodules.

For more details, please have a look at the respective commits.

Fixes #8556.

Copy link

codecov bot commented Oct 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 67.97%. Comparing base (ddfdef1) to head (c15ce99).

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #9271      +/-   ##
============================================
+ Coverage     67.93%   67.97%   +0.03%     
- Complexity     1289     1293       +4     
============================================
  Files           249      250       +1     
  Lines          8792     8802      +10     
  Branches        913      914       +1     
============================================
+ Hits           5973     5983      +10     
  Misses         2433     2433              
  Partials        386      386              
Flag Coverage Δ
funTest-docker 64.82% <ø> (ø)
funTest-non-docker 33.28% <0.00%> (-0.06%) ⬇️
test 35.83% <100.00%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@wkl3nk wkl3nk force-pushed the wkl3nk/make-vcs-plugins-configurable branch 5 times, most recently from 60b8d10 to ae31656 Compare October 11, 2024 14:06
@wkl3nk wkl3nk force-pushed the wkl3nk/make-vcs-plugins-configurable branch 5 times, most recently from 1cf9421 to ce5d30c Compare October 21, 2024 15:27
@wkl3nk wkl3nk marked this pull request as ready for review October 21, 2024 15:30
@wkl3nk wkl3nk requested a review from a team as a code owner October 21, 2024 15:30
@fviernau
Copy link
Member

I believe the options for configuring submodule recursion could be injected via package curations. Have you considered this?

@wkl3nk
Copy link
Contributor Author

wkl3nk commented Oct 22, 2024

@fviernau Hello, can you explain a little bit more please?

@fviernau
Copy link
Member

fviernau commented Oct 22, 2024

@fviernau Hello, can you explain a little bit more please?

I believe it makes sense to have the ability to configure this separate per dependency.
I expect the following trade-off / approach to be reasonable for some users:

  1. Enable submodule recursion by default
  2. Disable submodule recursion per package curation (after reviewing that this is ok, scan won't miss anything relevant)

@wkl3nk wkl3nk force-pushed the wkl3nk/make-vcs-plugins-configurable branch 2 times, most recently from d9ef04f to caa86ed Compare October 23, 2024 10:09
@wkl3nk wkl3nk force-pushed the wkl3nk/make-vcs-plugins-configurable branch from caa86ed to 8217cc4 Compare October 23, 2024 10:16
@wkl3nk wkl3nk force-pushed the wkl3nk/make-vcs-plugins-configurable branch from 8217cc4 to a7c9533 Compare October 23, 2024 10:23
@fviernau
Copy link
Member

I believe it makes sense to have the ability to configure this separate per dependency. I expect the following trade-off / approach to be reasonable for some users:

1. Enable submodule recursion by default

2. Disable submodule recursion per package curation (after reviewing that this is ok, scan won't miss anything relevant)

To add to this, I believe in order to implement the above it would be necessary to pass
the config flags as parameter to the functions of the VCS implementations which make the cloning.

We could right away unify this instead of introducing two ways to tell the behavior. In particular,
even though Git, GitRepo and so forth are plugins, I believe it would be fine to assume that these are built in.
So, we can add the configuration to OrtConfiguration or any model under it, instead of to a plugin configuration.
Then, the caller could just pass whatever is configured to the VCS instance as param.
Any thoughts on that?

@wkl3nk
Copy link
Contributor Author

wkl3nk commented Oct 23, 2024

@fviernau I think I understand your concerns. However, my key issue is that there are very large repositories that act as kind of umbrella around other repositories, having already more than 50 git submodules in the first layer. These submodules then also have submodules and so on. One can't even be sure that there are no cycles in this nested submodules. ORT currently cannot even download these repositories, we tried it out and stopped it after several hours.
On the other hand, the project developers say that they usually only check out the submodules of the first layer, nothing more. This change request is about just that: Enable ORT to checkout out just the first layer of submodules and not the complete nested, possibly never ending tree of nested submodules.

@fviernau
Copy link
Member

fviernau commented Oct 23, 2024

@fviernau I think I understand your concerns. However, my key issue is that there are very large repositories that act as kind of umbrella around other repositories, having already more than 50 git submodules in the first layer. These submodules then also have submodules and so on. One can't even be sure that there are no cycles in this nested submodules. ORT currently cannot even download these repositories, we tried it out and stopped it after several hours.
On the other hand, the project developers say that they usually only check out the submodules of the first layer, nothing more. This change request is about just that: Enable ORT to checkout out just the first layer of submodules and not the complete nested, possibly never ending tree of nested submodules.

My concerns were not about the what but about the how. So, I generally agree that it makes sense that ORT provides a way to configure this. It also wasn't meant as a strict concern, but I wanted to make sure this is considered before committing to some solution. Maybe @sschuberth @mnonnenmacher any thoughts?

Another related thing: A while ago we've invented "scan by repository", which changed the scanner to no more clone recursively IIRC. The only place where recursive cloning is happening is in the provenance resolution stage IIRC. There had been an idea of mine, I believe I brought this up already longer time ago, to rewrite that provenance resolved so that it works without cloning recursively. The resolved would itself in Kotlin the recurse, but not use the VCSes recursion capabilities. If we would go that road, then configuring the recursion depth would be a configuration of that resolved, not of the VCS plugin. This is another indication (to me) that configuring this at the VCS plugin level may not be the right place.

@wkl3nk can you confirm that the issue happens during provenance resolution?

All in all, I more and more think this should be considered configuration of ORT, not of VCS plugin.
The config could be agnostic of the concrete VCS, e.g. some clone rec depth which applies to any VCS / provenance resolution?

@wkl3nk
Copy link
Contributor Author

wkl3nk commented Oct 24, 2024

@fviernau wrote

can you confirm that the issue happens during provenance resolution?

Would "provenance resolution" mean that the Analyzer has scanned for dependency management files? If yes, then the answer is: No - The inflexibility lies in the downloader, it is the default to do a "git submodule --init --recursive" to all submodules. So the Analyzer never ever kicks in, because it just is not possible in a reasonable time and space to download all these > 50 nested submodules and their respective submodules and so on. That's the reason for going in the direction to be able to configure the ORT Downloader in config.yml.

@wkl3nk wkl3nk force-pushed the wkl3nk/make-vcs-plugins-configurable branch from 0509c0a to 5221536 Compare November 8, 2024 15:48
@wkl3nk wkl3nk force-pushed the wkl3nk/make-vcs-plugins-configurable branch 4 times, most recently from c66ccf6 to 307fa1a Compare November 11, 2024 09:14
@wkl3nk wkl3nk force-pushed the wkl3nk/make-vcs-plugins-configurable branch from 307fa1a to 04c00b5 Compare November 11, 2024 09:20
@wkl3nk wkl3nk force-pushed the wkl3nk/make-vcs-plugins-configurable branch from 04c00b5 to f8e6bb6 Compare November 13, 2024 08:57
@wkl3nk wkl3nk force-pushed the wkl3nk/make-vcs-plugins-configurable branch 3 times, most recently from baca8ab to 6c28dc0 Compare November 13, 2024 13:50
@wkl3nk
Copy link
Contributor Author

wkl3nk commented Nov 13, 2024

Now using the new Plugin concept that is based on TypedConfigurablePluginFactory and is already was implemented for the scanners before. VCS classes now have a factory inside, and can have a specific configuration class, that is injected in the contructor of the VCS implementation by the factory. As an example, I added such a specific configuration class for Git.

@wkl3nk wkl3nk marked this pull request as ready for review November 13, 2024 14:03
@sschuberth
Copy link
Member

Now using the new Plugin concept that is based on TypedConfigurablePluginFactory

Before I start review, I'd like @mnonnenmacher to comment whether this "intermediate step" is meaningful to do (now that it's already there...) instead of straight migrating all VCS to the KSP-based plugin mechanism in the context of #9403.

@sschuberth sschuberth requested a review from a team November 14, 2024 08:16
@mnonnenmacher
Copy link
Member

mnonnenmacher commented Nov 14, 2024

Now using the new Plugin concept that is based on TypedConfigurablePluginFactory

Before I start review, I'd like @mnonnenmacher to comment whether this "intermediate step" is meaningful to do (now that it's already there...) instead of straight migrating all VCS to the KSP-based plugin mechanism in the context of #9403.

It would have made sense to use the KSP plugin API right away, however, now that it is implemented, the effort to do it in this PR or another one would be the same, so both would be fine with me.

@wkl3nk However, I have a conceptual question: In a usual ORT pipeline the downloader is used in two places:

  • Before running the analyzer to download the project repository.
  • By the scanner to download the sources of the dependencies.

The new options you introduce are to my understanding only relevant for the first step to limit the downloaded sources of the project itself. But they should not be applied within the scanner, because you cannot know beforehand if the dependencies use any submodules and if they need to be scanned or not.
Did you consider this during the implementation?

@wkl3nk
Copy link
Contributor Author

wkl3nk commented Nov 14, 2024

Where can this functionality currently be used?

@mnonnenmacher The functionality can out of the box now be used for the Downloader stage, meaning in the typical workflow where the ORT CLI is first used to download the repository. It supports adding VCS specific configurations for e.g. Git. For instance, you can configure in .ort/config/config.yml if only the first layer of submodules shall be downloaded, in case you need to deal with repositories with many many nested submodule layers. Once the repository is downloaded, it can be analyzed and scanned with other options of the ORT CLI.

In ORT Server, where we don't have this Downloader stage, and instead the Downloader is instantiated and used in the code, just before the Analyzer, this change also allows to configure the VCSs in a flexible way. How this is done would be a question within the context of the ORT Server, so this PR is more or less only the groundwork to make it possible to configure VCSs.

@mnonnenmacher
Copy link
Member

Where can this functionality currently be used?

@mnonnenmacher The functionality can out of the box now be used for the Downloader stage, meaning in the typical workflow where the ORT CLI is first used to download the repository. It supports adding VCS specific configurations for e.g. Git. For instance, you can configure in .ort/config/config.yml if only the first layer of submodules shall be downloaded, in case you need to deal with repositories with many many nested submodule layers. Once the repository is downloaded, it can be analyzed and scanned with other options of the ORT CLI.

In ORT Server, where we don't have this Downloader stage, and instead the Downloader is instantiated and used in the code, just before the Analyzer, this change also allows to configure the VCSs in a flexible way. How this is done would be a question within the context of the ORT Server, so this PR is more or less only the groundwork to make it possible to configure VCSs.

Yes., that part is clear, my concern is that it is not obvious form the configuration file that these settings will only be applied in certain situations, namely when Git is used via the download command of the CLI, but not when Git is used by the scanner. This should at least be documented very clearly to avoid confusion.

@wkl3nk wkl3nk force-pushed the wkl3nk/make-vcs-plugins-configurable branch from 6c28dc0 to 33dc841 Compare November 21, 2024 08:12
While VCS implementations are already plugins, they are not yet
configurable. VCS implementations require common configurations
(e.g., `revision`, `recursive`) and should support also
VCS-specific configurations if they are consumed via their API.
This allows to add functionality to individual VCS implementations
without the need to implement them for all of them.

The effects are limited to usage through the command-line interface
of the Downloader tool, designed primarily for scripted operations.
In other stages of the ORT pipeline these configuration options
have no effect.

Fixes oss-review-toolkit#8556.

Signed-off-by: Wolfgang Klenk <[email protected]>
For large repositories with many layers of nested Git submodules, the
download process can be very time-consuming and often results in
duplicate projects in the tree of nested submodules.
This feature introduces configuration options to limit the recursive
checkout of nested Git submodules to the first layer, optimizing
performance and reducing redundancy. Additionally, it also allows to
limit the depth of commit history to fetch when downloading
the projects.

Signed-off-by: Wolfgang Klenk <[email protected]>
@wkl3nk wkl3nk force-pushed the wkl3nk/make-vcs-plugins-configurable branch from 33dc841 to c15ce99 Compare November 21, 2024 08:15
@fviernau
Copy link
Member

Yes., that part is clear, my concern is that it is not obvious form the configuration file that these settings will only be applied in certain situations, namely when Git is used via the download command of the CLI, but not when Git is used by the scanner. This should at least be documented very clearly to avoid confusion.

If this is really only for the downloader, would this mean often re-configuration depending on what project you clone. Was it considered to make it CLI option?

@fviernau
Copy link
Member

Would "provenance resolution" mean that the Analyzer has scanned for dependency management files?

Provenance resolution is done as part of the "scanner" stage.

@wkl3nk
Copy link
Contributor Author

wkl3nk commented Nov 21, 2024

@mnonnenmacher I added comments in the reference.yml to express that the functionality is limited to the Downloader tools only in command line use (scripted use), and has (out of the box) no effect when the Downloader is used in programmatical way. I added hints in the commit message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make VCS plugins configurable
4 participants