Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: fix results with out of order tasks #7169

Merged
merged 1 commit into from
Oct 3, 2023

Conversation

afrittoli
Copy link
Member

@afrittoli afrittoli commented Oct 3, 2023

Changes

The pipeline run reconciler builds a pipeline run state on every run, which resolves task references, expands result and processes matrix fan outs.

The current process is incremental in a single loop, where each new PipelineTask resolution depends on the state of PipelineTasks resolved before. This is problematic because tasks are not necessarily defined in the pipeline in order of execution (which is undefined, given that pipelines are DAGs).

Since this PR is a fix to a regression, it aims to be as minimal as possible. The smallest solution available is to implement some sorting in the list of tasks, so that the incremental state can work correctly.

This PR splits the process into two runs, one for tasks that have been already started (and possibly completed), and a second one that includes all remaining tasks. The first group of task does not need matrix fan outs (they have already been processed) or result resolution, so its state can be safely build incrementally.

The second group is executed starting from the state of the second group. Any task that is a candidate for execution in this this reconcile cycle must have its results resolved through the state of the first group.

Testing with the current code arrangement is a bit challenging, as we ignore result resolution errors in the code, which is ok only in some cases:

  • result resolution due to task not found or result not defined is permanent and should not be ignored
  • result resolution due to a result not being available yet is ephemeral (possibly) and should not cause a failure

Currently one function checks for all these conditions and returns one error, so it's not possible to safely distinguish them. This will require some refactoring to be fixed in a follow up patch.

For now, a reconcile unit test can test the fix.

Fixes: #7103

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • Has Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Fix regression where a different order of task definition may cause result resolution to break

/kind bug

@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 3, 2023
@tekton-robot tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 3, 2023
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 92.7% 92.4% -0.3

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 92.7% 92.4% -0.3

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 92.7% 92.4% -0.3

The pipeline run reconciler builds a pipeline run state on every
run, which resolves task references, expands result and processes
matrix fan outs.

The current process is incremental in a single loop, where each
new PipelineTask resolution depends on the state of PipelineTasks
resolved before. This is problematic because tasks are not
necessarily defined in the pipeline in order of execution (which
is undefined, given that pipelines are DAGs).

Since this PR is a fix to a regression, it aims to be as minimal
as possible. The smallest solution available is to implement some
sorting in the list of tasks, so that the incremental state
can work correctly.

This PR splits the process into two runs, one for tasks that have
been already started (and possibly completed), and a second one
that includes all remaining tasks. The first group of task does
not need matrix fan outs (they have already been processed) or
result resolution, so its state can be safely build incrementally.

The second group is executed starting from the state of the second
group. Any task that is a candidate for execution in this this
reconcile cycle must have its results resolved through the state
of the first group.

Testing with the current code arrangement is a bit challenging,
as we ignore result resolution errors in the code, which is ok
only in some cases:
- result resolution due to task not found or result not defined
  is permanent and should not be ignored
- result resolution due to a result not being available yet is
  ephemeral (possibly) and should not cause a failure

Currently one function checks for all these conditions and
returns one error, so it's not possible to safely distinguish
them. This will require some refactoring to be fixed in a follow
up patch.

For now, a reconcile unit test can test the fix.

Fixes: tektoncd#7103

Signed-off-by: Andrea Frittoli <[email protected]>
@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 3, 2023
@afrittoli afrittoli changed the title [WIP] Regression: fix results with out of order tasks Regression: fix results with out of order tasks Oct 3, 2023
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 3, 2023
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 92.7% 92.4% -0.3

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 92.7% 92.4% -0.3

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 3, 2023
Copy link
Contributor

@EmmaMunley EmmaMunley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: EmmaMunley, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@Yongxuanzhang Yongxuanzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Thanks!!

@@ -536,7 +536,46 @@ func (c *Reconciler) reconcile(ctx context.Context, pr *v1.PipelineRun, getPipel
if len(pipelineSpec.Finally) > 0 {
tasks = append(tasks, pipelineSpec.Finally...)
}
pipelineRunState, err := c.resolvePipelineState(ctx, tasks, pipelineMeta.ObjectMeta, pr)

// We spit tasks in two lists:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: split

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice, thanks, I will fix in a follow up

// a PipelineTask has at least one TaskRun associated, then all its TaskRuns have been
// created already.
// The second group takes as input the partial state built in the first iteration and finally
// the two results are collated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: collected

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually meant collated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry I misunderstood here. 😄

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 3, 2023
@tekton-robot tekton-robot merged commit cbabe7f into tektoncd:main Oct 3, 2023
2 checks passed
@afrittoli afrittoli mentioned this pull request Oct 4, 2023
@afrittoli
Copy link
Member Author

/cherry-pick release-v0.50.x

@afrittoli afrittoli added the needs-cherry-pick Indicates a PR needs to be cherry-pick to a release branch label Oct 4, 2023
@tekton-robot
Copy link
Collaborator

@afrittoli: new pull request created: #7173

In response to this:

/cherry-pick release-v0.50.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@afrittoli
Copy link
Member Author

/cherry-pick release-v0.52.x

@tekton-robot
Copy link
Collaborator

@afrittoli: new pull request created: #7174

In response to this:

/cherry-pick release-v0.52.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. needs-cherry-pick Indicates a PR needs to be cherry-pick to a release branch release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

task results not templated if task order is reversed
5 participants