Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor run methods more into abstract method #4353

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

merelcht
Copy link
Member

@merelcht merelcht commented Nov 26, 2024

Description

Resolves #4290

Development notes

  • Refactored and merged logic of the various _run method implementations into the abstract _run method.
  • Added abstract _get_executor method, implemented by each runner.
  • Refactored logic to determine number of workers from ParallelRunner and ThreadRunner into shared validate_max_workers method.
  • Changed hook_manager argument in runners to allow it to be None, which is needed for the ParallelRunner, because hook manager can't be serialised.
  • Updated TestSuggestResumeScenario for SequentialRunner, because it's now using a ThreadPoolExecutor, the suggestions can vary per run. I've manually created a project with the same pipelines and verified that the suggestions (even if they vary) do always work.
  • Added TestSuggestResumeScenario tests for ThreadRunner.

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

@merelcht merelcht self-assigned this Nov 26, 2024
merelcht and others added 2 commits November 27, 2024 11:22
Signed-off-by: Merel Theisen <[email protected]>
@merelcht merelcht marked this pull request as ready for review November 27, 2024 11:43
) -> ThreadPoolExecutor | ProcessPoolExecutor:
"""Abstract method to provide the correct executor (e.g., ThreadPoolExecutor or ProcessPoolExecutor)."""
pass

@abstractmethod # pragma: no cover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still an abstractmethod?

@@ -46,11 +46,16 @@ def __init__(
is_async=is_async, extra_dataset_patterns=self._extra_dataset_patterns
)

def _get_executor(self, max_workers: int) -> ThreadPoolExecutor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary to create a thread for SeqeuntialRunner?

@@ -443,3 +546,27 @@ def run_node(
)
node = task.execute()
return node


def validate_max_workers(max_workers: int | None) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why _validate_catalog and _validate_nodes are private methods and this one is public?

)

self._release_datasets(node, catalog, load_counts, pipeline)
super()._run(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still needed if it's just using the base method?

Comment on lines +13 to +18
from collections import Counter, deque
from concurrent.futures import (
FIRST_COMPLETED,
ProcessPoolExecutor,
ThreadPoolExecutor,
wait,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this import multiprocessing as a dependencies? I recalled in the past we have issues with ShelveStore because even importing the library cause issues on restricted environment like AWS Lambda.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Abstract _run as much as possible
2 participants