Refactoring communication between tasks #450

annshress · 2024-02-08T16:35:17Z

annshress
Feb 8, 2024
Collaborator

Currently, tasks in the workflows have an implicit dependency on FilePath objects. For example,

@task
def gen_zarr(file_path: FilePath) -> None:
    input_tiff = f"{file_path.working_dir}/{file_path.base}.tiff"
    # ... process the input_tiff

Here, gen_zarr function could have directly received the input_tiff file as an argument. However, input_tiff is implicitly referred to be stored at a specific location in the working dir with the exact same name. This becomes a problem, when an upstream task decides to modify the name of the tiff file to {working_dir}/{base}__as_tiff.tiff, so gen_zarr task breaks with a FileNotFound exception.

A generic principle to fix these issues is:

Task-Specific Inputs
Each task should receive only the inputs it needs to perform its specific function. This promotes modularity and reusability by keeping tasks self-contained and focused on their individual responsibilities.
Explicit Dependency Passing
Tasks should explicitly declare their dependencies by accepting input parameters or dependencies from upstream tasks. This makes the task dependencies clear and helps in understanding the data flow within the workflow.
Avoid Global State
Avoid relying on global state or shared variables to pass data between tasks, as this can lead to dependencies and make the workflow harder to understand and debug. Instead, pass inputs explicitly between tasks using parameters or context objects.
Task Outputs
Tasks should produce outputs that are consumed by downstream tasks, either by returning values from the execute method or by storing outputs in a location accessible to other tasks (e.g., a file in a shared filesystem or a database table,).
Error Handling
Implement error handling within each task to handle failures gracefully. This may involve catching exceptions, retrying failed operations, logging errors, and possibly triggering alerts or notifications.

One approach we are considering is creating an object for Task I/O. Each task takes in TaskIO object, which has task-specific inputs. It executes and returns a new TaskIO object with outputs that would be consumed by downstream tasks.

In order to enforce Task-specific inputs and Error handling, we came up with a decorator called stream or taskio handler. This makes sure that if upstream tasks has observed/created an unhandled exception, it will be passed right through the current task.

annshress · 2024-02-08T16:55:08Z

annshress
Feb 8, 2024
Collaborator Author

What is the goal of encapsulating I/O params into an object? Why not pass it as an actual parameters to a task?

4 replies

mbopfNIH Feb 8, 2024
Collaborator

I guess the advantage of a general TaskIO object is that it could be extremely flexible in what inputs/outputs it used. It could be an abstract base class that gets extended every which way. The downside of doing that is the code is harder to follow and you can end up with a lot of boilerplate.

annshress Feb 8, 2024
Collaborator Author

That is one of the deterrence. It does not make it clear enough for what the task input (ie, the TaskIO object) actually contains.

In order to remedy this, I was thinking maybe can add something into stream or taskio handler, which checks the inputs required. For example,

@prefect.task
@taskio_handler(input_fields=["tiff_file", "opacity"], output_fields=["png_file"])
def execute_func(taskio: TaskIO):
  ...

But, it's adding up more code to what we already have without TaskIO object.

@prefect.task
def execute_func(input_tiff, opacity):
  ...

mbopfNIH Feb 9, 2024
Collaborator

I think it depends on what else you might want to do with a TaskIO object. What other "execute_func" could you use it with? In NIH3D we take something like this to almost to an extreme.

@dataclass
class EnrichedResult:
    """
    A class that contains the result of and arbitrary metadata associated with it.
    """
    data: Any
    metadata: Dict[str, Any]

So this EnrichedResult has some "data", which can be anything, and simply attaches a dictionary. I believe the goal was to avoid passing around the metadata as an extra argument, but it does provide a LOT of flexibility in defining what is "data" and "metadata". It allows for plenty of polymorphism when requirements change, but can also make the code harder to understand. It might be better for us to use a full Class rather than just a @dataclass structure.

In your case, I would think you'd want something more flexible than an "input_tiff". BTW, what were you thinking for the type: str, Path, libtiff.TIFF? A Path leads back to the whole "working_dir" filesystem issues, and to me a libtiff is a bit restrictive for future use. It would be useful to have it generic enough to span multiple situations, but that would require more code in the form of abstract base classes and concrete implementations. Or you could just throw it into something like our EnrichedResult and figure out the details later. :)

annshress Feb 9, 2024
Collaborator Author

That's something we are aiming to avoid. Starting with a simple class, and ending up in a God class that has everything and does everything.

Currently, the goal of TaskIO is to handle input/outputs validation, and error handling. Now we are trying to figure the degree of variability of the inputs and ouputs (data types, as well as multiplicity [1:1, n:1, n:n])

mbopfNIH · 2024-02-08T17:47:12Z

mbopfNIH
Feb 8, 2024
Collaborator

I think it would be very useful to somehow abstract away some of the specifics of the filesystem requirements in Hedwig. It is very difficult to write Pytests, for example, because there are so many Path dependencies involved. There are lots of checks to see where a Path is in the working_dir. So the whole Path often has to be created to run a simple test - or at least that's what I remember.

I'm not sure if the TaskIO concept would relieve this, but it might help.

0 replies

philipmac · 2024-02-09T15:09:47Z

philipmac
Feb 9, 2024
Maintainer

@dataclass
class TaskIO:
    # object passed to a task, containing a set of paths, which we expect to be processed
    # by the task. The str can be used to differentiate paths in the cases there's >1 input.
    # eg {'alingment_mrc' : Path('/path/name.mrc'), 'reconstr_mrc' : Path('fname.mrc')}
    # The task returns a TaskIO, which contains the outputs of that task.
    path: Dict[str, Path]

    # error reflects the known error of the oldest task upstream
    error: str = None

    # along with output_path, some taskios generate data to pass to the users
    data: Dict = None

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring communication between tasks #450

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Refactoring communication between tasks #450

annshress Feb 8, 2024 Collaborator

Replies: 3 comments · 4 replies

annshress Feb 8, 2024 Collaborator Author

mbopfNIH Feb 8, 2024 Collaborator

annshress Feb 8, 2024 Collaborator Author

mbopfNIH Feb 9, 2024 Collaborator

annshress Feb 9, 2024 Collaborator Author

mbopfNIH Feb 8, 2024 Collaborator

philipmac Feb 9, 2024 Maintainer

annshress
Feb 8, 2024
Collaborator

Replies: 3 comments 4 replies

annshress
Feb 8, 2024
Collaborator Author

mbopfNIH Feb 8, 2024
Collaborator

annshress Feb 8, 2024
Collaborator Author

mbopfNIH Feb 9, 2024
Collaborator

annshress Feb 9, 2024
Collaborator Author

mbopfNIH
Feb 8, 2024
Collaborator

philipmac
Feb 9, 2024
Maintainer