HTEX interchange ManagerLost/VersionMismatch reporting is broken by PR #3463 #3495

benclifford · 2024-06-26T10:52:57Z

PR #3463 rearranges how classes are defined in the interchange, as an effect of rearranging how the interchange is launched. Prior to PR #3463, the interchange code was imported as a module, parsl.executors.high_throughput.interchange and all top level definitions in that module existed as, for example, parsl.executors.high_throughput.interchange.ManagerLost.

When sending a ManagerLost exception back to the interchange, this would be serialized (by pickle) including a reference to module+name parsl.executors.high_throughput.interchange.ManagerLost. There is an implicit requirement that the interchange is running from the same environment as the submit side so that module+name will be resolvable/importable on the submit side when unpickling the exception.

PR #3463 changes how the interchange is launched - interchange.py now runs as the __main__ module of the interchange process, and top level definitions there exist as (for example) __main__,ManagerLost. References to __main__.ManagerLost cannot in general be deserialized on the submit side, because __main__ refers to the user workflow process.

When sending a ManagerLost back to the submit side, this appears on the submit side as a deserialization error, rather than a properly deserialized ManagerLost:

parsl.serialize.errors.DeserializationError: Failed to deserialize objects. Reason: Received exception, but handling also threw an exception: Can't get attribute 'ManagerLost' on <module '__main__' from '/home/benc/parsl/src/parsl/wl.py'>

Discovered by @khk-globus as part of Globus Compute work.

To recreate

Start a long running python_app in a parsl process. killall -STOP process_worker_pool.py to suspend the worker pool, leading to a heartbeat timeout and ManagerLost (don't terminate the worker pool because Parsl will sometimes notice that the process has died on the submit side). Observe the broken exception.

Proposed fix
Move ManagerLost and VersionMismatch into parsl.executors.high_throughput.errors (alongside WorkerLost).

Guiding principle: objects which will be serialized should be importable, not defined locally. (dill has some complications to make that not entirely true, to do with code movement, but that isn't relevant here, I think)

The text was updated successfully, but these errors were encountered:

Per the analysis in #3495, defining the `ManagerLost` and `VersionMismatch` errors in the `interchange.py` became a problem in #3463, where the interchange now runs as `__main__`. This makes it difficult for Dill to get the serde correct. The organizational fix is simply to move these classes to an importable location, which follows the expectation that classes are available in both local and remote locations, which defining in `__main__` can't easily guarantee. Fixes: #3495

benclifford added the bug label Jun 26, 2024

khk-globus mentioned this issue Jun 26, 2024

Move ManagerLost and VersionMismatch to errors.py #3496

Merged

khk-globus self-assigned this Jun 26, 2024

benclifford closed this as completed in #3496 Jun 27, 2024

benclifford closed this as completed in 1ea41f8 Jun 27, 2024

khk-globus mentioned this issue Jun 27, 2024

Refactor test #3497

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTEX interchange ManagerLost/VersionMismatch reporting is broken by PR #3463 #3495

HTEX interchange ManagerLost/VersionMismatch reporting is broken by PR #3463 #3495

benclifford commented Jun 26, 2024

HTEX interchange ManagerLost/VersionMismatch reporting is broken by PR #3463 #3495

HTEX interchange ManagerLost/VersionMismatch reporting is broken by PR #3463 #3495

Comments

benclifford commented Jun 26, 2024