You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #3463 rearranges how classes are defined in the interchange, as an effect of rearranging how the interchange is launched. Prior to PR #3463, the interchange code was imported as a module, parsl.executors.high_throughput.interchange and all top level definitions in that module existed as, for example, parsl.executors.high_throughput.interchange.ManagerLost.
When sending a ManagerLost exception back to the interchange, this would be serialized (by pickle) including a reference to module+name parsl.executors.high_throughput.interchange.ManagerLost. There is an implicit requirement that the interchange is running from the same environment as the submit side so that module+name will be resolvable/importable on the submit side when unpickling the exception.
PR #3463 changes how the interchange is launched - interchange.py now runs as the __main__ module of the interchange process, and top level definitions there exist as (for example) __main__,ManagerLost. References to __main__.ManagerLost cannot in general be deserialized on the submit side, because __main__ refers to the user workflow process.
When sending a ManagerLost back to the submit side, this appears on the submit side as a deserialization error, rather than a properly deserialized ManagerLost:
parsl.serialize.errors.DeserializationError: Failed to deserialize objects. Reason: Received exception, but handling also threw an exception: Can't get attribute 'ManagerLost' on <module '__main__' from '/home/benc/parsl/src/parsl/wl.py'>
Discovered by @khk-globus as part of Globus Compute work.
To recreate
Start a long running python_app in a parsl process. killall -STOP process_worker_pool.py to suspend the worker pool, leading to a heartbeat timeout and ManagerLost (don't terminate the worker pool because Parsl will sometimes notice that the process has died on the submit side). Observe the broken exception.
Proposed fix
Move ManagerLost and VersionMismatch into parsl.executors.high_throughput.errors (alongside WorkerLost).
Guiding principle: objects which will be serialized should be importable, not defined locally. (dill has some complications to make that not entirely true, to do with code movement, but that isn't relevant here, I think)
The text was updated successfully, but these errors were encountered:
Per the analysis in #3495, defining the `ManagerLost` and `VersionMismatch`
errors in the `interchange.py` became a problem in #3463, where the interchange
now runs as `__main__`. This makes it difficult for Dill to get the serde
correct.
The organizational fix is simply to move these classes to an importable
location, which follows the expectation that classes are available in both
local and remote locations, which defining in `__main__` can't easily
guarantee.
Fixes: #3495
PR #3463 rearranges how classes are defined in the interchange, as an effect of rearranging how the interchange is launched. Prior to PR #3463, the interchange code was imported as a module,
parsl.executors.high_throughput.interchange
and all top level definitions in that module existed as, for example,parsl.executors.high_throughput.interchange.ManagerLost
.When sending a
ManagerLost
exception back to the interchange, this would be serialized (by pickle) including a reference to module+nameparsl.executors.high_throughput.interchange.ManagerLost
. There is an implicit requirement that the interchange is running from the same environment as the submit side so that module+name will be resolvable/import
able on the submit side when unpickling the exception.PR #3463 changes how the interchange is launched - interchange.py now runs as the
__main__
module of the interchange process, and top level definitions there exist as (for example)__main__,ManagerLost
. References to__main__.ManagerLost
cannot in general be deserialized on the submit side, because__main__
refers to the user workflow process.When sending a
ManagerLost
back to the submit side, this appears on the submit side as a deserialization error, rather than a properly deserializedManagerLost
:Discovered by @khk-globus as part of Globus Compute work.
To recreate
Start a long running
python_app
in a parsl process. killall -STOP process_worker_pool.py to suspend the worker pool, leading to a heartbeat timeout andManagerLost
(don't terminate the worker pool because Parsl will sometimes notice that the process has died on the submit side). Observe the broken exception.Proposed fix
Move
ManagerLost
andVersionMismatch
intoparsl.executors.high_throughput.errors
(alongsideWorkerLost
).Guiding principle: objects which will be serialized should be importable, not defined locally. (dill has some complications to make that not entirely true, to do with code movement, but that isn't relevant here, I think)
The text was updated successfully, but these errors were encountered: