-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter out stdout from all ranks except the first? #958
Comments
I'm currently seeing duplicated I'm not particularly concerned by whatever changed under the hood, but I am concerned by the information pollution on the command line, and departure from "drop-in replacement" philosophy this leaves for the average Python user that just wants to naivly scale up and possibly print/debug/test their results as they always have.
|
Another thing I'm curious about is deprecation policy and behavior changes--the change in print behavior caught me by surprise and seems to be something that happened fairly recently, causing a change in debug print behavior between my legate deployments, preventing me from achieving an apples to apples comparison between one that works and one that does not. Also, launching an incorrect number of program copies can point to slurm or mpi issues, further muddying the waters. |
So currently, as we move out of limited availability towards general release we are not making any backwards compatibility guarantees or giving deprecation periods. This lack of deprecation period is temporary though. We plan on implementing proper deprecation periods in the near future as the low-level legate API's mature. But until then, you should consider every release to be breaking. Also, we keep a changelog every release (https://docs.nvidia.com/legate/latest/api/changes/index.html), which you can consult for any changes (though of course it does not include the |
One thing I just wanted to mention is that I'm pretty sure we added this feature for Legate in the first place: StanfordLegion/legion@b29c1a9 So I don't know the answers to your questions, but this was apparently a feature you wanted at some point in time. |
I added the feature because I thought it was ideal to maintain the appearance of sequential execution even though there was replicated execution occurring for scalability reasons in the implementation. Four years later it seems to be a reasonably successful decision since nobody complained about it during that time (either Legate or Pygion users) at least as far as I'm aware. |
We had an offline discussion with the dev team, they all agreed that getting duplicated output is just noise, but were surprised that we were achieving this by overriding the behavior of the default streams. Quoting from the discussion:
Nobody had a problem with e.g. providing a custom legate printer function that does this. The reason why that doesn't work is that we want a solution that:
The best compromise I can see at this moment is to offer a monkeypatching mode like the one from Note that, even with this change, the illusion of a single controller script is still prone to break, e.g. if you try to write files from the top-level, and a separate (opt-in) primitive would be needed for that. |
Some route to preserving the illusion/original experience of sequential execution sounds more useful than prefixing with rank number but retaining the replicated output. I don't quite follow the advantage of retaining the replicated prints as a default, but that's obviously up to you. It still seems more like a debug mode implementation detail that I'd only want to have opt-in, and even then only if you could generate useful information about decomposition details/division of work, which it seems it doesn't. |
@tylerjereddy is trying out multi-rank Legate runs, and was surprised to see duplicated output. This is inherent to the control replication execution model of Legion1, whereby all ranks execute the top-level program, and "mask out" the work that corresponds to each.
When we were using the Legion Python bindings (prior to Legate 24.09), we were inheriting this code from Legion, which filters out the stdout from all ranks except the first one.
The questions then are:
(a) do we want to replicate this behavior in post-Legion-bindings Legate?
(b) do we make it the default?
(c) do we do the same for C++?
I would ask @tylerjereddy to comment more on why this mode is desirable, and also @lightsighter @elliottslaughter, who originally enabled this mode in Legion.
Footnotes
Unless we introduce a client-server model on top of a control-replicated Legion run, which is a suggestion for making multi-node Jupyter runs work. ↩
The text was updated successfully, but these errors were encountered: