-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC-0030 Consolidate TorchElastic and TorchX #53
base: master
Are you sure you want to change the base?
Conversation
cc @soumith, @msaroufim, @d4l3k, @kurman, @priyaramani |
|
||
## **Motivation** | ||
|
||
1. **Uniform user experience (UX)** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivations would lead me to believe that either of the below are valid options
- Moving elastic out of core into torchx
- Moving torchx into core with elastic
At a high level I agree that the needs served by torchrun and torchx are quite similar so should be coupled more closely. One worry I have is that typically in core they're strict about adding new dependencies and often prefer out of core registration mechanisms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I don't want to bring additional deps + unnecssary bloat to core either. However there are number of benefits for PyTorch to define the specifications of how PyTorch authored scripts/applications are installed and run on the target infrastructure:
- PyTorch can intentionally ask for specific features and requirements from the infra to define the runtime environment to its liking (e.g. a flash storage side-car for checkpointing, node-level retries from schedulers for fault-tolerance, specific network topologies for certain workloads)
- Helps & encourages the infra/platform providers to support PyTorch in the way PyTorch is natively designed to be run.
-
- and 2) consequentially makes the PyTorch UX more uniform, portable, and easy across different infra providers.
|
||
1. **Uniform user experience (UX)** | ||
1. Both `torchrun` and `torchx` **launch** PyTorch scripts | ||
2. For instance, `torchx run -s local dist.ddp -j 1x2 train.py` is basically equivalent to `torchrun --nnodes 1 --nproc_per_node 2 train.py` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes me feel that maybe torchrun is actually the better name, it's more descriptive
consolidate TorchElastic and TorchX *outside* the pytorch/pytorch repo. However, due to the prevalent usage | ||
of `torchrun`, pulling torchelastic out of PyTorch would mean: | ||
|
||
1. Makes `torch-2.x` backwards incompatible since users would have to separately install `torchx` to get `torchrun`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel like these drawbacks are particularly bad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's something to be said about having a built-in runner CLI that helps you launch simple and complex jobs onto some commonly used schedulers (e.g. slurm, k8s). Makes the UX frictionless (especially for distributed).
We discussed the torchvision
model where the repo is kept separate but the install instructions on https://pytorch.org by default install both torch and torchvision
The fundamental difference though is that not all pytorch users need torchvision, but all pytorch users need to run their scripts at some point.
|
||
While (i) is lighter weight compared to (ii) both are rather expensive to run at the pace of commits to PyTorch. | ||
Therefore, we propose that these integ tests are run: | ||
1. Only for commits that touch files in `torch/x/**` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is what inductor does for example and makes sense to me
in such a way that surviving hosts can wait-for a failed host to be replaced then admit the newly replace host | ||
into the worker group (provided by torchelastic). | ||
3. **Out-of-the-box support for job launchers in PyTorch** | ||
1. Without additional installs, PyTorch users can run PyTorch scripts both locally and remotely onto widely used schedulers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually if this is all living in core it might make sense to have all schedulers out core with maybe some exception for the very well maintained or popular ones like K8, that way you can standardize a job launching interface without having to contribute more experimental schedulers in core where they may likely rot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I think it makes sense to include the schedulers that are not CSP specific (which would be SLURM and Kubernetes - SLURM doesn't even have a build-time dep since it doesn't have a python client), and keep the CSP ones (e.g. AWS Batch) in torchx so that eventually we can make the relevant service teams own the integration.
#### non-BC Breaking Changes | ||
|
||
1. `torch/distributed/launch.py` and `torch/distributed/run.py` would have to remain for a few releases, | ||
but will both point to `torch/x/run.py`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The consolidation makes a lot of sense to me. Do you and your plan plan to support this change all the way till torch/distributed/launch.py
and torch/distributed/run.py
are eventually removed from the repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that's the plan. Ideally we'd like to keep torchrun
still in pytorch (which is what this RFC is advocating for) but torch/distributed/launch.py | run.py
would be moved out to torch.x.run
|
||
1. Programmatic usages of torchelastic (e.g. `import torch.distributed.elastic`) are **NOT** BC and the user has to codemod | ||
references to `torch.distributed.elastic` to `torch.x.elastic`. | ||
1. **Impact**: Besides Meta-internal use-cases (which can be easily codemoded) the only external programmatic usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem, we can do the codemod for internal use-cases.
references to `torch.distributed.elastic` to `torch.x.elastic`. | ||
1. **Impact**: Besides Meta-internal use-cases (which can be easily codemoded) the only external programmatic usage | ||
of torchelastic is in DeepSpeed (see [GitHub](https://github.com/microsoft/DeepSpeed/search?q=torch.distributed.elastic)) | ||
for which we can work with the project owner to resolve. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we confident this is the only external usage? I do recall seeing many users asking questions about TorchElastic/TorchX on forum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT programmatic usage + more well known 3rd party lib yes. Most people use torchrun
or python -m torch.distributed.launch
. Even the kubeflow pytorch op uses python -m torch.distribued.run
(https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/imagenet/imagenet.yaml)
…certain sections. Discuss the option of selectively upstreaming TorchX specs only to core
> NOTE: The rest of the doc assumes Option 2 as this is the recommended option | ||
3. (Option 3) Upstream everything in TorchX: | ||
1. PROS: Makes maintenance and CI/CD simple since changes to the specs can be tested for downstream BC easily | ||
2. CONS: bloat-torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we know, by how much?
What about the option of upstreaming everything in torchx to torch but just including core torchx deps + kuebrenetes + slurm deps (mentioned in Option2) and install the other dependencies only when explicitly installing the variant with extra-requires like we do in torchx torch[batch]
or something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right so regardless of which option, the "extra-deps" of torchx would never be added as deps (even extra deps) of torch
. We won't mess with the extras_require
of torch
(aka no torch[batch]
) and instead would fail-fast with a warning prompting the user to install additional packages.
2. (OSS-only) `torchx` CLI users would have to switch over to `torchrun`. | ||
1. **Impact**: Every OSS user that currently uses `torchx` CLI would have to one-time opt-in and switch over to torchrun | ||
2. **To make BC**: | ||
1. **(Option 1)** Add console script `torchx` to PyTorch's `setup.py`, which would act as a symlink discussed in the non-BC section for Meta-internal use-case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the symlink option to keep disruptions for users minimal but we need to migrate at some point
3. (Option 3) Upstream everything in TorchX: | ||
1. PROS: Makes maintenance and CI/CD simple since changes to the specs can be tested for downstream BC easily | ||
2. CONS: bloat-torch | ||
3. Merge the functionalities of `torchx` CLI to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consolidating around torchrun
sounds like a good idea. My understanding from the above is that:
torchrun
would now take a scheduler arg, and users can launch on Slurm/Kubernetes out of the box with just the PyTorch package.- Other schedulers will be implemented in the
torchx
repo, and users would need to install torchx + do some registration to use those schedulers. - Runtime libraries like artifact tracking will remain in
torchx
, as will non-runtime stuff like pre-defined components, the pipelines API, etc.
With this approach what will be the framing for the torchx repo? It may come across as a set of disparate pieces that might be useful during training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torchrun
would now take a scheduler arg, and users can launch on Slurm/Kubernetes out of the box with just the PyTorch package.
Yes exactly. There's a bit of tricky BC to deal with in terms of CLI arguments (torchrun
and torchx
CLIs currently have completely different args - torchx
focuses more on actions that can be done at the job level whereas torchrun
focuses on a "single action": launching multiple procs on the local node).
Other schedulers will be implemented in the torchx repo, and users would need to install torchx + do some registration to use those schedulers.
The idea is to upstream the torchx.specs.*
(iface and APIs) to torch.x.*
and optionally (but strongly encouraged) upstream 1 non-trivial scheduler integration. IMO SLURM is a good candidate since it has no "client" dependencies (e.g. torchx.schedulers.kubernetes
uses the kubernetes
python client). The existing schedulers (k8s, aws-batch, gcp-batch, azure-batch, ray) should stay in the torchx repo as an optional plugin.
As for registration of the schedulers, this is already possible in torchx by design (see Registering Custom Schedulers). In fact this is how we currently make only the internal schedulers visible to Meta-internal uses.
Runtime libraries like artifact tracking will remain in torchx, as will non-runtime stuff like pre-defined components, the pipelines API, etc.
Yes for now. torchx.runtime.*
(including artifact tracking) APIs are currently not stable therefore, I think should be excluded from the upstream. The pipeline integrations should be deprecated (since no one seems to be using it). As for the pre-def components, I think its useful to have python
and ddp
(probably should be renamed to spmd
) in the main repo since those are the two most used launch topologies. They would serve two purposes:
#. As a vanilla/basic quick start components to launch test and PTD jobs
#. As reference implementations to further customize those components to fit your exact needs/infra.
With this approach what will be the framing for the torchx repo? It may come across as a set of disparate pieces that might be useful during training.
IMHO the TorchX repo should serve as a place to hold commonly used plugins (e.g. widely used scheduler impls, most commonly used resources (host types), etc).
TorchX was originally created to help PyTorch users in OSS run their PyTorch applications on widely adopted infrastructure setups and schedulers. Today TorchX supports most AI infra setups which use SLURM, Kubernetes, Ray, Batch services (AWS, GCP, and Azure) and Kubernetes-MCAD (IBM). In the recent months we've seen TorchX gain traction as evidenced by several blog posts detailing how to use PyTorch on XYZ using TorchX:
While TorchX launches PyTorch jobs onto local and remote schedulers, TorchElastic (aka
torchrun
) is responsible for launching PyTorch processes (ranks). But for the user both tools run their training scripts and seemingly overlap in functionality.This RFC proposes that:
torch.x
(under a new submodule calledx
)torch.distributed.elastic
and put it undertorch.x.run