New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

RFC-0030 Consolidate TorchElastic and TorchX #53

Open

kiukchung wants to merge 3 commits into pytorch:master from kiukchung:master

kiukchung commented Apr 18, 2023 •

edited

Loading

TorchX was originally created to help PyTorch users in OSS run their PyTorch applications on widely adopted infrastructure setups and schedulers. Today TorchX supports most AI infra setups which use SLURM, Kubernetes, Ray, Batch services (AWS, GCP, and Azure) and Kubernetes-MCAD (IBM). In the recent months we've seen TorchX gain traction as evidenced by several blog posts detailing how to use PyTorch on XYZ using TorchX:

While TorchX launches PyTorch jobs onto local and remote schedulers, TorchElastic (aka torchrun) is responsible for launching PyTorch processes (ranks). But for the user both tools run their training scripts and seemingly overlap in functionality.

This RFC proposes that:

We consolidate TorchElastic and TorchX as a single module
That we do so by:
1. Upstreaming TorchX as torch.x (under a new submodule called x)
2. Pull torch.distributed.elastic and put it under torch.x.run


          RFC-0030 Consolidate TorchElastic and TorchX

73e2105

facebook-github-bot added the cla signed label

Author

kiukchung commented Apr 19, 2023

cc @soumith, @msaroufim, @d4l3k, @kurman, @priyaramani

msaroufim reviewed

View reviewed changes

RFC-0030-Consolidate-Elastic-and-TorchX.md Outdated


		## Motivation

		1. Uniform user experience (UX)

Member

msaroufim Apr 20, 2023

The motivations would lead me to believe that either of the below are valid options

Moving elastic out of core into torchx
Moving torchx into core with elastic

At a high level I agree that the needs served by torchrun and torchx are quite similar so should be coupled more closely. One worry I have is that typically in core they're strict about adding new dependencies and often prefer out of core registration mechanisms

Author

kiukchung Apr 28, 2023

yea, I don't want to bring additional deps + unnecssary bloat to core either. However there are number of benefits for PyTorch to define the specifications of how PyTorch authored scripts/applications are installed and run on the target infrastructure:

PyTorch can intentionally ask for specific features and requirements from the infra to define the runtime environment to its liking (e.g. a flash storage side-car for checkpointing, node-level retries from schedulers for fault-tolerance, specific network topologies for certain workloads)
Helps & encourages the infra/platform providers to support PyTorch in the way PyTorch is natively designed to be run.
1. and 2) consequentially makes the PyTorch UX more uniform, portable, and easy across different infra providers.

msaroufim reviewed

View reviewed changes

RFC-0030-Consolidate-Elastic-and-TorchX.md Outdated

+. **Uniform user experience (UX)**
+. Both `torchrun` and `torchx` **launch** PyTorch scripts
+. For instance, `torchx run -s local dist.ddp -j 1x2 train.py` is basically equivalent to `torchrun --nnodes 1 --nproc_per_node 2 train.py`

Member

msaroufim Apr 20, 2023

this makes me feel that maybe torchrun is actually the better name, it's more descriptive

msaroufim reviewed

View reviewed changes

RFC-0030-Consolidate-Elastic-and-TorchX.md

+              consolidate TorchElastic and TorchX *outside* the pytorch/pytorch repo. However, due to the prevalent usage
+              of `torchrun`, pulling torchelastic out of PyTorch would mean:
+. Makes `torch-2.x` backwards incompatible since users would have to separately install `torchx` to get `torchrun`.

Member

msaroufim Apr 20, 2023

I don't feel like these drawbacks are particularly bad

Author

kiukchung Apr 28, 2023

There's something to be said about having a built-in runner CLI that helps you launch simple and complex jobs onto some commonly used schedulers (e.g. slurm, k8s). Makes the UX frictionless (especially for distributed).

We discussed the torchvision model where the repo is kept separate but the install instructions on https://pytorch.org by default install both torch and torchvision

The fundamental difference though is that not all pytorch users need torchvision, but all pytorch users need to run their scripts at some point.

msaroufim reviewed

View reviewed changes

RFC-0030-Consolidate-Elastic-and-TorchX.md

+                 While (i) is lighter weight compared to (ii) both are rather expensive to run at the pace of commits to PyTorch.
+                 Therefore, we propose that these integ tests are run:
+. Only for commits that touch files in `torch/x/**`

Member

msaroufim Apr 20, 2023

this is what inductor does for example and makes sense to me

msaroufim reviewed

View reviewed changes

RFC-0030-Consolidate-Elastic-and-TorchX.md Outdated

+                         in such a way that surviving hosts can wait-for a failed host to be replaced then admit the newly replace host
+                         into the worker group (provided by torchelastic).
+. **Out-of-the-box support for job launchers in PyTorch**
+. Without additional installs, PyTorch users can run PyTorch scripts both locally and remotely onto widely used schedulers

Member

msaroufim Apr 21, 2023

Actually if this is all living in core it might make sense to have all schedulers out core with maybe some exception for the very well maintained or popular ones like K8, that way you can standardize a job launching interface without having to contribute more experimental schedulers in core where they may likely rot

Author

kiukchung Apr 28, 2023

yea I think it makes sense to include the schedulers that are not CSP specific (which would be SLURM and Kubernetes - SLURM doesn't even have a build-time dep since it doesn't have a python client), and keep the CSP ones (e.g. AWS Batch) in torchx so that eventually we can make the relevant service teams own the integration.

mrshenli reviewed

View reviewed changes

RFC-0030-Consolidate-Elastic-and-TorchX.md

+              #### non-BC Breaking Changes
+. `torch/distributed/launch.py` and `torch/distributed/run.py` would have to remain for a few releases,
+                 but will both point to `torch/x/run.py`.

mrshenli Apr 26, 2023

The consolidation makes a lot of sense to me. Do you and your plan plan to support this change all the way till torch/distributed/launch.py and torch/distributed/run.py are eventually removed from the repo?

Author

kiukchung Apr 28, 2023

Yep that's the plan. Ideally we'd like to keep torchrun still in pytorch (which is what this RFC is advocating for) but torch/distributed/launch.py | run.py would be moved out to torch.x.run

RFC-0030-Consolidate-Elastic-and-TorchX.md

+. Programmatic usages of torchelastic (e.g. `import torch.distributed.elastic`) are **NOT** BC and the user has to codemod
+                 references to `torch.distributed.elastic` to `torch.x.elastic`.
+. **Impact**: Besides Meta-internal use-cases (which can be easily codemoded) the only external programmatic usage

mrshenli Apr 26, 2023

no problem, we can do the codemod for internal use-cases.

RFC-0030-Consolidate-Elastic-and-TorchX.md

+                 references to `torch.distributed.elastic` to `torch.x.elastic`.
+. **Impact**: Besides Meta-internal use-cases (which can be easily codemoded) the only external programmatic usage
+                      of torchelastic is in DeepSpeed (see [GitHub](https://github.com/microsoft/DeepSpeed/search?q=torch.distributed.elastic))
+                      for which we can work with the project owner to resolve.

mrshenli Apr 26, 2023

Are we confident this is the only external usage? I do recall seeing many users asking questions about TorchElastic/TorchX on forum.

Author

kiukchung Apr 28, 2023

AFAICT programmatic usage + more well known 3rd party lib yes. Most people use torchrun or python -m torch.distributed.launch. Even the kubeflow pytorch op uses python -m torch.distribued.run (https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/imagenet/imagenet.yaml)

kiukA9 added 2 commits

April 28, 2023 16:21


          RFC-0030: Incorporate comments + feedback. Provide clear options for …

15a3fa6

…certain sections. Discuss the option of selectively upstreaming TorchX specs only to core


          RFC-0030: Motivation section rewrite as prose versus bullet points

29a5923

priyaramani reviewed

View reviewed changes

RFC-0030-Consolidate-Elastic-and-TorchX.md

+                      > NOTE: The rest of the doc assumes Option 2 as this is the recommended option
+. (Option 3) Upstream everything in TorchX:
+. PROS: Makes maintenance and CI/CD simple since changes to the specs can be tested for downstream BC easily
+. CONS: bloat-torch

priyaramani May 9, 2023

do we know, by how much?
What about the option of upstreaming everything in torchx to torch but just including core torchx deps + kuebrenetes + slurm deps (mentioned in Option2) and install the other dependencies only when explicitly installing the variant with extra-requires like we do in torchx torch[batch] or something like that

Author

kiukchung May 9, 2023

Right so regardless of which option, the "extra-deps" of torchx would never be added as deps (even extra deps) of torch. We won't mess with the extras_require of torch (aka no torch[batch]) and instead would fail-fast with a warning prompting the user to install additional packages.

RFC-0030-Consolidate-Elastic-and-TorchX.md

+. (OSS-only) `torchx` CLI users would have to switch over to `torchrun`.
+. **Impact**: Every OSS user that currently uses `torchx` CLI would have to one-time opt-in and switch over to torchrun
+. **To make BC**:
+. **(Option 1)** Add console script `torchx` to PyTorch's `setup.py`, which would act as a symlink discussed in the non-BC section for Meta-internal use-case

priyaramani May 9, 2023

I like the symlink option to keep disruptions for users minimal but we need to migrate at some point

osalpekar reviewed

View reviewed changes

RFC-0030-Consolidate-Elastic-and-TorchX.md

+. (Option 3) Upstream everything in TorchX:
+. PROS: Makes maintenance and CI/CD simple since changes to the specs can be tested for downstream BC easily
+. CONS: bloat-torch
+. Merge the functionalities of `torchx` CLI to

Member

osalpekar Sep 1, 2023

Consolidating around torchrun sounds like a good idea. My understanding from the above is that:

torchrun would now take a scheduler arg, and users can launch on Slurm/Kubernetes out of the box with just the PyTorch package.
Other schedulers will be implemented in the torchx repo, and users would need to install torchx + do some registration to use those schedulers.
Runtime libraries like artifact tracking will remain in torchx, as will non-runtime stuff like pre-defined components, the pipelines API, etc.

With this approach what will be the framing for the torchx repo? It may come across as a set of disparate pieces that might be useful during training.

Author

kiukchung Sep 5, 2023

torchrun would now take a scheduler arg, and users can launch on Slurm/Kubernetes out of the box with just the PyTorch package.

Yes exactly. There's a bit of tricky BC to deal with in terms of CLI arguments (torchrun and torchx CLIs currently have completely different args - torchx focuses more on actions that can be done at the job level whereas torchrun focuses on a "single action": launching multiple procs on the local node).

Other schedulers will be implemented in the torchx repo, and users would need to install torchx + do some registration to use those schedulers.

The idea is to upstream the torchx.specs.* (iface and APIs) to torch.x.* and optionally (but strongly encouraged) upstream 1 non-trivial scheduler integration. IMO SLURM is a good candidate since it has no "client" dependencies (e.g. torchx.schedulers.kubernetes uses the kubernetes python client). The existing schedulers (k8s, aws-batch, gcp-batch, azure-batch, ray) should stay in the torchx repo as an optional plugin.

As for registration of the schedulers, this is already possible in torchx by design (see Registering Custom Schedulers). In fact this is how we currently make only the internal schedulers visible to Meta-internal uses.

Runtime libraries like artifact tracking will remain in torchx, as will non-runtime stuff like pre-defined components, the pipelines API, etc.

Yes for now. torchx.runtime.* (including artifact tracking) APIs are currently not stable therefore, I think should be excluded from the upstream. The pipeline integrations should be deprecated (since no one seems to be using it). As for the pre-def components, I think its useful to have python and ddp (probably should be renamed to spmd) in the main repo since those are the two most used launch topologies. They would serve two purposes:

#. As a vanilla/basic quick start components to launch test and PTD jobs
#. As reference implementations to further customize those components to fit your exact needs/infra.

With this approach what will be the framing for the torchx repo? It may come across as a set of disparate pieces that might be useful during training.

IMHO the TorchX repo should serve as a place to hold commonly used plugins (e.g. widely used scheduler impls, most commonly used resources (host types), etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels