-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC-0030 Consolidate TorchElastic and TorchX #53
base: master
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,262 @@ | ||
|
||
|
||
<details> | ||
<summary>Instructions - click to expand</summary> | ||
|
||
- Fork the rfcs repo: https://github.com/pytorch/rfcs | ||
- Copy `RFC-0000-template.md` to `RFC-00xx-my-feature.md`, or write your own open-ended proposal. Put care into the details. | ||
- Submit a pull request titled `RFC-00xx-my-feature`. | ||
- Assign the `draft` label while composing the RFC. You may find it easier to use a WYSIWYG editor (like Google Docs) when working with a few close collaborators; feel free to use whatever platform you like. Ideally this document is publicly visible and is linked to from the PR. | ||
- When opening the RFC for general discussion, copy your document into the `RFC-00xx-my-feature.md` file on the PR and assign the `commenting` label. | ||
- Build consensus for your proposal, integrate feedback and revise it as needed, and summarize the outcome of the discussion via a [resolution template](https://github.com/pytorch/rfcs/blob/rfc-process/RFC-0000-template.md#resolution). | ||
- If the RFC is idle here (no activity for 2 weeks), assign the label `stalled` to the PR. | ||
- Once the discussion has settled, assign a new label based on the level of support: | ||
- `accepted` if a decision has been made in the RFC | ||
- `draft` if the author needs to rework the RFC’s proposal | ||
- `shelved` if there are no plans to move ahead with the current RFC’s proposal. We want neither to think about evaluating the proposal | ||
nor about implementing the described feature until some time in the future. | ||
- A state of `accepted` means that the core team has agreed in principle to the proposal, and it is ready for implementation. | ||
- The author (or any interested developer) should next open a tracking issue on Github corresponding to the RFC. | ||
- This tracking issue should contain the implementation next steps. Link to this tracking issue on the RFC (in the Resolution > Next Steps section) | ||
- Once all relevant PRs are merged, the RFC’s status label can be finally updated to `closed`. | ||
|
||
</details> | ||
|
||
|
||
|
||
|
||
|
||
# Consolidate TorchElastic and TorchX into `torch.x.run` | ||
|
||
**Authors:** | ||
* @kiukchung (Kiuk Chung) | ||
* @d4l3k (Tristan Rice) | ||
|
||
|
||
## **Summary** | ||
|
||
Consolidate TorchElastic (`torch.distributed.elastic` + `torchrun`) and TorchX | ||
(`torchx` - on [pytorch/torchx](https://github.com/pytorch/torchx) GitHub) | ||
into a single module called `torch.x` that launches PyTorch scripts (single-process & distributed) | ||
both locally and as a job on remote schedulers (e.g. SLURM, Kubernetes, etc). | ||
|
||
## **Background** | ||
|
||
The sections below provide background/historical context on TorchElastic and TorchX. | ||
|
||
### TorchElastic | ||
|
||
[torchelastic](https://pytorch.org/docs/stable/distributed.elastic.html), hereafter refered to as [`torchrun`](https://pytorch.org/docs/stable/elastic/run.html), | ||
is the defacto local **process launcher** to kick-off PyTorch Distributed (PTD) scripts. Prior to `torch-1.9` TorchElastic resided in the PyTorch GitHub organization but under the [pytorch/elastic](https://github.com/pytorch/elastic) | ||
repository. In `torch-1.9` TorchElastic was upstreamed to torch under the `torch.distributed.elastic` submodule. | ||
|
||
#### `torchrun` vs `torch.distributed.launch` vs `torch.distributed.run` | ||
**TL;DR - All three tools use `torch.distributed.run` (torchelastic) under the hood** | ||
|
||
The following table summarizes when each tool was introduced to PyTorch and how they relate to each other. | ||
|
||
| **PyTorch Version** | `torch.distributed.launch` | `torch.distributed.run` | `torchrun` | | ||
|----------------------|--------------------------------------------------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------| | ||
| <1.9 | Uses `torch.multiprocessing` | -- | -- | | ||
| 1.9 | Wraps `torch.distributed.run` with backwards compatible args | Uses `torchelastic` | -- | | ||
| 1.10 | [same as above] | [same as above] | CLI [console_script](https://github.com/pytorch/pytorch/blob/master/setup.py#L980) to `torch.distributed.run` | | ||
|
||
|
||
### TorchX | ||
|
||
[TorchX](https://pytorch.org/torchx) | ||
on the other hand is a **job launcher** that can submit PyTorch scripts as a job onto various remote schedulers | ||
such as SLURM, Kubernetes, AWS/GCP/Azure Batch, etc. Similar to TorchElastic prior to `torch-1.9`, | ||
TorchX is currently in the PyTorch GitHub organization but under the [pytorch/torchx](https://github.com/pytorch/torchx) | ||
repository. | ||
|
||
|
||
## **Motivation** | ||
|
||
1. **Uniform user experience (UX)** | ||
1. Both `torchrun` and `torchx` **launch** PyTorch scripts | ||
2. For instance, `torchx run -s local dist.ddp -j 1x2 train.py` is basically equivalent to `torchrun --nnodes 1 --nproc_per_node 2 train.py` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this makes me feel that maybe torchrun is actually the better name, it's more descriptive |
||
2. **More native built-in support in PyTorch for launching large-scale, long-running, fault-tolerant distributed jobs** | ||
1. In practice running large-scale jobs reliably requires coordination between the scheduler/infrastructure (controlled by torchx) | ||
and the process manager (torchelastic) | ||
2. Example: dealing with host failures in a job: | ||
1. torchelastic alone cannot deal with host failures. The job needs to be configured (done via torchx) for | ||
the scheduler to deal with host failures (e.g. replace the failed host). | ||
2. Conversely, the scheduler replacing failed hosts for a job is not enough, the job itself needs to be authored | ||
in such a way that surviving hosts can wait-for a failed host to be replaced then admit the newly replace host | ||
into the worker group (provided by torchelastic). | ||
3. **Out-of-the-box support for job launchers in PyTorch** | ||
1. Without additional installs, PyTorch users can run PyTorch scripts both locally and remotely onto widely used schedulers | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually if this is all living in core it might make sense to have all schedulers out core with maybe some exception for the very well maintained or popular ones like K8, that way you can standardize a job launching interface without having to contribute more experimental schedulers in core where they may likely rot There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yea I think it makes sense to include the schedulers that are not CSP specific (which would be SLURM and Kubernetes - SLURM doesn't even have a build-time dep since it doesn't have a python client), and keep the CSP ones (e.g. AWS Batch) in torchx so that eventually we can make the relevant service teams own the integration. |
||
(Kubernetes, SLURM, CloudProvider-specific batch schedulers). | ||
4. **Well-defined and self-service plugin-points to integrate PyTorch with the users' infrastructure.** | ||
1. Examples include (but not limited to): | ||
1. Job/Experiment tracking (e.g. MLflow, Weights&Biases, FBLearner, etc) | ||
2. Registration of custom compute resource types to launch jobs onto (e.g. p4d.24xlarge [AWS], a2-megagpu-16g [GCP]) | ||
3. Code containerization (e.g. Docker, XAR, etc) | ||
|
||
|
||
## **Proposed Implementation** | ||
|
||
The diagram below depicts the proposed module move/upstream (optional merge of the `multiprocessing` module not shown). | ||
|
||
![`torch.x` Before and After](RFC-0030-assets/modules_before_after.png) | ||
|
||
**The following changes are proposed (refer to diagram above):** | ||
|
||
1. Move `torch.distribted.elastic.*` → `torch.x.elastic.*` | ||
2. Upstream `torchx.*` → `torch.x.*` | ||
3. Merge the functionalities of `torchx` CLI to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consolidating around
With this approach what will be the framing for the torchx repo? It may come across as a set of disparate pieces that might be useful during training. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes exactly. There's a bit of tricky BC to deal with in terms of CLI arguments (
The idea is to upstream the As for registration of the schedulers, this is already possible in torchx by design (see Registering Custom Schedulers). In fact this is how we currently make only the internal schedulers visible to Meta-internal uses.
Yes for now. #. As a vanilla/basic quick start components to launch test and PTD jobs
IMHO the TorchX repo should serve as a place to hold commonly used plugins (e.g. widely used scheduler impls, most commonly used resources (host types), etc). |
||
`torchrun` (a python [`console_script`](https://python-packaging.readthedocs.io/en/latest/command-line-scripts.html) | ||
that points to `torch.distributed.run`) | ||
allowing `torchrun` to: | ||
1. Submit PyTorch scripts as a job to remote schedulers (9 x OSS schedulers + 2 x FB-internal schedulers) | ||
2. Run (simulate) multi-node distributed jobs locally on a single machine with multiple GPUs | ||
4. (alternatively) keep `torchrun` for BC (eventually to be removed), and consolidate around `torchx` CLI. | ||
4. (optional) Merge `torch.distributed.elastic.multiprocessing` into `torch.multiprocessing`. Adds the following features to `torch.multiprocessing`: | ||
1. Launch binaries (e.g. `/home/kiuk/train.py`) in addition to functions (e.g. `pkg.my.train:main`) | ||
2. Write each sub-proc's stdout and stderr to log files (configurable) | ||
3. Prefix each line of stdout and stderr with local rank information | ||
4. Propagate exception traces from sub-proc back to the main proc for better error handling + propagation | ||
|
||
### Backwards Compatibility (BC) | ||
|
||
#### non-BC Breaking Changes | ||
|
||
1. `torch/distributed/launch.py` and `torch/distributed/run.py` would have to remain for a few releases, | ||
but will both point to `torch/x/run.py`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The consolidation makes a lot of sense to me. Do you and your plan plan to support this change all the way till There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep that's the plan. Ideally we'd like to keep |
||
2. If we decide to consolidate the CLI into `torchrun`, the arguments to the CLI `torchrun` would have to remain to ensure BC. | ||
We propose an "opt-in" mechanism where if the user drops a marker file in (say) `~/.torch/.torchx`, then `torchrun` | ||
operates like today's `torchx` CLI, otherwise, if the marker file is not found, then `torchrun` operates like the existing `torchrun`. | ||
3. (Meta-internal only) `torchx` CLI is widely used internally at Meta. In order to minimize disruptions, `torchx` CLI | ||
can be made a symlink to `torchrun` and gradually have users shift over to `torchrun`. This can be measured by tracking | ||
direct calls to `torchx` in Scuba. | ||
|
||
|
||
#### BC Breaking Changes | ||
|
||
The use-cases below will be BC breaking under this proposal. For each item, we discuss ways to make them non-BC breaking. | ||
However we propose that we clearly document the migration process and add references to these docs in the release notes | ||
in favor of introducing code complexity and/or tech-debt to keep things BC. | ||
|
||
|
||
1. Programmatic usages of torchelastic (e.g. `import torch.distributed.elastic`) are **NOT** BC and the user has to codemod | ||
references to `torch.distributed.elastic` to `torch.x.elastic`. | ||
1. **Impact**: Besides Meta-internal use-cases (which can be easily codemoded) the only external programmatic usage | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no problem, we can do the codemod for internal use-cases. |
||
of torchelastic is in DeepSpeed (see [GitHub](https://github.com/microsoft/DeepSpeed/search?q=torch.distributed.elastic)) | ||
for which we can work with the project owner to resolve. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we confident this is the only external usage? I do recall seeing many users asking questions about TorchElastic/TorchX on forum. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AFAICT programmatic usage + more well known 3rd party lib yes. Most people use |
||
1. **To make BC**: We could re-import `torch.x.elastic.**` from `torch/elastic/**/__init__.py`. For example, | ||
```python | ||
# In torch/distributed/elastic/__init__.py | ||
import torch.x.elastic.* # noqa F401 | ||
``` | ||
2. (OSS-only) `torchx` CLI users would have to switch over to `torchrun`. | ||
1. **Impact**: Every OSS user that currently uses `torchx` CLI would have to one-time opt-in and switch over to torchrun | ||
2. **To make BC**: | ||
1. (option 1) Add console script `torchx` to PyTorch's `setup.py`, which would act as a symlink discussed in the non-BC section for Meta-internal use-case | ||
2. (option 2) Release a final version of `torchx` wheel where the `torchx` CLI walks the user | ||
through the opt-in step and asks the user to switch over to `torchrun`. | ||
|
||
|
||
### Dependencies, and CI | ||
|
||
#### Dependencies | ||
|
||
The additional install deps that upstreaming TorchX to PyTorch would bring are: | ||
|
||
``` | ||
docstring-parser (==0.8.1) | ||
urllib3 (<1.27,>=1.21.1) | ||
tabulate # not strictly needed (can be removed, but requires code change in torchx CLI) | ||
pyyaml # already defined in pytorch/requirements.txt but not a install-time dep of torch | ||
fsspec # already defined in pytorch/requirements.txt but not a install-time dep of torch | ||
``` | ||
|
||
TorchX defines [`extras-require`](https://github.com/pytorch/torchx/blob/main/setup.py#L83) based on | ||
the type of infrastructure the user wants to use TorchX on. For instance: | ||
|
||
``` | ||
pip install torchx[kubernetes] # torchx + kubernetes>=11 | ||
pip install torchx[gcp_batch] # torchx + google-cloud-(batch + logging + runtimeconfig) | ||
pip install torchx[ray] # torchx + ray>=1.12.1 | ||
``` | ||
|
||
For `extras-require` we can either: | ||
|
||
1. (option 1) Create `[extras]-requirements.txt` (e.g. `kubernetes-requirements.txt`) and | ||
have the user `pip install -r [extras]-requirements.txt` | ||
2. (option 2) At runtime display a warning message if the user hits a codepath and does not have the deps installed: | ||
```shell-session | ||
# sample warning message | ||
$ pip install torch | ||
$ torchrun -s aws_batch dist.ddp -j 1x2 train.py | ||
================================================= | ||
ERROR: Unsatisfied dependencies [boto3, docker] | ||
To install run: | ||
|
||
$ pip install boto3==1.24.59, docker | ||
================================================= | ||
``` | ||
3. (option 3 - not desirable, including for completeness) Adding `extras-require` to torch installs (e.g. `pip install torch[gcp]`) | ||
|
||
#### CI | ||
|
||
1. **Linting**: static type checker needs to change from pyre (currently used by torchx) → mypy (used by torch). | ||
Once migrated over to mypy the regular PyTorch [Lint GitHub action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/lint.yml) should suffice. | ||
2. **Conditional Integration Tests**: TorchX runs two types of end-to-end integ tests to ensure that it can submit jobs correctly to the supported schedulers: | ||
1. [Components Integration Test](https://github.com/pytorch/torchx/blob/main/.github/workflows/components-integration-tests.yaml): | ||
Runs locally on the CI runner by using mock/local schedulers (e.g. minikube for kubernetes) | ||
2. [SCHEDULER] Integration Tests (e.g. [aws-batch-integration-tests](https://github.com/pytorch/torchx/blob/main/.github/workflows/aws-batch-integration-tests.yaml)): | ||
Runs against scheduler instances running on an actual cloud (e.g. aws, gcp) account. | ||
|
||
While (i) is lighter weight compared to (ii) both are rather expensive to run at the pace of commits to PyTorch. | ||
Therefore, we propose that these integ tests are run: | ||
1. Only for commits that touch files in `torch/x/**` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is what inductor does for example and makes sense to me |
||
2. On nightly releases of torch | ||
|
||
|
||
### Documentation | ||
|
||
The docs page https://pytorch.org/torchx/latest/ should move to https://pytorch.org/docs/stable/index.html | ||
under the `torch.x` submodule. `torch.distributed.elastic` docs should get merged into `torch.x`. | ||
|
||
## **Drawbacks** | ||
|
||
See section on BC above. | ||
|
||
## **Alternatives** | ||
|
||
We considered pulling `torch.distributed.elastic` (torchelastic) out to the pytorch/torchx repository and | ||
consolidate TorchElastic and TorchX *outside* the pytorch/pytorch repo. However, due to the prevalent usage | ||
of `torchrun`, pulling torchelastic out of PyTorch would mean: | ||
|
||
1. Makes `torch-2.x` backwards incompatible since users would have to separately install `torchx` to get `torchrun`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't feel like these drawbacks are particularly bad There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's something to be said about having a built-in runner CLI that helps you launch simple and complex jobs onto some commonly used schedulers (e.g. slurm, k8s). Makes the UX frictionless (especially for distributed). We discussed the The fundamental difference though is that not all pytorch users need torchvision, but all pytorch users need to run their scripts at some point. |
||
The installation instructions on https://pytorch.org could include `torchx` as part of the `pip` command | ||
(similar to `torchvision`). | ||
2. -- or -- (to make things BC) have PyTorch take a pip dependency on `torchx`. | ||
|
||
Due to the reasons mentioned above, pulling torchelastic out of the PyTorch repo is sub-optimal | ||
|
||
### Level of Support | ||
Pending... | ||
|
||
Choose one of the following: | ||
* 1: Overwhelming positive feedback. | ||
* 2: Positive feedback. | ||
* 3: Majority Acceptance, with conflicting Feedback. | ||
* 4: Acceptance, with Little Feedback. | ||
* 5: Unclear Resolution. | ||
* 6: RFC Rejected. | ||
* 7: RFC Rejected, with Conflicting Feedback. | ||
|
||
|
||
#### Additional Context | ||
N/A | ||
|
||
|
||
### Next Steps | ||
Pending RFC feedback | ||
|
||
|
||
#### Tracking issue | ||
N/A | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivations would lead me to believe that either of the below are valid options
At a high level I agree that the needs served by torchrun and torchx are quite similar so should be coupled more closely. One worry I have is that typically in core they're strict about adding new dependencies and often prefer out of core registration mechanisms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I don't want to bring additional deps + unnecssary bloat to core either. However there are number of benefits for PyTorch to define the specifications of how PyTorch authored scripts/applications are installed and run on the target infrastructure: