You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
A clear and concise description of what the bug is.
The byteps in K8S Pod doesn't have DMLC_WORKER_ID configured. So the bpslaunch complain it can't find DMLC_WORKER_ID variable and error out.
$ kubectl describe pod byteps-mxnet-job-worker-0
You can see DMLC_WORKER_ID is not there
DMLC_PS_ROOT_PORT: 9091
DMLC_PS_ROOT_URI: byteps-mxnet-job-scheduler-0
DMLC_NUM_SERVER: 2
DMLC_NUM_WORKER: 2
DMLC_ROLE: worker
DMLC_USE_KUBERNETES: 1
To reproduce it inside the Pod, you can modify the yaml as below to let the Pod run without running bpslanuch
command: ["/bin/bash", "-c"]
args: [
"sleep 3600"
]
Then apply the yaml to let the Pod run:
byteps-mxnet-job-server-0 1/1 Running 0 15s
byteps-mxnet-job-server-1 1/1 Running 0 15s
byteps-mxnet-job-worker-0 1/1 Running 0 15s
byteps-mxnet-job-worker-1 1/1 Running 0 14s
Then login as below:
$ kubectl exec -it byteps-mxnet-job-worker-0 -- bash
root@byteps-mxnet-job-worker-0:/#
root@byteps-mxnet-job-worker-0:/# env |grep DMLC_WORKER_ID
root@byteps-mxnet-job-worker-0:/# bpslaunch
BytePS launching worker
The env DMLC_WORKER_ID is missing
Expected behavior
A clear and concise description of what you expected to happen.
Expect to see the worker pod running
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
OS:
GCC version:
CUDA and NCCL version:
Framework (TF, PyTorch, MXNet):
Additional context
Add any other context about the problem here.
If I need to run Pytorch DDP with byteps in kubernetes platform, do I still have to use mxjob operator? or I can use PytorchJob operator?
Thanks
Jack
The text was updated successfully, but these errors were encountered:
Describe the bug
A clear and concise description of what the bug is.
The byteps in K8S Pod doesn't have DMLC_WORKER_ID configured. So the bpslaunch complain it can't find DMLC_WORKER_ID variable and error out.
To Reproduce
Steps to reproduce the behavior:
byteps-mxnet-job-scheduler-0 1/1 Running 0 8s
byteps-mxnet-job-server-0 1/1 Running 0 8s
byteps-mxnet-job-server-1 1/1 Running 0 8s
byteps-mxnet-job-worker-0 0/1 Completed 0 8s
byteps-mxnet-job-worker-1 0/1 Completed 0 7s
$ kubectl describe pod byteps-mxnet-job-worker-0
You can see DMLC_WORKER_ID is not there
DMLC_PS_ROOT_PORT: 9091
DMLC_PS_ROOT_URI: byteps-mxnet-job-scheduler-0
DMLC_NUM_SERVER: 2
DMLC_NUM_WORKER: 2
DMLC_ROLE: worker
DMLC_USE_KUBERNETES: 1
To reproduce it inside the Pod, you can modify the yaml as below to let the Pod run without running bpslanuch
command: ["/bin/bash", "-c"]
args: [
"sleep 3600"
]
command: ["bpslaunch"]
args: ["python3", "/usr/local/byteps/example/mxnet/train_imagenet_byteps.py", "--benchmark", "1", "--batch-size=32"]
Then apply the yaml to let the Pod run:
byteps-mxnet-job-server-0 1/1 Running 0 15s
byteps-mxnet-job-server-1 1/1 Running 0 15s
byteps-mxnet-job-worker-0 1/1 Running 0 15s
byteps-mxnet-job-worker-1 1/1 Running 0 14s
Then login as below:
$ kubectl exec -it byteps-mxnet-job-worker-0 -- bash
root@byteps-mxnet-job-worker-0:/#
root@byteps-mxnet-job-worker-0:/# env |grep DMLC_WORKER_ID
root@byteps-mxnet-job-worker-0:/# bpslaunch
BytePS launching worker
The env DMLC_WORKER_ID is missing
Expected behavior
A clear and concise description of what you expected to happen.
Expect to see the worker pod running
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context
Add any other context about the problem here.
If I need to run Pytorch DDP with byteps in kubernetes platform, do I still have to use mxjob operator? or I can use PytorchJob operator?
Thanks
Jack
The text was updated successfully, but these errors were encountered: