-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] Support different PS/worker types #1369
Comments
I've implemented this feature for internal use. If the community believe this feature is desirable, let me push the changes to this repository. |
It is widely needed in the industry, I think. /cc @kubeflow/wg-training-leads |
Interesting. Can you give one example here? When would a user prefer GPU for some while CPU. for other PS? |
I can offer two cases: Case 1: Diversity on ParameterServer For a PS/Worker distributed training task, as we wish to maximize the communication bandwidth between workers and parameter servers, once possible, we shall co-allocate workers and parameter servers. However, there might leave some parameter servers that cannot fit into node as the resource on a node is limited. In this case, we need to configure parameter servers with distinguished pod affinity. Case 2: Diversity on Worker When (tf/mpi/pytorch-)jobs are deployed in a cluster with fragmented resource left, such like 2c4g on this node or 4c2g on that node. With the traditional TFJob specification, these resources shall be permanently abandoned. However, a diverse TFJob can configure workers with multiple resource configuration so any left fragmented resources can be utilized by a worker in any diverse TFJobs. Of source, to run a worker on a diverse TFJob, optimizer in TensorFlow should be carefully chosen to perform asynchronized gradient updating as well as learning rate adjusting for workers with various batch size. |
It's more about scheduling. For example, there is a GPU node with 2 GPU, 64 CPU, and 126G memory. One work uses 1 GPU, 25 CPU, and 40G. Then there can be another PS that uses 14 CPU, 46G memory. Besides this, there are some CPU nodes with 16 CPU, 32G. Then PSes in CPU nodes should use 16 CPU, 32G. |
�Maybe we could wait until all-in-one is released. |
Got it. Because of the resource allocation, need arises |
This would require a customized optimizer and dynamic batch sizing algorithm though. I am curious to see if there's any good practices around this from your internal experiments. |
Definitely, a new optimizer that can cope with this volatile environment is required. But diverse parameter server mode, a regular optimizer supporting asyncrhonized updating is sufficient. dynamic bs adjusting mainly comes from the diverse worker mode. Regarding experiments, we probably need to provide this feature in tf-operator as the experiment environment for algorithm researchers to proceed. |
Has this function been implemented? |
I only have an implementation for poc based on v0.5.3: https://github.com/zw0610/tf-operator/tree/diverse-worker |
Can you please describe your use scenario? Maybe there is another workaround. |
My previous expression is not correct. |
We want to support heterogeneous PS/worker. For example, some workers use CPU to train while some others use GPU. |
This means that distributed training has both ps task and work task, so that cpu and gpu can act on training at the same time? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Any update about this issue? Heterogeneous workers have become increasingly important in the current training of LLMs. |
I do not think it is in the roadmap. |
@gaocegege @Windfarer I think, we can implement this feature as part of V2 APIs: #2171. Users will be able to create TrainingRuntime using different Job Template for every PS:
|
In some customer cases, users want to schedule one PS for one GPU machine, and place other PSes in CPU machines, like this:
/cc @zw0610
The text was updated successfully, but these errors were encountered: