diff --git a/docs/proposals/images/fl-v2-architecture.png b/docs/proposals/images/fl-v2-architecture.png new file mode 100644 index 000000000..e2c4695fa Binary files /dev/null and b/docs/proposals/images/fl-v2-architecture.png differ diff --git a/docs/proposals/sedna-federated-learning-v2.md b/docs/proposals/sedna-federated-learning-v2.md index 9106b7ea3..13feb9668 100644 --- a/docs/proposals/sedna-federated-learning-v2.md +++ b/docs/proposals/sedna-federated-learning-v2.md @@ -6,8 +6,9 @@ - [Proposal](#proposal) - [Use Cases](#use-cases) - [Design Details](#design-details) - - [New Fields in FederatedLearningJob CRD](#new-fields-in-federatedlearningjob-crd) - - [Sedna Federated Learning Example](#sedna-federated-learning-example) + - [Subnet Management](#subnet-management) + - [FederatedLearningJob CRD](#federatedlearningjob-crd) + - [DataLoader Daemonset & Waiter Container](#dataloader-daemonset--waiter-container) - [Controller Design](#controller-design) - [Federated Learning Controller](#federated-learning-controller) - [Downstream Controller](#downstream-controller) @@ -59,31 +60,39 @@ The non-goals include: - Integrate inference jobs (currently) -## User Story +## Proposal -Nowadays, the number of edge devices explodes in an exponential way, making federated learning more popular in some security-sensitive cloud-edge scenarios. However, the edge devices are heterogenous and have different amounts of resources like CPU, GPU, and memory. It’s unnecessary and impossible to execute federated learning on all of them with the consideration of capability and efficiency. So we just need to place the training task on those devices with abundant resources and use a portion of data for training. +In the new design, we assume that: -By integrating training-operator, we can schedule training tasks to those devices where richer resources are prepared for training, thus making federated-learning more efficient and practical. And also, we will have various lifecycle managing abilities so that the training process would be more robust and scalable. +1. Data can be transferred within a secure subnet. -## Proposal +2. All training workers have the same parameters. -We propose **reusing Kubernetes Custom Resource Definitions (CRDs) for federated learning** (i.e. `FederatedLearningJob`) to enable distributed training with training-operator. In this way, we can implement new training runtime with training-operator while maintaining backward compatibility with the default one. +Since federated learning is a training task, it’s in fact data-driven. If we only schedule training tasks without scheduling the training data, the model will have unacceptable training bias. So we need to collect data and distribute it to different training workers to avoid training bias, after which we can execute the federated learning jobs. -The design details will be described in the following chapter. The main ideas of new design are: +Based on the reasons above, we propose **Two-Phase Federated Learning** to enable distributed training with training-operator: -1. Add new fields to the CRD of federated learning (i.e. `FederatedLearningJob`) +1. **Phase1 - Data Preparation**: Collect training data in edge nodes and distribute them to different training workers. In this phase, federated training tasks are scheduled to nodes but are **blocked** waiting for data. -2. Add training-operator runtime (e.g. `PyTorchJob`, `TFJob`) as an alternative to default runtime (i.e. `Pod`) +2. **Phase2 - Training**: Execute federated learning jobs when the training data is ready. -3. If not specified, `Pod` will be the default runtime so as to maintain backward compatibility. +The design details will be described in the following chapter. The main ideas of new design are: -![](./images/federated-learning-job-crd.png) +1. Define subnets using `NodeGroup` in KubeEdge. -### Use Cases +2. New version of `FederatedLearningJob`. + +3. Add DataLoader Daemonset to collect and distribute data in Federated Learning Phase1. + +4. Using `PyTorchJob` CRD in training-operator as the training runtime for federated learning. + +5. Add a waiter container to the `initContainer` of training pods to check the readiness of the data. + +![](./images/fl-v2-architecture.png) -Add “specifying training runtime” compared to [the proposal for federated learning](./federated-learning.md): +### Use Cases -- Users can create a federated learning job, with providing a training script, specifying the aggregation algorithm, configuring training hyperparameters, configuring training datasets, **specifying training runtime (default/training-operator)**. +- Users can create a federated learning job, with providing a training script, specifying the aggregation algorithm, configuring training hyperparameters, configuring training datasets. - Users can get the federated learning status, including the nodes participating in training, current training status, sample size of each node, current iteration times, and current aggregation times. @@ -91,33 +100,42 @@ Add “specifying training runtime” compared to [the proposal for federated le ## Design Details -### New Fields in `FederatedLearningJob` CRD - -We decided to add a `TrainingPolicy` field. It allows users to choose two training mode: +### Subnet Management -1. `Default`: Use the original training mode in Sedna. +We will use [NodeGroup](https://github.com/kubeedge/kubeedge/blob/master/docs/proposals/node-group-management.md) in KubeEdge to define subnets for nodes, within which data can be transferred among nodes. This ensures the privacy of the data and enhances the efficiency of training. -2. **`Distributed`: Use training-operator as the training runtime to orchestrate training tasks**. +### `FederatedLearningJob` CRD -When the `Mode` field is set to `Distributed`, users need to specify the framework (e.g. PyTorch, Tensorflow, etc.) they use to decide the CRD we use in training-operator (e.g. PyTorchJob). +We defines new `FederatedLearningJob` CRD as follows: ```Golang // FLJobSpec is a description of a federated learning job type FLJobSpec struct { -+ TrainingPolicy TrainingPolicy `json:"trainingPolicy,omitempty"` + AggregationWorker AggregationWorker `json:"aggregationWorker"` + + TrainingWorkers TrainingWorker `json:"trainingWorkers"` + + PretrainedModel PretrainedModel `json:"pretrainedModel,omitempty"` + + Transmitter Transmitter `json:"transmitter,omitempty"` +} + +// TrainingWorker describes the data a training worker should have +type TrainingWorker struct { + Replicas int `json:"replicas"` - AggregationWorker AggregationWorker `json:"aggregationWorker"` + TargetNodeGroups []TargetNodeGroups `json:"targetNodeGroups"` - TrainingWorkers []TrainingWorker `json:"trainingWorkers"` + Datasets []TrainDataset `json:"datasets"` - PretrainedModel PretrainedModel `json:"pretrainedModel,omitempty"` + TrainingPolicy TrainingPolicy `json:"trainingPolicy,omitempty"` - Transmitter Transmitter `json:"transmitter,omitempty"` + Template v1.PodTemplateSpec `json:"template"` } // TrainingPolicy defines the policy we take in the training phase type TrainingPolicy struct { - // Mode defines the training mode, chosen from Default and Distributed + // Mode defines the training mode, chosen from Sequential and Distributed Mode string `json:"mode,omitempty"` // Framework indicates the framework we use(e.g. PyTorch). We will determine the training runtime(i.e. CRDs in training-operator) we adopt to orchestrate training tasks when the Mode field is set to Distributed @@ -125,8 +143,6 @@ type TrainingPolicy struct { } ``` -### Sedna Federated Learning Example - The configuration of federated learning jobs should look like: ```YAML @@ -135,9 +151,6 @@ kind: FederatedLearningJob metadata: name: surface-defect-detection spec: -+ trainingPolicy: -+ mode: Distributed -+ framework: PyTorch aggregationWorker: model: name: "surface-defect-detection-model" @@ -154,47 +167,46 @@ spec: resources: # user defined resources limits: memory: 2Gi - trainingWorkers: - - dataset: - name: "edge1-surface-defect-detection-dataset" - template: - spec: - nodeName: $EDGE1_NODE - containers: - - image: $TRAIN_IMAGE - name: train-worker - imagePullPolicy: IfNotPresent - env: # user defined environments - - name: "batch_size" - value: "32" - - name: "learning_rate" - value: "0.001" - - name: "epochs" - value: "2" - resources: # user defined resources - limits: - memory: 2Gi - - dataset: - name: "edge2-surface-defect-detection-dataset" - template: - spec: - nodeName: $EDGE2_NODE - containers: - - image: $TRAIN_IMAGE - name: train-worker - imagePullPolicy: IfNotPresent - env: # user defined environments - - name: "batch_size" - value: "32" - - name: "learning_rate" - value: "0.001" - - name: "epochs" - value: "2" - resources: # user defined resources - limits: - memory: 2Gi + trainingWorkers: + datasets: + - edge1-surface-defect-detection-dataset + - edge2-surface-defect-detection-dataset + replicas: 2 + targetNodesGroup: + - surface-detect-nodes-group + trainingPolicy: + mode: Distributed + framework: PyTorch + template: + spec: + containers: + - image: $TRAIN_IMAGE + name: train-worker + imagePullPolicy: IfNotPresent + env: # user defined environments + - name: "batch_size" + value: "32" + - name: "learning_rate" + value: "0.001" + - name: "epochs" + value: "2" + resources: # user defined resources + limits: + memory: 2Gi ``` +### DataLoader Daemonset & Waiter Container + +> **TBD**: DataLoader Daemonset may be implemented based on [edgemesh](https://github.com/kubeedge/edgemesh) + +The DataLoader daemonset watches for the event of `FederatedLearningJob`. + +When a new federated learning job is created, it will be blocked by the waiter container and reach `pending` status. The DataLoader daemonset will be notified about this event, and get the `.spec.datasets` field and the corresponding nodes info about training tasks to transfer training data to the target dir of each training worker. + +The waiter container exists in every training pods’ `initContainers` field and will block training tasks until the data for training is ready. + +When the data is ready, the DataLoader daemonset will notify every waiter container about this. After that, the waiter containers will reach `completed` status and training tasks start executing. + ## Controller Design The new design **will not change the main architecture** of the original Federated Learning Controller, which would start three separate goroutines called `federated-learning`, `upstream`, and `downstream` controllers.