This repository contains a StreamFlow Federated Learning (FL) pipeline based on PyTorch. The workflow trains a VGG16 model with Group Normalization over two datasets:
The workflow is described with an extended version of CWL that introduces support for the Loop construct, necessary to describe the training-aggregate iteration of FL workloads.
Datasets have been placed onto two different HPC facilities:
- MNIST has been trained on the EPITO cluster at the University of Torino (1 80-core Arm Neoverse N1, 512GB RAM, and 2 NVIDIA A100 GPU per node);
- SVHN has been trained on the CINECA MARCONI100 cluster in Bologna (2 16-core IBM POWER9 AC922, 256GB RAM, and 4 NVIDIA V100 GPUs per node).
Since HPC worker nodes cannot access the Internet through outbound connections, this workload cannot be managed by FL frameworks that require direct bidirectional connections between worker and aggregator nodes. Conversely, StreamFlow relies on a pull-based data transfer mechanism that overcomes this limitation.
To also perform a direct comparison between StreamFlow and the Intel OpenFL framework, the pipeline has also been executed over two VMs (8 cores, 32GB RAM, 1 NVIDIA T4 GPU each) hosted on the HPC4AI Cloud at the University of Torino, acting as workers. Conversely, the aggregation plane has always been placed on Cloud.
To run the experiment as is, clone this repository on the aggregator node and use the following commands:
python -m venv venv
source venv/bin/activate
pip install "streamflow==0.2.0.dev1"
pip install -r requirements.txt
streamflow run streamflow.yml
Reproducing the experiments in the same environment requires access to both HPC facilities and the HPC4AI Cloud. However, interested users can run the same pipeline on their preferred infrastructure by changing the deployments
definitions in the streamflow.yml
file and the corresponding Slurm/SSH scripts inside the environments
folder.
Also, note that the Python dependencies listed in the requirements.txt
file should be manually installed in any involved location (both the workers and the aggregator), and the datasets are supposed to be already present in the worker nodes.
Iacopo Colonnelli [email protected]
Bruno Casella [email protected]
Marco Aldinucci [email protected]