Examples for distributed training of machine learning/deep learning models in TensorFlow. Every model training example can be run on a multi-node cluster.
This repository contains a few examples for distributed (multi-nodes) training on Tensorflow (test on CPU cluster)
- Single layer neural network: mnist_nn_distibuted_placeholder.py
- Softmax model: mnist_softmax_distibuted_placeholder.py
- Two hidden layers neural network: mnist_2hiddenLayerNN_distributed_ph.py
- CNN tensorflow example
- For model 1,2,3: you can find a script called
xxx.py
and a corresponding folder in which there are shell scripts to launch the distributed training job. - For model 4: please refer to the corresponding README
- Change some default setting (e.g., python path, HOME path, host name) before running each training job.
- Make sure you understand the basics of distributed Tensorflow. See the offical tutorial for more detail.
- Model 1,2,3: Tensorflow version: 0.11.0rc0, Python 3, Ubuntu 16
- Model 4: Tensorflow version 1.5.0, Python 3, Ubuntu 16