Increasing the number of nodes does not speed up train_baseline #158

gregSchwartz18 · 2019-08-18T16:13:03Z

I am trying to run train_baseline on a cluster. When I increase the number of CPUs used per node the experiment speeds up. But when I use more nodes, while keeping the CPUs per node constant, the speed stays the same (disregarding the longer initial iteration). Is this repo set up to use multiple nodes? If not, what is the best way to go about adding that functionality?

Looking at the output logs, I also noticed that only iterations 1, 4, 7, and 10 were being printed. Why might this be happening?

train_baseline on 1 node.txt
train_baseline on 2 nodes.txt
train_baseline on 5 nodes.txt

eugenevinitsky · 2019-08-28T18:10:53Z

Sorry about the delay! This repo is set up to use multiple nodes: are you increasing the number of workers? It won't use more cpus than num_workers. As for the logging, I'm not sure, you may want to post on the rllib github.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increasing the number of nodes does not speed up train_baseline #158

Increasing the number of nodes does not speed up train_baseline #158

gregSchwartz18 commented Aug 18, 2019

eugenevinitsky commented Aug 28, 2019 •

edited

Loading

Increasing the number of nodes does not speed up train_baseline #158

Increasing the number of nodes does not speed up train_baseline #158

Comments

gregSchwartz18 commented Aug 18, 2019

eugenevinitsky commented Aug 28, 2019 • edited Loading

eugenevinitsky commented Aug 28, 2019 •

edited

Loading