You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run train_baseline on a cluster. When I increase the number of CPUs used per node the experiment speeds up. But when I use more nodes, while keeping the CPUs per node constant, the speed stays the same (disregarding the longer initial iteration). Is this repo set up to use multiple nodes? If not, what is the best way to go about adding that functionality?
Looking at the output logs, I also noticed that only iterations 1, 4, 7, and 10 were being printed. Why might this be happening?
Sorry about the delay! This repo is set up to use multiple nodes: are you increasing the number of workers? It won't use more cpus than num_workers. As for the logging, I'm not sure, you may want to post on the rllib github.
I am trying to run train_baseline on a cluster. When I increase the number of CPUs used per node the experiment speeds up. But when I use more nodes, while keeping the CPUs per node constant, the speed stays the same (disregarding the longer initial iteration). Is this repo set up to use multiple nodes? If not, what is the best way to go about adding that functionality?
Looking at the output logs, I also noticed that only iterations 1, 4, 7, and 10 were being printed. Why might this be happening?
train_baseline on 1 node.txt
train_baseline on 2 nodes.txt
train_baseline on 5 nodes.txt
The text was updated successfully, but these errors were encountered: