Release 0.9

Main changes are detailed below: New features - * CARLA 0.7 simulator integration * Human control of the game play * Recording of human game play and storing / loading the replay buffer * Behavioral cloning agent and presets * Golden tests for several presets * Selecting between deep / shallow image embedders * Rendering through pygame (with some boost in performance) API changes - * Improved environment wrapper API * Added an evaluate flag to allow convenient evaluation of existing checkpoints * Improve frameskip definition in Gym Bug fixes - * Fixed loading of checkpoints for agents with more than one network * Fixed the N Step Q learning agent python3 compatibility
IntelLabs · Dec 19, 2017 · 125c7ee · 125c7ee
1 parent 11faf19
commit 125c7ee
Show file tree

Hide file tree

Showing 41 changed files with 1,713 additions and 260 deletions.
diff --git a/README.md b/README.md
@@ -13,10 +13,16 @@ Training an agent to solve an environment is as easy as running:
 python3 coach.py -p CartPole_DQN -r
 ```
 
-<img src="img/doom.gif" alt="Doom Health Gathering" width="265" height="200"/><img src="img/minitaur.gif" alt="PyBullet Minitaur" width="265" height="200"/> <img src="img/ant.gif" alt="Gym Extensions Ant" width="250" height="200"/>
+<img src="img/doom_deathmatch.gif" alt="Doom Deathmatch" width="267" height="200"/> <img src="img/carla.gif" alt="CARLA" width="284" height="200"/> <img src="img/montezuma.gif" alt="MontezumaRevenge" width="152" height="200"/>
 
 Blog post from the Intel® Nervana™ website can be found [here](https://www.intelnervana.com/reinforcement-learning-coach-intel).
 
+
+## Documentation
+
+Framework documentation, algorithm description and instructions on how to contribute a new agent/environment can be found [here](http://coach.nervanasys.com).
+
+
 ## Installation
 
 Note: Coach has only been tested on Ubuntu 16.04 LTS, and with Python 3.5.
@@ -103,6 +109,8 @@ For example:
 
 It is easy to create new presets for different levels or environments by following the same pattern as in presets.py
 
+More usage examples can be found [here](http://coach.nervanasys.com/usage/index.html).
+
 ## Running Coach Dashboard (Visualization)
 Training an agent to solve an environment can be tricky, at times. 
 
@@ -121,11 +129,6 @@ python3 dashboard.py
 <img src="img/dashboard.png" alt="Coach Design" style="width: 800px;"/>
 
 
-## Documentation
-
-Framework documentation, algoritmic description and instructions on how to contribute a new agent/environment can be found [here](http://coach.nervanasys.com).
-
-
 ## Parallelizing an Algorithm
 
 Since the introduction of [A3C](https://arxiv.org/abs/1602.01783) in 2016, many algorithms were shown to benefit from running multiple instances in parallel, on many CPU cores. So far, these algorithms include [A3C](https://arxiv.org/abs/1602.01783), [DDPG](https://arxiv.org/pdf/1704.03073.pdf), [PPO](https://arxiv.org/pdf/1707.06347.pdf), and [NAF](https://arxiv.org/pdf/1610.00633.pdf), and this is most probably only the begining. 
@@ -150,36 +153,45 @@ python3 coach.py -p Hopper_A3C -n 16
 
 ## Supported Environments
 
-* OpenAI Gym 
+* *OpenAI Gym:*
 
     Installed by default by Coach's installer.
 
-* ViZDoom:
+* *ViZDoom:*
 
     Follow the instructions described in the ViZDoom repository -
 
     https://github.com/mwydmuch/ViZDoom
 
     Additionally, Coach assumes that the environment variable VIZDOOM_ROOT points to the ViZDoom installation directory.
 
-* Roboschool:
+* *Roboschool:*
 
     Follow the instructions described in the roboschool repository - 
 
     https://github.com/openai/roboschool
 
-* GymExtensions:
+* *GymExtensions:*
 
     Follow the instructions described in the GymExtensions repository -
 
     https://github.com/Breakend/gym-extensions
 
     Additionally, add the installation directory to the PYTHONPATH environment variable.
 
-* PyBullet
+* *PyBullet:*
 
     Follow the instructions described in the [Quick Start Guide](https://docs.google.com/document/d/10sXEhzFRSnvFcl3XxNGhnD4N2SedqwdAvK3dsihxVUA) (basically just - 'pip install pybullet')
 
+* *CARLA:*
+
+    Download release 0.7 from the CARLA repository -
+
+    https://github.com/carla-simulator/carla/releases
+
+    Create a new CARLA_ROOT environment variable pointing to CARLA's installation directory.
+
+    A simple CARLA settings file (```CarlaSettings.ini```) is supplied with Coach, and is located in the ```environments``` directory.
 
 
 ## Supported Algorithms
@@ -190,24 +202,24 @@ python3 coach.py -p Hopper_A3C -n 16
 
 
 
-* [Deep Q Network (DQN)](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)
-* [Double Deep Q Network (DDQN)](https://arxiv.org/pdf/1509.06461.pdf)
+* [Deep Q Network (DQN)](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)  ([code](agents/dqn_agent.py))
+* [Double Deep Q Network (DDQN)](https://arxiv.org/pdf/1509.06461.pdf)  ([code](agents/ddqn_agent.py))
 * [Dueling Q Network](https://arxiv.org/abs/1511.06581)
-* [Mixed Monte Carlo (MMC)](https://arxiv.org/abs/1703.01310)
-* [Persistent Advantage Learning (PAL)](https://arxiv.org/abs/1512.04860)
-* [Categorical Deep Q Network (C51)](https://arxiv.org/abs/1707.06887)
-* [Quantile Regression Deep Q Network (QR-DQN)](https://arxiv.org/pdf/1710.10044v1.pdf)
-* [Bootstrapped Deep Q Network](https://arxiv.org/abs/1602.04621)
-* [N-Step Q Learning](https://arxiv.org/abs/1602.01783) | **Distributed**
-* [Neural Episodic Control (NEC)](https://arxiv.org/abs/1703.01988)
-* [Normalized Advantage Functions (NAF)](https://arxiv.org/abs/1603.00748.pdf) | **Distributed**
-* [Policy Gradients (PG)](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf) | **Distributed**
-* [Asynchronous Advantage Actor-Critic (A3C)](https://arxiv.org/abs/1602.01783) | **Distributed**
-* [Deep Deterministic Policy Gradients (DDPG)](https://arxiv.org/abs/1509.02971) | **Distributed**
-* [Proximal Policy Optimization (PPO)](https://arxiv.org/pdf/1707.06347.pdf)
-* [Clipped Proximal Policy Optimization](https://arxiv.org/pdf/1707.06347.pdf) | **Distributed**
-* [Direct Future Prediction (DFP)](https://arxiv.org/abs/1611.01779) | **Distributed**
-
+* [Mixed Monte Carlo (MMC)](https://arxiv.org/abs/1703.01310)  ([code](agents/mmc_agent.py))
+* [Persistent Advantage Learning (PAL)](https://arxiv.org/abs/1512.04860)  ([code](agents/pal_agent.py))
+* [Categorical Deep Q Network (C51)](https://arxiv.org/abs/1707.06887)  ([code](agents/categorical_dqn_agent.py))
+* [Quantile Regression Deep Q Network (QR-DQN)](https://arxiv.org/pdf/1710.10044v1.pdf)  ([code](agents/qr_dqn_agent.py))
+* [Bootstrapped Deep Q Network](https://arxiv.org/abs/1602.04621)  ([code](agents/bootstrapped_dqn_agent.py))
+* [N-Step Q Learning](https://arxiv.org/abs/1602.01783) | **Distributed**  ([code](agents/n_step_q_agent.py))
+* [Neural Episodic Control (NEC)](https://arxiv.org/abs/1703.01988)  ([code](agents/nec_agent.py))
+* [Normalized Advantage Functions (NAF)](https://arxiv.org/abs/1603.00748.pdf) | **Distributed**  ([code](agents/naf_agent.py))
+* [Policy Gradients (PG)](http://www-anw.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf) | **Distributed**  ([code](agents/policy_gradients_agent.py))
+* [Asynchronous Advantage Actor-Critic (A3C)](https://arxiv.org/abs/1602.01783) | **Distributed**  ([code](agents/actor_critic_agent.py))
+* [Deep Deterministic Policy Gradients (DDPG)](https://arxiv.org/abs/1509.02971) | **Distributed**  ([code](agents/ddpg_agent.py))
+* [Proximal Policy Optimization (PPO)](https://arxiv.org/pdf/1707.06347.pdf)  ([code](agents/ppo_agent.py))
+* [Clipped Proximal Policy Optimization](https://arxiv.org/pdf/1707.06347.pdf) | **Distributed**  ([code](agents/clipped_ppo_agent.py))
+* [Direct Future Prediction (DFP)](https://arxiv.org/abs/1611.01779) | **Distributed**  ([code](agents/dfp_agent.py))
+* Behavioral Cloning (BC)  ([code](agents/bc_agent.py))
 
 
 

diff --git a/agents/__init__.py b/agents/__init__.py
@@ -16,13 +16,16 @@
 
 from agents.actor_critic_agent import *
 from agents.agent import *
+from agents.bc_agent import *
 from agents.bootstrapped_dqn_agent import *
 from agents.clipped_ppo_agent import *
 from agents.ddpg_agent import *
 from agents.ddqn_agent import *
 from agents.dfp_agent import *
 from agents.dqn_agent import *
 from agents.categorical_dqn_agent import *
+from agents.human_agent import *
+from agents.imitation_agent import *
 from agents.mmc_agent import *
 from agents.n_step_q_agent import *
 from agents.naf_agent import *

diff --git a/agents/agent.py b/agents/agent.py
@@ -50,6 +50,7 @@ def __init__(self, env, tuning_parameters, replicated_device=None, task_id=0):
         self.task_id = task_id
         self.sess = tuning_parameters.sess
         self.env = tuning_parameters.env_instance = env
+        self.imitation = False
 
         # i/o dimensions
         if not tuning_parameters.env.desired_observation_width or not tuning_parameters.env.desired_observation_height:
@@ -61,7 +62,12 @@ def __init__(self, env, tuning_parameters, replicated_device=None, task_id=0):
             self.measurements_size = tuning_parameters.env.measurements_size = (self.measurements_size[0] + 1,)
 
         # modules
-        self.memory = eval(tuning_parameters.memory + '(tuning_parameters)')
+        if tuning_parameters.agent.load_memory_from_file_path:
+            screen.log_title("Loading replay buffer from pickle. Pickle path: {}"
+                             .format(tuning_parameters.agent.load_memory_from_file_path))
+            self.memory = read_pickle(tuning_parameters.agent.load_memory_from_file_path)
+        else:
+            self.memory = eval(tuning_parameters.memory + '(tuning_parameters)')
         # self.architecture = eval(tuning_parameters.architecture)
 
         self.has_global = replicated_device is not None
@@ -121,11 +127,12 @@ def __init__(self, env, tuning_parameters, replicated_device=None, task_id=0):
 
     def log_to_screen(self, phase):
         # log to screen
-        if self.current_episode > 0:
-            if phase == RunPhase.TEST:
-                exploration = self.evaluation_exploration_policy.get_control_param()
-            else:
+        if self.current_episode >= 0:
+            if phase == RunPhase.TRAIN:
                 exploration = self.exploration_policy.get_control_param()
+            else:
+                exploration = self.evaluation_exploration_policy.get_control_param()
+
             screen.log_dict(
                 OrderedDict([
                     ("Worker", self.task_id),
@@ -135,7 +142,7 @@ def log_to_screen(self, phase):
                     ("steps", self.total_steps_counter),
                     ("training iteration", self.training_iteration)
                 ]),
-                prefix="Heatup" if self.in_heatup else "Training" if phase == RunPhase.TRAIN else "Testing"
+                prefix=phase
             )
 
     def update_log(self, phase=RunPhase.TRAIN):
@@ -146,7 +153,7 @@ def update_log(self, phase=RunPhase.TRAIN):
         # log all the signals to file
         logger.set_current_time(self.current_episode)
         logger.create_signal_value('Training Iter', self.training_iteration)
-        logger.create_signal_value('In Heatup', int(self.in_heatup))
+        logger.create_signal_value('In Heatup', int(phase == RunPhase.HEATUP))
         logger.create_signal_value('ER #Transitions', self.memory.num_transitions())
         logger.create_signal_value('ER #Episodes', self.memory.length())
         logger.create_signal_value('Episode Length', self.current_episode_steps_counter)
@@ -197,24 +204,6 @@ def reset_game(self, do_not_reset_env=False):
                 network.curr_rnn_c_in = network.middleware_embedder.c_init
                 network.curr_rnn_h_in = network.middleware_embedder.h_init
 
-    def stack_observation(self, curr_stack, observation):
-        """
-        Adds a new observation to an existing stack of observations from previous time-steps.
-        :param curr_stack: The current observations stack.
-        :param observation: The new observation
-        :return: The updated observation stack
-        """
-
-        if curr_stack == []:
-            # starting an episode
-            curr_stack = np.vstack(np.expand_dims([observation] * self.tp.env.observation_stack_size, 0))
-            curr_stack = self.switch_axes_order(curr_stack, from_type='channels_first', to_type='channels_last')
-        else:
-            curr_stack = np.append(curr_stack, np.expand_dims(np.squeeze(observation), axis=-1), axis=-1)
-            curr_stack = np.delete(curr_stack, 0, -1)
-
-        return curr_stack
-
     def preprocess_observation(self, observation):
         """
         Preprocesses the given observation. 
@@ -335,26 +324,6 @@ def preprocess_reward(self, reward):
             reward = max(reward, self.tp.env.reward_clipping_min)
         return reward
 
-    def switch_axes_order(self, observation, from_type='channels_first', to_type='channels_last'):
-        """
-        transpose an observation axes from channels_first to channels_last or vice versa
-        :param observation: a numpy array 
-        :param from_type: can be 'channels_first' or 'channels_last'
-        :param to_type: can be 'channels_first' or 'channels_last'
-        :return: a new observation with the requested axes order
-        """
-        if from_type == to_type or len(observation.shape) == 1:
-            return observation
-        assert 2 <= len(observation.shape) <= 3, 'num axes of an observation must be 2 for a vector or 3 for an image'
-        assert type(observation) == np.ndarray, 'observation must be a numpy array'
-        if len(observation.shape) == 3:
-            if from_type == 'channels_first' and to_type == 'channels_last':
-                return np.transpose(observation, (1, 2, 0))
-            elif from_type == 'channels_last' and to_type == 'channels_first':
-                return np.transpose(observation, (2, 0, 1))
-        else:
-            return np.transpose(observation, (1, 0))
-
     def act(self, phase=RunPhase.TRAIN):
         """
         Take one step in the environment according to the network prediction and store the transition in memory
@@ -370,15 +339,15 @@ def act(self, phase=RunPhase.TRAIN):
         is_first_transition_in_episode = (self.curr_state == [])
         if is_first_transition_in_episode:
             observation = self.preprocess_observation(self.env.observation)
-            observation = self.stack_observation([], observation)
+            observation = stack_observation([], observation, self.tp.env.observation_stack_size)
 
             self.curr_state = {'observation': observation}
             if self.tp.agent.use_measurements:
                 self.curr_state['measurements'] = self.env.measurements
                 if self.tp.agent.use_accumulated_reward_as_measurement:
                     self.curr_state['measurements'] = np.append(self.curr_state['measurements'], 0)
 
-        if self.in_heatup:  # we do not have a stacked curr_state yet
+        if phase == RunPhase.HEATUP and not self.tp.heatup_using_network_decisions:
             action = self.env.get_random_action()
         else:
             action, action_info = self.choose_action(self.curr_state, phase=phase)
@@ -394,11 +363,11 @@ def act(self, phase=RunPhase.TRAIN):
         observation = self.preprocess_observation(result['observation'])
 
         # plot action values online
-        if self.tp.visualization.plot_action_values_online and not self.in_heatup:
+        if self.tp.visualization.plot_action_values_online and phase != RunPhase.HEATUP:
             self.plot_action_values_online()
 
         # initialize the next state
-        observation = self.stack_observation(self.curr_state['observation'], observation)
+        observation = stack_observation(self.curr_state['observation'], observation, self.tp.env.observation_stack_size)
 
         next_state = {'observation': observation}
         if self.tp.agent.use_measurements and 'measurements' in result.keys():
@@ -407,7 +376,7 @@ def act(self, phase=RunPhase.TRAIN):
                 next_state['measurements'] = np.append(next_state['measurements'], self.total_reward_in_current_episode)
 
         # store the transition only if we are training
-        if phase == RunPhase.TRAIN:
+        if phase == RunPhase.TRAIN or phase == RunPhase.HEATUP:
             transition = Transition(self.curr_state, result['action'], shaped_reward, next_state, result['done'])
             for key in action_info.keys():
                 transition.info[key] = action_info[key]
@@ -427,7 +396,7 @@ def act(self, phase=RunPhase.TRAIN):
                 self.update_log(phase=phase)
             self.log_to_screen(phase=phase)
 
-            if phase == RunPhase.TRAIN:
+            if phase == RunPhase.TRAIN or phase == RunPhase.HEATUP:
                 self.reset_game()
 
             self.current_episode += 1
@@ -462,11 +431,12 @@ def evaluate(self, num_episodes, keep_networks_synced=False):
                     for network in self.networks:
                         network.sync()
 
-            if self.tp.visualization.dump_gifs and self.total_reward_in_current_episode > max_reward_achieved:
+            if self.total_reward_in_current_episode > max_reward_achieved:
                 max_reward_achieved = self.total_reward_in_current_episode
                 frame_skipping = int(5/self.tp.env.frame_skip)
-                logger.create_gif(self.last_episode_images[::frame_skipping],
-                                  name='score-{}'.format(max_reward_achieved), fps=10)
+                if self.tp.visualization.dump_gifs:
+                    logger.create_gif(self.last_episode_images[::frame_skipping],
+                                      name='score-{}'.format(max_reward_achieved), fps=10)
 
             average_evaluation_reward += self.total_reward_in_current_episode
             self.reset_game()
@@ -496,7 +466,7 @@ def improve(self):
             screen.log_title("Starting heatup {}".format(self.task_id))
             num_steps_required_for_one_training_batch = self.tp.batch_size * self.tp.env.observation_stack_size
             for step in range(max(self.tp.num_heatup_steps, num_steps_required_for_one_training_batch)):
-                self.act()
+                self.act(phase=RunPhase.HEATUP)
 
         # training phase
         self.in_heatup = False
@@ -509,7 +479,12 @@ def improve(self):
             # evaluate
             evaluate_agent = (self.last_episode_evaluation_ran is not self.current_episode) and \
                              (self.current_episode % self.tp.evaluate_every_x_episodes == 0)
+            evaluate_agent = evaluate_agent or \
+                             (self.imitation and self.training_iteration > 0 and
+                              self.training_iteration % self.tp.evaluate_every_x_training_iterations == 0)
+
             if evaluate_agent:
+                self.env.reset()
                 self.last_episode_evaluation_ran = self.current_episode
                 self.evaluate(self.tp.evaluation_episodes)
 
@@ -522,21 +497,24 @@ def improve(self):
                     self.save_model(model_snapshots_periods_passed)
 
             # play and record in replay buffer
-            if self.tp.agent.step_until_collecting_full_episodes:
-                step = 0
-                while step < self.tp.agent.num_consecutive_playing_steps or self.memory.get_episode(-1).length() != 0:
-                    self.act()
-                    step += 1
-            else:
-                for step in range(self.tp.agent.num_consecutive_playing_steps):
-                    self.act()
+            if self.tp.agent.collect_new_data:
+                if self.tp.agent.step_until_collecting_full_episodes:
+                    step = 0
+                    while step < self.tp.agent.num_consecutive_playing_steps or self.memory.get_episode(-1).length() != 0:
+                        self.act()
+                        step += 1
+                else:
+                    for step in range(self.tp.agent.num_consecutive_playing_steps):
+                        self.act()
 
             # train
             if self.tp.train:
                 for step in range(self.tp.agent.num_consecutive_training_steps):
                     loss = self.train()
                     self.loss.add_sample(loss)
                     self.training_iteration += 1
+                    if self.imitation:
+                        self.log_to_screen(RunPhase.TRAIN)
                 self.post_training_commands()
 
     def save_model(self, model_id):