Additional examples of CLI usage (#761)

* more cli examples * fixed typo in getting started cli docs * Illustrate the rollout_save_n_{timesteps,episodes} in cli.rst * Illustrate use of eval_policy script to get rollouts from an already trained expert * Adapt demonstration of significance testing in cli.rst to consume all generated monitor CSVs --------- Co-authored-by: Jason Hoelscher-Obermaier <[email protected]> Co-authored-by: Adam Gleave <[email protected]>
HumanCompatibleAI · Aug 11, 2023 · 5b0b531 · 5b0b531
1 parent fd4d8f0
commit 5b0b531
Showing 1 changed file with 195 additions and 6 deletions.
diff --git a/docs/getting-started/cli.rst b/docs/getting-started/cli.rst
@@ -32,8 +32,8 @@ You can always find out all the configurable values by running:
 
     python -m imitation.scripts.<script> print_config
 
-Run BC on the ``CartPole-v1`` environment with a pre-trained PPO policy as expert:
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Run BC on the ``CartPole-v1`` environment with a pre-trained PPO policy as expert
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. note:: Here the cartpole environment is specified via a named configuration.
 
@@ -49,8 +49,8 @@ Run BC on the ``CartPole-v1`` environment with a pre-trained PPO policy as exper
 50 expert demonstrations are sampled from the PPO policy that is included in the testdata folder.
 2000 batches are enough to train a good policy.
 
-Run DAgger on the ``CartPole-v0`` environment with a random policy as expert:
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Run DAgger on the ``CartPole-v0`` environment with a random policy as expert
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. code-block:: bash
 
@@ -63,8 +63,8 @@ Run DAgger on the ``CartPole-v0`` environment with a random policy as expert:
 This will not produce any meaningful results, since a random policy is not a good expert.
 
 
-Run AIRL on the ``MountainCar-v0`` environment with a expert from the HuggingFace model hub:
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Run AIRL on the ``MountainCar-v0`` environment with a expert from the HuggingFace model hub
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. code-block:: bash
 
@@ -93,6 +93,195 @@ The ``seals:`` prefix ensures that the seals package is imported and the environ
             demonstrations.n_expert_demos=50
 
 
+Train an expert and save the rollouts explicitly, then train a policy on the saved rollouts
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+First, train an expert and save the demonstrations. By default, this will use ``PPO`` and train for 1M time steps.
+We can set the number of time steps to train for by setting ``total_timesteps``.
+After training the expert, we generate rollouts using the expert policy and save them to disk.
+We can set a minimum number of episodes or time steps to be saved by setting one of ``rollout_save_n_episodes`` or
+``rollout_save_n_timesteps``. Note that the number of episodes or time steps saved may be slightly larger than the
+specified number.
+
+By default the demonstrations are saved in ``<log_dir>/rollouts/final``
+(where for this script by default ``<log_dir>`` is ``output/train_rl/<environment>/<timestamp>``).
+However, we can pass an explicit path as logging directory.
+
+.. code-block:: bash
+
+        python -m imitation.scripts.train_rl with seals_cartpole \
+                total_timesteps=40000 \
+                logging.log_dir=output/ppo/seals_cartpole/trained \
+                rollout_save_n_episodes=50
+
+Instead of training a new expert, we can also load a pre-trained expert policy and generate rollouts from it.
+This can be achieved using the ``eval_policy`` script.
+
+Note that the rollout_save_path is relative to the ``log_dir`` of the imitation script.
+
+.. code-block:: bash
+
+        python -m imitation.scripts.eval_policy with seals_cartpole \
+                expert.policy_type=ppo-huggingface \
+                eval_n_episodes=50 \
+                logging.log_dir=output/ppo/seals_cartpole/loaded \
+                rollout_save_path=rollouts/final
+
+Now we can run the imitation script (in this case DAgger) and pass the path to the demonstrations we just generated
+
+.. code-block:: bash
+
+        python -m imitation.scripts.train_imitation dagger with \
+                seals_cartpole \
+                dagger.total_timesteps=2000 \
+                demonstrations.source=local \
+                demonstrations.path=output/ppo/seals_cartpole/loaded/rollouts/final
+
+
+Visualise saved policies
+^^^^^^^^^^^^^^^^^^^^^^^^
+We can use the ``eval_policy`` script to visualise and render a saved policy.
+Here we are looking at the policy saved by the previous example.
+
+.. code-block:: bash
+
+    python -m imitation.scripts.eval_policy with \
+            expert.policy_type=ppo \
+            expert.loader_kwargs.path=output/train_rl/Pendulum-v1/my_run/policies/final/model.zip \
+            environment.num_vec=1 \
+            render=True \
+            environment.gym_id='Pendulum-v1'
+
+
+
+Comparing algorithms' performance
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Let's use the CLI to compare the performance of different algorithms.
+
+First, let's train an expert on the ``CartPole-v1`` environment.
+
+.. code-block:: bash
+
+    python -m imitation.scripts.train_rl with \
+            cartpole \
+            logging.log_dir=output/train_rl/CartPole-v1/expert \
+            total_timesteps=10000
+
+Now let's train a weaker agent.
+
+.. code-block:: bash
+
+    python -m imitation.scripts.train_rl with \
+        cartpole \
+        logging.log_dir=output/train_rl/CartPole-v1/non_expert \
+        total_timesteps=1000     # simply training less
+
+
+We can evaluate each policy using the ``eval_policy`` script.
+For the expert:
+
+.. code-block:: bash
+
+    python -m imitation.scripts.eval_policy with \
+            expert.policy_type=ppo \
+            expert.loader_kwargs.path=output/train_rl/CartPole-v1/expert/policies/final/model.zip \
+            environment.gym_id='CartPole-v1' \
+            environment.num_vec=1 \
+            logging.log_dir=output/eval_policy/CartPole-v1/expert
+
+which will return something like
+
+.. code-block:: bash
+
+    INFO - eval_policy - Result: {
+            'n_traj': 74,
+            'monitor_return_len': 74,
+            'return_min': 26.0,
+            'return_mean': 154.21621621621622,
+            'return_std': 79.94377589657559,
+            'return_max': 500.0,
+            'len_min': 26,
+            'len_mean': 154.21621621621622,
+            'len_std': 79.94377589657559,
+            'len_max': 500,
+            'monitor_return_min': 26.0,
+            'monitor_return_mean': 154.21621621621622,
+            'monitor_return_std': 79.94377589657559,
+            'monitor_return_max': 500.0
+        }
+    INFO - eval_policy - Completed after 0:00:12
+
+
+For the non-expert:
+
+.. code-block:: bash
+
+    python -m imitation.scripts.eval_policy with \
+            expert.policy_type=ppo \
+            expert.loader_kwargs.path=output/train_rl/CartPole-v1/non_expert/policies/final/model.zip \
+            environment.gym_id='CartPole-v1' \
+            environment.num_vec=1 \
+            logging.log_dir=output/eval_policy/CartPole-v1/non_expert
+
+
+.. code-block:: bash
+
+    INFO - eval_policy - Result: {
+            'n_traj': 355,
+            'monitor_return_len': 355,
+            'return_min': 8.0,
+            'return_mean': 28.92676056338028,
+            'return_std': 15.686012049373561,
+            'return_max': 104.0,
+            'len_min': 8,
+            'len_mean': 28.92676056338028,
+            'len_std': 15.686012049373561,
+            'len_max': 104,
+            'monitor_return_min': 8.0,
+            'monitor_return_mean': 28.92676056338028,
+            'monitor_return_std': 15.686012049373561,
+            'monitor_return_max': 104.0
+    }
+    INFO - eval_policy - Completed after 0:00:17
+
+This will save the monitor CSVs (one for each vectorised env, controlled by environment.num_vec).
+The monitor CSVs follow the naming convention ``mon*.monitor.csv``.
+We can load these CSV files with ``pandas`` and use the ``imitation.test.reward_improvement``
+module to compare the performances of the two policies.
+
+.. TODO: replace the python block below once a CLI tool for handling significance testing becomes available.
+.. code-block:: python
+
+    from pathlib import Path
+    import pandas as pd
+    from imitation.testing.reward_improvement import is_significant_reward_improvement
+
+    expert_monitor = pd.concat(
+        [
+            pd.read_csv(f, skiprows=1)
+            for f in Path("./output/train_rl/CartPole-v1/expert/monitor").glob(
+                "mon*.monitor.csv"
+            )
+        ]
+    )
+    non_expert_monitor = pd.concat(
+        [
+            pd.read_csv(f, skiprows=1)
+            for f in Path("./output/train_rl/CartPole-v1/non_expert/monitor").glob(
+                "mon*.monitor.csv"
+            )
+        ]
+    )
+    if is_significant_reward_improvement(non_expert_monitor["r"], expert_monitor["r"], 0.05):
+        print("The expert improved over the non-expert with >95% probability")
+    else:
+        print("No significant (p=0.05) reward improvement of expert over non-expert")
+
+.. code-block:: bash
+
+    True
+
+
 Algorithm Scripts
 =================