Skip to content

Commit

Permalink
Additional examples of CLI usage (#761)
Browse files Browse the repository at this point in the history
* more cli examples
* fixed typo in getting started cli docs
* Illustrate the rollout_save_n_{timesteps,episodes} in cli.rst
* Illustrate use of eval_policy script to get rollouts from an already trained expert
* Adapt demonstration of significance testing in cli.rst to consume all generated monitor CSVs

---------

Co-authored-by: Jason Hoelscher-Obermaier <[email protected]>
Co-authored-by: Adam Gleave <[email protected]>
  • Loading branch information
3 people authored Aug 11, 2023
1 parent fd4d8f0 commit 5b0b531
Showing 1 changed file with 195 additions and 6 deletions.
201 changes: 195 additions & 6 deletions docs/getting-started/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,8 @@ You can always find out all the configurable values by running:
python -m imitation.scripts.<script> print_config
Run BC on the ``CartPole-v1`` environment with a pre-trained PPO policy as expert:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Run BC on the ``CartPole-v1`` environment with a pre-trained PPO policy as expert
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. note:: Here the cartpole environment is specified via a named configuration.

Expand All @@ -49,8 +49,8 @@ Run BC on the ``CartPole-v1`` environment with a pre-trained PPO policy as exper
50 expert demonstrations are sampled from the PPO policy that is included in the testdata folder.
2000 batches are enough to train a good policy.

Run DAgger on the ``CartPole-v0`` environment with a random policy as expert:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Run DAgger on the ``CartPole-v0`` environment with a random policy as expert
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash
Expand All @@ -63,8 +63,8 @@ Run DAgger on the ``CartPole-v0`` environment with a random policy as expert:
This will not produce any meaningful results, since a random policy is not a good expert.


Run AIRL on the ``MountainCar-v0`` environment with a expert from the HuggingFace model hub:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Run AIRL on the ``MountainCar-v0`` environment with a expert from the HuggingFace model hub
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash
Expand Down Expand Up @@ -93,6 +93,195 @@ The ``seals:`` prefix ensures that the seals package is imported and the environ
demonstrations.n_expert_demos=50
Train an expert and save the rollouts explicitly, then train a policy on the saved rollouts
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

First, train an expert and save the demonstrations. By default, this will use ``PPO`` and train for 1M time steps.
We can set the number of time steps to train for by setting ``total_timesteps``.
After training the expert, we generate rollouts using the expert policy and save them to disk.
We can set a minimum number of episodes or time steps to be saved by setting one of ``rollout_save_n_episodes`` or
``rollout_save_n_timesteps``. Note that the number of episodes or time steps saved may be slightly larger than the
specified number.

By default the demonstrations are saved in ``<log_dir>/rollouts/final``
(where for this script by default ``<log_dir>`` is ``output/train_rl/<environment>/<timestamp>``).
However, we can pass an explicit path as logging directory.

.. code-block:: bash
python -m imitation.scripts.train_rl with seals_cartpole \
total_timesteps=40000 \
logging.log_dir=output/ppo/seals_cartpole/trained \
rollout_save_n_episodes=50
Instead of training a new expert, we can also load a pre-trained expert policy and generate rollouts from it.
This can be achieved using the ``eval_policy`` script.

Note that the rollout_save_path is relative to the ``log_dir`` of the imitation script.

.. code-block:: bash
python -m imitation.scripts.eval_policy with seals_cartpole \
expert.policy_type=ppo-huggingface \
eval_n_episodes=50 \
logging.log_dir=output/ppo/seals_cartpole/loaded \
rollout_save_path=rollouts/final
Now we can run the imitation script (in this case DAgger) and pass the path to the demonstrations we just generated

.. code-block:: bash
python -m imitation.scripts.train_imitation dagger with \
seals_cartpole \
dagger.total_timesteps=2000 \
demonstrations.source=local \
demonstrations.path=output/ppo/seals_cartpole/loaded/rollouts/final
Visualise saved policies
^^^^^^^^^^^^^^^^^^^^^^^^
We can use the ``eval_policy`` script to visualise and render a saved policy.
Here we are looking at the policy saved by the previous example.

.. code-block:: bash
python -m imitation.scripts.eval_policy with \
expert.policy_type=ppo \
expert.loader_kwargs.path=output/train_rl/Pendulum-v1/my_run/policies/final/model.zip \
environment.num_vec=1 \
render=True \
environment.gym_id='Pendulum-v1'
Comparing algorithms' performance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Let's use the CLI to compare the performance of different algorithms.

First, let's train an expert on the ``CartPole-v1`` environment.

.. code-block:: bash
python -m imitation.scripts.train_rl with \
cartpole \
logging.log_dir=output/train_rl/CartPole-v1/expert \
total_timesteps=10000
Now let's train a weaker agent.

.. code-block:: bash
python -m imitation.scripts.train_rl with \
cartpole \
logging.log_dir=output/train_rl/CartPole-v1/non_expert \
total_timesteps=1000 # simply training less
We can evaluate each policy using the ``eval_policy`` script.
For the expert:

.. code-block:: bash
python -m imitation.scripts.eval_policy with \
expert.policy_type=ppo \
expert.loader_kwargs.path=output/train_rl/CartPole-v1/expert/policies/final/model.zip \
environment.gym_id='CartPole-v1' \
environment.num_vec=1 \
logging.log_dir=output/eval_policy/CartPole-v1/expert
which will return something like

.. code-block:: bash
INFO - eval_policy - Result: {
'n_traj': 74,
'monitor_return_len': 74,
'return_min': 26.0,
'return_mean': 154.21621621621622,
'return_std': 79.94377589657559,
'return_max': 500.0,
'len_min': 26,
'len_mean': 154.21621621621622,
'len_std': 79.94377589657559,
'len_max': 500,
'monitor_return_min': 26.0,
'monitor_return_mean': 154.21621621621622,
'monitor_return_std': 79.94377589657559,
'monitor_return_max': 500.0
}
INFO - eval_policy - Completed after 0:00:12
For the non-expert:

.. code-block:: bash
python -m imitation.scripts.eval_policy with \
expert.policy_type=ppo \
expert.loader_kwargs.path=output/train_rl/CartPole-v1/non_expert/policies/final/model.zip \
environment.gym_id='CartPole-v1' \
environment.num_vec=1 \
logging.log_dir=output/eval_policy/CartPole-v1/non_expert
.. code-block:: bash
INFO - eval_policy - Result: {
'n_traj': 355,
'monitor_return_len': 355,
'return_min': 8.0,
'return_mean': 28.92676056338028,
'return_std': 15.686012049373561,
'return_max': 104.0,
'len_min': 8,
'len_mean': 28.92676056338028,
'len_std': 15.686012049373561,
'len_max': 104,
'monitor_return_min': 8.0,
'monitor_return_mean': 28.92676056338028,
'monitor_return_std': 15.686012049373561,
'monitor_return_max': 104.0
}
INFO - eval_policy - Completed after 0:00:17
This will save the monitor CSVs (one for each vectorised env, controlled by environment.num_vec).
The monitor CSVs follow the naming convention ``mon*.monitor.csv``.
We can load these CSV files with ``pandas`` and use the ``imitation.test.reward_improvement``
module to compare the performances of the two policies.

.. TODO: replace the python block below once a CLI tool for handling significance testing becomes available.
.. code-block:: python
from pathlib import Path
import pandas as pd
from imitation.testing.reward_improvement import is_significant_reward_improvement
expert_monitor = pd.concat(
[
pd.read_csv(f, skiprows=1)
for f in Path("./output/train_rl/CartPole-v1/expert/monitor").glob(
"mon*.monitor.csv"
)
]
)
non_expert_monitor = pd.concat(
[
pd.read_csv(f, skiprows=1)
for f in Path("./output/train_rl/CartPole-v1/non_expert/monitor").glob(
"mon*.monitor.csv"
)
]
)
if is_significant_reward_improvement(non_expert_monitor["r"], expert_monitor["r"], 0.05):
print("The expert improved over the non-expert with >95% probability")
else:
print("No significant (p=0.05) reward improvement of expert over non-expert")
.. code-block:: bash
True
Algorithm Scripts
=================

Expand Down

0 comments on commit 5b0b531

Please sign in to comment.