Skip to content

Latest commit

 

History

History
146 lines (92 loc) · 5.83 KB

Readme.md

File metadata and controls

146 lines (92 loc) · 5.83 KB

Training code for Goal Misgeneralization in Deep Reinforcement Learning

This code is based on a fork of this repository by Hojoon Lee. It includes scripts for training RL agents on modified procgen environments and producing the figures for the paper Goal Misgeneralization in Deep Reinforcement Learning.

Requirements

  • python>=3.6
  • torch 1.3
  • procgen (you will need to install our custom procgen linked above)
  • pyyaml
  • pandas
  • tensorboard==2.5

Reproducing experiments

Coinrun

Train:

python train.py --exp_name coinrun --env_name coinrun --num_levels 100000 --distribution_mode hard --param_name hard-500 --num_timesteps 200000000 --num_checkpoints 5 --seed 6033 --random_percent 0

In order to reproduce the experiments from the ablation, change the random_percent variable.

Test:

python render.py --exp_name coinrun_test --env_name coinrun_aisc --distribution_mode hard --param_name hard-500 --model_file PATH_TO_MODEL_FILE

where PATH_TO_MODEL_FILE is the path to the model file generated by the above training command.

Maze (Variant 1)

python train.py --exp_name maze1 --env_name maze_aisc --num_levels 100000 --distribution_mode hard --param_name hard-500 --num_timesteps 200000000 --num_checkpoints 5 --seed 1080
python render.py --exp_name maze1_test --env_name maze --distribution_mode hard --param_name hard-500  --model_file PATH_TO_MODEL_FILE

Maze (Variant 2)

python train.py --exp_name maze2 --env_name maze_yellowgem --num_levels 100000 --distribution_mode hard --param_name hard-500 --num_timesteps 200000000 --num_checkpoints 5 --seed 2809
python render.py --exp_name maze2_test --env_name maze_redgem_yellowstar --distribution_mode hard --param_name hard-500  --model_file PATH_TO_MODEL_FILE

Keys and Chests

python train.py --exp_name keys_chests --env_name heist_aisc_many_chests --num_levels 100000 --distribution_mode hard --param_name hard-500 --num_timesteps 200000000 --num_checkpoints 5 --seed 1111
python render.py --exp_name maze2_test --env_name heist_aisc_many_keys --distribution_mode hard --param_name hard-500  --model_file PATH_TO_MODEL_FILE



The original Readme (not our work) is reproduced below.


Training Procgen environment with Pytorch

🆕✅🎉 updated code: 10th September 2020: bug fixes + support recurrent policy.

Introduction

This repository contains code to train baseline ppo agent in Procgen implemented with Pytorch.

This implementation is inspired to accelerate the research in procgen environment. It aims to reproduce the result in Procgen paper. Code is designed to satisfy both readability and productivity. I tried to match the code as close as possible to OpenAI baselines's while following the coding style from ikostrikov's.

There were several key points to watch out for procgen, which differ from the general RL implementations

  • Xavier uniform initialization was used for conv layers rather than orthogonal initialization.
  • Do not use observation normalization
  • Gradient accumulation to handle large mini-batch size.

Training logs for starpilot can be found on logs/procgen/starpilot.

Requirements

  • python>=3.6
  • torch 1.3
  • procgen
  • pyyaml

Train

Use train.py to train the agent in procgen environment. It has the following arguments:

  • --exp_name: ID to designate your expriment.s
  • --env_name: Name of the Procgen environment.
  • --start_level: Start level for for environment.
  • --num_levels: Number of training levels for environment.
  • --distribution_mode: Mode of your environ
  • --param_name: Configurations name for your training. By default, the training loads hyperparameters from config.yml/procgen/param_name.
  • --num_timesteps: Number of total timesteps to train your agent.

After you start training your agent, log and parameters are automatically stored in logs/procgen/env-name/exp-name/

Try it out

Sample efficiency on easy environments

python train.py --exp_name easy-run-all --env_name ENV_NAME --param_name easy --num_levels 0 --distribution_mode easy --num_timesteps 25000000

Sample efficiency on hard environments

python train.py --exp_name hard-run-all --env_name ENV_NAME --param_name hard --num_levels 0 --distribution_mode hard --num_timesteps 200000000

Generalization on easy environments

python train.py --exp_name easy-run-200 --env_name ENV_NAME --param_name easy-200 --num_levels 200 --distribution_mode easy --num_timesteps 25000000

Generalization on hard environments

python train.py --exp_name hard-run-500 --env_name ENV_NAME --param_name hard-500 --num_levels 500 --distribution_mode hard --num_timesteps 200000000

If your GPU device could handle larger memory than 5GB, increase the mini-batch size to facilitate the trianing.

TODO

  • Implement Data Augmentation from RAD.
  • Create evaluation code to measure the test performance.

References

[1] PPO: Human-level control through deep reinforcement learning
[2] GAE: High-Dimensional Continuous Control Using Generalized Advantage Estimation
[3] IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
[4] Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO
[5] Leveraging Procedural Generation to Benchmark Reinforcement Learning