Skip to content
This repository has been archived by the owner on Dec 11, 2022. It is now read-only.

Commit

Permalink
Fixes to Batch RL tutorial (#378)
Browse files Browse the repository at this point in the history
  • Loading branch information
Gal Leibovich authored Jul 16, 2019
1 parent 0a4cc7e commit 718597c
Show file tree
Hide file tree
Showing 2 changed files with 9,846 additions and 5 deletions.
16 changes: 11 additions & 5 deletions tutorials/4. Batch Reinforcement Learning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -227,15 +227,15 @@
"metadata": {},
"source": [
"### Off-Policy Evaluation\n",
"As we mentioned earlier, one of the hardest problems in Batch RL is that we do not have a simulator or cannot easily deploy a trained policy on the real-world environment, in order to test its goodness. This is where OPE comes in handy. </br>\n",
"As we mentioned earlier, one of the hardest problems in Batch RL is that we do not have a simulator or cannot easily deploy a trained policy on the real-world environment, in order to test its goodness. This is where OPE comes in handy. \n",
"\n",
"Coach supports several off-policy evaluators, some are useful for bandits problems (only evaluating a single step return), and others are for full-blown Reinforcement Learning problems. The main goal of the OPEs is to help us select the best model, either for collecting more data to do another round of Batch RL on, or for actual deployment in the real-world environment. \n",
"\n",
"Opening the experiment that we have just ran (under the `tutorials/Resources` folder, with Coach Dashboard), you will be able to plot the actual simulator's `Evaluation Reward`. Usually, we won't have this signal available as we won't have a simulator, but since we're using a dummy environment for demonstration purposes, we can take a look and examine how the OPEs correlate with it. \n",
"\n",
"Here are two example plots from Dashboard showing how well the `Weighted Importance Sampling` (RL estimator) and the `Doubly Robust` (bandits estimator) each correlate with the `Evaluation Reward`. </br>\n",
"Here are two example plots from Dashboard showing how well the `Weighted Importance Sampling` (RL estimator) and the `Doubly Robust` (bandits estimator) each correlate with the `Evaluation Reward`. \n",
"![Weighted Importance Sampling](Resources/img/wis.png \"Weighted Importance Sampling vs. Evaluation Reward\") \n",
"</br>\n",
"\n",
"![Doubly Robust](Resources/img/dr.png \"Doubly Robust vs. Evaluation Reward\") \n",
"\n"
]
Expand All @@ -256,7 +256,13 @@
"### The CSV\n",
"Coach defines a csv data format that can be used to fill its replay buffer. We have created an example csv from the same `Acrobot-v1` environment, and have placed it under the [Tutorials' Resources folder](https://github.com/NervanaSystems/coach/tree/master/tutorials/Resources).\n",
"\n",
"Here are the first couple of lines from it so you can get a grip of what to expect - \n",
"Here are the first couple of lines from it so you can get a grip of what to expect - "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"| action | all_action_probabilities | episode_id | episode_name | reward | transition_number | state_feature_0 | state_feature_1 | state_feature_2 | state_feature_3 | state_feature_4 | state_feature_5 \n",
"|---|---|---|---|---|---|---|---|---|---|---|---------------------------------------------------------------------------|\n",
Expand Down Expand Up @@ -340,7 +346,7 @@
"Now that we have ran this preset, we have 100 agents (one is saved after every training epoch), and we would have to decide which one we choose for deployment (either for running another round of experience collection and training, or for final deployment, meaning going into production). \n",
"\n",
"Opening the experiment csv in Dashboard and displaying the OPE signals, we can now choose a checkpoint file for deployment on the end-node. Here is an example run, where we show the `Weighted Importance Sampling` and `Sequential Doubly Robust` OPEs. \n",
"</br>\n",
"\n",
"![Model Selection](Resources/img/model_selection.png \"Model Selection using OPE\") \n",
"\n",
"Based on this plot we would probably have chosen a checkpoint from around Epoch 85. From here, if we are not satisfied with the deployed agent's performance, we can iteratively continue with data collection, policy training (maybe based on a combination of all the data collected so far), and deployment. \n"
Expand Down
Loading

0 comments on commit 718597c

Please sign in to comment.