Fixes to Batch RL tutorial (#378)

IntelLabs · Jul 16, 2019 · 718597c · 718597c
1 parent 0a4cc7e
commit 718597c
Show file tree

Hide file tree

Showing 2 changed files with 9,846 additions and 5 deletions.
diff --git a/tutorials/4. Batch Reinforcement Learning.ipynb b/tutorials/4. Batch Reinforcement Learning.ipynb
@@ -227,15 +227,15 @@
    "metadata": {},
    "source": [
     "### Off-Policy Evaluation\n",
-    "As we mentioned earlier, one of the hardest problems in Batch RL is that we do not have a simulator or cannot easily deploy a trained policy on the real-world environment, in order to test its goodness. This is where OPE comes in handy. </br>\n",
+    "As we mentioned earlier, one of the hardest problems in Batch RL is that we do not have a simulator or cannot easily deploy a trained policy on the real-world environment, in order to test its goodness. This is where OPE comes in handy. \n",
     "\n",
     "Coach supports several off-policy evaluators, some are useful for bandits problems (only evaluating a single step return), and others are for full-blown Reinforcement Learning problems. The main goal of the OPEs is to help us select the best model, either for collecting more data to do another round of Batch RL on, or for actual deployment in the real-world environment. \n",
     "\n",
     "Opening the experiment that we have just ran (under the `tutorials/Resources` folder, with Coach Dashboard), you will be able to plot the actual simulator's `Evaluation Reward`. Usually, we won't have this signal available as we won't have a simulator, but since we're using a dummy environment for demonstration purposes, we can take a look and examine how the OPEs correlate with it. \n",
     "\n",
-    "Here are two example plots from Dashboard showing how well the `Weighted Importance Sampling` (RL estimator) and the `Doubly Robust` (bandits estimator) each correlate with the `Evaluation Reward`. </br>\n",
+    "Here are two example plots from Dashboard showing how well the `Weighted Importance Sampling` (RL estimator) and the `Doubly Robust` (bandits estimator) each correlate with the `Evaluation Reward`. \n",
     "![Weighted Importance Sampling](Resources/img/wis.png \"Weighted Importance Sampling vs. Evaluation Reward\") \n",
-    "</br>\n",
+    "\n",
     "![Doubly Robust](Resources/img/dr.png \"Doubly Robust vs. Evaluation Reward\") \n",
     "\n"
    ]
@@ -256,7 +256,13 @@
     "### The CSV\n",
     "Coach defines a csv data format that can be used to fill its replay buffer. We have created an example csv from the same `Acrobot-v1` environment, and have placed it under the [Tutorials' Resources folder](https://github.com/NervanaSystems/coach/tree/master/tutorials/Resources).\n",
     "\n",
-    "Here are the first couple of lines from it so you can get a grip of what to expect - \n",
+    "Here are the first couple of lines from it so you can get a grip of what to expect - "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
     "\n",
     "| action | all_action_probabilities | episode_id | episode_name | reward | transition_number | state_feature_0 | state_feature_1 | state_feature_2 | state_feature_3 | state_feature_4 | state_feature_5 \n",
     "|---|---|---|---|---|---|---|---|---|---|---|---------------------------------------------------------------------------|\n",
@@ -340,7 +346,7 @@
     "Now that we have ran this preset, we have 100 agents (one is saved after every training epoch), and we would have to decide which one we choose for deployment (either for running another round of experience collection and training, or for final deployment, meaning going into production).  \n",
     "\n",
     "Opening the experiment csv in Dashboard and displaying the OPE signals, we can now choose a checkpoint file for deployment on the end-node. Here is an example run, where we show the `Weighted Importance Sampling` and `Sequential Doubly Robust` OPEs. \n",
-    "</br>\n",
+    "\n",
     "![Model Selection](Resources/img/model_selection.png \"Model Selection using OPE\") \n",
     "\n",
     "Based on this plot we would probably have chosen a checkpoint from around Epoch 85. From here, if we are not satisfied with the deployed agent's performance, we can iteratively continue with data collection, policy training (maybe based on a combination of all the data collected so far), and deployment. \n"