From ee6a18302afea0304ba54afd5b9f1d5e198a558c Mon Sep 17 00:00:00 2001
From: "Ilya V. Schurov" <ilya@schurov.com>
Date: Thu, 1 Feb 2018 02:58:19 +0300
Subject: [PATCH 1/3] cp-in-an-observational-settings update

---
 .../cp-in-an-observational-settings.ipynb     | 63 ++++++++++++++-----
 1 file changed, 49 insertions(+), 14 deletions(-)

diff --git a/src/comments/cp-in-an-observational-settings.ipynb b/src/comments/cp-in-an-observational-settings.ipynb
index c458749..8dacf4d 100644
--- a/src/comments/cp-in-an-observational-settings.ipynb
+++ b/src/comments/cp-in-an-observational-settings.ipynb
@@ -4,44 +4,79 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# CP in an Observational Setting\n",
+    "# CP in an Observational Setting, ver. 1.1\n",
     "\n",
-    "This is a formalization of a diagram posted by William Fedus.\n",
+    "This is a formalization of a diagram posted by William Fedus, updated after discussions.\n",
     "\n",
+    "## Notation\n",
+    "1. Denote a set $\\{1, \\ldots, m\\}$ by $\\mathbb N_m$ for any natural $m$.\n",
+    "2. For time-dependent vector $y_t \\in \\mathbb R^m$ denote by $y_t[k]$ is's $k$-th component. For $K \\in \\mathbb N_m^n$ (vector of indexes), we denote the vector of the corresponding values $(y_t[k_1],\\ldots, y_t[k_n]) \\in \\mathbb R^n$ by $y_t[K]$.\n",
+    "\n",
+    "## Architecture\n",
+    "### Variables\n",
     "**Input**: a sequence of multidimensional vectors $x_t$, each $x_t \\in \\mathbb R^N$, $t\\in \\mathbb N$ (or $\\mathbb Z$), $N$ may be quite large (e.g. pixels of a video frame)\n",
     "\n",
-    "**Representation state** at the moment $t$ is $h_t \\in \\mathbb R^r$, $r$ is a dimension of the representation space (may be quite large). We denote $k$-th component of this vector as $h_t[k]$. Let us denote a set $\\{1, \\ldots, r\\}$ by $\\mathbb N_r$. If $K=(k_1,\\ldots, k_s) \\in \\mathbb N_r^s$ is a vector of indexes, we denote the vector of the corresponding values $(h_t[k_1],\\ldots, h_t[k_n]) \\in \\mathbb R^s$ by $h_t[K]$.\n",
+    "**Encoded state** at the moment $t$ is $e_t \\in \\mathbb R^u$, $u$ can be quite large.\n",
+    "\n",
+    "**Representation state** at the moment $t$ is $h_t \\in \\mathbb R^r$, $r$ is a dimension of the representation space (may be quite large).\n",
     "\n",
-    "**Conscious state** at the moment $t$ is $c_t=(B_t, b_t, A_t)$[^1], where $B_t\\in \\mathbb N_r^s$ is a vector of indexes of $h_t$ we are interested in at a particular moment, $b_t\\in \\mathbb R^s$ is the corresponding values, i.e. $b_t=h_t[B_t]$ and $A_t=(A_{1, t}, \\ldots, A_{p, t})\\in \\mathbb N_r^p$ is a vector of indexes we are going to predict.\n",
+    "**Conscious state** at the moment $t$ is $c_t=(B_t, b_t, A_t)$[^1], where $B_t\\in \\mathbb N_r^s$ is a vector of indexes of $h_t$ we are interested in at a particular moment, $b_t\\in \\mathbb R^s$ is the corresponding values, i.e. $b_t=h_t[B_t]$ and $A_t=(A_{1, t}, \\ldots, A_{p, t})\\in \\mathbb N_u^p$ is a vector of indexes of $e_{t+1}$ we are going to predict.\n",
     "\n",
     "[^1]: This notation is a bit different from William's but consistent with Bengio's paper: $c_t$ is a full consious state, including $A_t$. In William's diagram $c_t=(B_t, b_t)$.\n",
     "\n",
-    "**Representation network** $R$ is a function (presented by RNN):\n",
+    "### Networks\n",
+    "**Encoder network** $E$ is convolutional encoder that has access to fixed number of previous frames:\n",
+    "\n",
+    "$$e_t = E(x_t, x_{t-1}, \\ldots, x_{t-d}),$$\n",
+    "\n",
+    "Consider, for example, convolutional embedding of an image that has access only to the current frame.\n",
+    "\n",
+    "**Recurrent representation network** $R$ is a function (presented by RNN):\n",
     "$$\n",
-    "h_t=R(x_t, h_{t-1}).\n",
+    "h_t=R(e_t, h_{t-1}).\n",
     "$$\n",
-    "(In Bengio's paper the same thing is called $F$.)\n",
     "\n",
-    "**Consiousness network** is a function (presented by another RNN):\n",
+    "The composition of encoder network and recurrent representation network is collectively called simply **representation network**. See [this comment](https://theconsciousnessprior.slack.com/archives/C87HW5Q9X/p1517188030000148) for details on why we need this splitting of encoder and recurrent representation networks.\n",
+    "\n",
+    "**Conscious network** is a function (presented by another RNN):\n",
     "$$\n",
     "c_{t} = C(h_t, c_{t-1}, z_t),\n",
     "$$\n",
-    "where $z_t$ is a random noise source. Actually, $C$ have to define only $B_t$ and $A_t$ parts of $c_t$ as $b_t$ is defined by the relation $b_t=h_t[B_t]$. So the domain of $C$ is $\\mathbb N_r^s\\times\\mathbb N_r^p$.\n",
+    "where $z_t$ is a random noise source. Actually, $C$ have to define only $B_t$ and $A_t$ parts of $c_t$ as $b_t$ is defined by the relation $b_t=h_t[B_t]$. So the domain of $C$ is $\\mathbb N_r^s\\times\\mathbb N_u^p$.\n",
     "\n",
     "**Generator network** is a function\n",
     "$$\n",
     "  \\widehat{a}_t = G(c_t).\n",
     "$$\n",
-    "The objective of the generator is to predict the value of the representation vector $h_{t+1}$ in the next moment at indexes $A_t$. So we have a **MSE loss function**:\n",
+    "The objective of the generator is to predict the value of (some part of) the encoded state vector $e_{t+1}$ in the next moment at indexes $A_t$. \n",
+    "\n",
+    "### Loss\n",
+    "The main objective of conscious network is to select features that can be used to predict (parts of) the future encoded state. However, this objective can be satisfied trivially: the representation network can ignore $x_t$ and produce constant values (i.e. $h_t=0$ for all $t$) which will be perfectly predictable from their own previous values. To avoid this failure mode, we want to maximize the mutual information between $c_t = h_{t}[B_t]$ and $e_{t+1}[A_t]$. (To consider mutual information, we need random variables; to make $c_t$ and $e_{t+1}[A_t]$ random variables we just need to pick random $t$; it defines the probability space we are working on.)\n",
+    "\n",
+    "Estimation of mutual information is non-trivial problem, discussed in the literature (see e.g. [https://arxiv.org/abs/1801.04062](here)).\n",
+    "\n",
+    "However, to begin, we can consider simplified version of mutual information objective that uses variance instead of entropy. This objective is a variant of $R^2$ maximization. \n",
+    "\n",
+    "For every index $i \\in \\mathbb N_u$, denote by $T_i$ the set of all time moments $t$ such that $i \\in A_{t-1}$ (i.e. on step $t-1$ the conscious network selected $i$ as one of the elements we are going to predict).\n",
+    "\n",
+    "Now let us define $RSS_i$ (residual sum of squares) which measures our error in predicting of the future state at index $i$ given the previous state:\n",
+    "\n",
     "$$\n",
-    "\\mathcal L=\\sum_{i=1}^{p}(\\widehat{a}_{i, t}-h_{t+1}[A_{i, t}])^2\n",
+    "RSS_i=\\sum_{t \\in T_i}(\\widehat{a}_{t}[i]-e_{t+1}[i])^2\n",
     "$$\n",
     "\n",
-    "As the objective of the consciouss is to be able to predict the part of the future representation vector, we also introduce **mutual information loss function**: we want to maximize the following mutual information:[^2]\n",
+    "We also define $TSS_i$ in the following way:\n",
+    "\n",
+    "$$\\overline{e}[i]=\\frac{1}{|T_i|}\\sum_{t\\in T_i} e_t[i]$$\n",
+    "\n",
+    "$$TSS_i(e)=\\sum_{t \\in T_i} (e_t[i]-\\overline{e}[i])^2$$\n",
+    "\n",
+    "Now our objective is\n",
     "\n",
-    "$$I(b_t, h_{t+1}[A_t]) = H(h_{t+1}[A_t]) - H(h_{t+1}[A_t])\\mid b_t)$$\n",
+    "$$\\tag{1}\n",
+    "\\sum_{i=1}^u \\log RSS_i - \\sum_{i=1}^u \\log TSS_i \\to \\min$$\n",
     "\n",
-    "[^2]: In William's diagram, we have $[A_{i, t}]$ instead of $[A_t]$, but I believe we are interested in a full vector of $A_t$ here and not only $i$'s component of it."
+    "**Remark.** The proposed objective is scale invariant. Moreover, it is closely related to the real mutual information. Indeed, for any random variables, $I(X, Y) = H(Y) - H(Y|X)$. Moreover, $H(\\lambda X) = H(X) + \\log |\\lambda|$. It means that e.g. for Gaussian $X$, entropy is approximately $\\log SD(X) = \\frac{1}{2}\\log Var(X)$. The $RSS$ term in (1) is related to entropy of $e_t$ and $TSS$ term is related to entropy of $e_t$ conditioned on $c_t$."
    ]
   }
  ],

From e713255072c9d7e8a829b553126f37f4f26d745a Mon Sep 17 00:00:00 2001
From: "Ilya V. Schurov" <ilya@schurov.com>
Date: Thu, 1 Feb 2018 03:05:02 +0300
Subject: [PATCH 2/3] fix link to a paper

---
 src/comments/cp-in-an-observational-settings.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/comments/cp-in-an-observational-settings.ipynb b/src/comments/cp-in-an-observational-settings.ipynb
index 8dacf4d..188ac1f 100644
--- a/src/comments/cp-in-an-observational-settings.ipynb
+++ b/src/comments/cp-in-an-observational-settings.ipynb
@@ -53,7 +53,7 @@
     "### Loss\n",
     "The main objective of conscious network is to select features that can be used to predict (parts of) the future encoded state. However, this objective can be satisfied trivially: the representation network can ignore $x_t$ and produce constant values (i.e. $h_t=0$ for all $t$) which will be perfectly predictable from their own previous values. To avoid this failure mode, we want to maximize the mutual information between $c_t = h_{t}[B_t]$ and $e_{t+1}[A_t]$. (To consider mutual information, we need random variables; to make $c_t$ and $e_{t+1}[A_t]$ random variables we just need to pick random $t$; it defines the probability space we are working on.)\n",
     "\n",
-    "Estimation of mutual information is non-trivial problem, discussed in the literature (see e.g. [https://arxiv.org/abs/1801.04062](here)).\n",
+    "Estimation of mutual information is non-trivial problem, discussed in the literature (see e.g. [here](https://arxiv.org/abs/1801.04062)).\n",
     "\n",
     "However, to begin, we can consider simplified version of mutual information objective that uses variance instead of entropy. This objective is a variant of $R^2$ maximization. \n",
     "\n",

From fe0076aa394fba3ecc46c4527fa1dccc2ec2fee5 Mon Sep 17 00:00:00 2001
From: "Ilya V. Schurov" <ilya@schurov.com>
Date: Thu, 1 Feb 2018 12:20:29 +0300
Subject: [PATCH 3/3] changelog added

---
 src/comments/cp-in-an-observational-settings.ipynb | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/src/comments/cp-in-an-observational-settings.ipynb b/src/comments/cp-in-an-observational-settings.ipynb
index 188ac1f..a6e7026 100644
--- a/src/comments/cp-in-an-observational-settings.ipynb
+++ b/src/comments/cp-in-an-observational-settings.ipynb
@@ -6,7 +6,13 @@
    "source": [
     "# CP in an Observational Setting, ver. 1.1\n",
     "\n",
-    "This is a formalization of a diagram posted by William Fedus, updated after discussions.\n",
+    "## Changelog\n",
+    "### Ver 1.1\n",
+    "1. Representation network is splitted into convolution encoder and recurrent part, to address issue when recurrent part copies elements of previous state to new state and conscious choose them to predict. See [this comment](https://theconsciousnessprior.slack.com/archives/C87HW5Q9X/p1517188030000148) for details.\n",
+    "2. $R^2$-like approximation to mutual information loss is proposed.\n",
+    "\n",
+    "### Ver 1.0\n",
+    "Initial version, formalization of a diagram posted by William Fedus.\n",
     "\n",
     "## Notation\n",
     "1. Denote a set $\\{1, \\ldots, m\\}$ by $\\mathbb N_m$ for any natural $m$.\n",