doc: Add notebook that explores motivation for creating this package.…

… Minor fixes in other doc files
adamingas · Jan 17, 2024 · 416d640 · 416d640
1 parent 9f0ea77
commit 416d640
Show file tree

Hide file tree

Showing 8 changed files with 364 additions and 4 deletions.
diff --git a/.gitignore b/.gitignore
@@ -4,4 +4,5 @@ dist/
 .vscode/
 .ipynb_checkpoints/
 **/*.egg-info
-**/build
+**/build
+.DS_Store
diff --git a/docs/_static/feature_high_label.svg b/docs/_static/feature_high_label.svg
diff --git a/docs/_static/feature_low_label.svg b/docs/_static/feature_low_label.svg
diff --git a/docs/_static/feature_medium_label.svg b/docs/_static/feature_medium_label.svg
diff --git a/docs/_static/thresholds.svg b/docs/_static/thresholds.svg
diff --git a/docs/_static/thresholds_max_proba.svg b/docs/_static/thresholds_max_proba.svg
diff --git a/docs/maths.rst b/docs/maths.rst
@@ -27,8 +27,8 @@ In a three ordered labeled problem, we only need two thresholds,
 are associated to each label
 :math:`(-\infty,\theta_1], (\theta_1, \theta_2], (\theta_2, \infty)`.
 
-Deriving probabilities
-~~~~~~~~~~~~~~~~~~~~~~
+Deriving the probabilities
+~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 A property we want our mapping from latent variable to probability to
 have is for the cummulative probability of label :math:`z` being at most
@@ -93,7 +93,8 @@ our loss:
    &= -\sum_{i=0}^n I(z_i=k)\log \left(\sigma(\theta_k - y_i) - \sigma(\theta_{k-1} - y_i)\right)
    \end{align*}
 
-### Deriving the gradient and hessian
+Deriving the gradient and hessian
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 To use a custom loss function with gradient boosting tree frameworks
 (i.e. lightgbm), we have to first derive the gradient and hessian of the

diff --git a/docs/motivation.ipynb b/docs/motivation.ipynb
@@ -0,0 +1,133 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Overview\n",
+    "## Motivation\n",
+    "\n",
+    "Usually when faced with prediction problems involving ordered labels (i.e. low, medium, high) and tabular data, data scientists turn to regular multinomial classifiers from the gradient boosted tree family of models, because of their ease of use, speed of fitting, and good performance. Parametric ordinal models have been around for a while, but they have not been popular because of their poor performance compared to the gradient boosted models, especially for larger datasets.\n",
+    "\n",
+    "Although classifiers can predict ordinal labels adequately, they require building as many classifiers as there are labels to predict. This approach, however, leads to slower training times, and confusing feature interpretations. For example, a feature which is positively associated with the increasing order of the label set (i.e. as the feature's value grows, so do the probabilities of the higher ordered labels), will va a positive association with the highest ordered label, negative with the lowest ordered, and a \"concave\" association with the middle ones."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<div>\n",
+    "    <table>\n",
+    "        <tr>\n",
+    "            <th>\n",
+    "                <img src=\"_static/feature_low_label.svg\" width=\"250\"/>\n",
+    "            </th>\n",
+    "            <th>\n",
+    "                <img src=\"_static/feature_high_label.svg\" width=\"250\"/>\n",
+    "            </th>\n",
+    "            <th>\n",
+    "                <img src=\"_static/feature_medium_label.svg\" width=\"250\"/>\n",
+    "            </th>\n",
+    "        </tr>\n",
+    "    </table>\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": []
+   },
+   "source": [
+    "## Creating an ordinal loss\n",
+    "\n",
+    "We build an ordinal model using the \"threshold\" approach where a regressor learns a latent variable $y$, which is then contrasted to the real line split into regions using a set of thresholds $\\Theta$ to produce probabilities for each label. For a K labeled problem, we use K-1 thresholds $\\{\\theta_1,...,\\theta_{k-1}\\}$ that produce K regions in the real line. Each of these regions is associated with one of the levels of the label, and when the latent variable $y$ lies within their region, the probability of the label being on that level is maximised.\n",
+    "<div>\n",
+    "<img src=\"_static/thresholds.svg\" width=\"500\"/>\n",
+    "</div>\n",
+    "<div>\n",
+    "<img src=\"_static/thresholds_max_proba.svg\" width=\"500\"/>\n",
+    "</div>\n",
+    "\n",
+    "\n",
+    "Because of learning a single latent variable, we can calculate the cumulative probability of the label $z$ being at most at a certain level n [0,...,n,...,K] (contrasted to the regular classifier which assumes all labels are independent). This probability is **higher** when the latent variable gets smaller or when the level we consider is larger. In other words, in a 5 leveled ordinal problem, given a latent variable value $y$, the cumulative probability that our label is at most the third level is always going to be higher than being at most on the second level.\n",
+    "$$\n",
+    "P(z \\leq 3^{\\text{rd}};y,\\Theta) > P(z \\leq 2^{\\text{nd}};y,\\Theta)\n",
+    "$$\n",
+    "\n",
+    "Using the same setup, given that we are calculating the cumulative probability of our label being at most on third level, a **lower** latent value will lead to a higher probability.\n",
+    "\n",
+    "$$\n",
+    "    \\text{Given that } y_1 > y_2,\n",
+    "$$\n",
+    "$$\n",
+    "    P(z \\leq 3^{\\text{rd}};y_1,\\Theta)  < P(z \\leq 3^{\\text{rd}};y_2,\\Theta)\n",
+    "$$\n",
+    "\n",
+    "We can create a cumulative distribution function $F$ that calculates this probability and satisfies the aforementioned conditions, in addition to the following that makes it into a good candidate for being a CDF:\n",
+    "* Is continuous differentiable, and so is it's derivative\n",
+    "* It's domain is between 0 and 1\n",
+    "* Is monotonically increasing\n",
+    "\n",
+    "The probability of the label being a particular level is then just a subtraction of the cumulative probability of being up to that level and that of being up to a level below\n",
+    "$$\n",
+    "    P(z = n;y,\\Theta)  = P(z \\leq n;y,\\Theta) - P(z \\leq n-1;y,\\Theta)\n",
+    "$$\n",
+    "\n",
+    "With this, [the negative log likelihood as our loss, and by calculating it's gradient and hessian](maths.rst), we can (almost) build a gradient boosted ordinal model.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Optimising the thresholds\n",
+    "\n",
+    "GBT frameworks allow only for building trees by looking at the gradient and hessian of the loss with respect to the raw predictions of the model. Therefore, they won't allow us to also optimise the thresholds at the same time, as is done in other ordinal models. \n",
+    "\n",
+    "Instead we could view this as a two step optimisation problem. We first pick reasonable thresholds, build some trees, then otpimise the thresholds given the predictions, and then re-build the trees given the new thresholds. This could be repeated as many times as we want, and with any reasonable scalar as the stopping point for starting the threshold optimisation. In the current approach we do this only once and call it hot-starting the model."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Problems\n",
+    "\n",
+    "Life would be boring if implementation followed smoothly from theory, but fortunately this is not the case here.\n",
+    "\n",
+    "When the latent variable becomes too large, the probability of the label being any level other than the highest one tends to 0 really fast (depending on the choice of the $F$ CDF). This single probability estimate dominates the loss function and creates problems when optimising the thresholds. To combat this the probabilities are capped to a lower and upper limit when calculating the loss.\n",
+    "\n",
+    "Another issue is that the sigmoid (which we have chosen to be the $F$ function), tends to 1 and 0 fairly quickly, which presents two possibilities:\n",
+    "* if the thresholds are close to each other then only the lowest and highest levels can reach a probability of 1. All other levels max out at a much lower level.\n",
+    "* If the thresholds are further apart, most of the probability mass is concentrated on a smaller set of levels. \n",
+    "\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}