-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
doc: Add notebook that explores motivation for creating this package.…
… Minor fixes in other doc files
- Loading branch information
Showing
8 changed files
with
364 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,4 +4,5 @@ dist/ | |
.vscode/ | ||
.ipynb_checkpoints/ | ||
**/*.egg-info | ||
**/build | ||
**/build | ||
.DS_Store |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,133 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Overview\n", | ||
"## Motivation\n", | ||
"\n", | ||
"Usually when faced with prediction problems involving ordered labels (i.e. low, medium, high) and tabular data, data scientists turn to regular multinomial classifiers from the gradient boosted tree family of models, because of their ease of use, speed of fitting, and good performance. Parametric ordinal models have been around for a while, but they have not been popular because of their poor performance compared to the gradient boosted models, especially for larger datasets.\n", | ||
"\n", | ||
"Although classifiers can predict ordinal labels adequately, they require building as many classifiers as there are labels to predict. This approach, however, leads to slower training times, and confusing feature interpretations. For example, a feature which is positively associated with the increasing order of the label set (i.e. as the feature's value grows, so do the probabilities of the higher ordered labels), will va a positive association with the highest ordered label, negative with the lowest ordered, and a \"concave\" association with the middle ones." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"<div>\n", | ||
" <table>\n", | ||
" <tr>\n", | ||
" <th>\n", | ||
" <img src=\"_static/feature_low_label.svg\" width=\"250\"/>\n", | ||
" </th>\n", | ||
" <th>\n", | ||
" <img src=\"_static/feature_high_label.svg\" width=\"250\"/>\n", | ||
" </th>\n", | ||
" <th>\n", | ||
" <img src=\"_static/feature_medium_label.svg\" width=\"250\"/>\n", | ||
" </th>\n", | ||
" </tr>\n", | ||
" </table>\n", | ||
"</div>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"editable": true, | ||
"slideshow": { | ||
"slide_type": "" | ||
}, | ||
"tags": [] | ||
}, | ||
"source": [ | ||
"## Creating an ordinal loss\n", | ||
"\n", | ||
"We build an ordinal model using the \"threshold\" approach where a regressor learns a latent variable $y$, which is then contrasted to the real line split into regions using a set of thresholds $\\Theta$ to produce probabilities for each label. For a K labeled problem, we use K-1 thresholds $\\{\\theta_1,...,\\theta_{k-1}\\}$ that produce K regions in the real line. Each of these regions is associated with one of the levels of the label, and when the latent variable $y$ lies within their region, the probability of the label being on that level is maximised.\n", | ||
"<div>\n", | ||
"<img src=\"_static/thresholds.svg\" width=\"500\"/>\n", | ||
"</div>\n", | ||
"<div>\n", | ||
"<img src=\"_static/thresholds_max_proba.svg\" width=\"500\"/>\n", | ||
"</div>\n", | ||
"\n", | ||
"\n", | ||
"Because of learning a single latent variable, we can calculate the cumulative probability of the label $z$ being at most at a certain level n [0,...,n,...,K] (contrasted to the regular classifier which assumes all labels are independent). This probability is **higher** when the latent variable gets smaller or when the level we consider is larger. In other words, in a 5 leveled ordinal problem, given a latent variable value $y$, the cumulative probability that our label is at most the third level is always going to be higher than being at most on the second level.\n", | ||
"$$\n", | ||
"P(z \\leq 3^{\\text{rd}};y,\\Theta) > P(z \\leq 2^{\\text{nd}};y,\\Theta)\n", | ||
"$$\n", | ||
"\n", | ||
"Using the same setup, given that we are calculating the cumulative probability of our label being at most on third level, a **lower** latent value will lead to a higher probability.\n", | ||
"\n", | ||
"$$\n", | ||
" \\text{Given that } y_1 > y_2,\n", | ||
"$$\n", | ||
"$$\n", | ||
" P(z \\leq 3^{\\text{rd}};y_1,\\Theta) < P(z \\leq 3^{\\text{rd}};y_2,\\Theta)\n", | ||
"$$\n", | ||
"\n", | ||
"We can create a cumulative distribution function $F$ that calculates this probability and satisfies the aforementioned conditions, in addition to the following that makes it into a good candidate for being a CDF:\n", | ||
"* Is continuous differentiable, and so is it's derivative\n", | ||
"* It's domain is between 0 and 1\n", | ||
"* Is monotonically increasing\n", | ||
"\n", | ||
"The probability of the label being a particular level is then just a subtraction of the cumulative probability of being up to that level and that of being up to a level below\n", | ||
"$$\n", | ||
" P(z = n;y,\\Theta) = P(z \\leq n;y,\\Theta) - P(z \\leq n-1;y,\\Theta)\n", | ||
"$$\n", | ||
"\n", | ||
"With this, [the negative log likelihood as our loss, and by calculating it's gradient and hessian](maths.rst), we can (almost) build a gradient boosted ordinal model.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Optimising the thresholds\n", | ||
"\n", | ||
"GBT frameworks allow only for building trees by looking at the gradient and hessian of the loss with respect to the raw predictions of the model. Therefore, they won't allow us to also optimise the thresholds at the same time, as is done in other ordinal models. \n", | ||
"\n", | ||
"Instead we could view this as a two step optimisation problem. We first pick reasonable thresholds, build some trees, then otpimise the thresholds given the predictions, and then re-build the trees given the new thresholds. This could be repeated as many times as we want, and with any reasonable scalar as the stopping point for starting the threshold optimisation. In the current approach we do this only once and call it hot-starting the model." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Problems\n", | ||
"\n", | ||
"Life would be boring if implementation followed smoothly from theory, but fortunately this is not the case here.\n", | ||
"\n", | ||
"When the latent variable becomes too large, the probability of the label being any level other than the highest one tends to 0 really fast (depending on the choice of the $F$ CDF). This single probability estimate dominates the loss function and creates problems when optimising the thresholds. To combat this the probabilities are capped to a lower and upper limit when calculating the loss.\n", | ||
"\n", | ||
"Another issue is that the sigmoid (which we have chosen to be the $F$ function), tends to 1 and 0 fairly quickly, which presents two possibilities:\n", | ||
"* if the thresholds are close to each other then only the lowest and highest levels can reach a probability of 1. All other levels max out at a much lower level.\n", | ||
"* If the thresholds are further apart, most of the probability mass is concentrated on a smaller set of levels. \n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.12" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |