Skip to content

Commit

Permalink
docs: Fiinish maths section, left to do code explanation
Browse files Browse the repository at this point in the history
  • Loading branch information
adamingas committed Jan 12, 2024
1 parent 6dfe99a commit aa82e83
Show file tree
Hide file tree
Showing 2 changed files with 53 additions and 23 deletions.
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,6 @@
comparison_with_classifiers.ipynb
shap.ipynb
maths.md
autoapi/index
```
75 changes: 52 additions & 23 deletions docs/maths.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
# Maths
## Immediate Thresholds
# Loss
## Maths
### Immediate Thresholds

As is usual in ordinal regression model formulations, we build a regressor that learns a latent variable $y$, and then use a set of thresholds $\Theta$ to produce the probability estimates for each label. The set of thresholds is ordered, does not include infinities, and has as many members as the numbers of labels minus one.

We want to come up with a a way to map the latent variable $y$ to the probability space such that when $y$ is in $(\theta_{k-1},\theta_{k})$ the probability of label $k$ is maximised.

In a three ordered labeled problem, we only need two thresholds, $\theta_1$ and $\theta_2$, to define the three regions which are associated to each label $(-\infty,\theta_1], (\theta_1, \theta_2], (\theta_2, \infty)$.

## Deriving probabilities
### Deriving probabilities

A property we want our mapping from latent variable to probability to have is for the cummulative probability of label $z$ being at most label $k$ to increase as the label increases. This means that $P(z\leq k;y,\Theta)$ should increase as $k$ increases (i.e. as we consider more labels).

Expand All @@ -33,7 +34,7 @@ $$


A function that satisfies all these conditions is the sigmoid function, hereafter denoted as $\sigma$.
## Deriving the loss function
### Deriving the loss function

Given n samples, the likelihood of our set of predictions $\bf y$ is:
$$
Expand All @@ -43,13 +44,13 @@ $$
As is usual in machine learning we use the negative log likelihhod as our loss:

$$
\begin{align}
\begin{align*}
l({\bf y};\Theta) &= -\log L({\bf y},\theta)\\
&= -\sum_{i=0}^n I(z_i=k)\log(P(z_i = k; y_i,\Theta)) \\
&= -\sum_{i=0}^n I(z_i=k)\log \left(\sigma(\theta_k - y_i) - \sigma(\theta_{k-1} - y_i)\right)
\end{align}
\end{align*}
$$
## Deriving the gradient and hessian
### Deriving the gradient and hessian

To use a custom loss function with gradient boosting tree frameworks (i.e. lightgbm), we have to first derive the gradient and hessian of the loss with respect to **the raw predictions**, in our case the latent variable $y_i$.

Expand All @@ -62,19 +63,19 @@ $$
\begin{align*}
\mathcal{G}&=\frac{\partial l({\bf y};\Theta)}{\partial {\bf y}} \\
&= -\frac{\partial }{\partial {\bf y}} \sum_{i=0}^n I(z_i=k)\log \left(\sigma(\theta_k - y_i) - \sigma(\theta_{k-1} - y_i)\right) \\
&=
\begin{pmatrix}
-\frac{\partial }{\partial y_1} \sum_{i=0}^n I(z_i=k)\log \left(\sigma(\theta_k - y_i) - \sigma(\theta_{k-1} - y_i)\right) \\
&=
\begin{pmatrix}
-\frac{\partial }{\partial y_1} \sum_{i=0}^n I(z_i=k)\log \left(\sigma(\theta_k - y_i) - \sigma(\theta_{k-1} - y_i)\right) \\
... \\
-\frac{\partial }{\partial y_n} \sum_{i=0}^n I(z_i=k)\log \left(\sigma(\theta_k - y_i) - \sigma(\theta_{k-1} - y_i)\right) \\
\end{pmatrix} \\
&=
\begin{pmatrix}
I(z_1 = k) \left( \frac{\sigma'(\theta_k-y_1) - \sigma'(\theta_{k-1}-y_1)}{\sigma(\theta_k-y_1) - \sigma(\theta_{k-1}-y_1)} \right) \\
... \\
-\frac{\partial }{\partial y_n} \sum_{i=0}^n I(z_i=k)\log \left(\sigma(\theta_k - y_i) - \sigma(\theta_{k-1} - y_i)\right) \\
\end{pmatrix} \\
&=
\begin{pmatrix}
I(z_1 = k) \left( \frac{\sigma'(\theta_k-y_1) - \sigma'(\theta_{k-1}-y_1)}{\sigma(\theta_k-y_1) - \sigma(\theta_{k-1}-y_1)} \right) \\
... \\
I(z_n = k) \left( \frac{\sigma'(\theta_k-y_n) - \sigma'(\theta_{k-1}-y_n)}{\sigma(\theta_k-y_n) - \sigma(\theta_{k-1}-y_n)} \right) \\
\end{pmatrix}
I(z_n = k) \left( \frac{\sigma'(\theta_k-y_n) - \sigma'(\theta_{k-1}-y_n)}{\sigma(\theta_k-y_n) - \sigma(\theta_{k-1}-y_n)} \right) \\
\end{pmatrix}
\end{align*}
$$

Expand Down Expand Up @@ -128,14 +129,40 @@ $$
\end{pmatrix}l({\bf y};\Theta) \\
&=
\begin{pmatrix}
\frac{\partial}{\partial y_1 y_1}(z_1 = k) \left( \frac{\sigma'(\theta_k-y_1) - \sigma'(\theta_{k-1}-y_1)}{\sigma(\theta_k-y_1) - \sigma(\theta_{k-1}-y_1)} \right) \\
... \\.. \\
\frac{\partial}{\partial y_n y_n}
(z_n = k) \left( \frac{\sigma'(\theta_k-y_n) - \sigma'(\theta_{k-1}-y_n)}{\sigma(\theta_k-y_n) - \sigma(\theta_{k-1}-y_n)} \right) \\
\frac{\partial}{\partial y_1 }I(z_1 = k) \left( \frac{\sigma'(\theta_k-y_1) - \sigma'(\theta_{k-1}-y_1)}{\sigma(\theta_k-y_1) - \sigma(\theta_{k-1}-y_1)} \right) \\
... \\
\frac{\partial}{\partial y_n }
I(z_n = k) \left( \frac{\sigma'(\theta_k-y_n) - \sigma'(\theta_{k-1}-y_n)}{\sigma(\theta_k-y_n) - \sigma(\theta_{k-1}-y_n)} \right)
\end{pmatrix}\\
&=
\begin{pmatrix}
-I(z_i = k) \left( \frac{\sigma''(\theta_k-y_1) - \sigma''(\theta_{k-1}-y_1)}{\sigma(\theta_k-y_1) - \sigma(\theta_{k-1}-y_1)} \right) +
I(z_n = k)\left( \frac{\sigma'(\theta_k-y_1) - \sigma'(\theta_{k-1}-y_1)}{\sigma(\theta_k-y_1) - \sigma(\theta_{k-1}-y_1)} \right)^2 \\
... \\
-I(z_n = k) \left( \frac{\sigma''(\theta_k-y_n) - \sigma''(\theta_{k-1}-y_n)}{\sigma(\theta_k-y_n) - \sigma(\theta_{k-1}-y_n)} \right) +
I(z_n = k)\left( \frac{\sigma'(\theta_k-y_n) - \sigma'(\theta_{k-1}-y_n)}{\sigma(\theta_k-y_n) - \sigma(\theta_{k-1}-y_n)} \right)^2 \\
\end{pmatrix}
\end{align*}
$$

### Miscellanious

The gradient of the sigmoid function is:
$$
\sigma'(x) = \sigma(x)(1-\sigma(x))
$$
and the hessian is:
$$
\begin{align*}
\sigma''(x) &= \frac{d}{dx}\sigma(x)(1-\sigma(x)) \\
&= \sigma'(x)(1-\sigma(x)) - \sigma'(x)\sigma(x)\\
&= \sigma(x)(1-\sigma(x))(1-\sigma(x)) -\sigma(x)(1-\sigma(x))\sigma(x) \\
&= (1-\sigma(x))\left(\sigma(x)-2\sigma(x)^2\right)
\end{align*}
$$


<!--
$$
\begin{align*}
Expand All @@ -158,4 +185,6 @@ P(y=k|\bbeta;\btheta;\tilde\bx) &= \begin{cases}
\sigma'(\theta_{1}-\tilde\eta) - 0 & \text{ if } k=1
\end{cases}
\end{align*}
$$
$$ -->

## Code

0 comments on commit aa82e83

Please sign in to comment.