-
Notifications
You must be signed in to change notification settings - Fork 1
/
subsection_Brief_comment_on_the__1.tex
132 lines (75 loc) · 11.4 KB
/
subsection_Brief_comment_on_the__1.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
\subsection{Brief comment on the consistency of the Training Error and the Vapnik–Chervonenkis (VC) Dimension} \label{section-VcDimension}
\cite{vapnik-nature2013}
\cite{cherkassky-learning2007}
The Vapnik–Chervonenkis (VC) Dimension is a number based on a theoretical framework now called \textit{Statistical Learning Theory} (SLT). This theory is used to estimate the true expected prediction error from finite samples. Surprisingly, very few assumptions are needed and results are distribution independent. Exact bounds are given for some common use cases with constructive analytical methods. For other cases authors propose sampling or experimental methods to estimate the VC dimension. For this part, most of the results and reasoning follows \cite{cherkassky-learning2007}. For a more broad development of this work, \cite{vapnik-nature2013} is recommended too.
In a supervised learning context, we use the empirical training ($\overline{err}$) or test errors ($\overline{err}_{test}$) to estimate the $EPE$. A particular learner is fit and then used to predict new values. We would have $\mathrm{T} = (\textbf{X},\textbf{Y})$ as our training set of $n$ samples and $Y = \{0,1 \}$. Now let $\mathcal {F}$ be a class of classifiers where $f: X \rightarrow Y \, \forall f \in \mathcal {F}$. As it was seen before in INSERT CORRECT SECTION, a common problem in supervised learning is overfitting. In practice this is common when $n$ is \textit{small} or when the number of features is \textit{large}. SLT theory provides theory to effectively characterize these dimensions in a constructive way. Analytical bounds are given for a number of algorithms which allows us to estimate the distance between the empirical error and the generalization error.
In Vapnik and Chervonenkis's work, they focus on estimation of the prediction error through the convergence and consistency of the training error. In applications, a learner is fit from a class of functions. These, for example in logistic regression, can be indexed by the values of the feature's weights $\theta$. This class will represent the set from which to produce the final model.
Incorporating the loss functions directly in the model, SLT looks at functions of the form $Q(t,\theta) \ = \ L(f_\theta(x,y))$, which is seen as a two variable function. One variable $t$ represents the data and the other the class index. Then the $EPE(\theta)$ for any model then takes the form of the functional
\begin{equation}\label{vapnik-risk}
\begin{split}
EPE(\theta) = & \ \Expect_{\textbf{t}} \left[ Q(t,\theta) \right]\\
= & \int Q(t,\theta) p(t) dt \\
= & \int L(f_\theta(x),y) p(x,y) dx dy
\end{split}
\end{equation}
where we let $p(t)$ be the \textbf{true} underlying probabilty function of the data
The SLT theory then focuses on functions which are indexed from a class of learners and which seek to minimize a certain functional characterized by the form of the loss function.\footnote{For simplicity, we will assume the functions $Q$ in the class to be bounded.}
However, as it was said before, we only have access to restricted data and we would like to estimate the risk functional for that class with the or the training/test error. In this framework, this is what is called the empirical risk:
\begin{equation}\label{vapnik-empiricalRisk}
Err_{train}(\theta) = \sum_{i=1}^n Q(t_i,\theta)
\end{equation}
In this approach, the learner is estimated directly from the risk functional. The loss function is put together with the approximating function and takes a strong part in the minimization procedure.
here, the unknown true underlying distribution $p(t)$ will not be estimated to minimize the functional and all of the effort will be put in directly using the training error as \textit{good} substitute for the prediction error.
The training error depends on the model learned from the data and in turn, this depends on the sample.
This non-deterministic aspect of the problem will surely output different models for varying samples, and even sometimes for the same sample. However, as we increase the sample size, we would still want to have the training error converge and to be consistent with the prediction error of the models.
and $\theta^{*}_n$
\begin{definition}{Consistency of the Training error}
Let $\{\mathcal {T}_1, \mathcal {T}_2, ..., \mathcal {T}_n, ... \}$ be a set of training samples such that $\forall n \ |T_n|=n$. Given a training sample of size $n$ and a class of learners $\Theta$, let $\theta^{*}_n$ be the argument minimizing the training error and let $\theta_0$ be the minimizing argument for the prediction error over the class of functions i.e. over all of the models of the class.
Then, the training error is said to be consistent if
$$\lim_{n\to\infty} Err_{train}(\theta^{*}_n) \ = \lim_{n\to\infty} EPE(\theta^{*}_n) \ = EPE(\theta_0)$$
\end{definition}
The property of asymptotic convergence in the training and prediction errors is expected in any learning algorithm. In SLT, the consistency requires that the training and prediction errors' sequences not only converge to the same values but that the sequence of minimal training errors also converge, to the minimizing value of the prediction error. At the same time, the sequence of prediction errors must converge to this same point.
In reality, it is natural to think that the approximation of the prediction error with the training error introduces a strong overestimation. The training error will also be biased by the sample used whilst the prediction error is given for the whole class and does not depend on the sample. Thus, in SLT theory, the minimization of the training error when using bounded loss functions is consistent if and only if:
%\begin{definition}{Uniform Consistency of the Training Error}
%\end{definition}
$$\forall \epsilon > 0 \ , \ \lim_{n\to\infty} P\left[ \sup_{\theta \in \Theta} \mid Err^{n}_{train}(\theta) - EPE(\theta) \mid \right] = 0 $$ % % $$
% $$
Here the training error $Err^{n}_{train}(\theta)$ is the value of this error when using a sample of size $n$. Then, the training error is said to be consistent if it converges uniformly in probability over the whole class of functions\footnote{Remember that approximating functions in the SLT setting are indexed by the $\theta$ parameter.}. This implies that it will characterizing the set of functions $Q(t,\theta)$ used to approximate the data will be important in finding an appropriate model. SLT theory then introduces
The next results will be focused on a binary supervised machine learning setting. However the extension to other forms of supervised learning can be found in \cite{cherkassky-learning2007}.
\begin{definition}{Shattering}
Let $\mathcal {A}= \{A_1,A_{2},\dots \}$ be a set family and $T$ a finite set. Let $t \subseteq T$, it is said that $\mathcal {A}$ picks out $t$ if there exists $A' \subseteq \mathcal {A} $ such that $ T \cap A' = t$. $T$ is said to be shattered by $\mathcal {A}$ if it picks out all of its subsets.
The VC dimension of $\mathcal {A}$ is the biggest cardinality of a set shattered by $\mathcal {A}$. Note that by definition this means \textit{any} possible set shattered by $\mathcal {A}$.
\end{definition}
The n-th shattering coefficient $\Delta_n$ of a class $\mathcal {A}$ is defined to be the maximum number of subsets of $n$ elements picked out by the class.
If we consider a function from our set of classifiers, given a training set of size $n$
$\mathcal {T} = \{ t_1,t_2,...,t_n \}$, we know that each classifier acts as an indicator function on the inputs $\{ x_1,x_2,...,x_n \}$. The \textit{diversity} of this set of classifiers intuitively represents all of the different ways in which the input sample is partitioned by the classifiers.
We would say that $t$ is picked out by $\Theta$ if there exists a classifier $f_{\theta} \in \Theta$ such that $T = f_{\theta}^{-1}(\{1\})$. The classifiers in $\Theta$ define a unique mapping to the class of sets where each classifier is positive. It is said that $\mathcal {F}$ shatters a set $A$ if all of its subsets are picked out by the class of functions.
We will now reproduce the main necessary and sufficient conditions to have uniform consistency in predictive error approximation by the class of loss functions. Also,
Vapnik et al. provide bounds for the exponential convergence of this error. These bounds are constructive and can be effectively calculated in a number of commonly used models such as linear regressors and support vector machines. Moreover, these bounds depend \textbf{only} of the structure of theapproximators rather than the true distribution of the data.
\begin{definition}{Vapnik–Chervonenkis (VC) Dimension}
The Vapnik–Chervonenkis Dimension (VC) of a class of binary functions is the cardinality of the largest set which is shattered by $\mathcal {F}$.
\end{definition}\footnote{The VC dimension is briefly introduced here for the purpose of giving a theoretical approach to error estimation in machine learning methods. For a complete explanation on this topic refer to \cite{vapnik-nature2013}}
The VC dimension gives a certain criteria for measuring the complexity of a class of binary functions by evaluating its expressiveness. Note that the VC dimension need not be finite.
As a simple example, one could use a linear regression of $d$ features $$g_{\theta} = \sum_{i=1}^d x_i \theta_i + \theta_0$$ as a classifier if we consider the indicator function of the positive half-plane induced by the regresssion: $$f_{\theta} = I(\sum_{i=1}^d x_i \theta_i + \theta_0)$$. This class of approximating functions can shatter up to $d+1$ samples, but no bigger sample. Thus the VC dimension is exactly $d+1$.
For more examples, refer to \cite{cherkassky-learning2007} Pg. 113 for examples of different VC classes, finite and infinite.
SLT then proves that if a class of binary classifiers is of finite VC dimension, accurate bounds can be given to estimate train and predictive errors. For a sample of size $n$, take $Err^n_{train}(\theta^*)$ to be the minimum training error and $EPE(\theta_0)$ the minimum true predictive error.
For classes which have a finite VC dimension $h$, then the n-th shattering coefficient is bounded by a polynomial of order equal to the dimension
i.e. $\Delta_n(\mathcal {F}) \leq O(n^(h))$ \footnote{$O(\cdot)$ corresponds to Big-O notation.} where $VC$ stands for the VC dimension of the class.
Also, with probability at least $1 - \eta$ and $\forall \theta \in \Theta$
\begin{equation}\label{vapnik-classificationBound}
EPE(\theta) \leq Err^n_{train}(\theta) + \frac{\epsilon}{2} \left(1 + \sqrt{1 + \frac{4 Err^n_{train}(\theta) }{\epsilon}} \right)
\end{equation}
where we have that $h$ is the VC dimension of the class and, when the class is infinite:
\begin{equation}\label{vapnik-epsilonBound}
\epsilon = a_1 \frac{h \left( ln(\frac{a_2 n}{h} ) - ln(\frac{\eta}{4} ) \right)}{n}
\end{equation}
or
\begin{equation}\label{vapnik-epsilonBound}
\epsilon = 2 \frac{ ln(d) - ln(\eta)}{n}
\end{equation}
when the class is finite and consists of $d$ elements.
The values of constants $a_1$ and $a_2$ are related to the nature of the density function $p(t)$ of the data. However, its values are proven to be bounded, where $a_1 \in (0,4 ]$ and $a2 \in (0,2 ]$.
A more precise bound can be given for the function $Err^n_{train}(\theta^*)$, where with probability $1 - 2\eta$
\begin{equation}\label{vapnik-classificationBoundPrecise}
Err^n_{train}(\theta^*) - EPE(\theta_0) \leq \sqrt{\frac{-ln(\eta)}{2n} } + \frac{\epsilon}{2}\left( \sqrt{1 + \frac{4}{\epsilon} } \right)
\end{equation}