仅仅是简单的了解是不够的的,还需要严密的推导和切身的实践。---米开朗琪杰
其他内容
To describe teh supervised learning problem, our goal is given a training set,to learn a function
![image-20200105220222073](.\2 Linear Regression with multiple variables.assets\image-20200105220222073.png)
We can measure the accuracy of our hypothsis function by using a cost function.
$$
J(\theta)=\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2
$$
This function is also called the "Squared error function", or "Mean squared error".
A contour plot(等高线图) is a graph that contains many contour lines. A contour line of a two variable function has a constant value at all points of the same line. An example of such a graph is the one to the right below.
![image-20200105221703361](.\2 Linear Regression with multiple variables.assets\image-20200105221703361.png)
We can graph our hypothesis function based on its fields
![image-20200105223736198](.\2 Linear Regression with multiple variables.assets\image-20200105223736198.png)
The way to reach the minimum points is by taking the derivative(the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with he steepest descent. The size of each step is determined by the parameter
The gradient descent algorithm is:
repeat until convergence: $$ \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1) $$
where j=0,1 represents the feature index number.
Note, at each iteration j, the parameters should be updated as a whole rather that one by one.
![image-20200105224716883](.\2 Linear Regression with multiple variables.assets\image-20200105224716883.png)
the lowercase "n" is the number of features. Rather than 'm' is the number of rows on this table or the number of training examples.
we are going to use X superscript "i" such as
Secondly, we are going to use also
![1578020699595](.\2 Linear Regression with multiple variables.assets\1578020699595.png)
Firstly, let define
To quickly summarize our notation, this is our formal hypothesis in multivariable linear regression where we've adopted the convention that x0=1.The parameters of this model are theta0 through theta n, but instead of thinking of this as n separate parameters, which is valid, I'm instead going to think of the parameters as theta where theta here is a n+1-dimensional vector.
Cost function: $$ J(\theta)=\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2 $$ Gradient descent: $$ \theta_0:= \theta_0-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}))x_0^{(i)} $$
![image-20200104110124420](.\2 Linear Regression with multiple variables.assets\image-20200104110124420.png)
![image-20200104110405281](.\2 Linear Regression with multiple variables.assets\image-20200104110405281.png)
上述公式中的$x_0$=1, 故可化简。
补充,cost function的偏导可求,即梯度下降公式的一部分。
Because
Two techniques to help with this are feature scaling and mean normalization. Feature scaling just involves dividing the inputs by the range, resulting in a nwe range of just 1.
$$
x_i=\frac{x_i}{R_i}\\
R_i=\max{x_i}-\min{x_i}
$$
More importantly, mean normalization involves subtracting the average value of input valuse resulting in a new average value for the input variable just zero.
$$
x_i=\frac{x_i-\mu_i}{s_i}\\
$$
Where
- Making a plot with the number of iterations on the x-axis shows the changes of cost function. If
$J(\theta)$ ever increases, then we should decrease$\alpha$ . - Declaring convergence if
$J(\theta)$ decreases by less than$\varepsilon$ ,which is small value such as$10^{-3}$ .However in practice it's difficult to choose the threshold value.
![image-20200105111715490](.\2 Linear Regression with multiple variables.assets\image-20200105111715490.png)
To summarize:
If
If
![image-20200105111659339](.\2 Linear Regression with multiple variables.assets\image-20200105111659339.png)
We can improve the preformce of our hypothesis by combining multiple feathers into one or adding a featuer to another one.
For example,
$$
H(\theta)=\theta_0x_0+\theta_1(x_1+x_2)+\theta_3x_3^2+\theta_4\sqrt{x_4}\\
=\theta_0t_0+\theta_1t_1+\theta_3t_2+\theta_4t_3
$$
One important thing to keep in mind is, if you choose your features this way **then feature scaling becomes very important.**eg. if
pinv(x'*x)*x'y
![1578531270548](2 Linear Regression with multiple variables.assets/1578531270548.png)
![1578531702843](2 Linear Regression with multiple variables.assets/1578531702843-1578531723133.png)Solutions to the above problems include deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features.
To attempt classification, one method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. However, this method doesn't work well because classification is not actually a linear function.
The classification problem is just like the regression problem, except that the values we now want to predict take on only a small number of discrete values. For now, we will focus on the binary classification problem
g() is Sigmoid function or Logistic function. $$ g(z)=\frac{1}{1+e^{-z}} $$
![1578555033803](2 Linear Regression with multiple variables.assets/1578555033803.png)
![1578555899920](2 Linear Regression with multiple variables.assets/1578555899920.png)
![1578556453554](2 Linear Regression with multiple variables.assets/1578556453554.png)
Q:How to decide which part is denote 1?
![1578556237739](2 Linear Regression with multiple variables.assets/1578556237739.png)
Non-linear decision boundary
![1578556381294](2 Linear Regression with multiple variables.assets/1578556381294.png)
non-convex function
![1578556742065](2 Linear Regression with multiple variables.assets/1578556742065.png)
Logistic regression cost function $$ Cost(h_\theta(x^{(i)},y^{(i)})) = \begin{cases} -log(h_\theta(x)),\quad \text{if y = 1}\ -log(1 - h_\theta(x)),\quad \text{if y = 0}\ \end{cases}\\ =-ylog(h_\theta(x)) - (1-y)log_\theta(1-h_\theta(x)) $$ THE amazing function look like that:
![1578558110894](2 Linear Regression with multiple variables.assets/1578558110894.png)
If y = 1 and
凸优化问题比非凸优化问题更容易解决!
![1578559067602](2 Linear Regression with multiple variables.assets/1578559067602.png)
Logistics regression algorithm is similar with linear regression. We still have to simultaneously update all values in theta.A vectorized implementation is:
![1578559409299](2 Linear Regression with multiple variables.assets/1578559409299.png)
损失函数的微分推导不是一眼看出的:h(x)带入 logistics函数$g(X^\mathrm{T}\theta)$,求导即可。
By the way, features scaling is also applied in the logistics regression model.
![1578559640272](2 Linear Regression with multiple variables.assets/1578559640272.png)
There are a lot of advanced optimization algorithms.
![1578559793856](2 Linear Regression with multiple variables.assets/1578559793856.png)
"Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they're already tested and highly optimized. Octave provides them.
![1578560171365](2 Linear Regression with multiple variables.assets/1578560171365.png)
![1578560454788](2 Linear Regression with multiple variables.assets/1578560454788.png)
An exmaple about how to use the fminunc and the meaning of all part of option is illustrated.
(exitflag = 1 :正常迭代退出; )
分解任务
![1578617682038](2 Linear Regression with multiple variables.assets/1578617682038.png)
数学模型
![1578617816498](2 Linear Regression with multiple variables.assets/1578617816498.png)
![1578617735829](2 Linear Regression with multiple variables.assets/1578617735829.png)
Without formally defining what these terms mean, we’ll say the figure on the left shows an instance of underfitting, or high bias—in which the data clearly shows structure not captured by the model—and the figure on the right is an example of overfitting, high variance.
![img](2 Linear Regression with multiple variables.assets/0cOOdKsMEeaCrQqTpeD5ng_2a806eb8d988461f716f4799915ab779_Screenshot-2016-11-15-00.23.30.png)
This terminology is applied to both linear and logistic regression. There are two main options to address the issue of overfitting:
- Reduce the number of features:
- Manually select which features to keep.
- Use a model selection algorithm (studied later in the course).
- Regularization(正规化)
- Keep all the features, but reduce the magnitude of parameters
$\theta_j$ . - Regularization works well when we have a lot of slightly useful features.
If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.
Say we wanted to make the following function more quadratic:
We'll want to eliminate the influence of
![img](2 Linear Regression with multiple variables.assets/j0X9h6tUEeawbAp5ByfpEg_ea3e85af4056c56fa704547770da65a6_Screenshot-2016-11-15-08.53.32.png)
More Generally, 引入一个惩罚项来平衡拟合和过拟合。We could also regularize all of our theta parameters in a single summation as:
![1578619731712](2 Linear Regression with multiple variables.assets/1578619731712.png)
Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth out the function too much and cause underfitting. Hence, what would happen if
A: It does't work great.
We will modify our gradient descent function to separate out
![1578621078808](2 Linear Regression with multiple variables.assets/1578621078808.png)
The term
The first term in the above equation,
Now let's approach regularization using the alternate method of the non-iterative normal equation.
To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:
![1578620859889](2 Linear Regression with multiple variables.assets/1578620859889.png)
Recall that if m < n, then $X^TX$is non-invertible. However, when we add the term λ⋅L, then
类似上文的线性回归的梯度下降方法,损失函数和梯度下降的修改都是类似的。
We can regularize logistic regression in a similar way that we regularize linear regression.
The following image shows how the regularized function, displayed by the pink line, is less likely to overfit than the non-regularized function represented by the blue line:
![img](2 Linear Regression with multiple variables.assets/Od9mobDaEeaCrQqTpeD5ng_4f5e9c71d1aa285c1152ed4262f019c1_Screenshot-2016-11-22-09.31.21.png)
We can regularize this equation by adding a term to the end.The second sum,
![1578621857694](2 Linear Regression with multiple variables.assets/1578621857694.png)
Judge: Introducing regularization to the model always results in equal or better performance on the training set or not.
将正则化方法加入模型并不是每次都能取得好的效果,如果λ取得太大的化就会导致欠拟合. 这样不论对traing set 还是 examples都不好. 不正确.