diff --git a/_toc.yml b/_toc.yml index 7e666a0..90f4b7f 100644 --- a/_toc.yml +++ b/_toc.yml @@ -11,3 +11,4 @@ parts: - file: docs/回归/最小二乘法 - file: docs/回归/正则化线性回归 - file: docs/回归/贝叶斯回归 + - file: docs/回归/逻辑回归 diff --git "a/docs/\345\233\236\345\275\222/\351\200\273\350\276\221\345\233\236\345\275\222.ipynb" "b/docs/\345\233\236\345\275\222/\351\200\273\350\276\221\345\233\236\345\275\222.ipynb" index 4c4c8cf..8ca6cbe 100644 --- "a/docs/\345\233\236\345\275\222/\351\200\273\350\276\221\345\233\236\345\275\222.ipynb" +++ "b/docs/\345\233\236\345\275\222/\351\200\273\350\276\221\345\233\236\345\275\222.ipynb" @@ -5,87 +5,178 @@ "id": "bab0ad6e", "metadata": {}, "source": [ - "## logistic \n", + "# 逻辑回归 \n", "\n", - "逻辑回归(Logistic Regression)是一种广泛应用于分类问题的统计学习方法。它基于线性回归模型,通过一个非线性的逻辑函数(称为“sigmoid函数”或“逻辑函数”)将预测结果映射到概率值,从而对样本进行分类。\n", + "逻辑回归(Logistic Regression),虽然名字里有 “回归” 二字,但实际上是解决分类问题的一类线性模型。又被称作 logit 回归,maximum-entropy classification(MaxEnt,最大熵分类),或 log-linear classifier(对数线性分类器)。\n", "\n", - "在逻辑回归中,我们假设输入特征X与输出标签 $Y$ 之间存在一种概率关系,并希望根据特征来预测属于某个类别的概率。逻辑回归可以处理二分类问题,也可以通过扩展为多分类问题。\n", + "逻辑回归它基于线性回归模型,通过一个非线性的逻辑函数(称为“sigmoid函数”或“逻辑函数”)将预测结果映射到概率值,从而对样本进行分类。数学模型如下:\n", "\n", - "具体而言,逻辑回归的数学模型如下:\n", + "$hθ(x) = g(θ^T x)$ ,其中 $g(z) = \\frac{1}{1 + e^{-z}} $ 是逻辑函数。\n", "\n", - "* 假设函数(Hypothesis Function): \n", + "## sigmoid函数\n", + "函数表达式为:\n", "\n", - "$hθ(x) = g(θ^T x)$ \n", + "$\n", + "f(x) = \\frac{1}{1 + e^{-x}}\n", + "$\n", "\n", - "其中 \n", + "我们用代码将这个函数绘制出来:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "d3b6e683", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "\n", + "def sigmoid(x):\n", + " return 1 / (1 + np.exp(-x))\n", + "\n", + "# 生成一系列x值\n", + "x = np.linspace(-10, 10, 100)\n", "\n", - "$g(z) = \\frac{1}{1 + e^{-z}} $ \n", + "# 计算对应的sigmoid值\n", + "y = sigmoid(x)\n", "\n", - "是逻辑函数。\n", + "# 绘制sigmoid函数曲线\n", + "plt.plot(x, y)\n", + "plt.xlabel('x')\n", + "plt.ylabel('sigmoid(x)')\n", + "plt.title('Sigmoid Function')\n", + "plt.grid(True)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "b37b7f5a", + "metadata": {}, + "source": [ + "这个函数图像看着就很对称,也很优雅。记为$f(x) = \\frac{1}{1 + e^{-x}}$\n", + "\n", + "对这个函数求导为:\n", "\n", - "* 损失函数(Loss Function):\n", + "$f’(x) = \\frac{d}{dx} \\left(\\frac{1}{1 + e^{-x}}\\right)$\n", "\n", - "对于二分类问题,使用对数损失函数(Log Loss):\n", + "进一步可以得到(推导略)\n", "\n", - "$ J(\\theta) = -\\frac{1}{m} \\sum_{i=1}^{m} [y^{(i)} \\log(h_\\theta(x^{(i)})) + (1-y^{(i)}) \\log(1 - h_\\theta(x^{(i)}))] $\n", + "$f’(x) = 2f(x)(1 - f(x))$\n", "\n", - "其中 $m$ 是样本数量,$y$是实际标签。\n", + "换句话说,sigmoid函数的导数值,仅仅依靠该函数值本身的四则运算就可以计算得到,计算可谓是非常简便。" + ] + }, + { + "cell_type": "markdown", + "id": "ae7d69d8", + "metadata": {}, + "source": [ "\n", - "* 参数估计:\n", + "## 损失函数(Loss Function):\n", "\n", - "使用梯度下降等优化算法最小化损失函数 $J(θ)$,得到参数θ的估计值。\n", + "参考[1]\n", "\n", - "* 预测:\n", + "我们知道一个样本最终的计算值在 $(0,1)$ 之间浮动,可以被视作概率值。并且由于sigmoid函数的单调性,我们可以认为,概率值越接近1,说明样本越接近正样本,评价质量越好\n", "\n", - "根据学习到的参数 $θ$,对新样本进行预测,即计算 $hθ(x)$ 的概率值,并根据阈值进行分类。" + "因此,我们可以构造一个似然函数,它衡量了计算得分值和真实样本标签之间的差距,当样本只有一个的时候:\n", + "\n", + "$L(w) = (p(x))^y (1-p(x))^{(1-y)}$\n", + "\n", + "当样本有N个的时候\n", + "\n", + "$L(w) = \\prod_{i=0}^{n} (p(x_i))^y_i (1-p(x_i))^{(1-y_i)}$\n", + "\n", + "再考虑到该函数的连乘,也未对样本的数量做归一化,应当转换成\n", + "\n", + "$J(w) = -\\frac{1}{N} \\ln L(w)$\n", + "\n", + "加上L2正则化项为:\n", + "\n", + "$\n", + "J(w) = -\\frac{1}{N} \\ln L(w) + \\frac{\\lambda}{2N}\\sum_{j=1}^{m} w_j^2\n", + "$\n", + "\n", + "其中,$J(w)$ 是目标函数,$w$ 是模型的参数向量,$N$ 是训练样本数量,$w_j$ 是第 $j$ 个特征的权重,$λ$ 是正则化参数,$m$ 是参数个数。\n" + ] + }, + { + "cell_type": "markdown", + "id": "01b483ad", + "metadata": {}, + "source": [ + "下面演示一个逻辑回归分类的示例:" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 10, "id": "c93290d6", "metadata": {}, "outputs": [ { - "ename": "ValueError", - "evalue": "y should be a 1d array, got an array of shape (100, 5) instead.", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", - "\u001b[1;32m/Users/xue/work/code/github/introduction-to-machine-learning/docs/回归/逻辑回归.ipynb Cell 2\u001b[0m line \u001b[0;36m1\n\u001b[1;32m 11\u001b[0m logistic_reg \u001b[39m=\u001b[39m LogisticRegression()\n\u001b[1;32m 13\u001b[0m \u001b[39m# 拟合模型\u001b[39;00m\n\u001b[0;32m---> 14\u001b[0m logistic_reg\u001b[39m.\u001b[39;49mfit(X, y)\n\u001b[1;32m 16\u001b[0m \u001b[39m# 输出参数估计结果\u001b[39;00m\n\u001b[1;32m 17\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39m\"\u001b[39m\u001b[39m参数估计结果:\u001b[39m\u001b[39m\"\u001b[39m)\n", - "File \u001b[0;32m~/tool/anaconda3/envs/data-explore/lib/python3.11/site-packages/sklearn/base.py:1152\u001b[0m, in \u001b[0;36m_fit_context..decorator..wrapper\u001b[0;34m(estimator, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1145\u001b[0m estimator\u001b[39m.\u001b[39m_validate_params()\n\u001b[1;32m 1147\u001b[0m \u001b[39mwith\u001b[39;00m config_context(\n\u001b[1;32m 1148\u001b[0m skip_parameter_validation\u001b[39m=\u001b[39m(\n\u001b[1;32m 1149\u001b[0m prefer_skip_nested_validation \u001b[39mor\u001b[39;00m global_skip_validation\n\u001b[1;32m 1150\u001b[0m )\n\u001b[1;32m 1151\u001b[0m ):\n\u001b[0;32m-> 1152\u001b[0m \u001b[39mreturn\u001b[39;00m fit_method(estimator, \u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n", - "File \u001b[0;32m~/tool/anaconda3/envs/data-explore/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:1208\u001b[0m, in \u001b[0;36mLogisticRegression.fit\u001b[0;34m(self, X, y, sample_weight)\u001b[0m\n\u001b[1;32m 1205\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m 1206\u001b[0m _dtype \u001b[39m=\u001b[39m [np\u001b[39m.\u001b[39mfloat64, np\u001b[39m.\u001b[39mfloat32]\n\u001b[0;32m-> 1208\u001b[0m X, y \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_validate_data(\n\u001b[1;32m 1209\u001b[0m X,\n\u001b[1;32m 1210\u001b[0m y,\n\u001b[1;32m 1211\u001b[0m accept_sparse\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mcsr\u001b[39;49m\u001b[39m\"\u001b[39;49m,\n\u001b[1;32m 1212\u001b[0m dtype\u001b[39m=\u001b[39;49m_dtype,\n\u001b[1;32m 1213\u001b[0m order\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mC\u001b[39;49m\u001b[39m\"\u001b[39;49m,\n\u001b[1;32m 1214\u001b[0m accept_large_sparse\u001b[39m=\u001b[39;49msolver \u001b[39mnot\u001b[39;49;00m \u001b[39min\u001b[39;49;00m [\u001b[39m\"\u001b[39;49m\u001b[39mliblinear\u001b[39;49m\u001b[39m\"\u001b[39;49m, \u001b[39m\"\u001b[39;49m\u001b[39msag\u001b[39;49m\u001b[39m\"\u001b[39;49m, \u001b[39m\"\u001b[39;49m\u001b[39msaga\u001b[39;49m\u001b[39m\"\u001b[39;49m],\n\u001b[1;32m 1215\u001b[0m )\n\u001b[1;32m 1216\u001b[0m check_classification_targets(y)\n\u001b[1;32m 1217\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mclasses_ \u001b[39m=\u001b[39m np\u001b[39m.\u001b[39munique(y)\n", - "File \u001b[0;32m~/tool/anaconda3/envs/data-explore/lib/python3.11/site-packages/sklearn/base.py:622\u001b[0m, in \u001b[0;36mBaseEstimator._validate_data\u001b[0;34m(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)\u001b[0m\n\u001b[1;32m 620\u001b[0m y \u001b[39m=\u001b[39m check_array(y, input_name\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39my\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mcheck_y_params)\n\u001b[1;32m 621\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[0;32m--> 622\u001b[0m X, y \u001b[39m=\u001b[39m check_X_y(X, y, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mcheck_params)\n\u001b[1;32m 623\u001b[0m out \u001b[39m=\u001b[39m X, y\n\u001b[1;32m 625\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m no_val_X \u001b[39mand\u001b[39;00m check_params\u001b[39m.\u001b[39mget(\u001b[39m\"\u001b[39m\u001b[39mensure_2d\u001b[39m\u001b[39m\"\u001b[39m, \u001b[39mTrue\u001b[39;00m):\n", - "File \u001b[0;32m~/tool/anaconda3/envs/data-explore/lib/python3.11/site-packages/sklearn/utils/validation.py:1162\u001b[0m, in \u001b[0;36mcheck_X_y\u001b[0;34m(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)\u001b[0m\n\u001b[1;32m 1142\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[1;32m 1143\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m{\u001b[39;00mestimator_name\u001b[39m}\u001b[39;00m\u001b[39m requires y to be passed, but the target y is None\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 1144\u001b[0m )\n\u001b[1;32m 1146\u001b[0m X \u001b[39m=\u001b[39m check_array(\n\u001b[1;32m 1147\u001b[0m X,\n\u001b[1;32m 1148\u001b[0m accept_sparse\u001b[39m=\u001b[39maccept_sparse,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 1159\u001b[0m input_name\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mX\u001b[39m\u001b[39m\"\u001b[39m,\n\u001b[1;32m 1160\u001b[0m )\n\u001b[0;32m-> 1162\u001b[0m y \u001b[39m=\u001b[39m _check_y(y, multi_output\u001b[39m=\u001b[39;49mmulti_output, y_numeric\u001b[39m=\u001b[39;49my_numeric, estimator\u001b[39m=\u001b[39;49mestimator)\n\u001b[1;32m 1164\u001b[0m check_consistent_length(X, y)\n\u001b[1;32m 1166\u001b[0m \u001b[39mreturn\u001b[39;00m X, y\n", - "File \u001b[0;32m~/tool/anaconda3/envs/data-explore/lib/python3.11/site-packages/sklearn/utils/validation.py:1183\u001b[0m, in \u001b[0;36m_check_y\u001b[0;34m(y, multi_output, y_numeric, estimator)\u001b[0m\n\u001b[1;32m 1181\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m 1182\u001b[0m estimator_name \u001b[39m=\u001b[39m _check_estimator_name(estimator)\n\u001b[0;32m-> 1183\u001b[0m y \u001b[39m=\u001b[39m column_or_1d(y, warn\u001b[39m=\u001b[39;49m\u001b[39mTrue\u001b[39;49;00m)\n\u001b[1;32m 1184\u001b[0m _assert_all_finite(y, input_name\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39my\u001b[39m\u001b[39m\"\u001b[39m, estimator_name\u001b[39m=\u001b[39mestimator_name)\n\u001b[1;32m 1185\u001b[0m _ensure_no_complex_data(y)\n", - "File \u001b[0;32m~/tool/anaconda3/envs/data-explore/lib/python3.11/site-packages/sklearn/utils/validation.py:1244\u001b[0m, in \u001b[0;36mcolumn_or_1d\u001b[0;34m(y, dtype, warn)\u001b[0m\n\u001b[1;32m 1233\u001b[0m warnings\u001b[39m.\u001b[39mwarn(\n\u001b[1;32m 1234\u001b[0m (\n\u001b[1;32m 1235\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mA column-vector y was passed when a 1d array was\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 1240\u001b[0m stacklevel\u001b[39m=\u001b[39m\u001b[39m2\u001b[39m,\n\u001b[1;32m 1241\u001b[0m )\n\u001b[1;32m 1242\u001b[0m \u001b[39mreturn\u001b[39;00m _asarray_with_order(xp\u001b[39m.\u001b[39mreshape(y, (\u001b[39m-\u001b[39m\u001b[39m1\u001b[39m,)), order\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mC\u001b[39m\u001b[39m\"\u001b[39m, xp\u001b[39m=\u001b[39mxp)\n\u001b[0;32m-> 1244\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[1;32m 1245\u001b[0m \u001b[39m\"\u001b[39m\u001b[39my should be a 1d array, got an array of shape \u001b[39m\u001b[39m{}\u001b[39;00m\u001b[39m instead.\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m.\u001b[39mformat(shape)\n\u001b[1;32m 1246\u001b[0m )\n", - "\u001b[0;31mValueError\u001b[0m: y should be a 1d array, got an array of shape (100, 5) instead." - ] + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" } ], "source": [ - "from sklearn.linear_model import LogisticRegression\n", "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.linear_model import LogisticRegression\n", "\n", - "# 生成样本数据\n", + "# 创建一些随机数据用于训练\n", "np.random.seed(0)\n", - "n_samples, n_features = 100, 10\n", - "X = np.random.randn(n_samples, n_features)\n", - "y = np.random.randn(n_samples) # 生成5个相关联的目标变量\n", - "\n", - "# 创建逻辑回归模型\n", - "logistic_reg = LogisticRegression()\n", - "\n", - "# 拟合模型\n", - "logistic_reg.fit(X, y)\n", - "\n", - "# 输出参数估计结果\n", - "print(\"参数估计结果:\")\n", - "print(logistic_reg.coef_)\n", + "X1 = np.random.randn(50, 2) + [2, 2]\n", + "X2 = np.random.randn(50, 2) + [-2, -2]\n", + "X = np.concatenate((X1, X2))\n", + "y = np.concatenate((np.ones(50), np.zeros(50)))\n", + "\n", + "# 训练逻辑回归模型\n", + "model = LogisticRegression()\n", + "model.fit(X, y)\n", + "\n", + "# 可视化分类结果\n", + "x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n", + "y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n", + "xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))\n", + "Z = model.predict(np.c_[xx.ravel(), yy.ravel()])\n", + "Z = Z.reshape(xx.shape)\n", + "\n", + "plt.contourf(xx, yy, Z, alpha=0.8)\n", + "plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')\n", + "plt.xlabel('Feature 1')\n", + "plt.ylabel('Feature 2')\n", + "plt.title('Logistic Regression')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "0b9c848d", + "metadata": {}, + "source": [ + "## 参考\n", "\n", - "# 进行预测\n", - "y_pred = logistic_reg.predict(X)" + "[1] 逻辑回归 https://zhuanlan.zhihu.com/p/580207932?utm_id=0" ] } ],