- Unsupervised Learning cheatsheet
⟶
非監督式學習參考手冊
- Introduction to Unsupervised Learning
⟶
非監督式學習介紹
- Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.
⟶
動機 - 非監督式學習的目的是要找出未標籤資料 {x(1),...,x(m)} 之間的隱藏模式
- Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:
⟶
Jensen's 不等式 - 令 f 為一個凸函數、X 為一個隨機變數,我們可以得到底下這個不等式:
- Clustering
⟶
分群
- Expectation-Maximization
⟶
最大期望值
- Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:
⟶
潛在變數 (Latent variables) - 潛在變數指的是隱藏/沒有觀察到的變數,這會讓問題的估計變得困難,我們通常使用 z 來代表它。底下是潛在變數的常見設定:
- [Setting, Latent variable z, Comments]
⟶
[設定, 潛在變數 z, 評論]
- [Mixture of k Gaussians, Factor analysis]
⟶
[k 元高斯模型, 因素分析]
- Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:
⟶
演算法 - 最大期望演算法 (EM Algorithm) 透過重複建構一個概似函數的下界 (E-step) 和最佳化下界 (M-step) 來進行最大概似估計給出參數 θ 的高效率估計方法:
- E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:
⟶
E-step: 評估後驗機率 Qi(z(i)),其中每個資料點 x(i) 來自於一個特定的群集 z(i),如下:
- M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:
⟶
M-step: 使用後驗機率 Qi(z(i)) 作為資料點 x(i) 在群集中特定的權重,用來分別重新估計每個群集,如下:
- [Gaussians initialization, Expectation step, Maximization step, Convergence]
⟶
[高斯分佈初始化, E-Step, M-Step, 收斂]
- k-means clustering
⟶
k-means 分群法
- We note c(i) the cluster of data point i and μj the center of cluster j.
⟶
我們使用 c(i) 表示資料 i 屬於某群,而 μj 則是群 j 的中心
- Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:
⟶
演算法 - 在隨機初始化群集中心點 μ1,μ2,...,μk∈Rn 後,k-means 演算法重複以下步驟直到收斂:
- [Means initialization, Cluster assignment, Means update, Convergence]
⟶
[中心點初始化, 指定群集, 更新中心點, 收斂]
- Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:
⟶
畸變函數 - 為了確認演算法是否收斂,我們定義以下的畸變函數:
- Hierarchical clustering
⟶
階層式分群法
- Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.
⟶
演算法 - 階層式分群法是透過一種階層架構的方式,將資料建立為一種連續層狀結構的形式。
- Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:
⟶
類型 - 底下是幾種不同類型的階層式分群法,差別在於要最佳化的目標函式的不同,請參考底下:
- [Ward linkage, Average linkage, Complete linkage]
⟶
[Ward 鏈結距離, 平均鏈結距離, 完整鏈結距離]
- [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]
⟶
[最小化群內距離, 最小化各群彼此的平均距離, 最小化各群彼此的最大距離]
- Clustering assessment metrics
⟶
分群衡量指標
- In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.
⟶
在非監督式學習中,通常很難去評估一個模型的好壞,因為我們沒有擁有像在監督式學習任務中正確答案的標籤
- Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:
⟶
輪廓係數 (Silhouette coefficient) - 我們指定 a 為一個樣本點和相同群集中其他資料點的平均距離、b 為一個樣本點和下一個最接近群集其他資料點的平均距離,輪廓係數 s 對於此一樣本點的定義為:
- Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as
⟶
Calinski-Harabaz 指標 - 定義 k 是群集的數量,Bk 和 Wk 分別是群內和群集之間的離差矩陣 (dispersion matrices):
- the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:
⟶
Calinski-Harabaz 指標 s(k) 指出分群模型的好壞,此指標的值越高,代表分群模型的表現越好。定義如下:
- Dimension reduction
⟶
維度縮減
- Principal component analysis
⟶
主成份分析
- It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.
⟶
這是一個維度縮減的技巧,在於找到投影資料的最大方差
- Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:
⟶
特徵值、特徵向量 - 給定一個矩陣 A∈Rn×n,我們說 λ 是 A 的特徵值,當存在一個特徵向量 z∈Rn∖{0},使得:
- Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:
⟶
譜定理 - 令 A∈Rn×n,如果 A 是對稱的,則 A 可以可以透過正交矩陣 U∈Rn×n 對角化。當 Λ=diag(λ1,...,λn),我們得到:
- diagonal
⟶
對角線
- Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.
⟶
注意:與特徵值所關聯的特徵向量就是 A 矩陣的主特徵向量
- Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:
⟶
演算法 - 主成份分析 (PCA) 是一種維度縮減的技巧,它會透過尋找資料最大變異的方式,將資料投影在 k 維空間上:
- Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.
⟶
第一步:正規化資料,讓資料平均為 0,變異數為 1
- Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.
⟶
第二步:計算 Σ=1mm∑i=1x(i)x(i)T∈Rn×n,即對稱實際特徵值
- Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.
⟶
第三步:計算 u1,...,uk∈Rn,k 個正交主特徵向量的總和 Σ,即是 k 個最大特徵值的正交特徵向量
- Step 4: Project the data on spanR(u1,...,uk).
⟶
第四部:將資料投影到 spanR(u1,...,uk)
- This procedure maximizes the variance among all k-dimensional spaces.
⟶
這個步驟會最大化所有 k 維空間的變異數
- [Data in feature space, Find principal components, Data in principal components space]
⟶
[資料在特徵空間, 尋找主成分, 資料在主成分空間]
- Independent component analysis
⟶
獨立成分分析
- It is a technique meant to find the underlying generating sources.
⟶
這是用來尋找潛在生成來源的技巧
- Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:
⟶
假設 - 我們假設資料 x 是從 n 維的來源向量 s=(s1,...,sn) 產生,si 為獨立變數,透過一個混合與非奇異矩陣 A 產生如下:
- The goal is to find the unmixing matrix W=A−1.
⟶
目的在於找到一個 unmixing 矩陣 W=A−1
- Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:
⟶
Bell 和 Sejnowski 獨立成份分析演算法 - 此演算法透過以下步驟來找到 unmixing 矩陣:
- Write the probability of x=As=W−1s as:
⟶
紀錄 x=As=W−1s 的機率如下:
- Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:
⟶
在給定訓練資料 {x(i),i∈[[1,m]]} 的情況下,其對數概似估計函數與定義 g 為 sigmoid 函數如下:
- Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:
⟶ 因此,梯度隨機下降學習規則對每個訓練樣本 x(i) 來說,我們透過以下方法來更新 W: