Skip to content

Latest commit

 

History

History
298 lines (199 loc) · 10.3 KB

cheatsheet-unsupervised-learning.md

File metadata and controls

298 lines (199 loc) · 10.3 KB
  1. Unsupervised Learning cheatsheet

⟶ 非監督式學習參考手冊

  1. Introduction to Unsupervised Learning

⟶ 非監督式學習介紹

  1. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.

⟶ 動機 - 非監督式學習的目的是要找出未標籤資料 {x(1),...,x(m)} 之間的隱藏模式

  1. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:

⟶ Jensen's 不等式 - 令 f 為一個凸函數、X 為一個隨機變數,我們可以得到底下這個不等式:

  1. Clustering

⟶ 分群

  1. Expectation-Maximization

⟶ 最大期望值

  1. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:

⟶ 潛在變數 (Latent variables) - 潛在變數指的是隱藏/沒有觀察到的變數,這會讓問題的估計變得困難,我們通常使用 z 來代表它。底下是潛在變數的常見設定:

  1. [Setting, Latent variable z, Comments]

⟶ [設定, 潛在變數 z, 評論]

  1. [Mixture of k Gaussians, Factor analysis]

⟶ [k 元高斯模型, 因素分析]

  1. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:

⟶ 演算法 - 最大期望演算法 (EM Algorithm) 透過重複建構一個概似函數的下界 (E-step) 和最佳化下界 (M-step) 來進行最大概似估計給出參數 θ 的高效率估計方法:

  1. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:

⟶ E-step: 評估後驗機率 Qi(z(i)),其中每個資料點 x(i) 來自於一個特定的群集 z(i),如下:

  1. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:

⟶ M-step: 使用後驗機率 Qi(z(i)) 作為資料點 x(i) 在群集中特定的權重,用來分別重新估計每個群集,如下:

  1. [Gaussians initialization, Expectation step, Maximization step, Convergence]

⟶ [高斯分佈初始化, E-Step, M-Step, 收斂]

  1. k-means clustering

⟶ k-means 分群法

  1. We note c(i) the cluster of data point i and μj the center of cluster j.

⟶ 我們使用 c(i) 表示資料 i 屬於某群,而 μj 則是群 j 的中心

  1. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:

⟶ 演算法 - 在隨機初始化群集中心點 μ1,μ2,...,μk∈Rn 後,k-means 演算法重複以下步驟直到收斂:

  1. [Means initialization, Cluster assignment, Means update, Convergence]

⟶ [中心點初始化, 指定群集, 更新中心點, 收斂]

  1. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:

⟶ 畸變函數 - 為了確認演算法是否收斂,我們定義以下的畸變函數:

  1. Hierarchical clustering

⟶ 階層式分群法

  1. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.

⟶ 演算法 - 階層式分群法是透過一種階層架構的方式,將資料建立為一種連續層狀結構的形式。

  1. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:

⟶ 類型 - 底下是幾種不同類型的階層式分群法,差別在於要最佳化的目標函式的不同,請參考底下:

  1. [Ward linkage, Average linkage, Complete linkage]

⟶ [Ward 鏈結距離, 平均鏈結距離, 完整鏈結距離]

  1. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]

⟶ [最小化群內距離, 最小化各群彼此的平均距離, 最小化各群彼此的最大距離]

  1. Clustering assessment metrics

⟶ 分群衡量指標

  1. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.

⟶ 在非監督式學習中,通常很難去評估一個模型的好壞,因為我們沒有擁有像在監督式學習任務中正確答案的標籤

  1. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:

⟶ 輪廓係數 (Silhouette coefficient) - 我們指定 a 為一個樣本點和相同群集中其他資料點的平均距離、b 為一個樣本點和下一個最接近群集其他資料點的平均距離,輪廓係數 s 對於此一樣本點的定義為:

  1. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as

⟶ Calinski-Harabaz 指標 - 定義 k 是群集的數量,Bk 和 Wk 分別是群內和群集之間的離差矩陣 (dispersion matrices):

  1. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:

⟶ Calinski-Harabaz 指標 s(k) 指出分群模型的好壞,此指標的值越高,代表分群模型的表現越好。定義如下:

  1. Dimension reduction

⟶ 維度縮減

  1. Principal component analysis

⟶ 主成份分析

  1. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.

⟶ 這是一個維度縮減的技巧,在於找到投影資料的最大方差

  1. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:

⟶ 特徵值、特徵向量 - 給定一個矩陣 A∈Rn×n,我們說 λ 是 A 的特徵值,當存在一個特徵向量 z∈Rn∖{0},使得:

  1. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:

⟶ 譜定理 - 令 A∈Rn×n,如果 A 是對稱的,則 A 可以可以透過正交矩陣 U∈Rn×n 對角化。當 Λ=diag(λ1,...,λn),我們得到:

  1. diagonal

⟶ 對角線

  1. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.

⟶ 注意:與特徵值所關聯的特徵向量就是 A 矩陣的主特徵向量

  1. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k dimensions by maximizing the variance of the data as follows:

⟶ 演算法 - 主成份分析 (PCA) 是一種維度縮減的技巧,它會透過尋找資料最大變異的方式,將資料投影在 k 維空間上:

  1. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.

⟶ 第一步:正規化資料,讓資料平均為 0,變異數為 1

  1. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.

⟶ 第二步:計算 Σ=1mm∑i=1x(i)x(i)T∈Rn×n,即對稱實際特徵值

  1. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.

⟶ 第三步:計算 u1,...,uk∈Rn,k 個正交主特徵向量的總和 Σ,即是 k 個最大特徵值的正交特徵向量

  1. Step 4: Project the data on spanR(u1,...,uk).

⟶ 第四部:將資料投影到 spanR(u1,...,uk)

  1. This procedure maximizes the variance among all k-dimensional spaces.

⟶ 這個步驟會最大化所有 k 維空間的變異數

  1. [Data in feature space, Find principal components, Data in principal components space]

⟶ [資料在特徵空間, 尋找主成分, 資料在主成分空間]

  1. Independent component analysis

⟶ 獨立成分分析

  1. It is a technique meant to find the underlying generating sources.

⟶ 這是用來尋找潛在生成來源的技巧

  1. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:

⟶ 假設 - 我們假設資料 x 是從 n 維的來源向量 s=(s1,...,sn) 產生,si 為獨立變數,透過一個混合與非奇異矩陣 A 產生如下:

  1. The goal is to find the unmixing matrix W=A−1.

⟶ 目的在於找到一個 unmixing 矩陣 W=A−1

  1. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:

⟶ Bell 和 Sejnowski 獨立成份分析演算法 - 此演算法透過以下步驟來找到 unmixing 矩陣:

  1. Write the probability of x=As=W−1s as:

⟶ 紀錄 x=As=W−1s 的機率如下:

  1. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:

⟶ 在給定訓練資料 {x(i),i∈[[1,m]]} 的情況下,其對數概似估計函數與定義 g 為 sigmoid 函數如下:

  1. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:

⟶ 因此,梯度隨機下降學習規則對每個訓練樣本 x(i) 來說,我們透過以下方法來更新 W: