From d3f07406e3e85806f86912a4ffd4563648197e86 Mon Sep 17 00:00:00 2001 From: Lillian Weng Date: Wed, 24 Apr 2024 17:42:13 -0700 Subject: [PATCH] update pca2 --- pca_2/pca_2.qmd | 43 ++++++++++++++++++++++++++----------------- 1 file changed, 26 insertions(+), 17 deletions(-) diff --git a/pca_2/pca_2.qmd b/pca_2/pca_2.qmd index c8f3fc75..ea247bce 100644 --- a/pca_2/pca_2.qmd +++ b/pca_2/pca_2.qmd @@ -34,29 +34,37 @@ jupyter: ## Dimensionality Reduction -We often work with high-dimensional data or data containing *many* columns/features. Given the many dimensions, this data can be difficult to visualize and model. To use this data for visualization, EDA, and some modeling tasks, we look for a smaller **intrinsic dimension** to represent our data. More specifically, we want to take high-dimensional data and find a **smaller set** of **new features** (columns) that approximately capture the information contained in the original dataset; this is known as **dimensionality reduction**. - -We can frame dimensionality reduction as a matrix factorization problem. We want to factor our $n$ by $d$ data matrix $X$ into a lower-dimensional($n$ by $k$) matrix $C$ that when multiplied by $k$ by $d$ matrix $W$, approximately recovers the original data. +We often work with high-dimensional data that contain *many* columns/features. Given all these dimensions, this data can be difficult to visualize and model. However, not all the data in this high-demensional space is useful -- there could be repeated features or outliers that make the data seem more complex than it really is. The most concise representation of high-demensional data is it's **intrinsic dimension**. Our goal with this lecture is to use **dimensionality reduction** to find the intrinsic dimension of a high-demensional dataset. In other words, we want to find a smaller set of new features/columns that approximates the original data well without loosing that much information. This is especially useful because this smaller set of features allows us to better visualize the data and do EDA to understand which modeling techniques would fit the data well. ### Loss Minimization -As with any model, our goal for this matrix factorization model is to minimize the reconstruction loss denoted as: +In order to find the intrinsic dimension of a high-dimensional dataset, we'll use techniques from linear algebra. Suppose we have a high-dimensional dataset, $X$, that has $n$ rows and $d$ columns. We want to factor (split) $X$ into two matrices, $Z$ and $W$. $Z$ has $n$ rows and $k$ columns; $W$ has $k$ rows and $d$ columns. + +$$ X \approx ZW$$ + +We can reframe this problem as a loss function: in other words, if we want $X$ to roughly equal $ZW$, their difference should be as small as possible, ideally 0. This difference becomes our loss function, $L(Z, W)$: $$L(Z, W) = \frac{1}{n}\sum_{i=1}^{n}||X_i - Z_iW||^2$$ Breaking down the variables in this formula: * $X_i$: A row vector from the original data matrix $X$, which we can assume is centered to a mean of 0. * $Z_i$: A row vector from the lower-dimension matrix $Z$. The rows of $Z$ are also known as **latent vectors** and are used for EDA. -* $W$: We constrain our model so that $W$ is a row-orthonormal matrix (e.g., $WW^T = I$). The rows of $W$ are the **principal components**. +* $W$: The rows of $W$ are the **principal components**. We constrain our model so that $W$ is a row-orthonormal matrix (e.g., $WW^T = I$). -Using calculus, we know that this loss is minimized with respect to $W$ when $Z = XW^T$. We won't go through this proof in-depth, since it is out of scope for Data 100, but generally, we: -* Used Lagrangian multipliers to introduce the orthonormality constraint on $W$. -* Took the derivative with respect to $W$ (which requires vector calculus) and solved for 0. +Using calculus and optimization techniques (take EECS 127 if you're interested!), we we find that this loss is minimized when +$$Z = XW^T$$ +The proof for this is out of scope for Data 100, but for those who are interested, we: +* Use Lagrangian multipliers to introduce the orthonormality constraint on $W$. +* Took the derivative with respect to $W$ (which requires vector calculus) and solve for 0. -Conducting this loss minimization, we find that $\Sigma w^T = \lambda w^T$, which implies: -1. $w$ is a **unitary eigenvector** of the covariance matrix $\Sigma$. -2. The error is minimized when $w$ is the eigenvector with the **largest eigenvalue** $\lambda$. +This gives us a very cool result of -From this minimization, it becomes clear that the principal components are the eigenvectors with the largest eigenvalues of the covariance matrix. They represent the directions of **maximum variance** in the data. We can construct the latent factors, or the $Z$ matrix, by projecting the centered data $X$ onto the principal component vectors, $W^T$. +$$\Sigma w^T = \lambda w^T$$ + +$\Sigma$ is the covariance matrix of $X$. The equation above implies that: +1. $w$ is a **unitary eigenvector** of the covariance matrix $\Sigma$. In other words, it's norm is equal to 1: $||w||^2 = ww^T = 1$ +2. The loss is minimized when $w$ is the eigenvector with the **largest eigenvalue** $\lambda$. + +This tells us that the principal components (rows of $W$) are the eigenvectors with the largest eigenvalues of the covariance matrix $\Sigma$. They represent the directions of **maximum variance** in the data. We can construct the latent factors, or the $Z$ matrix, by projecting the centered data $X$ onto the principal component vectors, $W^T$.
@@ -236,7 +244,7 @@ Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) can be ### Deriving Principal Components From SVD -After centering the data matrix $X$ so that each column has a mean of 0, we find its SVD: +After centering the original data matrix $X$ so that each column has a mean of 0, we find its SVD: $$ X = U S V^T $$ Because $X$ is centered, the covariance matrix of $X$, $\Sigma$, is equal to $X^T X$. Rearranging this equation, we get @@ -297,10 +305,11 @@ Note that to recover the original (uncentered) $X$ matrix, we would also need to **Note**: The notation used for PCA this semester differs from previous semesters a bit. Please bay careful attention to the terminology presented in this note. To summarize the terminology and concepts we've covered so far: -1. Principal Component: The columns of $V$ . These vectors specify the principal coordinate system and represent the directions along which the most variance in the data is captured. -2. Latent Vector Representation of $X$: The projection of our data matrix $X$ onto the principal components, $Z = XV = US$. In previous semesters, the terminology was different and this was termed the principal components of $X$. In other classes, the term principal coordinate is also used. The $i$-th latent vector refers to the $i$-th column of $V$, corresponding to the $i$-th largest singular value of $X$. The latent vector representation of $X$ using the first $k$ principal components is described as the ”best” rank-$k$ approximation of $X$. -3. $S$ (as in SVD): The diagonal matrix containing all the singular values of $X$. -4. $\Sigma$: The covariance matrix of $X$. Assuming $X$ is centered, $\Sigma = X^T X$. In previous semesters, the singular value decomposition of $X$ was written out as $X = U{\Sigma}V^T$. Note the difference between $\Sigma$ in that context compared to this semester. + +1. **Principal Component**: The columns of $V$ . These vectors specify the principal coordinate system and represent the directions along which the most variance in the data is captured. +2. **Latent Vector Representation** of $X$: The projection of our data matrix $X$ onto the principal components, $Z = XV = US$. In previous semesters, the terminology was different and this was termed the principal components of $X$. In other classes, the term principal coordinate is also used. The $i$-th latent vector refers to the $i$-th column of $V$, corresponding to the $i$-th largest singular value of $X$. The latent vector representation of $X$ using the first $k$ principal components is described as the ”best” rank-$k$ approximation of $X$. +3. **$S$** (as in SVD): The diagonal matrix containing all the singular values of $X$. +4. **$\Sigma$**: The covariance matrix of $X$. Assuming $X$ is centered, $\Sigma = X^T X$. In previous semesters, the singular value decomposition of $X$ was written out as $X = U{\Sigma}V^T$. Note the difference between $\Sigma$ in that context compared to this semester. :::