-
Notifications
You must be signed in to change notification settings - Fork 0
PCA vs FA Theory and Steps
Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3), 37-52.
Jolliffe, I. T. Principal Component Analysis. 1st ed. 1986. New York, New York: Springer, 1986. Web.
Jolliffe. (2002). Principal Component Analysis (Second Edition.). Springer. https://doi.org/10.1007/b98835
Abdi, Hervé, and Lynne J Williams. “Principal Component Analysis.” Wiley interdisciplinary reviews. Computational statistics 2.4 (2010): 433–459. Web.
Bro, R., & Smilde, A. K. (2014). Principal component analysis. Analytical methods, 6(9), 2812-2831.
Jolliffe, I. (2005). Principal component analysis. Encyclopedia of statistics in behavioral science.
Fabrigar, L. R., & Wegener, D. T. (2011). Exploratory factor analysis. Oxford University Press.
Cudeck, R. (2000). Exploratory factor analysis. In Handbook of applied multivariate statistics and mathematical modeling (pp. 265-296). Academic Press.
Costello, A. B., & Osborne, J. (2005). Best practices in exploratory factor analysis: Four recommendations for getting the most from your analysis. Practical assessment, research, and evaluation, 10(1), 7.
- How PCA can take more than 2 survey items in a particular construct and make a 2-D PCA plot.
- PCA can tell us which survey item is the most valuable for clustering the data.
- We calculate the average measurement of all samples for survey item 1 and survey item 2. With the average values, we calculate the center of the data.
- Then, we shift the center of the data towards the origin. Shifting the data did not change how the data points are positioned relative to one another.
- Then, we rotate the line until it fits the data as well as it can.
-
PCA projects the data onto the line
-
measure the distances from the data to the line and try to find line that minimizes the distance.
- or the green line that gets larger as the line fits better.
- We get a right angle between the black dotted line and the red dotted line when we project the data onto the red dotted line.
- Based on Pythagorean Theorem, since a^2 doesn't change, if b gets bigger, then, c must get smaller. Likewise vice versa is true. PCA can either minimize the distance to the line or maximize the distance from the projected point to the origin.
- PCA measures the distance from the data point to the origin. We have 6 points on the line, so there would be d1, d2, d3, d4, d5, d6.
-
Then we square all the 6 distances so that the negative values wouldn't cancel out the positive values.
-
d1^2 + d2^2 + d3^2 + d4^2 + d5^2 + d6^2
-
We call the end result the sum of the squared distances.
-
And then we rotate the line, and repeat the calculation until we end up with the largest sum of squared distances between the projected points to the origin.
-
In the end, we end up with a line that has the largest sum of squared distances.
- This line is called Principal Component 1 (PC1) for short. PC1 has a slope of 0.25, means that for every 4 unit we go out of survey item 1, we go up 1 unit along the survey item 2 axis.
-
This means that the data are mostly spread out in the survey item 1 axis, and only a little bit spread out on the survey item 2 axis.
-
We call this a "linear combination" of survey item 1 and survey item 2.
-
Using the Pythagorean theorem concept, we get the length of the red line as 4.12.
-
When we do PCA with SVD, the recipe for PC1 is scaled so that 4.12 is actually 1.
-
With SVD, we just need to divide all 3 sides by 4.12.
- This 1 unit long vector ( 0.97 parts survey item 1 and 0.242 parts survey item 2 ) is called the "singular vector" or "eigenvector" of PC1
- PCA calls the sum of squared distances for the best fit line, the eigenvalue of PC1.
-
The square root of the eigenvalue of PC1 is called a singular value of PC1.
-
PC2 is perpendicular to PC1
- Based on the below diagrams, loading scores of PC2 is -0.242 survey item 1 and 0.97 survey item 2. We call the blue arrow key singular vector for PC2 or eigenvector for PC2. Survey item 2 is 4 times more important than survey item 1.
- Eigenvalue of PC2 is sum of squares of distances between the projected points and the origin.
- To draw the PCA plot, we rotate everything so that PC1 is horizontal.
- That's how PCA is done using Singular Value Decomposition (SVD)
-
SS(distances for PC1) = eigenvalues for PC1
-
SS(distances for PC2) = eigenvalues for PC2
-
The sum of squared distances as measured earlier is gotten by the sum of squares of the distances from the origin, squaring them, and adding them together.
-
We can convert them into the variation around the origin (0,0) by dividing by the sample size minus 1.
-
For example, variation for PC1 is 15, and variation for PC2 is 3. Total variations for both PCs is 18. That means PC1 accounts for 15/18 = 83% of the total variation around the PCs. PC2 accounts for 3/18 = 17% of the total variation around the PCs.
-
A scree plot is a graphical representation of the percentages of variation that each PCs accounts for.
- PCA with 3 variables is pretty much the same as 2 variables.
- Center the data
- Find the best fitting line that goes through the origin
- If there are 3 variables, now the recipe for PC1 has 3 ingredients.
<img src="https://github.com/ironhacks/analysis-2017/blob/main/images/PCA-image30.png" width="400" height="300"
For example, 0.62 survey item 1, 0.15 survey item 2, and 0.77 survey item 3.
Now survey item 3 is the most important ingredient for PC1
Then, we find PC2, the next best fitting line that it goes through the origin and is perpendicular to PC1.
If we had more survey items, we will just keep on finding more and more principal components by adding perpendicular lines and rotating through them.
-
In theory, there is one PC per variable, but in practice, the number of PCs is either the number of variables or the number of samples, whichever is smaller.
-
Once we have all the principle components figured out, we can use the eigenvalues ( sum of squared distances ) to determine the proportion of variation that each PC accounts for.
-
In the example of PC1, PC2, and PC3. PC1 accounts for 79% of the variation, PC2 accounts for 15% of the variation, and PC3 accounts for 6% of the variation.
-
That means that a 2D graph using just PC1 and PC2 would account for 94% of the variation in the data.
-
To convert the 3D graph into a 2D graph, we strip away everything but the data and PC1 and PC2.
-
Then project the samples into PC1 and PC2.
-
Then we rotate so that PC1 is horizontal and PC2 is vertical ( this just makes it easier to look at )
This doesn't stop us from doing the PCA math, which doesn't care if we can draw it or not. In this case below, PC1 and PC2 accounts for 90% of the variation, so we can just use those to draw a 2-dimensional PCA graph, so we project the samples onto the first 2 PCs.
-
PCA and FA are both data reduction methods used to express multivariate data with fewer dimensions.
-
The goal of these methods is to re-orient the data so that a multitude of original variables can be summarized with relatively few factors or components that capture the maximum possible information from the variables.
-
PCA is also useful in identifying patterns of association across variables.
-
Factor analysis and principal component analysis are similar methods used for the reduction of multivariate data; the difference between them is that factor analysis assumes the existence of a few common factors driving the variation in the data while principal component analysis does not make such an assumption.
The goal of EFA is to develop sound instruments.
Let's say if we have an interest in a particular construct, we then write a number of survey items informed by theory.
-
Analysis of Covariance Matrix -> Extraction of Factors ( just like creating a matrix of correlation )
-
The goal is to group items together, explaining as much of the covariances as possible.
-
Each of the groups of items is called a factor.
-
A few factors are more parsimonious than many items.
-
Number of Factors = Number of Items
-
Types of rotation depending on which factor categories make more sense ( more of trials and errors )
- Common factor model - observed variance in each measure is attributable to a relatively small number of common factors and a single specific factor ( unrelated to other factors in the model ).
-
The common factors contribute to the variation in all variables X.
-
The specific factor can be thought of as the error term.
-
Factor analysis is appropriate when there is a "latent trait" and "unobservable characteristics".
-
The factor scores can be obtained from the analysis of dependence.
-
Factor analysis is used with survey questions about attitudes - the goal is to identify common factors capturing the variance from these questions and which can also be used as factor scores.
-
Assumptions to determine a solution to the common factor model:
-
The common factors are uncorrelated with each other.
-
The specific factors are uncorrelated with each other.
-
The common factors and specific factors are uncorrelated with each other.
- Perform a literature search on the construct. Let's say our construct is motivation.
- Develop a hypothesis based on the possible number of latent factors.
- Check the dataset for excessive skewness and kurtosis
- Perform correlations such as Pearson to check for non-trivial/trivial pairwise relationships.
- Perform parallel analysis to identify the number of factors.
- Perform factor analysis comparing the different types of rotation (Oblimin, Varimax, Promax)
- Validate the EFA with chi-square test, RSME and Tukey Lewis Test.
-
Tucker Lewis Index is a model fit statistic. Rule-of-thumb is that models with TLI > .90 fit the data well (Little, 2013, Longitudinal structural equation modeling)
-
Chi-square test of model fit (below) tests the null hypothesis that “The model fits the data.” Want prob [p-value] > .05 to suggest that model fits data well. With “large” samples (say, greater than 400-500), minor misfit between each person’s data and the model may accumulate to produce a significant chi-square value.
-
Root mean square error of approximation (RMSEA) is another model fit statistic. Rule-of-thumb is that models with RMSEA ≤ .05 fit the data well, ≤ .08 fit the data acceptably, and .08 - .10 is marginally acceptable fit (Little, 2013).
While performing EFA using Principal Axis Factoring with Promax rotation, Osborne, Costello, & Kellow (2008) suggests the communalities above 0.4 is acceptable.
Bandalos. (2018). Measurement theory and applications for the social sciences. The Guilford Press.
Costello, AB & Osborne, Jason. (2005). Best Practices in Exploratory Factor Analysis: Four Recommendations for Getting the Most From Your Analysis. Practical Assessment, Research & Evaluation. 10. 1-9.
Little, T. D. (2013). Longitudinal structural equation modeling. Guilford Press.
- Read in the Data
- Plot a Correlation Matrix
- Call prcomp
- DotPlot the PCA loadings
- Apply the Kaiser Criterion
- Make a screeplot
- Plot the Biplot
- Apply the varimax rotation.
Abdi, H. and Williams, L.J. (2010), Principal component analysis. WIREs Comp Stat, 2: 433-459. https://doi.org/10.1002/wics.101
Multivariate Statistics — Reducing the number of variables
Multivariate Statistics is a group of statistical methods that
focus on studying multiple variables together while focusing
on the variation that those variables have in common.
Multivariate Statistics deals with the treatment of data sets with a large number of dimensions. Its goals are therefore different from supervised modeling, but also different from segmentation and clustering models. There are many models in the family of Multivariate Statistics. In this article, I will focus on the difference between PCA and Factor Analysis, two commonly used Multivariate models.
Modern datasets often have a very large number of variables. This makes it difficult to inspect each of the variables individually, due to the practical fact that the human mind cannot easily digest data on such a large scale. When a dataset contains a large number of variables, there is often a serious amount of overlap between those variables.
The components that are found by PCA are ordered from the
highest information content to the lowest information content.
PCA is a statistical method that allows you to “regroup” your variables into a smaller number of variables, called components. This regrouping is done based on variation that is common to multiple variables.
The goal of PCA is to regroup variables in such a way that the first (newly created) component contains a maximum of variation. The second component contains the second-largest amount of variation, etc etc. The last component logically contains the smallest amount of variation.
Thanks to this ordering of components, it is made possible to retain only a few of the newly created components, while still retaining a maximum amount of variation. We can then use the components rather than the original variables for data exploration.
The mathematical definition of the PCA problem is to find a linear combination of the original variables with maximum variance.
This means that we are going to create a (new) component. Let’s call it z. This z is going to be computed based on our original variables (X1, X2, …) multiplied by a weight for each of our variables (u1, u2, …).
This can be written as z = Xu.
The mathematical goal is to find the values for u that will maximize the variance of z, with a constraint of unit length on u. This problem is mathematically called a constrained optimization using Lagrange Multiplier, but in practice, we use computers to do the whole PCA operation at once.
This can also be described as applying matrix decomposition to the correlation matrix of the original variables. PCA is efficient in finding the components that maximize variance. This is great if we are interested in reducing the number of variables while keeping a maximum of variance.
Sometimes, however, we are not purely interested in maximizating variance: we might want to give the most useful interpretations to our newly defined dimensions. And this is not always easiest with the solution found by a PCA. We can then apply Factor Analysis: an alternative to PCA that has a little bit more flexibility.
Goal — Finding latent variables in a data set
Just like PCA, Factor Analysis is also a model that allows reducing information in a larger number of variables into a smaller number of variables. In Factor Analysis we call those “latent variables”.
Factor Analysis tries to find latent variables that make sense to us.
We can rotate the solution until we find latent variables that have
a clear interpretation and “make sense”.
Factor Analysis is based on a model called the common factor model. It starts from the principle that there is a certain number of factors in a data set, and that each of the measured variables captures a part of one or more of those factors.
An example of Factor Analysis is given in the following schema. Imagine many students in a school. They all get grades for many subject matters. We could imagine that these different grades are partly correlated: a more intellectually gifted student would have higher grades overall. This would be an example of a latent variable.
But we could also imagine having students who are overall good in languages, but bad in technical subjects. In this case, we could try to find a latent variable for language ability and a second latent variable for technical ability.
We now have latent variables that measure the general ability of a student for Language and Technical subjects. But it would still be possible that some students are great at languages overall, but that they are just bad at German. This is why the Common Factor Model has specific factors: they measure the impact of one specific measured variable on the measured variable. We could describe it as “Ability for learning German while taking into account the general ability for learning languages”.
As said, the mathematical model in Factor Analysis is much more conceptual than the PCA model. Where the PCA model is more of a pragmatic approach, in Factor Analysis we are hypothesizing that latent variables exist. In a case with two latent variables, we can compute our original variables X by attributing one part of its variation to our first common latent variable (let’s call it k1), part to the second common latent variable (k2), and part to a specific factor (specific to this variable; called d). In a case with 4 original variables, the Factor Analysis model would be as follows:
X1 = c11 * k1 + c12 * k2 + d1
X2 = c21 * k1 + c22 * k2 + d2
X3 = c31 * k1 + c32 * k2 + d3
X4 = c41 * k1 + c42 * k2 + d4
c are each coefficient of the coefficient matrix, which are the values that we need to estimate. To solve this, the same mathematical solution as in PCA is used, except for a small difference. In PCA we apply matrix decomposition to the correlation matrix.
In Factor Analysis, we apply matrix decomposition to a correlation matrix
in which the diagonal entries are replaced by **1 — var(d)**,
one minus the variance of the specific factor of the variable.
The difference between PCA and Factor Analysis
So in short, the mathematical difference between PCA and
Factor Analysis is the use of specific factors for each
original variable.
Let’s get back to the example with students in a school who take 4 exams: two languages exams and two technical exams. We expect two underlying factors: language ability and technical ability.
PCA does not estimate specific effects, so it simply finds the
mathematical definition of the “best” components
(components who maximize variance).
Those could be a component for language ability and a component for technical ability,
but it also could be something else.
Factor Analysis will also estimate the components,
but we now call them common factors. Besides that,
it also estimates the specific factors. It will therefore
give us two common factors (language and technical) and four
specific factors (abilities on test 1, test 2, test 3, and test 4
that are unexplained by language or technical ability).
Even though we don’t really care for the specific factors, the fact that they have been estimated gives us a different definition of the common factors / components.
Resulting from this mathematical difference, we have also a big difference between the application of PCA and Factor Analysis. In PCA, there is one fixed outcome that orders the components from the highest explanatory value to the lowest explanatory value. In Factor Analysis, we can apply rotations to our solution, which will allow for finding a solution that has a more coherent business explication to each of the factors that was identified.
The possibility to apply rotation to a Factor Analysis makes it a great tool for treating multivariate questionnaire studies in marketing and psychology.
The fact that Factor Analysis is much more flexible for interpretation makes it a great tool for exploration and interpretation.
Two examples:
In marketing: resume product evaluation questionnaire with many questions in a few latent factors for product improvement. In psychology: reduce very long personality test responses into a small number of personality traits.
PCA allows us to find the components that contain the maximum amount of
information in fewer variables. This makes it a great tool for
dimension reduction.
PCA on the other hand is used in cases where we want to retain the largest amount of variation in the smallest number of variables possible. This can for example be used to simplify further analysis. PCA is also much used in data preparation for Machine Learning tasks, where we want to help the Machine Learning model by already “summarising” the data in an easier to digest form.
So in conclusion, we observe the difference between PCA and Factor Analysis in three points: Different Goal
Firstly, the goal is different. PCA has as a goal to define new variables based on the highest variance explained and so forth.
FA has as a goal to define new variables that we can understand and interpret in a business / practical manner.
Different Mathematical Model
Then as a consequence, the mathematics behind the two methods are, while close to each other, not exactly the same. Although both methods use decomposition, they differ in the details.
This also causes the fact that Factor Analysis has an additional possibility for rotation of the final solution, while PCA does not.
Different Applications
As Factor Analysis is more flexible for interpretation, due to the possibility of rotation of the solution, it is very valuable in studies for marketing and psychology.
PCA’s advantage is that it allows for dimension reduction while still keeping a maximum amount of information in a data set. This is often used to simplify exploratory analyses or to prepare data for Machine Learning pipelines.
PCA is based on the formative model, where the variation in the component is based on the variation in item responses (i.e. level of income will affect the social-economic status). While EFA is based on the reflective model, where the variation of the items is based on the variation of a construct (i.e. a person's happiness will change their response to the items, not the contrary). We can see this representation with the following figure.