StatQuest: Principal Component Analysis (PCA), Step-by-Step
Video Overview & Insights
Principal Component Analysis, is one of the most useful data analysis and machine learning methods out there. It can be used to identify patterns in highly complex datasets and it can tell you what variables in your data are the most important. Lastly, it can tell you how accurate your new understanding of the data actually is.
NOTE 1: The StatQuest PCA Study Guide is available! https://app.gumroad.com/statquest
NOTE 2: A lot of people ask about how, in 3-D, the 3rd PC can be perpendicular to both PC1 and PC2. Regardless of the number of dimensions, all principal components are perpendicular to each other. If that sounds insane, consider a 2-d graph, the x and y axes are perpendicular to each other. Now consider a 3-d graph, the x, y and z axes are all perpendicular to each other. Now consider a 4-d graph..... etc.
NOTE 3: A lot of people ask about the covariance matrix. There are two ways to do PCA: 1) The old way, which applies eigen-decomposition to the covariance matrix and 2) The new way, which applies singular value decomposition to the raw data. This video describes the new way, which is preferred because, from a computational stand point, it is more stable.
NOTE 4: A lot of people ask how fitting this line is different from Linear Regression. In Linear Regression we are trying to maintain a relationship between a value on the x-axis, and the value it would predict on the y-axis. In other words, the x-axis is used to predict values on the y-axis. This is why we use the vertical distance to measure error - because that tells us how far off our prediction is for the true value. In PCA, no such relationship exists, so we minimize the perpendicular distances between the data and the line.
NOTE 5: A lot of people wonder why we divide the sums of the squares by n-1 instead of n. To be honest, in this context, you can probably use 'n' or 'n-1'. 'n-1' is traditionally used because it prevents us from underestimating the variance - in other words, it's related to how statistics are calculated. If you want to learn more, see: https://youtu.be/vikkiwjQqfU https://youtu.be/SzZ6GpcfoQY and https://youtu.be/sHRBg6BhKjI (the last video specifically addresses the 'n' vs 'n-1' thing, but the first two give background that you need to understand first).
Support StatQuest by buying my books The StatQuest Illustrated Guide to Machine Learning, The StatQuest Illustrated Guide to Neural Networks and AI, or a Study Guide or Merch!!! https://statquest.org/statquest-store/
In this video, I go one step at a time through PCA, and the method used to solve it, Singular Value Decomposition. I take it nice and slowly so that the simplicity of the method is revealed and clearly explained.
If you are interested in doing PCA in R see: https://youtu.be/0Jp4gsfOLMs
Womp womp
If you are interested in learning more about how to determine the number of principal components, see: https://youtu.be/oRvgq966yZg
For a complete index of all the StatQuest videos, check out:
What tool or statistical package can I use for PCA
https://statquest.org/video-index/
If you'd like to support StatQuest, please consider...
best best....super best video!!!❤
Patreon: https://www.patreon.com/statquest
...or...
I used to study with your videos when i was a student and even now after working for 4 years in the industry as a ML engineer i still come back to your videos to understand things simply. Thank you for making all these videos. ❤
YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join
...buying one of my books, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...
Again Claude recommended this to me❤
https://statquest.org/statquest-store/
...or just donating to StatQuest!
I love you
https://www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
Thank you a lot fot this amazing video BAM!!.
https://twitter.com/joshuastarmer
0:00 Awesome song and introduction
I couldn't have understood it any better. Amazing BAM!!
0:30 Conceptual motivation for PCA
3:23 PCA worked out for 2-Dimensional data
This video opened my eyes after 8 years!!!!!!!!
5:03 Finding PC1
12:08 Singular vector/value, Eigenvector/value and loading scores defined
🎯 Key points for quick navigation:
00:00 PCA SVD introduction
00:33 Genes mice dataset
01:04 One gene line
01:37 Two genes clustering
02:15 Three genes 3D
02:53 PCA dimensional reduction
03:32 Center data origin
04:06 Shift center origin
04:41 Fit line origin
05:16 Project measure distances
05:46 Maximize projected distances
06:54 Pythagorean inverse relation
07:25 Maximize squared distances
09:02 PC1 best line
09:36 PC1 slope recipe
10:14 Linear combination variables
11:16 Scale unit vector
12:23 Eigenvector loading scores
12:57 PC2 perpendicular PC1
14:24 Project samples plot
14:52 Eigenvalues variation measure
15:44 Variation proportions PCs
16:14 Scree plot variation
16:44 3D PCA process
18:22 Scree plot proportions
19:40 2D approximation good
20:03 High-D PCA works
20:58 Clusters despite noise
Made with HARPA AI
12:56 Finding PC2
14:14 Drawing the PCA graph
So far the best video :)
15:03 Calculating percent variation for each PC and scree plot
16:30 PCA worked out for 3-Dimensional data
Okay, now I somewhat understand what the linear algebra equations we're doing.
#statquest #PCA #ML
How do you get PC1 and PC2 without being able to plot the n-dimensional graph? That whole jump confused me.
More User Perspectives
Very good, I like this !
@ZHANGXiang-m7hWords cannot describe the contrast in quality between this video and the teaching methods of my professor. Absolute life saver
@Michael-DiMarzoSo are each of those green dots SNP positions on a gene? Like I'm so confused where these hypothetical numbers came from and why closeness is considered similar
@ShawnMccabe-i2eMan, you are good. Your explanation is clearer than a course I am taking at Cornell. They gloss over the many details that you take the time to explain, which are crucial to achieve good understanding. T H A N K Y O U! BAM!
@elzorro7235That's it? That's the eigenvalue? That's amazing! So, maximum average distance value.
@isaaca3849Wish I had access to this channel in 1977, it would have saved me hours of hard work without seeing much advance. Thank you a lot.
@yutub6928omgg thaank youu smm! really have no clue for my thesis:((( but finally i undrstandd!
@chocoberry434Deep diving science topics brought me here and I found this really interesting. It makes me want to find a class on linear algebra. Thanks for the awesome overview!
@MeldsThis is too good
@mrrealnobody4382You made this topic look easy. Thank you so much.
@thechronicler7461This still remains a great video, thanks Josh.
@ishaanbanwait3948OMG....I got it I got it...I finally understood PCA💃🎉 Best explanation ever👏👏👏
@shanthalakvshan789Triple bam! Finally i understood a bit of PCA
@639himanshuyadavlittle bam!!! lol
@ronenalter281We made it to the end of another existing StatQuest!
@ender-11cthis is gold
@ankitkondilkar4832It's 1am and my exam is at 9am in the morning. I love you forever Mr StatQuest💖
@nikkirobinson2727What a discussion!Fabulous session
@mdmahmudulhasanmiddya9632Do you have any video doing PCA in python?
@xxlemmonxx68Explained everything really well. Even a beginner like me could understand with at most interest. Thank you.
@MaheshYeole-g2yThis is really helpful but I want to be clear, from your explanation I gather that if you are using the loadings to see which variable is "most important" to a PC, then the sign is irrelevant. the absolute value tells you which is contributing most to the vector of the PC... right?
@fallingwithjess8803Thank you so much ❤
@Jassersghaier777This is on another level of helpful!
@SaimirGjoni-l5f@statquest you are the GOAT
@ENTJ616This is just beautifull !!Thankyouu
@VishvaModhSimply thank you, you are a legend man ❤️🔥
@Gunem-h2mIsnt it kind of same as linear regression?
@krishnasutapalliBelieve in the Lord Jesus Christ and you shall be saved
@seesmofThis channel is awesome! Thanks for your effort
@HANYUSU-o5xI really enjoyed the video! I hope you don't mind a quick technical clarification regarding the centering step. In standard Principal Component Analysis (PCA), the proper method is to subtract the mean of each variable (column) from its corresponding values. This involves calculating the mean vector X and then performing the subtraction column-wise. Subtracting a single overall average value from the entire dataset is generally avoided because it fails to shift the data's centroid (the mean vector) to the origin, which is crucial for correctly calculating the covariance matrix. Could you clarify if the method you showed serves a specific, non-standard purpose? Otherwise, the procedure should be performed column-wise.
@laszlomadar32400:07 into the video and already loved it
@Study-e1b9ei don't know why people like it's video, I didn't like any of the explanations of this channel. I always think I will learn something but what I got is unnecessary build-ups, stretching things
@createownhappiness3227I think that another useful way to think about how to choose the “best” line (or why the best line is the one that maximises the distances between the origin and the projected points) is by thinking about variance as the “sum of the squared differences”. The goal of PCA is to capture variance within the data and project it in fewer dimensions, so intuitively it might make sense that we want to maximise the variance within those dimensions - and that means picking a line to maximise the sum of squared distances (that is, variance)!
I hope I’ve written this correctly, but that’s the reasoning that helped me (alongside this excellent explanation) understand the link between what PCA is doing, and what it hopes to achieve. (I’m only 8 minutes into the video, so now I’m crossing my fingers and hoping that he doesn’t say this later)
Thank you so much much much !!! cannot express how gratefull I am !!!! every video each video is just toooooo goooodddd <3
@prathameshwagh5503