StatQuest with Josh Starmer

StatQuest with Josh Starmer

1,640,000 subscribers

⏱ 👁 3,614,341 views

StatQuest: Principal Component Analysis (PCA), Step-by-Step

Video Overview & Insights

Principal Component Analysis, is one of the most useful data analysis and machine learning methods out there. It can be used to identify patterns in highly complex datasets and it can tell you what variables in your data are the most important. Lastly, it can tell you how accurate your new understanding of the data actually is.

“

NOTE 1: The StatQuest PCA Study Guide is available! https://app.gumroad.com/statquest
NOTE 2: A lot of people ask about how, in 3-D, the 3rd PC can be perpendicular to both PC1 and PC2. Regardless of the number of dimensions, all principal components are perpendicular to each other. If that sounds insane, consider a 2-d graph, the x and y axes are perpendicular to each other. Now consider a 3-d graph, the x, y and z axes are all perpendicular to each other. Now consider a 4-d graph..... etc.
NOTE 3: A lot of people ask about the covariance matrix. There are two ways to do PCA: 1) The old way, which applies eigen-decomposition to the covariance matrix and 2) The new way, which applies singular value decomposition to the raw data. This video describes the new way, which is preferred because, from a computational stand point, it is more stable.
NOTE 4: A lot of people ask how fitting this line is different from Linear Regression. In Linear Regression we are trying to maintain a relationship between a value on the x-axis, and the value it would predict on the y-axis. In other words, the x-axis is used to predict values on the y-axis. This is why we use the vertical distance to measure error - because that tells us how far off our prediction is for the true value. In PCA, no such relationship exists, so we minimize the perpendicular distances between the data and the line.
NOTE 5: A lot of people wonder why we divide the sums of the squares by n-1 instead of n. To be honest, in this context, you can probably use 'n' or 'n-1'. 'n-1' is traditionally used because it prevents us from underestimating the variance - in other words, it's related to how statistics are calculated. If you want to learn more, see: https://youtu.be/vikkiwjQqfU https://youtu.be/SzZ6GpcfoQY and https://youtu.be/sHRBg6BhKjI (the last video specifically addresses the 'n' vs 'n-1' thing, but the first two give background that you need to understand first).

Support StatQuest by buying my books The StatQuest Illustrated Guide to Machine Learning, The StatQuest Illustrated Guide to Neural Networks and AI, or a Study Guide or Merch!!! https://statquest.org/statquest-store/

— @statquest

In this video, I go one step at a time through PCA, and the method used to solve it, Singular Value Decomposition. I take it nice and slowly so that the simplicity of the method is revealed and clearly explained.

If you are interested in doing PCA in R see: https://youtu.be/0Jp4gsfOLMs

“

Womp womp

— @SoiBoi_Kelda1059

If you are interested in learning more about how to determine the number of principal components, see: https://youtu.be/oRvgq966yZg

For a complete index of all the StatQuest videos, check out:

“

What tool or statistical package can I use for PCA

— @amadigloria4019

https://statquest.org/video-index/

If you'd like to support StatQuest, please consider...

“

best best....super best video!!!❤

— @learning_science_withme

Patreon: https://www.patreon.com/statquest

...or...

“

I used to study with your videos when i was a student and even now after working for 4 years in the industry as a ML engineer i still come back to your videos to understand things simply. Thank you for making all these videos. ❤

— @amarpratapsingh4057

YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join

...buying one of my books, a study guide, a t-shirt or hoodie, or a song from the StatQuest store...

“

Again Claude recommended this to me❤

— @heissensei

https://statquest.org/statquest-store/

...or just donating to StatQuest!

“

I love you

— @Makushinio

https://www.paypal.me/statquest

Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:

“

Thank you a lot fot this amazing video BAM!!.

— @ksup6780

https://twitter.com/joshuastarmer

0:00 Awesome song and introduction

“

I couldn't have understood it any better. Amazing BAM!!

— @ayseguldalgic

0:30 Conceptual motivation for PCA

3:23 PCA worked out for 2-Dimensional data

“

This video opened my eyes after 8 years!!!!!!!!

— @mahalingam8928

5:03 Finding PC1

12:08 Singular vector/value, Eigenvector/value and loading scores defined

“

🎯 Key points for quick navigation:

00:00 PCA SVD introduction
00:33 Genes mice dataset
01:04 One gene line
01:37 Two genes clustering
02:15 Three genes 3D
02:53 PCA dimensional reduction
03:32 Center data origin
04:06 Shift center origin
04:41 Fit line origin
05:16 Project measure distances
05:46 Maximize projected distances
06:54 Pythagorean inverse relation
07:25 Maximize squared distances
09:02 PC1 best line
09:36 PC1 slope recipe
10:14 Linear combination variables
11:16 Scale unit vector
12:23 Eigenvector loading scores
12:57 PC2 perpendicular PC1
14:24 Project samples plot
14:52 Eigenvalues variation measure
15:44 Variation proportions PCs
16:14 Scree plot variation
16:44 3D PCA process
18:22 Scree plot proportions
19:40 2D approximation good
20:03 High-D PCA works
20:58 Clusters despite noise

Made with HARPA AI

— @ArmaanQWF

12:56 Finding PC2

14:14 Drawing the PCA graph

“

So far the best video :)

— @manishGupta-i1o1y

15:03 Calculating percent variation for each PC and scree plot

16:30 PCA worked out for 3-Dimensional data

“

Okay, now I somewhat understand what the linear algebra equations we're doing.

— @ismailmuhammad6620

#statquest #PCA #ML

“

How do you get PC1 and PC2 without being able to plot the n-dimensional graph? That whole jump confused me.

— @quarantina1293

More User Perspectives

@

Very good, I like this !

@ZHANGXiang-m7h

@

Words cannot describe the contrast in quality between this video and the teaching methods of my professor. Absolute life saver

@Michael-DiMarzo

@

So are each of those green dots SNP positions on a gene? Like I'm so confused where these hypothetical numbers came from and why closeness is considered similar

@ShawnMccabe-i2e

@

Man, you are good. Your explanation is clearer than a course I am taking at Cornell. They gloss over the many details that you take the time to explain, which are crucial to achieve good understanding. T H A N K Y O U! BAM!

@elzorro7235

@

That's it? That's the eigenvalue? That's amazing! So, maximum average distance value.

@isaaca3849

@

Wish I had access to this channel in 1977, it would have saved me hours of hard work without seeing much advance. Thank you a lot.

@yutub6928

@

omgg thaank youu smm! really have no clue for my thesis:((( but finally i undrstandd!

@chocoberry434

@

Deep diving science topics brought me here and I found this really interesting. It makes me want to find a class on linear algebra. Thanks for the awesome overview!

@Melds

@

This is too good

@mrrealnobody4382

@

You made this topic look easy. Thank you so much.

@thechronicler7461

@

What is the correct formula for Eigen value for pc1? Here 12:42 we should divide by n, aren't we? The same question for 14:04 part for pc2

@iharharkusha

@

This still remains a great video, thanks Josh.

@ishaanbanwait3948

@

OMG....I got it I got it...I finally understood PCA💃🎉 Best explanation ever👏👏👏

@shanthalakvshan789

@

Triple bam! Finally i understood a bit of PCA

@639himanshuyadav

@

little bam!!! lol

@ronenalter281

@

We made it to the end of another existing StatQuest!

@ender-11c

@

this is gold

@ankitkondilkar4832

@

It's 1am and my exam is at 9am in the morning. I love you forever Mr StatQuest💖

@nikkirobinson2727

@

What a discussion!Fabulous session

@mdmahmudulhasanmiddya9632

@

Do you have any video doing PCA in python?

@xxlemmonxx68

@

Explained everything really well. Even a beginner like me could understand with at most interest. Thank you.

@MaheshYeole-g2y

@

This is really helpful but I want to be clear, from your explanation I gather that if you are using the loadings to see which variable is "most important" to a PC, then the sign is irrelevant. the absolute value tells you which is contributing most to the vector of the PC... right?

@fallingwithjess8803

@

Thank you so much ❤

@Jassersghaier777

@

This is on another level of helpful!

@SaimirGjoni-l5f

@

@statquest you are the GOAT

@ENTJ616

@

@Gamemaster60101

@

This is just beautifull !!Thankyouu

@VishvaModh

@

Simply thank you, you are a legend man ❤️‍🔥

@Gunem-h2m

@

Isnt it kind of same as linear regression?

@krishnasutapalli

@

Believe in the Lord Jesus Christ and you shall be saved

@seesmof

@

This channel is awesome! Thanks for your effort

@HANYUSU-o5x

@

I really enjoyed the video! I hope you don't mind a quick technical clarification regarding the centering step. In standard Principal Component Analysis (PCA), the proper method is to subtract the mean of each variable (column) from its corresponding values. This involves calculating the mean vector X and then performing the subtraction column-wise. Subtracting a single overall average value from the entire dataset is generally avoided because it fails to shift the data's centroid (the mean vector) to the origin, which is crucial for correctly calculating the covariance matrix. Could you clarify if the method you showed serves a specific, non-standard purpose? Otherwise, the procedure should be performed column-wise.

@laszlomadar324

@

00:07 into the video and already loved it

@Study-e1b9e

@

i don't know why people like it's video, I didn't like any of the explanations of this channel. I always think I will learn something but what I got is unnecessary build-ups, stretching things

@createownhappiness3227

@

I think that another useful way to think about how to choose the “best” line (or why the best line is the one that maximises the distances between the origin and the projected points) is by thinking about variance as the “sum of the squared differences”. The goal of PCA is to capture variance within the data and project it in fewer dimensions, so intuitively it might make sense that we want to maximise the variance within those dimensions - and that means picking a line to maximise the sum of squared distances (that is, variance)!

I hope I’ve written this correctly, but that’s the reasoning that helped me (alongside this excellent explanation) understand the link between what PCA is doing, and what it hopes to achieve. (I’m only 8 minutes into the video, so now I’m crossing my fingers and hoping that he doesn’t say this later)

@JecIsBec

@

Thank you so much much much !!! cannot express how gratefull I am !!!! every video each video is just toooooo goooodddd <3

@prathameshwagh5503

#Joshua Starmer #StatQuest #PCA #Clearly Explained #SVD #Singular Value Decomposition #Principal Component Anlysis #Machine Learning #Data Science #Statistics #Data Analysis