title

StatQuest: Principal Component Analysis (PCA), Step-by-Step

description

Principal Component Analysis, is one of the most useful data analysis and machine learning methods out there. It can be used to identify patterns in highly complex datasets and it can tell you what variables in your data are the most important. Lastly, it can tell you how accurate your new understanding of the data actually is.
In this video, I go one step at a time through PCA, and the method used to solve it, Singular Value Decomposition. I take it nice and slowly so that the simplicity of the method is revealed and clearly explained.
If you are interested in doing PCA in R see: https://youtu.be/0Jp4gsfOLMs
If you are interested in learning more about how to determine the number of principal components, see: https://youtu.be/oRvgq966yZg
For a complete index of all the StatQuest videos, check out:
https://statquest.org/video-index/
If you'd like to support StatQuest, please consider...
Buying The StatQuest Illustrated Guide to Machine Learning!!!
PDF - https://statquest.gumroad.com/l/wvtmc
Paperback - https://www.amazon.com/dp/B09ZCKR4H6
Kindle eBook - https://www.amazon.com/dp/B09ZG79HXC
Patreon: https://www.patreon.com/statquest
...or...
YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join
...a cool StatQuest t-shirt or sweatshirt:
https://shop.spreadshirt.com/statquest-with-josh-starmer/
...buying one or two of my songs (or go large and get a whole album!)
https://joshuastarmer.bandcamp.com/
...or just donating to StatQuest!
https://www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
https://twitter.com/joshuastarmer
0:00 Awesome song and introduction
0:30 Conceptual motivation for PCA
3:23 PCA worked out for 2-Dimensional data
5:03 Finding PC1
12:08 Singular vector/value, Eigenvector/value and loading scores defined
12:56 Finding PC2
14:14 Drawing the PCA graph
15:03 Calculating percent variation for each PC and scree plot
16:30 PCA worked out for 3-Dimensional data
#statquest #PCA #ML

detail

{'title': 'StatQuest: Principal Component Analysis (PCA), Step-by-Step', 'heatmap': [{'end': 781.168, 'start': 732.281, 'weight': 0.805}], 'summary': 'Provides a step-by-step explanation of principal component analysis (pca), demonstrating its application in gene transcription analysis in mice, data centering, distance measurement, data spread visualization, and 2d data representation, emphasizing key concepts and their practical implications.', 'chapters': [{'end': 219.584, 'segs': [{'end': 122.684, 'src': 'embed', 'start': 92.6, 'weight': 2, 'content': [{'end': 100.962, 'text': "Even though it's a simple graph, it shows us that mice 1, 2, and 3 are more similar to each other than they are to mice 4, 5, and 6.", 'start': 92.6, 'duration': 8.362}, {'end': 109.964, 'text': 'If we measured two genes, then we can plot the data on a two-dimensional XY graph.', 'start': 100.962, 'duration': 9.002}, {'end': 116.766, 'text': 'Gene 1 is the x-axis and spans one of the two dimensions in this graph.', 'start': 111.945, 'duration': 4.821}, {'end': 122.684, 'text': 'Gene 2 is the y-axis and spans the other dimension.', 'start': 118.922, 'duration': 3.762}], 'summary': 'Simple graph indicates mice 1, 2, and 3 are more similar to each other than to mice 4, 5, and 6.', 'duration': 30.084, 'max_score': 92.6, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FgakZw6K1QQ/pics/FgakZw6K1QQ92600.jpg'}, {'end': 219.584, 'src': 'embed', 'start': 156.688, 'weight': 0, 'content': [{'end': 161.111, 'text': 'If we measured four genes, however, we can no longer plot the data.', 'start': 156.688, 'duration': 4.423}, {'end': 164.273, 'text': 'Four genes require four dimensions.', 'start': 161.891, 'duration': 2.382}, {'end': 176.82, 'text': "So we're going to talk about how PCA can take four or more gene measurements and thus four or more dimensions of data,", 'start': 168.115, 'duration': 8.705}, {'end': 179.362, 'text': 'and make a two-dimensional PCA plot.', 'start': 176.82, 'duration': 2.542}, {'end': 184.432, 'text': 'This plot will show us that similar mice cluster together.', 'start': 180.831, 'duration': 3.601}, {'end': 194.135, 'text': "We'll also talk about how PCA can tell us which gene, or variable, is the most valuable for clustering the data.", 'start': 186.272, 'duration': 7.863}, {'end': 202.977, 'text': 'For example, PCA might tell us that gene 3 is responsible for separating samples along the x-axis.', 'start': 195.735, 'duration': 7.242}, {'end': 210.701, 'text': "Lastly, we'll talk about how PCA can tell us how accurate the 2D graph is.", 'start': 205.6, 'duration': 5.101}, {'end': 219.584, 'text': "To understand what PCA does and how it works, let's go back to the data set that only had two genes.", 'start': 212.662, 'duration': 6.922}], 'summary': 'Pca can reduce 4d gene data to 2d, revealing gene impact and graph accuracy.', 'duration': 62.896, 'max_score': 156.688, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FgakZw6K1QQ/pics/FgakZw6K1QQ156688.jpg'}], 'start': 7.98, 'title': 'Principal component analysis and gene clustering in mice', 'summary': "Delves into principal component analysis using singular value decomposition to measure gene transcription in mice, providing a simple dataset for analysis. it also discusses gene clustering in mice, illustrating similarities and potential 3d representation based on gene measurements. additionally, it explores pca's transformation of gene measurements and its accuracy in identifying valuable genes for clustering.", 'chapters': [{'end': 90.684, 'start': 7.98, 'title': 'Principal component analysis using singular value decomposition', 'summary': 'Explains principal component analysis using singular value decomposition, illustrating its application in measuring gene transcription in mice and providing a simple data set for analysis.', 'duration': 82.704, 'highlights': ['PCA provides deeper insight into data by reducing dimensionality, allowing visualization and identification of patterns and outliers.', 'Illustrates using a simple data set of gene transcription in six mice, showcasing the practical application of PCA.', 'Explains how measuring one gene allows data plotting on a number line, highlighting differences in gene expression among the six mice.']}, {'end': 154.687, 'start': 92.6, 'title': 'Mouse gene clustering', 'summary': 'Explains how a simple graph illustrates the similarity between mice 1, 2, and 3 compared to mice 4, 5, and 6 based on gene measurements, and further discusses the potential 3d representation for additional genes.', 'duration': 62.087, 'highlights': ['The smaller dots have larger values for gene 3 and are further away, while the larger dots have smaller values for gene 3 and are closer.', 'If we measured two genes, then we can plot the data on a two-dimensional XY graph with Gene 1 on the x-axis and Gene 2 on the y-axis, showing the clustering of mice 1, 2, and 3 on the right side and mice 4, 5, and 6 on the lower left-hand side.']}, {'end': 219.584, 'start': 156.688, 'title': 'Pca for gene measurement', 'summary': 'Explains how pca can transform four or more gene measurements into a two-dimensional plot, revealing clusters of similar mice and identifying the most valuable gene for clustering, while also assessing the accuracy of the plot.', 'duration': 62.896, 'highlights': ['PCA can transform four or more gene measurements into a two-dimensional plot, revealing clusters of similar mice.', 'PCA can identify the most valuable gene for clustering the data, such as gene 3 responsible for separating samples along the x-axis.', 'PCA can assess the accuracy of the 2D graph, providing insights into its reliability and representation of the data.']}], 'duration': 211.604, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FgakZw6K1QQ/pics/FgakZw6K1QQ7980.jpg', 'highlights': ['PCA provides deeper insight into data by reducing dimensionality, allowing visualization and identification of patterns and outliers.', 'Illustrates using a simple data set of gene transcription in six mice, showcasing the practical application of PCA.', 'Explains how measuring one gene allows data plotting on a number line, highlighting differences in gene expression among the six mice.', 'PCA can transform four or more gene measurements into a two-dimensional plot, revealing clusters of similar mice.', 'PCA can identify the most valuable gene for clustering the data, such as gene 3 responsible for separating samples along the x-axis.', 'If we measured two genes, then we can plot the data on a two-dimensional XY graph with Gene 1 on the x-axis and Gene 2 on the y-axis, showing the clustering of mice 1, 2, and 3 on the right side and mice 4, 5, and 6 on the lower left-hand side.']}, {'end': 557.785, 'segs': [{'end': 279.538, 'src': 'embed', 'start': 253.679, 'weight': 2, 'content': [{'end': 258.961, 'text': 'Shifting the data did not change how the data points are positioned relative to each other.', 'start': 253.679, 'duration': 5.282}, {'end': 267.311, 'text': 'This point is still the highest one, and this is still the rightmost point, etc.', 'start': 260.728, 'duration': 6.583}, {'end': 273.014, 'text': 'Now that the data are centered on the origin, we can try to fit a line to it.', 'start': 268.612, 'duration': 4.402}, {'end': 279.538, 'text': 'To do this, we start by drawing a random line that goes through the origin.', 'start': 273.775, 'duration': 5.763}], 'summary': 'Data shifted, but relative positions remain unchanged. attempting to fit a line through the centered data.', 'duration': 25.859, 'max_score': 253.679, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FgakZw6K1QQ/pics/FgakZw6K1QQ253679.jpg'}, {'end': 345.116, 'src': 'embed', 'start': 309.748, 'weight': 0, 'content': [{'end': 315.673, 'text': 'To quantify how good this line fits the data, PCA projects the data onto it.', 'start': 309.748, 'duration': 5.925}, {'end': 327.224, 'text': 'and then it can either measure the distances from the data to the line and try to find the line that minimizes those distances,', 'start': 317.677, 'duration': 9.547}, {'end': 333.508, 'text': 'or it can try to find the line that maximizes the distances from the projected points to the origin.', 'start': 327.224, 'duration': 6.284}, {'end': 345.116, 'text': "If those options don't seem equivalent to you, we can build intuition by looking at how these distances shrink when the line fits better.", 'start': 335.449, 'duration': 9.667}], 'summary': 'Pca quantifies data fit by minimizing distances from projected points to the line or origin.', 'duration': 35.368, 'max_score': 309.748, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FgakZw6K1QQ/pics/FgakZw6K1QQ309748.jpg'}, {'end': 400.668, 'src': 'embed', 'start': 367.503, 'weight': 5, 'content': [{'end': 374.387, 'text': "In other words, the distance from the point to the origin doesn't change when the red dotted line rotates.", 'start': 367.503, 'duration': 6.884}, {'end': 384.535, 'text': 'When we project the point onto the line, we get a right angle between the black dotted line and the red dotted line.', 'start': 376.308, 'duration': 8.227}, {'end': 400.668, 'text': 'That means that if we label the sides like this, A, B, and C, then we can use the Pythagorean theorem to show how B and C are inversely related.', 'start': 385.776, 'duration': 14.892}], 'summary': 'Distance from point to origin remains constant during rotation, forming right angle for pythagorean theorem application.', 'duration': 33.165, 'max_score': 367.503, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FgakZw6K1QQ/pics/FgakZw6K1QQ367503.jpg'}, {'end': 459.857, 'src': 'embed', 'start': 433.922, 'weight': 3, 'content': [{'end': 443.628, 'text': "The reason I'm making such a fuss about this is that, intuitively, it makes sense to minimize b and the distance from the point to the line.", 'start': 433.922, 'duration': 9.706}, {'end': 451.532, 'text': "But it's actually easier to calculate c, the distance from the projected point to the origin.", 'start': 444.888, 'duration': 6.644}, {'end': 459.857, 'text': 'so PCA finds the best fitting line by maximizing the sum of the squared distances from the projected points to the origin.', 'start': 451.532, 'duration': 8.325}], 'summary': 'Pca finds best fitting line by maximizing sum of squared distances from projected points to origin.', 'duration': 25.935, 'max_score': 433.922, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FgakZw6K1QQ/pics/FgakZw6K1QQ433922.jpg'}], 'start': 221.024, 'title': 'Pca and distance in math', 'summary': 'Covers pca and data centering, line fitting, and maximizing sum of squared distances, emphasizing the relationship between distance, line fitting, and the pythagorean theorem.', 'chapters': [{'end': 333.508, 'start': 221.024, 'title': 'Pca and data centering', 'summary': 'Explains the process of plotting and centering data, followed by the steps involved in fitting a line using pca, which includes projecting data onto the line and finding the best fit.', 'duration': 112.484, 'highlights': ['The chapter explains the process of plotting and centering data, followed by the steps involved in fitting a line using PCA Process of plotting and centering data, fitting a line using PCA', 'PCA projects the data onto a line and measures the distances from the data to the line, trying to find the line that minimizes those distances. PCA projects data onto a line, measures distances, finding the line that minimizes distances', 'The line is rotated until it fits the data as well as it can, given that it has to go through the origin. Line is rotated to fit data, line goes through the origin']}, {'end': 400.668, 'start': 335.449, 'title': 'Distance and projection in math', 'summary': 'Explains the relationship between distance and line fitting in mathematical terms, demonstrating how the distances change when the line fits better and the inverse relationship between sides a, b, and c using the pythagorean theorem.', 'duration': 65.219, 'highlights': ['The distances shrink when the line fits better, and they get larger when the line fits better, illustrating the relationship between distance and line fitting.', "The distance from the point to the origin doesn't change when the red dotted line rotates, demonstrating a fixed point and its distance from the origin.", 'The Pythagorean theorem shows the inverse relationship between sides A, B, and C, providing a mathematical explanation for their relationship.']}, {'end': 557.785, 'start': 402.785, 'title': 'Pca: maximizing sum of squared distances', 'summary': 'Explains how principal component analysis (pca) maximizes the sum of squared distances from the projected points to the origin to find the best fitting line, emphasizing the process of measuring, squaring, and summing the distances to identify the line with the largest sum of squared distances.', 'duration': 155, 'highlights': ['PCA maximizes the sum of squared distances from the projected points to the origin to find the best fitting line. This demonstrates the core objective of PCA, which is to maximize the sum of squared distances to identify the best fitting line.', 'Measuring, squaring, and summing the distances is essential in the process of identifying the line with the largest sum of squared distances. The process of measuring, squaring, and summing the distances plays a crucial role in determining the line with the largest sum of squared distances, showcasing the key steps involved in PCA.', 'Explains the rationale behind squaring the distances to prevent negative values from canceling out positive values. This highlights the reasoning behind squaring the distances, which is to ensure that negative values do not cancel out positive values, thereby maintaining the accuracy of the measurement process.']}], 'duration': 336.761, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FgakZw6K1QQ/pics/FgakZw6K1QQ221024.jpg', 'highlights': ['PCA projects data onto a line, measures distances, finding the line that minimizes distances', 'The chapter explains the process of plotting and centering data, followed by the steps involved in fitting a line using PCA', 'The line is rotated to fit data, line goes through the origin', 'PCA maximizes the sum of squared distances from the projected points to the origin to find the best fitting line', 'The process of measuring, squaring, and summing the distances plays a crucial role in determining the line with the largest sum of squared distances, showcasing the key steps involved in PCA', 'The Pythagorean theorem shows the inverse relationship between sides A, B, and C, providing a mathematical explanation for their relationship']}, {'end': 998.218, 'segs': [{'end': 635.756, 'src': 'embed', 'start': 559.426, 'weight': 0, 'content': [{'end': 564.589, 'text': 'This line is called principal component 1, or PC1 for short.', 'start': 559.426, 'duration': 5.163}, {'end': 569.992, 'text': 'PC1 has a slope of 0.25.', 'start': 566.37, 'duration': 3.622}, {'end': 581.661, 'text': 'In other words, for every 4 units that we go out along the gene 1 axis, we go up one unit along the gene 2 axis.', 'start': 569.992, 'duration': 11.669}, {'end': 593.163, 'text': 'That means that the data are mostly spread out along the gene 1 axis, and only a little bit spread out along the gene 2 axis.', 'start': 583.521, 'duration': 9.642}, {'end': 599.584, 'text': 'One way to think about PC1 is in terms of a cocktail recipe.', 'start': 595.323, 'duration': 4.261}, {'end': 610.302, 'text': 'To make PC1, Mix four parts Gene 1 with one part Gene 2.', 'start': 601.145, 'duration': 9.157}, {'end': 612.344, 'text': 'Pour over ice and serve.', 'start': 610.302, 'duration': 2.042}, {'end': 623.113, 'text': 'The ratio of Gene 1 to Gene 2 tells you that Gene 1 is more important when it comes to describing how the data are spread out.', 'start': 614.405, 'duration': 8.708}, {'end': 635.756, 'text': 'Oh no! Terminology alert! Mathematicians call this cocktail recipe a linear combination of genes 1 and 2.', 'start': 625.094, 'duration': 10.662}], 'summary': 'Pc1 has a slope of 0.25, indicating data spread along gene 1 axis.', 'duration': 76.33, 'max_score': 559.426, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FgakZw6K1QQ/pics/FgakZw6K1QQ559426.jpg'}, {'end': 707.396, 'src': 'embed', 'start': 671.805, 'weight': 3, 'content': [{'end': 676.586, 'text': 'So the length of the red line is 4.12.', 'start': 671.805, 'duration': 4.781}, {'end': 685.515, 'text': 'When you do PCA with SVD, the recipe for PC1 is scaled so that this length equals 1.', 'start': 676.586, 'duration': 8.929}, {'end': 692.001, 'text': 'All we have to do to scale the triangle so that the red line is one unit long is to divide each side by 4.12.', 'start': 685.515, 'duration': 6.486}, {'end': 705.794, 'text': "For those of you keeping score, here's the math worked out that shows that all we need to do is divide all three sides by 4.12.", 'start': 692.001, 'duration': 13.793}, {'end': 707.396, 'text': 'Here are the scaled values.', 'start': 705.794, 'duration': 1.602}], 'summary': 'Using pca with svd, scaling the triangle sides by dividing each by 4.12 makes the red line 1 unit long.', 'duration': 35.591, 'max_score': 671.805, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FgakZw6K1QQ/pics/FgakZw6K1QQ671805.jpg'}, {'end': 762.614, 'src': 'embed', 'start': 732.281, 'weight': 4, 'content': [{'end': 746.247, 'text': 'This one unit long vector consisting of 0.97 parts gene 1 and 0.242 parts gene 2 is called the singular vector or the eigenvector for PC1..', 'start': 732.281, 'duration': 13.966}, {'end': 751.569, 'text': 'And the proportions of each gene are called loading scores.', 'start': 747.807, 'duration': 3.762}, {'end': 762.614, 'text': "Also, while I'm at it, PCA calls the average of the sums of the squared distances for the best fit line the eigenvalue for PC1.", 'start': 753.507, 'duration': 9.107}], 'summary': 'A singular vector consists of 0.97 parts gene 1 and 0.242 parts gene 2, with the loading scores indicating the proportions of each gene. pca calculates the eigenvalue for pc1 as the average of the sums of the squared distances for the best fit line.', 'duration': 30.333, 'max_score': 732.281, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FgakZw6K1QQ/pics/FgakZw6K1QQ732281.jpg'}, {'end': 781.168, 'src': 'heatmap', 'start': 732.281, 'weight': 0.805, 'content': [{'end': 746.247, 'text': 'This one unit long vector consisting of 0.97 parts gene 1 and 0.242 parts gene 2 is called the singular vector or the eigenvector for PC1..', 'start': 732.281, 'duration': 13.966}, {'end': 751.569, 'text': 'And the proportions of each gene are called loading scores.', 'start': 747.807, 'duration': 3.762}, {'end': 762.614, 'text': "Also, while I'm at it, PCA calls the average of the sums of the squared distances for the best fit line the eigenvalue for PC1.", 'start': 753.507, 'duration': 9.107}, {'end': 770.901, 'text': 'And the square root of the sums of the squared distances is called the singular value for PC1.', 'start': 764.596, 'duration': 6.305}, {'end': 775.404, 'text': "Bam! That's a lot of terminology.", 'start': 772.802, 'duration': 2.602}, {'end': 781.168, 'text': "Now that we've got PC1 all figured out, let's work on PC2.", 'start': 777.546, 'duration': 3.622}], 'summary': 'Singular vector for pc1: 0.97 gene 1, 0.242 gene 2. loading scores, eigenvalue, singular value defined. moving on to pc2.', 'duration': 48.887, 'max_score': 732.281, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FgakZw6K1QQ/pics/FgakZw6K1QQ732281.jpg'}, {'end': 969.408, 'src': 'embed', 'start': 936.326, 'weight': 5, 'content': [{'end': 944.326, 'text': 'That means that the total variation around both PCs is 15 plus 3 equals 18.', 'start': 936.326, 'duration': 8}, {'end': 955.416, 'text': 'And that means PC1 accounts for 15 divided by 18 equals 0.83 or 83% of the total variation around the PCs.', 'start': 944.326, 'duration': 11.09}, {'end': 965.224, 'text': 'PC2 accounts for 3 divided by 18 equals 17% of the total variation around the PCs.', 'start': 957.437, 'duration': 7.787}, {'end': 969.408, 'text': 'Oh no, another terminology alert.', 'start': 967.086, 'duration': 2.322}], 'summary': 'Pc1 accounts for 83% and pc2 accounts for 17% of the total variation around the pcs.', 'duration': 33.082, 'max_score': 936.326, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FgakZw6K1QQ/pics/FgakZw6K1QQ936326.jpg'}], 'start': 559.426, 'title': 'Pca and data spread', 'summary': 'Explains principal component 1 (pc1) with a slope of 0.25, signifying that the data are mostly spread out along the gene 1 axis, and only a little bit along the gene 2 axis, illustrated in a cocktail recipe analogy. it also explains the process of principal component analysis (pca) using singular value decomposition (svd), including the calculation of principal components pc1 and pc2, scaling of vectors, terminology, and variation percentages, demonstrating the fundamental concepts of pca.', 'chapters': [{'end': 635.756, 'start': 559.426, 'title': 'Pc1 and data spread', 'summary': 'Explains principal component 1 (pc1) with a slope of 0.25, signifying that the data are mostly spread out along the gene 1 axis, and only a little bit along the gene 2 axis, illustrated in a cocktail recipe analogy.', 'duration': 76.33, 'highlights': ['PC1 has a slope of 0.25, indicating that for every 4 units along gene 1 axis, there is a 1 unit increase along gene 2 axis.', 'Data are mostly spread out along the gene 1 axis and only a little bit spread out along the gene 2 axis, emphasizing the importance of Gene 1 in describing data spread.', 'The cocktail recipe analogy explains PC1 as mixing four parts Gene 1 with one part Gene 2, highlighting the significance of Gene 1 in the data spread.']}, {'end': 998.218, 'start': 635.756, 'title': 'Understanding pca using singular value decomposition', 'summary': 'Explains the process of principal component analysis (pca) using singular value decomposition (svd), including the calculation of principal components pc1 and pc2, scaling of vectors, terminology, and variation percentages, demonstrating the fundamental concepts of pca.', 'duration': 362.462, 'highlights': ['PC1 is a linear combination of variables, with the recipe for PC1 scaling its length to 1 when using PCA with SVD.', 'The singular vector for PC1 consists of 0.97 parts gene 1 and 0.242 parts gene 2, with gene proportions called loading scores, and the eigenvalue and singular value defining PC1.', 'The recipe for PC2 is negative 1 part gene 1 to 4 parts gene 2, and the singular vector for PC2 is -0.242 parts gene 1 and 0.97 parts gene 2, with corresponding loading scores and eigenvalue calculations.', 'The total variation around both PCs is 18, with PC1 and PC2 accounting for 83% and 17% of the total variation, respectively.', 'A scree plot visually represents the percentages of variation that each PC accounts for.']}], 'duration': 438.792, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FgakZw6K1QQ/pics/FgakZw6K1QQ559426.jpg', 'highlights': ['PC1 has a slope of 0.25, indicating 4 units along gene 1 axis for 1 unit increase along gene 2 axis.', 'Data are mostly spread out along gene 1 axis, emphasizing its importance in describing data spread.', "Cocktail recipe analogy explains PC1 as mixing four parts Gene 1 with one part Gene 2, highlighting Gene 1's significance.", 'PC1 is a linear combination of variables, scaling its length to 1 when using PCA with SVD.', 'Singular vector for PC1 consists of 0.97 parts gene 1 and 0.242 parts gene 2, defining PC1.', 'Total variation around both PCs is 18, with PC1 and PC2 accounting for 83% and 17% of the total variation, respectively.']}, {'end': 1316.75, 'segs': [{'end': 1074.908, 'src': 'embed', 'start': 1038.738, 'weight': 0, 'content': [{'end': 1043.319, 'text': 'In this case, gene 1 is the most important ingredient for PC2.', 'start': 1038.738, 'duration': 4.581}, {'end': 1052.541, 'text': 'Lastly, we find PC3, the best fitting line that goes through the origin and is perpendicular to PC1 and PC2.', 'start': 1044.839, 'duration': 7.702}, {'end': 1063.119, 'text': 'If we had more genes, we just keep on finding more and more principal components by adding perpendicular lines and rotating them.', 'start': 1055.273, 'duration': 7.846}, {'end': 1074.908, 'text': 'In theory, there is one per gene or variable, but in practice the number of PCs is either the number of variables or the number of samples,', 'start': 1065.281, 'duration': 9.627}], 'summary': 'Finding principal components for genes in practice: pcs= number of variables or samples.', 'duration': 36.17, 'max_score': 1038.738, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FgakZw6K1QQ/pics/FgakZw6K1QQ1038738.jpg'}, {'end': 1157.062, 'src': 'embed', 'start': 1087.074, 'weight': 2, 'content': [{'end': 1095.336, 'text': 'Once you have all the principal components figured out, you can use the eigenvalues i.e., the sums of squares of the distances,', 'start': 1087.074, 'duration': 8.262}, {'end': 1099.258, 'text': 'to determine the proportion of variation that each PC accounts for.', 'start': 1095.336, 'duration': 3.922}, {'end': 1109.08, 'text': 'In this case, PC1 accounts for 79% of the variation, PC2 accounts for 15% of the variation, and PC3 accounts for 6% of the variation.', 'start': 1100.978, 'duration': 8.102}, {'end': 1117.376, 'text': "Here's the scree plot.", 'start': 1116.135, 'duration': 1.241}, {'end': 1124.018, 'text': 'PC1 and PC2 account for the vast majority of the variation.', 'start': 1119.196, 'duration': 4.822}, {'end': 1134.442, 'text': 'That means that a 2D graph using just PC1 and PC2 would be a good approximation of this 3D graph,', 'start': 1125.659, 'duration': 8.783}, {'end': 1137.123, 'text': 'since it would account for 94% of the variation in the data.', 'start': 1134.442, 'duration': 2.681}, {'end': 1149.777, 'text': 'To convert the 3D graph into a two-dimensional PCA graph, we just strip away everything but the data and PC1 and PC2.', 'start': 1140.25, 'duration': 9.527}, {'end': 1157.062, 'text': 'Then project the samples onto PC1 and PC2.', 'start': 1151.578, 'duration': 5.484}], 'summary': 'Pc1 accounts for 79%, pc2 for 15%, and pc3 for 6% of the variation. pc1 and pc2 together account for 94% of the variation, making them a good approximation for a 3d graph.', 'duration': 69.988, 'max_score': 1087.074, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FgakZw6K1QQ/pics/FgakZw6K1QQ1087074.jpg'}], 'start': 999.638, 'title': 'Principal component analysis and 2d visualization', 'summary': 'Covers the process of finding principal components and their application in reducing data dimensionality. it emphasizes that pc1 is determined by gene 3, pc2 by gene 1, and pc1 and pc2 account for 94% of the variation, making a 2d graph a good approximation of the 3d graph. the importance of using eigenvalues to determine the proportion of variation each pc accounts for is also highlighted.', 'chapters': [{'end': 1085.076, 'start': 999.638, 'title': 'Principal component analysis', 'summary': 'Explains the process of finding principal components, where pc1 is determined by gene 3, pc2 is determined by gene 1, and the number of principal components is limited by the number of variables or samples.', 'duration': 85.438, 'highlights': ['PC1 is determined by gene 3 The best fitting line for PC1 is determined by gene 3 as the most important ingredient.', 'PC2 is determined by gene 1 PC2 is determined by gene 1 as the most important ingredient in its recipe.', 'Limitation on number of principal components The number of principal components is limited by the number of variables or samples, whichever is smaller.']}, {'end': 1316.75, 'start': 1087.074, 'title': 'Pca analysis and 2d visualization', 'summary': 'Explains the process of principal component analysis (pca) and its application in reducing the dimensionality of data. it highlights that pc1 and pc2 account for 94% of the variation, making a 2d graph a good approximation of the 3d graph, and emphasizes the importance of using the eigenvalues to determine the proportion of variation each pc accounts for.', 'duration': 229.676, 'highlights': ['PC1 and PC2 account for 94% of the variation, making a 2D graph a good approximation of the 3D graph.', 'PC1 accounts for 79% of the variation, PC2 accounts for 15% of the variation, and PC3 accounts for 6% of the variation.', 'Even a noisy PCA plot can be used to identify clusters of data.', 'PC1 and PC2 account for 90% of the variation, allowing the creation of a two-dimensional PCA graph.', 'The chapter emphasizes the importance of using the eigenvalues to determine the proportion of variation that each PC accounts for.']}], 'duration': 317.112, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/FgakZw6K1QQ/pics/FgakZw6K1QQ999638.jpg', 'highlights': ['PC1 is determined by gene 3 as the most important ingredient.', 'PC2 is determined by gene 1 as the most important ingredient in its recipe.', 'PC1 and PC2 account for 94% of the variation, making a 2D graph a good approximation of the 3D graph.', 'PC1 accounts for 79% of the variation, PC2 accounts for 15% of the variation, and PC3 accounts for 6% of the variation.', 'The importance of using eigenvalues to determine the proportion of variation each PC accounts for is highlighted.']}], 'highlights': ['PCA provides deeper insight into data by reducing dimensionality, allowing visualization and identification of patterns and outliers.', 'Illustrates using a simple data set of gene transcription in six mice, showcasing the practical application of PCA.', 'Explains how measuring one gene allows data plotting on a number line, highlighting differences in gene expression among the six mice.', 'PCA can transform four or more gene measurements into a two-dimensional plot, revealing clusters of similar mice.', 'PCA can identify the most valuable gene for clustering the data, such as gene 3 responsible for separating samples along the x-axis.', 'If we measured two genes, then we can plot the data on a two-dimensional XY graph with Gene 1 on the x-axis and Gene 2 on the y-axis, showing the clustering of mice 1, 2, and 3 on the right side and mice 4, 5, and 6 on the lower left-hand side.', 'PCA projects data onto a line, measures distances, finding the line that minimizes distances', 'The chapter explains the process of plotting and centering data, followed by the steps involved in fitting a line using PCA', 'The line is rotated to fit data, line goes through the origin', 'PCA maximizes the sum of squared distances from the projected points to the origin to find the best fitting line', 'The process of measuring, squaring, and summing the distances plays a crucial role in determining the line with the largest sum of squared distances, showcasing the key steps involved in PCA', 'The Pythagorean theorem shows the inverse relationship between sides A, B, and C, providing a mathematical explanation for their relationship', 'PC1 has a slope of 0.25, indicating 4 units along gene 1 axis for 1 unit increase along gene 2 axis.', 'Data are mostly spread out along gene 1 axis, emphasizing its importance in describing data spread.', "Cocktail recipe analogy explains PC1 as mixing four parts Gene 1 with one part Gene 2, highlighting Gene 1's significance.", 'PC1 is a linear combination of variables, scaling its length to 1 when using PCA with SVD.', 'Singular vector for PC1 consists of 0.97 parts gene 1 and 0.242 parts gene 2, defining PC1.', 'Total variation around both PCs is 18, with PC1 and PC2 accounting for 83% and 17% of the total variation, respectively.', 'PC1 is determined by gene 3 as the most important ingredient.', 'PC2 is determined by gene 1 as the most important ingredient in its recipe.', 'PC1 and PC2 account for 94% of the variation, making a 2D graph a good approximation of the 3D graph.', 'PC1 accounts for 79% of the variation, PC2 accounts for 15% of the variation, and PC3 accounts for 6% of the variation.', 'The importance of using eigenvalues to determine the proportion of variation each PC accounts for is highlighted.']}