title

Principal Component Analysis (PCA) clearly explained (2015)

description

NOTE: On April 2, 2018 I updated this video with a new video that goes, step-by-step, through PCA and how it is performed. Check it out!
https://youtu.be/FgakZw6K1QQ
RNA-seq results often contain a PCA or MDS plot. This StatQuest explains how these graphs are generated, how to interpret them, and how to determine if the plot is informative or not. I've got example code (in R) for how to do PCA and extract the most important information from it on the StatQuest GitHub: https://github.com/StatQuest/pca_demo/blob/master/pca_demo.R
For a complete index of all the StatQuest videos, check out:
https://statquest.org/video-index/
If you'd like to support StatQuest, please consider...
Buying The StatQuest Illustrated Guide to Machine Learning!!!
PDF - https://statquest.gumroad.com/l/wvtmc
Paperback - https://www.amazon.com/dp/B09ZCKR4H6
Kindle eBook - https://www.amazon.com/dp/B09ZG79HXC
Patreon: https://www.patreon.com/statquest
...or...
YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join
...a cool StatQuest t-shirt or sweatshirt:
https://shop.spreadshirt.com/statquest-with-josh-starmer/
...buying one or two of my songs (or go large and get a whole album!)
https://joshuastarmer.bandcamp.com/
...or just donating to StatQuest!
https://www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
https://twitter.com/joshuastarmer
0:00 Awesome song and introduction
1:45 An introduction to dimensions
6:29 Why we can omit dimensions
8:53 Principal components in terms of variance and covariance!!!
12:47 Transforming samples with loading scores
17:49 Review of main ideas
19:11 Scree plots for diagnostics
19:39 Loadings and Eignvectors
#statquest #PCA

detail

{'title': 'Principal Component Analysis (PCA) clearly explained (2015)', 'heatmap': [], 'summary': 'Provides a comprehensive explanation of principal component analysis (pca) and its application to single-cell rna sequencing data, demonstrating how it compresses 10,000 genes into a single graph, identifies cell type clusters, understands dimensions, flattens data for 2d visualization, explores 2d data variation, and explains the process of calculating principal components.', 'chapters': [{'end': 110.932, 'segs': [{'end': 62.967, 'src': 'embed', 'start': 36.296, 'weight': 0, 'content': [{'end': 40.319, 'text': "Here's an example PCA plot that I got from an article that I was just reading.", 'start': 36.296, 'duration': 4.023}, {'end': 42.761, 'text': 'It shows clusters of cell types.', 'start': 40.98, 'duration': 1.781}, {'end': 47.626, 'text': 'This graph was drawn from single-cell RNA sequencing data.', 'start': 44.363, 'duration': 3.263}, {'end': 51.601, 'text': 'there were about 10,000 transcribed genes in each cell.', 'start': 48.459, 'duration': 3.142}, {'end': 57.004, 'text': 'And each dot in this graph represents a single cell and its transcription profile.', 'start': 52.181, 'duration': 4.823}, {'end': 62.967, 'text': 'The general idea is that cells with similar transcription profiles should cluster.', 'start': 57.884, 'duration': 5.083}], 'summary': 'Pca plot shows cell type clusters from 10,000 genes in single-cell rna data', 'duration': 26.671, 'max_score': 36.296, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_UVHneBUBW0/pics/_UVHneBUBW036296.jpg'}], 'start': 13.636, 'title': 'Introduction to principal component analysis', 'summary': 'Introduces principal component analysis (pca) and its application to single-cell rna sequencing data. it demonstrates how pca compresses data from 10,000 genes into a single graph while capturing the essence of the original data and identifying clusters of cell types.', 'chapters': [{'end': 110.932, 'start': 13.636, 'title': 'Introduction to principal component analysis', 'summary': 'Introduces principal component analysis (pca) and its application to single-cell rna sequencing data, demonstrating how pca compresses data from 10,000 genes into a single graph while capturing the essence of the original data and identifying clusters of cell types.', 'duration': 97.296, 'highlights': ['PCA compresses transcription data from 10,000 genes into a single graph, allowing cells with similar transcription profiles to cluster, as demonstrated in a single-cell RNA sequencing data example.', 'The method captures the essence of the original data, demonstrating its ability to compress a lot of data into a meaningful representation.', 'The example graph shows distinct clusters for different cell types, such as blood cells, pluripotent cells, neuronal cells, and dermal/epidermal cells, indicating the effectiveness of PCA in identifying patterns and clusters within the data.', "The chapter provides background material before delving into the details of PCA, aiming to educate viewers on the method's compression capabilities and the interpretation of its output labels."]}], 'duration': 97.296, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_UVHneBUBW0/pics/_UVHneBUBW013636.jpg', 'highlights': ['PCA compresses transcription data from 10,000 genes into a single graph, allowing cells with similar transcription profiles to cluster.', 'The example graph shows distinct clusters for different cell types, indicating the effectiveness of PCA in identifying patterns and clusters within the data.', 'The method captures the essence of the original data, demonstrating its ability to compress a lot of data into a meaningful representation.', "The chapter provides background material before delving into the details of PCA, aiming to educate viewers on the method's compression capabilities and the interpretation of its output labels."]}, {'end': 326.938, 'segs': [{'end': 146.01, 'src': 'embed', 'start': 111.332, 'weight': 0, 'content': [{'end': 113.693, 'text': "We're going to have an introduction to dimensions.", 'start': 111.332, 'duration': 2.361}, {'end': 118.574, 'text': 'Just to warn you, this is going to seem very, very simple.', 'start': 114.573, 'duration': 4.001}, {'end': 121.015, 'text': 'But just hang in there.', 'start': 120.015, 'duration': 1}, {'end': 122.535, 'text': "You'll be glad we did this.", 'start': 121.335, 'duration': 1.2}, {'end': 124.736, 'text': "It'll keep your head from exploding.", 'start': 122.956, 'duration': 1.78}, {'end': 133.399, 'text': "If you can remember all the way back to first or second grade, you'll remember that one dimension equals a number line.", 'start': 125.897, 'duration': 7.502}, {'end': 138.746, 'text': 'Now, imagine we had a pretend RNA-seq dataset for a single cell.', 'start': 134.204, 'duration': 4.542}, {'end': 146.01, 'text': "Here, I've labeled the genes just A, B, and C, and the read counts are 10, 0, and 14 for those genes.", 'start': 139.727, 'duration': 6.283}], 'summary': 'Introduction to dimensions using rna-seq dataset, with 10, 0, and 14 read counts for genes a, b, and c.', 'duration': 34.678, 'max_score': 111.332, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_UVHneBUBW0/pics/_UVHneBUBW0111332.jpg'}, {'end': 220.482, 'src': 'embed', 'start': 190.976, 'weight': 1, 'content': [{'end': 196.138, 'text': 'Even though our number line is a very simple graph, we can get some useful information out of it.', 'start': 190.976, 'duration': 5.162}, {'end': 202.14, 'text': "Now, let's fast forward to fifth or sixth grade, when we learned about two-dimensional graphs.", 'start': 197.138, 'duration': 5.002}, {'end': 209.836, 'text': 'Now we have two axes instead of just one, and now we can plot data from two different cells instead of just one.', 'start': 203.112, 'duration': 6.724}, {'end': 214.419, 'text': "Here's a pretend RNA sequencing data set for two single cells.", 'start': 210.697, 'duration': 3.722}, {'end': 220.482, 'text': 'Just like before, we have the same genes, but now we have read counts for two separate cells.', 'start': 215.359, 'duration': 5.123}], 'summary': 'Introduction to two-dimensional graphs for rna sequencing data in 5th or 6th grade', 'duration': 29.506, 'max_score': 190.976, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_UVHneBUBW0/pics/_UVHneBUBW0190976.jpg'}, {'end': 298.139, 'src': 'embed', 'start': 274.903, 'weight': 2, 'content': [{'end': 283.749, 'text': "meaning if a gene is highly transcribed in cell 1, that doesn't tell us anything about whether it's highly or lowly transcribed in cell 2..", 'start': 274.903, 'duration': 8.846}, {'end': 290.113, 'text': 'Okay, so maybe sometime when we took calculus, we started drawing three-dimensional graphs.', 'start': 283.749, 'duration': 6.364}, {'end': 292.615, 'text': "That's just a fancy graph that has depth.", 'start': 290.614, 'duration': 2.001}, {'end': 298.139, 'text': 'With three separate axes, we can now plot data from three separate cells.', 'start': 293.476, 'duration': 4.663}], 'summary': 'Gene transcription varies between cells, visualized in 3d graphs.', 'duration': 23.236, 'max_score': 274.903, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_UVHneBUBW0/pics/_UVHneBUBW0274903.jpg'}], 'start': 111.332, 'title': 'Understanding dimensions and rna sequencing data analysis', 'summary': 'Introduces dimensions through transcript count plotting, demonstrates two-dimensional graphing with rna sequencing data for single cells, and explains gene expression correlation and visualization using three-dimensional graphs.', 'chapters': [{'end': 190.496, 'start': 111.332, 'title': 'Introduction to dimensions', 'summary': 'Introduces dimensions with a simple concept of plotting transcript counts on a number line, demonstrating uniform and non-uniform distributions, using an rna-seq dataset for a single cell.', 'duration': 79.164, 'highlights': ['The concept of dimensions is introduced with a simple analogy to plotting transcript counts on a number line, similar to first or second-grade mathematics.', 'An RNA-seq dataset for a single cell is used to demonstrate the plotting of genes A, B, and C, with read counts of 10, 0, and 14, respectively.', 'The illustration shows the potential outcomes of plotting all genes, including a uniform distribution of transcript counts or a non-uniform distribution, based on their transcription levels.']}, {'end': 248.76, 'start': 190.976, 'title': 'Understanding two-dimensional graphs', 'summary': 'Explains the concept of two-dimensional graphs by using a pretend rna sequencing data set for two single cells, illustrating how to plot the data for gene a, gene b, and gene c.', 'duration': 57.784, 'highlights': ['The chapter explains the concept of two-dimensional graphs using a pretend RNA sequencing data set for two single cells.', 'Illustrates how to plot the data for gene A by going over to 10 for cell 1 and going up to 8 for cell 2.', 'Explains how to plot the data for gene B by going over 0 for cell 1 and going up 2 for cell 2.', 'Describes how to plot the data for gene C by going over 14 and up 10.']}, {'end': 326.938, 'start': 248.76, 'title': 'Rna sequencing data analysis', 'summary': 'Explains the correlation of gene expression in single cells and the visualization of rna sequencing data using three-dimensional graphs.', 'duration': 78.178, 'highlights': ['The chapter explains the correlation of gene expression in single cells, where highly transcribed genes in one cell are also highly transcribed in another cell, and lowly transcribed genes in one cell are also lowly transcribed in another cell.', 'It describes the visualization of RNA sequencing data using three-dimensional graphs, allowing the plotting of data from three separate cells and the representation of gene expression for each cell.']}], 'duration': 215.606, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_UVHneBUBW0/pics/_UVHneBUBW0111332.jpg', 'highlights': ['Introduces dimensions through transcript count plotting with a simple analogy to first or second-grade mathematics.', 'Demonstrates plotting of genes A, B, and C with read counts of 10, 0, and 14, respectively for a single cell RNA-seq dataset.', 'Explains gene expression correlation in single cells and visualization using three-dimensional graphs.']}, {'end': 421.481, 'segs': [{'end': 354.881, 'src': 'embed', 'start': 327.788, 'weight': 1, 'content': [{'end': 330.87, 'text': "I'm not going to do too many examples of this because you get the idea.", 'start': 327.788, 'duration': 3.082}, {'end': 334.711, 'text': 'So this is what we know about dimensions so far.', 'start': 332.55, 'duration': 2.161}, {'end': 341.335, 'text': "If we have one cell's worth of data, we only need to have a one-dimensional graph, which is just a number line.", 'start': 335.372, 'duration': 5.963}, {'end': 350.359, 'text': 'If we have data from two cells, then we need a two-dimensional graph, which is just an XY graph that we learned about in fifth grade.', 'start': 342.155, 'duration': 8.204}, {'end': 354.881, 'text': 'If we have data from three cells, then we need a three-dimensional graph.', 'start': 351.459, 'duration': 3.422}], 'summary': 'Data from one cell requires one-dimensional graph, two cells need two-dimensional, and three cells need three-dimensional graph.', 'duration': 27.093, 'max_score': 327.788, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_UVHneBUBW0/pics/_UVHneBUBW0327788.jpg'}, {'end': 425.949, 'src': 'embed', 'start': 400.626, 'weight': 0, 'content': [{'end': 406.871, 'text': 'That is to say, cell 1 has some genes that are lowly transcribed and some genes that are highly transcribed.', 'start': 400.626, 'duration': 6.245}, {'end': 412.735, 'text': 'But it looks like all of cell 2 genes are all transcribed at the same level.', 'start': 407.611, 'duration': 5.124}, {'end': 421.481, 'text': "If we flatten the data, that is, removed the up and down variation, our graph wouldn't look much different from what it looked like before.", 'start': 413.936, 'duration': 7.545}, {'end': 425.949, 'text': 'And if we flatten the data, we could just graph it with a single number line.', 'start': 422.628, 'duration': 3.321}], 'summary': 'Cell 1 has variable gene transcription, while cell 2 shows consistent transcription. flattening the data reveals minimal change in the graph.', 'duration': 25.323, 'max_score': 400.626, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_UVHneBUBW0/pics/_UVHneBUBW0400626.jpg'}], 'start': 327.788, 'title': 'Understanding dimensions in data', 'summary': 'Explains the concept of dimensions in data, highlighting the relationship between cell data and the corresponding dimensional graphs, emphasizing the limitations of drawing high-dimensional graphs on paper, and discussing the importance of different dimensions using a hypothetical two-cell data set.', 'chapters': [{'end': 421.481, 'start': 327.788, 'title': 'Understanding dimensions in data', 'summary': 'Explains the concept of dimensions in data, highlighting the relationship between cell data and the corresponding dimensional graphs, emphasizing the limitations of drawing high-dimensional graphs on paper, and discussing the importance of different dimensions using a hypothetical two-cell data set.', 'duration': 93.693, 'highlights': ['The concept of dimensions in data is explained, with the relationship between the number of cells and the corresponding dimensional graphs being highlighted, such as the need for one-dimensional, two-dimensional, and three-dimensional graphs based on the data from one, two, and three cells, respectively.', 'The limitation of drawing high-dimensional graphs on paper is discussed, illustrating the impracticality of representing a four-dimensional graph and emphasizing the impossibility of drawing a 200-dimensional graph.', 'The importance of different dimensions is discussed using a hypothetical two-cell data set, emphasizing the variation in the data and the impact of flattening the data on the graph representation.']}], 'duration': 93.693, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_UVHneBUBW0/pics/_UVHneBUBW0327788.jpg', 'highlights': ['The concept of dimensions in data is explained, emphasizing the relationship between cell data and dimensional graphs.', 'The limitation of drawing high-dimensional graphs on paper is discussed, highlighting the impracticality of representing a four-dimensional graph.', 'The importance of different dimensions is discussed using a hypothetical two-cell data set, emphasizing the impact of flattening the data on graph representation.']}, {'end': 549.834, 'segs': [{'end': 500.934, 'src': 'embed', 'start': 475.658, 'weight': 3, 'content': [{'end': 482.002, 'text': 'Anyways, people look like people, things look like things, even when they have no depth and are flattened on a screen.', 'start': 475.658, 'duration': 6.344}, {'end': 489.667, 'text': 'Basically, a movie camera takes 3D information and flattens it to 2D without too much loss of information.', 'start': 483.003, 'duration': 6.664}, {'end': 497.251, 'text': 'To summarize what we know so far we know that each cell that we sequence adds another dimension,', 'start': 490.827, 'duration': 6.424}, {'end': 500.934, 'text': 'and we also know that some dimensions are more important than others.', 'start': 497.251, 'duration': 3.683}], 'summary': 'Movie camera flattens 3d to 2d with minimal loss. each sequenced cell adds a dimension, some more important than others.', 'duration': 25.276, 'max_score': 475.658, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_UVHneBUBW0/pics/_UVHneBUBW0475658.jpg'}, {'end': 549.834, 'src': 'embed', 'start': 524.915, 'weight': 0, 'content': [{'end': 533.138, 'text': 'For any biologists out there, this is sort of like flattening a Z-stack of microscope images to make a single, two-dimensional image for publication.', 'start': 524.915, 'duration': 8.223}, {'end': 535.587, 'text': "So let's start with an example.", 'start': 534.266, 'duration': 1.321}, {'end': 538.148, 'text': "Again, we'll just start with two cells.", 'start': 536.407, 'duration': 1.741}, {'end': 539.689, 'text': "Here's the data.", 'start': 538.968, 'duration': 0.721}, {'end': 546.693, 'text': "Like before, the genes are imaginary, so I've just listed them from A to I.", 'start': 540.349, 'duration': 6.344}, {'end': 549.834, 'text': "And here's a 2D plot from the data from two cells.", 'start': 546.693, 'duration': 3.141}], 'summary': 'Data visualization involves flattening z-stack images to create 2d plots for publication.', 'duration': 24.919, 'max_score': 524.915, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_UVHneBUBW0/pics/_UVHneBUBW0524915.jpg'}], 'start': 422.628, 'title': 'Flattening data for 2d visualization', 'summary': 'Discusses the effectiveness of flattening data to two or three dimensions for visualization, using examples like tv and movies represented in 2d despite being 3d. it also explains how pca flattens high-dimensional data to two or three dimensions by focusing on differences between cells.', 'chapters': [{'end': 549.834, 'start': 422.628, 'title': 'Flattening data for 2d visualization', 'summary': 'Discusses how flattening data from multiple dimensions to two or three dimensions can be effective in visualizing the important variations, such as in the case of tv and movies being represented in 2d despite the subjects being 3d, and how pca flattens high-dimensional data to two or three dimensions by focusing on the differences between cells.', 'duration': 127.206, 'highlights': ['Flattening data to just two or three dimensions using PCA helps in visualizing important variations and differences between cells in the dataset, making it meaningful for analysis.', "The concept of TV and movies being represented in 2D despite the subjects being 3D is highlighted, emphasizing that the third dimension usually doesn't add much to the story.", 'The process of flattening a Z-stack of microscope images to make a single, two-dimensional image for publication is compared to the concept of PCA flattening high-dimensional data to two or three dimensions, providing a relatable analogy for biologists.', "The example of flattening two cells' data into a 2D plot is demonstrated, showcasing the effective visualization of multidimensional data in a reduced dimension space."]}], 'duration': 127.206, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_UVHneBUBW0/pics/_UVHneBUBW0422628.jpg', 'highlights': ['Flattening data to just two or three dimensions using PCA helps in visualizing important variations and differences between cells in the dataset, making it meaningful for analysis.', "The example of flattening two cells' data into a 2D plot is demonstrated, showcasing the effective visualization of multidimensional data in a reduced dimension space.", 'The process of flattening a Z-stack of microscope images to make a single, two-dimensional image for publication is compared to the concept of PCA flattening high-dimensional data to two or three dimensions, providing a relatable analogy for biologists.', "The concept of TV and movies being represented in 2D despite the subjects being 3D is highlighted, emphasizing that the third dimension usually doesn't add much to the story."]}, {'end': 941.57, 'segs': [{'end': 677.236, 'src': 'embed', 'start': 622.571, 'weight': 1, 'content': [{'end': 628.956, 'text': 'These two new, or rotated, axes that describe the variation in the data are principal components.', 'start': 622.571, 'duration': 6.385}, {'end': 638.323, 'text': 'Principal component 1, or PC1, the first principal component, is the axis that spans the most variation in the data.', 'start': 630.237, 'duration': 8.086}, {'end': 645.668, 'text': 'PC2, or principal component number 2, is the axis that spans the second most variation.', 'start': 639.844, 'duration': 5.824}, {'end': 650.232, 'text': "So these are the general ideas we've covered so far.", 'start': 647.79, 'duration': 2.442}, {'end': 656.046, 'text': 'For each gene, we plotted a point based on how many reads were from each cell.', 'start': 651.584, 'duration': 4.462}, {'end': 661.869, 'text': 'Principal component one captures the direction where most of the variation is.', 'start': 657.567, 'duration': 4.302}, {'end': 667.651, 'text': 'Principal component two captures the direction of the second most variation.', 'start': 663.029, 'duration': 4.622}, {'end': 677.236, 'text': 'What if we had three cells? Just like before, principal component one would span the direction of the most variation.', 'start': 669.732, 'duration': 7.504}], 'summary': 'Principal components capture data variation, pc1 spans most variation, pc2 spans second most.', 'duration': 54.665, 'max_score': 622.571, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_UVHneBUBW0/pics/_UVHneBUBW0622571.jpg'}, {'end': 760.877, 'src': 'embed', 'start': 729.995, 'weight': 0, 'content': [{'end': 735.457, 'text': 'Principal component 200 would span the direction of the 200th most variation.', 'start': 729.995, 'duration': 5.462}, {'end': 744.205, 'text': 'Hooray! Now that we know what PC1 and PC2 are, we know what the x and y axes are in this figure.', 'start': 737, 'duration': 7.205}, {'end': 753.212, 'text': 'PC1 is the direction of the most variation in gene expression, and PC2 is the second most variation in gene expression.', 'start': 745.066, 'duration': 8.146}, {'end': 757.875, 'text': "But I bet just right now you're asking yourself this question.", 'start': 754.893, 'duration': 2.982}, {'end': 760.877, 'text': 'This is a plot of cells, not genes.', 'start': 758.435, 'duration': 2.442}], 'summary': 'Principal components pc1 and pc2 represent gene expression variation in cell plot.', 'duration': 30.882, 'max_score': 729.995, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_UVHneBUBW0/pics/_UVHneBUBW0729995.jpg'}], 'start': 550.895, 'title': '2d data variation and principal components in gene expression', 'summary': 'Explores the variation in 2d data, focusing on diagonal spread, maximum and second largest variation, and visualization techniques. it also delves into principal components in gene expression, detailing the capture of variation by pc1 and pc2, gene influences, and cell plotting using gene read counts.', 'chapters': [{'end': 621.423, 'start': 550.895, 'title': 'Variation in 2d data', 'summary': 'Discusses the spread of data along a diagonal line, emphasizing the maximum and second largest variation, and the ease of visualizing left-right and above-below variation using new axes.', 'duration': 70.528, 'highlights': ['The maximum variation in the data is between the two endpoints of a diagonal line. The data shows maximum variation along a diagonal line.', 'The second largest amount of variation is at the endpoints of the new line. The second largest variation occurs at the endpoints of the new line.', 'The ease of visualizing left-right and above-below variation using new axes. New axes make it easier to see left-right and above-below variation.']}, {'end': 941.57, 'start': 622.571, 'title': 'Principal components and gene expression', 'summary': 'Explains the concept of principal components in gene expression analysis, highlighting how pc1 and pc2 capture the most and second most variation, and discusses how genes influence principal components and how cells can be plotted using the combined read counts of all genes.', 'duration': 318.999, 'highlights': ['Principal component 1 (PC1) captures the direction of the most variation in gene expression, and PC2 captures the direction of the second most variation. PC1 and PC2 are key axes that describe the variation in gene expression data, with PC1 capturing the most variation and PC2 capturing the second most variation.', 'Explains how genes influence principal components through qualitative and quantitative scores, with genes at the ends of the line having higher influence and receiving higher scores. Genes close to the ends of the line have high influence on principal components and receive high scores, while those in the middle have low influence and lower scores.', "Describes the process of combining read counts for all genes in a cell to calculate a score for the cell, involving multiplying each gene's read count by its influence on the principal component and summing the results. The calculation of a cell's score involves multiplying each gene's read count by its influence on the principal component and summing the results, allowing for the plotting of cells based on gene expression data."]}], 'duration': 390.675, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_UVHneBUBW0/pics/_UVHneBUBW0550895.jpg', 'highlights': ['The maximum variation in the data is between the two endpoints of a diagonal line. The data shows maximum variation along a diagonal line.', 'Principal component 1 (PC1) captures the direction of the most variation in gene expression, and PC2 captures the direction of the second most variation. PC1 and PC2 are key axes that describe the variation in gene expression data, with PC1 capturing the most variation and PC2 capturing the second most variation.', 'The second largest amount of variation is at the endpoints of the new line. The second largest variation occurs at the endpoints of the new line.', 'Explains how genes influence principal components through qualitative and quantitative scores, with genes at the ends of the line having higher influence and receiving higher scores. Genes close to the ends of the line have high influence on principal components and receive high scores, while those in the middle have low influence and lower scores.']}, {'end': 1213.994, 'segs': [{'end': 1190.527, 'src': 'embed', 'start': 1144.885, 'weight': 0, 'content': [{'end': 1148.106, 'text': 'we could look at the influence scores in principal component number two.', 'start': 1144.885, 'duration': 3.221}, {'end': 1151.047, 'text': "But wait! There's even more? Yes.", 'start': 1149.127, 'duration': 1.92}, {'end': 1157.037, 'text': "There's a couple diagnostics you can do if you're drawing your own PCA plot.", 'start': 1153.035, 'duration': 4.002}, {'end': 1161.358, 'text': 'These are ways you can tell if your PCA is actually worth anything.', 'start': 1158.397, 'duration': 2.961}, {'end': 1169.701, 'text': 'One diagnostic plot is called a scree plot, where you plot how much variation each principal component can account for.', 'start': 1162.238, 'duration': 7.463}, {'end': 1178.324, 'text': 'What you want to see in this diagnostic plot is that most of the variation is accounted for by the first two principal components.', 'start': 1170.722, 'duration': 7.602}, {'end': 1182.266, 'text': "Lastly, here's a terminology alert.", 'start': 1179.905, 'duration': 2.361}, {'end': 1190.527, 'text': "The ways I've been describing things has been fairly intuitive, but there's actually a lot of technical jargon for principal component analysis.", 'start': 1183.305, 'duration': 7.222}], 'summary': "Analyzing influence scores in principal component two, with diagnostic plots to assess pca's worth and variation accounted for by first two components.", 'duration': 45.642, 'max_score': 1144.885, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_UVHneBUBW0/pics/_UVHneBUBW01144885.jpg'}], 'start': 942.491, 'title': 'Understanding principal component analysis', 'summary': 'Explains the process of calculating principal components, plotting them on a graph, and using the graph to identify key genes, as well as the diagnostic plots for assessing the quality of the pca.', 'chapters': [{'end': 1213.994, 'start': 942.491, 'title': 'Understanding principal component analysis', 'summary': 'Explains the process of calculating principal components, plotting them on a graph, and using the graph to identify key genes, as well as the diagnostic plots for assessing the quality of the pca.', 'duration': 271.503, 'highlights': ['The first principal component captures the most variation in the data. The first principal component captures the most variation in the data, providing a quantitative understanding of the impact of gene expression on the overall variation.', 'Cells with similar transcription patterns will cluster together on the graph. Cells with similar transcription patterns will cluster together on the graph, enabling the identification of distinct clusters based on gene expression patterns.', 'A diagnostic plot called a scree plot is used to assess the variation accounted for by each principal component. A diagnostic plot called a scree plot is used to assess the variation accounted for by each principal component, ensuring the effectiveness of the PCA by evaluating the distribution of variation across components.']}], 'duration': 271.503, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_UVHneBUBW0/pics/_UVHneBUBW0942491.jpg', 'highlights': ['The first principal component captures the most variation in the data, providing a quantitative understanding of the impact of gene expression on the overall variation.', 'Cells with similar transcription patterns will cluster together on the graph, enabling the identification of distinct clusters based on gene expression patterns.', 'A diagnostic plot called a scree plot is used to assess the variation accounted for by each principal component, ensuring the effectiveness of the PCA by evaluating the distribution of variation across components.']}], 'highlights': ['PCA compresses transcription data from 10,000 genes into a single graph, allowing cells with similar transcription profiles to cluster.', 'The example graph shows distinct clusters for different cell types, indicating the effectiveness of PCA in identifying patterns and clusters within the data.', 'The method captures the essence of the original data, demonstrating its ability to compress a lot of data into a meaningful representation.', 'Flattening data to just two or three dimensions using PCA helps in visualizing important variations and differences between cells in the dataset, making it meaningful for analysis.', 'Principal component 1 (PC1) captures the direction of the most variation in gene expression, and PC2 captures the direction of the second most variation. PC1 and PC2 are key axes that describe the variation in gene expression data, with PC1 capturing the most variation and PC2 capturing the second most variation.', 'The first principal component captures the most variation in the data, providing a quantitative understanding of the impact of gene expression on the overall variation.']}