title
Drawing and Interpreting Heatmaps

description
This StatQuest is about the heatmaps. We see these all the time, but there are lots of arbitrary decisions that go into drawing them. Here, I show you what those decisions are and how they affect the results. For a complete index of all the StatQuest videos, check out: https://statquest.org/video-index/ If you'd like to support StatQuest, please consider... Buying The StatQuest Illustrated Guide to Machine Learning!!! PDF - https://statquest.gumroad.com/l/wvtmc Paperback - https://www.amazon.com/dp/B09ZCKR4H6 Kindle eBook - https://www.amazon.com/dp/B09ZG79HXC Patreon: https://www.patreon.com/statquest ...or... YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join ...a cool StatQuest t-shirt or sweatshirt: https://shop.spreadshirt.com/statquest-with-josh-starmer/ ...buying one or two of my songs (or go large and get a whole album!) https://joshuastarmer.bandcamp.com/ ...or just donating to StatQuest! https://www.paypal.me/statquest Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter: https://twitter.com/joshuastarmer #statquest #rnaseq #heatmap

detail
{'title': 'Drawing and Interpreting Heatmaps', 'heatmap': [{'end': 108.071, 'start': 27.725, 'weight': 0.763}, {'end': 156.766, 'start': 126.447, 'weight': 0.766}, {'end': 322.949, 'start': 269.933, 'weight': 0.722}], 'summary': 'Covers visualizing gene expression data using heat maps, hierarchical clustering, z-score scaling, gene distance metrics, and comparing clustering methods, providing insights into scaling, clustering, and distance calculations for gene analysis.', 'chapters': [{'end': 327.751, 'segs': [{'end': 60.007, 'src': 'embed', 'start': 27.725, 'weight': 0, 'content': [{'end': 31.088, 'text': 'The rows are genes and the columns are RNA-seq samples.', 'start': 27.725, 'duration': 3.363}, {'end': 38.876, 'text': 'The data displayed in this heat map has been modified in two ways so that we can gain some insights from it.', 'start': 33.07, 'duration': 5.806}, {'end': 43.92, 'text': 'The first way is that the relative abundancies have been scaled.', 'start': 39.917, 'duration': 4.003}, {'end': 47.504, 'text': 'In this case, this was done on a per gene basis.', 'start': 44.641, 'duration': 2.863}, {'end': 51.608, 'text': "Other heat maps you've seen out there scale all the genes at once.", 'start': 48.184, 'duration': 3.424}, {'end': 60.007, 'text': 'Anyways, this makes it easy to see that sample X has more or less of gene Y than sample Z.', 'start': 52.722, 'duration': 7.285}], 'summary': 'Gene expression data in heat map scaled by gene for easy comparison of samples.', 'duration': 32.282, 'max_score': 27.725, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc27725.jpg'}, {'end': 122.465, 'src': 'heatmap', 'start': 27.725, 'weight': 2, 'content': [{'end': 31.088, 'text': 'The rows are genes and the columns are RNA-seq samples.', 'start': 27.725, 'duration': 3.363}, {'end': 38.876, 'text': 'The data displayed in this heat map has been modified in two ways so that we can gain some insights from it.', 'start': 33.07, 'duration': 5.806}, {'end': 43.92, 'text': 'The first way is that the relative abundancies have been scaled.', 'start': 39.917, 'duration': 4.003}, {'end': 47.504, 'text': 'In this case, this was done on a per gene basis.', 'start': 44.641, 'duration': 2.863}, {'end': 51.608, 'text': "Other heat maps you've seen out there scale all the genes at once.", 'start': 48.184, 'duration': 3.424}, {'end': 60.007, 'text': 'Anyways, this makes it easy to see that sample X has more or less of gene Y than sample Z.', 'start': 52.722, 'duration': 7.285}, {'end': 68.272, 'text': 'For example, the scaling makes it easy to see that sample 1 expresses this gene highlighted in the black box more than the others.', 'start': 60.007, 'duration': 8.265}, {'end': 76.088, 'text': "However, this specific gene-by-gene scaling means that we can't compare across genes.", 'start': 69.704, 'duration': 6.384}, {'end': 85.073, 'text': "The dark red bar in sample 1 for this gene doesn't mean that sample 1 transcribes it more than other genes, just other samples.", 'start': 76.748, 'duration': 8.325}, {'end': 93.418, 'text': 'The other modification that was done to this data is that the rows, that is to say the genes, have been grouped according to similarity.', 'start': 86.274, 'duration': 7.144}, {'end': 102.246, 'text': 'This grouping, or clustering, makes it easy to see genes that are transcribed most in the second sample and least in the fourth sample.', 'start': 94.298, 'duration': 7.948}, {'end': 108.071, 'text': 'These genes are transcribed most in the first sample and also least in the fourth sample.', 'start': 103.067, 'duration': 5.004}, {'end': 114.998, 'text': 'And lastly, these genes are transcribed most in the second sample and least in the third sample.', 'start': 109.273, 'duration': 5.725}, {'end': 117.501, 'text': "The clustering isn't by chance.", 'start': 116.039, 'duration': 1.462}, {'end': 122.465, 'text': 'but due to a computer program that tries to put similar things close together.', 'start': 118.103, 'duration': 4.362}], 'summary': 'Rna-seq data heat map shows scaled gene abundances and gene grouping by similarity.', 'duration': 94.74, 'max_score': 27.725, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc27725.jpg'}, {'end': 156.766, 'src': 'heatmap', 'start': 126.447, 'weight': 0.766, 'content': [{'end': 131.229, 'text': "Without clustering, the data would look like this, a mishmash that's harder to interpret.", 'start': 126.447, 'duration': 4.782}, {'end': 137.632, 'text': 'Without clustering or scaling, the data would look like this, which is completely uninterpretable.', 'start': 132.409, 'duration': 5.223}, {'end': 141.813, 'text': 'Notice that one gene is highly transcribed compared to the others.', 'start': 138.872, 'duration': 2.941}, {'end': 143.094, 'text': "It's an outlier.", 'start': 142.394, 'duration': 0.7}, {'end': 151.164, 'text': "Okay, now that we've seen one heatmap, let's look at another, slightly more complicated heatmap.", 'start': 145.842, 'duration': 5.322}, {'end': 156.766, 'text': 'Like the heatmap before, this heatmap has been scaled and clustered.', 'start': 152.744, 'duration': 4.022}], 'summary': 'Clustering and scaling data helps interpret patterns, as seen in heatmaps.', 'duration': 30.319, 'max_score': 126.447, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc126447.jpg'}, {'end': 247.985, 'src': 'embed', 'start': 218.82, 'weight': 3, 'content': [{'end': 221.862, 'text': 'What if we had used global scaling with the first heat map?', 'start': 218.82, 'duration': 3.042}, {'end': 229.755, 'text': "When we do this, we see that the outlier skews the scale so much that it's impossible to see the other genes.", 'start': 223.171, 'duration': 6.584}, {'end': 235.558, 'text': 'Also, notice that the clustering changes and the genes have a new order.', 'start': 231.396, 'duration': 4.162}, {'end': 245.003, 'text': 'Scaling can affect two things, how brightly colored the genes are and whether you can compare between them and the clustering.', 'start': 237.399, 'duration': 7.604}, {'end': 247.985, 'text': 'And now back to the action.', 'start': 246.444, 'duration': 1.541}], 'summary': 'Global scaling in the first heat map skewed the scale, affecting gene visibility and clustering.', 'duration': 29.165, 'max_score': 218.82, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc218820.jpg'}, {'end': 327.751, 'src': 'heatmap', 'start': 269.933, 'weight': 1, 'content': [{'end': 274.335, 'text': "Let's talk about converting to z-scores or z-score scaling.", 'start': 269.933, 'duration': 4.402}, {'end': 279.177, 'text': 'In this example, we have RNA-seq read counts from six samples.', 'start': 275.315, 'duration': 3.862}, {'end': 283.476, 'text': 'The first step is to calculate the mean of the data.', 'start': 280.455, 'duration': 3.021}, {'end': 287.477, 'text': "In this case, that's 16.5.", 'start': 283.956, 'duration': 3.521}, {'end': 291.119, 'text': 'The second step is to subtract the mean from each value.', 'start': 287.477, 'duration': 3.642}, {'end': 296.301, 'text': 'By subtracting the mean from each value, we center the data around zero.', 'start': 292.039, 'duration': 4.262}, {'end': 305.824, 'text': 'Samples with relatively high transcription get positive values, and samples with relatively low transcription get negative values.', 'start': 297.501, 'duration': 8.323}, {'end': 310.205, 'text': 'The third step is to calculate the standard deviation.', 'start': 307.284, 'duration': 2.921}, {'end': 313.326, 'text': "In this case, it's 6.28.", 'start': 310.925, 'duration': 2.401}, {'end': 319.468, 'text': 'The fourth step is to divide each data point by the standard deviation.', 'start': 313.326, 'duration': 6.142}, {'end': 322.949, 'text': 'Notice that the scale on the axis has changed.', 'start': 320.188, 'duration': 2.761}, {'end': 327.751, 'text': 'The data used to be spread from minus 8 to plus 8.', 'start': 323.689, 'duration': 4.062}], 'summary': 'Converting rna-seq read counts to z-scores involves centering the data around zero and scaling it by standard deviation.', 'duration': 30.25, 'max_score': 269.933, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc269933.jpg'}], 'start': 0.329, 'title': 'Visualizing gene expression data', 'summary': 'Covers the use of heat maps for gene expression data, emphasizing scaling and clustering techniques, global scaling significance, heat map clustering analysis for transcription patterns, and z-score scaling for rna-seq read counts.', 'chapters': [{'end': 169.47, 'start': 0.329, 'title': 'Understanding heat maps for gene expression', 'summary': 'Discusses the use of heat maps in visualizing gene expression data, highlighting the importance of scaling and clustering in gaining insights from the data, and the impact of outliers. it also emphasizes the significance of global scaling in the absence of outliers.', 'duration': 169.141, 'highlights': ['The rows in the heat map represent genes and the columns represent RNA-seq samples, with the data modified through gene-by-gene scaling and grouping based on similarity, allowing for comparison of gene expression levels across samples.', 'Clustering of genes according to similarity enables the visualization of genes transcribed most and least in different samples, facilitated by a computer program that groups similar genes together.', 'The significance of scaling is demonstrated, as it helps in identifying outliers in the gene expression data, making it easier to interpret and gain insights from the heat map.', 'Global scaling is utilized in a more complicated heat map, as there are no outliers present in the dataset, allowing for a more effective comparison of gene expression levels across all genes and samples.']}, {'end': 245.003, 'start': 170.506, 'title': 'Heat map clustering analysis', 'summary': 'Discusses the use of clustering in a heat map to analyze both columns and rows, illustrating how it helps identify similar transcription patterns and the impact of scaling on gene visualization and clustering.', 'duration': 74.497, 'highlights': ['The heat map demonstrates clustering for both columns and rows, revealing similar transcription patterns in samples and genes (e.g., black box highlights).', 'Global scaling skews the scale due to outliers, making it impossible to visualize other genes and causing changes in clustering and gene order.', "Clustering is essential for identifying patterns, as without it, there is a total mismatch and it's hard to interpret the data.", 'The impact of scaling on gene visualization and clustering is significant, affecting the brightness of genes and the ability to compare them.']}, {'end': 327.751, 'start': 246.444, 'title': 'Z-score scaling for data', 'summary': 'Discusses z-score scaling for rna-seq read counts, involving calculating mean and standard deviation, and dividing each data point by the standard deviation, resulting in centered data around zero and scaled to z-scores.', 'duration': 81.307, 'highlights': ['By subtracting the mean from each value, we center the data around zero, with samples having relatively high or low transcription getting positive or negative values, respectively.', 'The fourth step involves dividing each data point by the standard deviation, resulting in a change in the scale on the axis.', 'The first step is to calculate the mean of the data, which is 16.5 in this case.', 'The third step is to calculate the standard deviation, which is 6.28 in this example.']}], 'duration': 327.422, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc329.jpg', 'highlights': ['The rows in the heat map represent genes and the columns represent RNA-seq samples, with the data modified through gene-by-gene scaling and grouping based on similarity, allowing for comparison of gene expression levels across samples.', 'By subtracting the mean from each value, we center the data around zero, with samples having relatively high or low transcription getting positive or negative values, respectively.', 'Clustering of genes according to similarity enables the visualization of genes transcribed most and least in different samples, facilitated by a computer program that groups similar genes together.', 'The significance of scaling is demonstrated, as it helps in identifying outliers in the gene expression data, making it easier to interpret and gain insights from the heat map.']}, {'end': 612.969, 'segs': [{'end': 379.369, 'src': 'embed', 'start': 350.69, 'weight': 0, 'content': [{'end': 356.653, 'text': "Regardless of the variation in the original data, dividing by the standard deviation ensures that it's tightly grouped.", 'start': 350.69, 'duration': 5.963}, {'end': 366.927, 'text': 'And you might ask yourself, why do we need to ensure the data is tightly grouped? We do this because we can only discern so many shades of colors.', 'start': 357.794, 'duration': 9.133}, {'end': 371.067, 'text': 'The wider the range, the more subtle the differences in the shades.', 'start': 367.567, 'duration': 3.5}, {'end': 379.369, 'text': "By tightly grouping the data, we use fewer shades and it's easier to see sample one has more transcription than sample two.", 'start': 372.268, 'duration': 7.101}], 'summary': 'Dividing by the standard deviation tightens data grouping, making it easier to discern differences and compare samples.', 'duration': 28.679, 'max_score': 350.69, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc350690.jpg'}, {'end': 481.782, 'src': 'embed', 'start': 450.043, 'weight': 1, 'content': [{'end': 454.945, 'text': 'For this example, we are just going to use clustering to reorder the rows, or the genes.', 'start': 450.043, 'duration': 4.902}, {'end': 458.126, 'text': "Conceptually, here's what we do.", 'start': 456.485, 'duration': 1.641}, {'end': 464.548, 'text': 'First, we figure out which gene is most similar to gene 1.', 'start': 459.006, 'duration': 5.542}, {'end': 468.009, 'text': 'So we look and we see that genes 1 and 2 are different.', 'start': 464.548, 'duration': 3.461}, {'end': 471.55, 'text': 'Genes 1 and 3 are similar, and genes 1 and 4 are also similar.', 'start': 469.529, 'duration': 2.021}, {'end': 481.782, 'text': 'However, gene 1 is most similar to gene 3.', 'start': 476.718, 'duration': 5.064}], 'summary': 'Using clustering to reorder genes based on similarity.', 'duration': 31.739, 'max_score': 450.043, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc450043.jpg'}, {'end': 559.984, 'src': 'embed', 'start': 534.172, 'weight': 4, 'content': [{'end': 540.755, 'text': "We go back to step one, but now we treat the new cluster, that's cluster number one, like it's a single gene.", 'start': 534.172, 'duration': 6.583}, {'end': 547.258, 'text': "That is to say, we compare cluster one to find out which gene it's most similar to.", 'start': 542.015, 'duration': 5.243}, {'end': 549.919, 'text': "In this case, it's most similar to gene number four.", 'start': 547.558, 'duration': 2.361}, {'end': 559.984, 'text': 'Gene 2 is most similar to gene number 4, and genes 2 and 4 are the most similar combination.', 'start': 552.116, 'duration': 7.868}], 'summary': 'Comparing clusters to find most similar genes, cluster 1 is most similar to gene 4.', 'duration': 25.812, 'max_score': 534.172, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc534172.jpg'}, {'end': 612.969, 'src': 'embed', 'start': 584.559, 'weight': 3, 'content': [{'end': 589.822, 'text': 'Cluster number two was formed second, and the genes within it are the second most similar.', 'start': 584.559, 'duration': 5.263}, {'end': 597.667, 'text': 'Cluster number three, which contains all of the genes and merges the two clusters, was formed last.', 'start': 591.543, 'duration': 6.124}, {'end': 604.711, 'text': "Now that we have a conceptual idea of what's going on, let's get down to the nitpicky details.", 'start': 599.508, 'duration': 5.203}, {'end': 612.969, 'text': 'In the first step, we have to figure out which gene is most similar to gene 1.', 'start': 606.024, 'duration': 6.945}], 'summary': 'Clusters 2 and 3 formed sequentially, with specific gene similarities analyzed.', 'duration': 28.41, 'max_score': 584.559, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc584559.jpg'}], 'start': 327.751, 'title': 'Hierarchical clustering and z-score scaling', 'summary': 'Covers z-score scaling, outlining its formula, need for tightly grouped data, and impact of outliers. it also introduces hierarchical clustering, explaining its process and application to reordering genes in a simple example. additionally, it details the hierarchical clustering process, illustrating the merging of genes into clusters based on similarity and the formation of clusters accompanied by a dendrogram.', 'chapters': [{'end': 507.081, 'start': 327.751, 'title': 'Z-score scaling and hierarchical clustering', 'summary': 'Discusses z-score scaling, emphasizing its formula, the need for tightly grouped data, and the impact of outliers, along with an introduction to hierarchical clustering, including its process and application to reordering genes in a simple example.', 'duration': 179.33, 'highlights': ['Z-score scaling ensures tightly grouped data by dividing by the standard deviation, making it easier to discern differences in shades, and its impact on outlier data. Z-score scaling ensures data is tightly grouped by dividing by the standard deviation, making it easier to discern differences in shades, and it impacts outlier data by compressing values near zero when there is an outlier.', 'The chapter introduces hierarchical clustering and explains its process using a simple example, where genes are reordered based on similarity. The chapter introduces hierarchical clustering and explains its process using a simple example, where genes are reordered based on similarity.', 'Explains the formula for z-score scaling and its purpose to ensure tightly grouped data, making it easier to discern differences in shades. The formula for z-score scaling is explained, emphasizing its purpose to ensure tightly grouped data, making it easier to discern differences in shades.']}, {'end': 612.969, 'start': 508.281, 'title': 'Hierarchical clustering process', 'summary': 'Explains the process of hierarchical clustering, where genes are merged into clusters based on similarity, with genes 1 and 3 forming cluster one, then genes 2 and 4 forming cluster two, and the formation of cluster three, accompanied by a dendrogram.', 'duration': 104.688, 'highlights': ['Genes 1 and 3 are merged into a cluster, forming cluster number one.', 'Genes 2 and 4 are merged into a cluster, forming cluster number two.', 'Cluster number three is formed last, merging the two clusters and indicating the order of formation with a dendrogram.', 'The process involves comparing genes to find the most similar combinations, as seen with gene 1 being most similar to gene 4.']}], 'duration': 285.218, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc327751.jpg', 'highlights': ['Z-score scaling ensures tightly grouped data by dividing by the standard deviation, making it easier to discern differences in shades, and its impact on outlier data.', 'The chapter introduces hierarchical clustering and explains its process using a simple example, where genes are reordered based on similarity.', 'Explains the formula for z-score scaling and its purpose to ensure tightly grouped data, making it easier to discern differences in shades.', 'Cluster number three is formed last, merging the two clusters and indicating the order of formation with a dendrogram.', 'The process involves comparing genes to find the most similar combinations, as seen with gene 1 being most similar to gene 4.']}, {'end': 801.415, 'segs': [{'end': 696.091, 'src': 'embed', 'start': 612.969, 'weight': 0, 'content': [{'end': 617.712, 'text': 'However, before we do that, we have to define what most similar means.', 'start': 612.969, 'duration': 4.743}, {'end': 623.816, 'text': 'Unfortunately, the method for determining similarity is arbitrarily chosen.', 'start': 619.033, 'duration': 4.783}, {'end': 626.998, 'text': 'However, there are some common practices.', 'start': 624.577, 'duration': 2.421}, {'end': 632.302, 'text': 'The first and most common one is the Euclidean distance between genes.', 'start': 627.899, 'duration': 4.403}, {'end': 634.984, 'text': "Here's the formula for that.", 'start': 633.823, 'duration': 1.161}, {'end': 646.618, 'text': "It's the square root of the square of the difference in sample 1 between gene 1 and gene 2,", 'start': 635.99, 'duration': 10.628}, {'end': 654.444, 'text': 'plus the square of the difference in sample 2 of gene 1 and gene 2..', 'start': 646.618, 'duration': 7.826}, {'end': 661.772, 'text': 'Lastly, we have the difference in sample 3 between genes 1 and genes 2.', 'start': 654.444, 'duration': 7.328}, {'end': 666.715, 'text': 'And if you have more samples, the equation just keeps going on and on and on off the page.', 'start': 661.772, 'duration': 4.943}, {'end': 667.875, 'text': 'But you get the idea.', 'start': 666.955, 'duration': 0.92}, {'end': 674.519, 'text': "To see the Euclidean distance in action, let's assume there are only two samples and two genes.", 'start': 669.216, 'duration': 5.303}, {'end': 680.262, 'text': "So here we've restricted our data set to two samples and two genes.", 'start': 676.58, 'duration': 3.682}, {'end': 686.726, 'text': 'When there are only two samples and two genes, the formula boils down to the Pythagorean theorem.', 'start': 681.423, 'duration': 5.303}, {'end': 692.97, 'text': 'Just to show everything in action, here are some values that we can use to compute the distance.', 'start': 688.208, 'duration': 4.762}, {'end': 696.091, 'text': "Now we've plugged those numbers into our formula.", 'start': 693.87, 'duration': 2.221}], 'summary': 'Defining similarity using euclidean distance formula for genetic comparison.', 'duration': 83.122, 'max_score': 612.969, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc612969.jpg'}, {'end': 801.415, 'src': 'embed', 'start': 749.134, 'weight': 3, 'content': [{'end': 750.895, 'text': 'And yes, it makes a difference.', 'start': 749.134, 'duration': 1.761}, {'end': 756.66, 'text': "Bummer Here's that first heat map I showed.", 'start': 752.957, 'duration': 3.703}, {'end': 759.162, 'text': 'I drew it using the Euclidean distance.', 'start': 756.68, 'duration': 2.482}, {'end': 763.625, 'text': "Here's what it looks like when we use the Manhattan distance instead.", 'start': 760.623, 'duration': 3.002}, {'end': 770.783, 'text': 'We see that the large clusters remain intact even though they might be in different orders than they were before.', 'start': 764.639, 'duration': 6.144}, {'end': 776.866, 'text': 'However, with some of the smaller clustering in the finer resolution, we see more differences.', 'start': 771.523, 'duration': 5.343}, {'end': 783.17, 'text': 'However, the choice to use the Euclidean distance or the Manhattan distance is arbitrary.', 'start': 778.007, 'duration': 5.163}, {'end': 787.833, 'text': "There's no real biological reason why one metric might work better than another.", 'start': 783.55, 'duration': 4.283}, {'end': 794.217, 'text': "Bummer Here's some more nitpicky details about how hierarchical clustering works.", 'start': 789.054, 'duration': 5.163}, {'end': 801.415, 'text': 'Do you remember how we merged genes 1 and 3 into cluster number 1 and then compared that cluster to the other genes?', 'start': 795.034, 'duration': 6.381}], 'summary': 'Comparison of euclidean and manhattan distance in heat maps for clustering analysis, with arbitrary metric choice.', 'duration': 52.281, 'max_score': 749.134, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc749134.jpg'}], 'start': 612.969, 'title': 'Gene distance metrics', 'summary': 'Discusses defining similarity, euclidean distance calculation, and the differences between euclidean and manhattan distance methods for gene clustering, demonstrating with an example of two samples and two genes.', 'chapters': [{'end': 721.474, 'start': 612.969, 'title': 'Defining similarity and euclidean distance calculation', 'summary': 'Explains the concept of defining similarity and the calculation of euclidean distance, using the pythagorean theorem as an example with two samples and two genes.', 'duration': 108.505, 'highlights': ['The Euclidean distance calculation is explained using the example of two samples and two genes, demonstrating the use of the Pythagorean theorem.', 'The method for determining similarity is arbitrarily chosen, with the common practice being the Euclidean distance between genes.', 'The formula for the Euclidean distance calculation involves the square root of the square of the differences in samples between genes, providing a clear understanding of the calculation process.']}, {'end': 801.415, 'start': 722.194, 'title': 'Gene distance metrics: euclidean vs manhattan', 'summary': 'Discusses the differences between euclidean and manhattan distance methods for gene clustering, highlighting the impact on clustering results and emphasizing the arbitrary nature of the choice between the two methods.', 'duration': 79.221, 'highlights': ['The choice between Euclidean and Manhattan distance methods for gene clustering is arbitrary, with no real biological reason favoring one over the other.', 'Using the Manhattan distance method results in large clusters remaining intact even though they might be in different orders than before, while smaller clusters at finer resolutions exhibit more differences.', 'The Euclidean method was used to draw the first heat map, and the subsequent use of the Manhattan distance method resulted in observable differences in clustering.', 'The chapter introduces various distance methods for gene clustering, such as the Euclidean, Manhattan, and Canberra methods, emphasizing that the Euclidean distance is just one of many options.']}], 'duration': 188.446, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc612969.jpg', 'highlights': ['The formula for the Euclidean distance calculation involves the square root of the square of the differences in samples between genes, providing a clear understanding of the calculation process.', 'The Euclidean distance calculation is explained using the example of two samples and two genes, demonstrating the use of the Pythagorean theorem.', 'The method for determining similarity is arbitrarily chosen, with the common practice being the Euclidean distance between genes.', 'The choice between Euclidean and Manhattan distance methods for gene clustering is arbitrary, with no real biological reason favoring one over the other.', 'The chapter introduces various distance methods for gene clustering, such as the Euclidean, Manhattan, and Canberra methods, emphasizing that the Euclidean distance is just one of many options.', 'Using the Manhattan distance method results in large clusters remaining intact even though they might be in different orders than before, while smaller clusters at finer resolutions exhibit more differences.']}, {'end': 1007.063, 'segs': [{'end': 854.225, 'src': 'embed', 'start': 823.39, 'weight': 0, 'content': [{'end': 829.633, 'text': 'For the sake of visualizing how different methods work, imagine our data was spread out on an xy plane.', 'start': 823.39, 'duration': 6.243}, {'end': 836.636, 'text': 'Now, imagine that we have already formed these two clusters, the green dots and the orange dots.', 'start': 830.713, 'duration': 5.923}, {'end': 842.218, 'text': 'And we just want to figure out which cluster this last point belongs to.', 'start': 837.996, 'duration': 4.222}, {'end': 850.604, 'text': 'We can compare that point to 1, the average, Two, the closest point.', 'start': 843.718, 'duration': 6.886}, {'end': 854.225, 'text': 'Three, the furthest point.', 'start': 852.324, 'duration': 1.901}], 'summary': 'Explaining clustering methods using visualized data on an xy plane.', 'duration': 30.835, 'max_score': 823.39, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc823390.jpg'}, {'end': 949.166, 'src': 'embed', 'start': 872.19, 'weight': 1, 'content': [{'end': 879.133, 'text': "So we're comparing a gene to a cluster and we're finding the gene that's most different from that one we're comparing it to.", 'start': 872.19, 'duration': 6.943}, {'end': 883.774, 'text': 'This is the default setting in R.', 'start': 879.153, 'duration': 4.621}, {'end': 887.996, 'text': "Here's what the clustering looks like when we compare points to the cluster average.", 'start': 883.774, 'duration': 4.222}, {'end': 896.341, 'text': "As you can see, the major blocks of clustered genes and samples have been retained even though they've been reordered.", 'start': 889.457, 'duration': 6.884}, {'end': 899.363, 'text': 'However, there are differences in the details.', 'start': 897.142, 'duration': 2.221}, {'end': 905.067, 'text': "Here's an example where we compare points to the closest point in the cluster.", 'start': 900.624, 'duration': 4.443}, {'end': 911.731, 'text': 'Again, the major features of the clustering have been retained but are now reordered again.', 'start': 905.908, 'duration': 5.823}, {'end': 915.174, 'text': 'And again, there are differences in the details.', 'start': 912.612, 'duration': 2.562}, {'end': 923.678, 'text': 'In summary, to make a heat map, you First, scale the data, either per gene or globally.', 'start': 916.354, 'duration': 7.324}, {'end': 930.784, 'text': 'Second, you cluster the data, either by gene or sample or both gene and sample.', 'start': 924.999, 'duration': 5.785}, {'end': 938.917, 'text': "In this StatQuest, we focused on hierarchical clustering, and we've seen that within that, we've got to make some decisions.", 'start': 932.011, 'duration': 6.906}, {'end': 945.403, 'text': "The first is what's the distance metric going to be? Will it be Euclidean Manhattan or something else?", 'start': 939.458, 'duration': 5.945}, {'end': 949.166, 'text': 'And we also have to decide what the clustering method is going to be.', 'start': 946.023, 'duration': 3.143}], 'summary': 'Comparing genes and clusters, retaining major blocks. making decisions on distance metric and clustering method in hierarchical clustering.', 'duration': 76.976, 'max_score': 872.19, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc872190.jpg'}, {'end': 1007.063, 'src': 'embed', 'start': 978.925, 'weight': 3, 'content': [{'end': 982.988, 'text': 'When we do this, we have to decide how many clusters there should be in advance.', 'start': 978.925, 'duration': 4.063}, {'end': 986.89, 'text': 'Then the computer figures out which samples go in which cluster.', 'start': 983.508, 'duration': 3.382}, {'end': 993.554, 'text': "by trying to minimize some metric of dispersion, i.e., it's trying to reduce the amount of variance within each cluster.", 'start': 986.89, 'duration': 6.664}, {'end': 996.416, 'text': 'This deserves a separate StatQuest.', 'start': 994.595, 'duration': 1.821}, {'end': 998.017, 'text': "We're not going to talk about it today.", 'start': 996.676, 'duration': 1.341}, {'end': 1000.459, 'text': 'So that brings us to the end.', 'start': 999.278, 'duration': 1.181}, {'end': 1007.063, 'text': 'Thanks for listening to StatQuest, and look forward to more exciting quests in the land of statistics.', 'start': 1001.119, 'duration': 5.944}], 'summary': 'Decide clusters in advance to minimize dispersion. stay tuned for more statquests.', 'duration': 28.138, 'max_score': 978.925, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc978925.jpg'}], 'start': 802.396, 'title': 'Comparing and using clustering methods', 'summary': 'Covers various methods of comparing clusters, including comparing points to the average, closest point, and furthest point. it also discusses comparing genes to clusters, highlighting differences in default settings and the need to scale the data before creating a heat map. additionally, it delves into the process of clustering data using hierarchical or kmeans clustering methods, discussing decisions such as distance metrics and clustering methods.', 'chapters': [{'end': 871.67, 'start': 802.396, 'title': 'Comparing cluster methods', 'summary': 'Discusses different methods to compare clusters, including comparing points to the average, closest point, and furthest point, which can impact clustering and visualization.', 'duration': 69.274, 'highlights': ['Different methods to compare clusters include comparing points to the average, closest point, and furthest point, impacting clustering and visualization.', 'Visualizing the data on an xy plane and forming clusters of green and orange dots to compare different methods of clustering.', 'Swapping out ways of comparing clusters in the second heat map, such as comparing points to the furthest in the cluster.']}, {'end': 923.678, 'start': 872.19, 'title': 'Comparing gene clusters for heat maps', 'summary': 'Discusses comparing genes to clusters, highlighting differences in default settings, comparing points to cluster average, and emphasizing the need to scale the data before creating a heat map.', 'duration': 51.488, 'highlights': ["The major blocks of clustered genes and samples have been retained even though they've been reordered.", 'Comparing points to the closest point in the cluster retains major features of the clustering but reorders them.', 'To make a heat map, it is essential to first scale the data, either per gene or globally.']}, {'end': 1007.063, 'start': 924.999, 'title': 'Clustering methods in statistics', 'summary': 'Covers the process of clustering data using hierarchical or kmeans clustering methods, discussing decisions such as distance metrics and clustering methods, with the advice to use default values if using r, and the mention of the kmeans method requiring the prior determination of the number of clusters.', 'duration': 82.064, 'highlights': ['The process of clustering data involves making decisions on distance metrics (e.g., Euclidean, Manhattan) and clustering methods (e.g., centroid, average, furthest point, closest point), with most choices having default values.', 'KMeans clustering method requires deciding the number of clusters in advance, with the computer determining which samples go in each cluster by minimizing dispersion, aiming to reduce the amount of variance within each cluster.', 'The chapter concludes by mentioning the possibility of using a method called KMeans if hierarchical clustering is not preferred and hints at more exciting quests in the land of statistics in the future.']}], 'duration': 204.667, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/oMtDyOn2TCc/pics/oMtDyOn2TCc802396.jpg', 'highlights': ['Different methods to compare clusters include comparing points to the average, closest point, and furthest point, impacting clustering and visualization.', 'The process of clustering data involves making decisions on distance metrics (e.g., Euclidean, Manhattan) and clustering methods (e.g., centroid, average, furthest point, closest point), with most choices having default values.', 'To make a heat map, it is essential to first scale the data, either per gene or globally.', 'KMeans clustering method requires deciding the number of clusters in advance, with the computer determining which samples go in each cluster by minimizing dispersion, aiming to reduce the amount of variance within each cluster.', "The major blocks of clustered genes and samples have been retained even though they've been reordered."]}], 'highlights': ['The rows in the heat map represent genes and the columns represent RNA-seq samples, with the data modified through gene-by-gene scaling and grouping based on similarity, allowing for comparison of gene expression levels across samples.', 'By subtracting the mean from each value, we center the data around zero, with samples having relatively high or low transcription getting positive or negative values, respectively.', 'Clustering of genes according to similarity enables the visualization of genes transcribed most and least in different samples, facilitated by a computer program that groups similar genes together.', 'The significance of scaling is demonstrated, as it helps in identifying outliers in the gene expression data, making it easier to interpret and gain insights from the heat map.', 'Z-score scaling ensures tightly grouped data by dividing by the standard deviation, making it easier to discern differences in shades, and its impact on outlier data.', 'The chapter introduces hierarchical clustering and explains its process using a simple example, where genes are reordered based on similarity.', 'The formula for z-score scaling and its purpose to ensure tightly grouped data, making it easier to discern differences in shades.', 'Cluster number three is formed last, merging the two clusters and indicating the order of formation with a dendrogram.', 'The process involves comparing genes to find the most similar combinations, as seen with gene 1 being most similar to gene 4.', 'The formula for the Euclidean distance calculation involves the square root of the square of the differences in samples between genes, providing a clear understanding of the calculation process.', 'The Euclidean distance calculation is explained using the example of two samples and two genes, demonstrating the use of the Pythagorean theorem.', 'The method for determining similarity is arbitrarily chosen, with the common practice being the Euclidean distance between genes.', 'The choice between Euclidean and Manhattan distance methods for gene clustering is arbitrary, with no real biological reason favoring one over the other.', 'The chapter introduces various distance methods for gene clustering, such as the Euclidean, Manhattan, and Canberra methods, emphasizing that the Euclidean distance is just one of many options.', 'Using the Manhattan distance method results in large clusters remaining intact even though they might be in different orders than before, while smaller clusters at finer resolutions exhibit more differences.', 'Different methods to compare clusters include comparing points to the average, closest point, and furthest point, impacting clustering and visualization.', 'The process of clustering data involves making decisions on distance metrics (e.g., Euclidean, Manhattan) and clustering methods (e.g., centroid, average, furthest point, closest point), with most choices having default values.', 'To make a heat map, it is essential to first scale the data, either per gene or globally.', 'KMeans clustering method requires deciding the number of clusters in advance, with the computer determining which samples go in each cluster by minimizing dispersion, aiming to reduce the amount of variance within each cluster.', "The major blocks of clustered genes and samples have been retained even though they've been reordered."]}