title

ROC and AUC in R

description

This tutorial walks you through, step-by-step, how to draw ROC curves and calculate AUC in R. We start with basic ROC graph, learn how to extract thresholds for decision making, calculate AUC and partial AUC and how to layer multiple ROC curves on the same graph.
You can get a copy of the code from the StatQuest GitHub, here:
https://github.com/StatQuest/roc_and_auc_demo/blob/master/roc_and_auc_demo.R
NOTE: This StatQuest builds on the example in the original ROC and AUC StatQuest:
https://youtu.be/xugjARegisk
Also, if you're curious, here are some links to StatQuests about...
...Logistic Regression
https://youtu.be/yIYKR4sgzI8
...and Random Forests...
https://youtu.be/J4Wdy0Wc_xQ
For a complete index of all the StatQuest videos, check out:
https://statquest.org/video-index/
If you'd like to support StatQuest, please consider...
Buying The StatQuest Illustrated Guide to Machine Learning!!!
PDF - https://statquest.gumroad.com/l/wvtmc
Paperback - https://www.amazon.com/dp/B09ZCKR4H6
Kindle eBook - https://www.amazon.com/dp/B09ZG79HXC
Patreon: https://www.patreon.com/statquest
...or...
YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join
...a cool StatQuest t-shirt or sweatshirt:
https://shop.spreadshirt.com/statquest-with-josh-starmer/
...buying one or two of my songs (or go large and get a whole album!)
https://joshuastarmer.bandcamp.com/
...or just donating to StatQuest!
https://www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
https://twitter.com/joshuastarmer
#statquest #ROC #AUC

detail

{'title': 'ROC and AUC in R', 'heatmap': [{'end': 377.123, 'start': 338.623, 'weight': 0.908}, {'end': 814.415, 'start': 793.826, 'weight': 0.812}, {'end': 881.998, 'start': 852.786, 'weight': 0.732}], 'summary': 'Learn to draw roc graphs, calculate auc, and analyze sample data in r, including generating a data set with 100 samples, fitting a logistic regression curve, and comparing roc curves with logistic regression and random forest, providing practical example codes for implementation.', 'chapters': [{'end': 107.022, 'segs': [{'end': 30.21, 'src': 'embed', 'start': 0.563, 'weight': 0, 'content': [{'end': 2.323, 'text': "Let's make a graph.", 'start': 0.563, 'duration': 1.76}, {'end': 5.184, 'text': "Let's make it look cool.", 'start': 3.684, 'duration': 1.5}, {'end': 11.946, 'text': 'Thank goodness StatQuest is here because StatQuest rules.', 'start': 6.804, 'duration': 5.142}, {'end': 17.887, 'text': "StatQuest Hello, I'm Josh Starmer and welcome to StatQuest.", 'start': 13.026, 'duration': 4.861}, {'end': 25.069, 'text': "Today we're going to talk about drawing ROC graphs and calculating the AUC in R.", 'start': 18.407, 'duration': 6.662}, {'end': 30.21, 'text': "If you're interested in doing this at home, there's a link to the example code in the description below.", 'start': 25.069, 'duration': 5.141}], 'summary': 'Josh starmer discusses drawing roc graphs and calculating auc in r.', 'duration': 29.647, 'max_score': 0.563, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qcvAqAH60Yw/pics/qcvAqAH60Yw563.jpg'}, {'end': 93.678, 'src': 'embed', 'start': 62.677, 'weight': 1, 'content': [{'end': 69.128, 'text': 'The first thing we need to do is load in PROC, the library that will draw ROC graphs for us.', 'start': 62.677, 'duration': 6.451}, {'end': 77.161, 'text': "If you don't have PROC installed, just use install.packages proc to install it.", 'start': 70.35, 'duration': 6.811}, {'end': 82.474, 'text': "We're also going to use the random forest package as part of the example.", 'start': 78.513, 'duration': 3.961}, {'end': 85.975, 'text': 'For the purposes of this stack quest.', 'start': 83.695, 'duration': 2.28}, {'end': 93.678, 'text': 'you just need to know that a random forest is a way to classify samples, and we can change the threshold that we use to make those decisions.', 'start': 85.975, 'duration': 7.703}], 'summary': 'Loading proc library for drawing roc graphs and using random forest package for classifying samples.', 'duration': 31.001, 'max_score': 62.677, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qcvAqAH60Yw/pics/qcvAqAH60Yw62677.jpg'}], 'start': 0.563, 'title': 'Drawing roc graphs and calculating auc in r', 'summary': 'Covers drawing roc graphs, calculating auc in r, utilizing proc and random forest package, and provides example code for practical implementation.', 'chapters': [{'end': 107.022, 'start': 0.563, 'title': 'Drawing roc graphs and calculating auc in r', 'summary': 'Discusses drawing roc graphs, calculating the auc in r, and using proc and random forest package to achieve this. it also mentions the availability of example code for practical implementation.', 'duration': 106.459, 'highlights': ['Drawing ROC graphs and calculating the AUC in R The chapter emphasizes the focus on drawing ROC graphs and calculating the AUC in R as the main topic of discussion.', 'Availability of example code in the description for practical implementation It mentions the availability of example code in the description for practical implementation, providing additional resources for the audience.', 'Use of PROC for drawing ROC graphs and random forest package for classification It discusses the use of PROC for drawing ROC graphs and the random forest package for classifying samples, providing insight into the technical aspects of the process.']}], 'duration': 106.459, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qcvAqAH60Yw/pics/qcvAqAH60Yw563.jpg', 'highlights': ['Drawing ROC graphs and calculating the AUC in R is the main focus.', 'Availability of example code for practical implementation.', 'Use of PROC for drawing ROC graphs and random forest package for classification.']}, {'end': 545.423, 'segs': [{'end': 169.587, 'src': 'embed', 'start': 137.734, 'weight': 2, 'content': [{'end': 142.886, 'text': "So let's start by setting num.samples to 100.", 'start': 137.734, 'duration': 5.152}, {'end': 146.996, 'text': "Now we'll create 100 measurements and store them in a variable called weight.", 'start': 142.886, 'duration': 4.11}, {'end': 154.737, 'text': 'We do this by using the rnorm function to generate 100 random values from a normal distribution,', 'start': 148.513, 'duration': 6.224}, {'end': 160.941, 'text': 'with the mean set to 172 and the standard deviation set to 29..', 'start': 154.737, 'duration': 6.204}, {'end': 169.587, 'text': "Psst, just in case you're interested, the internet told me that the average man weighs 172 pounds with a standard deviation of 29.", 'start': 160.941, 'duration': 8.646}], 'summary': 'Generated 100 measurements of weight from normal distribution with mean 172 and standard deviation 29.', 'duration': 31.853, 'max_score': 137.734, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qcvAqAH60Yw/pics/qcvAqAH60Yw137734.jpg'}, {'end': 266.455, 'src': 'embed', 'start': 234.744, 'weight': 1, 'content': [{'end': 244.27, 'text': 'The if smaller than obese, otherwise not obese is performed by the if else function and the results are stored in a variable called obese.', 'start': 234.744, 'duration': 9.526}, {'end': 250.614, 'text': 'To see what that fancy line of code just did, we can print out the contents of obese.', 'start': 245.731, 'duration': 4.883}, {'end': 257.053, 'text': 'The zeros stand for not obese, and the ones stand for obese.', 'start': 252.472, 'duration': 4.581}, {'end': 262.014, 'text': 'The lighter samples are mostly zeros, not obese.', 'start': 258.632, 'duration': 3.382}, {'end': 266.455, 'text': 'And the heavier samples are mostly ones, obese.', 'start': 263.194, 'duration': 3.261}], 'summary': 'Using if-else function to determine obesity status with 0s for not obese and 1s for obese.', 'duration': 31.711, 'max_score': 234.744, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qcvAqAH60Yw/pics/qcvAqAH60Yw234744.jpg'}, {'end': 377.123, 'src': 'heatmap', 'start': 328.573, 'weight': 0, 'content': [{'end': 336.941, 'text': 'In other words, glm.fit $fitted.values contains estimated probabilities that each sample is obese.', 'start': 328.573, 'duration': 8.368}, {'end': 344.749, 'text': 'We will use the known classifications and the estimated probabilities to draw an ROC curve.', 'start': 338.623, 'duration': 6.126}, {'end': 352.477, 'text': 'We use the ROC function from the PROC library to draw the ROC graph.', 'start': 346.791, 'duration': 5.686}, {'end': 363.793, 'text': 'We pass in the known classifications, obese or not obese, for each sample, and the estimated probabilities that each sample is obese.', 'start': 354.025, 'duration': 9.768}, {'end': 371.679, 'text': 'And we tell the ROC function to draw the graph, not just calculate all of the numbers used to draw the graph.', 'start': 365.154, 'duration': 6.525}, {'end': 377.123, 'text': 'When you use the ROC function, it prints out a bunch of stuff.', 'start': 373.5, 'duration': 3.623}], 'summary': 'Using estimated probabilities from glm.fit to draw an roc curve for obesity classification.', 'duration': 35.22, 'max_score': 328.573, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qcvAqAH60Yw/pics/qcvAqAH60Yw328573.jpg'}], 'start': 108.809, 'title': 'Generating and analyzing sample data', 'summary': "Details the process of generating a data set with 100 samples, analyzing the weight distribution, classifying individuals as obese or not obese, fitting a logistic regression curve, and drawing an roc curve to assess the model's performance with an emphasis on the area under the curve.", 'chapters': [{'end': 545.423, 'start': 108.809, 'title': 'Generating and analyzing sample data', 'summary': "Details the process of generating a data set with 100 samples, analyzing the weight distribution, classifying individuals as obese or not obese, fitting a logistic regression curve, and drawing an roc curve to assess the model's performance with an emphasis on the area under the curve.", 'duration': 436.614, 'highlights': ['Generating a data set with 100 samples and analyzing weight distribution The chapter explains the process of creating a data set with 100 samples, using the rnorm function to generate 100 random values from a normal distribution with a mean of 172 and a standard deviation of 29, and sorting the numbers from low to high.', 'Classifying individuals as obese or not obese The chapter describes the method of classifying individuals as obese by scaling ranks, comparing them to random numbers, and using the if else function to store results in a variable called obese.', "Fitting a logistic regression curve and drawing an ROC curve The chapter details the process of fitting a logistic regression curve to the data using the GLM function, obtaining estimated probabilities for obesity, and drawing an ROC curve to assess the model's performance, with a focus on the area under the curve."]}], 'duration': 436.614, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qcvAqAH60Yw/pics/qcvAqAH60Yw108809.jpg', 'highlights': ['Fitting a logistic regression curve and drawing an ROC curve to assess model performance.', 'Classifying individuals as obese or not obese using the if else function.', 'Generating a data set with 100 samples and analyzing weight distribution using rnorm function.']}, {'end': 908.381, 'segs': [{'end': 579.573, 'src': 'embed', 'start': 547.017, 'weight': 0, 'content': [{'end': 552.881, 'text': "Now, imagine we're interested in the range of thresholds that resulted in this part of the ROC curve.", 'start': 547.017, 'duration': 5.864}, {'end': 568.13, 'text': 'We can access those thresholds by saving the calculations that the ROC function does in a variable and then make a data frame that contains all of the true positive percentages by multiplying the sensitivities by 100,', 'start': 554.562, 'duration': 13.568}, {'end': 579.573, 'text': 'and the false positive percentages by multiplying 1 minus specificities by 100, And, last but not least, the thresholds.', 'start': 568.13, 'duration': 11.443}], 'summary': 'Analyzing roc curve for threshold range with true and false positive percentages.', 'duration': 32.556, 'max_score': 547.017, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qcvAqAH60Yw/pics/qcvAqAH60Yw547017.jpg'}, {'end': 713.589, 'src': 'embed', 'start': 684.116, 'weight': 1, 'content': [{'end': 691.459, 'text': 'If we want to print the AUC directly on the graph, then we set the print.auc parameter to true.', 'start': 684.116, 'duration': 7.343}, {'end': 697.222, 'text': 'You can also draw and calculate a partial area under the curve.', 'start': 693.52, 'duration': 3.702}, {'end': 705.425, 'text': 'These are useful when you want to focus on the part of the ROC curve that only allows for a small number of false positives.', 'start': 698.542, 'duration': 6.883}, {'end': 713.589, 'text': 'To print and draw the partial AUC, we start by setting the print.auc parameter to true.', 'start': 707.126, 'duration': 6.463}], 'summary': 'Setting print.auc parameter to true prints auc directly on graph, partial auc useful for focusing on part of roc curve.', 'duration': 29.473, 'max_score': 684.116, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qcvAqAH60Yw/pics/qcvAqAH60Yw684116.jpg'}, {'end': 814.415, 'src': 'heatmap', 'start': 778.376, 'weight': 2, 'content': [{'end': 783.36, 'text': 'However, I added two digits to the end to make the color semi-transparent.', 'start': 778.376, 'duration': 4.984}, {'end': 791.968, 'text': "Bam! For those of you keeping track, we're up to three exclamation points on the bam.", 'start': 784.541, 'duration': 7.427}, {'end': 799.509, 'text': "Lastly, let's talk about how to overlap two ROC curves so that they are easy to compare.", 'start': 793.826, 'duration': 5.683}, {'end': 805.211, 'text': "We'll start by making a random forest classifier with the same dataset.", 'start': 801.149, 'duration': 4.062}, {'end': 810.113, 'text': 'Now we draw the original ROC curve for the logistic regression.', 'start': 806.592, 'duration': 3.521}, {'end': 814.415, 'text': 'Then we add the ROC curve for the random forest.', 'start': 811.314, 'duration': 3.101}], 'summary': 'Added transparency, 3 exclamations, overlapped roc curves for logistic regression and random forest.', 'duration': 31.737, 'max_score': 778.376, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qcvAqAH60Yw/pics/qcvAqAH60Yw778376.jpg'}, {'end': 881.998, 'src': 'heatmap', 'start': 852.786, 'weight': 0.732, 'content': [{'end': 862.09, 'text': 'And we set print.auc.y to 40 so that the AUC for the random forest is printed below the AUC for the logistic regression.', 'start': 852.786, 'duration': 9.304}, {'end': 867.192, 'text': 'Lastly, we draw legend in the bottom right hand corner.', 'start': 863.831, 'duration': 3.361}, {'end': 881.998, 'text': "Bam! Once we're all done drawing ROC graphs, we need to reset the PTY graphical parameter back to its default value, M, which is short for maximum.", 'start': 868.871, 'duration': 13.127}], 'summary': 'Set print.auc.y to 40, draw legend in bottom right corner, reset pty graphical parameter to default value m.', 'duration': 29.212, 'max_score': 852.786, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qcvAqAH60Yw/pics/qcvAqAH60Yw852786.jpg'}], 'start': 547.017, 'title': 'Analyzing and comparing roc curves', 'summary': 'Covers analyzing roc curve thresholds, including accessing and analyzing thresholds, and comparing roc curves with logistic regression and random forest, demonstrating how to print auc on the graph and draw partial auc, with specific details on setting parameters and visual elements.', 'chapters': [{'end': 682.368, 'start': 547.017, 'title': 'Analyzing roc curve thresholds', 'summary': 'Discusses accessing and analyzing thresholds in the roc curve by saving calculations, creating a data frame with true positive and false positive percentages, and isolating thresholds for a specific true positive rate range.', 'duration': 135.351, 'highlights': ['By saving the calculations that the ROC function does in a variable, we can create a data frame containing true positive percentages, false positive percentages, and thresholds, allowing us to analyze the thresholds that resulted in different parts of the ROC curve.', 'When the threshold is set to negative infinity, the true positive percentage is 100 and the false positive percentage is also 100, indicating that all obese samples were correctly classified and all non-obese samples were incorrectly classified, corresponding to the upper right-hand corner of the ROC curve.', 'Isolating the true positive rate between 60 and 80 allows for the selection of a threshold with an optimal balance of true positives and false positives, providing a method for choosing a threshold within a specific range.', 'Customizing what the ROC function draws allows for tailoring the visualization to specific analysis needs, providing flexibility in analyzing and interpreting the ROC curve.']}, {'end': 777.735, 'start': 684.116, 'title': 'Printing auc on graph and drawing partial auc', 'summary': 'Demonstrates how to print the auc on the graph and draw a partial area under the curve, including setting parameters such as print.auc, partial.auc, auc.polygon, and auc.polygon.call.', 'duration': 93.619, 'highlights': ['Setting print.auc parameter to true allows direct printing of AUC on the graph, enabling visualization of the area under the curve.', 'Drawing partial AUC is useful for focusing on specific parts of the ROC curve with a small number of false positives, achieved by setting partial.auc to desired specificity values.', 'Specifying the location and range of specificity values for printing and drawing the partial AUC ensures accurate representation on the graph, with 45 being a chosen x-axis location and a range from 100% to 90% specificity.', "Drawing the partial area under the curve involves setting AUC.polygon to true and specifying the polygon's color using RGB numbers, providing visual distinction from the curve line."]}, {'end': 908.381, 'start': 778.376, 'title': 'Comparing roc curves with logistic regression and random forest', 'summary': 'Discusses how to overlap two roc curves for logistic regression and random forest classifiers, using a random forest with the same dataset, and setting the color to green instead of blue, and adding the roc curve to an existing graph, and drawing the legend in the bottom right-hand corner.', 'duration': 130.005, 'highlights': ['The chapter explains how to overlap two ROC curves for logistic regression and random forest classifiers by using the same dataset and setting the color to green instead of blue.', 'The process includes adding the ROC curve for the random forest, passing in the number of trees in the forest that voted correctly and setting add to true so that this ROC curve is added to an existing graph.', 'The chapter also discusses drawing the legend in the bottom right-hand corner and resetting the PTY graphical parameter back to its default value, M, at the end of the process.']}], 'duration': 361.364, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/qcvAqAH60Yw/pics/qcvAqAH60Yw547017.jpg', 'highlights': ['By saving the calculations in a variable, create a data frame with true positive percentages, false positive percentages, and thresholds for analyzing the ROC curve.', 'Setting print.auc parameter to true allows direct printing of AUC on the graph, enabling visualization of the area under the curve.', 'The chapter explains how to overlap two ROC curves for logistic regression and random forest classifiers by using the same dataset and setting the color to green instead of blue.']}], 'highlights': ['Drawing ROC graphs and calculating the AUC in R is the main focus.', 'Fitting a logistic regression curve and drawing an ROC curve to assess model performance.', 'By saving the calculations in a variable, create a data frame with true positive percentages, false positive percentages, and thresholds for analyzing the ROC curve.', 'Availability of example code for practical implementation.', 'Classifying individuals as obese or not obese using the if else function.', 'Generating a data set with 100 samples and analyzing weight distribution using rnorm function.', 'Setting print.auc parameter to true allows direct printing of AUC on the graph, enabling visualization of the area under the curve.', 'The chapter explains how to overlap two ROC curves for logistic regression and random forest classifiers by using the same dataset and setting the color to green instead of blue.', 'Use of PROC for drawing ROC graphs and random forest package for classification.']}