title

Logistic Regression Details Pt 3: R-squared and p-value

description

This video follows from where we left off in Part 2 in this series on the details of Logistic Regression. Last time we saw how to fit a squiggly line to the data. This time we'll learn how to evaluate if that squiggly line is worth anything. In short, we'll calculate the R-squared value and it's associated p-value.
NOTE: This StatQuest assumes that you are already familiar with Part 1 in this series, Logistic Regression Details Pt1: Coefficients:
https://youtu.be/vN5cNN2-HWE
For a complete index of all the StatQuest videos, check out:
https://statquest.org/video-index/
If you'd like to support StatQuest, please consider...
Buying The StatQuest Illustrated Guide to Machine Learning!!!
PDF - https://statquest.gumroad.com/l/wvtmc
Paperback - https://www.amazon.com/dp/B09ZCKR4H6
Kindle eBook - https://www.amazon.com/dp/B09ZG79HXC
Patreon: https://www.patreon.com/statquest
...or...
YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join
...a cool StatQuest t-shirt or sweatshirt:
https://shop.spreadshirt.com/statquest-with-josh-starmer/
...buying one or two of my songs (or go large and get a whole album!)
https://joshuastarmer.bandcamp.com/
...or just donating to StatQuest!
https://www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
https://twitter.com/joshuastarmer
Correction:
13:58 The formula at should be 2[(LL(saturated) - LL(overall)) - (LL(saturated) - LL(fit))]. I got the terms flipped.
#statquest #logistic

detail

{'title': 'Logistic Regression Details Pt 3: R-squared and p-value', 'heatmap': [{'end': 412.259, 'start': 284.942, 'weight': 0.734}, {'value': 0.7729226342988315, 'end_time': 412.259, 'start_time': 385.835}, {'end': 761.668, 'start': 735.337, 'weight': 0.705}], 'summary': "Covers logistic regression, emphasizing r squared and p-values, and the process of fitting a line with maximum likelihood. it also discusses over 10 methods for calculating r squared in logistic regression, with a focus on mcfadden's pseudo r squared. in logistic regression, an r-squared value of 0.39 is obtained, calculated using log likelihoods and demonstrated with examples, including p-value calculation.", 'chapters': [{'end': 63.001, 'segs': [{'end': 63.001, 'src': 'embed', 'start': 36.121, 'weight': 0, 'content': [{'end': 45.15, 'text': 'And we converted the y-axis from probability to the log odds of obesity and then fit a line to that data using maximum likelihood.', 'start': 36.121, 'duration': 9.029}, {'end': 54.217, 'text': 'Well, technically, we maximize the log likelihood, but either way you do it, you get the same best fitting line.', 'start': 46.533, 'duration': 7.684}, {'end': 58.178, 'text': 'However, we ended with a bit of a cliffhanger.', 'start': 55.537, 'duration': 2.641}, {'end': 63.001, 'text': 'We know that the line is best fit, but how do we know if it is useful?', 'start': 58.899, 'duration': 4.102}], 'summary': 'Converted y-axis to log odds of obesity, fit line using maximum likelihood. need to determine if line is useful.', 'duration': 26.88, 'max_score': 36.121, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xxFYro8QuXA/pics/xxFYro8QuXA36121.jpg'}], 'start': 9.436, 'title': 'Logistic regression', 'summary': 'Discusses logistic regression, emphasizing r squared and p-values, and the process of fitting a line with maximum likelihood.', 'chapters': [{'end': 63.001, 'start': 9.436, 'title': 'Logistic regression: r squared and p-values', 'summary': 'Discusses logistic regression, focusing on r squared and p-values, highlighting the process of fitting a line with maximum likelihood and addressing the usefulness of the best fitting line.', 'duration': 53.565, 'highlights': ['The process of fitting a line with maximum likelihood for weight measurements of obese and non-obese mice is explained, emphasizing the conversion of the y-axis from probability to the log odds of obesity and the determination of the best fitting line.', 'The discussion addresses the question of how to determine the usefulness of the best fitting line, creating a sense of anticipation for the viewer.']}], 'duration': 53.565, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xxFYro8QuXA/pics/xxFYro8QuXA9436.jpg', 'highlights': ['The process of fitting a line with maximum likelihood for weight measurements of obese and non-obese mice is explained, emphasizing the conversion of the y-axis from probability to the log odds of obesity and the determination of the best fitting line.', 'The discussion addresses the question of how to determine the usefulness of the best fitting line, creating a sense of anticipation for the viewer.']}, {'end': 283.642, 'segs': [{'end': 219.572, 'src': 'embed', 'start': 89.239, 'weight': 0, 'content': [{'end': 96.685, 'text': 'Even though pretty much everyone agrees on how to calculate R squared and the associated p-value for linear models,', 'start': 89.239, 'duration': 7.446}, {'end': 101.069, 'text': 'there is no consensus on how to calculate R squared for logistic regression.', 'start': 96.685, 'duration': 4.384}, {'end': 103.791, 'text': 'There are more than 10 different ways to do it.', 'start': 101.649, 'duration': 2.142}, {'end': 113.527, 'text': 'So, before you settle on a way to calculate R squared for logistic regression, look and see what other people are already doing in your field.', 'start': 105.419, 'duration': 8.108}, {'end': 115.87, 'text': 'That will give you a good starting point.', 'start': 114.088, 'duration': 1.782}, {'end': 118.833, 'text': 'For this stat quest.', 'start': 117.191, 'duration': 1.642}, {'end': 122.957, 'text': 'rather than describe every single R squared for logistic regression,', 'start': 118.833, 'duration': 4.124}, {'end': 128.543, 'text': "I'm focusing on one that is commonly used and is easily calculated from the output that R gives you.", 'start': 122.957, 'duration': 5.586}, {'end': 135.445, 'text': "Just so you know, this R squared is called McFadden's pseudo R squared.", 'start': 130.222, 'duration': 5.223}, {'end': 143.909, 'text': 'Another bonus is that this method is very similar to how R squared is calculated for regular old linear models.', 'start': 136.866, 'duration': 7.043}, {'end': 153.754, 'text': "So let's do a super quick review of R squared for regular old linear models, using size and weight measurements as an example,", 'start': 145.47, 'duration': 8.284}, {'end': 156.196, 'text': 'so that the concepts are fresh in your mind.', 'start': 153.754, 'duration': 2.442}, {'end': 158.457, 'text': 'Wow, that was a long sentence.', 'start': 156.796, 'duration': 1.661}, {'end': 167.416, 'text': 'In linear regression and other linear models, r squared and the related p-value are calculated using the residuals.', 'start': 159.867, 'duration': 7.549}, {'end': 172.683, 'text': 'In brief, we square the residuals and then add them up.', 'start': 169.058, 'duration': 3.625}, {'end': 179.211, 'text': 'I call this ssfit for sum of squares of the residuals around the best fitting line.', 'start': 173.344, 'duration': 5.867}, {'end': 187.782, 'text': 'and we compare that to the sum of squared residuals around the worst fitting line, the mean of the y-axis values.', 'start': 180.68, 'duration': 7.102}, {'end': 190.243, 'text': 'This is called SS mean.', 'start': 188.482, 'duration': 1.761}, {'end': 200.046, 'text': 'R squared compares a measure of a good fit, SS fit, to a measure of a bad fit, SS mean.', 'start': 191.983, 'duration': 8.063}, {'end': 208.148, 'text': 'R squared is the percentage of variation around the mean that goes away when you fit a line to the data.', 'start': 201.406, 'duration': 6.742}, {'end': 215.13, 'text': "Also, because I want to refer to this later, I'm going to point out another thing you already know.", 'start': 209.528, 'duration': 5.602}, {'end': 219.572, 'text': 'R squared goes from 0 to 1.', 'start': 215.95, 'duration': 3.622}], 'summary': "There is no consensus on calculating r squared for logistic regression; mcfadden's pseudo r squared is commonly used and easily calculated.", 'duration': 130.333, 'max_score': 89.239, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xxFYro8QuXA/pics/xxFYro8QuXA89239.jpg'}, {'end': 283.642, 'src': 'embed', 'start': 252.925, 'weight': 7, 'content': [{'end': 261.447, 'text': 'And that means when we plug in the values for ssfit and ssmean, we get ssmean minus zero in the numerator.', 'start': 252.925, 'duration': 8.522}, {'end': 264.936, 'text': 'and then r squared equals one.', 'start': 262.935, 'duration': 2.001}, {'end': 270.537, 'text': 'Duh I told you this was something you already knew.', 'start': 266.076, 'duration': 4.461}, {'end': 275.739, 'text': "Now let's talk about r squared in terms of logistic regression.", 'start': 272.058, 'duration': 3.681}, {'end': 283.642, 'text': 'Like linear regression, we need to find a measure of a good fit to compare to a measure of a bad fit.', 'start': 277.36, 'duration': 6.282}], 'summary': 'R-squared equals 1 when values for ssfit and ssmean are plugged in, indicating a perfect fit in linear regression.', 'duration': 30.717, 'max_score': 252.925, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xxFYro8QuXA/pics/xxFYro8QuXA252925.jpg'}], 'start': 64.081, 'title': 'Calculating and understanding r squared in regression', 'summary': "Discusses the challenges of calculating r squared in logistic regression, highlighting over 10 different methods and emphasizing mcfadden's pseudo r squared as a commonly used and easily calculated approach. it also explains r squared as a measure of fit in linear regression and its extension to logistic regression, with r squared ranging from 0 to 1.", 'chapters': [{'end': 135.445, 'start': 64.081, 'title': 'Calculating r squared for logistic regression', 'summary': "Discusses the challenge of calculating r squared and a p-value for the relationship between weight and obesity, highlighting the lack of consensus in the field, with over 10 different ways to calculate r squared for logistic regression, leading to a focus on mcfadden's pseudo r squared as a commonly used and easily calculated method.", 'duration': 71.364, 'highlights': ['There are more than 10 different ways to calculate R squared for logistic regression, leading to a lack of consensus in the field.', "McFadden's pseudo R squared is commonly used and easily calculated from the output that R gives.", 'The chapter emphasizes the importance of looking at what other people are already doing in the field as a starting point for calculating R squared for logistic regression.']}, {'end': 283.642, 'start': 136.866, 'title': 'R squared in linear and logistic regression', 'summary': 'Explains r squared as a measure of fit in linear regression, comparing the sum of squares of residuals around the best fitting line to the sum of squares around the mean, with r squared ranging from 0 to 1, and its extension to logistic regression.', 'duration': 146.776, 'highlights': ['R squared compares a measure of a good fit, SS fit, to a measure of a bad fit, SS mean. R squared is the percentage of variation around the mean that goes away when you fit a line to the data. It explains how R squared compares measures of good and bad fit, representing the percentage of variation around the mean that diminishes when fitting a line to the data.', 'R squared goes from 0 to 1. It mentions the range of R squared, which is from 0 to 1, indicating the degree of fit in the model.', 'In linear regression and other linear models, r squared and the related p-value are calculated using the residuals. It states that in linear regression and other linear models, R squared and the related p-value are computed utilizing the residuals.', 'Another bonus is that this method is very similar to how R squared is calculated for regular old linear models. It highlights the similarity between the method described and the calculation of R squared for regular linear models.', "Now let's talk about r squared in terms of logistic regression. It introduces the transition to discussing R squared in the context of logistic regression, indicating an upcoming change in the topic."]}], 'duration': 219.561, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xxFYro8QuXA/pics/xxFYro8QuXA64081.jpg', 'highlights': ['There are more than 10 different ways to calculate R squared for logistic regression, leading to a lack of consensus in the field.', "McFadden's pseudo R squared is commonly used and easily calculated from the output that R gives.", 'The chapter emphasizes the importance of looking at what other people are already doing in the field as a starting point for calculating R squared for logistic regression.', 'R squared compares a measure of a good fit, SS fit, to a measure of a bad fit, SS mean.', 'R squared goes from 0 to 1, indicating the degree of fit in the model.', 'In linear regression and other linear models, R squared and the related p-value are calculated using the residuals.', 'Another bonus is that this method is very similar to how R squared is calculated for regular old linear models.', 'It introduces the transition to discussing R squared in the context of logistic regression, indicating an upcoming change in the topic.']}, {'end': 919.936, 'segs': [{'end': 412.259, 'src': 'heatmap', 'start': 284.942, 'weight': 0.734, 'content': [{'end': 290.724, 'text': "Unfortunately, the residuals for logistic regression are all infinite, so we can't use them.", 'start': 284.942, 'duration': 5.782}, {'end': 299.512, 'text': 'but we can project the data onto the best-fitting line, and then we translate the log odds back to probabilities.', 'start': 292.11, 'duration': 7.402}, {'end': 306.414, 'text': 'And lastly, calculate the log likelihood of the data given the best-fitting squiggle.', 'start': 301.052, 'duration': 5.362}, {'end': 313.036, 'text': 'In this case, that gives us negative 3.77.', 'start': 307.734, 'duration': 5.302}, {'end': 320.758, 'text': 'We can call this llfit for the log likelihood of the fitted line and use it as a substitute for ssfit.', 'start': 313.036, 'duration': 7.722}, {'end': 327.556, 'text': 'Now we need a measure of a poorly fitted line that is analogous to SS mean.', 'start': 322.614, 'duration': 4.942}, {'end': 335.298, 'text': 'We do this by calculating the log odds of obesity without taking weight into account.', 'start': 329.556, 'duration': 5.742}, {'end': 345.842, 'text': 'The overall log odds of obesity is just the total number of obese mice divided by the total number of mice that are not obese.', 'start': 336.699, 'duration': 9.143}, {'end': 350.784, 'text': 'Then we just take the log of the whole thing and do the math.', 'start': 347.503, 'duration': 3.281}, {'end': 357.966, 'text': 'In this case, we get a horizontal line at 0.22.', 'start': 352.219, 'duration': 5.747}, {'end': 360.328, 'text': 'Then project the data onto this line.', 'start': 357.966, 'duration': 2.362}, {'end': 365.274, 'text': 'And then we translate the log odds back to probabilities.', 'start': 361.61, 'duration': 3.664}, {'end': 373.952, 'text': 'That gives us a horizontal line at p equals 0.56.', 'start': 366.836, 'duration': 7.116}, {'end': 385.835, 'text': 'the overall log odds, 0.22, translates to the overall probability of being obese, 0.56.', 'start': 373.952, 'duration': 11.883}, {'end': 392.456, 'text': 'In other words, we can arrive at the same solution by calculating the overall probability of obesity.', 'start': 385.835, 'duration': 6.621}, {'end': 399.118, 'text': 'Hooray! They are the same! So we have two different ways to calculate the exact same number.', 'start': 393.597, 'duration': 5.521}, {'end': 406.595, 'text': 'Now calculate the log likelihood of the data given the overall probability of obesity.', 'start': 401.03, 'duration': 5.565}, {'end': 412.259, 'text': 'This gives us negative 6.18.', 'start': 407.856, 'duration': 4.403}], 'summary': 'Residuals for logistic regression are infinite, log likelihood is -3.77, overall probability of obesity is 0.56, log likelihood with probability is -6.18.', 'duration': 127.317, 'max_score': 284.942, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xxFYro8QuXA/pics/xxFYro8QuXA284942.jpg'}, {'end': 444.992, 'src': 'embed', 'start': 420.486, 'weight': 1, 'content': [{'end': 429.734, 'text': 'So we have LL overall probability, a measure of a bad fit, and LL fit, hopefully a measure of a good fit.', 'start': 420.486, 'duration': 9.248}, {'end': 439.81, 'text': 'And it makes intuitive sense that we could combine them just like we combined SSMEAN and SSFIT to calculate R squared.', 'start': 431.246, 'duration': 8.564}, {'end': 444.992, 'text': 'Plugging in the numbers gives us an R squared value equals 0.39.', 'start': 441.111, 'duration': 3.881}], 'summary': 'Combining ll overall probability and ll fit gives r squared value of 0.39.', 'duration': 24.506, 'max_score': 420.486, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xxFYro8QuXA/pics/xxFYro8QuXA420486.jpg'}, {'end': 610.922, 'src': 'embed', 'start': 575.949, 'weight': 3, 'content': [{'end': 582.092, 'text': 'The maximum likelihood best fitting line for this data has an intercept of negative 63.72 and a slope of 22.42.', 'start': 575.949, 'duration': 6.143}, {'end': 584.854, 'text': 'And this translates to a squiggly line on the probability scale.', 'start': 582.092, 'duration': 2.762}, {'end': 599.555, 'text': 'LL fit is the log likelihood of the data projected onto the best fitting line.', 'start': 593.991, 'duration': 5.564}, {'end': 603.517, 'text': 'In this case, LL fit equals zero.', 'start': 600.635, 'duration': 2.882}, {'end': 607.3, 'text': "That's because the log of one equals zero.", 'start': 604.858, 'duration': 2.442}, {'end': 610.922, 'text': 'So LL fit is just the sum of a bunch of zeros.', 'start': 607.78, 'duration': 3.142}], 'summary': 'The best fitting line has intercept -63.72 and slope 22.42, resulting in ll fit of zero.', 'duration': 34.973, 'max_score': 575.949, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xxFYro8QuXA/pics/xxFYro8QuXA575949.jpg'}, {'end': 761.668, 'src': 'heatmap', 'start': 677.481, 'weight': 0, 'content': [{'end': 680.062, 'text': 'Okay, back to calculating R-squared.', 'start': 677.481, 'duration': 2.581}, {'end': 684.785, 'text': "Let's plug in the values for LL overall probability and LL fit.", 'start': 680.583, 'duration': 4.202}, {'end': 688.227, 'text': 'This gives us an R-squared equals one.', 'start': 685.445, 'duration': 2.782}, {'end': 690.968, 'text': 'Double bam!.', 'start': 689.947, 'duration': 1.021}, {'end': 696.131, 'text': 'So we see that, at least on an intuitive level,', 'start': 692.609, 'duration': 3.522}, {'end': 702.394, 'text': 'the R-squared calculated with log likelihoods behaves like the R-squared calculated from sums of squares.', 'start': 696.131, 'duration': 6.263}, {'end': 709.897, 'text': 'The log-likelihood R-squared values go from 0 for poor models to 1 for good models.', 'start': 703.814, 'duration': 6.083}, {'end': 712.459, 'text': 'Now we need a p-value.', 'start': 711.178, 'duration': 1.281}, {'end': 718.222, 'text': 'The good news is that calculating the p-value is pretty straightforward.', 'start': 714.2, 'duration': 4.022}, {'end': 733.677, 'text': 'Two times the difference between LL fit and LL overall probability equals a chi-squared value with degrees freedom equal to the difference in the number of parameters in the two models.', 'start': 719.583, 'duration': 14.094}, {'end': 741.881, 'text': 'LL fit has two parameters since it needs estimates for a y-axis intercept and a slope.', 'start': 735.337, 'duration': 6.544}, {'end': 750.445, 'text': 'LL overall probability has one parameter since it only needs an estimate for a y-axis intercept.', 'start': 743.882, 'duration': 6.563}, {'end': 755.228, 'text': 'So, in this case, the degrees of freedom equals one.', 'start': 751.766, 'duration': 3.462}, {'end': 761.668, 'text': "Here's a graph of a chi-square distribution with one degree of freedom.", 'start': 757.244, 'duration': 4.424}], 'summary': 'R-squared equals 1 with log likelihoods, representing good models. p-value calculation straightforward.', 'duration': 64.4, 'max_score': 677.481, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xxFYro8QuXA/pics/xxFYro8QuXA677481.jpg'}], 'start': 284.942, 'title': 'Logistic regression and r-squared', 'summary': 'Covers logistic regression, log likelihood, and r-squared, with an r-squared value of 0.39 obtained in logistic regression. it also explains the calculation of r-squared using log likelihoods and demonstrates its behavior with examples, including the p-value calculation.', 'chapters': [{'end': 444.992, 'start': 284.942, 'title': 'Logistic regression and log likelihood', 'summary': 'Explores the concept of logistic regression, projecting data onto the best-fitting line, and calculating log likelihood to derive a substitute for ss fit and a measure of a bad fit. the r squared value obtained is 0.39.', 'duration': 160.05, 'highlights': ['The R squared value obtained from combining LL overall probability and LL fit is 0.39, indicating the goodness of fit.', 'The overall log odds of obesity is calculated as the total number of obese mice divided by the total number of mice that are not obese, resulting in a horizontal line at p equals 0.56.', 'The log likelihood of the data given the best-fitting squiggle is calculated as negative 3.77, serving as a measure of a good fit.']}, {'end': 919.936, 'start': 444.992, 'title': 'R-squared and log likelihood in logistic regression', 'summary': 'Explains the calculation of r-squared using log likelihoods in logistic regression, demonstrating how it behaves like r-squared from sums of squares, with examples showing r-squared values going from 0 for poor models to 1 for good models, and includes a detailed explanation of calculating the p-value.', 'duration': 474.944, 'highlights': ['The R-squared calculated with log likelihoods behaves like the R-squared calculated from sums of squares, with examples showing R-squared values going from 0 for poor models to 1 for good models.', 'The p-value is calculated as two times the difference between LL fit and LL overall probability, resulting in a chi-squared value with degrees freedom equal to the difference in the number of parameters in the two models, with an example yielding a p-value of 0.03.', "The log likelihood of the saturated model equals zero, so it can be omitted in the simple equations presented in this StatQuest, but it's usually included in other situations involving generalized linear models.", 'The log likelihood for logistic regression will always be between zero and negative infinity, with good fits resulting in log likelihoods close to zero and bad fits resulting in larger negative log likelihoods.']}], 'duration': 634.994, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/xxFYro8QuXA/pics/xxFYro8QuXA284942.jpg', 'highlights': ['The R-squared calculated with log likelihoods behaves like the R-squared calculated from sums of squares, with examples showing R-squared values going from 0 for poor models to 1 for good models.', 'The R squared value obtained from combining LL overall probability and LL fit is 0.39, indicating the goodness of fit.', 'The p-value is calculated as two times the difference between LL fit and LL overall probability, resulting in a chi-squared value with degrees freedom equal to the difference in the number of parameters in the two models, with an example yielding a p-value of 0.03.', 'The log likelihood of the data given the best-fitting squiggle is calculated as negative 3.77, serving as a measure of a good fit.']}], 'highlights': ['The R squared value obtained from combining LL overall probability and LL fit is 0.39, indicating the goodness of fit.', 'The process of fitting a line with maximum likelihood for weight measurements of obese and non-obese mice is explained, emphasizing the conversion of the y-axis from probability to the log odds of obesity and the determination of the best fitting line.', 'The p-value is calculated as two times the difference between LL fit and LL overall probability, resulting in a chi-squared value with degrees freedom equal to the difference in the number of parameters in the two models, with an example yielding a p-value of 0.03.', 'The R-squared calculated with log likelihoods behaves like the R-squared calculated from sums of squares, with examples showing R-squared values going from 0 for poor models to 1 for good models.', "McFadden's pseudo R squared is commonly used and easily calculated from the output that R gives.", 'In linear regression and other linear models, R squared and the related p-value are calculated using the residuals.', 'There are more than 10 different ways to calculate R squared for logistic regression, leading to a lack of consensus in the field.', 'The discussion addresses the question of how to determine the usefulness of the best fitting line, creating a sense of anticipation for the viewer.', 'R squared compares a measure of a good fit, SS fit, to a measure of a bad fit, SS mean.', 'The chapter emphasizes the importance of looking at what other people are already doing in the field as a starting point for calculating R squared for logistic regression.']}