title

Logistic Regression in R, Clearly Explained!!!!

description

This video describes how to do Logistic Regression in R, step-by-step. We start by importing a dataset and cleaning it up, then we perform logistic regression on a very simple model, followed by a fancy model. Lastly we draw a graph of the predicted probabilities that came from the Logistic Regression.
The code that I use in this video can be found on the StatQuest GitHub:
https://github.com/StatQuest/logistic_regression_demo/blob/master/logistic_regression_demo.R
For more details on what's going on, check out the following StatQuests:
For a general overview of Logistic Regression:
https://youtu.be/yIYKR4sgzI8
The odds and log(odds), clearly explained:
https://youtu.be/ARfXDSkQf1Y
The odds ratio and log(odds ratio), clearly explained:
https://youtu.be/8nm0G-1uJzA
Logistic Regression, Details Part 1, Coefficients:
https://youtu.be/vN5cNN2-HWE
Logistic Regression, Details Part 2, Fitting a line with Maximum Likelihood:
https://youtu.be/BfKanl1aSG0
Logistic Regression Details Part 3, R-squared and its p-value:
https://youtu.be/xxFYro8QuXA
Saturated Models and Deviance Statistics, Clearly Explained:
https://youtu.be/9T0wlKdew6I
Deviance Residuals, Clearly Explained:
https://youtu.be/JC56jS2gVUE
For a complete index of all the StatQuest videos, check out:
https://statquest.org/video-index/
If you'd like to support StatQuest, please consider...
Buying The StatQuest Illustrated Guide to Machine Learning!!!
PDF - https://statquest.gumroad.com/l/wvtmc
Paperback - https://www.amazon.com/dp/B09ZCKR4H6
Kindle eBook - https://www.amazon.com/dp/B09ZG79HXC
Patreon: https://www.patreon.com/statquest
...or...
YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join
...a cool StatQuest t-shirt or sweatshirt:
https://shop.spreadshirt.com/statquest-with-josh-starmer/
...buying one or two of my songs (or go large and get a whole album!)
https://joshuastarmer.bandcamp.com/
...or just donating to StatQuest!
https://www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
https://twitter.com/joshuastarmer
0:00 Awesome song and introduction
0:29 Load and format data
3:54 Dealing with missing data
5:03 Verifying that the data is not imbalanced
6:44 Logistic regression with one independent variable
12:48 Logistic regression with many independent variables
15:13 Graphing the predicted probabilities
#statquest #logistic

detail

{'title': 'Logistic Regression in R, Clearly Explained!!!!', 'heatmap': [{'end': 239.027, 'start': 215.878, 'weight': 0.746}, {'end': 317.077, 'start': 303.679, 'weight': 0.784}, {'end': 488.468, 'start': 473.465, 'weight': 0.702}, {'end': 921.933, 'start': 902.412, 'weight': 0.85}, {'end': 994.285, 'start': 948.794, 'weight': 0.729}, {'end': 1006.351, 'start': 995.745, 'weight': 0.861}], 'summary': 'Tutorial series covers logistic regression using real heart disease data from the uci machine learning repository, addressing data cleaning in r, data processing steps resulting in 297 remaining samples for heart disease prediction, and detailed explanation of logistic regression model with an r-squared value of 0.55 and effective visualization of predicted probabilities.', 'chapters': [{'end': 104.63, 'segs': [{'end': 40.396, 'src': 'embed', 'start': 11.001, 'weight': 0, 'content': [{'end': 14.022, 'text': "Hello, I'm Josh Starmer and welcome to StatQuest.", 'start': 11.001, 'duration': 3.021}, {'end': 21.546, 'text': "Today, at long last, we're going to cover Logistic Regression.", 'start': 14.463, 'duration': 7.083}, {'end': 28.15, 'text': 'A link to the code, which is chock full of comments and should be easy to follow, is in the description below.', 'start': 21.546, 'duration': 6.604}, {'end': 35.671, 'text': "For this example, we're going to get a real data set from the UCI machine learning repository.", 'start': 29.906, 'duration': 5.765}, {'end': 40.396, 'text': 'Specifically, we want the heart disease data set.', 'start': 37.333, 'duration': 3.063}], 'summary': 'Josh starmer covers logistic regression with a real heart disease data set from uci.', 'duration': 29.395, 'max_score': 11.001, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/C4N3_XJJ-jU/pics/C4N3_XJJ-jU11001.jpg'}, {'end': 104.63, 'src': 'embed', 'start': 72.781, 'weight': 3, 'content': [{'end': 75.703, 'text': 'Unfortunately, none of the columns are labeled.', 'start': 72.781, 'duration': 2.922}, {'end': 78.304, 'text': 'Wah, wah.', 'start': 77.204, 'duration': 1.1}, {'end': 84.789, 'text': 'So we name the columns after the names that were listed on the UCI website.', 'start': 80.126, 'duration': 4.663}, {'end': 94.666, 'text': 'Hooray! Now, when we look at the first six rows with the head function, things look a lot better.', 'start': 86.323, 'duration': 8.343}, {'end': 104.63, 'text': 'However, the str function, which describes the structure of the data, tell us that some of the columns are messed up.', 'start': 96.307, 'duration': 8.323}], 'summary': 'Data columns labeled after uci website names. some columns are messed up.', 'duration': 31.849, 'max_score': 72.781, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/C4N3_XJJ-jU/pics/C4N3_XJJ-jU72781.jpg'}], 'start': 11.001, 'title': 'Logistic regression and heart disease data', 'summary': 'Covers logistic regression and uses a real heart disease dataset from the uci machine learning repository. it addresses data cleaning issues in r.', 'chapters': [{'end': 104.63, 'start': 11.001, 'title': 'Logistic regression and heart disease data', 'summary': 'Covers the logistic regression, using a real heart disease dataset from the uci machine learning repository, and addresses data cleaning issues in r.', 'duration': 93.629, 'highlights': ['The chapter covers the logistic regression, using a real heart disease dataset from the UCI machine learning repository, and addresses data cleaning issues in R.', 'The data set used is the heart disease data set from the UCI machine learning repository.', 'The code for the logistic regression is available in the description below, and it is well-commented for easy understanding.', 'The head function is used to display the first six rows of the dataset, and the str function is utilized to describe the structure of the data.', 'The columns in the dataset are initially not labeled, but they are named after the listed names on the UCI website to improve readability.']}], 'duration': 93.629, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/C4N3_XJJ-jU/pics/C4N3_XJJ-jU11001.jpg', 'highlights': ['The chapter covers logistic regression using a real heart disease dataset from the UCI machine learning repository.', 'The dataset used is the heart disease dataset from the UCI machine learning repository.', 'The code for logistic regression is available in the description below and is well-commented for easy understanding.', 'The columns in the dataset are initially not labeled, but they are named after the listed names on the UCI website to improve readability.', 'The head function is used to display the first six rows of the dataset, and the str function is utilized to describe the structure of the data.']}, {'end': 498.036, 'segs': [{'end': 196.597, 'src': 'embed', 'start': 137.91, 'weight': 2, 'content': [{'end': 139.891, 'text': "So we've got some cleaning up to do.", 'start': 137.91, 'duration': 1.981}, {'end': 145.494, 'text': "The first thing we do is change the question marks to NA's.", 'start': 141.832, 'duration': 3.662}, {'end': 158.106, 'text': "Then, just to make the data easier on the i's, we convert the zeros in sex to f for female and the ones to m for male.", 'start': 147.284, 'duration': 10.822}, {'end': 162.467, 'text': 'Lastly, we convert the column into a factor.', 'start': 159.526, 'duration': 2.941}, {'end': 169.088, 'text': "Then we convert a bunch of other columns into factors since that's what they're supposed to be.", 'start': 164.467, 'duration': 4.621}, {'end': 175.789, 'text': 'See the UCI website or the sample code on the StatQuest blog for more details.', 'start': 170.268, 'duration': 5.521}, {'end': 184.287, 'text': "Since the CA column originally had a question mark in it, rather than NA, R thinks it's a column of strings.", 'start': 177.321, 'duration': 6.966}, {'end': 188.971, 'text': "We correct that assumption by telling R that it's a column of integers.", 'start': 185.127, 'duration': 3.844}, {'end': 192.554, 'text': 'And then we convert it to a factor.', 'start': 190.872, 'duration': 1.682}, {'end': 196.597, 'text': 'Then we do the same thing for Thal.', 'start': 194.575, 'duration': 2.022}], 'summary': "Clean and convert data: change question marks to na's, convert gender and columns to factors, correct assumptions in specific columns.", 'duration': 58.687, 'max_score': 137.91, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/C4N3_XJJ-jU/pics/C4N3_XJJ-jU137910.jpg'}, {'end': 246.851, 'src': 'heatmap', 'start': 215.878, 'weight': 0.746, 'content': [{'end': 221.819, 'text': 'We could have done a similar trick for sex, but I wanted to show you both ways to convert numbers to words.', 'start': 215.878, 'duration': 5.941}, {'end': 229.563, 'text': "Once we're done fixing up the data, we check that we have made the appropriate changes with the STIR function.", 'start': 223.78, 'duration': 5.783}, {'end': 239.027, 'text': 'Hooray! It worked! Now we see how many samples, rows of data, have NA values.', 'start': 231.303, 'duration': 7.724}, {'end': 246.851, 'text': "Later, we will decide if we can just toss these samples out, or if we should impute values for the NA's.", 'start': 240.688, 'duration': 6.163}], 'summary': 'Demonstrated converting numbers to words, checked data with stir function, and discussed handling na values.', 'duration': 30.973, 'max_score': 215.878, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/C4N3_XJJ-jU/pics/C4N3_XJJ-jU215878.jpg'}, {'end': 317.077, 'src': 'heatmap', 'start': 284.269, 'weight': 0, 'content': [{'end': 288.771, 'text': 'Including the six samples with NAs, there are 303 samples.', 'start': 284.269, 'duration': 4.502}, {'end': 299.497, 'text': 'Then we remove the six samples that have NAs, and after removing those samples, there are 297 samples remaining.', 'start': 290.312, 'duration': 9.185}, {'end': 310.082, 'text': 'Now we need to make sure that healthy and diseased samples come from each gender, female and male.', 'start': 303.679, 'duration': 6.403}, {'end': 317.077, 'text': 'If only male samples have heart disease, we should probably remove all females from the model.', 'start': 311.795, 'duration': 5.282}], 'summary': '303 samples initially, 297 after removing nas, gender balance is important.', 'duration': 32.808, 'max_score': 284.269, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/C4N3_XJJ-jU/pics/C4N3_XJJ-jU284269.jpg'}, {'end': 501.299, 'src': 'heatmap', 'start': 473.465, 'weight': 0.702, 'content': [{'end': 476.307, 'text': 'Then it gives you a summary of the deviance residuals.', 'start': 473.465, 'duration': 2.842}, {'end': 482.772, 'text': 'They look good since they are close to being centered on zero and are roughly symmetrical.', 'start': 477.828, 'duration': 4.944}, {'end': 488.468, 'text': 'If you want to know more about deviance residuals, check out the StatQuest.', 'start': 484.525, 'duration': 3.943}, {'end': 491.611, 'text': 'Deviance residuals clearly explained.', 'start': 489.009, 'duration': 2.602}, {'end': 494.673, 'text': 'Then we have the coefficients.', 'start': 493.232, 'duration': 1.441}, {'end': 498.036, 'text': 'They correspond to the following model.', 'start': 496.034, 'duration': 2.002}, {'end': 501.299, 'text': 'Heart disease equals negative 1.0438 plus 1.2737 times the patient is male.', 'start': 498.657, 'duration': 2.642}], 'summary': 'Deviance residuals are centered around zero and symmetrical; coefficients: -1.0438 + 1.2737 for male patients.', 'duration': 27.834, 'max_score': 473.465, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/C4N3_XJJ-jU/pics/C4N3_XJJ-jU473465.jpg'}], 'start': 106.251, 'title': 'Data processing and logistic regression', 'summary': 'Covers data cleaning and conversion, including replacing values, converting columns into factors, and data processing steps such as handling na values, ensuring gender representation, and conducting logistic regression, resulting in 297 remaining samples for heart disease prediction.', 'chapters': [{'end': 196.597, 'start': 106.251, 'title': 'Data cleaning and conversion', 'summary': 'Discusses the need for cleaning up the data by converting question marks to na, changing sex values to f for female and m for male, and converting multiple columns into factors to align with their intended data types.', 'duration': 90.346, 'highlights': ['The need to convert question marks to NA in the CA and Thal columns to ensure the correct data type, as R currently assumes them to be strings instead of integers.', 'The process of converting sex values to f for female and m for male, followed by converting the column into a factor to align with its intended data type.', 'The importance of converting multiple columns into factors to ensure they match their intended data types, simplifying the data for analysis and interpretation.']}, {'end': 498.036, 'start': 198.134, 'title': 'Data processing and logistic regression', 'summary': 'Covers data processing, including converting values to words, checking for na values, removing samples with nas, ensuring representation of both genders, and verifying levels of categorical variables, before conducting logistic regression using gender to predict heart disease, resulting in 297 remaining samples.', 'duration': 299.902, 'highlights': ['After removing samples with NAs, there are 297 remaining samples. The process results in 297 remaining samples for further analysis.', "Six samples, rows of data, have NA's in them. A total of 6 samples contain NA values, indicating potential data cleaning or imputation needs.", 'Only four patients represent level one in resting electrocardiographic results. There are only four patients representing level one in resting electrocardiographic results, which may affect model fit.', 'Healthy and unhealthy patients are both represented by a lot of female and male samples. Both healthy and unhealthy patients are well represented by a significant number of female and male samples, ensuring balanced representation for gender.', 'The chapter covers data processing, including converting values to words, checking for NA values, removing samples with NAs, ensuring representation of both genders, and verifying levels of categorical variables, before conducting logistic regression using gender to predict heart disease, resulting in 297 remaining samples. The summary of the chapter, encompassing the key data processing steps and logistic regression outcome.']}], 'duration': 391.785, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/C4N3_XJJ-jU/pics/C4N3_XJJ-jU106251.jpg', 'highlights': ['The chapter covers data processing, including converting values to words, checking for NA values, removing samples with NAs, ensuring representation of both genders, and verifying levels of categorical variables, before conducting logistic regression using gender to predict heart disease, resulting in 297 remaining samples.', 'After removing samples with NAs, there are 297 remaining samples. The process results in 297 remaining samples for further analysis.', 'The importance of converting multiple columns into factors to ensure they match their intended data types, simplifying the data for analysis and interpretation.', 'The need to convert question marks to NA in the CA and Thal columns to ensure the correct data type, as R currently assumes them to be strings instead of integers.', 'Both healthy and unhealthy patients are well represented by a significant number of female and male samples, ensuring balanced representation for gender.']}, {'end': 1031.185, 'segs': [{'end': 576.073, 'src': 'embed', 'start': 536.414, 'weight': 0, 'content': [{'end': 542.799, 'text': 'This reduces to heart disease equals negative 1.0438.', 'start': 536.414, 'duration': 6.385}, {'end': 550.945, 'text': 'Thus, the log odds that a female has heart disease equals negative 1.0438.', 'start': 542.799, 'duration': 8.146}, {'end': 555.749, 'text': 'If we were predicting heart disease for a male patient, we get the following equation.', 'start': 550.945, 'duration': 4.804}, {'end': 558.491, 'text': 'Heart disease equals negative 1.0438 plus 1.2737 times one.', 'start': 556.41, 'duration': 2.081}, {'end': 576.073, 'text': 'and that reduces to heart disease equals negative 1.0438 plus 1.2737.', 'start': 565.99, 'duration': 10.083}], 'summary': 'Logistic regression model predicts heart disease with coefficients -1.0438 and 1.2737.', 'duration': 39.659, 'max_score': 536.414, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/C4N3_XJJ-jU/pics/C4N3_XJJ-jU536414.jpg'}, {'end': 628.489, 'src': 'embed', 'start': 602.271, 'weight': 1, 'content': [{'end': 608.678, 'text': "This part of the logistic regression output shows how the Wald's test was computed for both coefficients.", 'start': 602.271, 'duration': 6.407}, {'end': 611.702, 'text': 'And here are the p-values.', 'start': 610.36, 'duration': 1.342}, {'end': 622.925, 'text': 'Both p-values are well below 0.05, and thus, the log of the odds and the log of the odds ratios are both statistically significant.', 'start': 612.878, 'duration': 10.047}, {'end': 628.489, 'text': "But remember, a small p-value alone isn't interesting.", 'start': 624.526, 'duration': 3.963}], 'summary': "Wald's test computed for coefficients, both p-values < 0.05, log odds and odds ratios are statistically significant", 'duration': 26.218, 'max_score': 602.271, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/C4N3_XJJ-jU/pics/C4N3_XJJ-jU602271.jpg'}, {'end': 690.025, 'src': 'embed', 'start': 658.771, 'weight': 3, 'content': [{'end': 663.452, 'text': 'Next, we see the default dispersion parameter used for this logistic regression.', 'start': 658.771, 'duration': 4.681}, {'end': 670.931, 'text': 'When we do normal linear regression, we estimate both the mean and the variance from the data.', 'start': 664.807, 'duration': 6.124}, {'end': 679.838, 'text': 'In contrast, with logistic regression, we estimate the mean of the data and the variance is derived from the mean.', 'start': 672.412, 'duration': 7.426}, {'end': 690.025, 'text': 'Since we are not estimating the variance from the data and instead just deriving it from the mean, it is possible that the variance is underestimated.', 'start': 681.099, 'duration': 8.926}], 'summary': 'Logistic regression estimates the mean of the data and derives variance from it, potentially leading to underestimated variance.', 'duration': 31.254, 'max_score': 658.771, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/C4N3_XJJ-jU/pics/C4N3_XJJ-jU658771.jpg'}, {'end': 739.021, 'src': 'embed', 'start': 708.575, 'weight': 2, 'content': [{'end': 716.5, 'text': 'For more details, check out the StatQuests, Logistic Regression Details, Part 3, R-squared and its p-value,', 'start': 708.575, 'duration': 7.925}, {'end': 720.502, 'text': 'and Saturated Models and Deviant Statistics clearly explained.', 'start': 716.5, 'duration': 4.002}, {'end': 729.674, 'text': 'Then we have the AIC, the Akaiki Information Criterion, which, in this context,', 'start': 722.509, 'duration': 7.165}, {'end': 734.037, 'text': 'is just the residual deviance adjusted for the number of parameters in the model.', 'start': 729.674, 'duration': 4.363}, {'end': 739.021, 'text': 'The AIC can be used to compare one model to another.', 'start': 735.619, 'duration': 3.402}], 'summary': 'Statquests explains logistic regression details, including aic for model comparison.', 'duration': 30.446, 'max_score': 708.575, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/C4N3_XJJ-jU/pics/C4N3_XJJ-jU708575.jpg'}, {'end': 877.503, 'src': 'embed', 'start': 849.731, 'weight': 4, 'content': [{'end': 858.417, 'text': 'we can pull the log likelihood of the null model out of the logistic variable by getting the value for the null deviance and dividing by negative two.', 'start': 849.731, 'duration': 8.686}, {'end': 871.079, 'text': 'And we can pull the log likelihood for the fancy model out of the logistic variable by getting the value for the residual deviance and dividing by negative 2..', 'start': 860.373, 'duration': 10.706}, {'end': 877.503, 'text': 'Then we just do the math and we end up with a pseudo r squared equals 0.55.', 'start': 871.079, 'duration': 6.424}], 'summary': 'Pseudo r squared equals 0.55 calculated from logistic regression models.', 'duration': 27.772, 'max_score': 849.731, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/C4N3_XJJ-jU/pics/C4N3_XJJ-jU849731.jpg'}, {'end': 924.154, 'src': 'heatmap', 'start': 902.412, 'weight': 0.85, 'content': [{'end': 911.882, 'text': 'More details on the R-squared and p-value can be found in the following StatQuest, Logistic Regression Details Part 3, R-squared and its p-value.', 'start': 902.412, 'duration': 9.47}, {'end': 921.933, 'text': 'Lastly, we can draw a graph that shows the predicted probabilities that each patient has heart disease along with their actual heart disease status.', 'start': 913.348, 'duration': 8.585}, {'end': 924.154, 'text': "I'll show you the code in a bit.", 'start': 922.693, 'duration': 1.461}], 'summary': 'Explains r-squared, p-value in logistic regression and visualizes predicted probabilities.', 'duration': 21.742, 'max_score': 902.412, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/C4N3_XJJ-jU/pics/C4N3_XJJ-jU902412.jpg'}, {'end': 994.285, 'src': 'heatmap', 'start': 926.115, 'weight': 6, 'content': [{'end': 933.639, 'text': 'Most of the patients with heart disease, the ones in turquoise, are predicted to have a high probability of having heart disease.', 'start': 926.115, 'duration': 7.524}, {'end': 943.211, 'text': 'and most of the patients without heart disease, the ones in salmon, are predicted to have a low probability of having heart disease.', 'start': 935.427, 'duration': 7.784}, {'end': 948.133, 'text': 'Thus, our logistic regression has done a pretty good job.', 'start': 944.772, 'duration': 3.361}, {'end': 957.058, 'text': "However, we could use cross-validation to get a better idea of how well it might perform with new data, but we'll save that for another day.", 'start': 948.794, 'duration': 8.264}, {'end': 965.427, 'text': 'To draw the graph, we start by creating a new data frame that contains the probabilities of having heart disease,', 'start': 958.442, 'duration': 6.985}, {'end': 967.768, 'text': 'along with the actual heart disease status.', 'start': 965.427, 'duration': 2.341}, {'end': 973.712, 'text': 'Then we sort the data frame from low probabilities to high probabilities.', 'start': 969.529, 'duration': 4.183}, {'end': 982.518, 'text': 'Then we add a new column to the data frame that has the rank of each sample from low probability to high probability.', 'start': 975.433, 'duration': 7.085}, {'end': 988.122, 'text': 'Then we load the ggplot2 library so we can draw a fancy graph.', 'start': 983.859, 'duration': 4.263}, {'end': 994.285, 'text': 'Then we load the calplot library so that ggplot has nice-looking defaults.', 'start': 989.582, 'duration': 4.703}], 'summary': 'Logistic regression predicts high heart disease probability for turquoise patients and low probability for salmon patients, with potential for improved performance through cross-validation.', 'duration': 47.597, 'max_score': 926.115, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/C4N3_XJJ-jU/pics/C4N3_XJJ-jU926115.jpg'}, {'end': 1027.622, 'src': 'heatmap', 'start': 995.745, 'weight': 0.861, 'content': [{'end': 1000.028, 'text': 'Then we call ggplot and use geompt to draw the data.', 'start': 995.745, 'duration': 4.283}, {'end': 1006.351, 'text': 'And lastly, we call ggsave to save the graph as a PDF file.', 'start': 1001.549, 'duration': 4.802}, {'end': 1015.311, 'text': "Triple bam! Hooray! We've made it to the end of another exciting StatQuest.", 'start': 1008.112, 'duration': 7.199}, {'end': 1019.534, 'text': 'If you like this StatQuest and want to see more of them, please subscribe.', 'start': 1015.951, 'duration': 3.583}, {'end': 1027.622, 'text': 'And if you want to support StatQuest, well, please click the like button below and consider buying one or two of my original songs.', 'start': 1020.055, 'duration': 7.567}], 'summary': 'Using ggplot, geompt, and ggsave to create a pdf graph. subscribe for more statquest content.', 'duration': 31.877, 'max_score': 995.745, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/C4N3_XJJ-jU/pics/C4N3_XJJ-jU995745.jpg'}], 'start': 498.657, 'title': 'Logistic regression in heart disease prediction', 'summary': 'Explains a logistic regression model for predicting heart disease, emphasizing the impact of gender on log odds and odds ratios, significance of coefficients, estimation of dispersion parameter, and the use of deviance statistics and aic for model comparison. it also discusses the creation of a logistic regression model with an r-squared value of 0.55 and visualization of predicted probabilities effectively differentiating patients with and without heart disease.', 'chapters': [{'end': 766.542, 'start': 498.657, 'title': 'Logistic regression in heart disease prediction', 'summary': 'Explains a logistic regression model for predicting heart disease, highlighting the impact of gender on the log odds and odds ratios, the significance of coefficients, the estimation of dispersion parameter, and the use of deviance statistics and aic for model comparison.', 'duration': 267.885, 'highlights': ['The log odds of a female having heart disease equals -1.0438, while for a male, it equals -1.0438 + 1.2737, indicating the increase in the log of the odds that a male has of having heart disease over a female.', "Both coefficients' p-values are well below 0.05, signifying the statistical significance of log odds and log odds ratios in predicting heart disease.", 'The estimation of variance in logistic regression is derived from the mean, and adjusting the dispersion parameter may be necessary if the variance is underestimated.', 'The null deviance and residual deviance are used to compare models, compute R-squared, and an overall p-value in logistic regression.', 'The AIC, adjusted for the number of parameters, can be used to compare different logistic regression models.']}, {'end': 1031.185, 'start': 768.663, 'title': 'Logistic regression model evaluation', 'summary': 'Discusses the creation of a logistic regression model using all variables to predict heart disease, with an r-squared value of 0.55 and the visualization of predicted probabilities effectively differentiating patients with and without heart disease.', 'duration': 262.522, 'highlights': ["The logistic regression model using all variables resulted in an R-squared value of 0.55, indicating an overall effect size. {'R-squared': 0.55}", "The visualization of predicted probabilities effectively differentiated patients with and without heart disease, indicating the logistic regression model's effectiveness. {'accuracy': 'high'}", "The discussion on using cross-validation to assess the model's performance with new data was left for another day, providing a potential avenue for future improvement. {'future_improvement': 'cross-validation'}"]}], 'duration': 532.528, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/C4N3_XJJ-jU/pics/C4N3_XJJ-jU498657.jpg', 'highlights': ['The log odds of a female having heart disease equals -1.0438, while for a male, it equals -1.0438 + 1.2737, indicating the increase in the log of the odds that a male has of having heart disease over a female.', "Both coefficients' p-values are well below 0.05, signifying the statistical significance of log odds and log odds ratios in predicting heart disease.", 'The AIC, adjusted for the number of parameters, can be used to compare different logistic regression models.', 'The estimation of variance in logistic regression is derived from the mean, and adjusting the dispersion parameter may be necessary if the variance is underestimated.', 'The null deviance and residual deviance are used to compare models, compute R-squared, and an overall p-value in logistic regression.', 'The logistic regression model using all variables resulted in an R-squared value of 0.55, indicating an overall effect size.', "The visualization of predicted probabilities effectively differentiated patients with and without heart disease, indicating the logistic regression model's effectiveness.", "The discussion on using cross-validation to assess the model's performance with new data was left for another day, providing a potential avenue for future improvement."]}], 'highlights': ['The logistic regression model using all variables resulted in an R-squared value of 0.55, indicating an overall effect size.', "The visualization of predicted probabilities effectively differentiated patients with and without heart disease, indicating the logistic regression model's effectiveness.", 'After removing samples with NAs, there are 297 remaining samples. The process results in 297 remaining samples for further analysis.', 'The chapter covers data processing, including converting values to words, checking for NA values, removing samples with NAs, ensuring representation of both genders, and verifying levels of categorical variables, before conducting logistic regression using gender to predict heart disease, resulting in 297 remaining samples.', "Both coefficients' p-values are well below 0.05, signifying the statistical significance of log odds and log odds ratios in predicting heart disease."]}