title
Comparing machine learning models in scikit-learn
description
We've learned how to train different machine learning models and make predictions, but how do we actually choose which model is "best"? We'll cover the train/test split process for model evaluation, which allows you to avoid "overfitting" by estimating how well a model is likely to perform on new data. We'll use that same process to locate optimal tuning parameters for a KNN model, and then we'll re-train our model so that it's ready to make real predictions.
Download the notebook: https://github.com/justmarkham/scikit-learn-videos
Quora explanation of overfitting: http://www.quora.com/What-is-an-intuitive-explanation-of-overfitting/answer/Jessica-Su
Estimating prediction error: https://www.youtube.com/watch?v=_2ij6eaaSl0&t=2m34s
Understanding the Bias-Variance Tradeoff: http://scott.fortmann-roe.com/docs/BiasVariance.html
Guiding questions for that article: https://github.com/justmarkham/DAT8/blob/master/homework/09_bias_variance.md
Visualizing bias and variance: http://work.caltech.edu/library/081.html
WANT TO GET BETTER AT MACHINE LEARNING? HERE ARE YOUR NEXT STEPS:
1) WATCH my scikit-learn video series:
https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
2) SUBSCRIBE for more videos:
https://www.youtube.com/dataschool?sub_confirmation=1
3) JOIN "Data School Insiders" to access bonus content:
https://www.patreon.com/dataschool
4) ENROLL in my Machine Learning course:
https://www.dataschool.io/learn/
5) LET'S CONNECT!
- Newsletter: https://www.dataschool.io/subscribe/
- Twitter: https://twitter.com/justmarkham
- Facebook: https://www.facebook.com/DataScienceSchool/
- LinkedIn: https://www.linkedin.com/in/justmarkham/
detail
{'title': 'Comparing machine learning models in scikit-learn', 'heatmap': [{'end': 214.457, 'start': 202.469, 'weight': 0.702}, {'end': 264.708, 'start': 215.738, 'weight': 0.703}, {'end': 1012.089, 'start': 945.528, 'weight': 0.795}, {'end': 1234.397, 'start': 1158.044, 'weight': 0.761}, {'end': 1319.076, 'start': 1293.501, 'weight': 0.722}], 'summary': 'Compares machine learning models in scikit-learn, covering supervised learning, model evaluation, overfitting, train-test split procedure, knn model training, bias-variance tradeoff, and regression in scikit-learn, achieving up to 100% accuracy with k-nn model and highlighting the importance of generalization in model performance.', 'chapters': [{'end': 137.865, 'segs': [{'end': 137.865, 'src': 'embed', 'start': 0.865, 'weight': 0, 'content': [{'end': 5.246, 'text': 'Welcome back to my video series on scikit-learn for machine learning.', 'start': 0.865, 'duration': 4.381}, {'end': 17.449, 'text': 'In the previous video we learned about the k-nearest neighbors classification model and the four key steps for model training and prediction in scikit-learn.', 'start': 6.406, 'duration': 11.043}, {'end': 24.47, 'text': 'Then, we applied those steps to the IRIS dataset using three different models.', 'start': 18.769, 'duration': 5.701}, {'end': 27.991, 'text': "In this video, I'll be covering the following.", 'start': 25.71, 'duration': 2.281}, {'end': 33.913, 'text': 'How do I choose which model to use for my supervised learning task?', 'start': 29.81, 'duration': 4.103}, {'end': 38.336, 'text': 'How do I choose the best tuning parameters for that model?', 'start': 34.953, 'duration': 3.383}, {'end': 44.54, 'text': 'And how do I estimate the likely performance of my model on out-of-sample data?', 'start': 39.316, 'duration': 5.224}, {'end': 49.223, 'text': "Let's start by reviewing where we ended up last time.", 'start': 46.441, 'duration': 2.782}, {'end': 54.907, 'text': 'Our classification task was to predict the species of an unknown iris.', 'start': 50.244, 'duration': 4.663}, {'end': 67.537, 'text': 'We tried using KNN with k equals one, KNN with k equals five, and logistic regression, and received three different sets of predictions.', 'start': 55.95, 'duration': 11.587}, {'end': 78.284, 'text': "Because this is out of sample data, we don't know the true response values, and thus we can't actually say which model made the best predictions.", 'start': 68.658, 'duration': 9.626}, {'end': 83.548, 'text': 'However, we still need to choose between these three models.', 'start': 79.845, 'duration': 3.703}, {'end': 91.73, 'text': 'The goal of supervised learning is always to build a model that generalizes to out-of-sample data,', 'start': 84.707, 'duration': 7.023}, {'end': 101.734, 'text': 'and thus what we really need is a procedure that allows us to estimate how well a given model is likely to perform on out-of-sample data.', 'start': 91.73, 'duration': 10.004}, {'end': 106.265, 'text': 'This is known as a model evaluation procedure.', 'start': 102.984, 'duration': 3.281}, {'end': 117.528, 'text': 'If we can estimate the likely performance of our three models, then we can use that performance estimate to choose between the models.', 'start': 107.465, 'duration': 10.063}, {'end': 125.65, 'text': "There are many possible model evaluation procedures, but in this video, I'm going to focus on two procedures.", 'start': 119.228, 'duration': 6.422}, {'end': 137.865, 'text': "The first procedure is widely known, but it doesn't have an official name that I'm aware of,", 'start': 131.96, 'duration': 5.905}], 'summary': 'Covered knn and logistic regression models on iris dataset, focusing on model evaluation procedures.', 'duration': 137, 'max_score': 0.865, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU865.jpg'}], 'start': 0.865, 'title': 'Scikit-learn for supervised learning', 'summary': 'Covers model training and prediction in scikit-learn, choosing models for supervised learning, selecting tuning parameters, and estimating model performance on out-of-sample data, using the example of knn and logistic regression models applied to the iris dataset.', 'chapters': [{'end': 137.865, 'start': 0.865, 'title': 'Scikit-learn for supervised learning', 'summary': 'Covers the model training and prediction in scikit-learn, choosing models for supervised learning, selecting tuning parameters, and estimating model performance on out-of-sample data, using the example of knn and logistic regression models applied to the iris dataset.', 'duration': 137, 'highlights': ['We applied the four key steps for model training and prediction in scikit-learn to three different models (KNN with k=1, KNN with k=5, and logistic regression) on the IRIS dataset.', 'The need to choose between the three models despite not knowing the true response values for out-of-sample data.', 'The focus on evaluating model performance on out-of-sample data to choose between the models, with a specific emphasis on two evaluation procedures.', 'The introduction of the concept of model evaluation procedure, emphasizing the goal of building a model that generalizes to out-of-sample data.', 'The lack of an official name for a widely known evaluation procedure.', 'Covering the steps for selecting tuning parameters and estimating the likely performance of a model on out-of-sample data.']}], 'duration': 137, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU865.jpg', 'highlights': ['Applied four key steps for model training and prediction in scikit-learn to three different models on the IRIS dataset.', 'Focused on evaluating model performance on out-of-sample data to choose between the models, with specific emphasis on two evaluation procedures.', 'Covered steps for selecting tuning parameters and estimating the likely performance of a model on out-of-sample data.', 'Introduced the concept of model evaluation procedure, emphasizing the goal of building a model that generalizes to out-of-sample data.', 'Emphasized the need to choose between the three models despite not knowing the true response values for out-of-sample data.']}, {'end': 482.48, 'segs': [{'end': 214.457, 'src': 'heatmap', 'start': 137.865, 'weight': 3, 'content': [{'end': 142.29, 'text': "so I'm just going to call it train and test on the entire data set.", 'start': 137.865, 'duration': 4.425}, {'end': 144.311, 'text': 'The idea is simple.', 'start': 143.431, 'duration': 0.88}, {'end': 153.54, 'text': 'We train our model on the entire data set, and then we test our model by checking how well it performs on that same data.', 'start': 145.152, 'duration': 8.388}, {'end': 164.698, 'text': "This appears to solve our original problem, which was that we made some predictions, but we couldn't check whether those predictions were correct.", 'start': 155.233, 'duration': 9.465}, {'end': 173.143, 'text': 'By testing our model on a dataset for which we do actually know the true response values.', 'start': 165.979, 'duration': 7.164}, {'end': 181.267, 'text': 'we can check how well our model is doing by comparing the predicted response values with the true response values.', 'start': 173.143, 'duration': 8.124}, {'end': 190.573, 'text': "Let's start by reading in the iris data and then creating our feature matrix X and our response vector Y.", 'start': 183.086, 'duration': 7.487}, {'end': 204.751, 'text': "we'll try logistic regression first.", 'start': 202.469, 'duration': 2.282}, {'end': 214.457, 'text': 'We follow the usual pattern, which is to import the class, instantiate the model, and fit the model with the training data.', 'start': 205.731, 'duration': 8.726}], 'summary': 'Train and test model on entire dataset to check performance and compare predictions with true values.', 'duration': 66.886, 'max_score': 137.865, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU137865.jpg'}, {'end': 264.708, 'src': 'heatmap', 'start': 215.738, 'weight': 0.703, 'content': [{'end': 226.546, 'text': "Then we'll make our predictions by passing the entire feature matrix X to the predict method of the fitted model and print out those predictions.", 'start': 215.738, 'duration': 10.808}, {'end': 234.387, 'text': "Let's store those predictions in an object called yPred.", 'start': 230.745, 'duration': 3.642}, {'end': 244.514, 'text': 'As you can see, it made 150 predictions, which is one prediction for each observation.', 'start': 237.649, 'duration': 6.865}, {'end': 256.241, 'text': 'Now, we need a numerical way to evaluate how well our model performed.', 'start': 250.978, 'duration': 5.263}, {'end': 264.708, 'text': 'The most obvious choice would be classification accuracy, which is the proportion of correct predictions.', 'start': 257.382, 'duration': 7.326}], 'summary': "Using feature matrix x, made 150 predictions, evaluating model's classification accuracy.", 'duration': 48.97, 'max_score': 215.738, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU215738.jpg'}, {'end': 329.82, 'src': 'embed', 'start': 299.431, 'weight': 2, 'content': [{'end': 308.337, 'text': 'Then we use the accuracy score function and pass it the true response values followed by the predicted response values.', 'start': 299.431, 'duration': 8.906}, {'end': 316.331, 'text': 'It returns a value of 0.96.', 'start': 312.419, 'duration': 3.912}, {'end': 329.82, 'text': 'That means that it compared the 150 true responses with the corresponding 150 predicted responses and calculated that 96% of our predictions were correct.', 'start': 316.331, 'duration': 13.489}], 'summary': 'Accuracy score is 0.96, indicating 96% correct predictions.', 'duration': 30.389, 'max_score': 299.431, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU299431.jpg'}, {'end': 408.559, 'src': 'embed', 'start': 368.583, 'weight': 1, 'content': [{'end': 374.684, 'text': 'This time, we get 0.967, which is slightly better than logistic regression.', 'start': 368.583, 'duration': 6.101}, {'end': 385.737, 'text': "Finally, we'll try KNN using the value k equals 1.", 'start': 376.934, 'duration': 8.803}, {'end': 390.999, 'text': 'This time, we get a score of 1.0, meaning 100% accuracy.', 'start': 385.737, 'duration': 5.262}, {'end': 402.203, 'text': 'It performed even better than the other two models, and so we would conclude that KNN with k equals 1 is the best model to use with this data.', 'start': 391.98, 'duration': 10.223}, {'end': 408.559, 'text': 'Or would we draw that conclusion? Think about that for a second.', 'start': 404.084, 'duration': 4.475}], 'summary': 'Knn with k=1 achieved 100% accuracy, outperforming other models.', 'duration': 39.976, 'max_score': 368.583, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU368583.jpg'}, {'end': 482.48, 'src': 'embed', 'start': 459.456, 'weight': 0, 'content': [{'end': 468.021, 'text': 'KNN would search for the one nearest observation in the training set and it would find that exact same observation.', 'start': 459.456, 'duration': 8.565}, {'end': 479.527, 'text': "In other words, KNN has memorized the training set, and because we're testing on the exact same data, it will always make correct predictions.", 'start': 469.382, 'duration': 10.145}, {'end': 482.48, 'text': 'At this point,', 'start': 481.178, 'duration': 1.302}], 'summary': 'Knn memorizes training set for exact predictions.', 'duration': 23.024, 'max_score': 459.456, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU459456.jpg'}], 'start': 137.865, 'title': 'Model evaluation and selection', 'summary': 'Discusses evaluating and comparing models for classification accuracy, with logistic regression achieving 96% accuracy, k-nn with k=5 achieving 96.7% accuracy, and k-nn with k=1 achieving 100% accuracy, leading to a critical examination of the best model choice.', 'chapters': [{'end': 234.387, 'start': 137.865, 'title': 'Train and test on the entire dataset', 'summary': 'Discusses the process of training and testing a model on the entire dataset to assess its performance, enabling the comparison of predicted response values with the true response values, using the example of logistic regression on the iris dataset.', 'duration': 96.522, 'highlights': ['The process involves training the model on the entire dataset and then testing its performance on the same data, addressing the issue of being unable to check the correctness of predictions made.', "By testing the model on a dataset with known true response values, the comparison between predicted and true response values allows for an assessment of the model's performance.", 'The example uses logistic regression on the iris dataset, following the standard pattern of importing the class, instantiating the model, fitting it with the training data, and making predictions using the entire feature matrix X.']}, {'end': 482.48, 'start': 237.649, 'title': 'Model evaluation and selection', 'summary': 'Discusses the process of evaluating and comparing different models using classification accuracy, with logistic regression achieving 96% accuracy, k-nn with k=5 achieving 96.7% accuracy, and k-nn with k=1 achieving 100% accuracy, leading to a critical examination of the best model choice.', 'duration': 244.831, 'highlights': ['k-NN with k=1 achieved 100% accuracy, indicating that it memorized the training set and always made correct predictions when tested on the same data. The k-NN model with k=1 achieved a score of 1.0, signifying 100% accuracy, due to memorizing the training set and consistently making correct predictions on the same data.', 'k-NN with k=5 achieved 96.7% accuracy, slightly outperforming logistic regression. k-NN with k=5 achieved a score of 0.967, slightly surpassing the accuracy of logistic regression, suggesting its improved performance in this scenario.', 'Logistic regression achieved 96% accuracy in classification. Logistic regression model achieved a classification accuracy of 96%, demonstrating its effectiveness in making correct predictions for the given data.', 'The chapter discusses the process of evaluating and comparing different models using classification accuracy. The chapter focuses on the evaluation and comparison of diverse models using classification accuracy as a metric to determine their performance and effectiveness.']}], 'duration': 344.615, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU137865.jpg', 'highlights': ['k-NN with k=1 achieved 100% accuracy, indicating that it memorized the training set and always made correct predictions when tested on the same data.', 'k-NN with k=5 achieved 96.7% accuracy, slightly outperforming logistic regression.', 'Logistic regression achieved 96% accuracy in classification.', 'The process involves training the model on the entire dataset and then testing its performance on the same data, addressing the issue of being unable to check the correctness of predictions made.', "By testing the model on a dataset with known true response values, the comparison between predicted and true response values allows for an assessment of the model's performance."]}, {'end': 660.307, 'segs': [{'end': 568.139, 'src': 'embed', 'start': 482.48, 'weight': 0, 'content': [{'end': 492.658, 'text': 'you might conclude that training and testing your models on the same data is not a useful procedure for deciding which models to choose,', 'start': 482.48, 'duration': 10.178}, {'end': 493.58, 'text': 'and you would be correct.', 'start': 492.658, 'duration': 0.922}, {'end': 507.416, 'text': 'Remember that our goal here is to estimate how well each model is likely to perform on out-of-sample data,', 'start': 498.669, 'duration': 8.747}, {'end': 512.119, 'text': "meaning future observations in which we don't know the true response values.", 'start': 507.416, 'duration': 4.703}, {'end': 524.708, 'text': "If what we try to maximize is training accuracy, then we're rewarding overly complex models that won't necessarily generalize to future cases.", 'start': 513.159, 'duration': 11.549}, {'end': 534.695, 'text': 'In other words, models with a high training accuracy may not actually do well when making predictions on out-of-sample data.', 'start': 525.925, 'duration': 8.77}, {'end': 541.522, 'text': 'Creating an unnecessarily complex model is known as overfitting.', 'start': 536.797, 'duration': 4.725}, {'end': 547.929, 'text': 'Models that overfit have learned the noise in the data rather than the signal.', 'start': 542.663, 'duration': 5.266}, {'end': 559.192, 'text': 'In the case of k and n, a very low value of k creates a high complexity model because it follows the noise in the data.', 'start': 549.284, 'duration': 9.908}, {'end': 568.139, 'text': 'This is a nice diagram that I think explains overfitting quite well.', 'start': 564.256, 'duration': 3.883}], 'summary': 'Training and testing models on the same data leads to overfitting and rewards overly complex models.', 'duration': 85.659, 'max_score': 482.48, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU482480.jpg'}, {'end': 643.946, 'src': 'embed', 'start': 609.761, 'weight': 1, 'content': [{'end': 614.766, 'text': 'is a good boundary for classifying future observations as red or blue.', 'start': 609.761, 'duration': 5.005}, {'end': 625.056, 'text': "It doesn't do a perfect job classifying the training observations, but it's likely to do a great job classifying out-of-sample data.", 'start': 615.927, 'duration': 9.129}, {'end': 633.238, 'text': 'A model that instead learns the green line as the decision boundary is overfitting the data.', 'start': 627.293, 'duration': 5.945}, {'end': 643.946, 'text': "It does a perfect job classifying the training observations, but it probably won't do as well as the black line when classifying out-of-sample data.", 'start': 634.358, 'duration': 9.588}], 'summary': 'The black line is a good boundary for classifying future data, likely to perform well out-of-sample.', 'duration': 34.185, 'max_score': 609.761, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU609761.jpg'}], 'start': 482.48, 'title': 'Overfitting and model evaluation', 'summary': 'Explains the concept of overfitting, the consequences of training accuracy maximization, and the need for a better model evaluation procedure when testing on the same data, highlighting the importance of generalization in model performance.', 'chapters': [{'end': 660.307, 'start': 482.48, 'title': 'Overfitting and model evaluation', 'summary': 'Explains the concept of overfitting, the consequences of training accuracy maximization, and the need for a better model evaluation procedure when testing on the same data, highlighting the importance of generalization in model performance.', 'duration': 177.827, 'highlights': ['The chapter emphasizes that training and testing models on the same data leads to inaccurate model evaluation, as it rewards overly complex models and may not generalize to future cases.', 'The concept of overfitting is explained, highlighting that overly complex models learn the noise in the data rather than the signal, impacting their performance on out-of-sample data.', 'The consequences of maximizing training accuracy are discussed, indicating that it rewards overly complex models that may not generalize well to future cases.', 'A clear explanation is provided using a diagram to illustrate the concept of overfitting, emphasizing the importance of the decision boundary in classifying future observations accurately.', 'The need for a better model evaluation procedure is highlighted, underscoring the importance of estimating model performance on out-of-sample data for accurate evaluation and model selection.']}], 'duration': 177.827, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU482480.jpg', 'highlights': ['The need for a better model evaluation procedure is highlighted, underscoring the importance of estimating model performance on out-of-sample data for accurate evaluation and model selection.', 'A clear explanation is provided using a diagram to illustrate the concept of overfitting, emphasizing the importance of the decision boundary in classifying future observations accurately.', 'The concept of overfitting is explained, highlighting that overly complex models learn the noise in the data rather than the signal, impacting their performance on out-of-sample data.', 'The consequences of maximizing training accuracy are discussed, indicating that it rewards overly complex models that may not generalize well to future cases.', 'The chapter emphasizes that training and testing models on the same data leads to inaccurate model evaluation, as it rewards overly complex models and may not generalize to future cases.']}, {'end': 975.857, 'segs': [{'end': 718.629, 'src': 'embed', 'start': 689.735, 'weight': 2, 'content': [{'end': 696.456, 'text': 'First, we split the data into two pieces, which we call a training set and a testing set.', 'start': 689.735, 'duration': 6.721}, {'end': 705.418, 'text': 'We train the model on the training set, and then we test the model on the testing set to evaluate how well we did.', 'start': 697.937, 'duration': 7.481}, {'end': 708.905, 'text': "That's really all there is to it.", 'start': 707.284, 'duration': 1.621}, {'end': 718.629, 'text': "The key idea here is that, because we're evaluating the model on data that was not used to train the model,", 'start': 709.825, 'duration': 8.804}], 'summary': 'Data split into training and testing sets to evaluate model performance.', 'duration': 28.894, 'max_score': 689.735, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU689735.jpg'}, {'end': 818.046, 'src': 'embed', 'start': 787.102, 'weight': 0, 'content': [{'end': 795.008, 'text': "It's important to understand what's happening and why, so I've created a diagram to explain the train test split function.", 'start': 787.102, 'duration': 7.906}, {'end': 803.815, 'text': 'For the moment, forget about the iris data.', 'start': 801.373, 'duration': 2.442}, {'end': 812.101, 'text': 'Pretend that we have a dataset with five observations consisting of two features and a response value.', 'start': 804.995, 'duration': 7.106}, {'end': 818.046, 'text': 'The response value is numeric, meaning that this is a regression problem.', 'start': 813.322, 'duration': 4.724}], 'summary': 'Diagram created to explain train test split for a regression problem with 5 observations and 2 features.', 'duration': 30.944, 'max_score': 787.102, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU787102.jpg'}, {'end': 975.857, 'src': 'embed', 'start': 915.667, 'weight': 1, 'content': [{'end': 923.509, 'text': 'This optional test size parameter determines the proportion of observations assigned to the testing set.', 'start': 915.667, 'duration': 7.842}, {'end': 932.864, 'text': "In this case, I've assigned 40% of observations to the testing set, which means that 60% will be assigned to the training set.", 'start': 924.761, 'duration': 8.103}, {'end': 943.327, 'text': "There's no general rule as to what percentage is best, but people generally use between 20 and 40% of their data for testing.", 'start': 934.124, 'duration': 9.203}, {'end': 950.47, 'text': "In terms of how the observations are assigned, it's actually a random process.", 'start': 945.528, 'duration': 4.942}, {'end': 961.253, 'text': "You'll find that if you run this function five different times on the same set of data, it will split the data five different ways.", 'start': 951.63, 'duration': 9.623}, {'end': 970.815, 'text': 'However, if you use an optional parameter called random state and give it an integer value,', 'start': 963.453, 'duration': 7.362}, {'end': 975.857, 'text': 'it will split a given data set the exact same way every single time.', 'start': 970.815, 'duration': 5.042}], 'summary': '40% testing set, 60% training set, random split, consistent with random state', 'duration': 60.19, 'max_score': 915.667, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU915667.jpg'}], 'start': 661.187, 'title': 'Train test split procedure', 'summary': 'Explains the process of dividing data into training and testing sets, using a 60/40 split ratio, and evaluates model performance on the testing set, emphasizing its importance in accurately simulating model performance on out-of-sample data.', 'chapters': [{'end': 718.629, 'start': 661.187, 'title': 'Train test split procedure', 'summary': "Explains the train test split procedure, involving splitting data into training and testing sets, and testing the model's performance on the testing set to evaluate its effectiveness.", 'duration': 57.442, 'highlights': ['The procedure involves splitting the data into a training set and a testing set, then training the model on the training set and testing its performance on the testing set.', 'The key idea is evaluating the model on data that was not used to train it, ensuring unbiased assessment of its effectiveness.']}, {'end': 975.857, 'start': 718.629, 'title': 'Train test split procedure', 'summary': 'Explains the train test split function, which divides the data into training and testing sets, using a 60/40 split ratio and emphasizes the importance of this method in accurately simulating model performance on out-of-sample data.', 'duration': 257.228, 'highlights': ['The train test split function divides the data into training and testing sets, using a 60/40 split ratio, providing a better estimate of model performance on future data. 60/40 split ratio, better estimate of model performance on future data', 'The proportion of observations assigned to the testing set is determined by the optional test size parameter, with 20-40% commonly used for testing. 20-40% commonly used for testing', 'Using the optional random state parameter with an integer value ensures the data set is split the exact same way every single time. Ensures consistent data split every time']}], 'duration': 314.67, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU661187.jpg', 'highlights': ['The train test split function divides the data into training and testing sets, using a 60/40 split ratio, providing a better estimate of model performance on future data.', 'The proportion of observations assigned to the testing set is determined by the optional test size parameter, with 20-40% commonly used for testing.', 'The procedure involves splitting the data into a training set and a testing set, then training the model on the training set and testing its performance on the testing set.', 'Using the optional random state parameter with an integer value ensures the data set is split the exact same way every single time.', 'The key idea is evaluating the model on data that was not used to train it, ensuring unbiased assessment of its effectiveness.']}, {'end': 1356.323, 'segs': [{'end': 1234.397, 'src': 'heatmap', 'start': 1111.479, 'weight': 0, 'content': [{'end': 1119.227, 'text': "Let's repeat steps two and three for our K and N models, again with K equals five and K equals one.", 'start': 1111.479, 'duration': 7.748}, {'end': 1130.313, 'text': 'For K equals five, we achieve a testing accuracy of 0.967.', 'start': 1120.728, 'duration': 9.585}, {'end': 1139.578, 'text': 'And for k equals one, we achieve a testing accuracy of 0.95.', 'start': 1130.313, 'duration': 9.265}, {'end': 1143.801, 'text': 'We would therefore conclude that, out of these three models,', 'start': 1139.578, 'duration': 4.223}, {'end': 1151.185, 'text': 'KNN with k equals five is likely to be the best model for making predictions on out-of-sample data.', 'start': 1143.801, 'duration': 7.384}, {'end': 1164.089, 'text': 'Naturally, you might wonder whether we can find an even better value for k.', 'start': 1158.044, 'duration': 6.045}, {'end': 1172.415, 'text': "I've written a for loop to do exactly that, in which I try every value of k from 1 through 25,", 'start': 1164.089, 'duration': 8.326}, {'end': 1178.4, 'text': "and then record KNN's testing accuracy in this Python list called scores.", 'start': 1172.415, 'duration': 5.985}, {'end': 1190.49, 'text': "I'm then going to use matplotlib, the predominant Python library for scientific plotting,", 'start': 1183.587, 'duration': 6.903}, {'end': 1195.492, 'text': 'to plot the relationship between the value of x and the testing accuracy.', 'start': 1190.49, 'duration': 5.002}, {'end': 1213.749, 'text': 'In general, as the value of k increases, there appears to be a rise in the testing accuracy and then a fall.', 'start': 1204.236, 'duration': 9.513}, {'end': 1225.854, 'text': 'This rise and fall is actually quite typical when examining the relationship between model complexity and testing accuracy.', 'start': 1216.09, 'duration': 9.764}, {'end': 1234.397, 'text': 'As we talked about earlier, training accuracy rises as model complexity increases,', 'start': 1227.134, 'duration': 7.263}], 'summary': 'Knn with k=5 achieves 0.967 testing accuracy, likely the best model. testing accuracy rises and then falls with increasing k values.', 'duration': 79.011, 'max_score': 1111.479, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU1111479.jpg'}, {'end': 1319.076, 'src': 'heatmap', 'start': 1293.501, 'weight': 0.722, 'content': [{'end': 1304.683, 'text': 'Regardless, plotting testing accuracy versus model complexity is a very useful way to tune any parameters that relate to model complexity.', 'start': 1293.501, 'duration': 11.182}, {'end': 1319.076, 'text': "Once you've chosen a model and its optimal parameters and are ready to make predictions on out-of-sample data,", 'start': 1311.504, 'duration': 7.572}], 'summary': 'Plotting testing accuracy vs model complexity is useful for tuning parameters.', 'duration': 25.575, 'max_score': 1293.501, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU1293501.jpg'}, {'end': 1356.323, 'src': 'embed', 'start': 1330.819, 'weight': 2, 'content': [{'end': 1341.342, 'text': "In this case, we'll choose a value of 11 for K, since that's in the middle of the K range with the highest testing accuracy,", 'start': 1330.819, 'duration': 10.523}, {'end': 1343.603, 'text': "and we'll call that our best model.", 'start': 1341.342, 'duration': 2.261}, {'end': 1356.323, 'text': 'Thus, we instantiate the KNN model with n neighbors equals 11, we fit the model with x and y, and we use the model to make a prediction.', 'start': 1345.06, 'duration': 11.263}], 'summary': 'Chose k=11 for knn model with highest testing accuracy.', 'duration': 25.504, 'max_score': 1330.819, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU1330819.jpg'}], 'start': 977.637, 'title': 'Model training with knn', 'summary': 'Discusses the process of training and testing a knn model, achieving testing accuracies of 0.967, 0.95, and exploring the relationship between model complexity and testing accuracy using matplotlib.', 'chapters': [{'end': 1356.323, 'start': 977.637, 'title': 'Model training and testing with knn', 'summary': 'Discusses the process of training and testing a knn model, achieving testing accuracies of 0.967, 0.95, and exploring the relationship between model complexity and testing accuracy by plotting it using matplotlib.', 'duration': 378.686, 'highlights': ['The KNN model achieved a testing accuracy of 0.967 for k equals 5, and 0.95 for k equals 1, with a tentative conclusion that a k value in the range of 6 to 17 would be better than k equals 5.', "The for loop tried every value of k from 1 through 25, recording KNN's testing accuracy in a Python list called scores, and plotted the relationship between the value of k and the testing accuracy using matplotlib.", "It's important to retrain the model on all available training data to avoid throwing away valuable data, and in this case, a value of 11 for K is chosen as the best model."]}], 'duration': 378.686, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU977637.jpg', 'highlights': ['The KNN model achieved a testing accuracy of 0.967 for k equals 5, and 0.95 for k equals 1, with a tentative conclusion that a k value in the range of 6 to 17 would be better than k equals 5.', "The for loop tried every value of k from 1 through 25, recording KNN's testing accuracy in a Python list called scores, and plotted the relationship between the value of k and the testing accuracy using matplotlib.", "It's important to retrain the model on all available training data to avoid throwing away valuable data, and in this case, a value of 11 for K is chosen as the best model."]}, {'end': 1599.152, 'segs': [{'end': 1412.971, 'src': 'embed', 'start': 1391.5, 'weight': 0, 'content': [{'end': 1400.964, 'text': "There's an alternative model evaluation procedure called k-fold cross-validation that largely overcomes this limitation,", 'start': 1391.5, 'duration': 9.464}, {'end': 1408.808, 'text': 'by repeating the train-test-split process multiple times in a systematic way and averaging the results.', 'start': 1400.964, 'duration': 7.844}, {'end': 1412.971, 'text': "We'll go over that procedure in a future video.", 'start': 1410.249, 'duration': 2.722}], 'summary': 'K-fold cross-validation improves model evaluation by repeating train-test-split process multiple times.', 'duration': 21.471, 'max_score': 1391.5, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU1391500.jpg'}, {'end': 1501.472, 'src': 'embed', 'start': 1470.521, 'weight': 1, 'content': [{'end': 1484.125, 'text': 'and this is the article you should read if you want to really understand why testing accuracy exhibits that upside-down U-shaped curve when you vary model parameters such as k.', 'start': 1470.521, 'duration': 13.604}, {'end': 1493.098, 'text': 'Not only does it do a great job explaining a difficult concept,', 'start': 1488.59, 'duration': 4.508}, {'end': 1501.472, 'text': 'but it also has this extremely cool interactive visualization that will help you to better understand k-nearest neighbors.', 'start': 1493.098, 'duration': 8.374}], 'summary': "Article explains testing accuracy's u-shaped curve when varying k, with interactive visualization.", 'duration': 30.951, 'max_score': 1470.521, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU1470521.jpg'}, {'end': 1573.621, 'src': 'embed', 'start': 1543.688, 'weight': 2, 'content': [{'end': 1552.306, 'text': 'So far in this series, We focused on classification problems in which the goal is to predict a categorical response.', 'start': 1543.688, 'duration': 8.618}, {'end': 1560.69, 'text': "In the next video, we'll expand our scikit-learn toolbox by learning about a machine learning model for regression,", 'start': 1553.247, 'duration': 7.443}, {'end': 1563.692, 'text': 'in which the goal is to predict a continuous response.', 'start': 1560.69, 'duration': 3.002}, {'end': 1573.621, 'text': "We'll also learn how to read a dataset into Pandas, a very popular library for data analysis and exploration,", 'start': 1564.636, 'duration': 8.985}], 'summary': 'Introduction to regression model in scikit-learn and pandas for data analysis', 'duration': 29.933, 'max_score': 1543.688, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU1543688.jpg'}], 'start': 1363.525, 'title': 'Model evaluation methods, understanding bias-variance tradeoff, and regression in scikit-learn', 'summary': 'Covers model evaluation methods, emphasizing the drawbacks of the train-test-split procedure and introducing k-fold cross-validation. it delves into understanding the bias-variance tradeoff, providing various resources. additionally, it introduces regression in scikit-learn with a focus on continuous response prediction and pandas for data analysis.', 'chapters': [{'end': 1444.938, 'start': 1363.525, 'title': 'Model evaluation methods', 'summary': 'Explains the downsides of the train-test-split procedure for model evaluation, highlighting its high variance estimate of out-of-sample accuracy. it introduces k-fold cross-validation as an alternative method that overcomes this limitation by repeating the train-test-split process multiple times in a systematic way and averaging the results. despite its limitations, train-test-split remains a useful procedure due to its flexibility and speed.', 'duration': 81.413, 'highlights': ['k-fold cross-validation provides a more reliable estimate of out-of-sample accuracy by repeating the train-test-split process multiple times and averaging the results.', 'Train-test-split procedure has a high variance estimate of out-of-sample accuracy, which can change depending on the observations in the training and testing sets.', 'Despite the limitations, train-test-split remains a useful procedure due to its flexibility and speed.']}, {'end': 1540.906, 'start': 1446.96, 'title': 'Understanding bias-variance tradeoff in data science', 'summary': "Discusses resources for understanding the bias-variance tradeoff in data science, including a video on statistical learning, an educational article explaining the upside-down u-shaped curve in testing accuracy, an interactive visualization for k-nearest neighbors, guiding questions for focused reading, and a video from caltech's learning from data course.", 'duration': 93.946, 'highlights': ["The article 'Understanding the Bias-Variance Tradeoff' provides an explanation for the upside-down U-shaped curve in testing accuracy when varying model parameters such as k.", 'The article also includes an interactive visualization that enhances understanding of k-nearest neighbors.', 'A video from Hastie and Tip Shirani, the authors of An Introduction to Statistical Learning, covers many of the same concepts as the chapter.', 'Guiding questions for understanding the bias-variance tradeoff are provided for focused reading.', "A video from Caltech's Learning from Data course is linked for visualizing bias and variance after reading the article."]}, {'end': 1599.152, 'start': 1543.688, 'title': 'Introduction to regression in scikit-learn', 'summary': 'Introduces regression in scikit-learn, focusing on predicting a continuous response and using pandas for data analysis, while seeking feedback for improvement and future topics.', 'duration': 55.464, 'highlights': ['The chapter introduces regression in Scikit-Learn, focusing on predicting a continuous response.', 'It also covers using Pandas for data analysis and exploration to work with the data in Scikit-Learn.', 'The speaker seeks feedback on the series and asks for suggestions for future topics and improvements.']}], 'duration': 235.627, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/0pP4EwWJgIU/pics/0pP4EwWJgIU1363525.jpg', 'highlights': ['k-fold cross-validation provides a more reliable estimate of out-of-sample accuracy by repeating the train-test-split process multiple times and averaging the results.', "The article 'Understanding the Bias-Variance Tradeoff' provides an explanation for the upside-down U-shaped curve in testing accuracy when varying model parameters such as k.", 'The chapter introduces regression in Scikit-Learn, focusing on predicting a continuous response.']}], 'highlights': ['k-NN with k=1 achieved 100% accuracy, indicating that it memorized the training set and always made correct predictions when tested on the same data.', 'The KNN model achieved a testing accuracy of 0.967 for k equals 5, and 0.95 for k equals 1, with a tentative conclusion that a k value in the range of 6 to 17 would be better than k equals 5.', 'The train test split function divides the data into training and testing sets, using a 60/40 split ratio, providing a better estimate of model performance on future data.', 'k-fold cross-validation provides a more reliable estimate of out-of-sample accuracy by repeating the train-test-split process multiple times and averaging the results.', 'A clear explanation is provided using a diagram to illustrate the concept of overfitting, emphasizing the importance of the decision boundary in classifying future observations accurately.']}