title

Selecting the best model in scikit-learn using cross-validation

description

In this video, we'll learn about K-fold cross-validation and how it can be used for selecting optimal tuning parameters, choosing between models, and selecting features. We'll compare cross-validation with the train/test split procedure, and we'll also discuss some variations of cross-validation that can result in more accurate estimates of model performance.
Download the notebook: https://github.com/justmarkham/scikit-learn-videos
Documentation on cross-validation: http://scikit-learn.org/stable/modules/cross_validation.html
Documentation on model evaluation: http://scikit-learn.org/stable/modules/model_evaluation.html
GitHub issue on negative mean squared error: https://github.com/scikit-learn/scikit-learn/issues/2439
An Introduction to Statistical Learning: http://www-bcf.usc.edu/~gareth/ISL/
K-fold and leave-one-out cross-validation: https://www.youtube.com/watch?v=nZAM5OXrktY
Cross-validation the right and wrong ways: https://www.youtube.com/watch?v=S06JpVoNaA0
Accurately Measuring Model Prediction Error: http://scott.fortmann-roe.com/docs/MeasuringError.html
An Introduction to Feature Selection: http://machinelearningmastery.com/an-introduction-to-feature-selection/
Harvard CS109: https://github.com/cs109/content/blob/master/lec_10_cross_val.ipynb
Cross-validation pitfalls: http://www.jcheminf.com/content/pdf/1758-2946-6-10.pdf
WANT TO GET BETTER AT MACHINE LEARNING? HERE ARE YOUR NEXT STEPS:
1) WATCH my scikit-learn video series:
https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
2) SUBSCRIBE for more videos:
https://www.youtube.com/dataschool?sub_confirmation=1
3) JOIN "Data School Insiders" to access bonus content:
https://www.patreon.com/dataschool
4) ENROLL in my Machine Learning course:
https://www.dataschool.io/learn/
5) LET'S CONNECT!
- Newsletter: https://www.dataschool.io/subscribe/
- Twitter: https://twitter.com/justmarkham
- Facebook: https://www.facebook.com/DataScienceSchool/
- LinkedIn: https://www.linkedin.com/in/justmarkham/

detail

{'title': 'Selecting the best model in scikit-learn using cross-validation', 'heatmap': [{'end': 411.694, 'start': 315.98, 'weight': 0.715}, {'end': 828.45, 'start': 790.723, 'weight': 0.741}, {'end': 1210.445, 'start': 1156.927, 'weight': 1}, {'end': 1366.91, 'start': 1247.833, 'weight': 0.707}, {'end': 1471.99, 'start': 1437.464, 'weight': 0.718}, {'end': 1552.314, 'start': 1527.023, 'weight': 0.791}, {'end': 1628.082, 'start': 1568.47, 'weight': 0.818}, {'end': 1811.615, 'start': 1792.079, 'weight': 0.804}, {'end': 1926.858, 'start': 1863.953, 'weight': 0.78}], 'summary': 'Covers the drawbacks of train-test-split, benefits of k-fold cross-validation for selecting tuning parameters, choosing between models, and selecting features. it also discusses the importance of model evaluation in supervised learning, achieving testing accuracies of 100%, 97%, and 95% through k-fold cross-validation, and best practices for cross-validation, including using k=10 for more reliable estimates of out-of-sample accuracy and employing stratified sampling for equal representation of response classes in each fold.', 'chapters': [{'end': 264.106, 'segs': [{'end': 77.886, 'src': 'embed', 'start': 46.672, 'weight': 1, 'content': [{'end': 54.718, 'text': 'How can cross-validation be used for selecting tuning parameters, choosing between models and selecting features?', 'start': 46.672, 'duration': 8.046}, {'end': 59.501, 'text': 'And what are some possible improvements to cross-validation?', 'start': 55.899, 'duration': 3.602}, {'end': 69.522, 'text': "Let's start by reviewing what we've learned so far about model evaluation procedures.", 'start': 64.659, 'duration': 4.863}, {'end': 77.886, 'text': 'One of the primary reasons we evaluate machine learning models is so that we can choose the best available model.', 'start': 70.802, 'duration': 7.084}], 'summary': 'Cross-validation aids in model selection and evaluation for machine learning.', 'duration': 31.214, 'max_score': 46.672, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI46672.jpg'}, {'end': 264.106, 'src': 'embed', 'start': 134.724, 'weight': 0, 'content': [{'end': 140.21, 'text': 'The alternative procedure we came up with is called train-test-split,', 'start': 134.724, 'duration': 5.486}, {'end': 145.896, 'text': 'in which we split the dataset into two pieces known as the training and testing sets.', 'start': 140.21, 'duration': 5.686}, {'end': 153.023, 'text': 'we train the model on the training set and we evaluate the model by testing its performance on the testing set.', 'start': 145.896, 'duration': 7.127}, {'end': 165.546, 'text': 'The resulting evaluation metric is known as testing accuracy, which is a better estimate of out-of-sample performance than training accuracy,', 'start': 154.52, 'duration': 11.026}, {'end': 169.948, 'text': 'because we trained and tested the model on different sets of data.', 'start': 165.546, 'duration': 4.402}, {'end': 179.974, 'text': 'As well, testing accuracy does not reward overly complex models, and thus it helps us to avoid overfitting.', 'start': 171.349, 'duration': 8.625}, {'end': 186.475, 'text': 'However, there is a drawback to the train-test-split procedure.', 'start': 182.472, 'duration': 4.003}, {'end': 195.201, 'text': 'It turns out, the testing accuracy is a high variance estimate of out-of-sample accuracy,', 'start': 187.636, 'duration': 7.565}, {'end': 203.967, 'text': 'meaning that testing accuracy can change a lot depending on which observations happen to be in the testing set.', 'start': 195.201, 'duration': 8.766}, {'end': 211.933, 'text': "Let's see an example of this in scikit-learn using the iris dataset, which we've worked with previously.", 'start': 205.448, 'duration': 6.485}, {'end': 224.49, 'text': "First, we'll import the relevant functions, classes, and modules.", 'start': 220.026, 'duration': 4.464}, {'end': 235.659, 'text': "Then, we'll read in the iris dataset and define our feature matrix X and our response vector Y.", 'start': 228.713, 'duration': 6.946}, {'end': 247.879, 'text': "In this next cell, we'll use the train test split function to split x and y into four pieces.", 'start': 240.396, 'duration': 7.483}, {'end': 260.224, 'text': "The splits are random, but I've set the random state parameter so that if you run this code at home using the same random state value,", 'start': 248.96, 'duration': 11.264}, {'end': 264.106, 'text': 'your data will be split in the exact same way as my data.', 'start': 260.224, 'duration': 3.882}], 'summary': 'Train-test-split procedure evaluates model performance with testing accuracy, which helps avoid overfitting, but has high variance.', 'duration': 129.382, 'max_score': 134.724, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI134724.jpg'}], 'start': 0.749, 'title': 'Machine learning model evaluation', 'summary': 'Covers drawbacks of train-test-split, benefits of k-fold cross-validation, and its use for selecting tuning parameters, choosing between models, and selecting features. it also discusses the importance of model evaluation in supervised learning, with an example using the iris dataset.', 'chapters': [{'end': 77.886, 'start': 0.749, 'title': 'Machine learning in scikit-learn', 'summary': 'Covers the drawbacks of train-test-split for model evaluation, the benefits of k-fold cross-validation, and how cross-validation can be used for selecting tuning parameters, choosing between models, and selecting features. it also discusses the primary reasons for evaluating machine learning models to choose the best available model.', 'duration': 77.137, 'highlights': ['K-fold cross-validation overcomes the drawback of using the train-test-split procedure for model evaluation by providing a more reliable estimate of model performance.', "Cross-validation can be used for selecting tuning parameters, choosing between models, and selecting features, enhancing the model's performance and generalization.", 'Evaluating machine learning models helps in choosing the best available model, ensuring optimal performance and accuracy.']}, {'end': 264.106, 'start': 79.047, 'title': 'Supervised learning evaluation', 'summary': 'Discusses the importance of model evaluation in supervised learning, introducing the train-test-split procedure as a better estimate of out-of-sample performance, highlighting its limitations and providing an example using the iris dataset.', 'duration': 185.059, 'highlights': ['The train-test-split procedure provides a better estimate of out-of-sample performance than training accuracy, as it avoids rewarding overly complex models and helps in avoiding overfitting. Better estimate of out-of-sample performance, avoids rewarding overly complex models, helps in avoiding overfitting', 'Testing accuracy is a high variance estimate of out-of-sample accuracy, meaning that it can change a lot depending on which observations happen to be in the testing set. Testing accuracy is a high variance estimate, its changes depending on observations in the testing set', 'The chapter provides an example using the iris dataset, demonstrating the train-test-split procedure and the use of relevant functions, classes, and modules in scikit-learn. Example using the iris dataset, demonstration of train-test-split procedure, use of relevant functions, classes, and modules in scikit-learn']}], 'duration': 263.357, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI749.jpg', 'highlights': ['K-fold cross-validation provides a more reliable estimate of model performance than train-test-split', 'Cross-validation enhances model performance and generalization by selecting tuning parameters, models, and features', 'Evaluating machine learning models helps in choosing the best available model for optimal performance and accuracy', 'Train-test-split provides a better estimate of out-of-sample performance and helps in avoiding overfitting', 'Testing accuracy is a high variance estimate that can change depending on the observations in the testing set', 'The chapter provides an example using the iris dataset to demonstrate the train-test-split procedure and scikit-learn usage']}, {'end': 632.481, 'segs': [{'end': 411.694, 'src': 'heatmap', 'start': 291.687, 'weight': 0, 'content': [{'end': 301.352, 'text': 'But what if we reran this code and the only thing we changed was which observations were assigned to the training and testing sets?', 'start': 291.687, 'duration': 9.665}, {'end': 315.98, 'text': 'If we change the random state parameter to three and rerun the cell, we get a testing accuracy of 95%.', 'start': 303.013, 'duration': 12.967}, {'end': 329.497, 'text': 'If we change the random state to two, we get a testing accuracy of 100%.', 'start': 315.98, 'duration': 13.517}, {'end': 335.722, 'text': 'This is why testing accuracy is known as a high variance estimate.', 'start': 329.497, 'duration': 6.225}, {'end': 347.752, 'text': 'Naturally, you might think that we could solve this problem by creating a bunch of different train test splits,', 'start': 339.886, 'duration': 7.866}, {'end': 355.818, 'text': 'calculating the testing accuracy each time and then averaging the results together in order to reduce the variance.', 'start': 347.752, 'duration': 8.066}, {'end': 361.2, 'text': 'In fact, that is the essence of how cross-validation works.', 'start': 356.959, 'duration': 4.241}, {'end': 372.944, 'text': "Let's walk through the steps for k-fold cross-validation, which is the most common type of cross-validation.", 'start': 366.442, 'duration': 6.502}, {'end': 383.127, 'text': 'First, we choose a number for k and split the entire dataset into k partitions of equal size.', 'start': 374.584, 'duration': 8.543}, {'end': 387.326, 'text': 'These partitions are known as folds.', 'start': 384.465, 'duration': 2.861}, {'end': 398.41, 'text': 'So if k was equal to five and the dataset had 150 observations, each of the five folds would contain 30 observations.', 'start': 388.786, 'duration': 9.624}, {'end': 411.694, 'text': 'Second, we designate the observations in fold one as the testing set and the union of all other folds as the training set.', 'start': 400.65, 'duration': 11.044}], 'summary': 'Changing random state affects testing accuracy; cross-validation reduces variance.', 'duration': 44.035, 'max_score': 291.687, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI291687.jpg'}, {'end': 533.158, 'src': 'embed', 'start': 473.43, 'weight': 1, 'content': [{'end': 486.398, 'text': 'During the third iteration, fold three would be the testing set, and the union of folds one, two, four, and five would be the training set, and so on.', 'start': 473.43, 'duration': 12.968}, {'end': 499.705, 'text': 'Finally, the average testing accuracy, also known as the cross-validated accuracy, is used as the estimate of out-of-sample accuracy.', 'start': 488.433, 'duration': 11.272}, {'end': 509.415, 'text': "Here's a diagram of five-fold cross-validation that may be helpful to you.", 'start': 504.43, 'duration': 4.985}, {'end': 520.735, 'text': 'As you can see, each fold acts as the testing set for one iteration and is part of the training set for the other four iterations.', 'start': 510.825, 'duration': 9.91}, {'end': 528.934, 'text': 'One thing I want to make clear is that we are dividing the observations into five folds.', 'start': 522.929, 'duration': 6.005}, {'end': 533.158, 'text': 'We are not dividing the features into five folds.', 'start': 529.575, 'duration': 3.583}], 'summary': 'Five-fold cross-validation ensures robust model evaluation and out-of-sample accuracy estimation.', 'duration': 59.728, 'max_score': 473.43, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI473430.jpg'}, {'end': 600.824, 'src': 'embed', 'start': 571.348, 'weight': 3, 'content': [{'end': 580.913, 'text': 'Pretend that you have a dataset with 25 observations, numbered 0 through 24, and you want to use 5-fold cross-validation.', 'start': 571.348, 'duration': 9.565}, {'end': 587.091, 'text': 'This is an example of how that dataset might be split into the five folds.', 'start': 582.066, 'duration': 5.025}, {'end': 598.482, 'text': 'Each line represents one iteration of cross-validation in which the dataset has been split into training and testing sets.', 'start': 589.013, 'duration': 9.469}, {'end': 600.824, 'text': 'For each iteration,', 'start': 599.763, 'duration': 1.061}], 'summary': '25 observations split into 5 folds for cross-validation.', 'duration': 29.476, 'max_score': 571.348, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI571348.jpg'}], 'start': 265.407, 'title': 'Model accuracy and cross-validation', 'summary': 'Delves into testing accuracy, highlighting its variance and the solution through k-fold cross-validation, with examples achieving testing accuracies of 100%, 97%, and 95%. it also explains the 5-fold cross-validation process using a dataset of 25 observations for training/testing set divisions.', 'chapters': [{'end': 509.415, 'start': 265.407, 'title': 'Cross-validation for model accuracy', 'summary': 'Explores the concept of testing accuracy, demonstrating its high variance and the solution through k-fold cross-validation, which involves dividing the dataset into k partitions, designating one as the testing set, training the model and calculating the testing accuracy, and repeating this process to obtain the average cross-validated accuracy, with specific examples achieving testing accuracies of 100%, 97%, and 95%.', 'duration': 244.008, 'highlights': ['The reported testing accuracy of our model is 97%. The model achieved a testing accuracy of 97%, demonstrating its performance on the testing data.', 'If we change the random state to two, we get a testing accuracy of 100%. Changing the random state parameter to two resulted in a 100% testing accuracy, showcasing the variance in testing accuracy based on the data split.', 'If we change the random state parameter to three and rerun the cell, we get a testing accuracy of 95%. Modifying the random state parameter to three led to a testing accuracy of 95%, further illustrating the impact of data split on testing accuracy.', 'During the second iteration, fold two would be the testing set, and the union of folds one, three, four, and five would be the training set. In the second iteration of k-fold cross-validation, fold two served as the testing set, while the remaining folds were used as the training set, demonstrating the process of partitioning the data for cross-validation.', "the average testing accuracy, also known as the cross-validated accuracy, is used as the estimate of out-of-sample accuracy. The average testing accuracy obtained through k-fold cross-validation serves as the cross-validated accuracy, providing an estimate of the model's out-of-sample performance."]}, {'end': 632.481, 'start': 510.825, 'title': '5-fold cross-validation process', 'summary': 'Explains the process of splitting a dataset into five folds for 5-fold cross-validation, using a dataset of 25 observations and demonstrating the iterations and training/testing set divisions.', 'duration': 121.656, 'highlights': ['The dataset is divided into five folds for 5-fold cross-validation, with each fold acting as the testing set for one iteration and as part of the training set for the other four iterations. This demonstrates the key concept of 5-fold cross-validation, where the dataset is divided into five folds, with each fold serving as the testing set for one iteration and being part of the training set for the other four iterations.', 'An example of splitting a dataset with 25 observations into five folds is provided, showing the ID numbers of observations in the training and testing sets for each iteration. The example illustrates the process of splitting a dataset with 25 observations into five folds, displaying the ID numbers of observations in the training and testing sets for each iteration, providing a visual understanding of the 5-fold cross-validation process.', 'Clarification is given that the observations, not the features, are divided into five folds for 5-fold cross-validation. The chapter emphasizes that the division into five folds pertains to the observations, not the features, providing clarity on a potentially confusing aspect of the 5-fold cross-validation process.']}], 'duration': 367.074, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI265407.jpg', 'highlights': ['The model achieved a testing accuracy of 100% with a random state of two, showcasing variance in testing accuracy based on data split.', "The average testing accuracy obtained through k-fold cross-validation serves as the cross-validated accuracy, providing an estimate of the model's out-of-sample performance.", 'The dataset is divided into five folds for 5-fold cross-validation, with each fold acting as the testing set for one iteration and as part of the training set for the other four iterations.', 'The example illustrates the process of splitting a dataset with 25 observations into five folds, displaying the ID numbers of observations in the training and testing sets for each iteration, providing a visual understanding of the 5-fold cross-validation process.']}, {'end': 832.913, 'segs': [{'end': 675.043, 'src': 'embed', 'start': 639.622, 'weight': 0, 'content': [{'end': 645.984, 'text': "Let's briefly compare cross-validation to train-test-split to clarify the advantages of each.", 'start': 639.622, 'duration': 6.362}, {'end': 659.167, 'text': 'The main reason we prefer cross-validation to train-test-split is that cross-validation generates a more accurate estimate of out-of-sample accuracy,', 'start': 647.424, 'duration': 11.743}, {'end': 662.208, 'text': 'which is what we need in order to choose the best model.', 'start': 659.167, 'duration': 3.041}, {'end': 675.043, 'text': "As you've seen, it also uses the data more efficiently than train test split since every observation is used for both training and testing the model.", 'start': 663.613, 'duration': 11.43}], 'summary': 'Cross-validation provides more accurate out-of-sample accuracy estimate and uses data more efficiently than train-test-split.', 'duration': 35.421, 'max_score': 639.622, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI639622.jpg'}, {'end': 828.45, 'src': 'heatmap', 'start': 693.293, 'weight': 1, 'content': [{'end': 700.617, 'text': 'since k-fold cross-validation essentially repeats the train-test-split process k times.', 'start': 693.293, 'duration': 7.324}, {'end': 708.682, 'text': 'This is an important consideration for larger datasets, as well as models that take a long time to train.', 'start': 701.718, 'duration': 6.964}, {'end': 717.824, 'text': "Second, it's much easier to examine the detailed results of the testing process from train-test-split.", 'start': 710.74, 'duration': 7.084}, {'end': 727.769, 'text': "As you'll see below, Scikit-learn makes cross-validation very easy to implement, but all you get back are the resulting scores.", 'start': 719.344, 'duration': 8.425}, {'end': 737.894, 'text': 'This makes it difficult to inspect the results using a confusion matrix or ROC curve, which are tools for model evaluation,', 'start': 729.069, 'duration': 8.825}, {'end': 743.069, 'text': 'whereas train test split makes it easy to examine those results.', 'start': 738.902, 'duration': 4.167}, {'end': 755.388, 'text': 'Before we walk through some cross-validation code, I want to present two recommendations for the use of cross-validation.', 'start': 747.762, 'duration': 7.626}, {'end': 765.776, 'text': "First, we've been using the value k equals 5 for our examples, and in fact any number can be used for k.", 'start': 756.609, 'duration': 9.167}, {'end': 776.885, 'text': 'However, k equals 10 is generally recommended because it has been shown experimentally to produce the most reliable estimates of out-of-sample accuracy.', 'start': 765.776, 'duration': 11.109}, {'end': 789.342, 'text': 'Second, when you use cross-validation with classification problems, it is recommended that you use stratified sampling to create the folds.', 'start': 778.931, 'duration': 10.411}, {'end': 798.531, 'text': 'This means that each response class should be represented with approximately equal proportions in each of the folds.', 'start': 790.723, 'duration': 7.808}, {'end': 809.766, 'text': 'For example, if your dataset has two response classes HAM and SPAM and 20% of your observations were HAM,', 'start': 799.778, 'duration': 9.988}, {'end': 814.97, 'text': 'then each of your cross-validation folds should consist of approximately 20% HAM.', 'start': 809.766, 'duration': 5.204}, {'end': 828.45, 'text': "Thankfully, scikit-learn uses stratified sampling by default when using the cross-val score function, which is what we'll use below,", 'start': 817.864, 'duration': 10.586}], 'summary': 'K-fold cross-validation is recommended, with k=10 for reliable estimates. stratified sampling is suggested for classification problems.', 'duration': 116.473, 'max_score': 693.293, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI693293.jpg'}], 'start': 639.622, 'title': 'Cross-validation methods and best practices', 'summary': 'Compares cross-validation to train-test-split, highlighting that cross-validation provides a more accurate estimate of out-of-sample accuracy and utilizes data more efficiently, while train-test-split runs k times faster and allows for easier examination of detailed testing results. it also discusses best practices for cross-validation, including using k=10 for more reliable estimates of out-of-sample accuracy and employing stratified sampling for equal representation of response classes in each fold.', 'chapters': [{'end': 743.069, 'start': 639.622, 'title': 'Cross-validation vs train-test-split', 'summary': 'Compares cross-validation to train-test-split, highlighting that cross-validation provides a more accurate estimate of out-of-sample accuracy and utilizes data more efficiently, while train-test-split runs k times faster and allows for easier examination of detailed testing results.', 'duration': 103.447, 'highlights': ['Cross-validation provides a more accurate estimate of out-of-sample accuracy and uses data more efficiently than train-test-split.', 'Train-test-split runs k times faster than k-fold cross-validation, which is important for larger datasets and time-consuming models.', 'Examining detailed results using confusion matrix or ROC curve is easier with train-test-split compared to cross-validation.']}, {'end': 832.913, 'start': 747.762, 'title': 'Cross-validation best practices', 'summary': 'Discusses the best practices for cross-validation, including using k=10 for more reliable estimates of out-of-sample accuracy and employing stratified sampling for equal representation of response classes in each fold.', 'duration': 85.151, 'highlights': ['Using k=10 is generally recommended for cross-validation, as it has been experimentally shown to produce the most reliable estimates of out-of-sample accuracy.', 'Employing stratified sampling in cross-validation for classification problems ensures that each response class is represented with approximately equal proportions in each fold, leading to better model evaluation.']}], 'duration': 193.291, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI639622.jpg', 'highlights': ['Cross-validation provides a more accurate estimate of out-of-sample accuracy and uses data more efficiently than train-test-split.', 'Using k=10 is generally recommended for cross-validation, as it has been experimentally shown to produce the most reliable estimates of out-of-sample accuracy.', 'Employing stratified sampling in cross-validation for classification problems ensures that each response class is represented with approximately equal proportions in each fold, leading to better model evaluation.', 'Train-test-split runs k times faster than k-fold cross-validation, which is important for larger datasets and time-consuming models.', 'Examining detailed results using confusion matrix or ROC curve is easier with train-test-split compared to cross-validation.']}, {'end': 1019.38, 'segs': [{'end': 923.512, 'src': 'embed', 'start': 838.776, 'weight': 0, 'content': [{'end': 846.461, 'text': "Let's now go through an example for how cross-validation can be used in scikit-learn to help us with parameter tuning.", 'start': 838.776, 'duration': 7.685}, {'end': 855.654, 'text': "We're again using the IRIS dataset and our goal in this case is to select the best tuning parameters,", 'start': 848.001, 'duration': 7.653}, {'end': 860.722, 'text': 'also known as hyperparameters for the k-nearest neighbors classification model.', 'start': 855.654, 'duration': 5.068}, {'end': 872.404, 'text': 'In other words, we want to select the tuning parameters for k and n, which will produce a model that best generalizes to out-of-sample data.', 'start': 862.096, 'duration': 10.308}, {'end': 877.768, 'text': "We'll focus on tuning the k in k-nearest neighbors,", 'start': 873.545, 'duration': 4.223}, {'end': 883.493, 'text': 'which represents the number of nearest neighbors that are taken into account when making a prediction.', 'start': 877.768, 'duration': 5.725}, {'end': 890.418, 'text': 'Note that this k has nothing to do with the k in k-fold cross-validation.', 'start': 885.074, 'duration': 5.344}, {'end': 905.258, 'text': "Our primary function for cross-validation in scikit-learn will be CrossValScore, which we'll import from the sklearn.crossvalidation module.", 'start': 892.972, 'duration': 12.286}, {'end': 910.906, 'text': "We're going to try out the value.", 'start': 909.585, 'duration': 1.321}, {'end': 923.512, 'text': 'k equals 5, so we instantiate a kNeighborsClassifier model with the nNeighbors parameter set to that value and save the model as an object called knn.', 'start': 910.906, 'duration': 12.606}], 'summary': 'Using cross-validation to tune k-nearest neighbors model parameters, focusing on k=5.', 'duration': 84.736, 'max_score': 838.776, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI838776.jpg'}, {'end': 988.853, 'src': 'embed', 'start': 957.697, 'weight': 5, 'content': [{'end': 966.32, 'text': "It's very important to note that we are passing the entirety of x and y to CrossValScore, not xtrain and ytrain.", 'start': 957.697, 'duration': 8.623}, {'end': 974.202, 'text': "As we'll discuss below, CrossValScore takes care of splitting the data into folds,", 'start': 967.72, 'duration': 6.482}, {'end': 979.164, 'text': 'and thus we do not need to split the data ourselves using train test split.', 'start': 974.202, 'duration': 4.962}, {'end': 988.853, 'text': 'The fourth parameter is CV equals 10, which means that we want it to use tenfold cross-validation.', 'start': 981.371, 'duration': 7.482}], 'summary': 'Crossvalscore uses tenfold cross-validation to split data into folds', 'duration': 31.156, 'max_score': 957.697, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI957697.jpg'}], 'start': 838.776, 'title': 'Cross-validation in scikit-learn', 'summary': 'Discusses the use of cross-validation in scikit-learn for parameter tuning, focusing on the k-nearest neighbors classification model using the iris dataset. it demonstrates the use of crossvalscore to perform 10-fold cross-validation with an evaluation metric of accuracy.', 'chapters': [{'end': 890.418, 'start': 838.776, 'title': 'Cross-validation for parameter tuning', 'summary': 'Discusses how cross-validation is used in scikit-learn for parameter tuning, focusing on selecting the best tuning parameters for the k-nearest neighbors classification model using the iris dataset.', 'duration': 51.642, 'highlights': ['Using cross-validation in scikit-learn to select tuning parameters for the k-nearest neighbors classification model', 'The goal is to select the best tuning parameters for k and n to best generalize to out-of-sample data', 'Focusing on tuning the k in k-nearest neighbors, representing the number of nearest neighbors considered for prediction']}, {'end': 1019.38, 'start': 892.972, 'title': 'Cross-validation in scikit-learn', 'summary': 'Demonstrates the use of crossvalscore in scikit-learn to perform k-fold cross-validation with a kneighborsclassifier model, utilizing 10-fold cross-validation and accuracy as the evaluation metric.', 'duration': 126.408, 'highlights': ["CrossValScore function is used to perform k-fold cross-validation with scikit-learn's kNeighborsClassifier model. The chapter demonstrates the use of CrossValScore function for k-fold cross-validation with scikit-learn's kNeighborsClassifier model.", 'kNeighborsClassifier model is instantiated with k=5 and nNeighbors parameter set to that value. The kNeighborsClassifier model is instantiated with k=5 and nNeighbors parameter set to that value.', 'The CrossValScore function utilizes tenfold cross-validation (CV=10) and accuracy as the evaluation metric. The CrossValScore function uses tenfold cross-validation (CV=10) and accuracy as the evaluation metric.']}], 'duration': 180.604, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI838776.jpg', 'highlights': ['Using cross-validation in scikit-learn to select tuning parameters for the k-nearest neighbors classification model', 'The goal is to select the best tuning parameters for k and n to best generalize to out-of-sample data', 'Focusing on tuning the k in k-nearest neighbors, representing the number of nearest neighbors considered for prediction', "CrossValScore function is used to perform k-fold cross-validation with scikit-learn's kNeighborsClassifier model", "The chapter demonstrates the use of CrossValScore function for k-fold cross-validation with scikit-learn's kNeighborsClassifier model", 'The CrossValScore function utilizes tenfold cross-validation (CV=10) and accuracy as the evaluation metric', 'kNeighborsClassifier model is instantiated with k=5 and nNeighbors parameter set to that value']}, {'end': 1527.023, 'segs': [{'end': 1062.608, 'src': 'embed', 'start': 1021.681, 'weight': 1, 'content': [{'end': 1029.803, 'text': "Before we run this code, let's discuss what the CrossValScore function actually does and what it returns.", 'start': 1021.681, 'duration': 8.122}, {'end': 1038.606, 'text': 'Basically, CrossValScore executes the first four steps of k-fold cross-validation.', 'start': 1031.243, 'duration': 7.363}, {'end': 1049.79, 'text': 'It will split x and y into 10 equal folds.', 'start': 1046.525, 'duration': 3.265}, {'end': 1062.608, 'text': 'It will train the K&N model on the union of folds 2 through 10, test the model on fold 1, and calculate the testing accuracy.', 'start': 1051.893, 'duration': 10.715}], 'summary': 'Crossvalscore function splits data into 10 folds for k-fold cross-validation.', 'duration': 40.927, 'max_score': 1021.681, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI1021681.jpg'}, {'end': 1210.445, 'src': 'heatmap', 'start': 1107.149, 'weight': 0, 'content': [{'end': 1114.535, 'text': 'We can see that during the first iteration, the model achieved a testing accuracy of 100%.', 'start': 1107.149, 'duration': 7.386}, {'end': 1117.418, 'text': 'In the second iteration, the accuracy was 93% and so on.', 'start': 1114.535, 'duration': 2.883}, {'end': 1132.535, 'text': "As mentioned above, we'll usually average the testing accuracy across all 10 iterations and use that as our estimate of out-of-sample accuracy.", 'start': 1121.912, 'duration': 10.623}, {'end': 1144.558, 'text': 'It happens that NumPy arrays have a method called mean, so we can simply print scores.mean in order to see the mean accuracy score.', 'start': 1134.135, 'duration': 10.423}, {'end': 1152.765, 'text': 'It turns out to be about 97%.', 'start': 1148.559, 'duration': 4.206}, {'end': 1156.227, 'text': 'because we use cross-validation to arrive at this result.', 'start': 1152.765, 'duration': 3.462}, {'end': 1165.531, 'text': "We're more confident that it's an accurate estimate of out-of-sample accuracy than we would be if we had used train-test-split.", 'start': 1156.927, 'duration': 8.604}, {'end': 1181.278, 'text': 'Our goal here is to find an optimal value of k for k and n, which we set using the nNeighbors parameter.', 'start': 1173.414, 'duration': 7.864}, {'end': 1194.596, 'text': 'Thus we will loop through a range of reasonable values for k and for each value. use tenfold cross-validation to estimate the out-of-sample accuracy.', 'start': 1182.569, 'duration': 12.027}, {'end': 1204.801, 'text': "We first create a list of the integers 1 through 30, which are the values we'll try for k.", 'start': 1196.437, 'duration': 8.364}, {'end': 1210.445, 'text': "Then we create k-scores, which is an empty list in which we'll store the 30 scores.", 'start': 1204.801, 'duration': 5.644}], 'summary': 'Model achieved 97% average testing accuracy using cross-validation method.', 'duration': 58.382, 'max_score': 1107.149, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI1107149.jpg'}, {'end': 1471.99, 'src': 'heatmap', 'start': 1247.833, 'weight': 3, 'content': [{'end': 1251.898, 'text': 'These numbers are a bit difficult to scan through visually,', 'start': 1247.833, 'duration': 4.065}, {'end': 1261.15, 'text': "so we'll use a line plot with matplotlib in order to visualize how the accuracy changes as we vary the number for k.", 'start': 1251.898, 'duration': 9.252}, {'end': 1280.837, 'text': 'The maximum cross-validated accuracy occurs at k equals 13 through k equals 20.', 'start': 1270.771, 'duration': 10.066}, {'end': 1287.021, 'text': 'The general shape of the curve is an upside-down U,', 'start': 1280.837, 'duration': 6.184}, {'end': 1294.445, 'text': 'which is quite typical when examining the relationship between a model complexity parameter and the model accuracy.', 'start': 1287.021, 'duration': 7.424}, {'end': 1303.187, 'text': 'As I mentioned briefly in a previous video, this is an example of the bias-variance tradeoff,', 'start': 1296.218, 'duration': 6.969}, {'end': 1314.601, 'text': 'in which low values of k produce a model with low bias and high variance, and high values of k produce a model with high bias and low variance.', 'start': 1303.187, 'duration': 11.414}, {'end': 1327.166, 'text': 'The best model is found in the middle because it appropriately balances bias and variance and thus is most likely to generalize to out-of-sample data.', 'start': 1316.124, 'duration': 11.042}, {'end': 1339.129, 'text': 'When deciding which exact value of k to call the best, it is generally recommended to choose the value which produces the simplest model.', 'start': 1329.987, 'duration': 9.142}, {'end': 1354.424, 'text': "In the case of K and N, higher values of K produce lower complexity models, and thus, we'll choose K equals 20 as our single best K and N model.", 'start': 1340.598, 'duration': 13.826}, {'end': 1366.91, 'text': "So far, we've used cross-validation to help us with parameter tuning.", 'start': 1362.648, 'duration': 4.262}, {'end': 1376.074, 'text': "Let's look at a brief example to demonstrate how cross-validation can help us to choose between different types of models.", 'start': 1368.191, 'duration': 7.883}, {'end': 1385.377, 'text': 'Specifically, we want to compare the best KNN model on the IRIS dataset with the logistic regression model,', 'start': 1377.274, 'duration': 8.103}, {'end': 1387.678, 'text': 'which is a popular model for classification.', 'start': 1385.377, 'duration': 2.301}, {'end': 1397.101, 'text': "First, let's run tenfold cross-validation with the best KNN model to see our accuracy.", 'start': 1390.279, 'duration': 6.822}, {'end': 1404.003, 'text': 'As we saw above, the accuracy is 98%.', 'start': 1401.641, 'duration': 2.362}, {'end': 1414.93, 'text': 'Note that instead of saving the 10 scores in an object called scores and then calculating the mean of that object,', 'start': 1404.003, 'duration': 10.927}, {'end': 1418.392, 'text': "I'm just running the mean method directly on the results.", 'start': 1414.93, 'duration': 3.462}, {'end': 1432.62, 'text': "We'll compare this with logistic regression by importing and instantiating a logistic regression model and then again running tenfold cross-validation.", 'start': 1422.011, 'duration': 10.609}, {'end': 1449.214, 'text': 'This gives us an accuracy of 95%, and so we would conclude that KNN is likely a better choice than logistic regression for this particular task.', 'start': 1437.464, 'duration': 11.75}, {'end': 1462.002, 'text': "Finally, let's check out how cross-validation can help us with feature selection.", 'start': 1456.497, 'duration': 5.505}, {'end': 1471.99, 'text': 'If you remember the advertising dataset from the last video, you may recall that we were using linear regression to predict sales.', 'start': 1463.423, 'duration': 8.567}], 'summary': 'Cross-validation helps in selecting knn model over logistic regression with 98% accuracy, and k=20 is the best model for knn and n.', 'duration': 201.381, 'max_score': 1247.833, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI1247833.jpg'}], 'start': 1021.681, 'title': 'Cross-validation and bias-variance tradeoff', 'summary': "Explains the crossvalscore function for k-fold cross-validation, yielding a mean accuracy of 97% and determining the optimal k value for knn model. it also delves into the bias-variance tradeoff, comparing knn with logistic regression and showcasing cross-validation's role in feature selection for linear regression.", 'chapters': [{'end': 1287.021, 'start': 1021.681, 'title': 'Cross-validation for knn model', 'summary': 'Explains the crossvalscore function, which performs k-fold cross-validation, splitting the data into 10 folds and returning 10 accuracy scores as a numpy array, with the mean accuracy score being about 97%, and uses it to find the optimal value of k for knn model, with the maximum cross-validated accuracy occurring at k equals 13 through k equals 20.', 'duration': 265.34, 'highlights': ['The maximum cross-validated accuracy occurs at k equals 13 through k equals 20, with the mean accuracy score being about 97%.', 'CrossValScore function performs k-fold cross-validation, splitting the data into 10 folds and returning 10 accuracy scores as a NumPy array.', 'During the first iteration, the model achieved a testing accuracy of 100%, and during the second iteration, the accuracy was 93%.']}, {'end': 1527.023, 'start': 1287.021, 'title': 'Bias-variance tradeoff in model complexity', 'summary': 'Discusses the bias-variance tradeoff in model complexity, highlighting the importance of finding the appropriate balance to achieve the best model, with a specific example of choosing the best k and n model using cross-validation and comparing knn model with logistic regression model, concluding that knn is a better choice, and demonstrating how cross-validation can help with feature selection in linear regression.', 'duration': 240.002, 'highlights': ['The best model is found in the middle because it appropriately balances bias and variance and thus is most likely to generalize to out-of-sample data. The model with an appropriate balance of bias and variance is most likely to generalize to out-of-sample data.', 'K equals 20 is chosen as the single best K and N model, producing lower complexity models. Choosing K equals 20 as the best K and N model produces lower complexity models.', 'The accuracy of the best KNN model on the IRIS dataset through tenfold cross-validation is 98%. The accuracy of the best KNN model on the IRIS dataset through tenfold cross-validation is 98%.', 'The accuracy of logistic regression model through tenfold cross-validation is 95%. The accuracy of logistic regression model through tenfold cross-validation is 95%.']}], 'duration': 505.342, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI1021681.jpg', 'highlights': ['The mean accuracy score for k-fold cross-validation is about 97%.', 'The CrossValScore function performs k-fold cross-validation, returning 10 accuracy scores as a NumPy array.', 'The model achieved a testing accuracy of 100% during the first iteration and 93% during the second iteration.', 'Choosing K equals 20 as the best K and N model produces lower complexity models.', 'The accuracy of the best KNN model on the IRIS dataset through tenfold cross-validation is 98%.', 'The accuracy of logistic regression model through tenfold cross-validation is 95%.']}, {'end': 1785.534, 'segs': [{'end': 1628.082, 'src': 'heatmap', 'start': 1527.023, 'weight': 1, 'content': [{'end': 1530.705, 'text': 'since it works with both classification and regression models.', 'start': 1527.023, 'duration': 3.682}, {'end': 1538.548, 'text': "We can't use accuracy as our evaluation metric since that's only relevant for classification problems.", 'start': 1531.845, 'duration': 6.703}, {'end': 1547.21, 'text': 'We want to use root, mean squared error, but that is not directly available via the scoring parameter.', 'start': 1539.744, 'duration': 7.466}, {'end': 1552.314, 'text': "so we'll instead ask for mean squared error and then later take the square root.", 'start': 1547.21, 'duration': 5.104}, {'end': 1559.7, 'text': "Anyway, let's run tenfold cross-validation on our model and examine the scores.", 'start': 1554.155, 'duration': 5.545}, {'end': 1567.27, 'text': "This doesn't seem right.", 'start': 1566.209, 'duration': 1.061}, {'end': 1576.514, 'text': "If you remember the formula for mean squared error from the last video, you'd agree that the results should be positive numbers,", 'start': 1568.47, 'duration': 8.044}, {'end': 1578.915, 'text': 'and all of these numbers are negative.', 'start': 1576.514, 'duration': 2.401}, {'end': 1580.696, 'text': 'What happened here?', 'start': 1579.996, 'duration': 0.7}, {'end': 1594.815, 'text': "It's complicated to explain, but it boils down to the fact that classification accuracy is a reward function, meaning something you want to maximize,", 'start': 1582.921, 'duration': 11.894}, {'end': 1600.081, 'text': 'whereas mean squared error is a loss function, meaning something you want to minimize.', 'start': 1594.815, 'duration': 5.266}, {'end': 1607.905, 'text': 'There are other scikit-learn functions that depend on the results of cross-val score,', 'start': 1601.86, 'duration': 6.045}, {'end': 1615.091, 'text': 'and those functions select the best model by looking for the highest value of cross-val score.', 'start': 1607.905, 'duration': 7.186}, {'end': 1622.898, 'text': 'Finding the highest value of a reward function makes sense for choosing the best model,', 'start': 1616.592, 'duration': 6.306}, {'end': 1628.082, 'text': 'but finding the highest value for a loss function would select the worst model.', 'start': 1622.898, 'duration': 5.184}], 'summary': 'Using cross-validation, discovered issue with negative mean squared error scores in regression models due to selection of worst model based on loss function.', 'duration': 53.673, 'max_score': 1527.023, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI1527023.jpg'}, {'end': 1638.005, 'src': 'embed', 'start': 1607.905, 'weight': 3, 'content': [{'end': 1615.091, 'text': 'and those functions select the best model by looking for the highest value of cross-val score.', 'start': 1607.905, 'duration': 7.186}, {'end': 1622.898, 'text': 'Finding the highest value of a reward function makes sense for choosing the best model,', 'start': 1616.592, 'duration': 6.306}, {'end': 1628.082, 'text': 'but finding the highest value for a loss function would select the worst model.', 'start': 1622.898, 'duration': 5.184}, {'end': 1638.005, 'text': 'Thus a design decision was made for CrossValScore to negate the output for all loss functions,', 'start': 1629.463, 'duration': 8.542}], 'summary': 'Crossvalscore selects best model using highest value of cross-val score.', 'duration': 30.1, 'max_score': 1607.905, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI1607905.jpg'}, {'end': 1785.534, 'src': 'embed', 'start': 1755.712, 'weight': 0, 'content': [{'end': 1759.796, 'text': 'taking the square root and then taking the mean.', 'start': 1755.712, 'duration': 4.084}, {'end': 1772.124, 'text': 'The resulting estimate is 1.68.', 'start': 1767.261, 'duration': 4.863}, {'end': 1781.051, 'text': 'Since this is a lower number than the model that included newspaper and root mean squared error is something we want to minimize.', 'start': 1772.124, 'duration': 8.927}, {'end': 1785.534, 'text': 'we would conclude that the model excluding newspaper is a better model.', 'start': 1781.051, 'duration': 4.483}], 'summary': 'Excluding newspaper in the model results in a lower rmse of 1.68, indicating its superiority.', 'duration': 29.822, 'max_score': 1755.712, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI1755712.jpg'}], 'start': 1527.023, 'title': 'Evaluation metrics and model comparison', 'summary': 'Covers the use of root mean squared error as the evaluation metric for regression models, issues with negative scores during cross-validation, the distinction between reward and loss functions in scikit-learn, and a comparison of models with and without newspaper, revealing a lower root mean squared error of 1.68 for the latter.', 'chapters': [{'end': 1580.696, 'start': 1527.023, 'title': 'Evaluation metrics for models', 'summary': 'Discusses the need to use root mean squared error as the evaluation metric for regression models, and the issue of obtaining negative scores during cross-validation due to the absence of directly available mean squared error.', 'duration': 53.673, 'highlights': ['Obtaining the root mean squared error as the evaluation metric for regression models is necessary, as it is not directly available via the scoring parameter and is more suitable than accuracy. The need to use root mean squared error as the evaluation metric for regression models due to its relevance and unavailability via the scoring parameter.', 'The issue of obtaining negative numbers during cross-validation, which contradicts the expected positive results based on the formula for mean squared error, requires investigation. The problem of obtaining negative scores during cross-validation, contrary to the expected positive results based on the formula for mean squared error, needing further examination.']}, {'end': 1666.757, 'start': 1582.921, 'title': 'Reward vs loss functions in scikit-learn', 'summary': "Explains the distinction between reward and loss functions in scikit-learn, highlighting the impact on model selection and the decision to negate the output for all loss functions in crossvalscore, with a reference to a long discussion in scikit-learn's github repository since 2013.", 'duration': 83.836, 'highlights': ['The design decision was made for CrossValScore to negate the output for all loss functions, ensuring that higher results consistently indicate better models.', 'The distinction between reward and loss functions impacts model selection, with the need to maximize reward functions and minimize loss functions for optimal model selection.', "There is a long discussion in scikit-learn's GitHub repository about this issue since 2013, providing further insights into the topic."]}, {'end': 1785.534, 'start': 1668.097, 'title': 'Comparison of models with and without newspaper', 'summary': 'Discusses a workaround for a behavior in scikit-learn, converting mean squared error to root mean squared error, and comparing the model including newspaper with a model excluding newspaper, with the latter showing a lower root mean squared error of 1.68.', 'duration': 117.437, 'highlights': ['The model excluding newspaper shows an out-of-sample root mean squared error of 1.68, lower than the model including newspaper with an estimated error of 1.69.', 'A workaround for the behavior in scikit-learn is to take the negative of scores and store it in MSE scores, likely to change in a future version.', 'The process of converting mean squared error to root mean squared error involves taking the square root.', 'The average root mean squared error is calculated across 10 cross-validation folds.']}], 'duration': 258.511, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI1527023.jpg', 'highlights': ['The model excluding newspaper shows an out-of-sample root mean squared error of 1.68, lower than the model including newspaper with an estimated error of 1.69.', 'The need to use root mean squared error as the evaluation metric for regression models due to its relevance and unavailability via the scoring parameter.', 'The issue of obtaining negative scores during cross-validation, contrary to the expected positive results based on the formula for mean squared error, needing further examination.', 'The distinction between reward and loss functions impacts model selection, with the need to maximize reward functions and minimize loss functions for optimal model selection.', 'The design decision was made for CrossValScore to negate the output for all loss functions, ensuring that higher results consistently indicate better models.']}, {'end': 2151.982, 'segs': [{'end': 1817.3, 'src': 'heatmap', 'start': 1792.079, 'weight': 0, 'content': [{'end': 1801.527, 'text': 'To wrap up, I want to briefly go through some common variations to cross-validation that are likely to make it an even better procedure.', 'start': 1792.079, 'duration': 9.448}, {'end': 1811.615, 'text': 'The first is repeated cross-validation, in which k-fold cross-validation is repeated multiple times,', 'start': 1803.188, 'duration': 8.427}, {'end': 1817.3, 'text': 'with different random splits of the data into the k-folds, and the results are averaged.', 'start': 1811.615, 'duration': 5.685}], 'summary': 'Repeated cross-validation improves model evaluation by averaging results of multiple random splits.', 'duration': 25.221, 'max_score': 1792.079, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI1792079.jpg'}, {'end': 1926.858, 'src': 'heatmap', 'start': 1849.178, 'weight': 1, 'content': [{'end': 1855.849, 'text': 'The best model is located and tuned using cross-validation on the remaining data.', 'start': 1849.178, 'duration': 6.671}, {'end': 1862.641, 'text': 'At the end of this process, the holdout set is then used to test the best model.', 'start': 1856.991, 'duration': 5.65}, {'end': 1874.382, 'text': 'the performance on this holdout set is considered to be a more reliable estimate of out-of-sample performance than the cross-validated performance,', 'start': 1863.953, 'duration': 10.429}, {'end': 1878.946, 'text': 'since the holdout set is out of sample for the entire process.', 'start': 1874.382, 'duration': 4.564}, {'end': 1890.736, 'text': 'The final improvement is for all feature engineering and selection to take place within each cross-validation iteration.', 'start': 1881.248, 'duration': 9.488}, {'end': 1902.091, 'text': 'Performing these tasks before cross-validation does not fully mimic the application of the model to out-of-sample data,', 'start': 1892.321, 'duration': 9.77}, {'end': 1914.524, 'text': 'since those processes will have unfair knowledge of the entire dataset and thus the cross-validated estimate of out-of-sample performance will be biased upward.', 'start': 1902.091, 'duration': 12.433}, {'end': 1926.858, 'text': 'As such, a more reliable performance estimate is generated when these tasks only take place within the cross-validation iterations.', 'start': 1916.135, 'duration': 10.723}], 'summary': 'Best model is located and tuned using cross-validation, holdout set used for testing, feature engineering and selection within each cross-validation iteration for unbiased estimate of out-of-sample performance.', 'duration': 52.913, 'max_score': 1849.178, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI1849178.jpg'}], 'start': 1792.079, 'title': 'Improving cross-validation techniques, feature selection, and model performance', 'summary': 'Discusses the benefits of repeated cross-validation and creating a holdout set to improve reliability, emphasizes the importance of feature engineering and selection within each cross-validation iteration, and recommends repeated cross-validation for optimal model evaluation, potentially reducing variance and providing a more reliable estimate despite added complexity and potential computational expenses.', 'chapters': [{'end': 1878.946, 'start': 1792.079, 'title': 'Improving cross-validation techniques', 'summary': 'Discusses the benefits of repeated cross-validation and creating a holdout set to improve the reliability of estimating out-of-sample performance, potentially reducing variance and providing a more reliable estimate.', 'duration': 86.867, 'highlights': ['Repeated cross-validation involves repeating k-fold cross-validation multiple times with different random splits, providing a more reliable estimate of out-of-sample performance and reducing variance.', 'Creating a holdout set involves setting aside a portion of the data, locating and tuning the best model using cross-validation on the remaining data, and then testing the best model using the holdout set for a more reliable estimate of out-of-sample performance.']}, {'end': 2057.717, 'start': 1881.248, 'title': 'Improving cross-validation for model performance', 'summary': 'Discusses the importance of performing feature engineering and selection within each cross-validation iteration to generate a more reliable out-of-sample performance estimate, despite the added complexity and potential computational expenses. it also provides resources on cross-validation strategies and model evaluation.', 'duration': 176.469, 'highlights': ['Performing feature engineering and selection within each cross-validation iteration generates a more reliable out-of-sample performance estimate, avoiding biased upward estimates from unfair knowledge of the entire dataset.', 'The chapter emphasizes that the benefits of improved procedures may not always outweigh the costs, suggesting that the decision to use simple k-fold cross-validation depends on the specific problem.', "The resources provided include scikit-learn's documentation on cross-validation strategies and model evaluation metrics, as well as additional materials for gaining conceptual depth on cross-validation and model performance estimation."]}, {'end': 2151.982, 'start': 2059.716, 'title': 'Importance of feature selection and cross-validation in machine learning', 'summary': "Emphasizes the significance of feature selection within cross-validation iterations, as highlighted by a harvard ipython notebook illustrating its importance when the number of features is significantly larger than the dataset's observations, and recommends repeated cross-validation for optimal model evaluation.", 'duration': 92.266, 'highlights': ['The Harvard IPython notebook demonstrates the importance of feature selection within cross-validation iterations when the number of features is significantly larger than the number of observations in the dataset.', 'The majority of the audience prefers focusing on scikit-learn, with the introduction of useful pandas functionality as needed.', 'The chapter provides general advice on feature selection and recommends repeated cross-validation for optimal model evaluation.']}], 'duration': 359.903, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6dbrR-WymjI/pics/6dbrR-WymjI1792079.jpg', 'highlights': ['Repeated cross-validation involves repeating k-fold cross-validation multiple times with different random splits, providing a more reliable estimate of out-of-sample performance and reducing variance.', 'Performing feature engineering and selection within each cross-validation iteration generates a more reliable out-of-sample performance estimate, avoiding biased upward estimates from unfair knowledge of the entire dataset.', 'Creating a holdout set involves setting aside a portion of the data, locating and tuning the best model using cross-validation on the remaining data, and then testing the best model using the holdout set for a more reliable estimate of out-of-sample performance.']}], 'highlights': ['K-fold cross-validation provides a more reliable estimate of model performance than train-test-split', 'Cross-validation enhances model performance and generalization by selecting tuning parameters, models, and features', 'Evaluating machine learning models helps in choosing the best available model for optimal performance and accuracy', "The average testing accuracy obtained through k-fold cross-validation serves as the cross-validated accuracy, providing an estimate of the model's out-of-sample performance", 'Using k=10 is generally recommended for cross-validation, as it has been experimentally shown to produce the most reliable estimates of out-of-sample accuracy', 'Employing stratified sampling in cross-validation for classification problems ensures that each response class is represented with approximately equal proportions in each fold, leading to better model evaluation', 'Using cross-validation in scikit-learn to select tuning parameters for the k-nearest neighbors classification model', 'The mean accuracy score for k-fold cross-validation is about 97%', 'Repeated cross-validation involves repeating k-fold cross-validation multiple times with different random splits, providing a more reliable estimate of out-of-sample performance and reducing variance', 'Creating a holdout set involves setting aside a portion of the data, locating and tuning the best model using cross-validation on the remaining data, and then testing the best model using the holdout set for a more reliable estimate of out-of-sample performance']}