Coursnap

title
K Nearest Neighbors Application - Practical Machine Learning Tutorial with Python p.14

description
In the last part we introduced Classification, which is a supervised form of machine learning, and explained the K Nearest Neighbors algorithm intuition. In this tutorial, we're actually going to apply a simple example of the algorithm using Scikit-Learn, and then in the subsquent tutorials we'll build our own algorithm to learn more about how it works under the hood. To exemplify classification, we're going to use a Breast Cancer Dataset, which is a dataset donated to the University of California, Irvine (UCI) collection from the University of Wisconsin-Madison. UCI has a large Machine Learning Repository. https://pythonprogramming.net https://twitter.com/sentdex https://www.facebook.com/pythonprogramming.net/ https://plus.google.com/+sentdex

detail
{'title': 'K Nearest Neighbors Application - Practical Machine Learning Tutorial with Python p.14', 'heatmap': [{'end': 121.325, 'start': 90.689, 'weight': 0.948}, {'end': 1080.279, 'start': 1066.191, 'weight': 0.713}], 'summary': 'Using k nearest neighbors algorithm, the video demonstrates breast cancer classification, handling missing data, preprocessing, and outlier handling, achieving 96.4% accuracy in testing the model and 98.5% accuracy in making predictions with numpy and scikit-learn.', 'chapters': [{'end': 308.791, 'segs': [{'end': 63.34, 'src': 'embed', 'start': 6.885, 'weight': 0, 'content': [{'end': 29.593, 'text': "what we're going to be doing is continuing along with classification and specifically right now we're talking about k nearest neighbors which takes any new data point and just compares the quite literally euclidean distance between that point and the other data points to determine which which class it happens to belong to,", 'start': 6.885, 'duration': 22.708}, {'end': 33.095, 'text': 'by comparing its kind of distance from one class to another class.', 'start': 29.593, 'duration': 3.502}, {'end': 41.94, 'text': "basically so we're gonna be using scikit-learn to do this and then we're gonna use a data set from from the UCI edu data set.", 'start': 33.095, 'duration': 8.845}, {'end': 42.68, 'text': 'so here we go.', 'start': 41.94, 'duration': 0.74}, {'end': 48.605, 'text': 'so This is for University of California at Irvine.', 'start': 42.68, 'duration': 5.925}, {'end': 49.867, 'text': 'I believe is UCI.', 'start': 48.605, 'duration': 1.262}, {'end': 51.809, 'text': 'Okay, so this is the website.', 'start': 50.528, 'duration': 1.281}, {'end': 57.356, 'text': "If I happen to forget, it's relatively simple, archive.ics.uci.edu.ml.datasets.html.", 'start': 51.869, 'duration': 5.487}, {'end': 62.24, 'text': 'They have a ton of datasets here.', 'start': 60.539, 'duration': 1.701}, {'end': 63.34, 'text': "They've got them.", 'start': 62.36, 'duration': 0.98}], 'summary': 'Continuing with k nearest neighbors for classification using scikit-learn with uci dataset.', 'duration': 56.455, 'max_score': 6.885, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U6885.jpg'}, {'end': 122.586, 'src': 'heatmap', 'start': 84.967, 'weight': 3, 'content': [{'end': 87.228, 'text': 'Anyway, there you have that.', 'start': 84.967, 'duration': 2.261}, {'end': 90.229, 'text': "It's a great site for practicing machine learning.", 'start': 87.248, 'duration': 2.981}, {'end': 93.971, 'text': "And the dataset that we're going to use is the breast cancer dataset.", 'start': 90.689, 'duration': 3.282}, {'end': 95.711, 'text': "We're going to go with the original one.", 'start': 93.991, 'duration': 1.72}, {'end': 96.752, 'text': "I've already got that up though.", 'start': 95.731, 'duration': 1.021}, {'end': 103.939, 'text': "So that would bring you to a page like this, and then you'd be interested in the data folder which would open up page.", 'start': 97.332, 'duration': 6.607}, {'end': 108.121, 'text': "and then, if you click both of these, you'll get this and this.", 'start': 103.939, 'duration': 4.182}, {'end': 113.643, 'text': 'so the first one is the actual data that corresponds to the breast cancer set.', 'start': 108.121, 'duration': 5.522}, {'end': 121.325, 'text': "so we'll right click and save that and I'm gonna save it in the folder or the thing that working in, which is this right here,", 'start': 113.643, 'duration': 7.682}, {'end': 122.586, 'text': 'and then the second one.', 'start': 121.325, 'duration': 1.261}], 'summary': 'Practicing machine learning using the breast cancer dataset, downloading data files.', 'duration': 37.619, 'max_score': 84.967, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U84967.jpg'}, {'end': 187.666, 'src': 'embed', 'start': 162.032, 'weight': 4, 'content': [{'end': 169.595, 'text': 'And then finally, the 11th attribute is actually the class, which is either benign or malignant for the tumor.', 'start': 162.032, 'duration': 7.563}, {'end': 174.557, 'text': "So benign or malignant is an attribute, but that's also what we're actually trying to predict for.", 'start': 170.035, 'duration': 4.522}, {'end': 180.28, 'text': "So we're going to actually make that the class or the label, so to speak.", 'start': 174.597, 'duration': 5.683}, {'end': 182.021, 'text': "so anyways, that's our data.", 'start': 180.72, 'duration': 1.301}, {'end': 183.442, 'text': "we've downloaded the data.", 'start': 182.021, 'duration': 1.421}, {'end': 184.383, 'text': 'i have it now.', 'start': 183.442, 'duration': 0.941}, {'end': 187.666, 'text': "uh, let's see if it's yeah down here.", 'start': 184.383, 'duration': 3.283}], 'summary': 'The 11th attribute classifies tumors as benign or malignant for prediction.', 'duration': 25.634, 'max_score': 162.032, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U162032.jpg'}, {'end': 314.954, 'src': 'embed', 'start': 288.252, 'weight': 5, 'content': [{'end': 294.357, 'text': "you most machine learning algorithms don't want the class to be a string.", 'start': 288.252, 'duration': 6.105}, {'end': 295.318, 'text': "right, it's supposed to be a number.", 'start': 294.357, 'duration': 0.961}, {'end': 300.763, 'text': "so this is how you can convert it, but later on i'll show you you can kind of get away with using a string.", 'start': 295.318, 'duration': 5.445}, {'end': 301.804, 'text': 'but anyway,', 'start': 300.763, 'duration': 1.041}, {'end': 308.791, 'text': "for the most part you're going to be using like numpy or some other really optimized library that does number crunching and that's going to want numbers,", 'start': 301.804, 'duration': 6.987}, {'end': 309.492, 'text': 'not a string.', 'start': 308.791, 'duration': 0.701}, {'end': 314.954, 'text': 'So the other thing to pay attention to is this says it does indeed have missing attributes.', 'start': 310.212, 'duration': 4.742}], 'summary': 'Most machine learning algorithms require class as a number, not a string. numpy or similar optimized libraries need numbers, not strings.', 'duration': 26.702, 'max_score': 288.252, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U288252.jpg'}], 'start': 1.943, 'title': 'Using k nearest neighbors for breast cancer classification', 'summary': 'Discusses implementing the k nearest neighbors algorithm for classifying breast cancer using scikit-learn and a dataset from uci, covering the process of obtaining the breast cancer dataset for machine learning practice and understanding tumor data attributes for predicting benign or malignant tumors.', 'chapters': [{'end': 84.407, 'start': 1.943, 'title': 'K nearest neighbors for classification', 'summary': 'Discusses the use of k nearest neighbors algorithm for classification using scikit-learn and a dataset from uci, which provides various datasets categorized by tasks, attribute types, data types, area, and instances.', 'duration': 82.464, 'highlights': ['The k nearest neighbors algorithm is used to determine the class to which a new data point belongs by comparing its distance from other data points. The k nearest neighbors algorithm calculates the euclidean distance between a new data point and existing data points to determine its class, making it a useful tool for classification tasks.', 'UCI provides datasets categorized by tasks, attribute types, data types, area, and instances. The University of California at Irvine (UCI) offers a wide range of datasets categorized by tasks, attribute types, data types, area, and instances, making it a valuable resource for machine learning projects.', 'The UCI datasets cover various areas such as sciences, engineering, social sciences, and more. The UCI datasets cover a wide range of areas including sciences, engineering, social sciences, and more, providing diverse data for different types of machine learning projects.']}, {'end': 144.344, 'start': 84.967, 'title': 'Breast cancer dataset and machine learning practice', 'summary': 'Covers the process of obtaining the breast cancer dataset for machine learning practice, including the steps of accessing the data folder, saving the actual data, and reviewing additional information on the dataset.', 'duration': 59.377, 'highlights': ['The dataset used for machine learning practice is the breast cancer dataset.', 'The process involves accessing the data folder, saving the actual data, and reviewing additional information on the dataset.', 'The additional information includes citation requests, past usage, and details about who has used the dataset.']}, {'end': 308.791, 'start': 144.344, 'title': 'Understanding tumor data attributes', 'summary': "Discusses the structure of tumor data, including the attributes and their relevance to predicting benign or malignant tumors, and the process of converting the 'class' attribute to numerical values for machine learning algorithms.", 'duration': 164.447, 'highlights': ['The data contains 11 attributes, with the 11th attribute being the class, representing benign or malignant tumors.', 'The process of adding a header row to the data for easier interpretation and manipulation is demonstrated, ensuring the correct alignment of attributes and their values.', "Highlighting the importance of converting the 'class' attribute from string to numerical values for compatibility with most machine learning algorithms."]}], 'duration': 306.848, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U1943.jpg', 'highlights': ['The k nearest neighbors algorithm calculates the euclidean distance between a new data point and existing data points to determine its class, making it a useful tool for classification tasks.', 'UCI offers a wide range of datasets categorized by tasks, attribute types, data types, area, and instances, making it a valuable resource for machine learning projects.', 'The UCI datasets cover a wide range of areas including sciences, engineering, social sciences, and more, providing diverse data for different types of machine learning projects.', 'The dataset used for machine learning practice is the breast cancer dataset.', 'The data contains 11 attributes, with the 11th attribute being the class, representing benign or malignant tumors.', "Highlighting the importance of converting the 'class' attribute from string to numerical values for compatibility with most machine learning algorithms."]}, {'end': 443.298, 'segs': [{'end': 339.409, 'src': 'embed', 'start': 308.791, 'weight': 0, 'content': [{'end': 309.492, 'text': 'not a string.', 'start': 308.791, 'duration': 0.701}, {'end': 314.954, 'text': 'So the other thing to pay attention to is this says it does indeed have missing attributes.', 'start': 310.212, 'duration': 4.742}, {'end': 316.135, 'text': "There's 16 missing.", 'start': 315.014, 'duration': 1.121}, {'end': 320.177, 'text': "And if they are missing, it's denoted by a simple question mark.", 'start': 316.755, 'duration': 3.422}, {'end': 325.539, 'text': "Also, you get the class distribution here, which just says basically we've got 458 benign tumors.", 'start': 320.877, 'duration': 4.662}, {'end': 330.242, 'text': 'and 241 of them are malignant.', 'start': 327.94, 'duration': 2.302}, {'end': 339.409, 'text': "So it's not totally a balanced dataset, but it might actually be more realistic in terms of how many are benign or malignant.", 'start': 330.602, 'duration': 8.807}], 'summary': 'Dataset contains 16 missing attributes. class distribution: 458 benign tumors, 241 malignant.', 'duration': 30.618, 'max_score': 308.791, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U308791.jpg'}, {'end': 443.298, 'src': 'embed', 'start': 416.75, 'weight': 2, 'content': [{'end': 421.951, 'text': 'And we already know out of the gate we have some missing data that is denoted by a question mark.', 'start': 416.75, 'duration': 5.201}, {'end': 424.492, 'text': "So let's go ahead and replace that information.", 'start': 422.191, 'duration': 2.301}, {'end': 426.713, 'text': "So we're going to say df.replace.", 'start': 424.532, 'duration': 2.181}, {'end': 432.174, 'text': "And we're going to replace all question marks with a negative 99,999.", 'start': 427.873, 'duration': 4.301}, {'end': 440.757, 'text': "And then we're going to say, whoops, comma, comma, in place equals true.", 'start': 432.174, 'duration': 8.583}, {'end': 443.298, 'text': 'So that just modifies the data frame right away.', 'start': 440.837, 'duration': 2.461}], 'summary': 'Replace missing data denoted by question marks with -99,999 in the dataframe.', 'duration': 26.548, 'max_score': 416.75, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U416750.jpg'}], 'start': 308.791, 'title': 'Handling missing data in breast cancer dataset', 'summary': 'Discusses strategies for addressing 16 missing attributes in a breast cancer dataset, with class distribution of 458 benign tumors and 241 malignant tumors, highlighting a slightly imbalanced dataset.', 'chapters': [{'end': 443.298, 'start': 308.791, 'title': 'Handling missing data in breast cancer dataset', 'summary': 'Discusses handling missing data in a breast cancer dataset, which contains 16 missing attributes and a class distribution of 458 benign tumors and 241 malignant tumors, presenting a slightly imbalanced dataset.', 'duration': 134.507, 'highlights': ['The dataset contains 16 missing attributes denoted by a question mark. This indicates that there are 16 missing attributes in the dataset.', 'There are 458 benign tumors and 241 malignant tumors in the dataset. The class distribution shows 458 benign tumors and 241 malignant tumors, indicating a slightly imbalanced dataset.', 'The process involves replacing all question marks with a value of -99,999. The missing data denoted by question marks are replaced with a value of -99,999 in the dataset.']}], 'duration': 134.507, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U308791.jpg', 'highlights': ['The class distribution shows 458 benign tumors and 241 malignant tumors, indicating a slightly imbalanced dataset.', 'The dataset contains 16 missing attributes denoted by a question mark.', 'The process involves replacing all question marks with a value of -99,999.']}, {'end': 602.908, 'segs': [{'end': 473.523, 'src': 'embed', 'start': 444.298, 'weight': 0, 'content': [{'end': 455.367, 'text': 'and then remember before i was asking is there any, you know you want to watch out for useless data and, in this case, what might be useless data?', 'start': 444.298, 'duration': 11.069}, {'end': 459.931, 'text': 'well, when looking at the data set here, the first column is id.', 'start': 455.367, 'duration': 4.564}, {'end': 467.037, 'text': 'does id have any implication as to whether or not a tumor is benign or malignant?', 'start': 459.931, 'duration': 7.106}, {'end': 470.5, 'text': 'The answer, of course, is no.', 'start': 468.478, 'duration': 2.022}, {'end': 473.523, 'text': "So we don't want the ID column.", 'start': 471.121, 'duration': 2.402}], 'summary': 'Data analysis revealed the id column as useless for tumor classification.', 'duration': 29.225, 'max_score': 444.298, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U444298.jpg'}, {'end': 540.938, 'src': 'embed', 'start': 513.118, 'weight': 1, 'content': [{'end': 517.222, 'text': "Now, as you can remember before, we're going to define our X's and Y's.", 'start': 513.118, 'duration': 4.104}, {'end': 521.605, 'text': "Also, I believe I've already explained the negative 99,999.", 'start': 517.922, 'duration': 3.683}, {'end': 529.012, 'text': "The whole point of that is just so most algorithms recognize that's an outlier and will treat it like an outlier as opposed to dumping the data.", 'start': 521.605, 'duration': 7.407}, {'end': 534.835, 'text': 'Again, a lot of real world data sets have a lot of missing data, just a ton of holes, to the point where,', 'start': 529.332, 'duration': 5.503}, {'end': 540.938, 'text': 'if you were to drop everything with missing data, you would maybe sacrifice 50% of your data set, or something like that.', 'start': 534.835, 'duration': 6.103}], 'summary': 'Explaining outlier handling and data loss due to missing data in real-world datasets, potentially sacrificing 50% of the dataset.', 'duration': 27.82, 'max_score': 513.118, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U513118.jpg'}, {'end': 613.513, 'src': 'embed', 'start': 582.421, 'weight': 3, 'content': [{'end': 583.782, 'text': "anyway, we're going to have that there.", 'start': 582.421, 'duration': 1.361}, {'end': 587.203, 'text': 'so maybe in this case you might want to actually say zero or something like that.', 'start': 583.782, 'duration': 3.421}, {'end': 589.083, 'text': 'you just kind of keep that in mind.', 'start': 587.203, 'duration': 1.88}, {'end': 593.205, 'text': "depending on how you want to do it, we'll see that this probably won't have a huge effect anyway.", 'start': 589.083, 'duration': 4.122}, {'end': 602.908, 'text': 'also, we only had 16, so you could in theory, rather than replacing the question mark, you can do df.drop na in place equals true,', 'start': 593.205, 'duration': 9.703}, {'end': 604.008, 'text': 'and just get rid of those.', 'start': 602.908, 'duration': 1.1}, {'end': 605.569, 'text': "so anyway, that's really.", 'start': 604.008, 'duration': 1.561}, {'end': 609.57, 'text': "it comes down to opinion and also your specific data set and the algorithm that you're going to use.", 'start': 605.569, 'duration': 4.001}, {'end': 610.651, 'text': "but anyway we're going to do it this way.", 'start': 609.57, 'duration': 1.081}, {'end': 613.513, 'text': "so now we're going to define x and y.", 'start': 611.571, 'duration': 1.942}], 'summary': 'Discussion on handling missing data and defining variables in a dataset.', 'duration': 31.092, 'max_score': 582.421, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U582421.jpg'}], 'start': 444.298, 'title': 'Data preprocessing and outlier handling', 'summary': "Discusses the identification and removal of useless data, such as the 'id' column, and the handling of outliers with a specific value like -99,999 to prevent data loss and its impact on algorithms like k-nearest-neighbors.", 'chapters': [{'end': 602.908, 'start': 444.298, 'title': 'Data preprocessing and outlier handling', 'summary': "Discusses the identification and removal of useless data, such as the 'id' column, and the handling of outliers with the use of a specific value like -99,999, to prevent data loss and its impact on algorithms like k-nearest-neighbors.", 'duration': 158.61, 'highlights': ["The 'id' column is identified as useless data for tumor classification as it has no implication on whether a tumor is benign or malignant, and is removed to prevent it from affecting the algorithm's performance.", 'The use of a specific outlier value like -99,999 is explained to ensure that most algorithms recognize it as an outlier and treat it as such, preventing data loss that could occur from dropping missing data, which can be as high as 50% in real world datasets.', 'The significance of the outlier value -99,999 is highlighted, especially in the context of k-nearest-neighbors algorithm, and the consideration of using alternative values like zero is mentioned, with the acknowledgment that it may not have a significant effect on the outcome.', 'The option of replacing missing data with a specific value or dropping them is mentioned, with the observation that in the given dataset, there were only 16 missing values, providing the flexibility for different handling approaches.']}], 'duration': 158.61, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U444298.jpg', 'highlights': ["The 'id' column is identified as useless data for tumor classification and is removed to prevent it from affecting the algorithm's performance.", 'The use of a specific outlier value like -99,999 is explained to prevent data loss that could occur from dropping missing data, which can be as high as 50% in real world datasets.', 'The significance of the outlier value -99,999 is highlighted, especially in the context of k-nearest-neighbors algorithm, and the consideration of using alternative values like zero is mentioned.', 'The option of replacing missing data with a specific value or dropping them is mentioned, with the observation that in the given dataset, there were only 16 missing values.']}, {'end': 865.428, 'segs': [{'end': 733.585, 'src': 'embed', 'start': 602.908, 'weight': 1, 'content': [{'end': 604.008, 'text': 'and just get rid of those.', 'start': 602.908, 'duration': 1.1}, {'end': 605.569, 'text': "so anyway, that's really.", 'start': 604.008, 'duration': 1.561}, {'end': 609.57, 'text': "it comes down to opinion and also your specific data set and the algorithm that you're going to use.", 'start': 605.569, 'duration': 4.001}, {'end': 610.651, 'text': "but anyway we're going to do it this way.", 'start': 609.57, 'duration': 1.081}, {'end': 613.513, 'text': "so now we're going to define x and y.", 'start': 611.571, 'duration': 1.942}, {'end': 615.374, 'text': "so we've got x and y again capital.", 'start': 613.513, 'duration': 1.861}, {'end': 619.677, 'text': 'x is for your features, y is for your labels or your class or whatever.', 'start': 615.374, 'duration': 4.303}, {'end': 630.225, 'text': "so x is going to be np.array and we're going to do basically the exact same thing that we did before when we were using linear regression.", 'start': 619.677, 'duration': 10.548}, {'end': 639.392, 'text': 'so the features are basically everything except for the class column, class, df.drop class, comma one,', 'start': 630.225, 'duration': 9.167}, {'end': 643.696, 'text': 'and then the label is basically just the class right.', 'start': 640.012, 'duration': 3.684}, {'end': 649.702, 'text': 'so np array, df class.', 'start': 643.696, 'duration': 6.006}, {'end': 651.663, 'text': 'so now we have our features and our labels.', 'start': 649.702, 'duration': 1.961}, {'end': 654.646, 'text': "now we're going to do the cross validation.", 'start': 651.663, 'duration': 2.983}, {'end': 670.121, 'text': 'so again, x, train, x test y train y test equals cross validation dot train test.', 'start': 654.646, 'duration': 15.475}, {'end': 680.969, 'text': "whoops, train test split and we're going to split what x and y And the test size will be 0.2, so 20%.", 'start': 670.121, 'duration': 10.848}, {'end': 682.27, 'text': 'Feel free to change that if you want.', 'start': 680.969, 'duration': 1.301}, {'end': 684.871, 'text': "But this line, again, we've already covered it.", 'start': 682.33, 'duration': 2.541}, {'end': 693.116, 'text': "That's just how we can quickly shuffle the data and separate it into training and testing chunks that meet our hopes.", 'start': 684.911, 'duration': 8.205}, {'end': 695.958, 'text': "So now we're going to define the classifier.", 'start': 694.557, 'duration': 1.401}, {'end': 697.519, 'text': "So you've already seen this before.", 'start': 696.038, 'duration': 1.481}, {'end': 700.301, 'text': 'Not fit yet.', 'start': 699.801, 'duration': 0.5}, {'end': 703.323, 'text': 'Classifier equals neighbors.k classifier.', 'start': 700.621, 'duration': 2.702}, {'end': 709.977, 'text': 'neighbors classifier, k neighbors classifier.', 'start': 704.768, 'duration': 5.209}, {'end': 711.62, 'text': 'okay, i think i spelled that right.', 'start': 709.977, 'duration': 1.643}, {'end': 713.183, 'text': "so we've defined the classifier.", 'start': 711.62, 'duration': 1.563}, {'end': 715.687, 'text': 'now we can fit, so clf.fit.', 'start': 713.183, 'duration': 2.504}, {'end': 716.689, 'text': 'and then we fit what.', 'start': 715.687, 'duration': 1.002}, {'end': 720.556, 'text': 'well, we fit the x train and y train.', 'start': 716.689, 'duration': 3.867}, {'end': 724.839, 'text': 'again all this code is almost identical to when we did the linear regression code.', 'start': 720.556, 'duration': 4.283}, {'end': 732.444, 'text': "that's what's really just so nice about scikit-learn is, even with different machine learning algorithm, i hate to say classes,", 'start': 724.839, 'duration': 7.605}, {'end': 733.585, 'text': 'but really they are classes.', 'start': 732.444, 'duration': 1.141}], 'summary': 'Using np.array, cross validation, and k-neighbors classifier to train and test data.', 'duration': 130.677, 'max_score': 602.908, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U602908.jpg'}, {'end': 842.659, 'src': 'embed', 'start': 815.681, 'weight': 0, 'content': [{'end': 824.327, 'text': 'so with um, just with the defaults and all of that, our accuracy is 0.964, which is 96.4 percent.', 'start': 815.681, 'duration': 8.646}, {'end': 826.809, 'text': "that's huge accuracy.", 'start': 824.327, 'duration': 2.482}, {'end': 827.95, 'text': 'now, of course, like i said before,', 'start': 826.809, 'duration': 1.141}, {'end': 836.016, 'text': "with the self-driving car and all that um differentiating between a baby and a blanket and a blob of tar on the road, um, I don't know about you,", 'start': 827.95, 'duration': 8.066}, {'end': 842.659, 'text': "but I don't want to be told that I do or don't have cancer, even with a 4% inaccuracy, right?", 'start': 836.016, 'duration': 6.643}], 'summary': 'Default settings yield 96.4% accuracy. concerns raised about 4% inaccuracy in life-threatening decisions.', 'duration': 26.978, 'max_score': 815.681, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U815681.jpg'}], 'start': 602.908, 'title': 'Defining features and classifying data', 'summary': "Discusses the process of defining features and labels for machine learning, emphasizing the use of np.array and specific algorithms. it also covers the implementation of cross validation with a 20% test size and the use of k nearest neighbors classifier, achieving a 96.4% accuracy in testing the model, and highlighting scikit-learn's versatility.", 'chapters': [{'end': 651.663, 'start': 602.908, 'title': 'Defining features and labels for machine learning', 'summary': 'Discusses the process of defining features and labels for machine learning, emphasizing the importance of specific data sets and algorithms, and the use of np.array to create features and labels.', 'duration': 48.755, 'highlights': ['The process of defining features and labels for machine learning is influenced by opinion, specific data sets, and the chosen algorithm.', 'Using np.array to create features and labels involves excluding the class column for features and using the class column for labels.', 'The importance of specific data sets and algorithms in defining features and labels for machine learning.']}, {'end': 865.428, 'start': 651.663, 'title': 'Implementing cross validation and k nearest neighbors classifier', 'summary': 'Covers the implementation of cross validation with a test size of 20% and the use of k nearest neighbors classifier, achieving an accuracy of 96.4% in testing the model, highlighting the versatility of scikit-learn.', 'duration': 213.765, 'highlights': ['The accuracy achieved in testing the model with the defaults is 96.4%, showcasing the effectiveness of the K Nearest Neighbors classifier.', "The implementation of cross validation includes a test size of 20%, providing a reliable method for evaluating the model's performance.", 'The code demonstrates the quick shuffling and separation of data into training and testing chunks to meet the desired requirements for model evaluation.', 'The versatility of scikit-learn is highlighted as the implementation of different machine learning algorithms follows almost identical coding patterns, promoting ease of use and consistency in model development.']}], 'duration': 262.52, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U602908.jpg', 'highlights': ['The accuracy achieved in testing the model with the defaults is 96.4%, showcasing the effectiveness of the K Nearest Neighbors classifier.', "The implementation of cross validation includes a test size of 20%, providing a reliable method for evaluating the model's performance.", 'The versatility of scikit-learn is highlighted as the implementation of different machine learning algorithms follows almost identical coding patterns, promoting ease of use and consistency in model development.', 'Using np.array to create features and labels involves excluding the class column for features and using the class column for labels.', 'The process of defining features and labels for machine learning is influenced by opinion, specific data sets, and the chosen algorithm.', 'The importance of specific data sets and algorithms in defining features and labels for machine learning.', 'The code demonstrates the quick shuffling and separation of data into training and testing chunks to meet the desired requirements for model evaluation.']}, {'end': 1299.992, 'segs': [{'end': 1080.279, 'src': 'heatmap', 'start': 1041.656, 'weight': 0, 'content': [{'end': 1045.579, 'text': 'Okay, so now our accuracy is 98.5%, and we did get our prediction.', 'start': 1041.656, 'duration': 3.923}, {'end': 1053.564, 'text': "But you'll see we got this deprecation warning that it wants us to reshape our data.", 'start': 1045.759, 'duration': 7.805}, {'end': 1062.789, 'text': 'Before, the problem was we were passing an incorrect number of attributes because we had one more because of the ID column.', 'start': 1053.944, 'duration': 8.845}, {'end': 1065.911, 'text': 'Okay, anyway, fixed.', 'start': 1063.129, 'duration': 2.782}, {'end': 1070.013, 'text': 'But now to get rid of that value error, we would do this and then run it.', 'start': 1066.191, 'duration': 3.822}, {'end': 1078.978, 'text': "then now you don't have the stupid deprecation warning, so that makes things maybe a little more complex.", 'start': 1071.653, 'duration': 7.325}, {'end': 1080.279, 'text': 'so i just want to.', 'start': 1078.978, 'duration': 1.301}], 'summary': 'Accuracy at 98.5%, resolved deprecation warning and value error.', 'duration': 28.357, 'max_score': 1041.656, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U1041656.jpg'}, {'end': 1270.395, 'src': 'embed', 'start': 1234.584, 'weight': 1, 'content': [{'end': 1244.066, 'text': "just how to properly reshape your numpy array for a variety of reasons but that's just one of the reasons why you would is so you can feed it through scikit-learn.", 'start': 1234.584, 'duration': 9.482}, {'end': 1246.887, 'text': "okay, so there's your real world example.", 'start': 1244.066, 'duration': 2.821}, {'end': 1254.089, 'text': 'we found that we can predict based on, you know, somewhere between 94 and 98 accuracy, which is pretty impressive,', 'start': 1246.887, 'duration': 7.202}, {'end': 1257.51, 'text': 'especially now that you know the intuition of how simple k-nearest neighbors is.', 'start': 1254.089, 'duration': 3.421}, {'end': 1263.032, 'text': "And now what we're going to do is actually write our very own k-nearest neighbors algorithm.", 'start': 1258.55, 'duration': 4.482}, {'end': 1270.395, 'text': "And then what we can do is we'll run it on these exact same data, and then we'll see how we compare to scikit-learn.", 'start': 1263.372, 'duration': 7.023}], 'summary': 'Reshaping numpy array for scikit-learn, achieving 94-98% prediction accuracy, and implementing k-nearest neighbors algorithm.', 'duration': 35.811, 'max_score': 1234.584, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U1234584.jpg'}], 'start': 866.008, 'title': 'Making predictions with numpy and scikit-learn', 'summary': 'Demonstrates making predictions using numpy and scikit-learn, achieving an accuracy of 98.5% and highlighting the importance of properly reshaping numpy arrays for scikit-learn classifiers.', 'chapters': [{'end': 1299.992, 'start': 866.008, 'title': 'Making predictions with numpy and scikit-learn', 'summary': 'Demonstrates making predictions using numpy and scikit-learn, achieving an accuracy of 98.5% and highlighting the importance of properly reshaping numpy arrays for scikit-learn classifiers.', 'duration': 433.984, 'highlights': ['The accuracy of the prediction is improved from 56% to 98.5% after resolving the deprecation warning and reshaping the data properly for scikit-learn classifiers.', 'The process of reshaping numpy arrays is crucial to ensure the flexibility of making predictions on any number of data points, achieving a 98.5% accuracy in a real-world example.', 'The chapter emphasizes the simplicity of the k-nearest neighbors algorithm and the intention to develop a custom k-nearest neighbors algorithm for comparison with scikit-learn.']}], 'duration': 433.984, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/1i0zu9jHN6U/pics/1i0zu9jHN6U866008.jpg', 'highlights': ['The accuracy of the prediction is improved from 56% to 98.5% after resolving the deprecation warning and reshaping the data properly for scikit-learn classifiers.', 'The process of reshaping numpy arrays is crucial to ensure the flexibility of making predictions on any number of data points, achieving a 98.5% accuracy in a real-world example.', 'The chapter emphasizes the simplicity of the k-nearest neighbors algorithm and the intention to develop a custom k-nearest neighbors algorithm for comparison with scikit-learn.']}], 'highlights': ['The accuracy achieved in testing the model with the defaults is 96.4%, showcasing the effectiveness of the K Nearest Neighbors classifier.', 'The accuracy of the prediction is improved from 56% to 98.5% after resolving the deprecation warning and reshaping the data properly for scikit-learn classifiers.', 'The process of reshaping numpy arrays is crucial to ensure the flexibility of making predictions on any number of data points, achieving a 98.5% accuracy in a real-world example.', 'The process involves replacing all question marks with a value of -99,999.', "The implementation of cross validation includes a test size of 20%, providing a reliable method for evaluating the model's performance.", 'The k nearest neighbors algorithm calculates the euclidean distance between a new data point and existing data points to determine its class, making it a useful tool for classification tasks.', 'The dataset used for machine learning practice is the breast cancer dataset.', 'The class distribution shows 458 benign tumors and 241 malignant tumors, indicating a slightly imbalanced dataset.', 'The process of defining features and labels for machine learning is influenced by opinion, specific data sets, and the chosen algorithm.', 'The versatility of scikit-learn is highlighted as the implementation of different machine learning algorithms follows almost identical coding patterns, promoting ease of use and consistency in model development.']}