Coursnap

title
Support Vector Machines in Python from Start to Finish.

description
NOTE: You can support StatQuest by purchasing the Jupyter Notebook and Python code seen in this video here: http://statquest.gumroad.com/l/iulnea This webinar was recorded 20200609 at 11:00am (New York Time) NOTE: This StatQuest assumes that you are already familiar with: Support Vector Machines: https://youtu.be/efR1C6CvhmE The Radial Basis Function: https://youtu.be/Qc5IyLW_hns Regularization: https://youtu.be/Q81RR3yKn30 Cross Validation: https://youtu.be/fSytzGwwBVw Confusion Matrices: https://youtu.be/Kdsp6soqA7o For a complete index of all the StatQuest videos, check out: https://statquest.org/video-index/ If you'd like to support StatQuest, please consider... Buying my book, The StatQuest Illustrated Guide to Machine Learning: PDF - https://statquest.gumroad.com/l/wvtmc Paperback - https://www.amazon.com/dp/B09ZCKR4H6 Kindle eBook - https://www.amazon.com/dp/B09ZG79HXC Patreon: https://www.patreon.com/statquest ...or... YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join ...a cool StatQuest t-shirt or sweatshirt: https://shop.spreadshirt.com/statquest-with-josh-starmer/ ...buying one or two of my songs (or go large and get a whole album!) https://joshuastarmer.bandcamp.com/ ...or just donating to StatQuest! https://www.paypal.me/statquest Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter: https://twitter.com/joshuastarmer 0:00 Awesome song and introduction 4:16 Import Modules 6:36 Import Data 11:27 Missing Data Part 1: Identifying 16:57 Missing Data Part 2: Dealing with it 21:04 Downsampling the data 24:35 Format Data Part 1: X and y 26:35 Format Data Part 2: One-Hot Encoding 31:25 Format Data Part 3: Centering and Scaling 32:45 Build a Preliminary SVM 34:55 Optimize Parameters with Cross Validation (GridSearchCV) 37:58 Build and Draw Final SVM #StatQuest #ML #SVM

detail
{'title': 'Support Vector Machines in Python from Start to Finish.', 'heatmap': [{'end': 1916.235, 'start': 1852.607, 'weight': 0.721}, {'end': 1990.578, 'start': 1932.628, 'weight': 0.772}, {'end': 2458.161, 'start': 2419.273, 'weight': 0.848}, {'end': 2610.217, 'start': 2551.548, 'weight': 0.75}], 'summary': 'Covers the basics and theory of support vector machines in python, using scikit-learn and radial basis function for classification, data manipulation with pandas, and optimization of support vector machine using gridsearchcv, resulting in a slight improvement in classification accuracy.', 'chapters': [{'end': 505.789, 'segs': [{'end': 65.826, 'src': 'embed', 'start': 39.028, 'weight': 0, 'content': [{'end': 47.053, 'text': 'In this lesson, we will build a support vector machine for classification using scikit-learn and the radial basis function.', 'start': 39.028, 'duration': 8.025}, {'end': 59.582, 'text': 'Our training data set contains continuous and categorical data from the UCI machine learning repository to predict whether or not a person will default on their credit card.', 'start': 48.194, 'duration': 11.388}, {'end': 65.826, 'text': 'And note, throughout this Jupyter notebook, all these links are live.', 'start': 60.042, 'duration': 5.784}], 'summary': 'Building an svm model to predict credit card default using scikit-learn and radial basis function.', 'duration': 26.798, 'max_score': 39.028, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ39028.jpg'}, {'end': 152.92, 'src': 'embed', 'start': 107.503, 'weight': 1, 'content': [{'end': 110.485, 'text': 'Okay, so support vector machines.', 'start': 107.503, 'duration': 2.982}, {'end': 111.306, 'text': 'why would you want to do them??', 'start': 110.485, 'duration': 0.821}, {'end': 116.069, 'text': 'They are one of the best machine learning algorithms out there,', 'start': 111.686, 'duration': 4.383}, {'end': 124.195, 'text': 'for when getting the correct answer is a higher priority than actually understanding why you get the correct answer.', 'start': 116.069, 'duration': 8.126}, {'end': 130.84, 'text': 'They work really well with relatively small data sets, and they tend to work well out of the box.', 'start': 124.935, 'duration': 5.905}, {'end': 135.343, 'text': 'In other words, they tend to not require much optimization.', 'start': 131.52, 'duration': 3.823}, {'end': 140.914, 'text': "So in this lesson, we're going to learn about importing data from a file.", 'start': 137.813, 'duration': 3.101}, {'end': 147.898, 'text': "We're going to learn about missing data, downsampling data, formatting the data for support vector machines.", 'start': 140.934, 'duration': 6.964}, {'end': 151.499, 'text': "And we're going to build a preliminary support vector machine.", 'start': 148.538, 'duration': 2.961}, {'end': 152.92, 'text': "Then we're going to optimize it.", 'start': 151.719, 'duration': 1.201}], 'summary': 'Support vector machines are best for high-priority correct answers, work well with small datasets, and require minimal optimization.', 'duration': 45.417, 'max_score': 107.503, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ107503.jpg'}, {'end': 292.429, 'src': 'embed', 'start': 263.598, 'weight': 5, 'content': [{'end': 267.14, 'text': 'Python itself just gives us a basic programming language.', 'start': 263.598, 'duration': 3.542}, {'end': 279.247, 'text': 'These modules give us extra functionality to import the data, clean it up and format it, and then build, evaluate and draw the support vector machine.', 'start': 267.78, 'duration': 11.467}, {'end': 288.788, 'text': "Note, you're going to need Python 3, and I've got instructions on how to install that and make sure all your modules are up to date down here.", 'start': 280.505, 'duration': 8.283}, {'end': 292.429, 'text': "We don't need to go through that because I've already got this set up on my own computer.", 'start': 288.828, 'duration': 3.601}], 'summary': 'Python provides basic programming language, modules offer extra functionality for data processing and support vector machine building.', 'duration': 28.831, 'max_score': 263.598, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ263598.jpg'}, {'end': 454.055, 'src': 'embed', 'start': 411.449, 'weight': 3, 'content': [{'end': 420.357, 'text': 'This data set will allow us to predict if someone will default on their credit card payments based on their sex, age, and a variety of other metrics.', 'start': 411.449, 'duration': 8.908}, {'end': 425.261, 'text': "Note when pandas, which is what we're going to use to read in the data.", 'start': 421.237, 'duration': 4.024}, {'end': 430.209, 'text': 'When it reads in data, it returns a data frame which is a lot like a spreadsheet.', 'start': 425.967, 'duration': 4.242}, {'end': 437.753, 'text': 'The data are organized in rows and columns, and each row can contain a mixture of text and columns.', 'start': 430.869, 'duration': 6.884}, {'end': 438.994, 'text': 'Excuse me.', 'start': 438.693, 'duration': 0.301}, {'end': 441.407, 'text': 'text and numbers.', 'start': 440.666, 'duration': 0.741}, {'end': 447.27, 'text': "The standard variable name for a data frame is the initials DF, and that's what we're going to use here.", 'start': 442.687, 'duration': 4.583}, {'end': 449.932, 'text': "And I've got two blocks of code here.", 'start': 447.811, 'duration': 2.121}, {'end': 454.055, 'text': "One is to read in the file that we're going to use.", 'start': 450.533, 'duration': 3.522}], 'summary': 'Using a dataset to predict credit card default based on demographic and other factors.', 'duration': 42.606, 'max_score': 411.449, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ411449.jpg'}], 'start': 0.74, 'title': 'Support vector machines in python and data loading', 'summary': 'Covers the basics and theory of support vector machines in python, emphasizing the use of scikit-learn and radial basis function for classification. it also includes guidance on importing necessary modules and loading and manipulating data using pandas data frames.', 'chapters': [{'end': 196.355, 'start': 0.74, 'title': 'Support vector machines in python', 'summary': 'Introduces the use of support vector machines in python for classification, using scikit-learn and the radial basis function to predict credit card default. it covers importing data, missing data handling, downsampling, formatting for support vector machines, building, optimizing, evaluating, drawing, and interpreting support vector machines, and comparing the preliminary and final models.', 'duration': 195.615, 'highlights': ['The chapter covers the use of support vector machines in Python for classification, using scikit-learn and the radial basis function to predict credit card default, demonstrating practical steps and techniques. It also addresses data handling, model optimization, evaluation, and comparison. (Relevance score: 5)', 'Support vector machines are highlighted as one of the best machine learning algorithms, particularly suitable when obtaining the correct answer is a priority over understanding why, and for relatively small datasets. They are noted to work well out of the box and not require significant optimization. (Relevance score: 4)', 'The chapter explores practical steps including importing data from a file, handling missing data, downsampling, and formatting the data for support vector machines, emphasizing a comprehensive approach to model building and evaluation. (Relevance score: 3)']}, {'end': 338.53, 'start': 199.176, 'title': 'Support vector machines with python', 'summary': 'Introduces the basics of python and the theory behind support vector machines, emphasizing the importance of playing with the code for learning. it also provides guidance on importing necessary modules and using jupyter notebooks for data manipulation.', 'duration': 139.354, 'highlights': ['The chapter stresses the importance of playing with the code for learning and encourages experimenting with different machine learning algorithms to gain insights.', 'The chapter provides guidance on importing necessary modules and using Jupyter Notebooks for data manipulation and visualization, emphasizing the importance of Python 3 and up-to-date modules.', 'The chapter assumes familiarity with the basics of Python and theory behind support vector machines, the radial basis function, regularization, cross-validation, and confusion matrices, providing links for reference.']}, {'end': 505.789, 'start': 338.53, 'title': 'Running and loading data in python', 'summary': 'Explains how to run code in python using different methods, and then loads a data set from the uci machine learning repository to predict credit card defaults based on various metrics using pandas data frames.', 'duration': 167.259, 'highlights': ['The data set will allow us to predict if someone will default on their credit card payments based on their sex, age, and a variety of other metrics. The data set enables prediction of credit card defaults based on various metrics such as sex, age, and other factors.', 'Pandas returns a data frame when it reads in data, which is organized in rows and columns, and contains a mixture of text and numbers. Pandas returns a data frame organized in rows and columns, containing a mix of text and numbers when it reads in data.', 'The file has a header row, and the first row contains some other kind of nonsense, while the data are tab delimited. The file has a header row, with the first row containing irrelevant data, and the data is tab delimited.']}], 'duration': 505.049, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ740.jpg', 'highlights': ['The chapter covers the use of support vector machines in Python for classification, using scikit-learn and the radial basis function to predict credit card default, demonstrating practical steps and techniques. It also addresses data handling, model optimization, evaluation, and comparison.', 'Support vector machines are highlighted as one of the best machine learning algorithms, particularly suitable when obtaining the correct answer is a priority over understanding why, and for relatively small datasets. They are noted to work well out of the box and not require significant optimization.', 'The chapter explores practical steps including importing data from a file, handling missing data, downsampling, and formatting the data for support vector machines, emphasizing a comprehensive approach to model building and evaluation.', 'The data set will allow us to predict if someone will default on their credit card payments based on their sex, age, and a variety of other metrics. The data set enables prediction of credit card defaults based on various metrics such as sex, age, and other factors.', 'Pandas returns a data frame when it reads in data, which is organized in rows and columns, and contains a mixture of text and numbers. Pandas returns a data frame organized in rows and columns, containing a mix of text and numbers when it reads in data.', 'The chapter provides guidance on importing necessary modules and using Jupyter Notebooks for data manipulation and visualization, emphasizing the importance of Python 3 and up-to-date modules.']}, {'end': 747.161, 'segs': [{'end': 620.644, 'src': 'embed', 'start': 530.491, 'weight': 0, 'content': [{'end': 532.895, 'text': "The columns are ID, that's just an ID number.", 'start': 530.491, 'duration': 2.404}, {'end': 537.599, 'text': 'Limit balance is the credit limit for that customer.', 'start': 534.054, 'duration': 3.545}, {'end': 540.222, 'text': 'Sex is male or female.', 'start': 538.16, 'duration': 2.062}, {'end': 544.628, 'text': 'Education, marriage, age, payment.', 'start': 540.883, 'duration': 3.745}, {'end': 550.516, 'text': 'These columns tell us whether or not the last payment was on time or how late it was.', 'start': 544.688, 'duration': 5.828}, {'end': 559.24, 'text': "There's one column for different months of payments.", 'start': 554.096, 'duration': 5.144}, {'end': 562.403, 'text': "So it's not just one last month's payment.", 'start': 559.541, 'duration': 2.862}, {'end': 564.065, 'text': 'It goes back six months.', 'start': 562.844, 'duration': 1.221}, {'end': 572.592, 'text': 'Then we have the bill amounts for the last six months, the last six bills, and how much was paid for the last six bills.', 'start': 564.985, 'duration': 7.607}, {'end': 577.356, 'text': 'And then lastly, we have default payment next month.', 'start': 573.533, 'duration': 3.823}, {'end': 580.499, 'text': "This is the variable that we're going to try to predict.", 'start': 577.836, 'duration': 2.663}, {'end': 584.208, 'text': "And here I've listed the column names as well.", 'start': 581.807, 'duration': 2.401}, {'end': 591.511, 'text': 'But note, the last column name, this one, default payment next month, that is a mouthful.', 'start': 584.668, 'duration': 6.843}, {'end': 595.473, 'text': "So we're going to change it to just be default.", 'start': 592.111, 'duration': 3.362}, {'end': 603.056, 'text': 'So did they default or not? And we do that by doing dataframe.rename.', 'start': 596.893, 'duration': 6.163}, {'end': 608.761, 'text': 'And we pass in the column name that we want to change, and then the name we want to change it to.', 'start': 603.84, 'duration': 4.921}, {'end': 615.003, 'text': "When we set access to columns, we're specifying that we want to change a column name.", 'start': 609.661, 'duration': 5.342}, {'end': 620.644, 'text': "Lastly, we're saying we want to do it in place, meaning we're going to modify this data frame.", 'start': 616.103, 'duration': 4.541}], 'summary': 'Data includes id, credit limit, gender, education, marriage, age, payment history, bill amounts, and default payment prediction for next month.', 'duration': 90.153, 'max_score': 530.491, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ530491.jpg'}, {'end': 722.344, 'src': 'embed', 'start': 672.451, 'weight': 5, 'content': [{'end': 677.394, 'text': "And there it is, like we've got ID up here, and now we no longer have the ID.", 'start': 672.451, 'duration': 4.943}, {'end': 680.236, 'text': "So hooray, we've cleaned up the columns a bit.", 'start': 677.654, 'duration': 2.582}, {'end': 686.479, 'text': 'And now that we have the data in a data frame called DF, we are ready to identify and deal with missing data.', 'start': 680.956, 'duration': 5.523}, {'end': 699.087, 'text': 'Unfortunately, The biggest part of any data analysis project is making sure that the data are correctly formatted and fixing it when it is not.', 'start': 688.08, 'duration': 11.007}, {'end': 704.056, 'text': 'The first part of this process is identifying and dealing with missing data.', 'start': 699.994, 'duration': 4.062}, {'end': 713.98, 'text': 'Missing data is simply a blank space or a surrogate value like NA that indicates that we failed to collect data for one of the features.', 'start': 704.776, 'duration': 9.204}, {'end': 722.344, 'text': "For example, if we forgot to ask someone's age or forgot to write it down, then we would have a blank space in the data set for that person's age.", 'start': 714.4, 'duration': 7.944}], 'summary': 'Data cleaning process: identifying and dealing with missing data in a dataframe called df.', 'duration': 49.893, 'max_score': 672.451, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ672451.jpg'}], 'start': 505.809, 'title': 'Data analysis techniques', 'summary': 'Covers analyzing credit card default data, renaming and dropping columns in a dataframe, and dealing with missing data in a data analysis. it includes loading and examining credit card default data, renaming and dropping columns in a dataframe, and addressing missing data in a data frame. the techniques aim to modify dataframes and handle missing data effectively for accurate analysis.', 'chapters': [{'end': 591.511, 'start': 505.809, 'title': 'Analyzing credit card default data', 'summary': "Explains how to load and examine credit card default data in a data frame, 'df', with columns for id, limit balance, sex, education, marriage, age, payment history, bill amounts, and default payment next month.", 'duration': 85.702, 'highlights': ["The data frame 'df' contains columns for ID, limit balance, sex, education, marriage, age, payment history, bill amounts, and default payment next month. The data frame 'df' contains information on ID, limit balance, sex, education, marriage, age, payment history, bill amounts, and default payment next month.", 'The payment history columns track whether the last payment was on time or how late it was, with details for different months. The payment history columns track the timeliness of payments and include details for different months.', "The data includes default payment next month, which is the variable to be predicted. The data includes 'default payment next month' as the variable to be predicted."]}, {'end': 669.869, 'start': 592.111, 'title': 'Rename and drop columns in dataframe', 'summary': "Demonstrates renaming a column in a dataframe using 'dataframe.rename' and then dropping a column using 'df.drop' in place, with the aim of modifying the dataframe without creating a new variable.", 'duration': 77.758, 'highlights': ["The chapter demonstrates using 'dataframe.rename' to rename a column by passing in the column name and the new name, and setting access to columns.", 'The process of renaming the column is done in place to modify the dataframe without creating a new variable, verified by printing out the first five rows to ensure the column was renamed correctly.', "The chapter illustrates using 'df.drop' to remove a column by specifying the ID column and performing the operation in place, followed by verifying the column removal by printing out the first five rows."]}, {'end': 747.161, 'start': 672.451, 'title': 'Dealing with missing data in data analysis', 'summary': 'Focuses on identifying and dealing with missing data in a data frame called df, highlighting the importance of correctly formatting data and presenting two main ways to deal with missing data: removing rows or imputing values.', 'duration': 74.71, 'highlights': ["The first part of this process is identifying and dealing with missing data, which can be a blank space or a surrogate value like NA, indicating failed data collection (e.g., forgetting to ask for someone's age).", 'There are two main ways to deal with missing data - removing the rows containing missing data from the dataset or imputing the missing values, which involves making an educated guess about the value.', 'The biggest part of any data analysis project is ensuring the data are correctly formatted and fixing it when it is not, emphasizing the importance of data quality in data analysis.']}], 'duration': 241.352, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ505809.jpg', 'highlights': ["The data frame 'df' contains columns for ID, limit balance, sex, education, marriage, age, payment history, bill amounts, and default payment next month.", 'The payment history columns track whether the last payment was on time or how late it was, with details for different months.', 'The data includes default payment next month, which is the variable to be predicted.', "The chapter demonstrates using 'dataframe.rename' to rename a column by passing in the column name and the new name, and setting access to columns.", 'The process of renaming the column is done in place to modify the dataframe without creating a new variable, verified by printing out the first five rows to ensure the column was renamed correctly.', "The first part of this process is identifying and dealing with missing data, which can be a blank space or a surrogate value like NA, indicating failed data collection (e.g., forgetting to ask for someone's age).", 'The biggest part of any data analysis project is ensuring the data are correctly formatted and fixing it when it is not, emphasizing the importance of data quality in data analysis.']}, {'end': 1290.733, 'segs': [{'end': 801.743, 'src': 'embed', 'start': 747.881, 'weight': 0, 'content': [{'end': 750.884, 'text': "First, let's see what sort of data is in each column.", 'start': 747.881, 'duration': 3.003}, {'end': 757.053, 'text': "To do that, we've got our data frame and we're asking for the data type.", 'start': 752.691, 'duration': 4.362}, {'end': 761.574, 'text': "So we're gonna check out the data types with the D types command.", 'start': 757.073, 'duration': 4.501}, {'end': 776.082, 'text': 'And when we run that code, we see that every column is int64, which is good, or at least it looks good,', 'start': 763.315, 'duration': 12.767}, {'end': 781.086, 'text': "because it doesn't tell us off the bat that the person mixed letters and numbers.", 'start': 776.082, 'duration': 5.004}, {'end': 784.789, 'text': 'In other words, there are no NA values,', 'start': 782.347, 'duration': 2.442}, {'end': 794.277, 'text': "and that suggests that maybe things are in good hands because they didn't use character-based placeholders for missing data and data frame.", 'start': 784.789, 'duration': 9.488}, {'end': 801.743, 'text': 'That said, we should still make sure that each column contains acceptable values.', 'start': 795.837, 'duration': 5.906}], 'summary': 'Data types check: all columns are int64, no na values found, ensuring acceptable values in each column.', 'duration': 53.862, 'max_score': 747.881, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ747881.jpg'}, {'end': 968.644, 'src': 'embed', 'start': 941.419, 'weight': 3, 'content': [{'end': 946.109, 'text': "Now we're going to look at marriage and make sure it only contains 1, 2, and 3.", 'start': 941.419, 'duration': 4.69}, {'end': 951.137, 'text': "So we do that, the exact same code as before, only this time we're specifying the column named marriage.", 'start': 946.109, 'duration': 5.028}, {'end': 958.729, 'text': "And like education, marriage contains zero, which I'm guessing represents missing data.", 'start': 953.221, 'duration': 5.508}, {'end': 966.583, 'text': 'Now, note, This data set is part of an academic publication that is not open access.', 'start': 959.431, 'duration': 7.152}, {'end': 968.644, 'text': "It's owned by a company called Elsevier.", 'start': 966.663, 'duration': 1.981}], 'summary': 'Data analysis of marriage column shows 0 missing values. dataset owned by elsevier.', 'duration': 27.225, 'max_score': 941.419, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ941419.jpg'}, {'end': 1166.619, 'src': 'embed', 'start': 1117.261, 'weight': 1, 'content': [{'end': 1123.946, 'text': "And we're going to use that len or length function, just like we did before, only this time we're not specifying which rows we want.", 'start': 1117.261, 'duration': 6.685}, {'end': 1127.829, 'text': "And when we don't specify specific rows, we get them all.", 'start': 1124.366, 'duration': 3.463}, {'end': 1133.173, 'text': 'And we see that there are 30, 000 rows in the data set to begin with.', 'start': 1129.39, 'duration': 3.783}, {'end': 1140.571, 'text': 'And so 68 of the 30, 000 rows, or less than 1%, contain missing values.', 'start': 1134.269, 'duration': 6.302}, {'end': 1146.793, 'text': 'Since that still leaves us with more data than we need for a support vector machine.', 'start': 1142.071, 'duration': 4.722}, {'end': 1151.694, 'text': 'we will remove the rows with missing values rather than try to impute their values.', 'start': 1146.793, 'duration': 4.901}, {'end': 1158.096, 'text': "And like I said, we're going to try to do imputing imputation in a future webinar, hopefully in two months.", 'start': 1151.774, 'duration': 6.322}, {'end': 1166.619, 'text': 'So the way we do this, we select all the rows that do not contain zero in either education or marriage.', 'start': 1159.761, 'duration': 6.858}], 'summary': 'Dataset has 30,000 rows, with 68 (less than 1%) containing missing values. rows with missing values will be removed.', 'duration': 49.358, 'max_score': 1117.261, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ1117261.jpg'}], 'start': 747.881, 'title': 'Data validation and missing value handling', 'summary': 'Covers the validation of data columns including limit balance, sex, education, marriage, and age, and the handling of 68 missing values out of 30,000 rows, with the decision to remove the rows with missing values.', 'chapters': [{'end': 801.743, 'start': 747.881, 'title': 'Data type checking and validation', 'summary': "Discusses how to check the data types of columns in a data frame using the 'd types' command, revealing that all columns are of type int64, indicating the absence of na values and the potential absence of character-based placeholders for missing data.", 'duration': 53.862, 'highlights': ["The data types of columns are checked using the 'D types' command, revealing that all columns are of type int64, indicating the absence of NA values and the potential absence of character-based placeholders for missing data.", 'The absence of NA values suggests that there are no missing data in the data frame, which is a positive indicator of data quality.']}, {'end': 1290.733, 'start': 802.704, 'title': 'Data validation and missing value handling', 'summary': 'Covers the validation of data columns including limit balance, sex, education, marriage, and age, and the handling of missing values in the dataset, with 68 out of 30,000 rows containing missing values, leading to the decision to remove the rows with missing values.', 'duration': 488.029, 'highlights': ['68 out of 30,000 rows contain missing values The dataset contains 68 rows with missing values in the education or marriage columns, representing less than 1% of the total rows.', 'Decision to remove rows with missing values It is decided to remove the rows with missing values rather than trying to impute their values, as the remaining data is still sufficient for a support vector machine.', 'Validation of data columns The validation process ensures that the columns for sex, education, and marriage contain only specified numerical values and categories, with checks for unique values and potential missing data.']}], 'duration': 542.852, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ747881.jpg', 'highlights': ['The absence of NA values suggests that there are no missing data in the data frame, which is a positive indicator of data quality.', 'Decision to remove rows with missing values It is decided to remove the rows with missing values rather than trying to impute their values, as the remaining data is still sufficient for a support vector machine.', '68 out of 30,000 rows contain missing values The dataset contains 68 rows with missing values in the education or marriage columns, representing less than 1% of the total rows.', 'Validation of data columns The validation process ensures that the columns for sex, education, and marriage contain only specified numerical values and categories, with checks for unique values and potential missing data.', "The data types of columns are checked using the 'D types' command, revealing that all columns are of type int64, indicating the absence of NA values and the potential absence of character-based placeholders for missing data."]}, {'end': 1572.504, 'segs': [{'end': 1311.89, 'src': 'embed', 'start': 1290.753, 'weight': 0, 'content': [{'end': 1299.757, 'text': "So we're going to downsample both categories, customers who did not default and customers that did, down to 1, 000 each.", 'start': 1290.753, 'duration': 9.004}, {'end': 1306.081, 'text': "So first thing we're going to do is just remind ourselves how many rows of data we're working with because we removed some of them.", 'start': 1300.198, 'duration': 5.883}, {'end': 1310.988, 'text': "So we use the length function again, and that tells us we've got 29, 932 samples.", 'start': 1306.741, 'duration': 4.247}, {'end': 1311.89, 'text': "That's relatively large.", 'start': 1311.028, 'duration': 0.862}], 'summary': 'Downsampled 1,000 non-default and default customers from 29,932 samples.', 'duration': 21.137, 'max_score': 1290.753, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ1290753.jpg'}, {'end': 1371.675, 'src': 'embed', 'start': 1340.988, 'weight': 1, 'content': [{'end': 1343.469, 'text': "We're storing them in another data frame.", 'start': 1340.988, 'duration': 2.481}, {'end': 1350.651, 'text': "So we're splitting the data into two variables here, one for people that defaulted, and one for people that did not default.", 'start': 1343.489, 'duration': 7.162}, {'end': 1356.279, 'text': "And now what we're doing is we're downsampling the people that did not default.", 'start': 1352.014, 'duration': 4.265}, {'end': 1364.189, 'text': "We're using the resample function and we're passing it the data frame that consists of people that did not default.", 'start': 1356.32, 'duration': 7.869}, {'end': 1366.891, 'text': "We're setting replace to false.", 'start': 1365.23, 'duration': 1.661}, {'end': 1371.675, 'text': 'So that means when we pull something out of there and we put it in our new data frame,', 'start': 1367.151, 'duration': 4.524}], 'summary': 'Data is split into defaulted and non-defaulted groups, with non-defaulted data being downsampled using the resample function and replace set to false.', 'duration': 30.687, 'max_score': 1340.988, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ1340988.jpg'}, {'end': 1469.561, 'src': 'embed', 'start': 1436.395, 'weight': 2, 'content': [{'end': 1441.396, 'text': "Now we're going to do the exact same thing, but this time we're using the people that defaulted.", 'start': 1436.395, 'duration': 5.001}, {'end': 1445.434, 'text': "And we're going to print out the number of rows there and there.", 'start': 1442.773, 'duration': 2.661}, {'end': 1450.835, 'text': "So we've got two new variables, each containing 1, 000 rows each.", 'start': 1446.634, 'duration': 4.201}, {'end': 1460.798, 'text': 'And now what we want to do is we want to merge them back into a single data frame and print out the total number of rows to make sure everything is hunky dory.', 'start': 1451.135, 'duration': 9.663}, {'end': 1469.561, 'text': "To merge the two data frames that we're creating, we're using this pandas function called concat, which will concatenate the two things.", 'start': 1460.838, 'duration': 8.723}], 'summary': 'Merging 2 data frames, each with 1000 rows, using pandas concat function.', 'duration': 33.166, 'max_score': 1436.395, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ1436395.jpg'}, {'end': 1511.747, 'src': 'embed', 'start': 1486.393, 'weight': 5, 'content': [{'end': 1491.958, 'text': "We're going to have one part that contains the columns of data that we will use to make classifications,", 'start': 1486.393, 'duration': 5.565}, {'end': 1498.023, 'text': 'and one part is going to be a column of data that contains the things we want to predict.', 'start': 1491.958, 'duration': 6.065}, {'end': 1506.485, 'text': "So we're going to use the conventional notation of capital X to represent the columns of data that we will use to make classifications.", 'start': 1498.982, 'duration': 7.503}, {'end': 1511.747, 'text': "And we're going to use lowercase y to represent the thing we want to predict.", 'start': 1506.965, 'duration': 4.782}], 'summary': 'Data will be used for classification with x and prediction with y.', 'duration': 25.354, 'max_score': 1486.393, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ1486393.jpg'}], 'start': 1290.753, 'title': 'Data downsampling and svm preparation', 'summary': 'Involves downsampling the dataset to 1,000 samples for both default and non-default customers from an initial 29,932 samples and preparing for support vector machine (svm) modeling for classification.', 'chapters': [{'end': 1366.891, 'start': 1290.753, 'title': 'Data downsampling for default and non-default customers', 'summary': 'Involves downsampling the dataset to 1,000 samples for both default and non-default customers from an initial 29,932 samples, using the resample function and storing the split data into new variables.', 'duration': 76.138, 'highlights': ['The dataset, initially containing 29,932 samples, is downsampled to 1,000 samples for both default and non-default customers.', 'The resample function is utilized with replace set to false to downsample the data for non-default customers.', 'The data is split into two variables, one for customers who defaulted and another for customers who did not default.']}, {'end': 1572.504, 'start': 1367.151, 'title': 'Data sampling and preparing for svm', 'summary': 'Discusses down sampling data to 1000 rows, merging two data frames into one, splitting the data into x and y for classification, and preparing for support vector machine (svm) modeling.', 'duration': 205.353, 'highlights': ['Merging two data frames to create a single data frame and confirming 2000 rows in total. The process involves merging two data frames using the pandas concat function, resulting in a total of 2000 rows.', 'Down sampling data to create two new variables, each containing 1000 rows. The data is down sampled to create two new variables, each with 1000 rows, representing a subset of the original data.', "Splitting the data into X and y for classification, with X representing columns for classification and y representing the target variable. The data set is split into X and y, where X contains the columns used for classification, and y represents the target variable 'default' indicating whether someone defaulted on their payments."]}], 'duration': 281.751, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ1290753.jpg', 'highlights': ['The dataset is downsampled to 1,000 samples for both default and non-default customers from an initial 29,932 samples.', 'The resample function is utilized with replace set to false for downsampling the data for non-default customers.', 'Merging two data frames to create a single data frame and confirming 2000 rows in total.', 'The data is split into two variables, one for customers who defaulted and another for customers who did not default.', 'The data is downsampled to create two new variables, each containing 1000 rows.', 'Splitting the data into X and y for classification, with X representing columns for classification and y representing the target variable.']}, {'end': 1855.529, 'segs': [{'end': 1654.968, 'src': 'embed', 'start': 1622.149, 'weight': 2, 'content': [{'end': 1624.931, 'text': "That's male or one for male, two for female.", 'start': 1622.149, 'duration': 2.782}, {'end': 1628.354, 'text': 'Education has a bunch of categories, one, two, three, and four.', 'start': 1625.552, 'duration': 2.802}, {'end': 1629.514, 'text': "We've already talked about these.", 'start': 1628.394, 'duration': 1.12}, {'end': 1638.46, 'text': 'So it looks like sex, education, marriage, and pay are supposed to be categorical, and they need to be modified.', 'start': 1630.955, 'duration': 7.505}, {'end': 1644.723, 'text': 'This is because scikit-learn support vector machines.', 'start': 1639.341, 'duration': 5.382}, {'end': 1654.968, 'text': 'while they natively support continuous data like limit, balance and age, they do not natively support categorical data like marriage,', 'start': 1644.723, 'duration': 10.245}], 'summary': 'Data categories need modification for scikit-learn support vector machines.', 'duration': 32.819, 'max_score': 1622.149, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ1622149.jpg'}, {'end': 1702.396, 'src': 'embed', 'start': 1670.52, 'weight': 0, 'content': [{'end': 1673.321, 'text': 'And this trick is called one-hot encoding.', 'start': 1670.52, 'duration': 2.801}, {'end': 1682.406, 'text': "So at this point, you may be wondering what's wrong with treating categorical data like it's continuous?", 'start': 1676.983, 'duration': 5.423}, {'end': 1685.428, 'text': "So to answer that question, let's look at an example.", 'start': 1683.147, 'duration': 2.281}, {'end': 1691.673, 'text': 'For the marriage columns, we have three options, one married, two single, and three other.', 'start': 1686.391, 'duration': 5.282}, {'end': 1702.396, 'text': 'If we treated those values one, two and three like continuous data then we would assume that three, which means other, is more similar to two,', 'start': 1692.733, 'duration': 9.663}], 'summary': 'One-hot encoding is used to avoid treating categorical data as continuous, illustrated with a marriage column example.', 'duration': 31.876, 'max_score': 1670.52, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ1670520.jpg'}, {'end': 1813.046, 'src': 'embed', 'start': 1784.081, 'weight': 3, 'content': [{'end': 1785.522, 'text': 'Get dummies is a Panda function.', 'start': 1784.081, 'duration': 1.441}, {'end': 1789.403, 'text': 'And so we pass it X, capital X.', 'start': 1786.202, 'duration': 3.201}, {'end': 1791.804, 'text': "That's the columns that we want to transform.", 'start': 1789.403, 'duration': 2.401}, {'end': 1795.365, 'text': 'And we specify the columns that we want to transform.', 'start': 1791.884, 'duration': 3.481}, {'end': 1798.827, 'text': "For this demonstration, we're just going to transform marriage.", 'start': 1795.685, 'duration': 3.142}, {'end': 1803.148, 'text': "And then we're going to print out the first five rows to see what it did.", 'start': 1799.487, 'duration': 3.661}, {'end': 1813.046, 'text': "scroll over here and you'll see that on the left side of the data frame are the columns that we did not touch,", 'start': 1806.084, 'duration': 6.962}], 'summary': 'Demonstration of using the panda function get dummies to transform columns, with marriage being transformed for this instance.', 'duration': 28.965, 'max_score': 1784.081, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ1784081.jpg'}], 'start': 1573.817, 'title': 'One-hot encoding basics', 'summary': "Discusses the basics of one hot encoding using the get dummies function from pandas, demonstrating the transformation of a 'marriage' column into three separate columns containing binary values.", 'chapters': [{'end': 1740.686, 'start': 1573.817, 'title': 'One-hot encoding for support vector machine', 'summary': 'Discusses the process of formatting the data for support vector machine, including splitting the data frame, identifying categorical variables, and using one-hot encoding to convert them into binary values, to ensure compatibility with scikit-learn support vector machines.', 'duration': 166.869, 'highlights': ['The process of one-hot encoding is used to convert categorical data into binary values to make it compatible with scikit-learn support vector machines. This is important because scikit-learn support vector machines natively support continuous data but not categorical data, therefore, one-hot encoding is necessary to process categorical variables.', 'Explanation of why treating categorical data as continuous is not suitable, using an example of marriage categories, one being married, two being single, and three being other. This highlights the difference in treating categorical data as continuous, showcasing the potential misrepresentation of similarity between categories and the significance of treating them as separate and equal categories.', 'Discussion on identifying categorical variables such as sex, education, marriage, and pay, and the need to modify them for compatibility with support vector machines. This emphasizes the process of identifying and modifying categorical variables for compatibility with support vector machines, highlighting the specific variables that require modification.']}, {'end': 1855.529, 'start': 1740.706, 'title': 'One hot encoding basics', 'summary': "Explains the basics of one hot encoding using the get dummies function from pandas, demonstrating the transformation of a 'marriage' column into three separate columns containing binary values.", 'duration': 114.823, 'highlights': ["The get dummies function from pandas is used to transform the 'marriage' column into three separate columns, each containing binary values, demonstrating the basics of one hot encoding.", "The original 'marriage' column containing three values is transformed into three separate columns, each containing a 0 or 1 based on the original values, providing a practical demonstration of the one hot encoding process."]}], 'duration': 281.712, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ1573817.jpg', 'highlights': ['One-hot encoding converts categorical data into binary values for scikit-learn SVM compatibility.', 'Treating categorical data as continuous can misrepresent similarity between categories.', 'Identifying and modifying categorical variables like sex, education, marriage, and pay for SVM compatibility.', "Pandas' get dummies function transforms 'marriage' column into three separate binary value columns.", "Practical demonstration of one-hot encoding using get dummies function on 'marriage' column."]}, {'end': 2349.393, 'segs': [{'end': 1913.713, 'src': 'embed', 'start': 1886.066, 'weight': 2, 'content': [{'end': 1891.288, 'text': 'The last part of formatting the data for a support vector machine is to center and scale the data.', 'start': 1886.066, 'duration': 5.222}, {'end': 1898.981, 'text': "The radial basis function that we're gonna use assumes that the data are centered and scaled.", 'start': 1893.356, 'duration': 5.625}, {'end': 1904.345, 'text': 'So in other words, each column should have a mean of zero and a standard deviation of one.', 'start': 1899.481, 'duration': 4.864}, {'end': 1913.713, 'text': "So what we're gonna do is, first what we're gonna do is we're gonna split the data into training and test data sets.", 'start': 1905.746, 'duration': 7.967}], 'summary': 'Data for support vector machine: center, scale, split into training and test sets.', 'duration': 27.647, 'max_score': 1886.066, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ1886066.jpg'}, {'end': 1990.578, 'src': 'heatmap', 'start': 1932.628, 'weight': 0.772, 'content': [{'end': 1936.692, 'text': "And we're creating X train, X test, Y train, and Y test.", 'start': 1932.628, 'duration': 4.064}, {'end': 1945.198, 'text': 'And I believe the default setting is for 70% of the data to go into the training data set and 30% to go into testing.', 'start': 1936.732, 'duration': 8.466}, {'end': 1947.941, 'text': "Don't quote me on that, I just believe that's the case.", 'start': 1945.539, 'duration': 2.402}, {'end': 1955.049, 'text': 'After we do that, we are scaling the data sets using the scale function.', 'start': 1948.761, 'duration': 6.288}, {'end': 1962.478, 'text': 'Moving along.', 'start': 1961.657, 'duration': 0.821}, {'end': 1970.153, 'text': "Now we're going to talk about building a preliminary support vector machine.", 'start': 1966.232, 'duration': 3.921}, {'end': 1972.994, 'text': "We've done a lot of stuff.", 'start': 1971.753, 'duration': 1.241}, {'end': 1976.054, 'text': "We've almost spent a whole hour just formatting data.", 'start': 1973.214, 'duration': 2.84}, {'end': 1978.675, 'text': "And now we're finally getting to the good part.", 'start': 1976.834, 'duration': 1.841}, {'end': 1983.376, 'text': 'The way we do that is we call SVC for support vector classifier.', 'start': 1979.335, 'duration': 4.041}, {'end': 1985.356, 'text': 'We set the random state to 42.', 'start': 1983.416, 'duration': 1.94}, {'end': 1990.578, 'text': 'And what this does is it kind of makes an untrained shell of a support vector classifier.', 'start': 1985.356, 'duration': 5.222}], 'summary': 'Data split 70% train, 30% test. building preliminary support vector machine.', 'duration': 57.95, 'max_score': 1932.628, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ1932628.jpg'}, {'end': 2133.231, 'src': 'embed', 'start': 2101.361, 'weight': 1, 'content': [{'end': 2112.668, 'text': "And when we optimize a support vector machine, it's all about finding the best value for gamma and potentially the regularization parameter C.", 'start': 2101.361, 'duration': 11.307}, {'end': 2119.692, 'text': "So what we're going to do is when we use grid search CV, we specify the parameters that we want to try in a matrix.", 'start': 2112.668, 'duration': 7.024}, {'end': 2133.231, 'text': "And so we've got, this is the parameter C, which is the regularization parameter, and we're going to try these values.", 'start': 2124.403, 'duration': 8.828}], 'summary': 'Optimizing a support vector machine involves finding the best values for gamma and regularization parameter c through grid search cv.', 'duration': 31.87, 'max_score': 2101.361, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ2101361.jpg'}, {'end': 2294.315, 'src': 'embed', 'start': 2260.187, 'weight': 0, 'content': [{'end': 2268.427, 'text': "And we can see that the ideal value for C, because that's alliteration, is 100.", 'start': 2260.187, 'duration': 8.24}, {'end': 2270.467, 'text': 'which means that we will use regularization.', 'start': 2268.427, 'duration': 2.04}, {'end': 2277.589, 'text': 'And the ideal value for gamma is 0.001.', 'start': 2270.567, 'duration': 7.022}, {'end': 2282.51, 'text': "So now we're ready to build, evaluate, draw, and interpret the final support vector machine.", 'start': 2277.589, 'duration': 4.921}, {'end': 2285.03, 'text': "So now we're doing exactly what we did before.", 'start': 2282.69, 'duration': 2.34}, {'end': 2294.315, 'text': "However, this time we're specifying c equals 100 and gamma equals 0.001.", 'start': 2285.97, 'duration': 8.345}], 'summary': 'The ideal values for c and gamma are 100 and 0.001 respectively, for building a support vector machine.', 'duration': 34.128, 'max_score': 2260.187, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ2260187.jpg'}], 'start': 1856.77, 'title': 'Support vector machines', 'summary': 'Covers formatting categorical data, centering and scaling, splitting data, building a preliminary support vector machine, evaluating its performance, and optimizing parameters using grid search cross-validation. it details the process of optimizing a support vector machine using gridsearchcv to find the optimal parameters, resulting in a slight improvement in classification accuracy, with four more people correctly classified as not defaulting and one person incorrectly classified as defaulting.', 'chapters': [{'end': 2158.181, 'start': 1856.77, 'title': 'Support vector machine data formatting and preliminary model building', 'summary': 'Covers formatting categorical data, centering and scaling, splitting data into training and test sets, building a preliminary support vector machine, evaluating its performance with a confusion matrix, and optimizing parameters using grid search cross-validation.', 'duration': 301.411, 'highlights': ["The support vector machine was not awesome with 79% correctly classified for non-default and 61% correctly classified for default cases. The support vector machine's performance was evaluated with a confusion matrix, showing 79% correctly classified for non-default and 61% correctly classified for default cases.", 'Formatting categorical data, centering and scaling, and splitting data into training and test sets were essential steps in preparing the data for the support vector machine. Key steps included formatting categorical data, centering and scaling, and splitting the data into training and test sets, crucial for preparing the data for the support vector machine.', 'Grid search cross-validation was used to optimize parameters, focusing on finding the best values for gamma and the regularization parameter C. Grid search cross-validation was employed to optimize parameters, particularly focusing on finding the best values for gamma and the regularization parameter C.']}, {'end': 2349.393, 'start': 2160.564, 'title': 'Optimizing support vector machines', 'summary': 'Details the process of optimizing a support vector machine using gridsearchcv to find the optimal parameters, resulting in a slight improvement in classification accuracy, with four more people correctly classified as not defaulting and one person incorrectly classified as defaulting.', 'duration': 188.829, 'highlights': ['The ideal value for C is 100, and the ideal value for gamma is 0.001, resulting in a slight improvement in classification accuracy, with four more people correctly classified as not defaulting and one person incorrectly classified as defaulting.', 'The process involves running GridSearchCV to find the optimal parameters for the support vector machine, which includes the number of folds for cross-validation and the scoring metric.', 'Despite downsizing the dataset from 30,000 rows to 2,000, the cross-validation process still took a considerable amount of time but resulted in a slight improvement in classification accuracy.', 'Support vector machines tend to perform well out of the box, and even after optimization, the improvement in classification accuracy was marginal, indicating the initial effectiveness of the algorithm.']}], 'duration': 492.623, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ1856770.jpg', 'highlights': ['The ideal value for C is 100, and the ideal value for gamma is 0.001, resulting in a slight improvement in classification accuracy, with four more people correctly classified as not defaulting and one person incorrectly classified as defaulting.', 'Grid search cross-validation was used to optimize parameters, focusing on finding the best values for gamma and the regularization parameter C.', 'Formatting categorical data, centering and scaling, and splitting data into training and test sets were essential steps in preparing the data for the support vector machine.']}, {'end': 2688.057, 'segs': [{'end': 2462.585, 'src': 'heatmap', 'start': 2419.273, 'weight': 0, 'content': [{'end': 2423.834, 'text': "And if you don't know what principal component analysis is right now,", 'start': 2419.273, 'duration': 4.561}, {'end': 2428.736, 'text': "just know that what we're doing is we're taking those 24 columns and we're going to shrink them down to two.", 'start': 2423.834, 'duration': 4.902}, {'end': 2433.467, 'text': 'This is the code for doing that.', 'start': 2431.386, 'duration': 2.081}, {'end': 2436.828, 'text': "We're also going to plot what's called a scree plot.", 'start': 2433.547, 'duration': 3.281}, {'end': 2444.99, 'text': 'The scree plot tells us how good this approximation of the true classifier is.', 'start': 2438.508, 'duration': 6.482}, {'end': 2458.161, 'text': 'What we would like is for the first two principal components, the first two columns here, to be much taller than all the rest.', 'start': 2450.034, 'duration': 8.127}, {'end': 2462.585, 'text': 'And that means that those first two components can be,', 'start': 2458.561, 'duration': 4.024}], 'summary': 'Applying principal component analysis to reduce 24 columns to 2 and evaluating with a scree plot for classifier approximation.', 'duration': 57.316, 'max_score': 2419.273, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ2419273.jpg'}, {'end': 2543.838, 'src': 'embed', 'start': 2485.271, 'weight': 1, 'content': [{'end': 2489.715, 'text': 'And that tells us that this approximation is not going to be great.', 'start': 2485.271, 'duration': 4.444}, {'end': 2495.441, 'text': "That said, I'm also including in this.", 'start': 2491.837, 'duration': 3.604}, {'end': 2508.395, 'text': "in the email I'm gonna send out, you're gonna be able to download how to do support vector machines with this heart disease dataset.", 'start': 2498.607, 'duration': 9.788}, {'end': 2514.74, 'text': 'And in that case, the image is better and actually classification in general is better.', 'start': 2508.715, 'duration': 6.025}, {'end': 2517.602, 'text': 'So make sure you run through both of these,', 'start': 2515.04, 'duration': 2.562}, {'end': 2525.908, 'text': "because you'll get different results and you'll see that how to decide when it's gonna be a good sort of collapsing of data and when it's not.", 'start': 2517.602, 'duration': 8.306}, {'end': 2531.942, 'text': 'The next data is pretty complicated, next code chunk.', 'start': 2527.615, 'duration': 4.327}, {'end': 2540.375, 'text': "However, just know that what we're doing is we're retraining and re-optimizing a support vector machine on just those two columns that we collapse the data down to.", 'start': 2531.982, 'duration': 8.393}, {'end': 2543.838, 'text': 'We run that.', 'start': 2543.257, 'duration': 0.581}], 'summary': 'Comparing support vector machine performance on heart disease dataset, highlighting the importance of data collapsing for better classification results.', 'duration': 58.567, 'max_score': 2485.271, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ2485271.jpg'}, {'end': 2610.217, 'src': 'heatmap', 'start': 2551.548, 'weight': 0.75, 'content': [{'end': 2560.119, 'text': "Yes, we are recording this as a video and I'll put it online and you'll be able to access this later on at your leisure and I'll email you the link.", 'start': 2551.548, 'duration': 8.571}, {'end': 2562.595, 'text': "Okay, so it's done running.", 'start': 2561.514, 'duration': 1.081}, {'end': 2564.496, 'text': "We've got optimal parameters.", 'start': 2563.035, 'duration': 1.461}, {'end': 2567.338, 'text': "Notice we've got different optimal parameters than before.", 'start': 2564.596, 'duration': 2.742}, {'end': 2569.96, 'text': "That's because we're actually using a different data set.", 'start': 2567.378, 'duration': 2.582}, {'end': 2573.142, 'text': "Instead of all 24 columns, we're just using two now.", 'start': 2570, 'duration': 3.142}, {'end': 2580.615, 'text': "And that's, like I said, we're approximating what we did with the 24.", 'start': 2574.102, 'duration': 6.513}, {'end': 2583.778, 'text': "Here, this is when we're actually drawing that decision boundary.", 'start': 2580.615, 'duration': 3.163}, {'end': 2586.702, 'text': 'Lots of code, but also very well commented.', 'start': 2583.959, 'duration': 2.743}, {'end': 2588.664, 'text': 'So you can go through this at your leisure.', 'start': 2586.722, 'duration': 1.942}, {'end': 2591.908, 'text': "Right now, we're just going to run it really quickly.", 'start': 2589.945, 'duration': 1.963}, {'end': 2594.989, 'text': "and we're going to look at our picture, and here it is.", 'start': 2592.748, 'duration': 2.241}, {'end': 2605.314, 'text': 'Kind of a mess, but kind of what we expected, because the first two principal components, which form the x- and y-axis of this graph,', 'start': 2595.529, 'duration': 9.785}, {'end': 2610.217, 'text': "don't do a great job capturing, sort of all the variation that's in the data.", 'start': 2605.314, 'duration': 4.903}], 'summary': 'Data analysis results indicate different optimal parameters using a reduced data set.', 'duration': 58.669, 'max_score': 2551.548, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ2551548.jpg'}, {'end': 2637.924, 'src': 'embed', 'start': 2611.217, 'weight': 4, 'content': [{'end': 2615.579, 'text': "And we knew that going into this, so when we get a kind of a messy thing, it's what it is.", 'start': 2611.217, 'duration': 4.362}, {'end': 2617.38, 'text': 'Okay, so bam.', 'start': 2616.3, 'duration': 1.08}, {'end': 2628.377, 'text': "The pink part of this graph represents the decision area where if someone falls in there, we'll classify them as not defaulted.", 'start': 2620.051, 'duration': 8.326}, {'end': 2633.041, 'text': "The yellow part is where we'll classify people as defaulted.", 'start': 2628.918, 'duration': 4.123}, {'end': 2637.924, 'text': 'Red dots are from the training data set that are known to have defaulted.', 'start': 2633.941, 'duration': 3.983}], 'summary': 'Graph indicates decision areas for classifying defaulters, based on red dots in training data set.', 'duration': 26.707, 'max_score': 2611.217, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ2611217.jpg'}], 'start': 2349.913, 'title': 'Dimensionality reduction and support vector machines', 'summary': 'Discusses visualizing high-dimensional data using principal component analysis to reduce dimensions to 2 and evaluates approximation quality. it also covers the process of loading, formatting, and optimizing support vector machine for heart disease classification, yielding different results with two datasets, and drawing and interpreting the decision boundary graph.', 'chapters': [{'end': 2485.231, 'start': 2349.913, 'title': 'Dimensionality reduction for classifier visualization', 'summary': 'Discusses the challenges of visualizing high-dimensional data, introduces principal component analysis to reduce the dimensions from 24 to 2, and evaluates the quality of the approximation using a scree plot.', 'duration': 135.318, 'highlights': ['The chapter discusses the challenges of visualizing high-dimensional data The data set contains 24 columns, making it difficult to visualize in its original high-dimensional form.', 'Introduces principal component analysis to reduce the dimensions from 24 to 2 The technique of principal component analysis is used to transform the 24 columns of data into two dimensions for visualization purposes.', 'Evaluates the quality of the approximation using a scree plot The scree plot is used to assess the quality of the approximation, aiming for the first two principal components to be significantly taller than the rest, indicating an accurate reflection of the original data.']}, {'end': 2688.057, 'start': 2485.271, 'title': 'Support vector machines for heart disease', 'summary': 'Covers the process of loading, formatting, and optimizing a support vector machine for heart disease classification, yielding different results with two different datasets, and finally drawing and interpreting the decision boundary graph.', 'duration': 202.786, 'highlights': ['The process involves loading, formatting, and optimizing a support vector machine for heart disease classification, resulting in different outcomes with two datasets.', 'The decision boundary graph is drawn and interpreted to classify individuals as defaulted or not defaulted based on the support vector machine model.', 'The chapter also addresses handling missing data, downsampling the data, and using one hot encoding for formatting the data for support vector machine usage.', 'The support vector machine is retrained and re-optimized on a smaller dataset, yielding different optimal parameters than before.', 'The support vector machine is built, evaluated, and interpreted, marking the completion of the process.']}], 'duration': 338.144, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8A7L0GsBiLQ/pics/8A7L0GsBiLQ2349913.jpg', 'highlights': ['Introduces principal component analysis to reduce the dimensions from 24 to 2 The technique of principal component analysis is used to transform the 24 columns of data into two dimensions for visualization purposes.', 'The process involves loading, formatting, and optimizing a support vector machine for heart disease classification, resulting in different outcomes with two datasets.', 'The support vector machine is retrained and re-optimized on a smaller dataset, yielding different optimal parameters than before.', 'Evaluates the quality of the approximation using a scree plot The scree plot is used to assess the quality of the approximation, aiming for the first two principal components to be significantly taller than the rest, indicating an accurate reflection of the original data.', 'The decision boundary graph is drawn and interpreted to classify individuals as defaulted or not defaulted based on the support vector machine model.']}], 'highlights': ['The ideal value for C is 100, and the ideal value for gamma is 0.001, resulting in a slight improvement in classification accuracy, with four more people correctly classified as not defaulting and one person incorrectly classified as defaulting.', 'The chapter covers the use of support vector machines in Python for classification, using scikit-learn and the radial basis function to predict credit card default, demonstrating practical steps and techniques.', 'The data set will allow us to predict if someone will default on their credit card payments based on their sex, age, and a variety of other metrics.', 'The absence of NA values suggests that there are no missing data in the data frame, which is a positive indicator of data quality.', 'The process involves loading, formatting, and optimizing a support vector machine for heart disease classification, resulting in different outcomes with two datasets.']}