title
How do I encode categorical features using scikit-learn?

description
In order to include categorical features in your Machine Learning model, you have to encode them numerically using "dummy" or "one-hot" encoding. But how do you do this correctly using scikit-learn? In this video, you'll learn how to use OneHotEncoder and ColumnTransformer to encode your categorical features and prepare your feature matrix in a single step. You'll also learn how to include this step within a Pipeline so that you can cross-validate your model and preprocessing steps simultaneously. Finally, you'll learn why you should use scikit-learn (rather than pandas) for preprocessing your dataset. AGENDA: 0:00 Introduction 0:22 Why should you use a Pipeline? 2:30 Preview of the lesson 3:35 Loading and preparing a dataset 6:11 Cross-validating a simple model 10:00 Encoding categorical features with OneHotEncoder 15:01 Selecting columns for preprocessing with ColumnTransformer 19:00 Creating a two-step Pipeline 19:54 Cross-validating a Pipeline 21:44 Making predictions on new data 23:43 Recap of the lesson 24:50 Why should you use scikit-learn (rather than pandas) for preprocessing? CODE FROM THIS VIDEO: https://github.com/justmarkham/scikit-learn-videos/blob/master/10_categorical_features.ipynb WANT TO JOIN MY NEXT LIVE WEBCAST? Become a member ($5/month): https://www.patreon.com/dataschool === RELATED RESOURCES === OneHotEncoder documentation: https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features ColumnTransformer documentation: https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data Pipeline documentation: https://scikit-learn.org/stable/modules/compose.html#pipeline My video on cross-validation: https://www.youtube.com/watch?v=6dbrR-WymjI&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=7 My video on grid search: https://www.youtube.com/watch?v=Gol_qOgRqfA&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=8 My lesson notebook on StandardScaler: https://nbviewer.jupyter.org/github/justmarkham/DAT8/blob/master/notebooks/19_advanced_sklearn.ipynb === WANT TO GET BETTER AT MACHINE LEARNING? === 1) WATCH my scikit-learn video series: https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A 2) SUBSCRIBE for more videos: https://www.youtube.com/dataschool?sub_confirmation=1 3) ENROLL in my Machine Learning course: https://www.dataschool.io/learn/ 4) LET'S CONNECT! - Newsletter: https://www.dataschool.io/subscribe/ - Twitter: https://twitter.com/justmarkham - Facebook: https://www.facebook.com/DataScienceSchool/ - LinkedIn: https://www.linkedin.com/in/justmarkham/

detail
{'title': 'How do I encode categorical features using scikit-learn?', 'heatmap': [{'end': 405.29, 'start': 363.874, 'weight': 1}, {'end': 460.862, 'start': 433.737, 'weight': 0.713}, {'end': 924.71, 'start': 904.163, 'weight': 0.719}, {'end': 1093.444, 'start': 1065.147, 'weight': 0.718}, {'end': 1397.775, 'start': 1350.895, 'weight': 0.711}, {'end': 1638.059, 'start': 1593.879, 'weight': 0.717}], 'summary': "Explores the use of scikit-learn pipeline for cross-validation and grid search, discusses the significance of scikit-learn 0.20 for column transformer and one hot encoder, introduces the titanic dataset with 891 rows and 12 columns, and demonstrates model accuracy improvement from 0.67 to 0.77 through one hot encoding using scikit-learn's one hot encoder.", 'chapters': [{'end': 163.095, 'segs': [{'end': 33.082, 'src': 'embed', 'start': 0.329, 'weight': 2, 'content': [{'end': 3.47, 'text': 'Next question from Vishwanatha.', 'start': 0.329, 'duration': 3.141}, {'end': 12.493, 'text': 'I was wondering if you could help me to understand the process of building a pipeline using the scikit-learn pipeline module.', 'start': 3.49, 'duration': 9.003}, {'end': 19.115, 'text': 'It would be great if you could use Scalar and one hot encoder as part of the tutorial as well.', 'start': 12.993, 'duration': 6.122}, {'end': 21.576, 'text': 'This is a great question.', 'start': 19.755, 'duration': 1.821}, {'end': 33.082, 'text': 'So what is the point of pipeline? The point of pipeline is to chain steps together sequentially.', 'start': 22.096, 'duration': 10.986}], 'summary': 'Building a scikit-learn pipeline involves chaining steps like scalar and one hot encoder sequentially.', 'duration': 32.753, 'max_score': 0.329, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo329.jpg'}, {'end': 113.966, 'src': 'embed', 'start': 84.597, 'weight': 0, 'content': [{'end': 93.843, 'text': 'So pipeline, generally speaking, is useful because you can cross-validate a process that includes preprocessing as well as model building.', 'start': 84.597, 'duration': 9.246}, {'end': 113.966, 'text': 'The other reason pipeline is useful is because you can do a grid search or a randomized search of a pipeline which allows you to do a grid search or randomized search of both tuning parameters for model and the preprocessing steps.', 'start': 94.443, 'duration': 19.523}], 'summary': 'Pipelines enable cross-validation and grid/randomized search of preprocessing and model tuning parameters.', 'duration': 29.369, 'max_score': 84.597, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo84597.jpg'}], 'start': 0.329, 'title': 'Building a pipeline with scikit-learn', 'summary': 'Emphasizes the significance of using the scikit-learn pipeline for cross-validation and grid search, providing a clear understanding of its importance in preprocessing and model building.', 'chapters': [{'end': 163.095, 'start': 0.329, 'title': 'Building a pipeline with scikit-learn', 'summary': 'Discusses the importance of using the scikit-learn pipeline, emphasizing its role in cross-validation and grid search for preprocessing and model building, providing a clear understanding of its significance.', 'duration': 162.766, 'highlights': ['The pipeline in scikit-learn is crucial for cross-validating a process that includes preprocessing and model building, rather than just a model, allowing for more accurate results in cross-validation.', 'Using a pipeline enables the capability to perform grid search or randomized search of both tuning parameters for the model and the preprocessing steps, providing the flexibility to search parameters for preprocessing in combination with the model.', 'The tutorial aims to provide a clear understanding of the process of building a pipeline using the scikit-learn pipeline module, including the use of Scalar and one hot encoder, to demonstrate its practical application.']}], 'duration': 162.766, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo329.jpg', 'highlights': ['Using a pipeline enables the capability to perform grid search or randomized search of both tuning parameters for the model and the preprocessing steps, providing the flexibility to search parameters for preprocessing in combination with the model.', 'The pipeline in scikit-learn is crucial for cross-validating a process that includes preprocessing and model building, rather than just a model, allowing for more accurate results in cross-validation.', 'The tutorial aims to provide a clear understanding of the process of building a pipeline using the scikit-learn pipeline module, including the use of Scalar and one hot encoder, to demonstrate its practical application.']}, {'end': 621.31, 'segs': [{'end': 219.562, 'src': 'embed', 'start': 184.704, 'weight': 0, 'content': [{'end': 185.665, 'text': "So I'm going to teach you that.", 'start': 184.704, 'duration': 0.961}, {'end': 191.577, 'text': "The other thing that's really important is I'm going to be using scikit-learn 0.20.", 'start': 186.465, 'duration': 5.112}, {'end': 195.7, 'text': "It's probably 0.20, 0.2.", 'start': 191.577, 'duration': 4.123}, {'end': 198.543, 'text': "If you're running scikit-learn previous to 0.20,", 'start': 195.701, 'duration': 2.842}, {'end': 205.37, 'text': "you're not going to have column transformer and one hot encoder is going to work slightly differently.", 'start': 198.543, 'duration': 6.827}, {'end': 211.255, 'text': "So you won't be able to reuse this code unless you're using at least 0.20 in scikit-learn.", 'start': 205.97, 'duration': 5.285}, {'end': 219.562, 'text': 'Okay With all of that being said, let me go over to my empty notebook.', 'start': 211.275, 'duration': 8.287}], 'summary': 'Teaching with scikit-learn 0.20, column transformer, and one hot encoder.', 'duration': 34.858, 'max_score': 184.704, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo184704.jpg'}, {'end': 324.792, 'src': 'embed', 'start': 297.958, 'weight': 1, 'content': [{'end': 306.248, 'text': "I'm going to select the survived column, which is our target, the P class column, the sex column.", 'start': 297.958, 'duration': 8.29}, {'end': 309.629, 'text': 'column and the embarked column.', 'start': 306.708, 'duration': 2.921}, {'end': 317.991, 'text': "If you've never heard of the Titanic data set, you're predicting whether passengers on the Titanic survived or did not survive.", 'start': 310.949, 'duration': 7.042}, {'end': 319.631, 'text': 'So survived is the target.', 'start': 318.391, 'duration': 1.24}, {'end': 321.591, 'text': "And then we're going to use these three features.", 'start': 319.971, 'duration': 1.62}, {'end': 323.212, 'text': 'P class is passenger class.', 'start': 321.671, 'duration': 1.541}, {'end': 324.792, 'text': 'Sex is male or female.', 'start': 323.612, 'duration': 1.18}], 'summary': 'Predicting titanic survival using p class, sex, and embarked features.', 'duration': 26.834, 'max_score': 297.958, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo297958.jpg'}, {'end': 405.29, 'src': 'heatmap', 'start': 363.874, 'weight': 1, 'content': [{'end': 369.838, 'text': 'Here are my four columns, okay? And there are now no null values in a.sum.', 'start': 363.874, 'duration': 5.964}, {'end': 374.321, 'text': "Let's just take a look at the head of the data frame.", 'start': 371.899, 'duration': 2.422}, {'end': 376.122, 'text': 'And here we are.', 'start': 374.341, 'duration': 1.781}, {'end': 383.103, 'text': 'And once again, let me just say, Survived is our target, passenger class.', 'start': 377.042, 'duration': 6.061}, {'end': 391.165, 'text': "It's technically a categorical variable, but for reasons I'm not gonna explain, we're treating it as a numeric variable,", 'start': 383.663, 'duration': 7.502}, {'end': 392.626, 'text': "and that's actually the best approach.", 'start': 391.165, 'duration': 1.461}, {'end': 395.927, 'text': "And then I've got two categorical variables.", 'start': 393.486, 'duration': 2.441}, {'end': 405.29, 'text': "okay?. Now I'm gonna start by cross-validating a model that predicts survived using only P class.", 'start': 395.927, 'duration': 9.363}], 'summary': 'Data analysis includes 4 columns with no null values, with a focus on predicting survival using passenger class.', 'duration': 41.416, 'max_score': 363.874, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo363874.jpg'}, {'end': 460.862, 'src': 'heatmap', 'start': 433.737, 'weight': 0.713, 'content': [{'end': 437.298, 'text': 'Okay And X.shape.', 'start': 433.737, 'duration': 3.561}, {'end': 441.023, 'text': 'is 889 by one.', 'start': 438.741, 'duration': 2.282}, {'end': 444.026, 'text': 'y dot shape is 889 by nothing.', 'start': 441.023, 'duration': 3.003}, {'end': 449.771, 'text': 'um, just remember, even if you have only one feature in your x in your training.', 'start': 444.026, 'duration': 5.745}, {'end': 453.214, 'text': 'uh, matrix, it needs to be two-dimensional.', 'start': 449.771, 'duration': 3.443}, {'end': 460.862, 'text': "okay, it can't be one-dimensional for reasons that take a while to explain, but That is on purpose.", 'start': 453.214, 'duration': 7.648}], 'summary': 'X.shape: 889x1, y.shape: 889x0, training matrix must be 2d', 'duration': 27.125, 'max_score': 433.737, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo433737.jpg'}], 'start': 163.595, 'title': 'Scikit-learn 0.20 and titanic dataset', 'summary': 'Discusses the significance of using scikit-learn 0.20 for column transformer and one hot encoder, highlighting the impact of version differences. additionally, it introduces the titanic dataset with 891 rows and 12 columns, emphasizing feature selection and cross-validation of a logistic regression model with 67.8% mean accuracy, outperforming the null accuracy of 61%.', 'chapters': [{'end': 211.255, 'start': 163.595, 'title': 'Pipeline and column transformer in scikit-learn 0.20', 'summary': 'Discusses the importance of using scikit-learn 0.20 for column transformer and one hot encoder, emphasizing that using a version prior to 0.20 will lead to differences in functionality and code reusability.', 'duration': 47.66, 'highlights': ['Column transformer and one hot encoder are important in scikit-learn 0.20, as using a version prior to 0.20 will result in differences in functionality and code reusability.', 'The chapter emphasizes the necessity of using scikit-learn 0.20 for column transformer and one hot encoder, as versions prior to 0.20 will not support these functionalities.', 'Using scikit-learn versions prior to 0.20 will lead to differences in functionality and code reusability for column transformer and one hot encoder.']}, {'end': 621.31, 'start': 211.275, 'title': 'Data analysis: titanic dataset', 'summary': 'Introduces the titanic dataset with 891 rows and 12 columns, focusing on feature selection and cross-validation of a logistic regression model with 67.8% mean accuracy, outperforming the null accuracy of 61%.', 'duration': 410.035, 'highlights': ['Introducing the Titanic dataset with 891 rows and 12 columns, focusing on feature selection and cross-validation of a logistic regression model with 67.8% mean accuracy. The dataset consists of 891 rows and 12 columns, and the logistic regression model achieved a 67.8% mean accuracy.', 'Explaining the concept of null accuracy and the goal to outperform it, which stands at 61%. The null accuracy, representing the accuracy achieved by predicting the most frequent class, is 61%, and the goal is to outperform this accuracy.', "Demonstrating the process of selecting features for machine learning purposes, including the choice of 'survived', 'P class', 'sex', and 'embarked' columns. Feature selection process involves choosing 'survived', 'P class', 'sex', and 'embarked' columns for machine learning purposes."]}], 'duration': 457.715, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo163595.jpg', 'highlights': ['Using scikit-learn 0.20 for column transformer and one hot encoder is crucial due to differences in functionality and code reusability with prior versions.', 'The Titanic dataset comprises 891 rows and 12 columns, with a focus on feature selection and cross-validation of a logistic regression model achieving 67.8% mean accuracy, outperforming the 61% null accuracy.']}, {'end': 1256.117, 'segs': [{'end': 742.894, 'src': 'embed', 'start': 622.697, 'weight': 1, 'content': [{'end': 630.458, 'text': 'The answer is pipeline, but first we have to talk about encoding the sex column and the embarked column.', 'start': 622.697, 'duration': 7.761}, {'end': 637.332, 'text': "okay?. So let's go back and show the head of the data frame.", 'start': 630.458, 'duration': 6.874}, {'end': 654.364, 'text': 'okay?. For encoding categorical features if they are unordered, usually the best approach is called dummy encoding, which is also known as one hot encoding.', 'start': 637.332, 'duration': 17.032}, {'end': 657.626, 'text': 'okay?. Scikit-learn calls it one hot encoding.', 'start': 654.364, 'duration': 3.262}, {'end': 660.408, 'text': 'Pandas calls it dummy encoding.', 'start': 658.646, 'duration': 1.762}, {'end': 662.149, 'text': "It's the same thing, okay?", 'start': 660.868, 'duration': 1.281}, {'end': 670.691, 'text': "Now we're going to do it in scikit-learn, and there's a bunch of reasons for this that I will explain at the end of this kind of lesson, all right?", 'start': 663.169, 'duration': 7.522}, {'end': 674.312, 'text': "So if you want to use dummy encoder, here's how you do it.", 'start': 671.211, 'duration': 3.101}, {'end': 683.494, 'text': "From sklearn.preprocessing, import one hot encoder, okay? It's a funny name, but it has a reason for it.", 'start': 674.752, 'duration': 8.742}, {'end': 685.234, 'text': "But I'll save that for another time.", 'start': 683.914, 'duration': 1.32}, {'end': 690.936, 'text': 'Then you instantiate a one-hot encoder just like you instantiate a model.', 'start': 686.335, 'duration': 4.601}, {'end': 693.437, 'text': 'You make an instance of it.', 'start': 691.416, 'duration': 2.021}, {'end': 697.138, 'text': "Here's my one-hot encoder.", 'start': 693.457, 'duration': 3.681}, {'end': 702.059, 'text': 'For teaching purposes, I need to make it not sparse.', 'start': 698.518, 'duration': 3.541}, {'end': 703.059, 'text': 'I need to make it dense.', 'start': 702.119, 'duration': 0.94}, {'end': 704.06, 'text': "Don't worry about it.", 'start': 703.179, 'duration': 0.881}, {'end': 707.961, 'text': 'You never have to write that in the real world.', 'start': 704.8, 'duration': 3.161}, {'end': 721.375, 'text': 'One hot encoder, like any scikit-learn transformer, has a fit and a transform method and a fit transform that allows you to do both at the same time.', 'start': 710.913, 'duration': 10.462}, {'end': 732.558, 'text': 'So if I pass it a data frame column, the one hot encoder is going to one hot encode the sex column.', 'start': 722.036, 'duration': 10.522}, {'end': 735.238, 'text': "Now let's look at these first three rows.", 'start': 733.018, 'duration': 2.22}, {'end': 742.894, 'text': 'What it has done is create a numpy array with two columns.', 'start': 735.318, 'duration': 7.576}], 'summary': 'Teaching one hot encoding using sklearn.preprocessing to encode categorical features for data transformation.', 'duration': 120.197, 'max_score': 622.697, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo622697.jpg'}, {'end': 930.512, 'src': 'heatmap', 'start': 904.163, 'weight': 0.719, 'content': [{'end': 912.566, 'text': "Okay I'm going to define my X as, if I previously, oh, I guess previously I defined X as one feature.", 'start': 904.163, 'duration': 8.403}, {'end': 916.467, 'text': "Okay But now I'm going to define it as three features.", 'start': 912.586, 'duration': 3.881}, {'end': 924.71, 'text': "So I'm actually going to do df.drop survived axis equals columns.", 'start': 916.787, 'duration': 7.923}, {'end': 930.512, 'text': "Okay And you can see that here's my X.", 'start': 924.73, 'duration': 5.782}], 'summary': 'Re-defining x from one feature to three features in the dataframe', 'duration': 26.349, 'max_score': 904.163, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo904163.jpg'}, {'end': 997.989, 'src': 'embed', 'start': 967.194, 'weight': 2, 'content': [{'end': 977.317, 'text': 'You use Column Transformer anytime you have features in your data frame that need different pre-processing.', 'start': 967.194, 'duration': 10.123}, {'end': 979.577, 'text': 'okay?, What do I mean?', 'start': 977.317, 'duration': 2.26}, {'end': 985.321, 'text': 'Well, dummy encoding, or one hot encoding, is a preprocessing step.', 'start': 980.318, 'duration': 5.003}, {'end': 996.448, 'text': "I want to apply it to embarked and sex, but I do not want to apply it to p class, because we're treating that as a numeric variable,", 'start': 986.202, 'duration': 10.246}, {'end': 997.989, 'text': 'not a categorical variable.', 'start': 996.448, 'duration': 1.541}], 'summary': 'Use column transformer for different feature pre-processing, e.g., dummy encoding and one hot encoding, excluding specific variables.', 'duration': 30.795, 'max_score': 967.194, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo967194.jpg'}, {'end': 1093.444, 'src': 'heatmap', 'start': 1065.147, 'weight': 0.718, 'content': [{'end': 1074.811, 'text': "So with a column transformer, we'll do a fit transform and we'll pass it our training data and here's what comes out.", 'start': 1065.147, 'duration': 9.664}, {'end': 1086.762, 'text': "And you'll notice what we have is these two columns, the first two columns are the one hot encoded sex.", 'start': 1075.899, 'duration': 10.863}, {'end': 1093.444, 'text': 'The next three columns, these three columns are the one hot encoded embarked.', 'start': 1087.482, 'duration': 5.962}], 'summary': 'Using column transformer to one hot encode sex and embarked columns.', 'duration': 28.297, 'max_score': 1065.147, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo1065147.jpg'}, {'end': 1189.646, 'src': 'embed', 'start': 1160.629, 'weight': 0, 'content': [{'end': 1162.03, 'text': "I'm going to make a pipeline.", 'start': 1160.629, 'duration': 1.401}, {'end': 1172.701, 'text': 'of my column transformer and my logistic regression model, okay? So remember, pipeline is for chaining steps together.', 'start': 1163.178, 'duration': 9.523}, {'end': 1176.822, 'text': "So I'm creating a pipeline that does the following things.", 'start': 1173.221, 'duration': 3.601}, {'end': 1180.843, 'text': 'It takes my data that I pass it.', 'start': 1177.302, 'duration': 3.541}, {'end': 1189.646, 'text': 'it transforms the columns, which is my pre-processing steps, and then it builds my model, which is logistic regression okay?', 'start': 1180.843, 'duration': 8.803}], 'summary': 'Creating a pipeline for logistic regression model and column transformation.', 'duration': 29.017, 'max_score': 1160.629, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo1160629.jpg'}], 'start': 622.697, 'title': 'Encoding categorical data and model improvement', 'summary': "Covers encoding categorical features using dummy encoding, one hot encoding in scikit-learn, focusing on the one hot encoder, and the need to make it dense for teaching purposes. it also explains the process of one hot encoding using scikit-learn's one hot encoder, demonstrates the use of column transformer and pipeline, resulting in a model accuracy improvement from 0.67 to 0.77.", 'chapters': [{'end': 707.961, 'start': 622.697, 'title': 'Encoding categorical data in machine learning', 'summary': 'Discusses encoding categorical features using dummy encoding, also known as one hot encoding, in scikit-learn, with a focus on the one hot encoder, and the need to make it dense for teaching purposes.', 'duration': 85.264, 'highlights': ['The best approach for encoding unordered categorical features is dummy encoding, also known as one hot encoding, which is used in scikit-learn and Pandas.', "To use the dummy encoder in scikit-learn, the 'one hot encoder' is imported from the 'sklearn.preprocessing' module.", 'In scikit-learn, the one hot encoder needs to be instantiated just like a model, and it can be made dense for teaching purposes.']}, {'end': 1256.117, 'start': 710.913, 'title': 'One hot encoding and column transformer', 'summary': "Explains the process of one hot encoding using scikit-learn's one hot encoder and demonstrates the use of column transformer and pipeline, resulting in a model accuracy improvement from 0.67 to 0.77.", 'duration': 545.204, 'highlights': ['The one hot encoder is used to one hot encode the sex and embarked columns, creating numpy arrays with two and three columns respectively. The one hot encoder is used to one hot encode the sex and embarked columns, creating numpy arrays with two and three columns respectively.', "The process of dummy encoding or one hot encoding is explained, highlighting the improvement in scikit-learn's process and the use of column transformer for different preprocessing needs. The process of dummy encoding or one hot encoding is explained, highlighting the improvement in scikit-learn's process and the use of column transformer for different preprocessing needs.", 'The demonstration of using pipeline with column transformer and logistic regression model resulting in an accuracy improvement from 0.67 to 0.77 when evaluated using cross-validation. The demonstration of using pipeline with column transformer and logistic regression model resulting in an accuracy improvement from 0.67 to 0.77 when evaluated using cross-validation.']}], 'duration': 633.42, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo622697.jpg', 'highlights': ['The demonstration of using pipeline with column transformer and logistic regression model resulting in an accuracy improvement from 0.67 to 0.77 when evaluated using cross-validation.', 'The best approach for encoding unordered categorical features is dummy encoding, also known as one hot encoding, which is used in scikit-learn and Pandas.', "The process of dummy encoding or one hot encoding is explained, highlighting the improvement in scikit-learn's process and the use of column transformer for different preprocessing needs.", 'In scikit-learn, the one hot encoder needs to be instantiated just like a model, and it can be made dense for teaching purposes.', "To use the dummy encoder in scikit-learn, the 'one hot encoder' is imported from the 'sklearn.preprocessing' module.", 'The one hot encoder is used to one hot encode the sex and embarked columns, creating numpy arrays with two and three columns respectively.']}, {'end': 1421.244, 'segs': [{'end': 1330.362, 'src': 'embed', 'start': 1276.803, 'weight': 1, 'content': [{'end': 1289.329, 'text': 'In other words, I am not cross-validating a model, I am cross-validating a pipeline of steps that include pre-processing of data and model building.', 'start': 1276.803, 'duration': 12.526}, {'end': 1297.17, 'text': 'In other words, cross-val score is going to do my split of data, my five-fold split.', 'start': 1290.009, 'duration': 7.161}, {'end': 1303.232, 'text': 'And then after it splits the data, it will then run the pipeline.', 'start': 1297.651, 'duration': 5.581}, {'end': 1311.313, 'text': "The point of cross-validation, remember, is to evaluate your model so that you can decide whether you're building a good model.", 'start': 1304.152, 'duration': 7.161}, {'end': 1313.874, 'text': 'And then you can use it to make predictions on new data.', 'start': 1311.413, 'duration': 2.461}, {'end': 1318.775, 'text': "So let's go ahead and make up some new data to pass to the model.", 'start': 1314.414, 'duration': 4.361}, {'end': 1321.756, 'text': "So I'm going to make something called X new.", 'start': 1320.155, 'duration': 1.601}, {'end': 1330.362, 'text': "And as a kind of lazy way of doing this, I'm going to sample five rows from X.", 'start': 1322.597, 'duration': 7.765}], 'summary': 'Cross-validating a pipeline with pre-processing and model building, using five-fold split to evaluate and make predictions on new data.', 'duration': 53.559, 'max_score': 1276.803, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo1276803.jpg'}, {'end': 1389.508, 'src': 'embed', 'start': 1350.895, 'weight': 0, 'content': [{'end': 1360.739, 'text': "How normally if you've built your model and evaluated it and you want to make predictions, what do you do? You do like model.fit.", 'start': 1350.895, 'duration': 9.844}, {'end': 1363, 'text': "Well, I don't have a model.", 'start': 1361.08, 'duration': 1.92}, {'end': 1364.901, 'text': 'I have a pipeline that includes a model.', 'start': 1363.1, 'duration': 1.801}, {'end': 1370.524, 'text': "So I do pipe.fit and I say I'm training it on X and Y.", 'start': 1365.221, 'duration': 5.303}, {'end': 1389.508, 'text': 'Okay And then I do pipe.predict So pipe.fit is like model.fit, except it runs the preprocessing as well as the model fitting.', 'start': 1370.524, 'duration': 18.984}], 'summary': 'To make predictions, use pipe.fit on the pipeline including a model, then use pipe.predict.', 'duration': 38.613, 'max_score': 1350.895, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo1350895.jpg'}, {'end': 1397.775, 'src': 'heatmap', 'start': 1350.895, 'weight': 0.711, 'content': [{'end': 1360.739, 'text': "How normally if you've built your model and evaluated it and you want to make predictions, what do you do? You do like model.fit.", 'start': 1350.895, 'duration': 9.844}, {'end': 1363, 'text': "Well, I don't have a model.", 'start': 1361.08, 'duration': 1.92}, {'end': 1364.901, 'text': 'I have a pipeline that includes a model.', 'start': 1363.1, 'duration': 1.801}, {'end': 1370.524, 'text': "So I do pipe.fit and I say I'm training it on X and Y.", 'start': 1365.221, 'duration': 5.303}, {'end': 1389.508, 'text': 'Okay And then I do pipe.predict So pipe.fit is like model.fit, except it runs the preprocessing as well as the model fitting.', 'start': 1370.524, 'duration': 18.984}, {'end': 1397.775, 'text': 'Pipe.predict is just like model.predict, except it runs the preprocessing on X new.', 'start': 1390.209, 'duration': 7.566}], 'summary': 'Using pipeline for model training and prediction, including preprocessing.', 'duration': 46.88, 'max_score': 1350.895, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo1350895.jpg'}], 'start': 1256.637, 'title': 'Cross-validation and pipelines in machine learning', 'summary': 'Delves into cross-validation in machine learning, highlighting the use of pipelines for evaluation and prediction. it showcases a five-fold split and model training.', 'chapters': [{'end': 1421.244, 'start': 1256.637, 'title': 'Cross-validation and pipelines in machine learning', 'summary': 'Explains the concept of cross-validation in machine learning, emphasizing the use of pipelines to evaluate and make predictions on new data, illustrated with a five-fold split and model training.', 'duration': 164.607, 'highlights': ["The purpose of cross-validation is to evaluate the model's performance and make predictions on new data. Cross-validation is used to assess whether the model is good and to make predictions on new data.", "Using pipelines in machine learning allows for preprocessing of data and model fitting in a single step, demonstrated by the 'pipe.fit' and 'pipe.predict' methods. Pipelines enable the combination of data preprocessing and model fitting into a single step, as demonstrated by the 'pipe.fit' and 'pipe.predict' methods.", 'Illustrating the process with a five-fold split to cross-validate the entire pipeline of preprocessing and model building steps. The process is illustrated with a five-fold split to cross-validate the entire pipeline of preprocessing and model building steps.']}], 'duration': 164.607, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo1256637.jpg', 'highlights': ["Using pipelines in machine learning allows for preprocessing of data and model fitting in a single step, demonstrated by the 'pipe.fit' and 'pipe.predict' methods.", 'Illustrating the process with a five-fold split to cross-validate the entire pipeline of preprocessing and model building steps.', "The purpose of cross-validation is to evaluate the model's performance and make predictions on new data."]}, {'end': 1677.939, 'segs': [{'end': 1486.233, 'src': 'embed', 'start': 1449.546, 'weight': 0, 'content': [{'end': 1450.787, 'text': "And I'll summarize it briefly.", 'start': 1449.546, 'duration': 1.241}, {'end': 1452.709, 'text': 'Here are my imports.', 'start': 1451.107, 'duration': 1.602}, {'end': 1455.391, 'text': "Here's where I read in my data frame.", 'start': 1453.129, 'duration': 2.262}, {'end': 1457.032, 'text': 'I selected my columns.', 'start': 1455.771, 'duration': 1.261}, {'end': 1459.355, 'text': 'I defined my X and Y.', 'start': 1457.393, 'duration': 1.962}, {'end': 1466.041, 'text': 'Here I made my column transformer made up of a one hot encoder and passing through the remaining columns.', 'start': 1459.355, 'duration': 6.686}, {'end': 1467.262, 'text': "Here's my model.", 'start': 1466.401, 'duration': 0.861}, {'end': 1470.625, 'text': "Here's my pipeline that's a column transformer and a model.", 'start': 1467.582, 'duration': 3.043}, {'end': 1473.307, 'text': "Here's cross-validation of the pipeline.", 'start': 1471.045, 'duration': 2.262}, {'end': 1486.233, 'text': "here is building my X new data frame, and here is fitting, and then here's making predictions on new data, okay? So this is everything we just did.", 'start': 1474.088, 'duration': 12.145}], 'summary': 'Data preprocessing, model building, and prediction using a pipeline and cross-validation.', 'duration': 36.687, 'max_score': 1449.546, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo1449546.jpg'}, {'end': 1566.342, 'src': 'embed', 'start': 1536.569, 'weight': 1, 'content': [{'end': 1539.95, 'text': "You'll notice that one hot encoder does not affect our data frame.", 'start': 1536.569, 'duration': 3.381}, {'end': 1544.391, 'text': "So our data frame stays three or four columns and that's it.", 'start': 1540.49, 'duration': 3.901}, {'end': 1546.911, 'text': "And that's easier to explore and easier to manage.", 'start': 1544.731, 'duration': 2.18}, {'end': 1555.877, 'text': "Number two When new data comes in, you don't have to use get dummies on it right?", 'start': 1547.852, 'duration': 8.025}, {'end': 1566.342, 'text': 'Because if you are using get dummies on all of your training data, then when out of sample data comes in, you still have to use get dummies on it.', 'start': 1556.177, 'duration': 10.165}], 'summary': 'One hot encoder maintains 3-4 columns, simplifies exploration and management, and avoids need for get dummies on new data.', 'duration': 29.773, 'max_score': 1536.569, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo1536569.jpg'}, {'end': 1644.265, 'src': 'heatmap', 'start': 1585.714, 'weight': 2, 'content': [{'end': 1589.236, 'text': "Well, it's not going to produce the correctly shaped data frame.", 'start': 1585.714, 'duration': 3.522}, {'end': 1592.038, 'text': 'This is going to cause problems.', 'start': 1589.596, 'duration': 2.442}, {'end': 1598.342, 'text': 'All right, number three as to why this process with Scikit-Learn is better.', 'start': 1593.879, 'duration': 4.463}, {'end': 1606.287, 'text': 'You can do a grid search, as I was mentioning, with both model parameters and pre-processing parameters.', 'start': 1599.002, 'duration': 7.285}, {'end': 1620.027, 'text': 'And then finally, reason number four in some cases, pre-processing outside of scikit-learn can make cross-validation scores less reliable.', 'start': 1608.061, 'duration': 11.966}, {'end': 1622.348, 'text': 'okay?. And this gets complicated.', 'start': 1620.027, 'duration': 2.321}, {'end': 1628.051, 'text': "but basically, if you're using a standard scaler, if you're doing missing value imputation,", 'start': 1622.348, 'duration': 5.703}, {'end': 1638.059, 'text': "if you're using text data and a variety of other circumstances, If you do your preprocessing before scikit-learn,", 'start': 1628.051, 'duration': 10.008}, {'end': 1644.265, 'text': 'your cross-validation scores are possibly going to be unreliable.', 'start': 1638.059, 'duration': 6.206}], 'summary': 'Pre-processing outside scikit-learn can make cross-validation scores less reliable, due to incorrectly shaped data frame.', 'duration': 58.551, 'max_score': 1585.714, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo1585714.jpg'}], 'start': 1422.165, 'title': 'Data analysis workflow and one hot encoding', 'summary': 'Covers a recap of the data analysis workflow, including importing data and building a model, and discusses the advantages of using one hot encoder over get dummies, such as simplifying handling of new data and preventing issues with different categories.', 'chapters': [{'end': 1505.282, 'start': 1422.165, 'title': 'Recap of data analysis workflow', 'summary': 'Presents a recap of the data analysis workflow, including the steps of importing data, selecting columns, defining x and y, creating a column transformer, building a model, performing cross-validation, and making predictions on new data.', 'duration': 83.117, 'highlights': ['The chapter provides a summary of the entire data analysis workflow, encompassing data import, column selection, X and Y definition, column transformer creation, model building, pipeline implementation, cross-validation, and prediction on new data.', 'The code presented in the recap is concise, comprising only the essential components and eliminating exploratory and teaching code.', 'The presenter mentions that the provided code will be shared in a notebook after the webcast.']}, {'end': 1585.714, 'start': 1505.903, 'title': 'One hot encoding vs get dummies', 'summary': "Discusses the advantages of using one hot encoder over get dummies, highlighting that it doesn't create a large data frame, simplifies handling of new data, and prevents issues with different categories in out-of-sample data.", 'duration': 79.811, 'highlights': ['One hot encoder does not affect the data frame, keeping it small and manageable, making it easier to explore and handle.', "When new data comes in, there's no need to use get dummies on it, avoiding complications with different categories in out-of-sample data.", 'Using one hot encoder avoids creating a gigantic data frame, simplifying data management and exploration.']}, {'end': 1677.939, 'start': 1585.714, 'title': 'Benefits of using scikit-learn', 'summary': 'Explains four reasons why using the process with scikit-learn for data preprocessing is better, including the ability to do grid search for model and preprocessing parameters, and how pre-processing outside of scikit-learn can make cross-validation scores less reliable.', 'duration': 92.225, 'highlights': ['You can do a grid search with both model parameters and pre-processing parameters in Scikit-Learn, providing flexibility and optimization for the data processing.', 'Pre-processing outside of scikit-learn can make cross-validation scores less reliable, especially in cases involving standard scaler, missing value imputation, and text data, highlighting the potential impact on the reliability of cross-validation scores.', 'Using the process with Scikit-Learn for data preprocessing is better than using get dummies and pandas, providing a more reliable and effective approach.', 'Joining the membership program at the $5 level by going to patreon.com/data school allows access to monthly webcasts and the opportunity to ask questions, providing a valuable resource for learning and engagement.']}], 'duration': 255.774, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/irHhDMbw3xo/pics/irHhDMbw3xo1422165.jpg', 'highlights': ['The chapter provides a summary of the entire data analysis workflow, encompassing data import, column selection, X and Y definition, column transformer creation, model building, pipeline implementation, cross-validation, and prediction on new data.', 'One hot encoder does not affect the data frame, keeping it small and manageable, making it easier to explore and handle.', 'You can do a grid search with both model parameters and pre-processing parameters in Scikit-Learn, providing flexibility and optimization for the data processing.']}], 'highlights': ['The demonstration of using pipeline with column transformer and logistic regression model resulting in an accuracy improvement from 0.67 to 0.77 when evaluated using cross-validation.', 'Using a pipeline enables the capability to perform grid search or randomized search of both tuning parameters for the model and the preprocessing steps, providing the flexibility to search parameters for preprocessing in combination with the model.', 'The pipeline in scikit-learn is crucial for cross-validating a process that includes preprocessing and model building, rather than just a model, allowing for more accurate results in cross-validation.', 'Using scikit-learn 0.20 for column transformer and one hot encoder is crucial due to differences in functionality and code reusability with prior versions.', 'The best approach for encoding unordered categorical features is dummy encoding, also known as one hot encoding, which is used in scikit-learn and Pandas.', "Using pipelines in machine learning allows for preprocessing of data and model fitting in a single step, demonstrated by the 'pipe.fit' and 'pipe.predict' methods.", 'The chapter provides a summary of the entire data analysis workflow, encompassing data import, column selection, X and Y definition, column transformer creation, model building, pipeline implementation, cross-validation, and prediction on new data.']}