title
Kaggle Competition - House Prices: Advanced Regression Techniques Part1

description
In this video I will be showing how we can participate in Kaggle competition by solving a problem statement. #Kaggle #MachineLearning github: https://github.com/krishnaik06/Kaggle-Competitions If you like music support my brother's channel https://www.youtube.com/channel/UCdupFqYIc6VMO-pXVlvmM4Q Support me in Patreon: https://www.patreon.com/join/2340909? Buy the Best book of Machine Learning, Deep Learning with python sklearn and tensorflow from below amazon url: https://www.amazon.in/Hands-Machine-Learning-Scikit-Learn-Tensor/dp/9352135210/ref=as_sl_pc_qf_sp_asin_til?tag=krishnaik06-21&linkCode=w00&linkId=a706a13cecffd115aef76f33a760e197&creativeASIN=9352135210 You can buy my book on Finance with Machine Learning and Deep Learning from the below url amazon url: https://www.amazon.in/Hands-Python-Finance-implementing-strategies/dp/1789346371/ref=as_sl_pc_qf_sp_asin_til?tag=krishnaik06-21&linkCode=w00&linkId=ac229c9a45954acc19c1b2fa2ca96e23&creativeASIN=1789346371 Connect with me here: Twitter: https://twitter.com/Krishnaik06 Facebook: https://www.facebook.com/krishnaik06 instagram: https://www.instagram.com/krishnaik06 Subscribe my unboxing Channel https://www.youtube.com/channel/UCjWY5hREA6FFYrthD0rZNIw Below are the various playlist created on ML,Data Science and Deep Learning. Please subscribe and support the channel. Happy Learning! Deep Learning Playlist: https://www.youtube.com/watch?v=DKSZHN7jftI&list=PLZoTAELRMXVPGU70ZGsckrMdr0FteeRUi Data Science Projects playlist: https://www.youtube.com/watch?v=5Txi0nHIe0o&list=PLZoTAELRMXVNUcr7osiU7CCm8hcaqSzGw NLP playlist: https://www.youtube.com/watch?v=6ZVf1jnEKGI&list=PLZoTAELRMXVMdJ5sqbCK2LiM0HhQVWNzm Statistics Playlist: https://www.youtube.com/watch?v=GGZfVeZs_v4&list=PLZoTAELRMXVMhVyr3Ri9IQ-t5QPBtxzJO Feature Engineering playlist: https://www.youtube.com/watch?v=NgoLMsaZ4HU&list=PLZoTAELRMXVPwYGE2PXD3x0bfKnR0cJjN Computer Vision playlist: https://www.youtube.com/watch?v=mT34_yu5pbg&list=PLZoTAELRMXVOIBRx0andphYJ7iakSg3Lk Data Science Interview Question playlist: https://www.youtube.com/watch?v=820Qr4BH0YM&list=PLZoTAELRMXVPkl7oRvzyNnyj1HS4wt2K- You can buy my book on Finance with Machine Learning and Deep Learning from the below url amazon url: https://www.amazon.in/Hands-Python-Finance-implementing-strategies/dp/1789346371/ref=sr_1_1?keywords=krish+naik&qid=1560943725&s=gateway&sr=8-1 🙏🙏🙏🙏🙏🙏🙏🙏 YOU JUST NEED TO DO 3 THINGS to support my channel LIKE SHARE & SUBSCRIBE TO MY YOUTUBE CHANNEL

detail
{'title': 'Kaggle Competition - House Prices: Advanced Regression Techniques Part1', 'heatmap': [{'end': 1280.383, 'start': 1222.473, 'weight': 0.836}, {'end': 1433.078, 'start': 1371.591, 'weight': 0.738}, {'end': 1583.977, 'start': 1485.431, 'weight': 0.704}, {'end': 1788.095, 'start': 1636.135, 'weight': 0.714}], 'summary': "Provides an overview of the 'house prices advanced regression technique' competition, covering data analysis, data exploration, handling missing values, and feature engineering, achieving a rank of 2521 out of 4384 participants with an initial score of 0.141 in just four hours of work.", 'chapters': [{'end': 229.739, 'segs': [{'end': 56.347, 'src': 'embed', 'start': 22.288, 'weight': 2, 'content': [{'end': 24.549, 'text': 'I have submitted it in the Kaggle website itself.', 'start': 22.288, 'duration': 2.261}, {'end': 29.912, 'text': "i'm going to show you that problem statement, what i have done, but still it is just in the initial stages.", 'start': 25.169, 'duration': 4.743}, {'end': 32.613, 'text': 'i need to hyper tune it more.', 'start': 29.912, 'duration': 2.701}, {'end': 37.115, 'text': 'uh, you know, i have to apply a proper algorithm to it, try to play with each and every parameter.', 'start': 32.613, 'duration': 4.502}, {'end': 45.78, 'text': 'still, but apart from that, feature engineering, feature selection and a simple model creation has been already done, and the rank is also good,', 'start': 37.115, 'duration': 8.665}, {'end': 47, 'text': "which i'm just going to show you.", 'start': 45.78, 'duration': 1.22}, {'end': 53.325, 'text': 'and, uh, the best way to start is that i found out, uh, uh, you know, kaggle competition, which is named as house prices,', 'start': 47, 'duration': 6.325}, {'end': 56.347, 'text': 'advanced regulation techniques, and it is ongoing.', 'start': 53.325, 'duration': 3.022}], 'summary': 'Submitted kaggle competition entry, initial stages, hyper tuning required, good rank achieved in house prices competition', 'duration': 34.059, 'max_score': 22.288, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU22288.jpg'}, {'end': 111.962, 'src': 'embed', 'start': 75.657, 'weight': 0, 'content': [{'end': 81.203, 'text': 'what Kaggle is all about how the data set is basically present, how you have to program.', 'start': 75.657, 'duration': 5.546}, {'end': 85.587, 'text': 'you have to write the programs and, after writing the programs, once you get your output,', 'start': 81.203, 'duration': 4.384}, {'end': 88.289, 'text': 'how do you have to submit it so that you get a ranking in this?', 'start': 85.587, 'duration': 2.702}, {'end': 89.45, 'text': 'So let us begin.', 'start': 88.79, 'duration': 0.66}, {'end': 94.655, 'text': "So I'm going to take this particular problem statement, which is called House Prices Advanced Regression Technique.", 'start': 90.291, 'duration': 4.364}, {'end': 98.479, 'text': 'So inside this, you just have to, first of all, you have to log in.', 'start': 95.115, 'duration': 3.364}, {'end': 100.495, 'text': 'Make sure you log in.', 'start': 99.474, 'duration': 1.021}, {'end': 102.596, 'text': "without logging in, they'll not be.", 'start': 100.495, 'duration': 2.101}, {'end': 105.418, 'text': "they'll not allow you to basically download the data set.", 'start': 102.596, 'duration': 2.822}, {'end': 109.661, 'text': 'So, after you log in, this is the description of the project that you will be seeing.', 'start': 105.418, 'duration': 4.243}, {'end': 111.962, 'text': 'Okay, all the information are basically given.', 'start': 109.661, 'duration': 2.301}], 'summary': 'Kaggle involves programming and submitting for ranking, such as in the house prices advanced regression technique project.', 'duration': 36.305, 'max_score': 75.657, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU75657.jpg'}, {'end': 246.084, 'src': 'embed', 'start': 211.289, 'weight': 1, 'content': [{'end': 213.37, 'text': 'you need to find out the price of the house.', 'start': 211.289, 'duration': 2.081}, {'end': 215.951, 'text': 'okay, based on various features that you can see over here.', 'start': 213.37, 'duration': 2.581}, {'end': 220.413, 'text': 'okay, and it has literally around 81 features.', 'start': 216.531, 'duration': 3.882}, {'end': 223.976, 'text': 'it has literally around 81 features and there are many, many.', 'start': 220.413, 'duration': 3.563}, {'end': 227.258, 'text': 'you know, there are many, many um category features.', 'start': 223.976, 'duration': 3.282}, {'end': 228.519, 'text': 'there are a lot of null values.', 'start': 227.258, 'duration': 1.261}, {'end': 229.739, 'text': "how you're going to handle that?", 'start': 228.519, 'duration': 1.22}, {'end': 234.602, 'text': 'a lot of things will be basically over there and you need to do a lot of stuffs in this.', 'start': 229.739, 'duration': 4.863}, {'end': 238.745, 'text': "so i've taken this particular problem statement so that you'll get a complete idea about it how it is done.", 'start': 234.602, 'duration': 4.143}, {'end': 246.084, 'text': 'Now the next thing is that after I go over here and see about the training.csv, you can also see test.csv.', 'start': 239.42, 'duration': 6.664}], 'summary': 'Analyzing a house with 81 features, many null values, and data in training and test files.', 'duration': 34.795, 'max_score': 211.289, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU211289.jpg'}], 'start': 0.981, 'title': 'Kaggle competition overview', 'summary': "Provides an overview of the 'house prices advanced regression technique' competition, covering the process of tackling the competition, a demo of completing and submitting a problem statement, and the importance of feature engineering and selection within a four to five-hour timeframe, as well as details on the competition dataset and problem statement.", 'chapters': [{'end': 111.962, 'start': 0.981, 'title': 'Kaggle competition tutorial', 'summary': "Details the process of tackling a kaggle competition, specifically the 'house prices advanced regression technique', including a demo of completing and submitting a problem statement, and the importance of feature engineering and selection, within four to five hours.", 'duration': 110.981, 'highlights': ["The chapter details the process of tackling a Kaggle competition, specifically the 'House Prices Advanced Regression Technique', including a demo of completing and submitting a problem statement, and the importance of feature engineering and selection, within four to five hours. Tackling a Kaggle competition, 'House Prices Advanced Regression Technique', completing and submitting a problem statement, importance of feature engineering and selection", 'The presenter mentions completing and submitting a Kaggle problem statement within four to five hours, and the need for hyper tuning, proper algorithm application, and parameter experimentation. Completing and submitting a Kaggle problem statement, need for hyper tuning, proper algorithm application, parameter experimentation', 'The process of logging in and accessing the description of the project on Kaggle is explained, emphasizing the requirement to log in to download the dataset. Logging in, accessing project description, requirement to log in to download dataset']}, {'end': 229.739, 'start': 111.962, 'title': 'Kaggle data set evaluation', 'summary': 'Provides an overview of a kaggle competition dataset, including details on the data, training and testing files, and the problem statement of predicting house prices based on 81 features with null values.', 'duration': 117.777, 'highlights': ['The training data consists of around 81 features and many category features with numerous null values. The dataset for the Kaggle competition has around 81 features and numerous category features, with a significant amount of null values.', "The problem statement involves predicting house prices based on various features. The competition's problem statement requires participants to predict house prices based on various features present in the dataset.", 'The training data needs to be used to train the model, while the test data is for predictions, with submissions made in a specific file format (sample_submission.csv). Participants are required to train their models using the training data and make predictions for the test data, with submissions to be made in the specified file format (sample_submission.csv).']}], 'duration': 228.758, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU981.jpg', 'highlights': ["The competition's problem statement requires participants to predict house prices based on various features present in the dataset.", 'The dataset for the Kaggle competition has around 81 features and numerous category features, with a significant amount of null values.', "Tackling a Kaggle competition, 'House Prices Advanced Regression Technique', completing and submitting a problem statement, importance of feature engineering and selection", 'The presenter mentions completing and submitting a Kaggle problem statement within four to five hours, and the need for hyper tuning, proper algorithm application, and parameter experimentation.', 'The process of logging in and accessing the description of the project on Kaggle is explained, emphasizing the requirement to log in to download the dataset.']}, {'end': 426.447, 'segs': [{'end': 280.604, 'src': 'embed', 'start': 251.088, 'weight': 1, 'content': [{'end': 253.069, 'text': 'So this is the ID of the test data itself.', 'start': 251.088, 'duration': 1.981}, {'end': 256.37, 'text': 'And this is the output that you have to predict and give it to them.', 'start': 253.169, 'duration': 3.201}, {'end': 264.377, 'text': 'Okay So after you make this kind of CSV file, you have to just submit in the predictions and that CSV file, you have to just upload it.', 'start': 256.692, 'duration': 7.685}, {'end': 267.239, 'text': "Once you upload it, you'll be getting a score based on that particular score.", 'start': 264.497, 'duration': 2.742}, {'end': 269.08, 'text': 'You will be basically getting a rank.', 'start': 267.639, 'duration': 1.441}, {'end': 276.622, 'text': 'Now, as we had already seen in the evaluation technique of this, you can see over here in the evaluation.', 'start': 269.6, 'duration': 7.022}, {'end': 280.604, 'text': 'here in this evaluation, they are basically saying that they are going to use root, mean squared error.', 'start': 276.622, 'duration': 3.982}], 'summary': 'Predict and submit csv file, get rank based on root mean squared error.', 'duration': 29.516, 'max_score': 251.088, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU251088.jpg'}, {'end': 335.227, 'src': 'embed', 'start': 304.553, 'weight': 0, 'content': [{'end': 308.374, 'text': 'if you see, there are so many people who have actually got very good ranks.', 'start': 304.553, 'duration': 3.821}, {'end': 317.457, 'text': 'You can see 0.05, 0.08, 0.099 and one of my, if you see my, my rank currently is 2521.', 'start': 308.394, 'duration': 9.063}, {'end': 322.439, 'text': "So I'm getting somewhere around 0.141 and this is just, I've written the code in four hours guys.", 'start': 317.457, 'duration': 4.982}, {'end': 323.759, 'text': 'Understand that thing.', 'start': 322.919, 'duration': 0.84}, {'end': 327.44, 'text': "Still I've not done hyperparameter optimization and this will take time.", 'start': 323.779, 'duration': 3.661}, {'end': 329.441, 'text': 'It is not just our four hours work.', 'start': 327.8, 'duration': 1.641}, {'end': 335.227, 'text': 'Okay, so after getting the first, this was my first trial, I have uploaded it, I have got a very good score, what I think.', 'start': 329.861, 'duration': 5.366}], 'summary': 'Several people achieved good ranks, with the speaker achieving a rank of 2521 and a score of around 0.141 after 4 hours of coding.', 'duration': 30.674, 'max_score': 304.553, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU304553.jpg'}, {'end': 384.347, 'src': 'embed', 'start': 351.742, 'weight': 2, 'content': [{'end': 354.163, 'text': 'I have solved many, but I am just giving you an example.', 'start': 351.742, 'duration': 2.421}, {'end': 356.924, 'text': 'In this, I have not applied any hyperparameter optimization.', 'start': 354.423, 'duration': 2.501}, {'end': 362.767, 'text': 'I have just followed a feature engineering, all the feature engineering work I have basically done, I have done all the feature selection.', 'start': 356.944, 'duration': 5.823}, {'end': 366.349, 'text': 'But again, let me just show you what all code I have basically written over.', 'start': 363.227, 'duration': 3.122}, {'end': 367.93, 'text': 'So you have to just go to the data.', 'start': 366.669, 'duration': 1.261}, {'end': 372.379, 'text': 'and make sure that you click on this download all button.', 'start': 368.996, 'duration': 3.383}, {'end': 375.961, 'text': 'okay, so, once you download it, you will be downloading all this whole files.', 'start': 372.379, 'duration': 3.582}, {'end': 384.347, 'text': 'you can just click on download all all the files will get downloaded and let us go and start how we can basically solve this particular problem.', 'start': 375.961, 'duration': 8.386}], 'summary': 'The speaker has done feature engineering and selection without hyperparameter optimization, and has solved many problems using this approach.', 'duration': 32.605, 'max_score': 351.742, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU351742.jpg'}], 'start': 229.739, 'title': 'Data analysis and kaggle competition success', 'summary': 'Explains the process of creating and submitting csv files for data analysis, including the requirement to upload predictions and receive a score and rank, and discusses achieving a rank of 2521 out of 4384 participants with an initial score of 0.141, without hyperparameter optimization, in just four hours of work.', 'chapters': [{'end': 269.08, 'start': 229.739, 'title': 'Data analysis and submission process', 'summary': 'Explains the process of creating and submitting csv files for data analysis, including the requirement to upload predictions and receive a score and rank based on the submission.', 'duration': 39.341, 'highlights': ['After creating the CSV file, it needs to be submitted for evaluation, and based on the submission, a score and rank will be provided.', 'The process involves creating a sample submission file.csv, with the test data ID and the output to be predicted.', 'The training.csv and test.csv files are mentioned as part of the data analysis process.']}, {'end': 426.447, 'start': 269.6, 'title': 'Kaggle competition success', 'summary': 'Discusses a kaggle competition where the speaker achieved a rank of 2521 out of 4384 participants with an initial score of 0.141, without hyperparameter optimization, in just four hours of work.', 'duration': 156.847, 'highlights': ['The speaker achieved a rank of 2521 out of 4384 participants in a Kaggle competition with a score of 0.141, without hyperparameter optimization, in just four hours of work.', 'The evaluation technique for the competition involved the use of root mean squared error (RMSE) to submit predictions and achieve leaderboard rankings.', 'The speaker emphasized the efficiency of their approach by achieving a competitive rank and score without hyperparameter optimization, showcasing their proficiency in feature engineering and selection.']}], 'duration': 196.708, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU229739.jpg', 'highlights': ['Achieved rank 2521 out of 4384 with score 0.141 in 4 hours', 'Submission yields score and rank', 'Efficient approach without hyperparameter optimization']}, {'end': 692.956, 'segs': [{'end': 458.519, 'src': 'embed', 'start': 426.547, 'weight': 0, 'content': [{'end': 432.11, 'text': "I'm uploading the train.csv file and this is my dataset head which I see over here.", 'start': 426.547, 'duration': 5.563}, {'end': 433.992, 'text': 'They are around 81 columns.', 'start': 432.451, 'duration': 1.541}, {'end': 438.095, 'text': 'Okay 81 columns and they are somewhere around, you know, 4,000, not 4,000, 1,400 records, I guess.', 'start': 434.152, 'duration': 3.943}, {'end': 439.135, 'text': 'Let me just have a look.', 'start': 438.135, 'duration': 1}, {'end': 445.375, 'text': "Okay So I'll just make a cell above.", 'start': 439.155, 'duration': 6.22}, {'end': 450.656, 'text': 'So here you can see that it is 1460 records, 81 columns.', 'start': 446.275, 'duration': 4.381}, {'end': 458.519, 'text': 'And this particular code is basically to see this sns.heatmap, df.isNull, yTickableFalse.', 'start': 451.197, 'duration': 7.322}], 'summary': 'Dataset has 81 columns and 1460 records.', 'duration': 31.972, 'max_score': 426.547, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU426547.jpg'}, {'end': 571.507, 'src': 'embed', 'start': 546.003, 'weight': 2, 'content': [{'end': 551.488, 'text': 'because, just understand, guys, the total number of records that we have is somewhere around 1460..', 'start': 546.003, 'duration': 5.485}, {'end': 557.854, 'text': 'And if I take this examples over here, 1453, 1179, 1406, so many missing values are there.', 'start': 551.488, 'duration': 6.366}, {'end': 563.719, 'text': "So what I've done is that I've planned to drop all these columns pool, QC, fence, miscellaneous features.", 'start': 558.194, 'duration': 5.525}, {'end': 571.507, 'text': "you know, because it is not necessary to do that, i'm not dropping this, okay, not this also fireplace q you, but instead i'm dropping,", 'start': 563.719, 'duration': 7.788}], 'summary': 'The dataset has around 1460 records, with many missing values. columns pool, qc, fence, and miscellaneous features are planned to be dropped.', 'duration': 25.504, 'max_score': 546.003, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU546003.jpg'}, {'end': 623.641, 'src': 'embed', 'start': 597.449, 'weight': 3, 'content': [{'end': 604.375, 'text': 'now, just if you have 81 features, it is difficult to just understand directly which, without any domain expert knowledge, right,', 'start': 597.449, 'duration': 6.926}, {'end': 606.477, 'text': 'which feature to drop, which feature to not to drop.', 'start': 604.375, 'duration': 2.102}, {'end': 613.908, 'text': 'So for that, what I have done is that I have dropped features like alley and some more features which you will be seeing towards the down,', 'start': 606.897, 'duration': 7.011}, {'end': 616.292, 'text': 'where my missing values is more than 50%.', 'start': 613.908, 'duration': 2.384}, {'end': 617.754, 'text': 'like this kind of features, I have deleted it.', 'start': 616.292, 'duration': 1.462}, {'end': 623.641, 'text': 'So what I have done is that after that, what I will do I will go and see what is the heat map.', 'start': 618.175, 'duration': 5.466}], 'summary': 'Dropped features with over 50% missing values, then used a heatmap for analysis.', 'duration': 26.192, 'max_score': 597.449, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU597449.jpg'}, {'end': 705.684, 'src': 'embed', 'start': 674.378, 'weight': 1, 'content': [{'end': 680.564, 'text': "They are also lot number of null values over there, and what I've done is that, initially without judging anything,", 'start': 674.378, 'duration': 6.186}, {'end': 684.908, 'text': "without understanding about all the features, I've just taken the mean.", 'start': 680.564, 'duration': 4.344}, {'end': 692.956, 'text': "I'm telling that fill all the NA values with the frontage mean, that same column mean and I've basically replaced the missing values over here.", 'start': 685.328, 'duration': 7.628}, {'end': 698.139, 'text': 'Now always remember guys you should go line by line, you should go feature by feature.', 'start': 693.916, 'duration': 4.223}, {'end': 705.684, 'text': 'Now, since we have 81 features and obviously you will get confused to make sure that first of the feature, you target one feature, like lot,', 'start': 698.399, 'duration': 7.285}], 'summary': 'Data cleaning involved replacing null values with column means.', 'duration': 31.306, 'max_score': 674.378, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU674378.jpg'}], 'start': 426.547, 'title': 'Data exploration, handling missing values, and data cleaning', 'summary': 'Covers the exploration of a dataset with 81 columns and 1460 records, visualization of null values using sns.heatmap, handling missing values in the dataset, dropping columns with more than 50 missing values, and cleaning the data by dropping features with over 50% missing values and filling null values with respective column means, resulting in the removal of 259 missing values.', 'chapters': [{'end': 503.441, 'start': 426.547, 'title': 'Exploring dataset with 81 columns', 'summary': 'Showcases the exploration of a dataset with 81 columns and 1460 records, including the use of sns.heatmap to visualize null values and the attempt to execute the function df.isnull.sum, encountering an error in the process.', 'duration': 76.894, 'highlights': ['The dataset contains 81 columns and 1460 records.', 'The code utilizes sns.heatmap to visualize null values in the dataset.', 'An attempt is made to execute the function df.isNull.sum, resulting in an error.']}, {'end': 582.597, 'start': 503.441, 'title': 'Handling missing values in data', 'summary': "Discusses handling missing values in a dataset with around 1460 records, identifying features with large numbers of missing values, such as 'pool qc' and 'fence', and planning to drop columns with more than 50 missing values.", 'duration': 79.156, 'highlights': ["Identified features like 'pool qc' and 'fence' with 1453 and 1179 missing values respectively, in a dataset with around 1460 records.", "Planned to drop columns with more than 50 missing values, such as 'pool qc' and 'fence', as it is not necessary to retain them in the dataset."]}, {'end': 692.956, 'start': 582.597, 'title': 'Data cleaning and feature selection', 'summary': 'Discusses handling missing values in a dataset with 81 features, where features with over 50% missing values are dropped and null values are filled with respective column means, resulting in the removal of 259 missing values.', 'duration': 110.359, 'highlights': ["Features with over 50% missing values are dropped The speaker dropped features with more than 50% missing values, such as 'alley' and others, to facilitate data cleaning and improve dataset quality.", "Null values are filled with respective column means The speaker filled null values with the mean of the respective column, specifically demonstrating this with the 'lot front age' feature to handle the numerous null values present.", '259 missing values were addressed The speaker addressed a total of 259 missing values in the dataset through feature selection and filling null values with respective column means to improve data completeness and quality.']}], 'duration': 266.409, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU426547.jpg', 'highlights': ['The dataset contains 81 columns and 1460 records.', '259 missing values were addressed through feature selection and filling null values with respective column means.', "Planned to drop columns with more than 50 missing values, such as 'pool qc' and 'fence', as it is not necessary to retain them in the dataset.", 'Features with over 50% missing values are dropped to facilitate data cleaning and improve dataset quality.']}, {'end': 858.13, 'segs': [{'end': 758.206, 'src': 'embed', 'start': 730.678, 'weight': 0, 'content': [{'end': 735.301, 'text': 'Why the shape is 80? Now see in my training data, I have shape as 81, right? 81 features.', 'start': 730.678, 'duration': 4.623}, {'end': 739.303, 'text': 'And remember this sales price over here you have.', 'start': 735.701, 'duration': 3.602}, {'end': 746.183, 'text': 'if I go and show you my data, set over here, At the end of the column there will be something called a sales price.', 'start': 739.303, 'duration': 6.88}, {'end': 749.405, 'text': 'This sale price is basically a dependent feature.', 'start': 747.124, 'duration': 2.281}, {'end': 755.588, 'text': 'You need to find out this particular value based on all the other features, price of the house with respect to all the other features.', 'start': 749.425, 'duration': 6.163}, {'end': 758.206, 'text': 'In the test data you will not have that.', 'start': 756.445, 'duration': 1.761}], 'summary': 'Training data has 81 features, predicting house price', 'duration': 27.528, 'max_score': 730.678, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU730678.jpg'}, {'end': 818.501, 'src': 'embed', 'start': 790.33, 'weight': 2, 'content': [{'end': 792.771, 'text': 'When I check the null values over here, right?', 'start': 790.33, 'duration': 2.441}, {'end': 798.814, 'text': 'You could see that my leverage, my MS zoning, did not have any null value.', 'start': 793.311, 'duration': 5.503}, {'end': 803.876, 'text': 'Whereas over here in my test data set, MS zoning is having some null values, okay?', 'start': 799.374, 'duration': 4.502}, {'end': 805.157, 'text': 'Now this is again.', 'start': 804.217, 'duration': 0.94}, {'end': 810.86, 'text': "you have to, so I'm telling you, you have to simultaneously work along with test data and train data, both, okay?", 'start': 805.157, 'duration': 5.703}, {'end': 812.54, 'text': 'Now for MS zoning.', 'start': 811.32, 'duration': 1.22}, {'end': 814.941, 'text': 'let us understand what this MS zoning column is all about.', 'start': 812.54, 'duration': 2.401}, {'end': 818.501, 'text': 'If you go and see over here, MS zoning is basically an object type.', 'start': 815.341, 'duration': 3.16}], 'summary': 'Ms zoning in train data has no null values, while test data has some null values. both datasets need simultaneous work.', 'duration': 28.171, 'max_score': 790.33, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU790330.jpg'}], 'start': 693.916, 'title': 'Data analysis and feature engineering', 'summary': 'Covers the process of data analysis and feature engineering for a dataset with 81 features, emphasizing the importance of handling test data simultaneously and identifying and handling null values in categorical features.', 'chapters': [{'end': 858.13, 'start': 693.916, 'title': 'Data analysis and feature engineering', 'summary': 'Covers the process of data analysis and feature engineering for a dataset with 81 features, emphasizing the importance of handling test data simultaneously and identifying and handling null values in categorical features.', 'duration': 164.214, 'highlights': ['The dataset contains 81 features, and the process emphasizes the importance of handling test data simultaneously.', 'The sales price is a dependent feature that needs to be predicted based on other features.', "Identifying and handling null values in categorical features, such as 'MS zoning', is crucial for data analysis and feature engineering."]}], 'duration': 164.214, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU693916.jpg', 'highlights': ['The dataset contains 81 features, emphasizing the importance of handling test data simultaneously.', 'The sales price is a dependent feature that needs to be predicted based on other features.', "Identifying and handling null values in categorical features, such as 'MS zoning', is crucial for data analysis and feature engineering."]}, {'end': 1053.169, 'segs': [{'end': 882.958, 'src': 'embed', 'start': 858.672, 'weight': 0, 'content': [{'end': 866.58, 'text': 'So what you can do is that, in order to handle the null values over here, what I have done is that I have basically taken the mode of this category.', 'start': 858.672, 'duration': 7.908}, {'end': 870.904, 'text': 'Now mode basically means that which is your most frequent category that will get replaced by that.', 'start': 866.64, 'duration': 4.264}, {'end': 878.812, 'text': 'So you will be seeing that that will be my first step that I will do over here after I have actually done my frontage mean.', 'start': 871.725, 'duration': 7.087}, {'end': 882.958, 'text': 'So I have already shown you I am going to compute this frontage mean right.', 'start': 879.192, 'duration': 3.766}], 'summary': 'Handling null values by taking the mode of the most frequent category, followed by computing the frontage mean.', 'duration': 24.286, 'max_score': 858.672, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU858672.jpg'}, {'end': 927.904, 'src': 'embed', 'start': 896.787, 'weight': 1, 'content': [{'end': 903.093, 'text': "Why I'm dropping the alley column? Again, guys, there is so many number of features that are having null values over there.", 'start': 896.787, 'duration': 6.306}, {'end': 910.24, 'text': "So that is the reason I'm just dropping this alley column, okay? Make sure you do this and simultaneously we'll go back to the test data.", 'start': 903.433, 'duration': 6.807}, {'end': 916.397, 'text': 'Now in test data also we are doing this and again now see in that in the training data we did not do anything with MS zoning.', 'start': 910.794, 'duration': 5.603}, {'end': 919.219, 'text': 'But here we are performing the mode of it.', 'start': 916.938, 'duration': 2.281}, {'end': 927.904, 'text': 'So in order to perform I am just writing test underscore DF MS zoning dot fill NA test underscore DF MS zoning dot mode of zero.', 'start': 919.279, 'duration': 8.625}], 'summary': 'Dropping alley column due to numerous null values, filling ms zoning with mode.', 'duration': 31.117, 'max_score': 896.787, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU896787.jpg'}, {'end': 973.954, 'src': 'embed', 'start': 940.158, 'weight': 2, 'content': [{'end': 941.879, 'text': 'I have to handle the missing values.', 'start': 940.158, 'duration': 1.721}, {'end': 944.1, 'text': 'okay, that is the first step in feature engineering.', 'start': 941.879, 'duration': 2.221}, {'end': 946.441, 'text': 'then only I can decide something else now.', 'start': 944.1, 'duration': 2.341}, {'end': 951.803, 'text': "after that, I'll go back again to my final project over here and you can see over here what all I have done.", 'start': 946.441, 'duration': 5.362}, {'end': 960.347, 'text': 'I have actually again, guys, you can all write one inbuilt function, one, one, one, one custom function where you can write all this code.', 'start': 951.803, 'duration': 8.544}, {'end': 963.888, 'text': 'but I was going, you know, feature by feature.', 'start': 960.347, 'duration': 3.541}, {'end': 967.25, 'text': 'I was just seeing features by feature and I was trying to solve this particular problem.', 'start': 963.888, 'duration': 3.362}, {'end': 973.954, 'text': "So you'll be seeing that wherever I saw that the feature work category features, I have just replaced it with the mode.", 'start': 967.93, 'duration': 6.024}], 'summary': 'Handling missing values is the first step in feature engineering, replacing category features with mode.', 'duration': 33.796, 'max_score': 940.158, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU940158.jpg'}, {'end': 1005.517, 'src': 'embed', 'start': 978.357, 'weight': 4, 'content': [{'end': 984.701, 'text': "You should try to understand about the data, but later on, see now I've actually submitted, I know what is my accuracy.", 'start': 978.357, 'duration': 6.344}, {'end': 991.125, 'text': 'Now I have to try to reduce that particular Error that is basically coming over there right now.', 'start': 985.141, 'duration': 5.984}, {'end': 995.729, 'text': "in order to do that, What I'll do is that I'll start exploring more about this particular feature,", 'start': 991.125, 'duration': 4.604}, {'end': 999.332, 'text': "But currently I've just taken the mode of all the category features.", 'start': 995.729, 'duration': 3.603}, {'end': 1005.517, 'text': 'I have replaced it, So that is the reason you will find for all these features, which are basically category feature I am replacing.', 'start': 999.352, 'duration': 6.165}], 'summary': 'Identifying accuracy and reducing errors in data through feature exploration and replacement.', 'duration': 27.16, 'max_score': 978.357, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU978357.jpg'}, {'end': 1053.169, 'src': 'embed', 'start': 1017.791, 'weight': 3, 'content': [{'end': 1023.212, 'text': 'Again, see guys, the first feature that I went was with MS zoning, right? Because this was having null values.', 'start': 1017.791, 'duration': 5.421}, {'end': 1024.912, 'text': 'Then I went with lot frontage.', 'start': 1023.252, 'duration': 1.66}, {'end': 1027.073, 'text': 'You have to go just by feature by feature.', 'start': 1025.291, 'duration': 1.782}, {'end': 1032.934, 'text': "Now you found out that the number of missing values of more than 50% is I've dropped it.", 'start': 1027.893, 'duration': 5.041}, {'end': 1035.755, 'text': "In order to drop it, you can see I've written a code over here also.", 'start': 1033.294, 'duration': 2.461}, {'end': 1044.037, 'text': "I've written a code over here, right? So similarly, I'm dropping pool QC fence Micheliners features with axis equal to two.", 'start': 1037.515, 'duration': 6.522}, {'end': 1047.346, 'text': 'This is done and now you can basically see my shape.', 'start': 1044.864, 'duration': 2.482}, {'end': 1048.106, 'text': 'This is my shape.', 'start': 1047.406, 'duration': 0.7}, {'end': 1051.688, 'text': "I'm also dropping the ID column because ID is unique identifier.", 'start': 1048.486, 'duration': 3.202}, {'end': 1052.469, 'text': "I don't require it.", 'start': 1051.708, 'duration': 0.761}, {'end': 1053.169, 'text': "So I'm dropping it.", 'start': 1052.549, 'duration': 0.62}], 'summary': 'Data cleaning process involved dropping features with more than 50% missing values and unique id column.', 'duration': 35.378, 'max_score': 1017.791, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU1017791.jpg'}], 'start': 858.672, 'title': 'Dataset feature engineering', 'summary': 'Covers handling null values by replacing them with mode or dropping columns with many null values, emphasizing the importance of handling missing values. it also discusses data analysis, dropping features with more than 50% missing values, and unique identifier column, resulting in a new shape of the dataset.', 'chapters': [{'end': 960.347, 'start': 858.672, 'title': 'Handling null values in data', 'summary': 'Covers handling null values in a dataset by replacing them with the mode or dropping columns with many null values, and performing these operations on both training and test data, emphasizing the importance of handling missing values as the first step in feature engineering.', 'duration': 101.675, 'highlights': ['Replacing null values with the mode of the category is demonstrated as the first step in handling missing values, emphasizing its importance in feature engineering.', 'Dropping the alley column due to a high number of null values is emphasized as a necessary data cleaning step.', 'Performing operations to handle missing values on both training and test data indicates a comprehensive approach to data cleaning and preparation for modeling.']}, {'end': 1053.169, 'start': 960.347, 'title': 'Data analysis and feature replacement', 'summary': 'Discusses the process of analyzing and replacing features in a dataset, dropping features with more than 50% missing values, and dropping the unique identifier column, resulting in a new shape of the dataset.', 'duration': 92.822, 'highlights': ['The process involves analyzing and replacing features in the dataset, with a focus on understanding the data and improving accuracy.', 'Features with more than 50% missing values are dropped from the dataset, as shown in the code, resulting in a new shape of the dataset.', 'The unique identifier column is also dropped from the dataset during the process.']}], 'duration': 194.497, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU858672.jpg', 'highlights': ['Replacing null values with the mode of the category is demonstrated as the first step in handling missing values, emphasizing its importance in feature engineering.', 'Dropping the alley column due to a high number of null values is emphasized as a necessary data cleaning step.', 'Performing operations to handle missing values on both training and test data indicates a comprehensive approach to data cleaning and preparation for modeling.', 'Features with more than 50% missing values are dropped from the dataset, as shown in the code, resulting in a new shape of the dataset.', 'The process involves analyzing and replacing features in the dataset, with a focus on understanding the data and improving accuracy.', 'The unique identifier column is also dropped from the dataset during the process.']}, {'end': 1453.369, 'segs': [{'end': 1140.677, 'src': 'embed', 'start': 1111.839, 'weight': 0, 'content': [{'end': 1116.246, 'text': 'okay, now you can observe that there are very less number of null values remaining.', 'start': 1111.839, 'duration': 4.407}, {'end': 1118.23, 'text': "okay. so again i've done for one more feature.", 'start': 1116.246, 'duration': 1.984}, {'end': 1126.582, 'text': "finally, what i'll do is that after, after executing this line, you can again, if i go and click over here, you'll be able to see that there are very,", 'start': 1119.534, 'duration': 7.048}, {'end': 1128.705, 'text': 'very small number of null values, around eight.', 'start': 1126.582, 'duration': 2.123}, {'end': 1132.709, 'text': 'okay. so what i can do is that i can finally drop the records if i want.', 'start': 1128.705, 'duration': 4.004}, {'end': 1135.953, 'text': "okay, because there's very less number of null values for that particular feature.", 'start': 1132.709, 'duration': 3.244}, {'end': 1140.677, 'text': 'But I can also find the record and try to use the mode similarly.', 'start': 1136.934, 'duration': 3.743}], 'summary': 'After data cleaning, only around 8 null values remain, allowing for potential record dropping.', 'duration': 28.838, 'max_score': 1111.839, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU1111839.jpg'}, {'end': 1183.631, 'src': 'embed', 'start': 1153.949, 'weight': 3, 'content': [{'end': 1156.591, 'text': 'have handled all the missing values here right here.', 'start': 1153.949, 'duration': 2.642}, {'end': 1158.253, 'text': 'I have handled all the missing values.', 'start': 1156.591, 'duration': 1.662}, {'end': 1166.92, 'text': 'I know that my handling of the missing values is based on mode and for integer variables I have done it with the help of integer mean right.', 'start': 1158.253, 'duration': 8.667}, {'end': 1168.601, 'text': 'but now why have done that?', 'start': 1166.92, 'duration': 1.681}, {'end': 1171.48, 'text': 'because I just want to start with the problem.', 'start': 1168.601, 'duration': 2.879}, {'end': 1174.562, 'text': 'okay, I have still not done any statistical analysis.', 'start': 1171.48, 'duration': 3.082}, {'end': 1178.726, 'text': 'we will do that in our next section because I know my accuracy over here.', 'start': 1174.562, 'duration': 4.164}, {'end': 1183.631, 'text': 'now I have to make that accuracy better now, after that, after handling the missing values.', 'start': 1178.726, 'duration': 4.905}], 'summary': 'Handled missing values using mode and integer mean to start problem-solving, aiming to improve accuracy in next section.', 'duration': 29.682, 'max_score': 1153.949, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU1153949.jpg'}, {'end': 1284.267, 'src': 'heatmap', 'start': 1214.246, 'weight': 1, 'content': [{'end': 1222.473, 'text': 'Okay, so what I thought of is that can I write one function wherein I will be considering I will be taking this category features.', 'start': 1214.246, 'duration': 8.227}, {'end': 1228.959, 'text': "I'll apply a get dummies you know, pandas.get dummies in order to convert the category feature into dummy variables.", 'start': 1222.473, 'duration': 6.486}, {'end': 1235.645, 'text': "And then I'll append that same variables directly to my data frames, right? So that is what I was planning to do that.", 'start': 1229.399, 'duration': 6.246}, {'end': 1241.509, 'text': 'So for that what I did is that I started creating a separate list of all the category features.', 'start': 1235.945, 'duration': 5.564}, {'end': 1244.232, 'text': 'So this is my list of category features and I created one column.', 'start': 1241.529, 'duration': 2.703}, {'end': 1249.576, 'text': 'Now if I go and see the length of column there are 39 category features in that 81 columns you know.', 'start': 1244.272, 'duration': 5.304}, {'end': 1255.821, 'text': 'Now I created one function which will be handling all the features and converting that into category feature.', 'start': 1250.176, 'duration': 5.645}, {'end': 1257.362, 'text': 'So this is basically my function.', 'start': 1256.141, 'duration': 1.221}, {'end': 1259.606, 'text': 'okay, and this is my function.', 'start': 1257.844, 'duration': 1.762}, {'end': 1267.232, 'text': 'after converting, you just have to provide the list of columns here and then it will be returning you all the category features directly into this.', 'start': 1259.606, 'duration': 7.626}, {'end': 1274.158, 'text': 'okay, sorry, all the data frames concatenated with the category features, because you can see where I am concatenating it.', 'start': 1267.232, 'duration': 6.926}, {'end': 1280.383, 'text': 'okay, now, after that, after you do this right, make a copy of your data frame into one variable.', 'start': 1274.158, 'duration': 6.225}, {'end': 1284.267, 'text': 'so it is always good, because this value should not be getting changed again and again.', 'start': 1280.383, 'duration': 3.884}], 'summary': 'Created a function to convert 39 category features into dummy variables and append them to the data frames.', 'duration': 70.021, 'max_score': 1214.246, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU1214246.jpg'}, {'end': 1435.379, 'src': 'heatmap', 'start': 1359.065, 'weight': 2, 'content': [{'end': 1364.528, 'text': "since we don't know in the training data set that whether we are having the complete whole, some categories.", 'start': 1359.065, 'duration': 5.463}, {'end': 1371.591, 'text': 'what I have done is that I will take this test data concatenate with the training data.', 'start': 1364.528, 'duration': 7.063}, {'end': 1374.753, 'text': 'okay, concatenate row wise with the training data.', 'start': 1371.591, 'duration': 3.162}, {'end': 1376.734, 'text': 'just understand this very important thing.', 'start': 1374.753, 'duration': 1.981}, {'end': 1385.945, 'text': 'so what I will do is that, first of all, The reason why I am concatenating is that after concatenation you can see over here I have written the code.', 'start': 1376.734, 'duration': 9.211}, {'end': 1388.186, 'text': 'I will just tell you about the code in a while.', 'start': 1386.145, 'duration': 2.041}, {'end': 1397.613, 'text': 'So after I concatenate, now I will apply the pandas.get underscore dummies which converts into a one hot encoding to the entire column itself.', 'start': 1388.806, 'duration': 8.807}, {'end': 1406.439, 'text': 'Now if I combine both training and test data set, I know that I have all the specific number of categories within each and every feature.', 'start': 1398.153, 'duration': 8.286}, {'end': 1409.721, 'text': 'and it will never, never increase after that.', 'start': 1407.199, 'duration': 2.522}, {'end': 1415.606, 'text': 'okay, so that is the reason why i will combine my training and the test data now, before combining that.', 'start': 1409.721, 'duration': 5.885}, {'end': 1422.271, 'text': 'i have to perform all the operations that i have done to handle the missing values, like how we did it in the training data set.', 'start': 1415.606, 'duration': 6.665}, {'end': 1424.633, 'text': 'so that is why i have created this different file.', 'start': 1422.271, 'duration': 2.362}, {'end': 1426.034, 'text': 'now you can see over here.', 'start': 1424.633, 'duration': 1.401}, {'end': 1429.336, 'text': 'i have done all these particular steps, all these particular steps.', 'start': 1426.034, 'duration': 3.302}, {'end': 1433.078, 'text': 'you can see that i have applied mode, mode, mode mode for everything.', 'start': 1429.336, 'duration': 3.742}, {'end': 1435.379, 'text': 'you know, all the categories to handle it.', 'start': 1433.078, 'duration': 2.301}], 'summary': 'Concatenated test data with training data and applied one hot encoding to ensure complete categories for each feature.', 'duration': 76.314, 'max_score': 1359.065, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU1359065.jpg'}], 'start': 1053.69, 'title': 'Handling data features', 'summary': 'Covers handling null values in a dataset, reducing null values to around eight using mode and mean, then addressing category features by creating a function to convert them into dummy variables and ensuring consistency between training and test data.', 'chapters': [{'end': 1174.562, 'start': 1053.69, 'title': 'Handling null values in data', 'summary': 'Covers the process of handling null values in a dataset, identifying and addressing null values using mode and mean, resulting in a significant reduction in the number of null values to around eight, before considering the option of dropping records.', 'duration': 120.872, 'highlights': ['The process significantly reduces the number of null values to around eight After identifying and addressing null values using mode and mean, the number of null values in the dataset is reduced to around eight.', 'The use of mode and mean to handle missing values for integer variables The speaker explains the use of mode and mean to handle missing values, specifically using mode for categorical features and mean for integer variables.', 'Considering the option of dropping records due to very few null values The speaker considers the option of dropping records due to the very small number of remaining null values for a particular feature.']}, {'end': 1453.369, 'start': 1174.562, 'title': 'Handling category features in data analysis', 'summary': 'Covers handling category features in a dataset, including creating a function to convert category features into dummy variables, concatenating training and test data, and applying mode to handle missing values, while ensuring consistency in category variables between the two datasets.', 'duration': 278.807, 'highlights': ['Creating a function to convert category features into dummy variables The speaker plans to write a function to convert category features into dummy variables using pandas.get_dummies, with most category features having just two to four categories.', 'Concatenating test data with training data to ensure consistency in category variables The speaker concatenates the test data with the training data row-wise to ensure that the number of categories in each feature remains consistent after applying pandas.get_dummies.', 'Applying mode to handle missing values in category features The speaker applies the mode to handle missing values in category features, with around three to four categories and plans to replace missing values with the mode.']}], 'duration': 399.679, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU1053690.jpg', 'highlights': ['The process significantly reduces the number of null values to around eight After identifying and addressing null values using mode and mean, the number of null values in the dataset is reduced to around eight.', 'Creating a function to convert category features into dummy variables The speaker plans to write a function to convert category features into dummy variables using pandas.get_dummies, with most category features having just two to four categories.', 'Concatenating test data with training data to ensure consistency in category variables The speaker concatenates the test data with the training data row-wise to ensure that the number of categories in each feature remains consistent after applying pandas.get_dummies.', 'The use of mode and mean to handle missing values for integer variables The speaker explains the use of mode and mean to handle missing values, specifically using mode for categorical features and mean for integer variables.']}, {'end': 1880.438, 'segs': [{'end': 1583.977, 'src': 'heatmap', 'start': 1485.431, 'weight': 0.704, 'content': [{'end': 1489.235, 'text': 'now you can see over here when i go to my training data set.', 'start': 1485.431, 'duration': 3.804}, {'end': 1491.618, 'text': 'okay, till here i have already explained you.', 'start': 1489.235, 'duration': 2.383}, {'end': 1493.319, 'text': "i've taken the column of all the categories.", 'start': 1491.618, 'duration': 1.701}, {'end': 1498.605, 'text': "i've created a category, a function which converts a category variable into one hot encoding.", 'start': 1493.319, 'duration': 5.286}, {'end': 1503.649, 'text': 'Now what I will do is that I will go and read that same formulated test.csv.', 'start': 1499.125, 'duration': 4.524}, {'end': 1510.336, 'text': 'Now you can see that my test.csv is basically having 74 columns and this many rows.', 'start': 1504.11, 'duration': 6.226}, {'end': 1512.518, 'text': 'And this is how my test data looks like.', 'start': 1510.816, 'duration': 1.702}, {'end': 1519.725, 'text': 'Now what I will do, I will combine my train data which is present inside my df and my test underscore df.', 'start': 1512.898, 'duration': 6.827}, {'end': 1522.727, 'text': "okay, row wise, I'll try to combine it row wise.", 'start': 1520.205, 'duration': 2.522}, {'end': 1526.489, 'text': "so in order to do that, I'll just write pd dot, concat df, comma,", 'start': 1522.727, 'duration': 3.762}, {'end': 1532.313, 'text': 'test underscore df with axis is equal to 0 and that variable is basically stored in our final underscore df.', 'start': 1526.489, 'duration': 5.824}, {'end': 1537.976, 'text': 'okay, now, when you go and see this, my final underscore df dot shape is somewhere around 2 to 8 1,', 'start': 1532.313, 'duration': 5.663}, {'end': 1542.91, 'text': 'which is the combination of both training and test, and my records are 75.', 'start': 1537.976, 'duration': 4.934}, {'end': 1544.311, 'text': 'Remember, in my test data set.', 'start': 1542.91, 'duration': 1.401}, {'end': 1546.853, 'text': "I don't have a column called a sales price.", 'start': 1544.371, 'duration': 2.482}, {'end': 1550.316, 'text': "So if I'm concatenating for all my test data to have no man values,", 'start': 1546.853, 'duration': 3.463}, {'end': 1559.943, 'text': 'you remember that now I can easily apply my this particular function which converts all my category features into one hot inco.', 'start': 1550.316, 'duration': 9.627}, {'end': 1562.305, 'text': "So for that I'm just calling this function over here.", 'start': 1559.943, 'duration': 2.362}, {'end': 1566.908, 'text': "Okay, and I'm giving my list of columns over here itself.", 'start': 1562.305, 'duration': 4.603}, {'end': 1570.651, 'text': 'now, here you can see that, for which all columns it has basically performed it.', 'start': 1566.908, 'duration': 3.743}, {'end': 1583.977, 'text': 'and okay, now, after performing this, you will be seeing that I will now have 235 columns created from that 75 column after applying one hot encoding.', 'start': 1570.651, 'duration': 13.326}], 'summary': 'Combined training and test data, applied one hot encoding to create 235 columns from 75.', 'duration': 98.546, 'max_score': 1485.431, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU1485431.jpg'}, {'end': 1621.471, 'src': 'embed', 'start': 1594.603, 'weight': 0, 'content': [{'end': 1601.106, 'text': 'that basically means that both are internally correlated right And both are having the same importance.', 'start': 1594.603, 'duration': 6.503}, {'end': 1605.908, 'text': "So I'm just deleting the duplicate columns from this and my final DF is basically created.", 'start': 1601.126, 'duration': 4.782}, {'end': 1612.369, 'text': 'And now you can see that my final DF is basically having two 2881 records and 175 columns.', 'start': 1606.228, 'duration': 6.141}, {'end': 1615.29, 'text': 'Initially, how much columns I had? I had 235 columns.', 'start': 1612.769, 'duration': 2.521}, {'end': 1617.31, 'text': 'Now I have 175 columns.', 'start': 1615.73, 'duration': 1.58}, {'end': 1621.471, 'text': 'Okay So this is how I have basically done.', 'start': 1618.491, 'duration': 2.98}], 'summary': 'Cleaned data to reduce from 235 to 175 columns, resulting in 2881 records.', 'duration': 26.868, 'max_score': 1594.603, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU1594603.jpg'}, {'end': 1788.095, 'src': 'heatmap', 'start': 1636.135, 'weight': 0.714, 'content': [{'end': 1638.557, 'text': 'Now this is how my category features look like.', 'start': 1636.135, 'duration': 2.422}, {'end': 1643.502, 'text': 'You can see that if I go to the last, you all have values like zeros, ones, zeros, ones, zeros, ones.', 'start': 1638.598, 'duration': 4.904}, {'end': 1649.527, 'text': "Okay Now again, what I'll do, I'll divide this into my training data set and test data set.", 'start': 1644.122, 'duration': 5.405}, {'end': 1652.369, 'text': 'Again, I know how many records were there in the training data set.', 'start': 1649.567, 'duration': 2.802}, {'end': 1653.13, 'text': "I'll take that much.", 'start': 1652.409, 'duration': 0.721}, {'end': 1655.332, 'text': "How many records are there in the test data set? I'll take that.", 'start': 1653.21, 'duration': 2.122}, {'end': 1660.067, 'text': "my test data set, i'll drop the sales price column because i don't have any.", 'start': 1656.325, 'duration': 3.742}, {'end': 1664.03, 'text': 'all the values are none over there and my x is equal to 1 in place is equal to.', 'start': 1660.067, 'duration': 3.963}, {'end': 1665.071, 'text': "i'm doing that now.", 'start': 1664.03, 'duration': 1.041}, {'end': 1667.432, 'text': 'you can see my shape in my test data set.', 'start': 1665.071, 'duration': 2.361}, {'end': 1670.534, 'text': 'it is 1459 174.', 'start': 1667.432, 'duration': 3.102}, {'end': 1676.198, 'text': "now what i'll do is that i will drop the sales price for my training data set in my x train.", 'start': 1670.534, 'duration': 5.664}, {'end': 1680.162, 'text': "I'll just from this training data set I'll create my X train and Y train right?", 'start': 1676.758, 'duration': 3.404}, {'end': 1683.345, 'text': 'So X train is basically having sales price.', 'start': 1680.562, 'duration': 2.783}, {'end': 1685.848, 'text': 'I mean, drop all the columns apart from sales price.', 'start': 1683.425, 'duration': 2.423}, {'end': 1688.631, 'text': 'Y train is basically having only the sales price.', 'start': 1686.208, 'duration': 2.423}, {'end': 1692.015, 'text': 'Now what we need, we have our X train and Y train.', 'start': 1689.352, 'duration': 2.663}, {'end': 1694.958, 'text': 'the best thing is that you start applying algorithm.', 'start': 1692.455, 'duration': 2.503}, {'end': 1696.78, 'text': "Now here I've just selected XGBoost.", 'start': 1694.998, 'duration': 1.782}, {'end': 1703.106, 'text': 'I could have also selected Random Forest, but I want to try with Random Forest, XGBoost, with hyperparameter optimization,', 'start': 1697.36, 'duration': 5.746}, {'end': 1704.768, 'text': 'but initially I just wanted to try it out.', 'start': 1703.106, 'duration': 1.662}, {'end': 1707.551, 'text': "I've used XGB Regressor, XGBoost Regressor.", 'start': 1705.128, 'duration': 2.423}, {'end': 1709.813, 'text': "I've done this and this has got executed.", 'start': 1708.011, 'duration': 1.802}, {'end': 1712.495, 'text': 'My ensemble techniques over here.', 'start': 1710.294, 'duration': 2.201}, {'end': 1716.076, 'text': "i was trying for a random forest regression, which i'll do it later.", 'start': 1712.495, 'duration': 3.581}, {'end': 1719.958, 'text': 'so once your function gets executed, once your classifier gets fit,', 'start': 1716.076, 'duration': 3.882}, {'end': 1728.921, 'text': "you can also save it as a pickle file so that you don't have to train it again and again because the training will take some amount of time.", 'start': 1719.958, 'duration': 8.963}, {'end': 1735.583, 'text': "and after that, what you do is that you just use classifier.predict on your test data set and now you'll be getting your y pred.", 'start': 1728.921, 'duration': 6.662}, {'end': 1738.265, 'text': 'okay, so this is your prediction data set.', 'start': 1736.323, 'duration': 1.942}, {'end': 1739.706, 'text': 'now this particular data set.', 'start': 1738.265, 'duration': 1.441}, {'end': 1742.568, 'text': 'remember the submission dot csv file.', 'start': 1739.706, 'duration': 2.862}, {'end': 1746.912, 'text': "okay, so what i'll do is that first of all i'll convert this y parade into a data frame.", 'start': 1742.568, 'duration': 4.344}, {'end': 1751.442, 'text': 'Okay, then I will read the sample underscore submission dot CSV file.', 'start': 1747.619, 'duration': 3.823}, {'end': 1756.345, 'text': 'I will take the ID column from this sample dot submission dot CSV.', 'start': 1752.382, 'duration': 3.963}, {'end': 1762.789, 'text': 'If you remember guys, we inside this particular data set, right? We have an ID and sales price.', 'start': 1756.745, 'duration': 6.044}, {'end': 1765.091, 'text': "I don't require the sales price sales price.", 'start': 1763.21, 'duration': 1.881}, {'end': 1766.352, 'text': 'I have actually computed it.', 'start': 1765.171, 'duration': 1.181}, {'end': 1768.213, 'text': 'I will just take this ID column.', 'start': 1766.912, 'duration': 1.301}, {'end': 1770.014, 'text': "I'll put up my sales price over here.", 'start': 1768.333, 'duration': 1.681}, {'end': 1773.077, 'text': "So for that I'm writing this two lines of code.", 'start': 1770.595, 'duration': 2.482}, {'end': 1779.304, 'text': "and finally, i'm converting this into sample dot sample underscore submission.csv.", 'start': 1774.057, 'duration': 5.247}, {'end': 1784.59, 'text': 'now, after that, once your submission file gets created, you just have to go over here, click on submit prediction.', 'start': 1779.304, 'duration': 5.286}, {'end': 1788.095, 'text': 'then they will ask you the file to load over here.', 'start': 1784.59, 'duration': 3.505}], 'summary': 'Performed data split and applied xgboost regressor for prediction on test data, achieving a shape of 1459 174.', 'duration': 151.96, 'max_score': 1636.135, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU1636135.jpg'}, {'end': 1683.345, 'src': 'embed', 'start': 1656.325, 'weight': 1, 'content': [{'end': 1660.067, 'text': "my test data set, i'll drop the sales price column because i don't have any.", 'start': 1656.325, 'duration': 3.742}, {'end': 1664.03, 'text': 'all the values are none over there and my x is equal to 1 in place is equal to.', 'start': 1660.067, 'duration': 3.963}, {'end': 1665.071, 'text': "i'm doing that now.", 'start': 1664.03, 'duration': 1.041}, {'end': 1667.432, 'text': 'you can see my shape in my test data set.', 'start': 1665.071, 'duration': 2.361}, {'end': 1670.534, 'text': 'it is 1459 174.', 'start': 1667.432, 'duration': 3.102}, {'end': 1676.198, 'text': "now what i'll do is that i will drop the sales price for my training data set in my x train.", 'start': 1670.534, 'duration': 5.664}, {'end': 1680.162, 'text': "I'll just from this training data set I'll create my X train and Y train right?", 'start': 1676.758, 'duration': 3.404}, {'end': 1683.345, 'text': 'So X train is basically having sales price.', 'start': 1680.562, 'duration': 2.783}], 'summary': 'Dropped sales price column with none values, test data set shape is 1459x174, and created x train and y train from training data set.', 'duration': 27.02, 'max_score': 1656.325, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU1656325.jpg'}, {'end': 1738.265, 'src': 'embed', 'start': 1697.36, 'weight': 2, 'content': [{'end': 1703.106, 'text': 'I could have also selected Random Forest, but I want to try with Random Forest, XGBoost, with hyperparameter optimization,', 'start': 1697.36, 'duration': 5.746}, {'end': 1704.768, 'text': 'but initially I just wanted to try it out.', 'start': 1703.106, 'duration': 1.662}, {'end': 1707.551, 'text': "I've used XGB Regressor, XGBoost Regressor.", 'start': 1705.128, 'duration': 2.423}, {'end': 1709.813, 'text': "I've done this and this has got executed.", 'start': 1708.011, 'duration': 1.802}, {'end': 1712.495, 'text': 'My ensemble techniques over here.', 'start': 1710.294, 'duration': 2.201}, {'end': 1716.076, 'text': "i was trying for a random forest regression, which i'll do it later.", 'start': 1712.495, 'duration': 3.581}, {'end': 1719.958, 'text': 'so once your function gets executed, once your classifier gets fit,', 'start': 1716.076, 'duration': 3.882}, {'end': 1728.921, 'text': "you can also save it as a pickle file so that you don't have to train it again and again because the training will take some amount of time.", 'start': 1719.958, 'duration': 8.963}, {'end': 1735.583, 'text': "and after that, what you do is that you just use classifier.predict on your test data set and now you'll be getting your y pred.", 'start': 1728.921, 'duration': 6.662}, {'end': 1738.265, 'text': 'okay, so this is your prediction data set.', 'start': 1736.323, 'duration': 1.942}], 'summary': 'Experimented with xgboost and random forest for regression, emphasizing on saving time by using pickle file for trained classifier.', 'duration': 40.905, 'max_score': 1697.36, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU1697360.jpg'}, {'end': 1848.302, 'src': 'embed', 'start': 1803.344, 'weight': 4, 'content': [{'end': 1814.15, 'text': "it will go and compute the score and you will be able to see your score over here in the leaderboard section and whatever rank you're getting, okay.", 'start': 1803.344, 'duration': 10.806}, {'end': 1817.913, 'text': 'so currently you see that i am getting somewhere around 2522 rank.', 'start': 1814.15, 'duration': 3.763}, {'end': 1825.101, 'text': 'I am definitely sure I will be able to come into top 100, because I have to do lot of things over there.', 'start': 1819.155, 'duration': 5.946}, {'end': 1832.788, 'text': 'Some of the best practices that I will definitely follow and I will also give you the suggestion how to do it in my later videos.', 'start': 1825.742, 'duration': 7.046}, {'end': 1834.27, 'text': 'But right now I have done till here.', 'start': 1832.828, 'duration': 1.442}, {'end': 1838.254, 'text': 'The best thing is that I was able to do this in just 4 hours.', 'start': 1834.71, 'duration': 3.544}, {'end': 1841.356, 'text': 'That was the most motivating thing for me.', 'start': 1838.914, 'duration': 2.442}, {'end': 1844.439, 'text': 'uh, you know, i had to do a lot of stuff.', 'start': 1842.277, 'duration': 2.162}, {'end': 1848.302, 'text': 'you know, understand how that training data was, how the test data was remember.', 'start': 1844.439, 'duration': 3.863}], 'summary': 'Rank 2522, aiming for top 100, completed task in 4 hours', 'duration': 44.958, 'max_score': 1803.344, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU1803344.jpg'}], 'start': 1453.369, 'title': 'Data processing, model training, and submission process', 'summary': 'Details data processing, model training using xgboost and random forest resulting in 2881 records and 175 columns, and submission process achieving a rank of 2522 with plans to achieve a top 100 rank in 4 hours.', 'chapters': [{'end': 1716.076, 'start': 1453.369, 'title': 'Data processing and model training', 'summary': 'Details the process of handling null values, converting data to csv, combining training and test datasets, applying one hot encoding, removing duplicate columns, and preparing the data for model training using xgboost and random forest, resulting in a final dataset with 2881 records and 175 columns.', 'duration': 262.707, 'highlights': ['The final dataset has 2881 records and 175 columns after handling null values, combining training and test datasets, applying one hot encoding, and removing duplicate columns. The combined final dataset contains 2881 records and 175 columns, showcasing the successful completion of data processing.', 'The test dataset has 1459 records and 174 columns after processing for model training. The test dataset is prepared with 1459 records and 174 columns, crucial for the subsequent model training process.', 'XGBoost and Random Forest are mentioned as the algorithms for model training, with a plan to utilize hyperparameter optimization for Random Forest in the future. The chapter discusses the selection of XGBoost and plans for utilizing Random Forest with hyperparameter optimization in the future for model training.']}, {'end': 1880.438, 'start': 1716.076, 'title': 'Submission process and result analysis', 'summary': 'Outlines the process of saving and using a trained classifier to make predictions, preparing a submission file, and submitting predictions to achieve a rank of 2522, with plans to improve and achieve a top 100 rank in a time frame of 4 hours.', 'duration': 164.362, 'highlights': ['The process of saving and using a trained classifier to make predictions Explains how to save a trained classifier as a pickle file to avoid repeated training and make predictions on test data.', 'Achievement of a rank of 2522 with plans to improve and achieve a top 100 rank Expresses confidence in improving the rank and achieving a top 100 rank by following best practices and strategies.', 'Efficient completion of the process within 4 hours Highlights the accomplishment of completing the entire process in just 4 hours, indicating a high level of efficiency.']}], 'duration': 427.069, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/vtm35gVP8JU/pics/vtm35gVP8JU1453369.jpg', 'highlights': ['The combined final dataset contains 2881 records and 175 columns, showcasing the successful completion of data processing.', 'The test dataset is prepared with 1459 records and 174 columns, crucial for the subsequent model training process.', 'The chapter discusses the selection of XGBoost and plans for utilizing Random Forest with hyperparameter optimization in the future for model training.', 'Explains how to save a trained classifier as a pickle file to avoid repeated training and make predictions on test data.', 'Expresses confidence in improving the rank and achieving a top 100 rank by following best practices and strategies.', 'Highlights the accomplishment of completing the entire process in just 4 hours, indicating a high level of efficiency.']}], 'highlights': ['Achieved rank 2521 out of 4384 with score 0.141 in 4 hours', 'The dataset for the Kaggle competition has around 81 features and numerous category features, with a significant amount of null values', 'The combined final dataset contains 2881 records and 175 columns, showcasing the successful completion of data processing', 'Features with over 50% missing values are dropped to facilitate data cleaning and improve dataset quality', 'Explains how to save a trained classifier as a pickle file to avoid repeated training and make predictions on test data']}