title

Advance House Price Prediction- Exploratory Data Analysis- Part 1

description

Github url :https://github.com/krishnaik06/Advanced-House-Price-Prediction-
Please join as a member in my channel to get additional benefits like materials in Data Science, live streaming for Members and many more
https://www.youtube.com/channel/UCNU_lfiiWBdtULKOw6X0Dig/join
Please do subscribe my other channel too
https://www.youtube.com/channel/UCjWY5hREA6FFYrthD0rZNIw
Connect with me here:
Twitter: https://twitter.com/Krishnaik06
Facebook: https://www.facebook.com/krishnaik06
instagram: https://www.instagram.com/krishnaik06

detail

{'title': 'Advance House Price Prediction- Exploratory Data Analysis- Part 1', 'heatmap': [{'end': 368.705, 'start': 345.524, 'weight': 0.708}, {'end': 409.742, 'start': 373.887, 'weight': 0.72}, {'end': 570.547, 'start': 549.472, 'weight': 0.739}, {'end': 738.989, 'start': 717.038, 'weight': 0.826}, {'end': 861.21, 'start': 842.482, 'weight': 0.725}, {'end': 987.934, 'start': 971.862, 'weight': 0.75}, {'end': 1124.271, 'start': 1042.367, 'weight': 0.717}], 'summary': 'Series explores machine learning pipelines and data analysis for advanced house spice regression technique from kaggle, focusing on 80+ features and covering data analysis, feature engineering, selection, model building, and deployment. it also emphasizes the manipulation of a data frame with 1460 rows and 81 columns, the impact of nan values on house prices, and the relationship analysis in data visualization.', 'chapters': [{'end': 176.025, 'segs': [{'end': 89.346, 'src': 'embed', 'start': 61.878, 'weight': 2, 'content': [{'end': 65.181, 'text': "So, in this particular video, we'll try to understand that.", 'start': 61.878, 'duration': 3.303}, {'end': 70.586, 'text': "Today's video will try to understand data analysis phase only in the upcoming videos.", 'start': 65.642, 'duration': 4.944}, {'end': 78.894, 'text': "And this is just the part one of data analysis phase, because the project that I've actually taken is from a Kaggle problem statement,", 'start': 70.626, 'duration': 8.268}, {'end': 82.857, 'text': 'which is basically called as Advanced House Spice Regression Technique.', 'start': 78.894, 'duration': 3.963}, {'end': 89.346, 'text': "i've taken this particular data set, guys, because, uh, in some, uh, few months back,", 'start': 83.818, 'duration': 5.528}], 'summary': 'Introduction to data analysis phase with kaggle problem statement on advanced house spice regression technique.', 'duration': 27.468, 'max_score': 61.878, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv861878.jpg'}, {'end': 152.662, 'src': 'embed', 'start': 126.932, 'weight': 0, 'content': [{'end': 133.255, 'text': 'and the best thing about this particular code is that this particular data set is that it has more than 80 plus features.', 'start': 126.932, 'duration': 6.323}, {'end': 136.117, 'text': 'you know so lot of man values.', 'start': 133.255, 'duration': 2.862}, {'end': 137.217, 'text': 'how did I handle that?', 'start': 136.117, 'duration': 1.1}, {'end': 142.019, 'text': 'in feature engineering feature selection, This will be a group of four to five videos.', 'start': 137.217, 'duration': 4.802}, {'end': 144.92, 'text': "And first of all, we'll just start about data analysis.", 'start': 142.619, 'duration': 2.301}, {'end': 149.561, 'text': 'But I can also say that suppose, if you are getting a project for the first time,', 'start': 145.4, 'duration': 4.161}, {'end': 152.662, 'text': 'what are the things you should do? How you can actually write a clean code?', 'start': 149.561, 'duration': 3.101}], 'summary': 'Data set has 80+ features, covering feature engineering, selection, and clean code writing.', 'duration': 25.73, 'max_score': 126.932, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv8126932.jpg'}], 'start': 1.363, 'title': 'Machine learning pipelines and data analysis', 'summary': 'Delves into machine learning pipelines and the data analysis phase, focusing on the advanced house spice regression technique from kaggle with over 80+ features and outlining a series of four to five videos covering data analysis, feature engineering, feature selection, model building, and model deployment.', 'chapters': [{'end': 176.025, 'start': 1.363, 'title': 'Machine learning pipelines and data analysis', 'summary': 'Discusses machine learning pipelines and the data analysis phase of a data science project, focusing on the advanced house spice regression technique from kaggle, with over 80+ features and the plan for a series of four to five videos covering data analysis, feature engineering, feature selection, model building, and model deployment.', 'duration': 174.662, 'highlights': ['The video covers the data analysis phase of a data science project, focusing on the Advanced House Spice Regression Technique from Kaggle, which includes over 80+ features. Data analysis of the Advanced House Spice Regression Technique from Kaggle with over 80+ features.', 'The plan includes a series of four to five videos covering data analysis, feature engineering, feature selection, model building, and model deployment. Planned series of four to five videos covering data analysis, feature engineering, feature selection, model building, and model deployment.', 'The data set used for the project has more than 80 plus features and includes handling of missing values during feature engineering and feature selection. Data set with more than 80 plus features and handling of missing values during feature engineering and feature selection.']}], 'duration': 174.662, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv81363.jpg', 'highlights': ['Series of four to five videos covering data analysis, feature engineering, feature selection, model building, and model deployment.', 'Data set with more than 80 plus features and handling of missing values during feature engineering and feature selection.', 'Data analysis of the Advanced House Spice Regression Technique from Kaggle with over 80+ features.']}, {'end': 567.084, 'segs': [{'end': 203.606, 'src': 'embed', 'start': 176.946, 'weight': 0, 'content': [{'end': 181.85, 'text': "So initially, I'll be requiring some libraries like pandas, numpy, matplotlib, seaborn.", 'start': 176.946, 'duration': 4.904}, {'end': 183.751, 'text': 'Pretty much simple for every one of you.', 'start': 181.97, 'duration': 1.781}, {'end': 189.115, 'text': "Then I'll also try to display all the columns of a data frame by using this which is present in partners,", 'start': 184.312, 'duration': 4.803}, {'end': 191.777, 'text': 'which is called as pd.partners.set,option.', 'start': 189.115, 'duration': 2.662}, {'end': 195.7, 'text': 'And here, we have to specify display.max,columns.', 'start': 192.297, 'duration': 3.403}, {'end': 203.606, 'text': 'If you specify display.max,rows, then whenever you display any data frame, every, even though it has million records,', 'start': 196.16, 'duration': 7.446}], 'summary': 'Using libraries like pandas, numpy, matplotlib, seaborn to display columns of a data frame, with options to specify max columns and rows.', 'duration': 26.66, 'max_score': 176.946, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv8176946.jpg'}, {'end': 291.017, 'src': 'embed', 'start': 217.341, 'weight': 2, 'content': [{'end': 219.803, 'text': "I'm reading in this particular variable, which is called as dataset.", 'start': 217.341, 'duration': 2.462}, {'end': 228.831, 'text': 'And the string.csv, if you go and see the shape of this particular data set, it has 1460 rows at 81 columns, guys.', 'start': 220.744, 'duration': 8.087}, {'end': 235.877, 'text': 'So this will be a very, very good example for you to deal with complicated data set, which has more number of features.', 'start': 229.431, 'duration': 6.446}, {'end': 241.481, 'text': 'How do we do feature engineering? How do we do feature selection? How do we do model creation? So stay tuned.', 'start': 236.457, 'duration': 5.024}, {'end': 242.963, 'text': 'This is pretty much important, guys.', 'start': 241.581, 'duration': 1.382}, {'end': 247.626, 'text': "If you follow this, trust me, you'll be able to do any other problem statements.", 'start': 243.223, 'duration': 4.403}, {'end': 249.367, 'text': 'So here it is.', 'start': 248.787, 'duration': 0.58}, {'end': 252.008, 'text': "We I'm just trying to see the top five records.", 'start': 249.827, 'duration': 2.181}, {'end': 254.348, 'text': 'So here are my top five records guys.', 'start': 252.388, 'duration': 1.96}, {'end': 261.67, 'text': 'They are some of the Features which is called as a my subclass and my zoning lot, frontage lot, area, street alley,', 'start': 254.408, 'duration': 7.262}, {'end': 269.772, 'text': 'lot shape the various properties Related to the house, which actually determines your sales price of that particular house in dollars.', 'start': 261.67, 'duration': 8.102}, {'end': 272.733, 'text': 'Okay so pretty much important in this.', 'start': 269.772, 'duration': 2.961}, {'end': 274.433, 'text': 'now, in data analysis, What do we do?', 'start': 272.733, 'duration': 1.7}, {'end': 283.493, 'text': "So here I've actually written in data analysis we will try to analyze to find out the below stuff, that is, missing values, all the numerical values,", 'start': 275.669, 'duration': 7.824}, {'end': 285.094, 'text': 'distribution of the numerical values.', 'start': 283.493, 'duration': 1.601}, {'end': 291.017, 'text': 'because, understand, since this is a regression problem statement, we have to focus on the distribution of numerical values.', 'start': 285.094, 'duration': 5.923}], 'summary': 'Data set has 1460 rows, 81 columns. focus on feature engineering, selection, and model creation. analyze missing values and numerical distribution.', 'duration': 73.676, 'max_score': 217.341, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv8217341.jpg'}, {'end': 368.705, 'src': 'heatmap', 'start': 345.524, 'weight': 0.708, 'content': [{'end': 355.772, 'text': "so i'm just reading each and every columns from the data set and i'm saying that if that data set of that specific feature and i'm putting a condition which is called as is null,", 'start': 345.524, 'duration': 10.248}, {'end': 364.462, 'text': 'okay dot sum is greater than one, that basically means if there is at least one nan value, then i can consider that that feature has a nan value.', 'start': 355.772, 'duration': 8.69}, {'end': 368.705, 'text': 'right. so i will be considering all that particular features.', 'start': 364.462, 'duration': 4.243}], 'summary': 'Analyzing dataset for null values in each feature.', 'duration': 23.181, 'max_score': 345.524, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv8345524.jpg'}, {'end': 409.742, 'src': 'heatmap', 'start': 373.887, 'weight': 0.72, 'content': [{'end': 380.631, 'text': "then what i will do is that, uh, in the next statement, and here i'll i'll get all my features name, yes, understand that, okay,", 'start': 373.887, 'duration': 6.744}, {'end': 383.332, 'text': "all my feature name i'll be getting okay.", 'start': 380.631, 'duration': 2.701}, {'end': 391.755, 'text': "now i'll say for featuring features with na, that basically means i'm iterating through each and every feature and i'm printing the feature name,", 'start': 383.332, 'duration': 8.423}, {'end': 397.117, 'text': "okay, and then i'm saying that whatever the null value mean, you are getting okay,", 'start': 391.755, 'duration': 5.362}, {'end': 403.32, 'text': "i'm rounding it up till the four decimal points and i'm printing a statement which is called a missing value.", 'start': 397.117, 'duration': 6.203}, {'end': 407.761, 'text': "so i'm basically finding out the percentage of missing values in each and every feature.", 'start': 403.32, 'duration': 4.441}, {'end': 409.742, 'text': 'okay, pretty much simple, okay.', 'start': 407.761, 'duration': 1.981}], 'summary': 'Iterating through features to find missing values, rounding to 4 decimal points, and printing percentage.', 'duration': 35.855, 'max_score': 373.887, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv8373887.jpg'}, {'end': 470.715, 'src': 'embed', 'start': 442.023, 'weight': 5, 'content': [{'end': 448.588, 'text': 'but right now we have found out that, yes, they are missing values in this many features and these many features have nan values in chat.', 'start': 442.023, 'duration': 6.565}, {'end': 455.894, 'text': 'okay, now, the next thing is that, since there are many missing values, we need to find the relationship between the missing value and sales price.', 'start': 448.588, 'duration': 7.306}, {'end': 458.439, 'text': 'so I can do one thing right?', 'start': 455.894, 'duration': 2.545}, {'end': 460.502, 'text': 'If there is a missing value, I can drop that particular row.', 'start': 458.499, 'duration': 2.003}, {'end': 470.715, 'text': 'But understand that whether that missing value has some dependency or is there is a relationship with the dependent feature, which is the sales price.', 'start': 461.023, 'duration': 9.692}], 'summary': 'Many features have missing values; need to find relationship with sales price.', 'duration': 28.692, 'max_score': 442.023, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv8442023.jpg'}], 'start': 176.946, 'title': 'Data analysis and feature engineering', 'summary': 'Focuses on using libraries like pandas, numpy, matplotlib, and seaborn to manipulate a data frame with 1460 rows and 81 columns, emphasizing feature engineering, selection, and model creation. it also discusses the importance of house properties in determining sales price, steps in data analysis for regression problems, and handling missing values.', 'chapters': [{'end': 254.348, 'start': 176.946, 'title': 'Data analysis with pandas and matplotlib', 'summary': 'Focuses on using libraries like pandas, numpy, matplotlib, and seaborn to display and manipulate a data frame, with the dataset consisting of 1460 rows and 81 columns, providing a good example for dealing with complex data sets, and emphasizing the importance of feature engineering, selection, and model creation for problem-solving.', 'duration': 77.402, 'highlights': ['Illustrating the use of libraries like pandas, numpy, matplotlib, and seaborn for data manipulation and visualization, essential for analyzing complex data sets and model creation.', 'Explaining the process of displaying all the columns of a data frame using the pd.partners.set option, with a specific focus on the display.max_columns attribute.', 'Demonstrating the importance of feature engineering, selection, and model creation, emphasizing their significance for addressing problem statements and providing a foundation for handling various data analysis tasks.', 'Reading a dataset with 1460 rows and 81 columns, serving as a practical example for dealing with complex data sets and showcasing essential data analysis skills.']}, {'end': 567.084, 'start': 254.408, 'title': 'Data analysis and feature engineering', 'summary': 'Discusses the importance of various house properties in determining sales price, the steps involved in data analysis for a regression problem, and the process of identifying and handling missing values, including finding the relationship between missing values and sales price through visualizations and statistical analysis.', 'duration': 312.676, 'highlights': ['The importance of various house properties in determining the sales price is discussed, emphasizing the significance of these features in dollars. The chapter emphasizes the significance of various house properties in determining the sales price.', 'The process of data analysis for a regression problem includes analyzing missing values, numerical value distribution, feature engineering such as transformation, handling category value variables and cardinality, and exploring the relationship between independent and dependent features. The data analysis process for a regression problem involves analyzing missing values, numerical value distribution, and exploring the relationship between independent and dependent features.', 'The approach to handling missing values involves identifying features with null values, calculating the percentage of missing values for each feature, and understanding the relationship between missing values and sales price through visualizations and statistical analysis. The approach to handling missing values involves identifying features with null values, calculating the percentage of missing values for each feature, and understanding the relationship between missing values and sales price.']}], 'duration': 390.138, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv8176946.jpg', 'highlights': ['Illustrating the use of libraries like pandas, numpy, matplotlib, and seaborn for data manipulation and visualization, essential for analyzing complex data sets and model creation.', 'Explaining the process of displaying all the columns of a data frame using the pd.partners.set option, with a specific focus on the display.max_columns attribute.', 'Demonstrating the importance of feature engineering, selection, and model creation, emphasizing their significance for addressing problem statements and providing a foundation for handling various data analysis tasks.', 'Reading a dataset with 1460 rows and 81 columns, serving as a practical example for dealing with complex data sets and showcasing essential data analysis skills.', 'The process of data analysis for a regression problem includes analyzing missing values, numerical value distribution, feature engineering such as transformation, handling category value variables and cardinality, and exploring the relationship between independent and dependent features.', 'The approach to handling missing values involves identifying features with null values, calculating the percentage of missing values for each feature, and understanding the relationship between missing values and sales price through visualizations and statistical analysis.', 'The importance of various house properties in determining the sales price is discussed, emphasizing the significance of these features in dollars.']}, {'end': 1019.839, 'segs': [{'end': 635.548, 'src': 'embed', 'start': 589.7, 'weight': 0, 'content': [{'end': 592.881, 'text': 'So here, because of this NAND values, the price is also high.', 'start': 589.7, 'duration': 3.181}, {'end': 596.923, 'text': 'Now from this particular diagram, you see that.', 'start': 594.221, 'duration': 2.702}, {'end': 603.565, 'text': 'understand guys, whichever feature had a NAND values we had actually replaced to one, and whichever feature did not had a NAND value,', 'start': 596.923, 'duration': 6.642}, {'end': 604.526, 'text': 'we replaced to zero.', 'start': 603.565, 'duration': 0.961}, {'end': 614.132, 'text': 'Now here you can see that when the NAND values has a higher number And based on the NAND values, the house price the sales price,', 'start': 605.166, 'duration': 8.966}, {'end': 616.433, 'text': 'when we are trying to find out the median is also high.', 'start': 614.132, 'duration': 2.301}, {'end': 617.534, 'text': 'It is somewhere here.', 'start': 616.834, 'duration': 0.7}, {'end': 619.616, 'text': 'So this plays a very important role.', 'start': 618.055, 'duration': 1.561}, {'end': 626.821, 'text': 'Even lot frontage, other features like alley, mass variance type, mass variance area.', 'start': 619.696, 'duration': 7.125}, {'end': 631.085, 'text': 'You can see that wherever we had NAND values, that has a higher median sales price.', 'start': 627.162, 'duration': 3.923}, {'end': 634.247, 'text': 'So it is definitely playing a major role.', 'start': 631.605, 'duration': 2.642}, {'end': 635.548, 'text': 'All the missing values.', 'start': 634.667, 'duration': 0.881}], 'summary': 'Nand values impact house prices, with higher values leading to higher median sales prices.', 'duration': 45.848, 'max_score': 589.7, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv8589700.jpg'}, {'end': 752.435, 'src': 'heatmap', 'start': 717.038, 'weight': 3, 'content': [{'end': 721.601, 'text': 'Okay, guys, so if it is not object, by default it will become numerical.', 'start': 717.038, 'duration': 4.563}, {'end': 726.724, 'text': "Okay, so I'm just going to take that particular numerical values and I'm going to see the length of numerical features will be 38.", 'start': 721.601, 'duration': 5.123}, {'end': 731.366, 'text': 'Okay, And here you can see that these are all my numerical features.', 'start': 726.724, 'duration': 4.642}, {'end': 738.989, 'text': 'but understand, guys, we have some of the features, like year, build, you know, and I guess year sold.', 'start': 731.366, 'duration': 7.623}, {'end': 742.03, 'text': 'So this all features needs to be handled also.', 'start': 739.129, 'duration': 2.901}, {'end': 748.952, 'text': 'And if I just consider this year sold or year field, right, it is also called as a temporal variable.', 'start': 742.65, 'duration': 6.302}, {'end': 752.435, 'text': 'the reason why we call it as a temporal variable.', 'start': 749.634, 'duration': 2.801}], 'summary': 'Identified 38 numerical features and mentioned the need to handle temporal variables like year and year sold.', 'duration': 25.711, 'max_score': 717.038, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv8717038.jpg'}, {'end': 875.364, 'src': 'heatmap', 'start': 842.482, 'weight': 0.725, 'content': [{'end': 845.643, 'text': 'Year built, year remote add, garage year built, year sold.', 'start': 842.482, 'duration': 3.161}, {'end': 852.766, 'text': "Okay Now, if I'm printing this particular features over here, you can see that these are all my values.", 'start': 846.144, 'duration': 6.622}, {'end': 855.28, 'text': 'These are all my features.', 'start': 854.278, 'duration': 1.002}, {'end': 861.21, 'text': 'Now, let us go ahead and analyze this temporal date time variables.', 'start': 856.802, 'duration': 4.408}, {'end': 862.192, 'text': 'How do we analyze it?', 'start': 861.25, 'duration': 0.942}, {'end': 867.262, 'text': "now, first of all, guys, we'll try to understand whether there is.", 'start': 863.621, 'duration': 3.641}, {'end': 869.042, 'text': 'there are so many features, right?', 'start': 867.262, 'duration': 1.78}, {'end': 875.364, 'text': 'let us understand whether there is a relationship between year sold and the output variable, that is, sales price.', 'start': 869.042, 'duration': 6.322}], 'summary': 'Analyzing the relationship between year sold and sales price.', 'duration': 32.882, 'max_score': 842.482, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv8842482.jpg'}, {'end': 911.666, 'src': 'embed', 'start': 889.789, 'weight': 4, 'content': [{'end': 898.936, 'text': 'Now what we are doing, grouping by year sold feature, and then we are actually considering with respect to each and every group for this year sold.', 'start': 889.789, 'duration': 9.147}, {'end': 902.799, 'text': "we'll try to find out the median of the sales price and then we'll plot it.", 'start': 898.936, 'duration': 3.863}, {'end': 911.666, 'text': 'After we plot this information, you can see that the price is decreasing, which is pretty much amazing, right?', 'start': 903.659, 'duration': 8.007}], 'summary': 'Grouped by year sold, found median sales price, showing decreasing trend.', 'duration': 21.877, 'max_score': 889.789, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv8889789.jpg'}, {'end': 987.934, 'src': 'heatmap', 'start': 957.618, 'weight': 5, 'content': [{'end': 968.601, 'text': "Okay, So I'm just taking this up, Creating a new feature which is like Whatever feature is actually present over here,", 'start': 957.618, 'duration': 10.983}, {'end': 971.862, 'text': "and I'm saying that date of year of sold, minus date of feature.", 'start': 968.601, 'duration': 3.261}, {'end': 979.504, 'text': "Okay, so I'm basically finding the difference between the year variable and the year of the house was sold for.", 'start': 971.862, 'duration': 7.642}, {'end': 987.934, 'text': "okay. so that difference and I'm trying to plot that so that will give you an idea of in the year built this was the sales price.", 'start': 979.504, 'duration': 8.43}], 'summary': 'Creating a new feature to calculate the difference between year of sale and year built for plotting sales price trends.', 'duration': 21.886, 'max_score': 957.618, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv8957618.jpg'}], 'start': 567.584, 'title': 'Impact of nan values on house price', 'summary': 'Discusses the impact of nan values on house prices, highlighting that the conversion of nan values to one has led to higher median sales prices, indicating a significant role in determining house prices. it also covers the analysis of temporal variables in feature engineering, including identifying numerical features, analyzing the relationship between year sold and sales price, and comparing the difference between year features and sales price.', 'chapters': [{'end': 635.548, 'start': 567.584, 'title': 'Impact of nan values on house price', 'summary': 'Discusses the impact of nan values on house prices, highlighting that the conversion of nan values to one has led to higher median sales prices, indicating a significant role in determining house prices.', 'duration': 67.964, 'highlights': ['The conversion of NAN values to one has led to higher median sales prices, indicating a significant role in determining house prices.', 'Features with NAN values converted to one have a higher median sales price, such as lot frontage, alley, mass variance type, and mass variance area.', 'The presence of NAN values in features has resulted in higher sales prices, emphasizing the significant influence of missing values on house prices.']}, {'end': 1019.839, 'start': 635.988, 'title': 'Temporal variables analysis', 'summary': 'Discusses the analysis of temporal variables in feature engineering, including identifying numerical features, analyzing the relationship between year sold and sales price, and comparing the difference between year features and sales price.', 'duration': 383.851, 'highlights': ['Identifying numerical features The speaker identifies 38 numerical features in the dataset by checking the data types of the features, emphasizing the need to handle temporal variables like year built and year sold.', 'Analyzing the relationship between year sold and sales price The speaker groups the data by year sold and plots the median sales price for each group, revealing a decreasing price trend over the years, contrary to the expected increase.', 'Comparing the difference between year features and sales price The speaker creates new features representing the difference between the year variables and year sold, then plots these differences to visualize the impact of the age of the house on its sales price.']}], 'duration': 452.255, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv8567584.jpg', 'highlights': ['The conversion of NAN values to one has led to higher median sales prices, indicating a significant role in determining house prices.', 'Features with NAN values converted to one have a higher median sales price, such as lot frontage, alley, mass variance type, and mass variance area.', 'The presence of NAN values in features has resulted in higher sales prices, emphasizing the significant influence of missing values on house prices.', 'Identifying numerical features The speaker identifies 38 numerical features in the dataset by checking the data types of the features, emphasizing the need to handle temporal variables like year built and year sold.', 'Analyzing the relationship between year sold and sales price The speaker groups the data by year sold and plots the median sales price for each group, revealing a decreasing price trend over the years, contrary to the expected increase.', 'Comparing the difference between year features and sales price The speaker creates new features representing the difference between the year variables and year sold, then plots these differences to visualize the impact of the age of the house on its sales price.']}, {'end': 1402.657, 'segs': [{'end': 1124.271, 'src': 'heatmap', 'start': 1042.367, 'weight': 0.717, 'content': [{'end': 1049.27, 'text': "now, discrete variables are basically like they'll also be having some integers values, okay, but they will be having some fixed set of integers.", 'start': 1042.367, 'duration': 6.903}, {'end': 1052.542, 'text': 'Now, if I want to find out the discrete variable,', 'start': 1050.071, 'duration': 2.471}, {'end': 1059.865, 'text': 'what I have done is that have written a simple code which says that feature for feature in numerical underscore features.', 'start': 1052.542, 'duration': 7.323}, {'end': 1069.012, 'text': "I'm considering all the numerical underscore features and I'm saying that if length of data of feature dot unique, if it is less than 25, okay,", 'start': 1059.865, 'duration': 9.147}, {'end': 1075.878, 'text': "so that basically means for each and every category, I'm considering 25 as my threshold, unique parameters or unique values.", 'start': 1069.012, 'duration': 6.866}, {'end': 1080.002, 'text': "if it is less than that, then I'm going to consider that as a discrete variable.", 'start': 1075.878, 'duration': 4.124}, {'end': 1081.822, 'text': 'And one more condition.', 'start': 1080.922, 'duration': 0.9}, {'end': 1086.203, 'text': "I'm saying that the feature should not be a part of year feature and it should not be the part of ID.", 'start': 1081.822, 'duration': 4.381}, {'end': 1088.323, 'text': 'Pretty much simple, right? It should not be.', 'start': 1086.663, 'duration': 1.66}, {'end': 1090.604, 'text': 'Then only we can consider that as a discrete feature.', 'start': 1088.723, 'duration': 1.881}, {'end': 1099.505, 'text': 'Because year may be having less than 25 unique features because if we have only 25 years present in my data set, then it may look like it.', 'start': 1091.084, 'duration': 8.421}, {'end': 1101.406, 'text': "But we're not going to consider for our year data set.", 'start': 1099.525, 'duration': 1.881}, {'end': 1106.686, 'text': 'So after executing this, you can see that my total discrete variable count was 17.', 'start': 1102.006, 'duration': 4.68}, {'end': 1108.567, 'text': 'If I want to display, this is how it looks like.', 'start': 1106.686, 'duration': 1.881}, {'end': 1109.667, 'text': 'These are all my features.', 'start': 1108.687, 'duration': 0.98}, {'end': 1124.271, 'text': "okay. and now what I do is that and this is my head part of my data set of all my discrete features and then we'll try to find out the relationship between them and sales price.", 'start': 1110.907, 'duration': 13.364}], 'summary': 'Identified 17 discrete variables from numerical features with unique values less than 25, excluding year and id.', 'duration': 81.904, 'max_score': 1042.367, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv81042367.jpg'}, {'end': 1081.822, 'src': 'embed', 'start': 1059.865, 'weight': 0, 'content': [{'end': 1069.012, 'text': "I'm considering all the numerical underscore features and I'm saying that if length of data of feature dot unique, if it is less than 25, okay,", 'start': 1059.865, 'duration': 9.147}, {'end': 1075.878, 'text': "so that basically means for each and every category, I'm considering 25 as my threshold, unique parameters or unique values.", 'start': 1069.012, 'duration': 6.866}, {'end': 1080.002, 'text': "if it is less than that, then I'm going to consider that as a discrete variable.", 'start': 1075.878, 'duration': 4.124}, {'end': 1081.822, 'text': 'And one more condition.', 'start': 1080.922, 'duration': 0.9}], 'summary': 'Consider numerical underscore features with data length less than 25 as discrete variable.', 'duration': 21.957, 'max_score': 1059.865, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv81059865.jpg'}, {'end': 1151.13, 'src': 'embed', 'start': 1124.271, 'weight': 1, 'content': [{'end': 1128.092, 'text': 'exploratory data analysis is to gather some information from the data.', 'start': 1124.271, 'duration': 3.821}, {'end': 1131.013, 'text': 'so we always have to compare with the dependent features.', 'start': 1128.092, 'duration': 2.921}, {'end': 1131.613, 'text': 'always remember.', 'start': 1131.013, 'duration': 0.6}, {'end': 1134.855, 'text': "so again, i'm writing for feature in discrete underscore feature.", 'start': 1132.333, 'duration': 2.522}, {'end': 1136.997, 'text': 'data is equal to dataset.copy group by.', 'start': 1134.855, 'duration': 2.142}, {'end': 1142.461, 'text': "i'm grouping by features of all the discrete features and then i'm comparing with the sales price median.", 'start': 1136.997, 'duration': 5.464}, {'end': 1145.323, 'text': 'okay, here is your amazing thing.', 'start': 1142.461, 'duration': 2.862}, {'end': 1146.644, 'text': 'that will happen.', 'start': 1145.323, 'duration': 1.321}, {'end': 1151.13, 'text': 'okay, so Let me just show you okay.', 'start': 1146.644, 'duration': 4.486}], 'summary': 'Perform exploratory data analysis by comparing discrete features with sales price median.', 'duration': 26.859, 'max_score': 1124.271, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv81124271.jpg'}, {'end': 1220.137, 'src': 'embed', 'start': 1193.208, 'weight': 2, 'content': [{'end': 1198.434, 'text': 'right. as the overall quality increases somewhere around 9, the price is somewhere here.', 'start': 1193.208, 'duration': 5.226}, {'end': 1200.176, 'text': 'if it is 10, the price is somewhere here.', 'start': 1198.434, 'duration': 1.742}, {'end': 1205.543, 'text': 'so this basically shows that there is a relationship between the discrete variable and the output feature also.', 'start': 1200.176, 'duration': 5.367}, {'end': 1207.725, 'text': "so you're learning some information from this right.", 'start': 1205.543, 'duration': 2.182}, {'end': 1209.207, 'text': 'What about other features??', 'start': 1208.246, 'duration': 0.961}, {'end': 1212.03, 'text': 'Other features is again having some zigzag manner over here.', 'start': 1209.467, 'duration': 2.563}, {'end': 1216.153, 'text': "You can see that again, you'll not be able to find an exponential increase.", 'start': 1212.11, 'duration': 4.043}, {'end': 1220.137, 'text': 'This is also called as a monotonic relationship, this particular thing.', 'start': 1216.474, 'duration': 3.663}], 'summary': 'Quality has a direct effect on price, showcasing a monotonic relationship.', 'duration': 26.929, 'max_score': 1193.208, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv81193208.jpg'}, {'end': 1281.686, 'src': 'embed', 'start': 1251.188, 'weight': 3, 'content': [{'end': 1254.311, 'text': "Again, we'll try to see or learn more in the feature engineering part.", 'start': 1251.188, 'duration': 3.123}, {'end': 1257.774, 'text': 'Now, let us go ahead and try to see for the continuous variables also, guys.', 'start': 1254.811, 'duration': 2.963}, {'end': 1262.718, 'text': 'Now, in this continuous variables, if I want to find out, you can think of a logic over here.', 'start': 1258.414, 'duration': 4.304}, {'end': 1264.6, 'text': 'And the logic looks something like this.', 'start': 1263.098, 'duration': 1.502}, {'end': 1267.449, 'text': "Okay, I'll just show it to you.", 'start': 1266.007, 'duration': 1.442}, {'end': 1268.55, 'text': 'Over here.', 'start': 1268.19, 'duration': 0.36}, {'end': 1277.08, 'text': "the logic that I've actually written is saying that again I've written a list comprehension saying that for feature in feature in numerical features a feature not in discrete features.", 'start': 1268.55, 'duration': 8.53}, {'end': 1281.686, 'text': 'okay, it should not be a part of discrete feature, which was a list that I had actually created earlier.', 'start': 1277.08, 'duration': 4.606}], 'summary': 'Exploring feature engineering for continuous variables using logic and list comprehension.', 'duration': 30.498, 'max_score': 1251.188, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv81251188.jpg'}, {'end': 1372.169, 'src': 'embed', 'start': 1344.452, 'weight': 4, 'content': [{'end': 1350.936, 'text': 'uh, this has a gaussian distribution, but other features does not have a gaussian distribution, you know,', 'start': 1344.452, 'duration': 6.484}, {'end': 1354.178, 'text': 'and this basically proves that these are actually skewed data.', 'start': 1350.936, 'duration': 3.242}, {'end': 1361.602, 'text': "So in my next part of the data analysis I'll be showing you how you can actually perform a normalization.", 'start': 1355.038, 'duration': 6.564}, {'end': 1365.865, 'text': "You'll always remember, guys, whenever you're solving a regression problem statement,", 'start': 1362.083, 'duration': 3.782}, {'end': 1372.169, 'text': 'you should try to convert this kind of non-Gaussian distribution into a Gaussian distribution or a standard normal distribution.', 'start': 1365.865, 'duration': 6.304}], 'summary': 'Data has skewed distribution; normalization needed for regression analysis.', 'duration': 27.717, 'max_score': 1344.452, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv81344452.jpg'}], 'start': 1020.479, 'title': 'Exploratory data analysis and relationship analysis in data visualization', 'summary': 'Covers the process of identifying 17 discrete variables and their relationship with sales price, emphasizing the comparison with the dependent feature. it also explores the relationship between discrete and continuous variables in data visualization, highlighting the presence of exponential and monotonic relationships, skewed data, and the need for normalization and handling of missing values in regression problem statements.', 'chapters': [{'end': 1145.323, 'start': 1020.479, 'title': 'Exploratory data analysis', 'summary': 'Covers the process of identifying discrete variables in the dataset and their relationship with sales price, revealing 17 discrete variables and emphasizing the comparison with the dependent feature, sales price median.', 'duration': 124.844, 'highlights': ['Identifying 17 discrete variables in the dataset based on a threshold of less than 25 unique values for each category. The code filters out features with less than 25 unique values, resulting in 17 discrete variables.', 'Emphasizing the importance of comparing discrete variables with the dependent feature, sales price median, in the exploratory data analysis. The process involves grouping the discrete features and comparing them with the sales price median to understand their relationship.', 'Providing an overview of the process of identifying and analyzing discrete variables in the dataset as part of exploratory data analysis. The chapter highlights the methodology of identifying and analyzing discrete variables and their relationship with sales price as a part of exploratory data analysis.']}, {'end': 1402.657, 'start': 1145.323, 'title': 'Relationship analysis in data visualization', 'summary': 'Explores the relationship between discrete and continuous variables in data visualization, highlighting the presence of exponential and monotonic relationships, skewed data, and the need for normalization and handling of missing values in regression problem statements.', 'duration': 257.334, 'highlights': ['The graph demonstrates the relationship between discrete variables and sales price, revealing an exponential rise as the overall quality increases, with some features exhibiting a monotonic relationship as well.', 'A logic is presented to identify and analyze the 16 continuous features, showcasing the skewed nature of some features and emphasizing the importance of converting non-Gaussian distributions into a standard normal distribution for effective linear model prediction in regression problems.', 'The chapter emphasizes the significance of normalizing skewed data and addressing missing values in the context of regression problem statements, with the promise of providing related code in a GitHub link for further exploration.']}], 'duration': 382.178, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ioN1jcWxbv8/pics/ioN1jcWxbv81020479.jpg', 'highlights': ['Identifying 17 discrete variables based on less than 25 unique values for each category.', 'Emphasizing the importance of comparing discrete variables with the dependent feature, sales price median, in exploratory data analysis.', 'Demonstrating the relationship between discrete variables and sales price, revealing an exponential rise as the overall quality increases.', 'Presenting a logic to identify and analyze the 16 continuous features, showcasing the skewed nature of some features.', 'Emphasizing the significance of normalizing skewed data and addressing missing values in the context of regression problem statements.']}], 'highlights': ['Series explores machine learning pipelines and data analysis for advanced house spice regression technique from kaggle, focusing on 80+ features and covering data analysis, feature engineering, selection, model building, and deployment.', 'Data set with more than 80 plus features and handling of missing values during feature engineering and feature selection.', 'Data analysis of the Advanced House Spice Regression Technique from Kaggle with over 80+ features.', 'Illustrating the use of libraries like pandas, numpy, matplotlib, and seaborn for data manipulation and visualization, essential for analyzing complex data sets and model creation.', 'Explaining the process of displaying all the columns of a data frame using the pd.partners.set option, with a specific focus on the display.max_columns attribute.', 'Demonstrating the importance of feature engineering, selection, and model creation, emphasizing their significance for addressing problem statements and providing a foundation for handling various data analysis tasks.', 'Reading a dataset with 1460 rows and 81 columns, serving as a practical example for dealing with complex data sets and showcasing essential data analysis skills.', 'The process of data analysis for a regression problem includes analyzing missing values, numerical value distribution, feature engineering such as transformation, handling category value variables and cardinality, and exploring the relationship between independent and dependent features.', 'The approach to handling missing values involves identifying features with null values, calculating the percentage of missing values for each feature, and understanding the relationship between missing values and sales price through visualizations and statistical analysis.', 'The importance of various house properties in determining the sales price is discussed, emphasizing the significance of these features in dollars.', 'The conversion of NAN values to one has led to higher median sales prices, indicating a significant role in determining house prices.', 'Features with NAN values converted to one have a higher median sales price, such as lot frontage, alley, mass variance type, and mass variance area.', 'The presence of NAN values in features has resulted in higher sales prices, emphasizing the significant influence of missing values on house prices.', 'Identifying numerical features The speaker identifies 38 numerical features in the dataset by checking the data types of the features, emphasizing the need to handle temporal variables like year built and year sold.', 'Analyzing the relationship between year sold and sales price The speaker groups the data by year sold and plots the median sales price for each group, revealing a decreasing price trend over the years, contrary to the expected increase.', 'Comparing the difference between year features and sales price The speaker creates new features representing the difference between the year variables and year sold, then plots these differences to visualize the impact of the age of the house on its sales price.', 'Identifying 17 discrete variables based on less than 25 unique values for each category.', 'Emphasizing the importance of comparing discrete variables with the dependent feature, sales price median, in exploratory data analysis.', 'Demonstrating the relationship between discrete variables and sales price, revealing an exponential rise as the overall quality increases.', 'Presenting a logic to identify and analyze the 16 continuous features, showcasing the skewed nature of some features.', 'Emphasizing the significance of normalizing skewed data and addressing missing values in the context of regression problem statements.']}