title
Linear Regression Algorithm | Linear Regression Machine Learning | Linear Regression Full Course

description
🔥1000+ Free Courses With Free Certificates: https://www.mygreatlearning.com/academy?ambassador_code=GLYT_DES_Top_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Top_SEP22 🔥Build a successful career in Artificial Intelligence and Machine Learning https://www.mygreatlearning.com/pg-program-artificial-intelligence-course?ambassador_code=GLYT_DES_Middle_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Middle_SEP54 This “Linear Regression” Tutorial by Great Learning would help you to comprehensively learn all the underlying concepts of Linear Regression. This session will be taken by Professor Mukesh Rao, who is the academic director at Great Learning. Professor Mukesh Rao has over 20 years of industry experience in Market Research, Project Management, and Data Science. Visit Great Learning Academy, for free access to full courses, projects, data sets, codebooks & live sessions: https://glacad.me/3b6nizT The following topics are covered in the session: * Introduction - 00:00:00 * Case Study to understand the need of Linear Regression - 00:01:12 * Introduction to Linear Regression - 00:04:18 * Introduction to Multiple Linear Regression - 00:09:49 * Simple Demo in R and Python - 00:11:00 * Comprehensive explanation of Linear Regression Algorithm - 00:35:47 🔥Check Our Free Courses with free certificate: 📌Linear Regression Course: https://glacad.me/3GPrfX1 📌Machine Learning with Python: https://www.mygreatlearning.com/academy/learn-for-free/courses/machine-learning-with-python?ambassador_code=GLYT_DES_Middle_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Middle_SEP35 📌Machine Learning with AWS: https://www.mygreatlearning.com/academy/learn-for-free/courses/machine-learning-with-aws?ambassador_code=GLYT_DES_Middle_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Middle_SEP36 📌Machine Learning Algorithms: https://www.mygreatlearning.com/academy/learn-for-free/courses/machine-learning-algorithms?ambassador_code=GLYT_DES_Middle_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Middle_SEP37 📌Basics of Machine Learning: https://www.mygreatlearning.com/academy/learn-for-free/courses/basics-of-machine-learning-1?ambassador_code=GLYT_DES_Middle_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Middle_SEP38 📌Statistics for Machine Learning: https://www.mygreatlearning.com/academy/learn-for-free/courses/statistics-for-machine-learning?ambassador_code=GLYT_DES_Middle_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Middle_SEP39 ⚡ About Great Learning Academy: Visit Great Learning Academy to get access to 1000+ free courses with free certificate on Data Science, Data Analytics, Digital Marketing, Artificial Intelligence, Big Data, Cloud, Management, Cybersecurity, Software Development, and many more. These are supplemented with free projects, assignments, datasets, quizzes. You can earn a certificate of completion at the end of the course for free. ⚡ About Great Learning: With more than 5.4 Million+ learners in 170+ countries, Great Learning, a part of the BYJU'S group, is a leading global edtech company for professional and higher education offering industry-relevant programs in the blended, classroom, and purely online modes across technology, data and business domains. These programs are developed in collaboration with the top institutions like Stanford Executive Education, MIT Professional Education, The University of Texas at Austin, NUS, IIT Madras, IIT Bombay & more. SOCIAL MEDIA LINKS: 🔹 For more interesting tutorials, don't forget to subscribe to our channel: https://glacad.me/YTsubscribe 🔹 For more updates on courses and tips follow us on: ✅ Telegram: https://t.me/GreatLearningAcademy ✅ Facebook: https://www.facebook.com/GreatLearningOfficial/ ✅ LinkedIn: https://www.linkedin.com/school/great-learning/mycompany/verification/ ✅ Follow our Blog: https://glacad.me/GL_Blog

detail
{'title': 'Linear Regression Algorithm | Linear Regression Machine Learning | Linear Regression Full Course', 'heatmap': [{'end': 558.243, 'start': 415.253, 'weight': 0.849}, {'end': 2231.808, 'start': 2087.206, 'weight': 0.728}, {'end': 4458.968, 'start': 4318.257, 'weight': 0.781}, {'end': 7108.425, 'start': 6546.277, 'weight': 0.897}, {'end': 7402.349, 'start': 7237.242, 'weight': 0.796}, {'end': 8637.651, 'start': 8492.928, 'weight': 0.709}, {'end': 13919.548, 'start': 13783.663, 'weight': 0.712}], 'summary': 'This full linear regression course covers the basics, model comparison, correlation, techniques, model evaluation, ensemble techniques, data preprocessing, and data analysis for car and embedded chip data, including demos in python and r, with a focus on practical applications such as predicting gre scores based on cgpa, achieving a mean squared error of 0.19 in predicting diamond prices, and evaluating driving data risk factors with a 75-25 data split.', 'chapters': [{'end': 37.703, 'segs': [{'end': 37.703, 'src': 'embed', 'start': 0.229, 'weight': 0, 'content': [{'end': 2.37, 'text': 'Hey guys, this is Bharani from Great Learning.', 'start': 0.229, 'duration': 2.141}, {'end': 5.29, 'text': 'And I welcome you all to this session on Linear Regression.', 'start': 2.69, 'duration': 2.6}, {'end': 11.212, 'text': 'So, Linear Regression is one of the simplest and most widely used algorithms in machine learning.', 'start': 5.75, 'duration': 5.462}, {'end': 18.253, 'text': "And I've created this tutorial in such a way that at the end of this session, you'll have complete understanding of Linear Regression.", 'start': 11.812, 'duration': 6.441}, {'end': 26.475, 'text': "Now, before we start off with the session, I'd also like to inform you guys that we'll be creating a series of high quality tutorials on Data Science,", 'start': 18.754, 'duration': 7.721}, {'end': 28.576, 'text': 'Artificial Intelligence and Computer Vision.', 'start': 26.475, 'duration': 2.101}, {'end': 35.361, 'text': "So please do subscribe to Great Learning's YouTube channel and click on the bell icon so that you have a notification of our upcoming videos.", 'start': 29.056, 'duration': 6.305}, {'end': 37.703, 'text': "Now let's have a quick glance at the agenda.", 'start': 35.881, 'duration': 1.822}], 'summary': 'Linear regression tutorial for complete understanding, with upcoming high-quality tutorials on data science, ai, and computer vision.', 'duration': 37.474, 'max_score': 0.229, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM229.jpg'}], 'start': 0.229, 'title': 'Linear regression tutorial', 'summary': 'Introduces linear regression, a widely used algorithm in machine learning, and promises comprehensive understanding, with hints of upcoming tutorials on data science, artificial intelligence, and computer vision.', 'chapters': [{'end': 37.703, 'start': 0.229, 'title': 'Linear regression tutorial', 'summary': 'Introduces linear regression, one of the most widely used algorithms in machine learning, and promises a comprehensive understanding by the end of the session, while also hinting at upcoming high quality tutorials on data science, artificial intelligence, and computer vision.', 'duration': 37.474, 'highlights': ['The tutorial promises a complete understanding of Linear Regression by the end of the session.', 'The speaker plans to create high quality tutorials on Data Science, Artificial Intelligence, and Computer Vision in the future.', 'Linear Regression is highlighted as one of the simplest and most widely used algorithms in machine learning.', "The speaker encourages the audience to subscribe to Great Learning's YouTube channel for upcoming videos."]}], 'duration': 37.474, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM229.jpg', 'highlights': ['The tutorial promises a complete understanding of Linear Regression by the end of the session.', 'Linear Regression is highlighted as one of the simplest and most widely used algorithms in machine learning.', 'The speaker plans to create high quality tutorials on Data Science, Artificial Intelligence, and Computer Vision in the future.', "The speaker encourages the audience to subscribe to Great Learning's YouTube channel for upcoming videos."]}, {'end': 588.315, 'segs': [{'end': 73.607, 'src': 'embed', 'start': 44.38, 'weight': 0, 'content': [{'end': 47.821, 'text': "We'll start off with a case study where we'll understand the need of regression analysis.", 'start': 44.38, 'duration': 3.441}, {'end': 52.582, 'text': "Then we'll have a brief introduction to simple linear regression and multiple linear regression.", 'start': 48.361, 'duration': 4.221}, {'end': 56.163, 'text': "After that, we'll have a demo in both Python and R languages.", 'start': 53.042, 'duration': 3.121}, {'end': 57.223, 'text': 'Going ahead.', 'start': 56.763, 'duration': 0.46}, {'end': 64.644, 'text': "we'll have Professor Mukesh Rao, who is the academic director at Crate Learning, explaining all the concepts of linear regression comprehensively.", 'start': 57.223, 'duration': 7.421}, {'end': 71.067, 'text': 'Professor Mukesh Rao has over 20 years of industry experience in market research, project management, and data science.', 'start': 65.125, 'duration': 5.942}, {'end': 73.607, 'text': "So let's start off with this case study.", 'start': 71.767, 'duration': 1.84}], 'summary': 'Case study on regression analysis, introduction to simple and multiple linear regression, followed by a demo in python and r. professor mukesh rao, with over 20 years of industry experience, will explain the concepts comprehensively.', 'duration': 29.227, 'max_score': 44.38, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM44380.jpg'}, {'end': 306.07, 'src': 'embed', 'start': 280.275, 'weight': 3, 'content': [{'end': 284.858, 'text': 'So over here, GRE score is our dependent variable, which is a continuous numerical value.', 'start': 280.275, 'duration': 4.583}, {'end': 290.621, 'text': 'And we are trying to understand how does it change with respect to the CGP of the student, which would be our independent variable.', 'start': 284.978, 'duration': 5.643}, {'end': 294.943, 'text': 'And when it comes to linear regression, we basically have a straight line.', 'start': 291.221, 'duration': 3.722}, {'end': 298.005, 'text': 'So a straight line is a linear equation.', 'start': 295.304, 'duration': 2.701}, {'end': 301.147, 'text': 'And that is why this algorithm is known as linear regression.', 'start': 298.285, 'duration': 2.862}, {'end': 306.07, 'text': 'So we have Y and we are trying to understand how does Y vary with X.', 'start': 301.567, 'duration': 4.503}], 'summary': "Analyzing how gre score changes with student's cgp using linear regression.", 'duration': 25.795, 'max_score': 280.275, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM280275.jpg'}, {'end': 558.243, 'src': 'heatmap', 'start': 415.253, 'weight': 0.849, 'content': [{'end': 420.999, 'text': 'But then again, the value which is being predicted by this line would be around 2.2.', 'start': 415.253, 'duration': 5.746}, {'end': 430.006, 'text': 'So the difference between this observed value of Y and the predicted value of Y by this line is what is known as the residual or the error in prediction.', 'start': 420.999, 'duration': 9.007}, {'end': 433.949, 'text': "And similarly, you'll have a residual for all of these points over here.", 'start': 430.446, 'duration': 3.503}, {'end': 435.33, 'text': "So let's take this point now.", 'start': 434.269, 'duration': 1.061}, {'end': 443.676, 'text': 'Here the X value is equal to 2 and the observed value of Y is 4, but the predicted value of Y would be around 3.8.', 'start': 435.71, 'duration': 7.966}, {'end': 449.16, 'text': 'Similarly, if you take this point, x value is 4 and the observed value of y is 2.', 'start': 443.676, 'duration': 5.484}, {'end': 452.583, 'text': 'But the predicted value of y over here is 5.', 'start': 449.16, 'duration': 3.423}, {'end': 460.528, 'text': 'And if you take this point, x value is 5, observed value of y is 7 and the predicted value of y is around 6.5.', 'start': 452.583, 'duration': 7.945}, {'end': 467.013, 'text': 'So, these dotted red lines which you see, these basically indicate the residuals or the error in prediction.', 'start': 460.528, 'duration': 6.485}, {'end': 474.4, 'text': "And to get that best fit line, don't you agree that we would basically have to reduce these residuals as much as possible.", 'start': 467.433, 'duration': 6.967}, {'end': 477.662, 'text': 'And this is where we have something known as the residual sum of square.', 'start': 474.9, 'duration': 2.762}, {'end': 482.827, 'text': "So what we do is we'll go to this first residual and we'll square this up.", 'start': 478.063, 'duration': 4.764}, {'end': 488.192, 'text': "Then we'll go to the second residual, we'll square this up, go to the third residual, and then we'll square this up.", 'start': 483.107, 'duration': 5.085}, {'end': 490.894, 'text': "And we'll do this for all of the residuals.", 'start': 488.532, 'duration': 2.362}, {'end': 493.337, 'text': "And then we'll take the sum of all of the residuals.", 'start': 491.235, 'duration': 2.102}, {'end': 497.879, 'text': 'Now, when we take the sum, this has to be as low as possible,', 'start': 493.737, 'duration': 4.142}, {'end': 503.621, 'text': 'and whichever line would have the lowest value of residual sum of squares that would be the best fit line.', 'start': 497.879, 'duration': 5.742}, {'end': 507.462, 'text': "Now let's extend this concept to the previous problem statement.", 'start': 504.381, 'duration': 3.081}, {'end': 514.325, 'text': 'So this was our problem statement where the GRE score was on the Y axis and the CGP of the student was on the X axis.', 'start': 507.902, 'duration': 6.423}, {'end': 516.826, 'text': "And we've got these three lines over here.", 'start': 514.684, 'duration': 2.142}, {'end': 521.99, 'text': 'So, for the first line, the residual sum of squares comes to be around 28.', 'start': 517.566, 'duration': 4.424}, {'end': 525.632, 'text': 'For the second line, the residual sum of squares comes to be around 22.', 'start': 521.99, 'duration': 3.642}, {'end': 528.675, 'text': 'And for the third line, the residual sum of squares comes to be 24.', 'start': 525.632, 'duration': 3.043}, {'end': 537.102, 'text': 'Now, when we compare all of these three residual sum of squares, the second line would be the best fit line for our problem statement.', 'start': 528.675, 'duration': 8.427}, {'end': 544.616, 'text': "So now we've got the best fit line and we already know that a linear line would have an equation associated with it,", 'start': 537.952, 'duration': 6.664}, {'end': 549.158, 'text': 'which is basically Y equals MX plus C, and that is the equation over here.', 'start': 544.616, 'duration': 4.542}, {'end': 550.639, 'text': 'So over here.', 'start': 549.638, 'duration': 1.001}, {'end': 555.421, 'text': 'Y is basically the GRE score of the student and X is the CGPA of the student.', 'start': 550.639, 'duration': 4.782}, {'end': 558.243, 'text': 'and these two which you see, the M value and the C value.', 'start': 555.421, 'duration': 2.822}], 'summary': 'Residual sum of squares helps find best fit line, with second line as the best fit for gre vs. cgpa problem.', 'duration': 142.99, 'max_score': 415.253, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM415253.jpg'}, {'end': 488.192, 'src': 'embed', 'start': 460.528, 'weight': 4, 'content': [{'end': 467.013, 'text': 'So, these dotted red lines which you see, these basically indicate the residuals or the error in prediction.', 'start': 460.528, 'duration': 6.485}, {'end': 474.4, 'text': "And to get that best fit line, don't you agree that we would basically have to reduce these residuals as much as possible.", 'start': 467.433, 'duration': 6.967}, {'end': 477.662, 'text': 'And this is where we have something known as the residual sum of square.', 'start': 474.9, 'duration': 2.762}, {'end': 482.827, 'text': "So what we do is we'll go to this first residual and we'll square this up.", 'start': 478.063, 'duration': 4.764}, {'end': 488.192, 'text': "Then we'll go to the second residual, we'll square this up, go to the third residual, and then we'll square this up.", 'start': 483.107, 'duration': 5.085}], 'summary': 'Residual sum of squares reduces errors in prediction for best fit line.', 'duration': 27.664, 'max_score': 460.528, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM460528.jpg'}], 'start': 44.38, 'title': 'Understanding and using linear regression for predictive analysis', 'summary': 'Covers the need for regression analysis, introduces simple and multiple linear regression, includes a demo in python and r, and features professor mukesh rao with over 20 years of industry experience. it also explains the process of using linear regression to predict gre scores based on cgpa, with a focus on understanding the relationship between the two variables, building a best fit line based on residual sum of squares, and using the linear regression equation to predict the gre score for a given cgpa.', 'chapters': [{'end': 73.607, 'start': 44.38, 'title': 'Understanding linear regression', 'summary': 'Covers the need for regression analysis, introduces simple and multiple linear regression, includes a demo in python and r, and features professor mukesh rao with over 20 years of industry experience explaining linear regression comprehensively.', 'duration': 29.227, 'highlights': ['Professor Mukesh Rao, academic director at Crate Learning, with over 20 years of industry experience in market research, project management, and data science, explains linear regression comprehensively.', 'The chapter includes a case study to understand the need for regression analysis, as well as introductions to simple and multiple linear regression.', 'A demo in both Python and R languages will be provided as part of the chapter.']}, {'end': 588.315, 'start': 74.027, 'title': 'Linear regression for predicting gre scores', 'summary': 'Explains the process of using linear regression to predict gre scores based on cgpa, with a focus on understanding the relationship between the two variables, building a best fit line based on residual sum of squares, and using the linear regression equation to predict the gre score for a given cgpa.', 'duration': 514.288, 'highlights': ['Use of Linear Regression to Predict GRE Scores The chapter introduces the concept of using linear regression to predict GRE scores based on CGPA, demonstrating the process of understanding the relationship between the two variables and creating a predictive model.', 'Building a Best Fit Line Based on Residual Sum of Squares It explains the process of determining the best fit line by calculating the residual sum of squares for different lines and selecting the line with the lowest value, providing a quantitative measure for identifying the best fit line.', 'Predicting GRE Scores Using Linear Regression Equation The chapter illustrates the use of the linear regression equation, Y = MX + C, to predict the GRE score for a given CGPA, providing a practical demonstration of how the linear regression algorithm works in practice.']}], 'duration': 543.935, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM44380.jpg', 'highlights': ['Professor Mukesh Rao, academic director at Crate Learning, with over 20 years of industry experience in market research, project management, and data science, explains linear regression comprehensively.', 'The chapter includes a case study to understand the need for regression analysis, as well as introductions to simple and multiple linear regression.', 'A demo in both Python and R languages will be provided as part of the chapter.', 'Use of Linear Regression to Predict GRE Scores The chapter introduces the concept of using linear regression to predict GRE scores based on CGPA, demonstrating the process of understanding the relationship between the two variables and creating a predictive model.', 'Building a Best Fit Line Based on Residual Sum of Squares It explains the process of determining the best fit line by calculating the residual sum of squares for different lines and selecting the line with the lowest value, providing a quantitative measure for identifying the best fit line.', 'Predicting GRE Scores Using Linear Regression Equation The chapter illustrates the use of the linear regression equation, Y = MX + C, to predict the GRE score for a given CGPA, providing a practical demonstration of how the linear regression algorithm works in practice.']}, {'end': 2281.174, 'segs': [{'end': 685.244, 'src': 'embed', 'start': 656.38, 'weight': 0, 'content': [{'end': 659.442, 'text': 'And this would be the equation of the hyperplane.', 'start': 656.38, 'duration': 3.062}, {'end': 663.745, 'text': "So to implement the linear regression algorithm, we'll be working with this diamonds dataset.", 'start': 659.982, 'duration': 3.763}, {'end': 666.408, 'text': "And let's have a glance at the description of this dataset.", 'start': 664.086, 'duration': 2.322}, {'end': 670.972, 'text': 'So this dataset comprises of 10 columns and there are 53, 940 rows.', 'start': 667.148, 'duration': 3.824}, {'end': 676.637, 'text': 'Or in other words, there are 53, 940 diamonds.', 'start': 671.192, 'duration': 5.445}, {'end': 681.861, 'text': "So we've got the price column, which basically denotes the price of the diamond in US dollars.", 'start': 677.357, 'duration': 4.504}, {'end': 685.244, 'text': "Then we've got carat, which denotes the weight of the diamond.", 'start': 682.002, 'duration': 3.242}], 'summary': 'Linear regression algorithm will be applied to a dataset with 53,940 rows and 10 columns, representing diamonds with price and carat data.', 'duration': 28.864, 'max_score': 656.38, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM656380.jpg'}, {'end': 905.774, 'src': 'embed', 'start': 878.382, 'weight': 1, 'content': [{'end': 885.049, 'text': 'So, as you see over here, if you check this length column which is x, you see that the length is greater than 4 millimeters.', 'start': 878.382, 'duration': 6.667}, {'end': 888.252, 'text': 'So, these were some basic data manipulation operations.', 'start': 885.61, 'duration': 2.642}, {'end': 894.119, 'text': "Now, we'll go ahead and perform some data visualization operations using the ggplot2 package.", 'start': 889.093, 'duration': 5.026}, {'end': 896.601, 'text': 'So, I load up the ggplot2 package.', 'start': 895, 'duration': 1.601}, {'end': 899.745, 'text': "I'll type in library of ggplot2 over here.", 'start': 896.701, 'duration': 3.044}, {'end': 905.774, 'text': "Now this dataset basically which I'm working with the diamonds dataset is also part of the ggplot2 package.", 'start': 900.891, 'duration': 4.883}], 'summary': 'Data analysis includes length check (>4mm) and ggplot2 visualization.', 'duration': 27.392, 'max_score': 878.382, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM878382.jpg'}, {'end': 1101.577, 'src': 'embed', 'start': 1076.258, 'weight': 2, 'content': [{'end': 1081.242, 'text': 'Now you guys must be wondering why do you have to divide a data set into train and test set?', 'start': 1076.258, 'duration': 4.984}, {'end': 1083.604, 'text': "So let's take this example to understand it better.", 'start': 1081.602, 'duration': 2.002}, {'end': 1088.187, 'text': "So let's say you have a math test tomorrow and you haven't learned anything at all.", 'start': 1084.104, 'duration': 4.083}, {'end': 1092.57, 'text': "But your friend luckily steals the question paper from the principal's room.", 'start': 1088.607, 'duration': 3.963}, {'end': 1099.976, 'text': 'And when you see the question paper, you see that there are five questions and you perfectly learn all of those five questions.', 'start': 1093.251, 'duration': 6.725}, {'end': 1101.577, 'text': 'Now there are two situations.', 'start': 1100.396, 'duration': 1.181}], 'summary': 'Data set division explained using a relatable math test scenario.', 'duration': 25.319, 'max_score': 1076.258, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM1076258.jpg'}, {'end': 1243.197, 'src': 'embed', 'start': 1219.457, 'weight': 3, 'content': [{'end': 1227.123, 'text': 'So we see that there are 42, 126 entries in the train set and there are 11, 814 entries in the test set.', 'start': 1219.457, 'duration': 7.666}, {'end': 1231.467, 'text': "So now it's time to go ahead and build the model on top of the train set.", 'start': 1227.844, 'duration': 3.623}, {'end': 1235.47, 'text': "And to build the linear regression model, we've got this LM function.", 'start': 1231.967, 'duration': 3.503}, {'end': 1243.197, 'text': "So I'll just type in LM and the first parameter is the formula where we're given the dependent variable and the independent variable.", 'start': 1236.011, 'duration': 7.186}], 'summary': 'Train set: 42,126 entries, test set: 11,814 entries. building linear regression model using lm function.', 'duration': 23.74, 'max_score': 1219.457, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM1219457.jpg'}, {'end': 1791.018, 'src': 'embed', 'start': 1764.101, 'weight': 4, 'content': [{'end': 1770.208, 'text': 'So for the first model, the root mean square error is 1221 and for the second model the root mean square error is 1574..', 'start': 1764.101, 'duration': 6.107}, {'end': 1771.009, 'text': 'So, for the second.', 'start': 1770.208, 'duration': 0.801}, {'end': 1776.933, 'text': 'So for the second model, the root mean square error is higher.', 'start': 1773.852, 'duration': 3.081}, {'end': 1781.995, 'text': 'This means that the first model is considerably better than the second model.', 'start': 1777.373, 'duration': 4.622}, {'end': 1786.816, 'text': 'So this is how we can build multiple models and compare their accuracy with each other.', 'start': 1782.415, 'duration': 4.401}, {'end': 1791.018, 'text': 'So we have successfully implemented the linear regression algorithm in our language.', 'start': 1787.297, 'duration': 3.721}], 'summary': 'First model has a lower rmse of 1221, indicating it is considerably better than the second model with rmse of 1574. successfully implemented linear regression algorithm.', 'duration': 26.917, 'max_score': 1764.101, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM1764101.jpg'}, {'end': 2231.808, 'src': 'heatmap', 'start': 2087.206, 'weight': 0.728, 'content': [{'end': 2089.388, 'text': 'so the actual value was 5.7.', 'start': 2087.206, 'duration': 2.182}, {'end': 2092.27, 'text': 'the value predicted by this model was 5.9.', 'start': 2089.388, 'duration': 2.882}, {'end': 2093.15, 'text': 'the actual value was 7.4.', 'start': 2092.27, 'duration': 0.88}, {'end': 2095.371, 'text': 'predicted was 6.1.', 'start': 2093.15, 'duration': 2.221}, {'end': 2097.553, 'text': "then we've got 5.1, which was predicted as 4.9.", 'start': 2095.371, 'duration': 2.182}, {'end': 2100.595, 'text': 'right, and if we take this value 7, it was predicted as 6.15.', 'start': 2097.553, 'duration': 3.042}, {'end': 2104.737, 'text': "so we've got the actual values and the predicted values.", 'start': 2100.595, 'duration': 4.142}, {'end': 2107.379, 'text': "now it's time to find out the error in prediction.", 'start': 2104.737, 'duration': 2.642}, {'end': 2111.523, 'text': 'So I load the mean square error from the matrix of sklearn.', 'start': 2107.839, 'duration': 3.684}, {'end': 2120.414, 'text': "So, I'll type in from sklearn.matrix, I would basically be importing the mean square error.", 'start': 2111.864, 'duration': 8.55}, {'end': 2123.745, 'text': 'so now that i import it,', 'start': 2121.784, 'duration': 1.961}, {'end': 2136.851, 'text': "i will go ahead and find out the mean squared error of this model and inside this i'll pass in the actual values which are present in y test and then the second parameter would be the predicted values which are present in y pred.", 'start': 2123.745, 'duration': 13.106}, {'end': 2141.073, 'text': 'so the mean squared error comes out to be 0.19, which is actually quite less.', 'start': 2136.851, 'duration': 4.222}, {'end': 2143.734, 'text': 'so seems that this model which you built is quite good.', 'start': 2141.073, 'duration': 2.661}, {'end': 2152.596, 'text': "So in linear regression it's one of the most popular algorithms in production.", 'start': 2148.154, 'duration': 4.442}, {'end': 2159.46, 'text': 'because it can be used for classification, it can be used for regression.', 'start': 2152.596, 'duration': 6.864}, {'end': 2166.664, 'text': 'it also happens to be one of the core algorithms that is used by many other algorithms.', 'start': 2159.46, 'duration': 7.204}, {'end': 2170.786, 'text': 'Logistic regression is a linear regression based model, right.', 'start': 2166.824, 'duration': 3.962}, {'end': 2176.009, 'text': "If you take support vector machine, another very famous algorithm in data science, it's a linear model.", 'start': 2171.406, 'duration': 4.603}, {'end': 2185.952, 'text': 'Or, if you take, if you build, do linear models using binned data what you call binned b-i-n,', 'start': 2177.745, 'duration': 8.207}, {'end': 2192.918, 'text': 'binned data sets then the model that you get behaves like a decision tree right.', 'start': 2185.952, 'duration': 6.966}, {'end': 2200.184, 'text': 'So many of the other algorithms which are very popular in the industry, internally they are linked to linear models.', 'start': 2193.379, 'duration': 6.805}, {'end': 2205.269, 'text': 'You can spend your entire life time studying linear models.', 'start': 2202.346, 'duration': 2.923}, {'end': 2213.155, 'text': 'It has so much of depth and breadth that it takes lot of time to actually get used to all these things.', 'start': 2206.67, 'duration': 6.485}, {'end': 2222.321, 'text': 'So today I am going to start by introducing you to simple linear models, the core concepts,', 'start': 2215.136, 'duration': 7.185}, {'end': 2231.808, 'text': 'and then subsequently I will leave some advanced discussions for the next visit, where we will be covering FMT feature engineering model tuning.', 'start': 2222.321, 'duration': 9.487}], 'summary': 'Mean squared error of the model is 0.19, indicating good performance in linear regression.', 'duration': 144.602, 'max_score': 2087.206, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM2087206.jpg'}, {'end': 2143.734, 'src': 'embed', 'start': 2123.745, 'weight': 5, 'content': [{'end': 2136.851, 'text': "i will go ahead and find out the mean squared error of this model and inside this i'll pass in the actual values which are present in y test and then the second parameter would be the predicted values which are present in y pred.", 'start': 2123.745, 'duration': 13.106}, {'end': 2141.073, 'text': 'so the mean squared error comes out to be 0.19, which is actually quite less.', 'start': 2136.851, 'duration': 4.222}, {'end': 2143.734, 'text': 'so seems that this model which you built is quite good.', 'start': 2141.073, 'duration': 2.661}], 'summary': 'Mean squared error of model is 0.19, indicating good performance.', 'duration': 19.989, 'max_score': 2123.745, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM2123745.jpg'}], 'start': 589.85, 'title': 'Linear regression and model comparison', 'summary': "Covers multiple linear regression with 'empty cars' and 'diamonds' datasets, data splitting with a 75% - 25% split, model comparison for predicting diamond prices, and implementing linear regression in python using the iris dataset, achieving a mean squared error of 0.19.", 'chapters': [{'end': 1075.318, 'start': 589.85, 'title': 'Multiple linear regression', 'summary': "Discusses the concept of multiple linear regression, demonstrating its application using the 'empty cars' dataset with 5 independent variables and the 'diamonds' dataset with 10 columns and 53,940 rows, followed by data manipulation and visualization using ggplot2 package, and the process of splitting the dataset into training and testing sets.", 'duration': 485.468, 'highlights': ["The chapter discusses the concept of multiple linear regression using the 'empty cars' dataset with 5 independent variables and the 'diamonds' dataset with 10 columns and 53,940 rows. Demonstrates the transition from simple linear regression to multiple linear regression, showcasing the 'empty cars' dataset with 5 independent variables and the 'diamonds' dataset with 10 columns and 53,940 rows.", "Data manipulation and data visualization operations using the ggplot2 package are performed, including extracting a subset of the 'diamonds' dataset based on specific conditions. Illustrates the use of the ggplot2 package for data manipulation and visualization, showcasing the extraction of a subset of the 'diamonds' dataset based on specific conditions.", 'The process of splitting the dataset into training and testing sets is explained using the CA tools package and the sample.split function. Explains the process of splitting the dataset into training and testing sets using the CA tools package and the sample.split function.']}, {'end': 1463.975, 'start': 1076.258, 'title': 'Splitting data for model training', 'summary': "Explains the importance of dividing a data set into training and testing sets, and it details the process of splitting the data, building a linear regression model, and evaluating the model's predictions, with a 75% - 25% split resulting in 42,126 entries in the train set and 11,814 entries in the test set.", 'duration': 387.717, 'highlights': ['The process of splitting the data into training and testing sets is crucial to ensure model accuracy and avoid potential failure on new data, with a 75% - 25% split resulting in 42,126 entries in the train set and 11,814 entries in the test set.', 'The LM function is used to build a linear regression model to understand the relationship between the dependent variable (price) and the independent variable (carat), resulting in a model with a coefficient of 7894 for carat and the corresponding intercept value.', "The prediction of values on the test set is performed using the predict function, and the actual and predicted values are combined using the cbind function to evaluate the model's accuracy, revealing significant errors in the predictions."]}, {'end': 1786.816, 'start': 1464.296, 'title': 'Comparing model accuracy', 'summary': "Discusses the process of building and comparing two models for predicting diamond prices, revealing the root mean square errors of 1221 and 1574, indicating the first model's superior accuracy.", 'duration': 322.52, 'highlights': ['The root mean square error (RMSE) for the first model is 1221, suggesting its inadequacy in delivering precise results, leading to its storage as RMSE1.', 'A multiple linear regression model is built, utilizing the length, width, and depth of diamonds as independent variables, resulting in a higher RMSE of 1574, indicating its inferior accuracy compared to the first model.', 'Comparing the RMSE values of the two models reveals that the first model is considerably better with a lower RMSE of 1221, presenting a clear contrast in accuracy between the two models.']}, {'end': 2281.174, 'start': 1787.297, 'title': 'Implementing linear regression in python with iris dataset', 'summary': "Covers implementing linear regression in python using the iris dataset with a focus on loading data, splitting into training and testing sets, building the model, making predictions, and evaluating the model's performance, achieving a mean squared error of 0.19. additionally, it discusses the popularity and versatility of linear models in data science.", 'duration': 493.877, 'highlights': ["Mean Squared Error Evaluation The mean squared error of the linear regression model is found to be 0.19, indicating the model's accuracy in predicting sepal length.", 'Splitting into Training and Testing Sets The dataset is divided into a training set (70% of records) and a testing set (30% of records) using the train test split method from the sklearn package.', 'Loading Iris Dataset and Data Exploration The pandas library is used to load the iris dataset from a CSV file and explore the first five rows of the dataset, which contains information about different species of the iris flower and their features.']}], 'duration': 1691.324, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM589850.jpg', 'highlights': ["Implementing multiple linear regression with 'empty cars' and 'diamonds' datasets.", 'Performing data manipulation and visualization using the ggplot2 package.', 'Explaining the process of splitting datasets into training and testing sets.', 'Utilizing the LM function to build a linear regression model and predict values on the test set.', "Evaluating the models' accuracy using root mean square error (RMSE).", 'Achieving a mean squared error of 0.19 in predicting sepal length with the iris dataset.']}, {'end': 3590.125, 'segs': [{'end': 2487.366, 'src': 'embed', 'start': 2461.417, 'weight': 0, 'content': [{'end': 2467.442, 'text': 'So when you build k nearest neighbors, in k nearest neighbors you have different classes of data points.', 'start': 2461.417, 'duration': 6.025}, {'end': 2476.129, 'text': 'Suppose these are different classes and let me make it slightly more complex.', 'start': 2468.202, 'duration': 7.927}, {'end': 2483.895, 'text': 'What k nearest neighbor does is, we call it breaking a mathematical space into regions.', 'start': 2477.009, 'duration': 6.886}, {'end': 2487.366, 'text': 'It breaks a mathematical space into regions.', 'start': 2485.205, 'duration': 2.161}], 'summary': 'K-nearest neighbors breaks mathematical space into regions.', 'duration': 25.949, 'max_score': 2461.417, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM2461417.jpg'}, {'end': 2830.739, 'src': 'embed', 'start': 2785.725, 'weight': 5, 'content': [{'end': 2789.927, 'text': 'Where in your mathematical space on the y dimension?', 'start': 2785.725, 'duration': 4.202}, {'end': 2795.989, 'text': 'where on the y dimension does the line intercept or cut the y axis?', 'start': 2789.927, 'duration': 6.062}, {'end': 2802.592, 'text': 'So this is the generic equation of a line.', 'start': 2799.23, 'duration': 3.362}, {'end': 2806.65, 'text': 'y equal to mx plus c.', 'start': 2804.409, 'duration': 2.241}, {'end': 2810.111, 'text': 'It tells you how y and x are related to each other.', 'start': 2806.65, 'duration': 3.461}, {'end': 2822.516, 'text': 'If you look at this line l1, if you look at l1 then this becomes 0.', 'start': 2813.813, 'duration': 8.703}, {'end': 2823.496, 'text': 'So y equal to mx.', 'start': 2822.516, 'duration': 0.98}, {'end': 2830.739, 'text': 'If you look at l2, l2 this becomes not 0 but 2.', 'start': 2824.277, 'duration': 6.462}], 'summary': 'The line intersects the y-axis at 2 on the y dimension.', 'duration': 45.014, 'max_score': 2785.725, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM2785725.jpg'}, {'end': 2926.212, 'src': 'embed', 'start': 2877.882, 'weight': 4, 'content': [{'end': 2885.266, 'text': 'From this data that you have given to the algorithm, the algorithm has found out for you the M and the C.', 'start': 2877.882, 'duration': 7.384}, {'end': 2886.687, 'text': 'We call these coefficients.', 'start': 2885.266, 'duration': 1.421}, {'end': 2891.029, 'text': 'Coefficients of the model.', 'start': 2889.608, 'duration': 1.421}, {'end': 2899.255, 'text': 'This M and the C They reflect the relationship between Y and X in your data set.', 'start': 2893.23, 'duration': 6.025}, {'end': 2904.697, 'text': 'So the M and the C is what forms the model for you.', 'start': 2901.496, 'duration': 3.201}, {'end': 2916.722, 'text': 'Shall I complicate it a bit more? We will complicate it a bit more.', 'start': 2911.68, 'duration': 5.042}, {'end': 2926.212, 'text': 'Suppose instead of having 1.', 'start': 2922.704, 'duration': 3.508}], 'summary': 'Algorithm found the coefficients m and c to reflect the relationship between y and x in the data set.', 'duration': 48.33, 'max_score': 2877.882, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM2877882.jpg'}, {'end': 3040.513, 'src': 'embed', 'start': 3005.903, 'weight': 2, 'content': [{'end': 3017.911, 'text': 'suppose, instead of only one variable x, now you have x1 and you have another variable x2, and you have only one dependent variable y.', 'start': 3005.903, 'duration': 12.008}, {'end': 3030.724, 'text': 'In such case, the algorithm will find out the relationship between x1 and y, x2 and y.', 'start': 3020.856, 'duration': 9.868}, {'end': 3040.513, 'text': 'So it will express that relationship as y, equal to m1 x1 plus m2 x2, and there will be only one constant term.', 'start': 3030.724, 'duration': 9.789}], 'summary': 'Algorithm finds relationship between x1, x2 and y using y = m1x1 + m2x2', 'duration': 34.61, 'max_score': 3005.903, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM3005903.jpg'}, {'end': 3137.161, 'src': 'embed', 'start': 3104.531, 'weight': 1, 'content': [{'end': 3107.693, 'text': 'Yeah, so human minds can imagine only three dimensions.', 'start': 3104.531, 'duration': 3.162}, {'end': 3112.215, 'text': 'But suppose they go beyond three dimensions, there also the linear models work.', 'start': 3108.713, 'duration': 3.502}, {'end': 3115.396, 'text': 'Those planes are called hyperplanes.', 'start': 3113.635, 'duration': 1.761}, {'end': 3121.139, 'text': "Hyperplanes, how do they look? We can't imagine, so we don't know that, but they will also be one single plane.", 'start': 3115.797, 'duration': 5.342}, {'end': 3125.541, 'text': 'And there will be no ups and downs, no curves, nothing.', 'start': 3122.96, 'duration': 2.581}, {'end': 3126.602, 'text': 'It will be straight plane.', 'start': 3125.581, 'duration': 1.021}, {'end': 3130.784, 'text': 'What do you mean by straight plane in four dimensions? No idea.', 'start': 3127.122, 'duration': 3.662}, {'end': 3137.161, 'text': 'it will have the same properties as the properties of a plane in three dimensions.', 'start': 3132.679, 'duration': 4.482}], 'summary': 'Human minds can imagine only three dimensions, but linear models work in higher dimensions, such as hyperplanes.', 'duration': 32.63, 'max_score': 3104.531, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM3104531.jpg'}, {'end': 3298.181, 'src': 'embed', 'start': 3264.005, 'weight': 6, 'content': [{'end': 3265.786, 'text': 'The number of dimensions of a model.', 'start': 3264.005, 'duration': 1.781}, {'end': 3270.068, 'text': 'This is the model right now, the surface.', 'start': 3267.427, 'duration': 2.641}, {'end': 3271.869, 'text': 'The number of dimensions.', 'start': 3270.809, 'duration': 1.06}, {'end': 3273.17, 'text': 'how many dimensions is this model??', 'start': 3271.869, 'duration': 1.301}, {'end': 3278.045, 'text': 'dimensions length into breadth.', 'start': 3275.643, 'duration': 2.402}, {'end': 3279.927, 'text': 'the model is only two dimensions.', 'start': 3278.045, 'duration': 1.882}, {'end': 3282.689, 'text': 'the feature space is three dimensions,', 'start': 3279.927, 'duration': 2.762}, {'end': 3290.935, 'text': 'so the number of dimensions in a model will always be one less than the number of dimensions in the feature space.', 'start': 3282.689, 'duration': 8.246}, {'end': 3298.181, 'text': 'right, all right, you can call it anything you like.', 'start': 3290.935, 'duration': 7.246}], 'summary': "A model's dimensions are one less than the feature space's dimensions.", 'duration': 34.176, 'max_score': 3264.005, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM3264005.jpg'}, {'end': 3475.073, 'src': 'embed', 'start': 3445.853, 'weight': 3, 'content': [{'end': 3451.377, 'text': 'this collinearity can lead you to a problem when you productionize your models.', 'start': 3445.853, 'duration': 5.524}, {'end': 3461.464, 'text': 'The models may be less effective, less what you call predictive power than required to be because of the collinearity.', 'start': 3452.418, 'duration': 9.046}, {'end': 3469.571, 'text': 'So one of the things that we do when building models is we see what the collinearity is between the dimensions.', 'start': 3463.328, 'duration': 6.243}, {'end': 3475.073, 'text': 'If the collinearity is very strong they are very strongly related to each other,', 'start': 3470.291, 'duration': 4.782}], 'summary': 'Collinearity can reduce model effectiveness and predictive power, impacting production models.', 'duration': 29.22, 'max_score': 3445.853, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM3445853.jpg'}], 'start': 2282.674, 'title': 'Linear regression and models', 'summary': 'Covers the concept of linear regression, including basics, models in multi-dimensions, and collinearity in model building, emphasizing the role of k-nearest neighbors and the impact of collinearity on model effectiveness.', 'chapters': [{'end': 2514.556, 'start': 2282.674, 'title': 'Linear regression and models in data science', 'summary': 'Covers the concept of linear regression, where a line represents the relationship between independent and dependent variables, and explores how models in data science are represented as lines, surfaces, and hypersurfaces in feature space, along with the role of k-nearest neighbors in breaking mathematical space into regions.', 'duration': 231.882, 'highlights': ['Linear regression is based on representing the relationship between independent variable x and dependent variable y as a straight line, illustrating how they are related to each other. Concept of linear regression, representation of x and y as a straight line.', 'Models in data science are represented as lines, surfaces, and hypersurfaces in feature space, which explore the interaction between independent and dependent attributes. Representation of models, exploration of interaction between attributes.', 'K-nearest neighbors breaks a mathematical space into regions, called Voronoi regions, to classify different classes of data points. Role of K-nearest neighbors, classification of data points into regions.']}, {'end': 3005.903, 'start': 2517.27, 'title': 'Linear regression basics', 'summary': 'Covers the basics of linear regression, including the mathematical expression y=mx+c, the meaning of the slope and intercept, and the relationship between coefficients m and c and the data set.', 'duration': 488.633, 'highlights': ['The mathematical expression y=mx+c represents the relationship between y and x, with m representing the slope and c representing the intercept. The coefficients M and C reflect the relationship between Y and X in the data set. The generic equation of a line is y=mx+c, where y is the target column and x is the independent column. The algorithm finds the coefficients M and C from the given data set, which reflect the relationship between Y and X.', 'The value of the slope M and the intercept C form the model for predicting the value of y given the value of x in the data set. The slope M and intercept C determine the model for predicting y given x in the data set, with M representing the angle the line makes with the x-axis and C representing the value of y when x is 0. These coefficients form the model for prediction.', 'The angle of the line, represented by the slope M, determines the value of M and the equation of the line, with a slope of 45 degrees resulting in M=1. The angle of the line, represented by the slope M, determines the value of M and the equation of the line. A slope of 45 degrees results in M=1, indicating that whenever x changes by 1, y changes by 1, reflecting a 1:1 ratio.']}, {'end': 3290.935, 'start': 3005.903, 'title': 'Linear models in multi-dimensions', 'summary': 'Introduces linear models in multi-dimensions, explaining the relationship between variables and the concept of hyperplanes, highlighting the mathematical properties and dimensions of the models.', 'duration': 285.032, 'highlights': ['The number of dimensions in a model is always one less than the number of dimensions in the feature space, reflecting the relationship between the model and the feature space.', 'The algorithm finds the relationship between multiple independent variables and a dependent variable, expressing it as y = m1x1 + m2x2, with the intercept determining the value of y when x1 and x2 are 0.', 'In higher dimensions, linear models work as hyperplanes, which are single planes with no curves or ups and downs, maintaining the same mathematical properties as a plane in three dimensions.', 'The concept of hyperplanes is introduced, explaining that even in dimensions beyond three, linear models work with hyperplanes having the same properties as a plane in three dimensions, with the angles being orthogonal to each other.']}, {'end': 3590.125, 'start': 3290.935, 'title': 'Collinearity in model building', 'summary': 'Explores the issue of collinearity in model building and its impact on model effectiveness, emphasizing the need for dimensionality reduction in cases of strong collinearity.', 'duration': 299.19, 'highlights': ['Collinearity between independent variables can lead to less effective models with lower predictive power, necessitating the use of dimensionality reduction techniques. Impact of collinearity on model effectiveness, need for dimensionality reduction', 'R value measurement indicates strong correlation between related independent variables, which contradicts the assumption of independence required by the algorithm. Explanation of R value and its indication of strong correlation', 'Dimensionality reduction involves creating synthetic dimensions from related independent variables to use in model building. Explanation of dimensionality reduction and synthetic dimensions']}], 'duration': 1307.451, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM2282674.jpg', 'highlights': ['K-nearest neighbors breaks a mathematical space into regions, called Voronoi regions, to classify different classes of data points.', 'The concept of hyperplanes is introduced, explaining that even in dimensions beyond three, linear models work with hyperplanes having the same properties as a plane in three dimensions, with the angles being orthogonal to each other.', 'The algorithm finds the relationship between multiple independent variables and a dependent variable, expressing it as y = m1x1 + m2x2, with the intercept determining the value of y when x1 and x2 are 0.', 'Collinearity between independent variables can lead to less effective models with lower predictive power, necessitating the use of dimensionality reduction techniques.', 'The value of the slope M and the intercept C form the model for predicting the value of y given the value of x in the data set.', 'The mathematical expression y=mx+c represents the relationship between y and x, with m representing the slope and c representing the intercept.', 'The number of dimensions in a model is always one less than the number of dimensions in the feature space, reflecting the relationship between the model and the feature space.']}, {'end': 4929.355, 'segs': [{'end': 3640.961, 'src': 'embed', 'start': 3612.254, 'weight': 2, 'content': [{'end': 3617.477, 'text': "That depends on which dimension is more prone to errors when you calculate, when you're capturing the data.", 'start': 3612.254, 'duration': 5.223}, {'end': 3625.982, 'text': 'So instead of doing all that analysis, I might convert into a synthetic dimension using a technique called PCA, principal component analysis.', 'start': 3619.198, 'duration': 6.784}, {'end': 3630.485, 'text': 'So in the previous example, we took two variables.', 'start': 3626.002, 'duration': 4.483}, {'end': 3633.707, 'text': 'So in the previous example, we took two variables.', 'start': 3630.505, 'duration': 3.202}, {'end': 3636.729, 'text': 'So in the previous example, we took two variables.', 'start': 3633.727, 'duration': 3.002}, {'end': 3640.961, 'text': 'No, no, that is what I am saying.', 'start': 3639.92, 'duration': 1.041}], 'summary': 'Using pca can reduce errors in data analysis and capture data in synthetic dimensions.', 'duration': 28.707, 'max_score': 3612.254, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM3612254.jpg'}, {'end': 3959.959, 'src': 'embed', 'start': 3911.362, 'weight': 0, 'content': [{'end': 3914.304, 'text': 'Linear models are built only when you see linear relationships.', 'start': 3911.362, 'duration': 2.942}, {'end': 3919.608, 'text': 'If you see non-linear relationships, then you might want to resort to non-linear models.', 'start': 3915.405, 'duration': 4.203}, {'end': 3928.233, 'text': 'Linear models, they expect linear relationship between y and the independent variables.', 'start': 3921.789, 'duration': 6.444}, {'end': 3931.996, 'text': "Then your model's predictive power will be very high.", 'start': 3929.394, 'duration': 2.602}, {'end': 3936.684, 'text': 'However, as you will see now, those are all very perfect conditions.', 'start': 3933.322, 'duration': 3.362}, {'end': 3941.767, 'text': 'In the real world, we hardly have such perfect conditions, right.', 'start': 3937.825, 'duration': 3.942}, {'end': 3946.35, 'text': 'So, sometimes even when the relationship is not linear, we go up, build a linear model.', 'start': 3942.048, 'duration': 4.302}, {'end': 3948.572, 'text': 'We will see that down the line.', 'start': 3947.691, 'duration': 0.881}, {'end': 3955.336, 'text': 'Ok, How do you measure the strength of the relationship between independent variable and the target?', 'start': 3950.933, 'duration': 4.403}, {'end': 3959.959, 'text': 'For that we use a metric called coefficient of correlation.', 'start': 3955.716, 'duration': 4.243}], 'summary': 'Linear models expect linear relationship, predictive power high. non-linear models for non-linear relationships. measure with coefficient of correlation.', 'duration': 48.597, 'max_score': 3911.362, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM3911362.jpg'}, {'end': 4033.149, 'src': 'embed', 'start': 3996.925, 'weight': 1, 'content': [{'end': 4000.688, 'text': 'R value can be 0, can be 0.', 'start': 3996.925, 'duration': 3.763}, {'end': 4008.475, 'text': 'When you have spread like this, your R value will be very close to 0.', 'start': 4000.688, 'duration': 7.787}, {'end': 4015.801, 'text': 'When you have spread like this, your R value will be very close to minus 1.', 'start': 4008.475, 'duration': 7.326}, {'end': 4024.847, 'text': 'When you have spread in the other way, your R value will be very close to plus 1, okay.', 'start': 4015.801, 'duration': 9.046}, {'end': 4029.268, 'text': 'So R value is an indicator of how strong the relationships are.', 'start': 4025.387, 'duration': 3.881}, {'end': 4033.149, 'text': 'So let us understand this R value slightly more detail.', 'start': 4029.908, 'duration': 3.241}], 'summary': 'R value indicates strength of relationships, can be 0, close to -1, or close to +1.', 'duration': 36.224, 'max_score': 3996.925, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM3996925.jpg'}, {'end': 4300.46, 'src': 'embed', 'start': 4271.69, 'weight': 3, 'content': [{'end': 4278.036, 'text': 'if the variance is too large, the central value is not reliable, right.', 'start': 4271.69, 'duration': 6.346}, {'end': 4283.682, 'text': 'So, the variance gives you the reliability of the central values, how reliable the central values are.', 'start': 4278.777, 'duration': 4.905}, {'end': 4289.248, 'text': 'So, formula for variance is this, right.', 'start': 4286.265, 'duration': 2.983}, {'end': 4293.392, 'text': 'Just look at the numerator, just look at the numerator.', 'start': 4289.988, 'duration': 3.404}, {'end': 4297.959, 'text': 'This is the case when you have only one variable x.', 'start': 4294.537, 'duration': 3.422}, {'end': 4300.46, 'text': 'Now I have two variables x and y.', 'start': 4297.959, 'duration': 2.501}], 'summary': 'Variance measures reliability of central values for one or two variables.', 'duration': 28.77, 'max_score': 4271.69, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM4271690.jpg'}, {'end': 4458.968, 'src': 'heatmap', 'start': 4318.257, 'weight': 0.781, 'content': [{'end': 4322.761, 'text': 'So we find out x, i, minus, x bar multiplied with y, i, minus, y bar.', 'start': 4318.257, 'duration': 4.504}, {'end': 4328.445, 'text': 'that tells me how the data points in the feature space they vary together.', 'start': 4322.761, 'duration': 5.684}, {'end': 4329.466, 'text': 'that is called covariance.', 'start': 4328.445, 'duration': 1.021}, {'end': 4334.73, 'text': 'So they should not vary together to have a high covariance.', 'start': 4331.508, 'duration': 3.222}, {'end': 4337.413, 'text': 'Independent variables, independent variables.', 'start': 4335.631, 'duration': 1.782}, {'end': 4340.295, 'text': 'Independent variables should not have covariance.', 'start': 4338.413, 'duration': 1.882}, {'end': 4344.699, 'text': 'Target variable and independent variable should have strong covariance.', 'start': 4341.416, 'duration': 3.283}, {'end': 4348.681, 'text': "otherwise you don't use those independent variables.", 'start': 4346.98, 'duration': 1.701}, {'end': 4359.306, 'text': 'When you are building linear models, the target variable, the Y, and the independent variables, IVs, should have very strong covariance,', 'start': 4349.521, 'duration': 9.785}, {'end': 4363.908, 'text': 'but within the independent variables the covariance should be 0.', 'start': 4359.306, 'duration': 4.602}, {'end': 4366.549, 'text': 'that is an ideal situation.', 'start': 4363.908, 'duration': 2.641}, {'end': 4370.291, 'text': 'practically it never happens.', 'start': 4366.549, 'duration': 3.742}, {'end': 4373.773, 'text': 'alright now, given this covariance.', 'start': 4370.291, 'duration': 3.482}, {'end': 4376.494, 'text': 'now look at the formula for R.', 'start': 4373.773, 'duration': 2.721}, {'end': 4382.536, 'text': 'okay, on the top in the numerator, you are seeing covariance right.', 'start': 4376.494, 'duration': 6.042}, {'end': 4388.419, 'text': 'And if you look at it carefully the units on the top and the units on the denominator are the same.', 'start': 4383.177, 'duration': 5.242}, {'end': 4394.562, 'text': 'This is also x i minus x bar, this is also x i minus x, this is y i minus y bar, this is also y i minus y bar.', 'start': 4389.66, 'duration': 4.902}, {'end': 4397.063, 'text': 'It is squared term but there is an under root.', 'start': 4395.242, 'duration': 1.821}, {'end': 4400.284, 'text': 'This is also squared term but it is an under root.', 'start': 4398.343, 'duration': 1.941}, {'end': 4405.947, 'text': 'So both numerator and denominator are going to have same units of measurement agreed?', 'start': 4401.285, 'duration': 4.662}, {'end': 4412.633, 'text': 'So if I have a kilogram by kilogram case, then the units cancel each other out.', 'start': 4407.608, 'duration': 5.025}, {'end': 4418.578, 'text': 'It becomes a unit less quantity, it is a ratio, R value is a ratio.', 'start': 4414.895, 'duration': 3.683}, {'end': 4429.488, 'text': 'Ok Now when you have a diagram, when you have a plot like this between two variables.', 'start': 4422.562, 'duration': 6.926}, {'end': 4434.03, 'text': 'could be independent variables, target and independent, could be anything.', 'start': 4431.308, 'duration': 2.722}, {'end': 4438.154, 'text': 'When you have this kind of plot, you will see that r value is close to 0.', 'start': 4434.831, 'duration': 3.323}, {'end': 4442.237, 'text': 'We will see why this happens.', 'start': 4438.154, 'duration': 4.083}, {'end': 4450.004, 'text': 'When you have this kind of a plot where you see a trend, but as x increases, y decreases, they are going in different directions.', 'start': 4443.779, 'duration': 6.225}, {'end': 4453.327, 'text': 'x increases, y decreases, they are opposing each other.', 'start': 4451.005, 'duration': 2.322}, {'end': 4458.968, 'text': 'Then we will have r value close to minus 1.', 'start': 4454.147, 'duration': 4.821}], 'summary': 'Covariance measures data points variation, essential for linear models. r value indicates the strength of covariance between variables.', 'duration': 140.711, 'max_score': 4318.257, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM4318257.jpg'}, {'end': 4359.306, 'src': 'embed', 'start': 4331.508, 'weight': 4, 'content': [{'end': 4334.73, 'text': 'So they should not vary together to have a high covariance.', 'start': 4331.508, 'duration': 3.222}, {'end': 4337.413, 'text': 'Independent variables, independent variables.', 'start': 4335.631, 'duration': 1.782}, {'end': 4340.295, 'text': 'Independent variables should not have covariance.', 'start': 4338.413, 'duration': 1.882}, {'end': 4344.699, 'text': 'Target variable and independent variable should have strong covariance.', 'start': 4341.416, 'duration': 3.283}, {'end': 4348.681, 'text': "otherwise you don't use those independent variables.", 'start': 4346.98, 'duration': 1.701}, {'end': 4359.306, 'text': 'When you are building linear models, the target variable, the Y, and the independent variables, IVs, should have very strong covariance,', 'start': 4349.521, 'duration': 9.785}], 'summary': 'Covariance between target variable and ivs is crucial for linear models.', 'duration': 27.798, 'max_score': 4331.508, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM4331508.jpg'}, {'end': 4526.08, 'src': 'embed', 'start': 4494.928, 'weight': 5, 'content': [{'end': 4503.674, 'text': 'Keep in mind, all linear models that we build, linear regression models, they will always go through the point where x bar and the y bar meet.', 'start': 4494.928, 'duration': 8.746}, {'end': 4515.042, 'text': 'So the linear model that we saw a few minutes ago, this model will always go through wherever the x bar and the y bar meet.', 'start': 4506.676, 'duration': 8.366}, {'end': 4520.417, 'text': 'it will be hinged on to that point.', 'start': 4518.416, 'duration': 2.001}, {'end': 4526.08, 'text': 'That point will act as a fulcrum, alright.', 'start': 4522.498, 'duration': 3.582}], 'summary': 'Linear regression models go through x bar and y bar, acting as a fulcrum.', 'duration': 31.152, 'max_score': 4494.928, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM4494928.jpg'}, {'end': 4676.075, 'src': 'embed', 'start': 4644.808, 'weight': 6, 'content': [{'end': 4646.829, 'text': 'All these data points will have negative areas.', 'start': 4644.808, 'duration': 2.021}, {'end': 4650.351, 'text': 'All these data points will have negative areas.', 'start': 4648.37, 'duration': 1.981}, {'end': 4656.556, 'text': 'When you total this up, sigma, you see sigma over here? Sigma means sum up all the areas.', 'start': 4651.932, 'duration': 4.624}, {'end': 4664.049, 'text': 'When you sum up all the areas here, all the positives and the negatives will cancel each other out.', 'start': 4657.605, 'duration': 6.444}, {'end': 4672.113, 'text': 'The reason why they will cancel each other out is, each quadrant has same density of these points.', 'start': 4666.91, 'duration': 5.203}, {'end': 4676.075, 'text': 'The points are distributed like a symmetrical circle.', 'start': 4672.734, 'duration': 3.341}], 'summary': 'Data points will cancel out due to symmetrical distribution.', 'duration': 31.267, 'max_score': 4644.808, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM4644808.jpg'}, {'end': 4794.708, 'src': 'embed', 'start': 4732.561, 'weight': 9, 'content': [{'end': 4736.403, 'text': 'Right? All of you okay? So this is a concept of r-value.', 'start': 4732.561, 'duration': 3.842}, {'end': 4742.907, 'text': 'So what we do in linear models is, we test the r-values of every independent dimension with the target variable.', 'start': 4736.763, 'duration': 6.144}, {'end': 4751.135, 'text': 'we select only those whose r value is either close to minus 1 or plus 1.', 'start': 4743.769, 'duration': 7.366}, {'end': 4756.079, 'text': 'Any dimension with r value close to 0, that dimension is a useless dimension.', 'start': 4751.135, 'duration': 4.944}, {'end': 4760.883, 'text': 'How close to 0 is useless? That depends on your particular problem statement.', 'start': 4757.12, 'duration': 3.763}, {'end': 4771.191, 'text': "So generally any r value greater than 7, 0.7, minus or plus, it's considered as a good variable.", 'start': 4763.605, 'duration': 7.586}, {'end': 4778.278, 'text': 'Any variable with r value between 0.5 and 0.7,', 'start': 4772.395, 'duration': 5.883}, {'end': 4785.343, 'text': 'it requires further research whether this relationship is real or is it a fluke by chance that you are seeing in your data set.', 'start': 4778.278, 'duration': 7.065}, {'end': 4794.708, 'text': 'Any r value close to 0, not worth considering.', 'start': 4787.204, 'duration': 7.504}], 'summary': 'In linear models, r-value is used to select independent dimensions with r-value close to -1 or 1, while r-value close to 0 is considered useless. r-value greater than 0.7 is considered good, 0.5-0.7 requires further research, and close to 0 is not worth considering.', 'duration': 62.147, 'max_score': 4732.561, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM4732561.jpg'}, {'end': 4904.009, 'src': 'embed', 'start': 4873.263, 'weight': 11, 'content': [{'end': 4875.984, 'text': 'sampling brings in some asymmetry in the distribution.', 'start': 4873.263, 'duration': 2.721}, {'end': 4878.124, 'text': 'There will be some asymmetry in the distribution.', 'start': 4876.004, 'duration': 2.12}, {'end': 4883.14, 'text': 'You will end up with an r value greater than 0.', 'start': 4878.624, 'duration': 4.516}, {'end': 4885.781, 'text': 'this r value we call it statistical fluke.', 'start': 4883.14, 'duration': 2.641}, {'end': 4896.045, 'text': 'So, when r value is very close to 0s, it could be by chance sampling problem, it may not be a real relationship.', 'start': 4888.022, 'duration': 8.023}, {'end': 4904.009, 'text': 'So, you have to go and further investigate whether the relationship is real, it might be real or is it fake.', 'start': 4897.226, 'duration': 6.783}], 'summary': 'Sampling introduces asymmetry, leading to an r value > 0, but close to 0 could indicate a chance sampling problem.', 'duration': 30.746, 'max_score': 4873.263, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM4873263.jpg'}], 'start': 3590.185, 'title': 'Correlation and linear modeling', 'summary': 'Covers linear modeling, principal component analysis, and correlation analysis, emphasizing the importance of strong relationships between independent variables and the target. it explains the calculation and significance of the coefficient of correlation, the importance of variance and covariance in measuring the reliability of central values and the influence of variables, and discusses correlation, linear regression, and r-value, including their calculations, interpretations, and significance.', 'chapters': [{'end': 4091.817, 'start': 3590.185, 'title': 'Linear modeling and correlation analysis', 'summary': 'Covers linear modeling, principal component analysis, and correlation analysis, emphasizing the importance of strong relationships between independent variables and the target, and explaining the calculation and significance of the coefficient of correlation.', 'duration': 501.632, 'highlights': ['Linear models are built only when a linear relationship exists between y and the independent variables, and non-linear relationships may require non-linear models. The chapter emphasizes the necessity of linear relationships for building linear models and mentions the possibility of resorting to non-linear models for non-linear relationships.', 'The coefficient of correlation (R) measures the strength of the relationship between independent variables and the target, ranging from -1 to +1, with values close to 0 indicating weak relationships and values close to -1 or +1 indicating strong relationships. Explanation of the coefficient of correlation (R) and its range from -1 to +1, with close to 0 values indicating weak relationships and close to -1 or +1 values indicating strong relationships.', 'Principal component analysis (PCA) can be used to convert variables into synthetic dimensions, allowing for the analysis of the most influential dimensions and the reduction of dimensionality. Introduction of principal component analysis (PCA) as a technique to convert variables into synthetic dimensions, enabling the analysis of influential dimensions and dimensionality reduction.']}, {'end': 4382.536, 'start': 4096.341, 'title': 'Understanding variance and covariance', 'summary': 'Explains the importance of variance and covariance in measuring the reliability of central values and the influence of variables, emphasizing the need for strong covariance between target and independent variables in linear models.', 'duration': 286.195, 'highlights': ['The variance measures the reliability of central values, with a large variance indicating unreliability. The variance measures the reliability of central values, with a large variance indicating unreliability.', 'Covariance signifies how variables influence each other, with strong covariance desired between the target and independent variables in linear models. Covariance signifies how variables influence each other, with strong covariance desired between the target and independent variables in linear models.', 'The formula for variance is discussed, highlighting its significance in determining the reliability of central values. The formula for variance is discussed, highlighting its significance in determining the reliability of central values.']}, {'end': 4672.113, 'start': 4383.177, 'title': 'Correlation and linear regression', 'summary': 'Discusses the concept of correlation and linear regression, including the calculation of correlation coefficient, its interpretation, and the relationship between r value and the trend in a plot, while highlighting the point that all linear regression models go through the point of x bar and y bar, and the significance of the cancellation of positive and negative areas in the context of correlation.', 'duration': 288.936, 'highlights': ['All linear regression models go through the point of x bar and y bar, acting as a fulcrum for the model. This point serves as a pivotal point for all linear regression models, ensuring that they always intersect at the x bar and y bar.', 'Explanation of the significance of the cancellation of positive and negative areas in the context of correlation, where the sum of all areas in each quadrant will cancel each other out due to the same density of points. The cancellation of positive and negative areas in each quadrant due to the same density of points results in the sum of all areas canceling each other out, impacting the correlation analysis.', 'The interpretation of r value in correlation, with r close to 0 indicating opposing trends, r close to -1 indicating opposing directional trends, and r close to 1 indicating synchronous directional trends. The r value in correlation is interpreted as close to 0 for opposing trends, close to -1 for opposing directional trends, and close to 1 for synchronous directional trends, providing insights into the relationship between variables.', 'The calculation of correlation coefficient and its significance in representing the ratio of two variables, where the units cancel each other out to become a unitless quantity. The calculation of correlation coefficient represents the ratio of two variables, with the units canceling each other out, resulting in a unitless quantity that indicates the relationship between the variables.']}, {'end': 4929.355, 'start': 4672.734, 'title': 'Understanding correlation and r-value', 'summary': 'Explains the concept of r-value, highlighting the significance of r-value close to 0, 0.7, and the impact of sampling on r-value, emphasizing the need for further investigation when r-value is close to 0 or 0.7.', 'duration': 256.621, 'highlights': ['The significance of r-value close to 0, 0.7, and the impact of sampling on r-value. The chapter discusses the importance of r-value close to 0, 0.7, and the potential impact of sampling on the r-value, emphasizing the need for further investigation when r-value is close to 0 or 0.7.', 'The concept of r-value and its relevance to linear models. The chapter explains the concept of r-value and its significance in linear models, where only dimensions with r-value close to -1 or +1 are selected, while those close to 0 are considered useless.', 'The impact of sampling on r-value and the need for further investigation. The transcript highlights the impact of sampling on r-value, with asymmetric distribution due to sampling leading to r-value greater than 0 and the need for further investigation to differentiate between real and fake relationships.']}], 'duration': 1339.17, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM3590185.jpg', 'highlights': ['Linear models require linear relationships between y and independent variables, resorting to non-linear models for non-linear relationships.', 'Coefficient of correlation (R) measures strength of relationship, ranging from -1 to +1, with close to 0 indicating weak and close to -1 or +1 indicating strong relationships.', 'Principal component analysis (PCA) converts variables into synthetic dimensions, enabling analysis of influential dimensions and dimensionality reduction.', 'Variance measures reliability of central values, with large variance indicating unreliability.', 'Covariance signifies how variables influence each other, with strong covariance desired between target and independent variables in linear models.', 'All linear regression models intersect at the point of x bar and y bar, serving as a fulcrum for the model.', 'Cancellation of positive and negative areas in each quadrant impacts correlation analysis due to the same density of points.', 'Interpretation of r value in correlation: r close to 0 for opposing trends, close to -1 for opposing directional trends, and close to 1 for synchronous directional trends.', 'Calculation of correlation coefficient represents the ratio of two variables, resulting in a unitless quantity indicating the relationship between the variables.', 'Importance of r-value close to 0, 0.7, and the impact of sampling on r-value, emphasizing the need for further investigation.', "R-value's relevance in linear models: only dimensions with r-value close to -1 or +1 are selected, while those close to 0 are considered useless.", 'Impact of sampling on r-value, with asymmetric distribution leading to r-value greater than 0 and the need for further investigation.']}, {'end': 6398.59, 'segs': [{'end': 5088.54, 'src': 'embed', 'start': 5063.667, 'weight': 1, 'content': [{'end': 5069.71, 'text': 'and it will evaluate the different possible lines before it finds what is called the best fit line.', 'start': 5063.667, 'duration': 6.043}, {'end': 5073.852, 'text': 'Now, what is this best fit line?', 'start': 5072.552, 'duration': 1.3}, {'end': 5086.139, 'text': 'Amongst all the lines possible that it has checked, best fit line is that which goes through the maximum number of data points.', 'start': 5074.973, 'duration': 11.166}, {'end': 5088.54, 'text': "it can't go through all of them.", 'start': 5086.139, 'duration': 2.401}], 'summary': 'Algorithm evaluates lines to find best fit line through maximum data points.', 'duration': 24.873, 'max_score': 5063.667, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM5063667.jpg'}, {'end': 5188.299, 'src': 'embed', 'start': 5142.76, 'weight': 0, 'content': [{'end': 5144.041, 'text': "Don't mistake the word error.", 'start': 5142.76, 'duration': 1.281}, {'end': 5149.304, 'text': "Error doesn't mean anything negative, but it's used for something that you already know.", 'start': 5144.761, 'duration': 4.543}, {'end': 5150.485, 'text': "I'll tell you what it is.", 'start': 5149.684, 'duration': 0.801}, {'end': 5158.766, 'text': 'So amongst all the different possible lines, there will be only one line which minimizes the sum of all these errors.', 'start': 5152.264, 'duration': 6.502}, {'end': 5167.149, 'text': 'If I sum up all these errors across all the data points, the minimum sum error will be given to you by only one line.', 'start': 5158.866, 'duration': 8.283}, {'end': 5170.31, 'text': 'That line is called the best fit line.', 'start': 5168.469, 'duration': 1.841}, {'end': 5178.132, 'text': 'The algorithm will find out for you from infinite number of possibilities the best fit line for you.', 'start': 5172.21, 'duration': 5.922}, {'end': 5179.993, 'text': 'It is like looking for a needle in a haystack.', 'start': 5178.212, 'duration': 1.781}, {'end': 5188.299, 'text': 'and to do this it makes use of a process which is called the gradient descent.', 'start': 5182.893, 'duration': 5.406}], 'summary': 'Finding the best fit line minimizes errors across data points using gradient descent.', 'duration': 45.539, 'max_score': 5142.76, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM5142760.jpg'}, {'end': 5543.916, 'src': 'embed', 'start': 5514.773, 'weight': 6, 'content': [{'end': 5519.495, 'text': "Isn't this formula for variance? So error is nothing but variance.", 'start': 5514.773, 'duration': 4.722}, {'end': 5527.259, 'text': 'How data points vary, scatter across the best fit line? The lesser the variance, more reliable the central value is.', 'start': 5520.435, 'duration': 6.824}, {'end': 5532.427, 'text': 'the lesser the variance of the points across the model, the better the model is.', 'start': 5528.424, 'duration': 4.003}, {'end': 5536.19, 'text': 'Same concept comes to you in a different way.', 'start': 5534.089, 'duration': 2.101}, {'end': 5543.916, 'text': 'So, sum of squared errors is nothing but variance, variance of the data points across the model.', 'start': 5539.133, 'duration': 4.783}], 'summary': 'Sum of squared errors represents the variance of data points across the model, indicating model reliability.', 'duration': 29.143, 'max_score': 5514.773, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM5514773.jpg'}, {'end': 5740.549, 'src': 'embed', 'start': 5705.496, 'weight': 5, 'content': [{'end': 5707.397, 'text': 'One is called stochastic variance.', 'start': 5705.496, 'duration': 1.901}, {'end': 5713.736, 'text': 'which means nothing but probabilistic variance.', 'start': 5709.875, 'duration': 3.861}, {'end': 5726.241, 'text': 'the other one is called deterministic variance, which means variance that happens for reasons that I know.', 'start': 5713.736, 'duration': 12.505}, {'end': 5727.762, 'text': 'I know why that variance happens.', 'start': 5726.241, 'duration': 1.521}, {'end': 5729.703, 'text': 'okay, that is called deterministic.', 'start': 5727.762, 'duration': 1.941}, {'end': 5734.504, 'text': "but variance also happens in your data set for reasons that I don't know.", 'start': 5729.703, 'duration': 4.801}, {'end': 5735.325, 'text': 'we call them noise.', 'start': 5734.504, 'duration': 0.821}, {'end': 5740.549, 'text': 'okay, that is called noise, alright.', 'start': 5738.928, 'duration': 1.621}], 'summary': 'Stochastic variance is probabilistic, while deterministic variance is known. variance in datasets for unknown reasons is called noise.', 'duration': 35.053, 'max_score': 5705.496, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM5705496.jpg'}, {'end': 6118.302, 'src': 'embed', 'start': 6080.772, 'weight': 7, 'content': [{'end': 6097.218, 'text': 'Descent as a name suggest we have to descend ok, gradient descent algorithm which internally makes use of what is called partial derivatives,', 'start': 6080.772, 'duration': 16.446}, {'end': 6102.081, 'text': 'partial derivatives and this term, partial derivatives.', 'start': 6097.218, 'duration': 4.863}, {'end': 6103.482, 'text': 'I will tell you what it is.', 'start': 6102.081, 'duration': 1.401}, {'end': 6105.623, 'text': 'it is not a rocket science.', 'start': 6103.482, 'duration': 2.141}, {'end': 6108.585, 'text': 'it is dy by dx.', 'start': 6105.623, 'duration': 2.962}, {'end': 6110.486, 'text': 'it is exactly dy by dx.', 'start': 6108.585, 'duration': 1.901}, {'end': 6118.302, 'text': 'what it does is for a small change in M in this direction or this direction.', 'start': 6110.486, 'duration': 7.816}], 'summary': 'Descent algorithm uses partial derivatives for directional changes.', 'duration': 37.53, 'max_score': 6080.772, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM6080772.jpg'}], 'start': 4929.515, 'title': 'Linear regression techniques', 'summary': 'Explains dimension selection, linear regression, variance in data, and gradient descent. it emphasizes finding the best fit line, error minimization, and managing noise in data for predictive modeling.', 'chapters': [{'end': 5220.297, 'start': 4929.515, 'title': 'Dimension selection and best fit line', 'summary': 'Explains the process of selecting independent dimensions and finding the best fit line using an algorithm, such as evaluating multiple lines to identify the best fit line that minimizes the sum of errors and goes through the maximum number of data points.', 'duration': 290.782, 'highlights': ['The algorithm evaluates multiple lines to find the best fit line that minimizes the sum of errors and maximizes the number of data points it goes through. The algorithm evaluates different possible lines to find the best fit line, which minimizes the sum of errors and goes through the maximum number of data points.', 'The process includes using the gradient descent to find the best model for the given dataset. The algorithm uses the gradient descent process to find the best model for the given dataset, resembling a needle in a haystack search.', 'The relationship between independent and target variables is represented by a line, and the best fit line minimizes the distance between other points and the line. The best fit line represents the relationship between independent and target variables and minimizes the distance between other points and the line.']}, {'end': 5679.762, 'start': 5238.291, 'title': 'Linear regression and error minimization', 'summary': 'Covers the concept of linear regression, emphasizing the priority of minimizing error through variance reduction and the impact of outliers, aiming to find the best fit line out of infinite possibilities.', 'duration': 441.471, 'highlights': ['The priority is to minimize error, with variance reduction being crucial in finding the best fit line. The error should be minimized, and variance reduction is essential in finding the best fit line with the least sum of squared errors.', 'The concept of variance and its impact on the reliability of the model, where lesser variance implies a more reliable model. Variance directly affects the reliability of the model, where lesser variance indicates a more reliable central value.', 'The discussion on the impact of outliers and the relevance of linear regression in data modeling. Outliers are removed to ensure the linear regression model is based on data that affects the model, with the presence of outliers causing noise and variance around the model.']}, {'end': 5991.25, 'start': 5679.762, 'title': 'Understanding variance in data', 'summary': 'Explains the concept of variance in data, distinguishing between stochastic and deterministic variances, and highlights the impact of noise on data points and the challenges it poses in predictive modeling, emphasizing the need to address stochastic noise and the role of multivariate analysis in understanding and managing noise in data.', 'duration': 311.488, 'highlights': ['The chapter discusses the distinction between stochastic and deterministic variances, highlighting the impact of noise on data points and the challenges it poses in predictive modeling.', 'It emphasizes the need to address stochastic noise, which causes a scatter of data points across the model, and the role of multivariate analysis in understanding and managing noise in data.', 'The speaker explains the concept of least mean error and the formulation of error as the sum of squared errors, which is related to the variance across the line and refers to it as a quadratic equation in mathematics.', 'The chapter mentions the potential impact of collinearity in the data set, where the interaction of different variables may either cancel out or magnify the noise, highlighting the complexity of managing noise in data analysis.', 'It introduces the concept of convex functions in mathematics, specifically in the context of quadratic equations, and its relevance to understanding variance in data.', 'The speaker addresses the challenge of identifying the best fit line with the least error, emphasizing the exploration of different combinations of slopes and intercepts in a mathematical space to minimize the error, which is related to variance across the line.']}, {'end': 6398.59, 'start': 5991.25, 'title': 'Gradient descent and best fit line', 'summary': 'Explains the concept of gradient descent in finding the best fit line using partial derivatives to reach the global minima, ensuring absolute minimum error in linear regression, and the importance of adjusting the learning step to prevent oscillation and improve convergence.', 'duration': 407.34, 'highlights': ['The algorithm utilizes gradient descent and partial derivatives to iteratively move from a random best fit line to the global minima, ensuring the least sum of squared errors and determining the best fit line for the given data points.', 'In linear regression, the error function being quadratic guarantees reaching the absolute minimum, but adjusting the learning step is crucial to prevent oscillation and ensure convergence towards the global minima.', 'The concept of partial derivatives, specifically dy by dx, is used to determine the direction of change in the parameters (M and C) to minimize the error, ensuring the algorithm converges towards the best fit line with the least error.']}], 'duration': 1469.075, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM4929515.jpg', 'highlights': ['The algorithm utilizes gradient descent and partial derivatives to iteratively move from a random best fit line to the global minima, ensuring the least sum of squared errors and determining the best fit line for the given data points.', 'The algorithm evaluates multiple lines to find the best fit line that minimizes the sum of errors and maximizes the number of data points it goes through.', 'The priority is to minimize error, with variance reduction being crucial in finding the best fit line.', 'The process includes using the gradient descent to find the best model for the given dataset.', 'The relationship between independent and target variables is represented by a line, and the best fit line minimizes the distance between other points and the line.', 'The chapter discusses the distinction between stochastic and deterministic variances, highlighting the impact of noise on data points and the challenges it poses in predictive modeling.', 'The concept of variance and its impact on the reliability of the model, where lesser variance implies a more reliable model.', 'The concept of partial derivatives, specifically dy by dx, is used to determine the direction of change in the parameters (M and C) to minimize the error, ensuring the algorithm converges towards the best fit line with the least error.']}, {'end': 7600.656, 'segs': [{'end': 6493.062, 'src': 'embed', 'start': 6463.066, 'weight': 3, 'content': [{'end': 6466.388, 'text': 'which is nothing but xi minus x bar, x bar here is your predicted lines.', 'start': 6463.066, 'duration': 3.322}, {'end': 6469.089, 'text': 'So formula remains same.', 'start': 6468.068, 'duration': 1.021}, {'end': 6473.871, 'text': 'That variance is called sum of squared errors, that variance we have to minimize.', 'start': 6469.829, 'duration': 4.042}, {'end': 6478.133, 'text': 'The minimal the variance is, the better your model is.', 'start': 6475.392, 'duration': 2.741}, {'end': 6490.9, 'text': 'Alright, so let me explain this to you because you are going to come across these terms down the line.', 'start': 6483.936, 'duration': 6.964}, {'end': 6493.062, 'text': 'So let me explain this to you in slightly more detail.', 'start': 6490.98, 'duration': 2.082}], 'summary': 'Minimize sum of squared errors for better model performance.', 'duration': 29.996, 'max_score': 6463.066, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM6463066.jpg'}, {'end': 7108.425, 'src': 'heatmap', 'start': 6546.277, 'weight': 0.897, 'content': [{'end': 6556.444, 'text': 'but the model, the best fit line says the value should be yellow p2, the distance between p2 and p1.', 'start': 6546.277, 'duration': 10.167}, {'end': 6559.606, 'text': 'this is what we call error, the variance.', 'start': 6556.444, 'duration': 3.162}, {'end': 6562.008, 'text': 'this is what we call sum of squared errors.', 'start': 6559.606, 'duration': 2.402}, {'end': 6568.472, 'text': "okay, I'm showing you only one data point here, but you have to imagine there are multiple data points here.", 'start': 6562.008, 'duration': 6.464}, {'end': 6574.088, 'text': 'sum of squared error applies all together okay.', 'start': 6568.472, 'duration': 5.616}, {'end': 6576.69, 'text': 'So this contributes to the sum of squared error.', 'start': 6574.569, 'duration': 2.121}, {'end': 6584.716, 'text': 'There is another term that we use which is called total error, total error across all the data points put together.', 'start': 6578.671, 'duration': 6.045}, {'end': 6594.262, 'text': 'The total error is nothing but the difference between y bar, the expected value of y and the actual value of y.', 'start': 6584.756, 'duration': 9.506}, {'end': 6602.348, 'text': 'In your data there is a distance between the actual value of y and the predicted value, the expected value of y, expected value is y bar.', 'start': 6594.262, 'duration': 8.086}, {'end': 6606.405, 'text': 'that distance is called total error.', 'start': 6604.124, 'duration': 2.281}, {'end': 6616.047, 'text': 'Of the total error, your model has captured this much.', 'start': 6610.426, 'duration': 5.621}, {'end': 6621.388, 'text': 'Your model is a regression model, hence the name given to this is regression error.', 'start': 6617.567, 'duration': 3.821}, {'end': 6625.369, 'text': 'Of the total error, your model is predicted this much error.', 'start': 6622.329, 'duration': 3.04}, {'end': 6631.671, 'text': 'Of the total error in your dataset, your model is predicted this much.', 'start': 6627.39, 'duration': 4.281}, {'end': 6635.569, 'text': 'the difference between predicted and the y bar.', 'start': 6633.868, 'duration': 1.701}, {'end': 6638.57, 'text': 'This distance is called regression error.', 'start': 6636.829, 'duration': 1.741}, {'end': 6642.391, 'text': 'What it has not been able to explain is this.', 'start': 6640.19, 'duration': 2.201}, {'end': 6645.412, 'text': 'This is called sum of squared errors.', 'start': 6643.852, 'duration': 1.56}, {'end': 6649.494, 'text': 'Sum of squared error has another term which is called residual errors.', 'start': 6646.072, 'duration': 3.422}, {'end': 6653.355, 'text': "Residuals are those things which we don't have a reason for.", 'start': 6650.434, 'duration': 2.921}, {'end': 6654.416, 'text': "We don't know why they occur.", 'start': 6653.435, 'duration': 0.981}, {'end': 6655.936, 'text': 'They are called residuals.', 'start': 6655.016, 'duration': 0.92}, {'end': 6660.698, 'text': 'Sum of squared errors is also known as residual errors.', 'start': 6657.537, 'duration': 3.161}, {'end': 6669.293, 'text': 'So when you build these models, especially when you go to your deep learning part and all these things, you will come across these terms.', 'start': 6663.871, 'duration': 5.422}, {'end': 6674.714, 'text': 'Total error is the distance between your actual data point and expected value.', 'start': 6670.453, 'duration': 4.261}, {'end': 6683.837, 'text': 'So if the average height of this class is 5.5, the difference between your height and 5.5 is the total error.', 'start': 6677.235, 'duration': 6.602}, {'end': 6692.74, 'text': 'Now you have built a model to predict the height of a person in the class based on the characteristics of the person.', 'start': 6686.818, 'duration': 5.922}, {'end': 6696.577, 'text': 'So your model predicts some value for you.', 'start': 6694.656, 'duration': 1.921}, {'end': 6702.482, 'text': 'So the model predicts the height to be 5.3.', 'start': 6698.399, 'duration': 4.083}, {'end': 6706.144, 'text': 'Your actual height is 6.', 'start': 6702.482, 'duration': 3.662}, {'end': 6709.867, 'text': '6 minus 5.3, that is 0.7.', 'start': 6706.144, 'duration': 3.723}, {'end': 6714.87, 'text': '0.7 is the distance between, 0.7 is this distance.', 'start': 6709.867, 'duration': 5.003}, {'end': 6717.492, 'text': 'Your actual height minus predicted height.', 'start': 6715.791, 'duration': 1.701}, {'end': 6729.041, 'text': 'Right? The distance between your height and the class average that is a total error.', 'start': 6719.834, 'duration': 9.207}, {'end': 6734.783, 'text': 'So you will come across three terms total error, sum of squared errors and regression error.', 'start': 6729.722, 'duration': 5.061}, {'end': 6742.444, 'text': 'Keep in mind regression error and residual errors these two together is a total error.', 'start': 6735.583, 'duration': 6.861}, {'end': 6747.085, 'text': "And don't mistake the word error, error means variance.", 'start': 6743.864, 'duration': 3.221}, {'end': 6749.769, 'text': 'error means variance.', 'start': 6748.928, 'duration': 0.841}, {'end': 6751.991, 'text': 'Of the total variance in your data set.', 'start': 6749.969, 'duration': 2.022}, {'end': 6756.194, 'text': 'I told you one part of the variance is caused by the deterministic.', 'start': 6751.991, 'duration': 4.203}, {'end': 6762.079, 'text': 'the other part of the variance is caused by the stochastic, non-deterministic same thing.', 'start': 6756.194, 'duration': 5.885}, {'end': 6766.342, 'text': 'Non-deterministic is your residual error, sum of spread errors.', 'start': 6763.46, 'duration': 2.882}, {'end': 6784.445, 'text': 'No, in this case I am showing you only one point, but you have to imagine all the points together, because this SSE SSR,', 'start': 6777.62, 'duration': 6.825}, {'end': 6786.046, 'text': 'they are across all points put together.', 'start': 6784.445, 'duration': 1.601}, {'end': 6790.87, 'text': 'What I am saying is what contributes to SSR and SSE, that is what I am showing you.', 'start': 6787.187, 'duration': 3.683}, {'end': 6795.534, 'text': 'So, you have to imagine all the points here, that becomes too cluttered, that is what I am showing you.', 'start': 6791.911, 'duration': 3.623}, {'end': 6800.477, 'text': 'What contributes to SSR, what contributes to SSE, that is what I am showing over here.', 'start': 6796.594, 'duration': 3.883}, {'end': 6818.412, 'text': 'Total error is this, I want to minimize this unexplained error, that is what my objective is.', 'start': 6810.907, 'duration': 7.505}, {'end': 6825.616, 'text': 'The gradient descent works in finding the best fit line for you, where best fit line is that line where this error is minimized.', 'start': 6818.452, 'duration': 7.164}, {'end': 6834.621, 'text': 'Because I do not want a model which is not able to tell me why, I want a model which tells me why this error.', 'start': 6828.577, 'duration': 6.044}, {'end': 6844.974, 'text': 'So, we will not try to reduce the regression error.', 'start': 6841.573, 'duration': 3.401}, {'end': 6846.035, 'text': 'No, we will not try.', 'start': 6845.415, 'duration': 0.62}, {'end': 6847.595, 'text': 'We will not do that.', 'start': 6846.075, 'duration': 1.52}, {'end': 6849.176, 'text': 'Because that is a definite mistake.', 'start': 6847.896, 'duration': 1.28}, {'end': 6850.176, 'text': 'I know why it happens.', 'start': 6849.256, 'duration': 0.92}, {'end': 6854.858, 'text': 'It is because of the variance of the It is because of the natural variance in the process.', 'start': 6850.677, 'duration': 4.181}, {'end': 6859, 'text': 'When you are running a process, many things come into play.', 'start': 6855.779, 'duration': 3.221}, {'end': 6863.662, 'text': 'factors come into play, because of which the values change here and there around the expected value.', 'start': 6859, 'duration': 4.662}, {'end': 6864.882, 'text': 'That is okay with me.', 'start': 6864.142, 'duration': 0.74}, {'end': 6868.904, 'text': 'But for a given value of input, there are multiple values of y that I do not understand.', 'start': 6865.603, 'duration': 3.301}, {'end': 6873.483, 'text': 'So for given value of input, there should be only one value of y.', 'start': 6870.902, 'duration': 2.581}, {'end': 6877.485, 'text': 'Why are there multiple values of y? That I do not understand.', 'start': 6873.483, 'duration': 4.002}, {'end': 6879.425, 'text': 'That part is my unexplained error.', 'start': 6877.585, 'duration': 1.84}, {'end': 6882.947, 'text': 'I want a model which unexplained error is minimized.', 'start': 6879.545, 'duration': 3.402}, {'end': 6887.349, 'text': 'I do not want a model where unexplained error is very high.', 'start': 6884.908, 'duration': 2.441}, {'end': 6895.612, 'text': 'What is the use of that model? It will not predict anything.', 'start': 6887.509, 'duration': 8.103}, {'end': 6915.831, 'text': 'When you say given value of x you have one, P2 is predicted value, but you have multiple values of P1 spread around the line, that gap is unexplained.', 'start': 6897.573, 'duration': 18.258}, {'end': 6919.254, 'text': 'that variance is what we call stochastic random variance.', 'start': 6915.831, 'duration': 3.423}, {'end': 6920.135, 'text': 'we do not know why it happens.', 'start': 6919.254, 'duration': 0.881}, {'end': 6925.439, 'text': 'So, basically the idea is to minimize SSE that is all, the bottom line is minimize SSE.', 'start': 6921.336, 'duration': 4.103}, {'end': 6933.913, 'text': 'So, increase the SSR to SST ratio that is what our objective is, yeah.', 'start': 6927.711, 'duration': 6.202}, {'end': 6935.654, 'text': 'Alright, let us move on.', 'start': 6934.714, 'duration': 0.94}, {'end': 6944.818, 'text': 'So, before you build the model you have to evaluate each independent dimension and see what is the R value.', 'start': 6938.515, 'duration': 6.303}, {'end': 6954.922, 'text': 'R value coefficient of correlation comes into play to help you identify good predictors given the target, ok.', 'start': 6946.158, 'duration': 8.764}, {'end': 6962.619, 'text': 'Once you build the model, I want to see how good the model is, how reliable the model is.', 'start': 6956.115, 'duration': 6.504}, {'end': 6972.605, 'text': 'For that we use another metric, that metric is called coefficient of determination.', 'start': 6963.059, 'duration': 9.546}, {'end': 6982.331, 'text': 'And this is represented as R square.', 'start': 6979.129, 'duration': 3.202}, {'end': 6996.723, 'text': 'but before you jump to it I should caution you R square is not always R into R, R is coefficient of correlation.', 'start': 6986.616, 'duration': 10.107}, {'end': 7001.806, 'text': 'So naturally to think that R square will be R into R.', 'start': 6997.923, 'duration': 3.883}, {'end': 7008.43, 'text': 'that is true only when you build a simple linear model between one dimension, one target.', 'start': 7001.806, 'duration': 6.624}, {'end': 7009.711, 'text': 'it is only true at that point.', 'start': 7008.43, 'duration': 1.281}, {'end': 7019.551, 'text': "In multidimensional multivariate analysis where you have more than 2, 3 dimensions, you can't say R square is R into R, ok.", 'start': 7010.852, 'duration': 8.699}, {'end': 7024.575, 'text': 'So R square is just a symbol which represents another metric called coefficient of determination.', 'start': 7020.152, 'duration': 4.423}, {'end': 7033.443, 'text': 'How much of the total variance in your Y has been explained by your model?', 'start': 7026.757, 'duration': 6.686}, {'end': 7042.01, 'text': 'Keep in mind the total variance in Y is consisting of two components deterministic, non-deterministic, stochastic, random.', 'start': 7033.823, 'duration': 8.187}, {'end': 7052.074, 'text': 'How much of this total variance in your Y has been explained by your model? That measure is called coefficient of determination.', 'start': 7043.168, 'duration': 8.906}, {'end': 7067.664, 'text': 'Obviously a model which captures, explains maximal of the variance in Y, it leaves behind very small residuals, that model is the best one.', 'start': 7056.216, 'duration': 11.448}, {'end': 7076.747, 'text': 'So R square, it ranges between 0 and 1.', 'start': 7069.966, 'duration': 6.781}, {'end': 7078.768, 'text': '0 means your model is not able to explain anything.', 'start': 7076.747, 'duration': 2.021}, {'end': 7081.768, 'text': 'Absolutely useless.', 'start': 7080.968, 'duration': 0.8}, {'end': 7086.849, 'text': '1 means your model is able to completely explain all the variance in your target.', 'start': 7082.769, 'duration': 4.08}, {'end': 7087.71, 'text': 'There is no residual.', 'start': 7086.869, 'duration': 0.841}, {'end': 7091.91, 'text': "Such models don't exist.", 'start': 7090.81, 'duration': 1.1}, {'end': 7094.111, 'text': 'We can never build such models.', 'start': 7093.011, 'duration': 1.1}, {'end': 7100.639, 'text': 'Usually r value will be between 0 and 1, the closer they are to 1 the better the model is.', 'start': 7095.655, 'duration': 4.984}, {'end': 7108.425, 'text': 'Because it is in simple linear models it is a squared term it can never be negative.', 'start': 7101.82, 'duration': 6.605}], 'summary': 'Understanding and minimizing sum of squared errors, regression errors, and coefficient of determination in model evaluation.', 'duration': 562.148, 'max_score': 6546.277, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM6546277.jpg'}, {'end': 6762.079, 'src': 'embed', 'start': 6735.583, 'weight': 2, 'content': [{'end': 6742.444, 'text': 'Keep in mind regression error and residual errors these two together is a total error.', 'start': 6735.583, 'duration': 6.861}, {'end': 6747.085, 'text': "And don't mistake the word error, error means variance.", 'start': 6743.864, 'duration': 3.221}, {'end': 6749.769, 'text': 'error means variance.', 'start': 6748.928, 'duration': 0.841}, {'end': 6751.991, 'text': 'Of the total variance in your data set.', 'start': 6749.969, 'duration': 2.022}, {'end': 6756.194, 'text': 'I told you one part of the variance is caused by the deterministic.', 'start': 6751.991, 'duration': 4.203}, {'end': 6762.079, 'text': 'the other part of the variance is caused by the stochastic, non-deterministic same thing.', 'start': 6756.194, 'duration': 5.885}], 'summary': 'Regression and residual errors together account for total variance in the data set.', 'duration': 26.496, 'max_score': 6735.583, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM6735583.jpg'}, {'end': 7091.91, 'src': 'embed', 'start': 7056.216, 'weight': 1, 'content': [{'end': 7067.664, 'text': 'Obviously a model which captures, explains maximal of the variance in Y, it leaves behind very small residuals, that model is the best one.', 'start': 7056.216, 'duration': 11.448}, {'end': 7076.747, 'text': 'So R square, it ranges between 0 and 1.', 'start': 7069.966, 'duration': 6.781}, {'end': 7078.768, 'text': '0 means your model is not able to explain anything.', 'start': 7076.747, 'duration': 2.021}, {'end': 7081.768, 'text': 'Absolutely useless.', 'start': 7080.968, 'duration': 0.8}, {'end': 7086.849, 'text': '1 means your model is able to completely explain all the variance in your target.', 'start': 7082.769, 'duration': 4.08}, {'end': 7087.71, 'text': 'There is no residual.', 'start': 7086.869, 'duration': 0.841}, {'end': 7091.91, 'text': "Such models don't exist.", 'start': 7090.81, 'duration': 1.1}], 'summary': 'R-squared ranges from 0 to 1; 0 means no explanation, 1 means complete explanation.', 'duration': 35.694, 'max_score': 7056.216, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM7056216.jpg'}, {'end': 7418.903, 'src': 'heatmap', 'start': 7237.242, 'weight': 0, 'content': [{'end': 7247.03, 'text': 'How much of this total variance has been explained, captured by your model? R square is the ratio of the dark area to light area.', 'start': 7237.242, 'duration': 9.788}, {'end': 7253.074, 'text': 'What you are seeing on the top, this is the residuals, unexplained variance.', 'start': 7248.811, 'duration': 4.263}, {'end': 7261.02, 'text': 'So R square is again a ratio, where the ratio is between the total variance in the data points.', 'start': 7255.055, 'duration': 5.965}, {'end': 7265.196, 'text': 'and the variance explained by your model.', 'start': 7263.015, 'duration': 2.181}, {'end': 7268.738, 'text': 'That ratio is called r square.', 'start': 7267.057, 'duration': 1.681}, {'end': 7274.441, 'text': 'Obviously, the lesser the residuals unexplained variance, the better the model is.', 'start': 7269.319, 'duration': 5.122}, {'end': 7276.182, 'text': 'We want more r square.', 'start': 7274.942, 'duration': 1.24}, {'end': 7277.323, 'text': 'Yes, exactly.', 'start': 7276.263, 'duration': 1.06}, {'end': 7282.246, 'text': 'But there is a problem.', 'start': 7277.383, 'duration': 4.863}, {'end': 7284.287, 'text': 'There is a problem here.', 'start': 7283.206, 'duration': 1.081}, {'end': 7287.669, 'text': 'The problem I can explain to you in simple terms.', 'start': 7285.047, 'duration': 2.622}, {'end': 7300.332, 'text': 'The problem is, go back to our discussion on r, where I told you, if there is no relationship between these two,', 'start': 7290.13, 'duration': 10.202}, {'end': 7305.513, 'text': 'the target and the independent variable you have to give, get a perfect symmetrical distribution.', 'start': 7300.332, 'duration': 5.181}, {'end': 7313.576, 'text': 'But the probability of getting a perfect symmetrical distribution is 0, which means you are going to get some asymmetrical distribution.', 'start': 7307.274, 'duration': 6.302}, {'end': 7321.959, 'text': 'The moment you have asymmetrical distribution, your r value will be greater than 0, magnitude this way or that way.', 'start': 7315.077, 'duration': 6.882}, {'end': 7327.034, 'text': 'which means R square will be greater than 0.', 'start': 7323.151, 'duration': 3.883}, {'end': 7336.241, 'text': 'R square is not considered as a very good metric for evaluating the models because it is easily impacted by fluke relationship.', 'start': 7327.034, 'duration': 9.207}, {'end': 7347.309, 'text': 'So we make use of another metric that is called adjusted R square which is nothing but R square minus fluke.', 'start': 7339.963, 'duration': 7.346}, {'end': 7362.216, 'text': 'There is a formula for this, how do you find out the fluke content which I am going to avoid right now because it is too detailed to discuss.', 'start': 7354.093, 'duration': 8.123}, {'end': 7364.917, 'text': 'In case you are interested let me know I can share it with you.', 'start': 7362.836, 'duration': 2.081}, {'end': 7372.26, 'text': "Conceptually it is easy to understand but the argument is very lengthy, we won't have time for that, that is why I am going to ignore it right now.", 'start': 7366.177, 'duration': 6.083}, {'end': 7378.802, 'text': 'So the metric that we will make use of to evaluate our models is adjusted R square not R square.', 'start': 7373.72, 'duration': 5.082}, {'end': 7402.349, 'text': 'The beauty of adjusted R square is whenever in your model building exercise to predict your target, when you include useless variables,', 'start': 7391.424, 'duration': 10.925}, {'end': 7405.471, 'text': 'variables which do not really have any significant impact on the target.', 'start': 7402.349, 'duration': 3.122}, {'end': 7408.816, 'text': 'But their relationship is statistical fluke.', 'start': 7406.735, 'duration': 2.081}, {'end': 7414.46, 'text': 'When you use such useless variables, R square will keep on going up and up and up.', 'start': 7409.136, 'duration': 5.324}, {'end': 7418.903, 'text': 'It will become closer and closer to 1.', 'start': 7414.7, 'duration': 4.203}], 'summary': 'R square is not a good metric; adjusted r square accounts for fluke relationships and is preferred for model evaluation.', 'duration': 27.479, 'max_score': 7237.242, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM7237242.jpg'}], 'start': 6399.051, 'title': 'Understanding regression models and model evaluation', 'summary': 'Explains the concepts of error in regression models, emphasizes minimizing sum of squared errors, discusses regression error and residual errors, evaluates model performance using r square and coefficient of determination, and emphasizes the importance of adjusted r square for model evaluation.', 'chapters': [{'end': 6734.783, 'start': 6399.051, 'title': 'Understanding error in regression models', 'summary': 'Explains the concept of error in regression models, emphasizing the importance of minimizing the sum of squared errors to improve model accuracy, providing insights into total error, regression error, and residual errors.', 'duration': 335.732, 'highlights': ['The minimal the variance is, the better your model is, emphasizing the importance of minimizing the sum of squared errors to improve model accuracy.', 'The total error is the distance between the actual data point and expected value, providing insights into the concept of total error in regression models.', 'Your model is a regression model, hence the name given to this is regression error, explaining the concept of regression error in the context of regression models.', 'Sum of squared errors is also known as residual errors, highlighting the relationship between sum of squared errors and residual errors in regression models.']}, {'end': 6925.439, 'start': 6735.583, 'title': 'Understanding regression error and minimizing unexplained variance', 'summary': 'Discusses the concept of regression error and residual errors, emphasizing the importance of minimizing unexplained error (sse) to improve model prediction, with a focus on understanding and reducing stochastic random variance.', 'duration': 189.856, 'highlights': ['The total error in regression analysis is the combination of regression error and residual errors, which represent the variance in the dataset.', "The objective is to minimize the unexplained error (SSE) to improve the model's predictive capability, by reducing the stochastic random variance and understanding the factors contributing to it.", 'The model aims to minimize the sum of squared errors (SSE) to enhance prediction accuracy and reduce the unexplained variance, emphasizing the significance of understanding and addressing stochastic random variance.']}, {'end': 7327.034, 'start': 6927.711, 'title': 'Model evaluation: r square and coefficient of determination', 'summary': 'Discusses the importance of r square and coefficient of determination in evaluating model performance, with r square representing the explained variance ranging from 0 to 1, and its significance in multivariate analysis, while emphasizing the need for a high r square close to 1 for a good model.', 'duration': 399.323, 'highlights': ['R square represents the explained variance in the model, ranging from 0 to 1, with a value of 1 indicating that the model explains all variance in the target. R square measures how much of the total variance in the target Y has been explained by the model, with 0 meaning the model is unable to explain anything and 1 indicating complete explanation of variance.', 'The significance of R square in multivariate analysis, where it does not equate to R multiplied by R, emphasizing the need for a high R square close to 1 for a good model. In multivariate analysis, R square does not equate to R multiplied by R, highlighting the importance of achieving a high R square close to 1 for a good model performance.', 'The importance of capturing maximal variance in the target Y to produce the best model, with R square being a metric used to evaluate the performance of the model. A model that captures and explains maximal variance in the target Y is considered the best, and R square is used as a metric to evaluate the performance of the model.']}, {'end': 7600.656, 'start': 7327.034, 'title': 'Adjusted r square for model evaluation', 'summary': "Discusses the importance of using adjusted r square over r square for model evaluation, as adjusted r square accounts for the impact of useless variables and provides a more accurate measure of a model's goodness of fit, unlike r square which can be easily influenced by fake relationships and useless variables.", 'duration': 273.622, 'highlights': ["The adjusted R square is preferred over R square for model evaluation as it accounts for the impact of useless variables and provides a more accurate measure of a model's goodness of fit, unlike R square which can be easily influenced by fake relationships and useless variables.", 'R square is not considered a good metric for evaluating models as it can be easily impacted by fluke relationships, while adjusted R square provides a more reliable evaluation by accounting for the impact of useless variables.', 'When including useless variables in a model, R square will keep increasing, while adjusted R square will decrease, making it a more reliable metric for model evaluation.', "Attributes used in model building always contain both good and fake components, with good attributes having a significantly higher percentage of 'good R' compared to 'fake R', emphasizing the importance of using adjusted R square for model evaluation.", "The concept of sampling is highlighted as a source of introducing fake relationships in models, as the sample's distribution differs from the population's distribution, emphasizing the need for further investigation when using sampled data in models."]}], 'duration': 1201.605, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM6399051.jpg', 'highlights': ["The adjusted R square is preferred over R square for model evaluation as it accounts for the impact of useless variables and provides a more accurate measure of a model's goodness of fit, unlike R square which can be easily influenced by fake relationships and useless variables.", 'R square represents the explained variance in the model, ranging from 0 to 1, with a value of 1 indicating that the model explains all variance in the target. R square measures how much of the total variance in the target Y has been explained by the model, with 0 meaning the model is unable to explain anything and 1 indicating complete explanation of variance.', 'The total error in regression analysis is the combination of regression error and residual errors, which represent the variance in the dataset.', 'The minimal the variance is, the better your model is, emphasizing the importance of minimizing the sum of squared errors to improve model accuracy.']}, {'end': 8846.336, 'segs': [{'end': 7664.808, 'src': 'embed', 'start': 7603.918, 'weight': 0, 'content': [{'end': 7608.221, 'text': 'Some assumptions that this linear regression models make, we already discussed this.', 'start': 7603.918, 'duration': 4.303}, {'end': 7614.866, 'text': 'Assumption of linearity, the relationship between the target and the independent variables are expected to be linear.', 'start': 7608.802, 'duration': 6.064}, {'end': 7617.508, 'text': 'That brings me to another point.', 'start': 7616.368, 'duration': 1.14}, {'end': 7619.25, 'text': 'Please be careful.', 'start': 7618.549, 'duration': 0.701}, {'end': 7634.715, 'text': "When your r value is close to 0, what does it mean? I think that's what I ended up saying.", 'start': 7619.83, 'duration': 14.885}, {'end': 7639.577, 'text': 'There is no linear relationship between x and y.', 'start': 7635.335, 'duration': 4.242}, {'end': 7645.019, 'text': 'When r value is close to 0 between y and x, that means there is no linear relationship.', 'start': 7639.577, 'duration': 5.442}, {'end': 7647.52, 'text': 'That does not mean there is no relationship.', 'start': 7645.479, 'duration': 2.041}, {'end': 7654.783, 'text': 'For example, if I give you a distribution like this, this is x and this is y and this is parabolic.', 'start': 7649.321, 'duration': 5.462}, {'end': 7657.867, 'text': 'there is no linear relationship.', 'start': 7656.547, 'duration': 1.32}, {'end': 7662.068, 'text': 'When you find R value for this distribution, you come close to 0.', 'start': 7657.907, 'duration': 4.161}, {'end': 7662.788, 'text': 'But they are related.', 'start': 7662.068, 'duration': 0.72}, {'end': 7664.808, 'text': 'There is non-linear relationships.', 'start': 7663.388, 'duration': 1.42}], 'summary': "Linear regression assumes linear relationship between variables. r value near 0 means no linear relationship, but doesn't imply no relationship at all.", 'duration': 60.89, 'max_score': 7603.918, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM7603918.jpg'}, {'end': 7785.356, 'src': 'embed', 'start': 7719.407, 'weight': 2, 'content': [{'end': 7729.977, 'text': 'So, unless we draw this pair plot, we will not come to know that we have this kind of non-linear, so is it always suggestible to draw a pair for each.', 'start': 7719.407, 'duration': 10.57}, {'end': 7730.517, 'text': '100 percent.', 'start': 7729.997, 'duration': 0.52}, {'end': 7738.585, 'text': 'if you ask me, I will always say pair plot is the most important tool you have in your toolbox, which you should use to understand your data.', 'start': 7730.517, 'duration': 8.068}, {'end': 7752.35, 'text': 'Try a different model, may be a non-linear model, ok.', 'start': 7748.167, 'duration': 4.183}, {'end': 7756.795, 'text': 'So, now that brings me to a question, what is linear and what is non-linear, ok.', 'start': 7752.751, 'duration': 4.044}, {'end': 7763.941, 'text': 'Now, look at this, this is slightly funny and at the same time something which is actually serious.', 'start': 7757.315, 'duration': 6.626}, {'end': 7776.673, 'text': 'This y and x, Mathematicians will say, suppose this is x, this is y and the expression is y is equal to x square.', 'start': 7767.344, 'duration': 9.329}, {'end': 7781.255, 'text': 'Mathematicians will say this is a non-linear function.', 'start': 7778.153, 'duration': 3.102}, {'end': 7785.356, 'text': 'Data scientists call this linear function.', 'start': 7783.476, 'duration': 1.88}], 'summary': 'Pair plot is 100% suggestible for understanding non-linear data in a toolbox.', 'duration': 65.949, 'max_score': 7719.407, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM7719407.jpg'}, {'end': 7961.858, 'src': 'embed', 'start': 7927.63, 'weight': 6, 'content': [{'end': 7930.372, 'text': 'Combo is what we call ensemble techniques.', 'start': 7927.63, 'duration': 2.742}, {'end': 7934.17, 'text': 'In production, we never put one single model into play.', 'start': 7931.027, 'duration': 3.143}, {'end': 7937.313, 'text': 'We always put the multiple models grouped together in a play.', 'start': 7934.551, 'duration': 2.762}, {'end': 7940.116, 'text': 'We always install multiple models put together.', 'start': 7938.254, 'duration': 1.862}, {'end': 7940.797, 'text': 'Varun, good morning.', 'start': 7940.136, 'duration': 0.661}, {'end': 7943.119, 'text': 'No problem, sir.', 'start': 7942.619, 'duration': 0.5}, {'end': 7943.74, 'text': "That's okay.", 'start': 7943.299, 'duration': 0.441}, {'end': 7945.542, 'text': 'Camera is facing me, not you.', 'start': 7944.44, 'duration': 1.102}, {'end': 7945.942, 'text': "It's okay.", 'start': 7945.622, 'duration': 0.32}, {'end': 7946.563, 'text': 'All right.', 'start': 7946.382, 'duration': 0.181}, {'end': 7948.845, 'text': "Okay But don't knock the camera off here.", 'start': 7946.883, 'duration': 1.962}, {'end': 7949.886, 'text': 'He is also sleeping.', 'start': 7949.105, 'duration': 0.781}, {'end': 7961.858, 'text': 'So what we do is have you seen this Kohn-Banega-Karotapathy?', 'start': 7957.034, 'duration': 4.824}], 'summary': 'Ensemble techniques involve using multiple models in production for better results.', 'duration': 34.228, 'max_score': 7927.63, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM7927630.jpg'}, {'end': 8011.649, 'src': 'embed', 'start': 7980.31, 'weight': 5, 'content': [{'end': 7982.171, 'text': 'we always put a collection of models into play.', 'start': 7980.31, 'duration': 1.861}, {'end': 7983.532, 'text': 'That is called ensemble.', 'start': 7982.651, 'duration': 0.881}, {'end': 7990.52, 'text': 'The algorithm assumes that there is a linear relationship between independent variables and the target variable.', 'start': 7985.117, 'duration': 5.403}, {'end': 7995.762, 'text': 'It also assumes that there is no relationship between the independent variables.', 'start': 7992.16, 'duration': 3.602}, {'end': 7998.523, 'text': 'That all algorithms do.', 'start': 7997.163, 'duration': 1.36}, {'end': 8000.384, 'text': 'There is nothing particular about this algorithm.', 'start': 7998.723, 'duration': 1.661}, {'end': 8004.906, 'text': 'Assumption of normality of error distribution.', 'start': 8002.285, 'duration': 2.621}, {'end': 8011.649, 'text': 'This algorithm assumes that the model that you build the best fit line,', 'start': 8006.227, 'duration': 5.422}], 'summary': 'Ensemble modeling of algorithms assumes linear relationship between independent variables and target variable, as well as normality of error distribution.', 'duration': 31.339, 'max_score': 7980.31, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM7980310.jpg'}, {'end': 8117.329, 'src': 'embed', 'start': 8089.459, 'weight': 8, 'content': [{'end': 8098.565, 'text': 'okay, so when you build up models, linear models, any model, the first thing we have to do is handle the outliers first.', 'start': 8089.459, 'duration': 9.106}, {'end': 8104.94, 'text': 'okay, There are various ways of testing this out, whether this is happening or not.', 'start': 8098.565, 'duration': 6.375}, {'end': 8113.266, 'text': 'One of the ways of testing this is you do a scatter plot between the actual values of y and predicted values of y.', 'start': 8106.361, 'duration': 6.905}, {'end': 8117.329, 'text': 'Between actual values of y and predicted values of y.', 'start': 8113.266, 'duration': 4.063}], 'summary': 'Handle outliers when building models, test using scatter plot between actual and predicted y values.', 'duration': 27.87, 'max_score': 8089.459, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM8089459.jpg'}, {'end': 8257.011, 'src': 'embed', 'start': 8228.163, 'weight': 9, 'content': [{'end': 8230.824, 'text': 'the independent variables x.', 'start': 8228.163, 'duration': 2.661}, {'end': 8233.525, 'text': 'Such things indicate heteroscedasticity, homoscedasticity.', 'start': 8230.824, 'duration': 2.701}, {'end': 8237.227, 'text': 'The expectation is it will be homoscedasticity, homo means homogeneous.', 'start': 8234.026, 'duration': 3.201}, {'end': 8245.049, 'text': 'Okay, so the residuals irrespective of what the range of x is will be similar across all the range.', 'start': 8238.746, 'duration': 6.303}, {'end': 8257.011, 'text': 'Assumption of independence of errors, I mean error done in one record is no way influenced by the error done in another record.', 'start': 8250.665, 'duration': 6.346}], 'summary': 'Discussion on homoscedasticity and independence of errors in regression analysis.', 'duration': 28.848, 'max_score': 8228.163, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM8228163.jpg'}, {'end': 8412.004, 'src': 'embed', 'start': 8381.159, 'weight': 11, 'content': [{'end': 8382.32, 'text': 'You cannot interpret those models.', 'start': 8381.159, 'duration': 1.161}, {'end': 8383.421, 'text': 'You do not know what they are.', 'start': 8382.719, 'duration': 0.702}, {'end': 8384.882, 'text': 'They are all black box.', 'start': 8384.04, 'duration': 0.842}, {'end': 8393.529, 'text': 'Sir, what do we mean by physical definition? I can convert this, whatever this formula is telling me, I can map it to English, right.', 'start': 8385.541, 'duration': 7.988}, {'end': 8396.111, 'text': 'I can tell you in day to day language what it is telling you.', 'start': 8393.649, 'duration': 2.462}, {'end': 8402.7, 'text': "I won't be able to do that if I use say random forest, you're going to do ensemble where you'll do random forest.", 'start': 8398.038, 'duration': 4.662}, {'end': 8406.882, 'text': "Random forest is a black box model, we don't know what it's actually doing inside.", 'start': 8403.22, 'duration': 3.662}, {'end': 8409.263, 'text': 'Support vector machine is a black forest.', 'start': 8407.702, 'duration': 1.561}, {'end': 8412.004, 'text': "we don't know what black forest.", 'start': 8409.263, 'duration': 2.741}], 'summary': 'Complex models like random forest and support vector machine are black box models, making it difficult to interpret their workings.', 'duration': 30.845, 'max_score': 8381.159, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM8381159.jpg'}, {'end': 8640.092, 'src': 'heatmap', 'start': 8478.383, 'weight': 1, 'content': [{'end': 8482.404, 'text': 'So the more time you spend in selecting your attributes to build the models,', 'start': 8478.383, 'duration': 4.021}, {'end': 8487.764, 'text': 'the higher the quality of the attributes that you use to build the models, the better your models will be.', 'start': 8483.561, 'duration': 4.203}, {'end': 8492.768, 'text': 'And the last, boundaries are linear.', 'start': 8490.686, 'duration': 2.082}, {'end': 8495.87, 'text': 'This is a problem not in regression, this is a problem in classification.', 'start': 8492.928, 'duration': 2.942}, {'end': 8500.633, 'text': 'So if we can use linear model for classification also.', 'start': 8497.511, 'duration': 3.122}, {'end': 8512.202, 'text': 'So if you are having a distribution of classes like this and then this, and you are trying to build linear classifiers,', 'start': 8501.774, 'duration': 10.428}, {'end': 8515.935, 'text': 'you will get probably a line like this as a best fit line.', 'start': 8513.373, 'duration': 2.562}, {'end': 8525.781, 'text': 'So, above is this triangle, below this is circles and one triangle is, whereas, on the other hand, if you have made use of non-linear stuff,', 'start': 8517.936, 'duration': 7.845}, {'end': 8530.263, 'text': 'non-linear models, then probably you would have got a better result.', 'start': 8525.781, 'duration': 4.482}, {'end': 8536.347, 'text': 'maybe you would have gotten something like this ok.', 'start': 8530.263, 'duration': 6.084}, {'end': 8541.15, 'text': 'So, if you use non-linear models, it might do a better classification than linear models.', 'start': 8536.827, 'duration': 4.323}, {'end': 8545.999, 'text': 'So, this limitation is not in regression it is in classification.', 'start': 8543.237, 'duration': 2.762}, {'end': 8551.584, 'text': 'Let us do one hands on ok.', 'start': 8549.562, 'duration': 2.022}, {'end': 8558.529, 'text': 'The hands on is not auto MPG data set as it is given here it is the other one which is simplistic.', 'start': 8552.084, 'duration': 6.445}, {'end': 8559.971, 'text': 'So, we will start with simpler ones.', 'start': 8558.87, 'duration': 1.101}, {'end': 8565.075, 'text': 'The other one which I have given to you is CARP ok.', 'start': 8562.533, 'duration': 2.542}, {'end': 8570.059, 'text': 'CARP is linear regression I have taken the data set directly from the UCI site.', 'start': 8565.375, 'duration': 4.684}, {'end': 8579.663, 'text': 'it is called imports underscore imports hyphen 85 dot data, this is the name of the data file ok.', 'start': 8573.959, 'duration': 5.704}, {'end': 8583.405, 'text': 'So, instead of downloading it I am directly reading from there.', 'start': 8579.683, 'duration': 3.722}, {'end': 8595.433, 'text': 'So, let me begin matplotlib is an instruction to the Jupyter notebook to plot make all the plots in the notebook itself sorry Jupyter server.', 'start': 8587.147, 'duration': 8.286}, {'end': 8603.378, 'text': 'I am importing pandas numpy and from scikit-learn linear model libraries I am importing the linear regression.', 'start': 8596.353, 'duration': 7.025}, {'end': 8607.158, 'text': 'You can have linear classifiers also.', 'start': 8605.657, 'duration': 1.501}, {'end': 8612.2, 'text': 'Then I am reading the CSV file which is available in this URL.', 'start': 8609.139, 'duration': 3.061}, {'end': 8620.304, 'text': 'It is a comma separated file and I have taken all these column names from the UCI data set.', 'start': 8614.621, 'duration': 5.683}, {'end': 8627.547, 'text': 'So, that gives me the data frame which I call car underscore df.', 'start': 8624.086, 'duration': 3.461}, {'end': 8632.81, 'text': 'This is my data frame.', 'start': 8632.11, 'duration': 0.7}, {'end': 8637.651, 'text': 'All of you okay? So, this is like loading a data into a table.', 'start': 8634.69, 'duration': 2.961}, {'end': 8640.092, 'text': 'So, let us look at two records.', 'start': 8637.991, 'duration': 2.101}], 'summary': 'Selecting high-quality attributes improves model performance. linear models have limitations in classification, while non-linear models may yield better results. hands-on exercise uses carp dataset for linear regression.', 'duration': 161.709, 'max_score': 8478.383, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM8478383.jpg'}, {'end': 8595.433, 'src': 'embed', 'start': 8562.533, 'weight': 14, 'content': [{'end': 8565.075, 'text': 'The other one which I have given to you is CARP ok.', 'start': 8562.533, 'duration': 2.542}, {'end': 8570.059, 'text': 'CARP is linear regression I have taken the data set directly from the UCI site.', 'start': 8565.375, 'duration': 4.684}, {'end': 8579.663, 'text': 'it is called imports underscore imports hyphen 85 dot data, this is the name of the data file ok.', 'start': 8573.959, 'duration': 5.704}, {'end': 8583.405, 'text': 'So, instead of downloading it I am directly reading from there.', 'start': 8579.683, 'duration': 3.722}, {'end': 8595.433, 'text': 'So, let me begin matplotlib is an instruction to the Jupyter notebook to plot make all the plots in the notebook itself sorry Jupyter server.', 'start': 8587.147, 'duration': 8.286}], 'summary': 'Using carp for linear regression with data from uci site, reading directly, and plotting with matplotlib in jupyter notebook.', 'duration': 32.9, 'max_score': 8562.533, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM8562533.jpg'}], 'start': 7603.918, 'title': 'Linear regression and ensemble techniques', 'summary': 'Covers linear regression assumptions, the distinction between linear and non-linear models, the use of ensemble techniques in data science, and model interpretation. it emphasizes the implications of an r value close to 0 indicating a lack of linear relationship, the importance of pair plot in understanding non-linear relationships, and the use of multiple models grouped together for production. it also discusses the assumptions and limitations of linear models, the physical interpretation of linear regression, the limitation of black box models like neural networks, the importance of reliable attributes in building linear models, and the impact of linear boundaries on classification models.', 'chapters': [{'end': 7717.766, 'start': 7603.918, 'title': 'Linear regression assumptions and r value', 'summary': 'Discusses the assumptions of linearity in linear regression, emphasizing the implications of an r value close to 0 indicating a lack of linear relationship, but not necessarily no relationship, and advises to check for non-linear relationships in the data before using linear models.', 'duration': 113.848, 'highlights': ['An r value close to 0 indicates no linear relationship, but not necessarily no relationship at all. The chapter emphasizes that when the r value is close to 0, it signifies a lack of linear relationship between x and y, but it does not imply the absence of any relationship. This is exemplified through scenarios with non-linear relationships, such as parabolic or sine omega distributions.', 'Caution advised when using linear models with non-linear relationships. It warns against the use of linear models when the relationship between variables is non-linear, as indicated by an r value close to 0. The chapter emphasizes the importance of checking for non-linear patterns in pair panels or pair plots before employing linear models.', 'Importance of reconsidering the use of linear models with attributes having non-linear relationships. The chapter stresses the need to carefully reconsider the use of linear models when dealing with attributes that exhibit non-linear relationships, as the presence of such attributes may necessitate a reevaluation of the suitability of linear models.']}, {'end': 7925.549, 'start': 7719.407, 'title': 'Understanding linear and non-linear models', 'summary': 'Emphasizes the importance of pair plot in understanding non-linear relationships, explains the distinction between linear and non-linear functions, and highlights the significance of choosing models with minimal bias variance errors.', 'duration': 206.142, 'highlights': ['Pair plot is the most important tool for understanding data. The speaker stresses the significance of using a pair plot to understand non-linear relationships, advocating for its use in analyzing data.', 'Explanation of linear and non-linear functions. The speaker provides a clear explanation of the distinction between linear and non-linear functions, using the example of y = x square to emphasize the concept.', 'Importance of choosing models with minimal bias variance errors. The chapter concludes by highlighting the objective of choosing models with minimal bias variance errors, emphasizing the importance of determining which model generalizes better.']}, {'end': 8353.298, 'start': 7927.63, 'title': 'Ensemble techniques in data science', 'summary': 'Discusses the concept of ensemble techniques in data science, emphasizing the use of multiple models grouped together for production, the assumptions and limitations of linear models, and the importance of handling outliers and testing for homoscedasticity and independence of errors.', 'duration': 425.668, 'highlights': ['Ensemble techniques involve using multiple models grouped together for production. Emphasizes the use of multiple models for production.', 'Linear models assume a linear relationship between independent variables and the target variable, and the absence of a relationship between independent variables. Explains the assumptions of linear models.', "It is important to handle outliers when building linear models, as they can significantly impact the model's performance. Stresses the significance of handling outliers in model building.", "Testing for homoscedasticity and independence of errors is crucial in evaluating the model's performance and correctness of the data sets. Highlights the importance of testing for homoscedasticity and independence of errors."]}, {'end': 8846.336, 'start': 8354.403, 'title': 'Interpreting linear regression and model selection', 'summary': 'Explains the physical interpretation of linear regression, the limitation of black box models like neural networks, the importance of reliable attributes in building linear models, and the impact of linear boundaries on classification models. it also discusses the data preprocessing steps for a linear regression model using the carp dataset.', 'duration': 491.933, 'highlights': ['Linear regression allows for physical interpretation of the model, providing insight into the relationship between variables and the impact of their changes. Linear regression enables the physical interpretation of the model, allowing for a clear understanding of the relationship between variables and their impact on the outcome.', "Black box models like neural networks lack physical interpretability, limiting the understanding of the model's functioning and outputs. Black box models like neural networks lack physical interpretability, hindering the understanding of the model's functioning and outputs.", 'The reliability of attributes used to build linear models is crucial, as unreliable attributes can lead to an unreliable model and its susceptibility to outliers. The reliability of attributes used to build linear models is crucial, as unreliable attributes can lead to an unreliable model and its susceptibility to outliers.', 'The impact of linear boundaries on classification models is discussed, highlighting the limitations of using linear models for classification tasks compared to non-linear models. The impact of linear boundaries on classification models is discussed, emphasizing the limitations of using linear models for classification tasks compared to non-linear models.', 'Data preprocessing steps for a linear regression model using the CARP dataset are demonstrated, including handling missing values, converting string data types to numbers, and applying a low variance filter to remove irrelevant columns. The transcript demonstrates data preprocessing steps for a linear regression model using the CARP dataset, including handling missing values, converting string data types to numbers, and applying a low variance filter to remove irrelevant columns.']}], 'duration': 1242.418, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM7603918.jpg', 'highlights': ['An r value close to 0 indicates no linear relationship, but not necessarily no relationship at all.', 'Importance of reconsidering the use of linear models with attributes having non-linear relationships.', 'Caution advised when using linear models with non-linear relationships.', 'Pair plot is the most important tool for understanding data.', 'Explanation of linear and non-linear functions.', 'Importance of choosing models with minimal bias variance errors.', 'Ensemble techniques involve using multiple models grouped together for production.', 'Linear models assume a linear relationship between independent variables and the target variable, and the absence of a relationship between independent variables.', "It is important to handle outliers when building linear models, as they can significantly impact the model's performance.", "Testing for homoscedasticity and independence of errors is crucial in evaluating the model's performance and correctness of the data sets.", 'Linear regression allows for physical interpretation of the model, providing insight into the relationship between variables and the impact of their changes.', "Black box models like neural networks lack physical interpretability, limiting the understanding of the model's functioning and outputs.", 'The reliability of attributes used to build linear models is crucial, as unreliable attributes can lead to an unreliable model and its susceptibility to outliers.', 'The impact of linear boundaries on classification models is discussed, highlighting the limitations of using linear models for classification tasks compared to non-linear models.', 'Data preprocessing steps for a linear regression model using the CARP dataset are demonstrated, including handling missing values, converting string data types to numbers, and applying a low variance filter to remove irrelevant columns.']}, {'end': 9835.98, 'segs': [{'end': 8877.226, 'src': 'embed', 'start': 8848.777, 'weight': 1, 'content': [{'end': 8855.881, 'text': "Let's get on to this code but do this analysis you will find that these columns are having low variance filters they are fit to be kicked out.", 'start': 8848.777, 'duration': 7.104}, {'end': 8865.82, 'text': 'right So, then what I do is I saw the car data types, the column data types, most of the data types are object string types.', 'start': 8857.315, 'duration': 8.505}, {'end': 8873.484, 'text': 'I drop my useless columns.', 'start': 8869.261, 'duration': 4.223}, {'end': 8876.385, 'text': 'number of doors most of them have four doors.', 'start': 8873.484, 'duration': 2.901}, {'end': 8877.226, 'text': 'then fuel type.', 'start': 8876.385, 'duration': 0.841}], 'summary': 'Identified low variance columns to drop, most cars have four doors, and discussed fuel type.', 'duration': 28.449, 'max_score': 8848.777, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM8848777.jpg'}, {'end': 8955.333, 'src': 'embed', 'start': 8914.245, 'weight': 0, 'content': [{'end': 8920.787, 'text': 'Single value there we are dropping it, which means the current price of the car is not a function of that its function of other variables.', 'start': 8914.245, 'duration': 6.542}, {'end': 8924.398, 'text': 'Ok, look at this.', 'start': 8922.236, 'duration': 2.162}, {'end': 8926.74, 'text': 'hey guys, do you understand why we drop low variance filter?', 'start': 8924.398, 'duration': 2.342}, {'end': 8935.168, 'text': 'The reason why we drop low variance filter is the price is not influenced by this particular column, which is low variance.', 'start': 8927.521, 'duration': 7.647}, {'end': 8943.175, 'text': 'If the price was impacted by this column, which is low variance, then all the prices should be fixed right,', 'start': 8936.769, 'duration': 6.406}, {'end': 8947.059, 'text': 'which means the variance in the prices is because of the other columns, not this column.', 'start': 8943.175, 'duration': 3.884}, {'end': 8950.608, 'text': 'So, that is why we drop it.', 'start': 8949.167, 'duration': 1.441}, {'end': 8955.333, 'text': 'Now, the number of cylinders what I am doing here is, I am converting all these into numbers.', 'start': 8951.029, 'duration': 4.304}], 'summary': 'Dropping low variance filter as price is not influenced by it.', 'duration': 41.088, 'max_score': 8914.245, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM8914245.jpg'}, {'end': 9056.532, 'src': 'embed', 'start': 9029.097, 'weight': 4, 'content': [{'end': 9035.101, 'text': 'So if your column is an ordinal data type You can go and introduce order in your numerical values.', 'start': 9029.097, 'duration': 6.004}, {'end': 9043.525, 'text': 'If the column is not ordinal, gender column, then you cannot blindly convert them into 1 and 2, you have to resort to one-hot coding.', 'start': 9035.781, 'duration': 7.744}, {'end': 9049.688, 'text': 'In scikit-learn there is a facility function called label encoder.', 'start': 9045.646, 'duration': 4.042}, {'end': 9054.07, 'text': 'Label encoder introduces order in your data.', 'start': 9051.069, 'duration': 3.001}, {'end': 9056.532, 'text': 'So be careful when you are using that.', 'start': 9055.131, 'duration': 1.401}], 'summary': 'Use label encoder for ordinal data to introduce order, one-hot coding for non-ordinal data. beware of unintended consequences in using label encoder.', 'duration': 27.435, 'max_score': 9029.097, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM9029097.jpg'}, {'end': 9158.898, 'src': 'embed', 'start': 9120.805, 'weight': 5, 'content': [{'end': 9122.726, 'text': 'Okay, I will tell you what one-hot coding is.', 'start': 9120.805, 'duration': 1.921}, {'end': 9123.426, 'text': 'then we will move on.', 'start': 9122.726, 'duration': 0.7}, {'end': 9132.798, 'text': 'Sorry, you have a column gender and the gender column has male and female, okay.', 'start': 9126.307, 'duration': 6.491}, {'end': 9137.021, 'text': 'This I cannot use as it is in my modeling.', 'start': 9134.019, 'duration': 3.002}, {'end': 9140.084, 'text': 'So, what I do is I convert this into numerical data types.', 'start': 9137.101, 'duration': 2.983}, {'end': 9142.545, 'text': 'All data types should be numerical.', 'start': 9141.104, 'duration': 1.441}, {'end': 9147.709, 'text': 'So, when I convert into numerical, I use one-hot coding.', 'start': 9144.687, 'duration': 3.022}, {'end': 9158.898, 'text': 'What will happen is, it will replace this gender column with two columns, gender underscore m, gender underscore f.', 'start': 9149.391, 'duration': 9.507}], 'summary': 'One-hot coding converts gender column into two numerical columns.', 'duration': 38.093, 'max_score': 9120.805, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM9120805.jpg'}, {'end': 9426.684, 'src': 'embed', 'start': 9396.972, 'weight': 6, 'content': [{'end': 9400.014, 'text': 'Rest of the columns which are object data types I am changing into float types.', 'start': 9396.972, 'duration': 3.042}, {'end': 9402.516, 'text': 'Till this point.', 'start': 9401.655, 'duration': 0.861}, {'end': 9402.996, 'text': 'is it ok??', 'start': 9402.516, 'duration': 0.48}, {'end': 9411.343, 'text': 'Once I have changed the data types to numbers numerical, I know there are many missing values.', 'start': 9405.218, 'duration': 6.125}, {'end': 9423.523, 'text': 'So the strategy I am using here for missing values is replace the missing values of price column with the median of the price column.', 'start': 9413.1, 'duration': 10.423}, {'end': 9426.684, 'text': 'I want to explain this step to you.', 'start': 9425.444, 'duration': 1.24}], 'summary': 'Converting object data types into float, replacing missing values with median.', 'duration': 29.712, 'max_score': 9396.972, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM9396972.jpg'}, {'end': 9603.248, 'src': 'embed', 'start': 9571.007, 'weight': 7, 'content': [{'end': 9574.368, 'text': 'in R, I will have to first convert in numerical value, even to find median.', 'start': 9571.007, 'duration': 3.361}, {'end': 9593.146, 'text': 'Now how do you address missing values? How do you address missing values depends on your analysis of the data and why the values are missing.', 'start': 9584.724, 'duration': 8.422}, {'end': 9596.926, 'text': 'The values can be missing because of some random reasons.', 'start': 9594.266, 'duration': 2.66}, {'end': 9598.887, 'text': 'We call it randomly missing values.', 'start': 9597.287, 'duration': 1.6}, {'end': 9603.248, 'text': "Values which are randomly, you don't see any patterns in them.", 'start': 9600.467, 'duration': 2.781}], 'summary': 'In r, converting to numerical values is necessary to find the median. addressing missing values depends on data analysis and reasons for their absence.', 'duration': 32.241, 'max_score': 9571.007, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM9571007.jpg'}, {'end': 9817.261, 'src': 'embed', 'start': 9785.739, 'weight': 9, 'content': [{'end': 9790.462, 'text': "So one column's missing value might be imputed multiple times from multiple columns.", 'start': 9785.739, 'duration': 4.723}, {'end': 9798.811, 'text': 'Please go and read about this, it is a very powerful package, when something comes with power you have to be very careful in using it.', 'start': 9792.568, 'duration': 6.243}, {'end': 9804.314, 'text': 'So read about this MICE, I am not going to cover it here but it is a very beautiful package, you should read about this.', 'start': 9799.912, 'duration': 4.402}, {'end': 9817.261, 'text': 'And just to add to the confusion, apparently the MICE implementation in Python is slightly different from MICE implementation in R,', 'start': 9808.016, 'duration': 9.245}], 'summary': 'Mice package can impute missing values from multiple columns, with different implementations in python and r.', 'duration': 31.522, 'max_score': 9785.739, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM9785739.jpg'}], 'start': 8848.777, 'title': 'Data preprocessing techniques', 'summary': 'Delves into techniques for analyzing and optimizing data, including dropping low variance columns, converting categorical data, and handling missing values, to enhance efficiency and streamline datasets for analysis.', 'chapters': [{'end': 8902.242, 'start': 8848.777, 'title': 'Data analysis and data type optimization', 'summary': 'Highlights the process of analyzing data, identifying low variance columns, dropping irrelevant columns, and optimizing data types to improve efficiency and reduce unnecessary data, ultimately resulting in a more streamlined dataset for analysis.', 'duration': 53.465, 'highlights': ['By analyzing the data, low variance columns were identified and deemed fit for removal, contributing to the streamlining of the dataset.', "The process involved dropping irrelevant columns such as 'number of doors' and 'fuel type' which had majority values, leading to a more concise dataset for further analysis.", "Optimization of data types was performed, particularly targeting object data types such as 'number of cylinders', to enhance efficiency and streamline the dataset for improved analysis."]}, {'end': 9178.823, 'start': 8914.245, 'title': 'Data preprocessing techniques', 'summary': 'Discusses the importance of dropping low variance columns, converting categorical data into numerical values using techniques like label encoding and one-hot encoding, and the implications of introducing order in non-ordinal columns, emphasizing the caution required when preprocessing data.', 'duration': 264.578, 'highlights': ['The importance of dropping low variance columns The reason for dropping low variance columns is that the price is not influenced by these columns, as indicated by the fact that the variance in prices is due to other columns, not the low variance column.', 'Converting categorical data into numerical values using label encoding and one-hot encoding The chapter explains the careful handling required when converting categorical data into numerical values, highlighting the use of label encoding for introducing order in the data and one-hot encoding for non-ordinal columns, and the caution needed to avoid introducing non-existent order.', 'The process of one-hot coding for categorical data The concept of one-hot coding is explained, where categorical data, such as gender, is replaced with two columns representing the categories, with 1s and 0s indicating the presence of each category, ultimately facilitating the conversion of non-numerical data into a suitable format for modeling.']}, {'end': 9835.98, 'start': 9178.843, 'title': 'Data type conversion and handling missing values', 'summary': 'Covers the manual conversion of data types, replacing missing values with the median, and strategies for handling missing data, including the use of the mice package for multiple imputations through chained equations.', 'duration': 657.137, 'highlights': ['The strategy for handling missing values involves replacing them with the median of the price column to mitigate the impact of outliers and utilizes the fillna function to achieve this. Replacing missing values with the median of the price column; Utilizing the fillna function for the replacement process', 'The discussion emphasizes the need to carefully address missing values based on the nature of the data, such as distinguishing between randomly missing values and values missing due to specific reasons, and suggests different strategies for each case, including building sub-models for prediction. Distinguishing between randomly missing values and specific reasons for missing values; Suggesting different strategies for addressing each case, such as building sub-models for prediction', 'Introducing the MICE package as a powerful tool for multiple imputations through chained equations and advising further reading on its usage and the differences between its implementations in Python and R. Introducing the MICE package for multiple imputations through chained equations; Advising further reading on its usage and differences between its implementations in Python and R']}], 'duration': 987.203, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM8848777.jpg', 'highlights': ['Dropping low variance columns streamlined the dataset.', "Irrelevant columns like 'number of doors' and 'fuel type' were removed.", 'Optimization of data types enhanced efficiency for improved analysis.', 'Price is not influenced by low variance columns.', 'Converting categorical data using label and one-hot encoding.', 'One-hot coding facilitates conversion of non-numerical data for modeling.', 'Replacing missing values with the median of the price column mitigates outliers.', 'Careful handling of missing values based on the nature of the data is emphasized.', 'Different strategies suggested for addressing randomly missing values and specific reasons.', 'Introduction of the MICE package for multiple imputations through chained equations.']}, {'end': 10665.823, 'segs': [{'end': 9957.737, 'src': 'embed', 'start': 9897.112, 'weight': 0, 'content': [{'end': 9902.215, 'text': 'The distribution is likely to be a symmetric bell curve for height alright.', 'start': 9897.112, 'duration': 5.103}, {'end': 9904.676, 'text': 'I do not see any skew to worry about.', 'start': 9902.655, 'duration': 2.021}, {'end': 9906.958, 'text': 'Look at the car weight.', 'start': 9905.377, 'duration': 1.581}, {'end': 9911.545, 'text': 'same story repeats here, 2555, 2014.', 'start': 9908.803, 'duration': 2.742}, {'end': 9923.736, 'text': 'If there is any column where the mean and median are very different, then you might be having a skewed data set.', 'start': 9911.545, 'duration': 12.191}, {'end': 9927.84, 'text': 'Look at this one price, now price is the target, forget the price.', 'start': 9924.477, 'duration': 3.363}, {'end': 9933.471, 'text': 'let us see if our analysis stands, let us see whether it stands, ok.', 'start': 9929.83, 'duration': 3.641}, {'end': 9940.033, 'text': 'So, this is numerical way, statistical way of analyzing data, but instead you can do a pair plot.', 'start': 9933.971, 'duration': 6.062}, {'end': 9947.854, 'text': 'In pair plot, I always prefer to have the diagonals is in form of density graphs.', 'start': 9942.733, 'duration': 5.121}, {'end': 9957.737, 'text': 'How do you get that? When you call the pair panel, you give diagonal kind is KDE, Kernel Density Estimates.', 'start': 9949.655, 'duration': 8.082}], 'summary': 'Analyzing data for symmetric bell curve, skew, and skewed data sets using statistical and visual methods.', 'duration': 60.625, 'max_score': 9897.112, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM9897112.jpg'}, {'end': 10282.263, 'src': 'embed', 'start': 10254.669, 'weight': 4, 'content': [{'end': 10257.57, 'text': 'Look at the length of the car, again almost a normal distribution.', 'start': 10254.669, 'duration': 2.901}, {'end': 10268.36, 'text': 'But we see some kind of overlap here between the you see, multiple gaussians here, one behind the other may not be of concern right now,', 'start': 10259.158, 'duration': 9.202}, {'end': 10269.9, 'text': 'because they are all kind of overlapping.', 'start': 10268.36, 'duration': 1.54}, {'end': 10277.282, 'text': 'Remember which was the column where we saw perfect match, the central values, what is the column? Bore.', 'start': 10271.641, 'duration': 5.641}, {'end': 10280.683, 'text': 'Height Height, height was the column.', 'start': 10277.302, 'duration': 3.381}, {'end': 10282.263, 'text': 'This is the distribution of height.', 'start': 10281.083, 'duration': 1.18}], 'summary': 'The car length follows a normal distribution, with overlapping gaussians, and the height distribution is being discussed.', 'duration': 27.594, 'max_score': 10254.669, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM10254669.jpg'}, {'end': 10426.816, 'src': 'embed', 'start': 10395.888, 'weight': 5, 'content': [{'end': 10398.448, 'text': 'What do you mean by Gaussian? Gaussian means distributions.', 'start': 10395.888, 'duration': 2.56}, {'end': 10410.251, 'text': 'So, if on a particular dimension you have two different distributions, you are talking about two different processes which are generated this data.', 'start': 10399.709, 'duration': 10.542}, {'end': 10417.173, 'text': 'Remember in statistics you would have done a t test or a z normal test.', 'start': 10412.512, 'duration': 4.661}, {'end': 10421.214, 'text': 'When do we say the two datasets belong to the same process?', 'start': 10418.393, 'duration': 2.821}, {'end': 10426.816, 'text': 'when they are significantly overlapping each other, the central values are not very far away from each other.', 'start': 10422.352, 'duration': 4.464}], 'summary': 'Gaussian refers to distributions; for same process, datasets must significantly overlap.', 'duration': 30.928, 'max_score': 10395.888, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM10395888.jpg'}], 'start': 9836.74, 'title': 'Car data analysis', 'summary': 'Covers statistical analysis of car data, including interpreting descriptive statistics, analyzing symmetrical bell curves, identifying skewed data sets, and using pair plots to compare variables. it also explains the importance of pair plots in understanding data distributions, highlighting the use of density curves instead of scatter plots, and the impact of outliers on standard deviation and model predictions.', 'chapters': [{'end': 10133.186, 'start': 9836.74, 'title': 'Statistical analysis of car data', 'summary': 'Covers statistical analysis of car data, including interpreting descriptive statistics, analyzing symmetrical bell curves, identifying skewed data sets, and using pair plots to compare variables in a dataset.', 'duration': 296.446, 'highlights': ['The chapter emphasizes the importance of interpreting descriptive statistics, such as comparing mean and median to identify symmetrical bell curves and skewed data sets.', 'The speaker discusses the use of pair plots to compare variables in a dataset, highlighting the significance of using diagonal density graphs to analyze the square matrix output.', 'The transcript also includes practical demonstrations of statistical analysis techniques, such as using code to perform pair plot visualizations and ensure inclusion of specific columns in the analysis.']}, {'end': 10665.823, 'start': 10133.186, 'title': 'Understanding pair plots and data distributions', 'summary': 'Explains the importance of pair plots in understanding data distributions and highlights the use of density curves instead of scatter plots, the analysis of column distributions, identification of gaussian mixtures, and the impact of outliers on standard deviation and model predictions.', 'duration': 532.637, 'highlights': ['The importance of understanding pair plots and data distributions in exploratory data analysis is emphasized, including the use of density curves instead of scatter plots for comparisons of a column with itself.', 'The analysis of column distributions, including the identification of normal distributions, Gaussian mixtures, and the impact of outliers on standard deviation, is highlighted.', 'The significance of identifying Gaussian mixtures in data, the potential impact on linear models, and the need to handle outliers effectively to prevent distortion of standard deviation and model predictions is explained.']}], 'duration': 829.083, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM9836740.jpg', 'highlights': ['The chapter emphasizes interpreting descriptive statistics to identify symmetrical bell curves and skewed data sets.', 'The use of pair plots to compare variables is highlighted, emphasizing the significance of using diagonal density graphs for analysis.', 'Practical demonstrations of statistical analysis techniques, such as using code for pair plot visualizations, are included.', 'Understanding pair plots and data distributions in exploratory data analysis is emphasized, including the use of density curves instead of scatter plots.', 'The analysis of column distributions, including the identification of normal distributions and Gaussian mixtures, is highlighted.', 'The significance of identifying Gaussian mixtures in data and the need to handle outliers effectively to prevent distortion of standard deviation and model predictions is explained.']}, {'end': 11576.364, 'segs': [{'end': 10801.286, 'src': 'embed', 'start': 10713.639, 'weight': 0, 'content': [{'end': 10718.102, 'text': 'So, if you start dropping the rows for all columns where you have outlast, your data set might shrink.', 'start': 10713.639, 'duration': 4.463}, {'end': 10732.521, 'text': 'So, now every column except one every column has outlier.', 'start': 10729.4, 'duration': 3.121}, {'end': 10738.284, 'text': 'So, when we are removing the outlier for every column 569 records comes down to 30, 000.', 'start': 10732.982, 'duration': 5.302}, {'end': 10740.285, 'text': 'So, that is not good.', 'start': 10738.284, 'duration': 2.001}, {'end': 10742.006, 'text': 'That is not good.', 'start': 10740.765, 'duration': 1.241}, {'end': 10746.267, 'text': 'So, dropping records is always the last option when you have plenty of data.', 'start': 10742.126, 'duration': 4.141}, {'end': 10751.25, 'text': 'When data size itself is restricted dropping records is not a good option.', 'start': 10748.088, 'duration': 3.162}, {'end': 10756.167, 'text': 'Shall we move on? All of you? Okay.', 'start': 10753.411, 'duration': 2.756}, {'end': 10757.168, 'text': "Let's move.", 'start': 10756.747, 'duration': 0.421}, {'end': 10761.511, 'text': 'So now, always start by analyzing the diagonals first.', 'start': 10757.868, 'duration': 3.643}, {'end': 10764.212, 'text': 'How the data is distributed on each column.', 'start': 10762.191, 'duration': 2.021}, {'end': 10766.314, 'text': 'This is my univariate analysis.', 'start': 10764.553, 'duration': 1.761}, {'end': 10770.136, 'text': 'This is what the output of DF describes.', 'start': 10768.195, 'duration': 1.941}, {'end': 10773.699, 'text': 'How your data is distributed on that particular column.', 'start': 10771.457, 'duration': 2.242}, {'end': 10775.02, 'text': "It's a basic statistic.", 'start': 10773.739, 'duration': 1.281}, {'end': 10776.181, 'text': "It's a descriptive statistic.", 'start': 10775.04, 'duration': 1.141}, {'end': 10779.123, 'text': 'Next we do is bivariate analysis.', 'start': 10777.241, 'duration': 1.882}, {'end': 10783.926, 'text': 'In bivariate analysis, how these columns interact with each other.', 'start': 10780.043, 'duration': 3.883}, {'end': 10793.262, 'text': 'For example, if you look at this one, this is telling you interaction between the symbolizing, symboling and this one wheel base.', 'start': 10784.957, 'duration': 8.305}, {'end': 10801.286, 'text': 'By the way, in this square matrix, above the diagonal and below the diagonal, it is mirror image.', 'start': 10794.823, 'duration': 6.463}], 'summary': 'Removing outliers reduced data from 569 to 30,000 records. analysis starts with univariate and bivariate approaches.', 'duration': 87.647, 'max_score': 10713.639, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM10713639.jpg'}, {'end': 11097.917, 'src': 'embed', 'start': 11072.404, 'weight': 5, 'content': [{'end': 11082.711, 'text': 'If you look at the cars with number of cylinders, in your data set most of the cars have 4 cylinders, 5 cylinders, 6 cylinder and 8 cylinder.', 'start': 11072.404, 'duration': 10.307}, {'end': 11090.797, 'text': 'Most of the cars look at this have 4 cylinders, by the way these data points might be sitting on top of one another.', 'start': 11083.112, 'duration': 7.685}, {'end': 11094.9, 'text': 'So, that does not mean your data set has only 1, 2, 3, 4, 5, 6 records for 4 cylinders.', 'start': 11091.537, 'duration': 3.363}, {'end': 11097.917, 'text': 'do that mistake.', 'start': 11097.076, 'duration': 0.841}], 'summary': 'Most cars in the data set have 4, 5, 6, or 8 cylinders.', 'duration': 25.513, 'max_score': 11072.404, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM11072404.jpg'}, {'end': 11178.143, 'src': 'embed', 'start': 11149.554, 'weight': 6, 'content': [{'end': 11154.215, 'text': 'This diagram, which I plotted here, is called KDE Kernel Density Estimates.', 'start': 11149.554, 'duration': 4.661}, {'end': 11155.836, 'text': 'So there are three words in this.', 'start': 11154.456, 'duration': 1.38}, {'end': 11158.597, 'text': "First word is, it's an estimate.", 'start': 11156.916, 'duration': 1.681}, {'end': 11168.28, 'text': "It's an estimation of the possible distribution, density distribution, density estimate in the population.", 'start': 11159.977, 'duration': 8.303}, {'end': 11173.822, 'text': 'In the population, how the cars are distributed around this value in cylinder.', 'start': 11169.18, 'duration': 4.642}, {'end': 11178.143, 'text': 'It is a density estimate based on a mathematical function.', 'start': 11174.262, 'duration': 3.881}], 'summary': 'Kde is a density estimate of car distribution in cylinder population.', 'duration': 28.589, 'max_score': 11149.554, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM11149554.jpg'}, {'end': 11357.152, 'src': 'embed', 'start': 11319.028, 'weight': 7, 'content': [{'end': 11324.153, 'text': 'it may not be 100, 0, it might be some distribution, but is that distribution good enough for you to use it?', 'start': 11319.028, 'duration': 5.125}, {'end': 11330.498, 'text': 'you have to take a call and that call you will be able to take both based on domain knowledge.', 'start': 11324.153, 'duration': 6.345}, {'end': 11338.786, 'text': 'if i know the drivers in the in the, the people who use the car, is 50, 50 amongst these, gender in the market.', 'start': 11330.498, 'duration': 8.288}, {'end': 11339.406, 'text': 'i know that.', 'start': 11338.786, 'duration': 0.62}, {'end': 11342.169, 'text': 'but for some reason the data which i have does not reflect that.', 'start': 11339.406, 'duration': 2.763}, {'end': 11357.152, 'text': 'This analysis is done on the original data set.', 'start': 11354.871, 'duration': 2.281}], 'summary': 'Decisions based on domain knowledge, not 50-50 gender distribution in data.', 'duration': 38.124, 'max_score': 11319.028, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM11319028.jpg'}], 'start': 10673.344, 'title': 'Data analysis techniques', 'summary': 'Discusses the importance of analyzing data distribution and relationships between columns, highlighting the presence of linear, non-linear, and no relationships, along with the potential approaches for handling dimensions with interactions.', 'chapters': [{'end': 10746.267, 'start': 10673.344, 'title': 'Outlier management in data analysis', 'summary': 'Discusses the challenges of managing outliers in data analysis, highlighting the potential loss of data and the significant reduction in records when removing outliers from every column.', 'duration': 72.923, 'highlights': ['Mahanchaya identifies that every column, except one, contains outliers, resulting in a significant reduction from 569 records to 30,000 when removing outliers from every column.', 'Dropping records is emphasized as the last option due to the potential loss of data when removing outliers, highlighting the importance of preserving as much data as possible.', 'The chapter highlights the challenge of removing outliers, as the dataset might shrink when dropping rows for all columns with outliers.']}, {'end': 11041.941, 'start': 10748.088, 'title': 'Data analysis techniques', 'summary': 'Discusses the importance of analyzing data distribution and relationships between columns, highlighting the presence of linear, non-linear, and no relationships, along with the potential approaches for handling dimensions with interactions.', 'duration': 293.853, 'highlights': ['The importance of starting with univariate analysis to understand the distribution of data on each column, providing a basic and descriptive statistic.', 'The demonstration of bivariate analysis to understand the interaction between columns, highlighting instances of negative, positive, and no relationships between dimensions.', 'The suggestion for handling dimensions with interactions, such as employing domain expertise to determine measurement errors or utilizing techniques like principal component analysis and singular value decomposition to create synthetic dimensions.']}, {'end': 11576.364, 'start': 11072.404, 'title': 'Cylinder distribution analysis', 'summary': 'Discusses the distribution of cylinders in the dataset, using kde kernel density estimates to estimate the density distribution of the cylinder column, highlighting the importance of domain knowledge in data analysis.', 'duration': 503.96, 'highlights': ['The chapter emphasizes the distribution of cylinders in the dataset, with most cars having 4 cylinders, followed by 5, 6, and 8 cylinders, and few records with 12 and 2 cylinders.', 'The use of KDE kernel density estimates to estimate the density distribution of the cylinder column is explained, emphasizing its role in understanding the likely distribution in the population based on available data.', 'The importance of domain knowledge in data analysis is highlighted, particularly in understanding variables like origin of the car and gender distribution, and justifying data handling strategies based on domain-specific insights.']}], 'duration': 903.02, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM10673344.jpg', 'highlights': ['Significant reduction from 569 records to 30,000 when removing outliers from every column.', 'Importance of preserving as much data as possible when removing outliers.', 'Dataset might shrink when dropping rows for all columns with outliers.', 'Starting with univariate analysis to understand the distribution of data on each column.', 'Demonstration of bivariate analysis to understand the interaction between columns.', 'Emphasizes the distribution of cylinders in the dataset, with most cars having 4 cylinders.', 'Use of KDE kernel density estimates to estimate the density distribution of the cylinder column.', 'Importance of domain knowledge in data analysis, particularly in understanding variables like origin of the car and gender distribution.']}, {'end': 13921.269, 'segs': [{'end': 11603.75, 'src': 'embed', 'start': 11576.364, 'weight': 0, 'content': [{'end': 11579.646, 'text': 'by this high end foreign brands they all come with embedded chips.', 'start': 11576.364, 'duration': 3.282}, {'end': 11583.248, 'text': 'Those embedded chips in real time.', 'start': 11581.287, 'duration': 1.961}, {'end': 11592.152, 'text': 'they capture the data about your driving style and pass it on to a central server where they sit down and analyze, and the risk factor is adjusted,', 'start': 11583.248, 'duration': 8.904}, {'end': 11597.655, 'text': 'recalculated, recalibrated, based on how the car is being driven.', 'start': 11592.152, 'duration': 5.503}, {'end': 11600.056, 'text': 'The symboling reflects that.', 'start': 11598.935, 'duration': 1.121}, {'end': 11603.75, 'text': "right. ok, let's move now.", 'start': 11601.308, 'duration': 2.442}], 'summary': 'High-end foreign cars have embedded chips that capture driving data for real-time analysis and risk adjustment.', 'duration': 27.386, 'max_score': 11576.364, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM11576364.jpg'}, {'end': 11741.262, 'src': 'embed', 'start': 11709.665, 'weight': 1, 'content': [{'end': 11716.813, 'text': 'Then I am making use of the random function which generates the training set and test we call it train test split.', 'start': 11709.665, 'duration': 7.148}, {'end': 11729.238, 'text': 'When I make use of this train test split on X and Y, I am asking it to split the data in the ratio of 75-25, 25 for test, 75 for training.', 'start': 11718.494, 'duration': 10.744}, {'end': 11735.02, 'text': 'The output of this will be 4 data sets.', 'start': 11732.179, 'duration': 2.841}, {'end': 11741.262, 'text': 'So, let me explain this to you because some people get confused here.', 'start': 11737.381, 'duration': 3.881}], 'summary': 'Using the random function to split data into 75-25 ratio for training and testing, resulting in 4 datasets.', 'duration': 31.597, 'max_score': 11709.665, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM11709665.jpg'}, {'end': 12123.06, 'src': 'embed', 'start': 12099.776, 'weight': 2, 'content': [{'end': 12109.279, 'text': 'Then how come one has a positive relation, other has a negative relation? But both of them should be positive.', 'start': 12099.776, 'duration': 9.503}, {'end': 12113.44, 'text': 'Wheelbase and length are positively related.', 'start': 12111.059, 'duration': 2.381}, {'end': 12119.537, 'text': 'but one of them is having a positive impact on the target the other one is having a negative impact on the target.', 'start': 12114.773, 'duration': 4.764}, {'end': 12123.06, 'text': 'How can that be?', 'start': 12122.44, 'duration': 0.62}], 'summary': 'Wheelbase and length are positively related, impacting the target differently.', 'duration': 23.284, 'max_score': 12099.776, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM12099776.jpg'}, {'end': 12352.101, 'src': 'embed', 'start': 12302.754, 'weight': 3, 'content': [{'end': 12305.475, 'text': 'The coefficients will not change when you normalize.', 'start': 12302.754, 'duration': 2.721}, {'end': 12306.536, 'text': 'it does not matter.', 'start': 12305.475, 'duration': 1.061}, {'end': 12334.727, 'text': 'By normalizing you are just centering the data to 0.', 'start': 12330.444, 'duration': 4.283}, {'end': 12339.811, 'text': 'The relationship between y and x will remain same whether it is a normalized data or a non-normalized data.', 'start': 12334.727, 'duration': 5.084}, {'end': 12341.673, 'text': 'Coefficients will not change.', 'start': 12340.632, 'duration': 1.041}, {'end': 12348.178, 'text': 'Linear regressions are not impacted by normalizations.', 'start': 12345.876, 'duration': 2.302}, {'end': 12352.101, 'text': 'The accuracy, score, the coefficient everything remains same.', 'start': 12350.039, 'duration': 2.062}], 'summary': 'Normalization does not affect linear regression coefficients or accuracy.', 'duration': 49.347, 'max_score': 12302.754, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM12302754.jpg'}, {'end': 12543.785, 'src': 'embed', 'start': 12505.925, 'weight': 4, 'content': [{'end': 12508.067, 'text': 'There are multiple 16 dimensions over here.', 'start': 12505.925, 'duration': 2.142}, {'end': 12509.828, 'text': 'This is the lowest point.', 'start': 12508.767, 'duration': 1.061}, {'end': 12514.351, 'text': 'This combination of coefficients gives the lowest point in the error graph.', 'start': 12510.308, 'duration': 4.043}, {'end': 12516.312, 'text': 'The best fit.', 'start': 12515.832, 'duration': 0.48}, {'end': 12524.137, 'text': 'Right? So, let us look at the performance of the model on test data.', 'start': 12519.874, 'duration': 4.263}, {'end': 12530.154, 'text': 'and as you can see performance on the model and test it as 83.6 percent.', 'start': 12525.931, 'duration': 4.223}, {'end': 12538.701, 'text': 'not bad, not good either.', 'start': 12530.154, 'duration': 8.547}, {'end': 12543.785, 'text': 'the comparison between this and this.', 'start': 12538.701, 'duration': 5.084}], 'summary': "Model's performance on test data is 83.6 percent, with room for improvement.", 'duration': 37.86, 'max_score': 12505.925, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM12505925.jpg'}, {'end': 12691.342, 'src': 'embed', 'start': 12658.503, 'weight': 5, 'content': [{'end': 12666.268, 'text': 'So there are various things you can do but before you do that I would like to introduce you to a library called StatsModel.', 'start': 12658.503, 'duration': 7.765}, {'end': 12671.511, 'text': 'StatsModel, the formula API, SMF.', 'start': 12669.49, 'duration': 2.021}, {'end': 12674.152, 'text': 'StatsModel functions.', 'start': 12672.872, 'duration': 1.28}, {'end': 12685.821, 'text': 'What happens is When you build this linear models in R, R gives you a lot of statistical information about your attributes and your models.', 'start': 12675.813, 'duration': 10.008}, {'end': 12691.342, 'text': 'scikit-learn linear regression does not give you those information.', 'start': 12687.741, 'duration': 3.601}], 'summary': 'Introduce statsmodel for detailed statistical information in linear regression.', 'duration': 32.839, 'max_score': 12658.503, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM12658503.jpg'}, {'end': 13014.035, 'src': 'embed', 'start': 12980.567, 'weight': 6, 'content': [{'end': 12982.829, 'text': 'let us go straight down.', 'start': 12980.567, 'duration': 2.262}, {'end': 12988.373, 'text': 'most important of these is your adjusted R square.', 'start': 12982.829, 'duration': 5.544}, {'end': 12991.835, 'text': 'adjusted R square is lower than R square 80 percent.', 'start': 12988.373, 'duration': 3.462}, {'end': 12992.816, 'text': 'is it ok with you?', 'start': 12991.835, 'duration': 0.981}, {'end': 12993.176, 'text': 'you have to.', 'start': 12992.816, 'duration': 0.36}, {'end': 12996.158, 'text': 'you have to decide.', 'start': 12993.176, 'duration': 2.982}, {'end': 12998.42, 'text': 'but before I go there, I will come down to this.', 'start': 12996.158, 'duration': 2.262}, {'end': 13004.249, 'text': 'I am going to stop at this point.', 'start': 13002.168, 'duration': 2.081}, {'end': 13014.035, 'text': 'What is this table telling you? I will just take one example and you have to use that example on others.', 'start': 13009.593, 'duration': 4.442}], 'summary': 'Adjusted r square is lower than r square by 80 percent.', 'duration': 33.468, 'max_score': 12980.567, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM12980567.jpg'}, {'end': 13327.296, 'src': 'embed', 'start': 13299.285, 'weight': 7, 'content': [{'end': 13307.069, 'text': 'What does this p-value tell you? What the p-value is telling you is, I am going to talk only for length, but the same thing applies for all.', 'start': 13299.285, 'duration': 7.784}, {'end': 13313.566, 'text': 'What this p-value is telling you is probability is 0.2.', 'start': 13308.902, 'duration': 4.664}, {'end': 13321.992, 'text': 'What is the cut-off probability for your confidence level, 94%? 0.05, 5%.', 'start': 13313.566, 'duration': 8.426}, {'end': 13324.994, 'text': '0.2 is much higher than 0.05.', 'start': 13321.992, 'duration': 3.002}, {'end': 13327.296, 'text': 'What this is telling you is,', 'start': 13324.994, 'duration': 2.302}], 'summary': 'The p-value is 0.2, higher than the 5% cut-off for 94% confidence level.', 'duration': 28.011, 'max_score': 13299.285, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM13299285.jpg'}, {'end': 13779.786, 'src': 'embed', 'start': 13750.179, 'weight': 8, 'content': [{'end': 13754.602, 'text': 'But when you actually go and analyze this drop the dimensions, you see that all coefficients change.', 'start': 13750.179, 'duration': 4.423}, {'end': 13757.408, 'text': 'correspondingly all the p-values will also change.', 'start': 13755.506, 'duration': 1.902}, {'end': 13763.212, 'text': 'The reason why they change is collinearity, they are not truly independent.', 'start': 13758.468, 'duration': 4.744}, {'end': 13771.079, 'text': 'Since there is collinearity and they impact each other, the coefficients are not reliable, hence the p-values are not reliable.', 'start': 13765.554, 'duration': 5.525}, {'end': 13779.786, 'text': 'It is because of this reason this split between the statistics community on whether they should rely on this or not.', 'start': 13773.421, 'duration': 6.365}], 'summary': 'Collinearity affects coefficients and p-values, making them unreliable and causing debate in the statistics community.', 'duration': 29.607, 'max_score': 13750.179, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM13750179.jpg'}, {'end': 13919.548, 'src': 'heatmap', 'start': 13783.663, 'weight': 0.712, 'content': [{'end': 13787.666, 'text': 'we have other ways of establishing the reliability of the dimensions and the models.', 'start': 13783.663, 'duration': 4.003}, {'end': 13790.287, 'text': "let's use that.", 'start': 13787.666, 'duration': 2.621}, {'end': 13796.871, 'text': 'that is why scikit-learn does not give you this facility, but then subsequently, under pressure, they came out with stats model libraries,', 'start': 13790.287, 'duration': 6.584}, {'end': 13800.214, 'text': 'which gives you this okay.', 'start': 13796.871, 'duration': 3.343}, {'end': 13809.82, 'text': 'now, just to end up the show, what is telling is p-value and what it is telling you, the 95% confidence within which these values lie.', 'start': 13800.214, 'duration': 9.606}, {'end': 13810.041, 'text': 'ok,', 'start': 13809.82, 'duration': 0.221}, {'end': 13817.311, 'text': 'So, if you guys remember 95% confidence level area of rejection area of you will see all you will be able to connect to those using this.', 'start': 13810.421, 'duration': 6.89}, {'end': 13820.035, 'text': 'I am going to jump this I am going to move forward.', 'start': 13818.433, 'duration': 1.602}, {'end': 13826.772, 'text': 'Last one, overall probability value of the model itself.', 'start': 13822.389, 'duration': 4.383}, {'end': 13833.136, 'text': 'Can such a relationship between the price and the other good attributes?', 'start': 13827.272, 'duration': 5.864}, {'end': 13834.617, 'text': 'does such a relationship exist?', 'start': 13833.136, 'duration': 1.481}, {'end': 13837.899, 'text': 'P-value is very low, which means null.', 'start': 13835.517, 'duration': 2.382}, {'end': 13839.72, 'text': 'hypothesis overall is rejected.', 'start': 13837.899, 'duration': 1.821}, {'end': 13842.822, 'text': 'Model level P-value is very low.', 'start': 13841.341, 'duration': 1.481}, {'end': 13847.985, 'text': 'So after removing all the unwanted dimensions, hopefully you will have a reliable model.', 'start': 13843.622, 'duration': 4.363}, {'end': 13857.163, 'text': 'P value is very less here.', 'start': 13854.702, 'duration': 2.461}, {'end': 13866.886, 'text': 'If you are working for people who believe in statistics and I mean not that I do not believe in statistics those people expect P values from you then you have to use stats model.', 'start': 13858.463, 'duration': 8.423}, {'end': 13870.527, 'text': 'If they do not care about P values, do not come here.', 'start': 13868.566, 'duration': 1.961}, {'end': 13877.589, 'text': 'there are other ways of designing whether dimension is reliable or not ensemble techniques and all these techniques are available with us,', 'start': 13870.527, 'duration': 7.062}, {'end': 13878.509, 'text': 'so we use that technique.', 'start': 13877.589, 'duration': 0.92}, {'end': 13882.767, 'text': 'most influencing factor for the price.', 'start': 13879.505, 'duration': 3.262}, {'end': 13888.929, 'text': 'Use decision tree regression that will list out for you the dimension, as you will see now after lunch.', 'start': 13883.147, 'duration': 5.782}, {'end': 13893.551, 'text': 'it will list out for you dimensions in terms of importance, which dimensions are more important?', 'start': 13888.929, 'duration': 4.622}, {'end': 13898.874, 'text': "But don't they again use the p values? They use Bayesian statistics, they don't use p values.", 'start': 13893.651, 'duration': 5.223}, {'end': 13907.258, 'text': "You remember Naive Bay, you won't have done Naive Bay, you remember Bayesian this thing distributions, they use that for this.", 'start': 13900.915, 'duration': 6.343}, {'end': 13911.082, 'text': 'some statisticians say that Bayesian distributions are more reliable than this.', 'start': 13908.16, 'duration': 2.922}, {'end': 13917.647, 'text': "All right, sir, I'm going to call it an end of the day for today.", 'start': 13914.224, 'duration': 3.423}, {'end': 13919.548, 'text': "for me, for you guys, it's going to continue.", 'start': 13917.647, 'duration': 1.901}], 'summary': 'Scikit-learn lacks reliability facility, stats model gives p-value for model validation and dimension importance.', 'duration': 135.885, 'max_score': 13783.663, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM13783663.jpg'}], 'start': 11576.364, 'title': 'Embedded chip data analysis', 'summary': "Discusses the use of embedded chips in high-end foreign cars to capture and analyze driving data in real-time, leading to the adjustment of risk factors and the utilization of linear regression to evaluate the model's performance with a 75-25 data split. it also identifies issues in the data set, discusses the impact of normalizing data on linear regression coefficients, introduces the statsmodel library in python for building a linear regression model, and explains the interpretation of p-values and statistical coefficients.", 'chapters': [{'end': 12019.632, 'start': 11576.364, 'title': 'Embedded chip data analysis', 'summary': "Discusses the use of embedded chips in high-end foreign cars to capture and analyze driving data in real-time, leading to the adjustment of risk factors and the utilization of linear regression to evaluate the model's performance with a 75-25 data split.", 'duration': 443.268, 'highlights': ["The use of embedded chips in high-end foreign cars captures real-time driving data and analyzes it to adjust risk factors. The embedded chips capture driving style data and adjust risk factors based on the car's performance.", "Utilizing linear regression with a 75-25 data split to evaluate the model's performance. The model's performance is evaluated using linear regression and a 75-25 data split for training and testing.", 'Explaining the process of breaking down the data into training and test sets using the train test split function. The train test split function is used to divide the data into training and test sets for model evaluation.']}, {'end': 12299.405, 'start': 12021.613, 'title': 'Identifying data set issues', 'summary': "Identifies issues in the data set, such as conflicting coefficients and unaddressed outliers, impacting the model's accuracy and performance.", 'duration': 277.792, 'highlights': ['Conflicting coefficients indicate issues in the data set, such as the positive relation between wheelbase and length having opposite impacts on the target variable. Conflicting coefficients reveal problems in the data set, as the positive relation between wheelbase and length has opposite impacts on the target variable.', 'The presence of outliers is identified as a potential cause for discrepancies in the coefficients and their impacts on the target variable. Outliers are recognized as a potential cause for discrepancies in the coefficients and their impacts on the target variable.', 'The lack of handling multicollinearity and correlation between independent variables is identified as a root cause for model-level problems. Not addressing multicollinearity and correlation between independent variables is recognized as a root cause for model-level problems.']}, {'end': 12656.702, 'start': 12302.754, 'title': 'Linear regression coefficients and model performance', 'summary': 'Discusses the impact of normalizing data on linear regression coefficients, the process of obtaining intercept and coefficients, and the interpretation of model performance with 83.6% accuracy on test data.', 'duration': 353.948, 'highlights': ['The relationship between y and x will remain same whether it is a normalized data or a non-normalized data. Coefficients will not change. The relationship between y and x remains the same regardless of data normalization, ensuring that coefficients remain unchanged.', 'Linear regressions are not impacted by normalizations. The accuracy, score, the coefficient everything remains same. Normalizations do not affect linear regressions, preserving accuracy, score, and coefficients.', "The model's performance on test data is 83.6 percent. The model demonstrates 83.6% accuracy in capturing the variance in test data, indicating its performance.", 'Error is a function of m and c, right? So c is one of the dimensions. Error is determined by the values of m and c, where c serves as one of the dimensions influencing the error.', 'Your first cut model will never give you the results that you are expecting, very rare. Initial models rarely yield expected results, prompting the need to focus on improving model performance.']}, {'end': 13114.884, 'start': 12658.503, 'title': 'Introduction to statsmodel and linear regression', 'summary': "Introduces the statsmodel library in python, which replicates r's statistical analysis, and demonstrates the process of building a linear regression model to predict car prices, highlighting the importance of adjusted r square and providing insights into the statistical analysis obtained from the model.", 'duration': 456.381, 'highlights': ["Building linear regression model using StatsModel to predict car prices Demonstrates the process of building a linear regression model using StatsModel in Python to predict car prices, showcasing the replication of R's statistical analysis in Python.", 'Importance of adjusted R square in model evaluation Emphasizes the significance of adjusted R square, highlighting that it is lower than R square and serves as an important factor in model evaluation, with a reference value of 80 percent.', 'Insights into statistical analysis obtained from the model Provides insights into the statistical analysis obtained from the model, including terms such as F statistics, AIC, BIC, standard error, t values, and probability values, emphasizing their relevance in model evaluation and hypothesis testing.']}, {'end': 13921.269, 'start': 13116.044, 'title': 'Interpreting p-values and statistical coefficients', 'summary': 'Explains the interpretation of p-values and statistical coefficients, emphasizing the significance of p-values in evaluating the reliability of dimensions and models, and the impact of collinearity on the reliability of coefficients and p-values.', 'duration': 805.225, 'highlights': ['The p-value indicates the probability of finding a relationship between two variables in the sample data, even if there is no relationship in the population, with a p-value of 0.2 signifying a higher likelihood of a fluke relationship compared to the 0.05 cutoff for a 94% confidence level. Explanation of p-value significance, comparison to cutoff value, and interpretation of likelihood of relationship in sample data.', 'The reliability of coefficients and p-values is impacted by collinearity, leading to changes in coefficients and p-values when dimensions are dropped, causing a split in the statistics community regarding reliance on p-values for model evaluation. Impact of collinearity on coefficient and p-value reliability, resulting changes in coefficients and p-values when dimensions are dropped, and the split in the statistics community regarding reliance on p-values.', 'The importance of p-values in evaluating the reliability of dimensions and models, with p-values greater than 0.05 indicating statistical flukes and p-values less than 0.05 signifying good attributes, along with the availability of alternative techniques for establishing dimension reliability. Significance of p-values in evaluating dimension and model reliability, differentiation between p-values greater and less than 0.05, and availability of alternative techniques for dimension reliability assessment.']}], 'duration': 2344.905, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tFi4Y_y-GNM/pics/tFi4Y_y-GNM11576364.jpg', 'highlights': ['The use of embedded chips in high-end foreign cars captures real-time driving data and analyzes it to adjust risk factors.', "Utilizing linear regression with a 75-25 data split to evaluate the model's performance.", 'Conflicting coefficients reveal problems in the data set, as the positive relation between wheelbase and length has opposite impacts on the target variable.', 'The relationship between y and x remains the same regardless of data normalization, ensuring that coefficients remain unchanged.', "The model's performance on test data is 83.6 percent, indicating its performance.", 'Building linear regression model using StatsModel in Python to predict car prices.', 'Emphasizes the significance of adjusted R square, highlighting that it is lower than R square and serves as an important factor in model evaluation, with a reference value of 80 percent.', 'Explanation of p-value significance, comparison to cutoff value, and interpretation of likelihood of relationship in sample data.', 'Impact of collinearity on coefficient and p-value reliability, resulting changes in coefficients and p-values when dimensions are dropped, and the split in the statistics community regarding reliance on p-values.']}], 'highlights': ['The tutorial promises a complete understanding of Linear Regression by the end of the session.', 'Linear Regression is highlighted as one of the simplest and most widely used algorithms in machine learning.', 'Linear models require linear relationships between y and independent variables, resorting to non-linear models for non-linear relationships.', 'Principal component analysis (PCA) converts variables into synthetic dimensions, enabling analysis of influential dimensions and dimensionality reduction.', 'The algorithm utilizes gradient descent and partial derivatives to iteratively move from a random best fit line to the global minima, ensuring the least sum of squared errors and determining the best fit line for the given data points.', "The adjusted R square is preferred over R square for model evaluation as it accounts for the impact of useless variables and provides a more accurate measure of a model's goodness of fit, unlike R square which can be easily influenced by fake relationships and useless variables.", 'An r value close to 0 indicates no linear relationship, but not necessarily no relationship at all.', 'Ensemble techniques involve using multiple models grouped together for production.', 'Linear regression allows for physical interpretation of the model, providing insight into the relationship between variables and the impact of their changes.', 'Data preprocessing steps for a linear regression model using the CARP dataset are demonstrated, including handling missing values, converting string data types to numbers, and applying a low variance filter to remove irrelevant columns.', 'The chapter emphasizes interpreting descriptive statistics to identify symmetrical bell curves and skewed data sets.', 'The use of embedded chips in high-end foreign cars captures real-time driving data and analyzes it to adjust risk factors.', "Utilizing linear regression with a 75-25 data split to evaluate the model's performance."]}