title
Machine Learning Interview Questions and Answers | Machine Learning Interview Preparation | Edureka

description
πŸ”₯ Machine Learning Training with Python (Use Code "π˜πŽπ”π“π”ππ„πŸπŸŽ"): https://www.edureka.co/data-science-python-certification-course This Machine Learning Interview Questions and Answers video will help you to prepare yourself for Data Science / Machine Learning interviews. This video is ideal for both beginners as well as professionals who want to learn or brush up their concepts in Machine Learning core-concepts, Machine Learning using Python and Machine Learning Scenarios. Below are the topics covered in this tutorial: 1. Machine Learning Core Interview Question 2. Machine Learning using Python Interview Question 3. Machine Learning Scenario based Interview Question Check out our playlist for more videos: http://bit.ly/2taym8X Subscribe to our channel to get video updates. Hit the subscribe button above. PG in Artificial Intelligence and Machine Learning with NIT Warangal : https://www.edureka.co/post-graduate/machine-learning-and-ai Post Graduate Certification in Data Science with IIT Guwahati - https://www.edureka.co/post-graduate/data-science-program (450+ Hrs || 9 Months || 20+ Projects & 100+ Case studies) #MachineLearningInterviewQuestions #MachineLearningUsingPython #MachineLearningTraning How it Works? 1. This is a 5 Week Instructor led Online Course,40 hours of assignment and 20 hours of project work 2. We have a 24x7 One-on-One LIVE Technical Support to help you with any problems you might face or any clarifications you may require during the course. 3. At the end of the training you will be working on a real time project for which we will provide you a Grade and a Verifiable Certificate! - - - - - - - - - - - - - - - - - About the Course Edureka’s Machine Learning Course using Python is designed to make you grab the concepts of Machine Learning. The Machine Learning training will provide deep understanding of Machine Learning and its mechanism. As a Data Scientist, you will be learning the importance of Machine Learning and its implementation in python programming language. Furthermore, you will be taught Reinforcement Learning which in turn is an important aspect of Artificial Intelligence. You will be able to automate real life scenarios using Machine Learning Algorithms. Towards the end of the course, we will be discussing various practical use cases of Machine Learning in python programming language to enhance your learning experience. After completing this Machine Learning Certification Training using Python, you should be able to: Gain insight into the 'Roles' played by a Machine Learning Engineer Automate data analysis using python Describe Machine Learning Work with real-time data Learn tools and techniques for predictive modeling Discuss Machine Learning algorithms and their implementation Validate Machine Learning algorithms Explain Time Series and it’s related concepts Gain expertise to handle business in future, living the present - - - - - - - - - - - - - - - - - - - Why learn Machine Learning with Python? Data Science is a set of techniques that enables the computers to learn the desired behavior from data without explicitly being programmed. It employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science. This course exposes you to different classes of machine learning algorithms like supervised, unsupervised and reinforcement algorithms. This course imparts you the necessary skills like data pre-processing, dimensional reduction, model evaluation and also exposes you to different machine learning algorithms like regression, clustering, decision trees, random forest, Naive Bayes and Q-Learning. For more information, Please write back to us at sales@edureka.co or call us at IND: 9606058406 / US: 18338555775 (toll free). Instagram: https://www.instagram.com/edureka_learning/ Facebook: https://www.facebook.com/edurekaIN/ Twitter: https://twitter.com/edurekain LinkedIn: https://www.linkedin.com/company/edureka

detail
{'title': 'Machine Learning Interview Questions and Answers | Machine Learning Interview Preparation | Edureka', 'heatmap': [{'end': 691.825, 'start': 559.519, 'weight': 0.936}, {'end': 815.1, 'start': 745.556, 'weight': 0.859}, {'end': 2006.923, 'start': 1878.767, 'weight': 1}], 'summary': 'Provides insights on machine learning interviews, covering reinforcement learning, model evaluation, model accuracy, a/b testing, python libraries, coding for accuracy metrics, predictive modeling, and data handling, offering comprehensive preparation for machine learning interviews.', 'chapters': [{'end': 116.929, 'segs': [{'end': 35.417, 'src': 'embed', 'start': 7.401, 'weight': 1, 'content': [{'end': 9.402, 'text': 'Welcome to the machine learning interview questions.', 'start': 7.401, 'duration': 2.001}, {'end': 14.303, 'text': 'So this session is organized by Edureka and let me introduce myself.', 'start': 9.882, 'duration': 4.421}, {'end': 19.845, 'text': 'My name is Rushikesh Mehrawade and I have an overall experience of around six years,', 'start': 14.563, 'duration': 5.282}, {'end': 28.935, 'text': 'and in the machine learning field I have an experience of over three years and We are trying to leverage machine learning to understand,', 'start': 19.845, 'duration': 9.09}, {'end': 35.417, 'text': 'like how we can use machine learning and artificial intelligence to create the test cases, understand the data centers,', 'start': 28.935, 'duration': 6.482}], 'summary': 'Rushikesh mehrawade has 6 years of experience, with 3 years in machine learning, focusing on leveraging machine learning and artificial intelligence for creating test cases and understanding data centers.', 'duration': 28.016, 'max_score': 7.401, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks7401.jpg'}, {'end': 95.995, 'src': 'embed', 'start': 62.843, 'weight': 0, 'content': [{'end': 68.528, 'text': 'So every company is trying to use machine learning to understand and get gems from their data.', 'start': 62.843, 'duration': 5.685}, {'end': 71.27, 'text': "They have a data but they don't know what to do with their data.", 'start': 68.548, 'duration': 2.722}, {'end': 78.749, 'text': 'They want machine learning engineers to have them understand the data so that their company revenues can be increased.', 'start': 71.806, 'duration': 6.943}, {'end': 85.631, 'text': 'So, on top of that, I wanted to also let you know that over the openings, you will see that there are opening for data scientists,', 'start': 79.169, 'duration': 6.462}, {'end': 89.092, 'text': 'machine learning engineers, deep learning engineers, data analyst.', 'start': 85.631, 'duration': 3.461}, {'end': 94.174, 'text': 'many people are confused with the terms, as which one should we choose like, should we go with the data scientist?', 'start': 89.092, 'duration': 5.082}, {'end': 95.995, 'text': 'should we go with the machine learning engineering?', 'start': 94.174, 'duration': 1.821}], 'summary': 'Companies seek ml to boost revenue. confusion between data scientist and ml engineer roles.', 'duration': 33.152, 'max_score': 62.843, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks62843.jpg'}], 'start': 7.401, 'title': 'Machine learning interview insights', 'summary': 'Presents insights from rushikesh mehrawade, a machine learning expert, emphasizing the increasing demand for machine learning professionals and the importance of understanding job descriptions in this field.', 'chapters': [{'end': 116.929, 'start': 7.401, 'title': 'Machine learning interview insights', 'summary': 'Presents insights from a machine learning expert, rushikesh mehrawade, highlighting the growing demand for machine learning professionals and the need for understanding the job descriptions when pursuing a career in this field.', 'duration': 109.528, 'highlights': ['Rushikesh Mehrawade has over six years of experience, with three years specifically in machine learning, emphasizing the practical application of machine learning in improving performance predictions and functional testing.', 'The market currently has a high demand for machine learning professionals, as companies aim to leverage machine learning to analyze and optimize their data, leading to increased company revenues.', 'Various job titles such as data scientists, machine learning engineers, deep learning engineers, and data analysts are used by companies to attract talent, but the key focus should be on understanding the job descriptions and being satisfied with the roles during the interview process.']}], 'duration': 109.528, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks7401.jpg', 'highlights': ['The market currently has a high demand for machine learning professionals, as companies aim to leverage machine learning to analyze and optimize their data, leading to increased company revenues.', 'Rushikesh Mehrawade has over six years of experience, with three years specifically in machine learning, emphasizing the practical application of machine learning in improving performance predictions and functional testing.', 'Various job titles such as data scientists, machine learning engineers, deep learning engineers, and data analysts are used by companies to attract talent, but the key focus should be on understanding the job descriptions and being satisfied with the roles during the interview process.']}, {'end': 802.432, 'segs': [{'end': 161.273, 'src': 'embed', 'start': 134.646, 'weight': 0, 'content': [{'end': 139.491, 'text': 'So first thing is this session is divided into three components of three broad components.', 'start': 134.646, 'duration': 4.845}, {'end': 143.134, 'text': 'So first thing is machine learning core interview questions.', 'start': 139.511, 'duration': 3.623}, {'end': 149.408, 'text': 'So, within this core interview questions, we are more interested with the theoretical aspects of the machine learning,', 'start': 143.666, 'duration': 5.742}, {'end': 154.57, 'text': "like how we're going to ask you the theoretical questions and you can explain those in an efficient manner.", 'start': 149.408, 'duration': 5.162}, {'end': 161.273, 'text': "Then second is the technical part where we're going to see the interview questions related to the Python.", 'start': 154.91, 'duration': 6.363}], 'summary': 'Session covers 3 components: machine learning core interview questions, theoretical aspects, and python technical questions.', 'duration': 26.627, 'max_score': 134.646, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks134646.jpg'}, {'end': 299.241, 'src': 'embed', 'start': 264.327, 'weight': 1, 'content': [{'end': 266.328, 'text': 'What are various types of machine learning?', 'start': 264.327, 'duration': 2.001}, {'end': 274.252, 'text': 'So, first of all, you have to say that the machine learning is categorized into majorly three components first is the supervised learning,', 'start': 266.968, 'duration': 7.284}, {'end': 278.854, 'text': 'Second is the unsupervised learning and third is the reinforcement learning.', 'start': 274.853, 'duration': 4.001}, {'end': 283.196, 'text': 'So in the supervised learning it is like learning with a teacher.', 'start': 279.274, 'duration': 3.922}, {'end': 288.097, 'text': 'So training data set is like a teacher is giving you like training the machine.', 'start': 283.436, 'duration': 4.661}, {'end': 291.518, 'text': 'So teacher is trying to train based on the whatever teacher knows.', 'start': 288.117, 'duration': 3.401}, {'end': 299.241, 'text': 'So model is like trained on the speed defined data which you have and it start to make decisions based on the,', 'start': 292.159, 'duration': 7.082}], 'summary': 'Machine learning has 3 main types: supervised, unsupervised, and reinforcement learning.', 'duration': 34.914, 'max_score': 264.327, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks264327.jpg'}, {'end': 474.547, 'src': 'embed', 'start': 446.414, 'weight': 2, 'content': [{'end': 453.617, 'text': 'So assume that we are playing a Mario game and in the Mario game, the player is called as the agent in the reinforcement learning.', 'start': 446.414, 'duration': 7.203}, {'end': 456.499, 'text': 'We have an environment which is nothing but a game.', 'start': 453.998, 'duration': 2.501}, {'end': 461.101, 'text': 'There will be some predefined actions which agent can take.', 'start': 456.899, 'duration': 4.202}, {'end': 468.724, 'text': 'For example, in the game of Mario, it will be he can move forward, he can jump, he can go into the tunnels, he can fire the bombs.', 'start': 461.381, 'duration': 7.343}, {'end': 474.547, 'text': 'So those are the steps and based on the steps, your environment will try to reward you.', 'start': 469.265, 'duration': 5.282}], 'summary': 'In a mario game, the player (agent) can take predefined actions and the environment rewards based on those actions.', 'duration': 28.133, 'max_score': 446.414, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks446414.jpg'}, {'end': 549.374, 'src': 'embed', 'start': 523.803, 'weight': 3, 'content': [{'end': 528.99, 'text': 'So in this case what happens is you have a data set which is half of it is labeled and half of it is not labeled.', 'start': 523.803, 'duration': 5.187}, {'end': 534.733, 'text': 'So in this case both supervised plus semi-supervised are used to create your models.', 'start': 529.37, 'duration': 5.363}, {'end': 538.835, 'text': 'So you can also give this so that is there will be more confident on you.', 'start': 535.073, 'duration': 3.762}, {'end': 541.076, 'text': "Okay, so let's move on to the next thing.", 'start': 539.335, 'duration': 1.741}, {'end': 545.918, 'text': "What's your favorite algorithm and can you explain it in a minute?", 'start': 541.516, 'duration': 4.402}, {'end': 549.374, 'text': 'In this case, interviews trying to understand,', 'start': 546.553, 'duration': 2.821}], 'summary': 'Using labeled and unlabeled data, both supervised and semi-supervised methods are employed to create models for increased confidence.', 'duration': 25.571, 'max_score': 523.803, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks523803.jpg'}, {'end': 691.825, 'src': 'heatmap', 'start': 559.519, 'weight': 0.936, 'content': [{'end': 565.062, 'text': 'So make sure you have some choice and you can explain different algorithms in a simple manner.', 'start': 559.519, 'duration': 5.543}, {'end': 566.723, 'text': 'you keep for every example.', 'start': 565.062, 'duration': 1.661}, {'end': 572.366, 'text': 'try to keep some simple examples so that you can effectively and you can easily explain it,', 'start': 566.723, 'duration': 5.643}, {'end': 577.25, 'text': 'so that even a small kid can understand as how you are explaining those things.', 'start': 572.366, 'duration': 4.884}, {'end': 582.914, 'text': 'So how deep learning is different from machine learning in this case?', 'start': 578.451, 'duration': 4.463}, {'end': 587.257, 'text': 'first thing is you have to know that deep learning is not completely different from machine learning.', 'start': 582.914, 'duration': 4.343}, {'end': 594.922, 'text': 'You have to say that, okay, deep learning is a small part of machine learning and in the core machine learning based on the input,', 'start': 587.597, 'duration': 7.325}, {'end': 597.044, 'text': 'first you have to do the feature extractions.', 'start': 594.922, 'duration': 2.122}, {'end': 602.54, 'text': 'You have to classify the features, which are the good, which are trying to explain your model better,', 'start': 597.444, 'duration': 5.096}, {'end': 607.503, 'text': 'based on some exploratory data analysis and based on that, you will feed it to the algorithms,', 'start': 602.54, 'duration': 4.963}, {'end': 612.727, 'text': 'and those algorithms will try to identify the patterns within those and give you the output.', 'start': 607.503, 'duration': 5.224}, {'end': 615.128, 'text': 'In deep learning it is bit simplified.', 'start': 613.147, 'duration': 1.981}, {'end': 624.375, 'text': 'What happens is based on the input the model will try to extract the features on itself and will try to create a model based on those.', 'start': 615.389, 'duration': 8.986}, {'end': 633.251, 'text': 'So it is trying to combine both the things of feature extraction and classification into a single thing and it will give you the output.', 'start': 624.795, 'duration': 8.456}, {'end': 636.355, 'text': "So that's why it is recently being very popular.", 'start': 633.351, 'duration': 3.004}, {'end': 644.483, 'text': 'So that is one thing and deep learning is basically if you know the deep learning is constructed most of the neural networks.', 'start': 636.615, 'duration': 7.868}, {'end': 650.29, 'text': 'So a neural networks are basically borrowed from the idea of the human brain.', 'start': 645.024, 'duration': 5.266}, {'end': 657.969, 'text': 'So how the human brain works and how the neurons within the brain really works in a very complicated and effective manner.', 'start': 650.31, 'duration': 7.659}, {'end': 666.18, 'text': 'so those are the things which have been taken and being implemented in the deep learning, and machine learning is about the algorithms.', 'start': 657.969, 'duration': 8.211}, {'end': 673.33, 'text': 'that basically parses and learns the data and tries to identify the patterns within the data, apply,', 'start': 666.704, 'duration': 6.626}, {'end': 676.812, 'text': 'whatever pattern learned from the data to the new data whichever is coming.', 'start': 673.33, 'duration': 3.482}, {'end': 680.355, 'text': 'So that is the most of the thing which was related with the machine learning.', 'start': 677.073, 'duration': 3.282}, {'end': 683.618, 'text': 'But deep learning has come a bit more advanced to it.', 'start': 680.375, 'duration': 3.243}, {'end': 687.321, 'text': 'Okay, so explain classification and regression.', 'start': 684.219, 'duration': 3.102}, {'end': 691.825, 'text': 'So classification and regression is part of supervised learning.', 'start': 687.681, 'duration': 4.144}], 'summary': 'Deep learning simplifies feature extraction and classification, making it popular. it is a part of machine learning and uses neural networks borrowed from the human brain. classification and regression are part of supervised learning.', 'duration': 132.306, 'max_score': 559.519, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks559519.jpg'}, {'end': 612.727, 'src': 'embed', 'start': 587.597, 'weight': 4, 'content': [{'end': 594.922, 'text': 'You have to say that, okay, deep learning is a small part of machine learning and in the core machine learning based on the input,', 'start': 587.597, 'duration': 7.325}, {'end': 597.044, 'text': 'first you have to do the feature extractions.', 'start': 594.922, 'duration': 2.122}, {'end': 602.54, 'text': 'You have to classify the features, which are the good, which are trying to explain your model better,', 'start': 597.444, 'duration': 5.096}, {'end': 607.503, 'text': 'based on some exploratory data analysis and based on that, you will feed it to the algorithms,', 'start': 602.54, 'duration': 4.963}, {'end': 612.727, 'text': 'and those algorithms will try to identify the patterns within those and give you the output.', 'start': 607.503, 'duration': 5.224}], 'summary': 'In machine learning, deep learning is a subset. feature extraction and algorithmic analysis are vital for pattern identification.', 'duration': 25.13, 'max_score': 587.597, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks587597.jpg'}], 'start': 117.289, 'title': 'Machine learning interview overview and reinforcement learning', 'summary': 'Provides an overview of machine learning interview questions, covering theoretical, technical, and scenario-based components. it also explains reinforcement learning using examples such as mario game, alphago, and chess, and discusses semi-supervised algorithms, deep learning, and machine learning comparison, classification, regression, and selection bias.', 'chapters': [{'end': 445.782, 'start': 117.289, 'title': 'Machine learning interview overview', 'summary': 'Covers an overview of machine learning interview questions, divided into theoretical, technical, and scenario-based components, including explanations of machine learning to a school going kid, various types of machine learning, and examples of supervised, unsupervised, and reinforcement learning.', 'duration': 328.493, 'highlights': ['The chapter covers an overview of machine learning interview questions, divided into theoretical, technical, and scenario-based components. The interview session is divided into three broad components: machine learning core interview questions, technical questions related to Python in machine learning, and scenario-based questions testing the ability to solve real-world problems using machine learning.', 'Explanation of machine learning to a school going kid, including examples of supervised and unsupervised learning. The interviewer is interested in seeing how the concepts of machine learning can be explained in a simple manner, such as using the example of classifying strangers at a party as an example of unsupervised learning, and distinguishing it from supervised learning by explaining the concept of having prior knowledge.', 'Explanation of various types of machine learning, including supervised, unsupervised, and reinforcement learning. Machine learning is categorized into three major components: supervised learning, unsupervised learning, and reinforcement learning, each with distinct characteristics such as learning with a teacher, learning without a teacher, and learning through hit and trial method, respectively.']}, {'end': 802.432, 'start': 446.414, 'title': 'Reinforcement learning and semi-supervised algorithms', 'summary': 'Explains reinforcement learning using examples such as mario game, alphago, and chess, along with the concept of semi-supervised algorithms, and then discusses the ability to simplify and explain complex algorithms, followed by a comparison between deep learning and machine learning, and concludes with an explanation of classification, regression, and selection bias.', 'duration': 356.018, 'highlights': ['Reinforcement Learning in Mario Game Reinforcement learning is exemplified through the Mario game, where the agent takes predefined actions and is rewarded or penalized based on the outcomes, contributing to the creation of a model that learns the best actions for maximizing scores.', 'Semi-Supervised Algorithms The discussion includes semi-supervised algorithms, which combine labeled and unlabeled data to create models, offering a balanced blend of supervised and unsupervised techniques to enhance model confidence.', 'Simplifying and Explaining Algorithms The importance of simplifying complex algorithms for effective communication, especially to non-technical stakeholders, is emphasized for interviews, highlighting the need for clear and simple examples to ensure understanding.', 'Comparison of Deep Learning and Machine Learning Deep learning is explained as a subset of machine learning, with a focus on its ability to automatically extract features and classify data, contrasting it with the conventional machine learning process of feature extraction, classification, and pattern identification.', 'Explanation of Classification, Regression, and Selection Bias The chapter concludes with explanations of classification and regression as parts of supervised learning, along with the concept of selection bias in statistical sampling, providing practical examples for better understanding.']}], 'duration': 685.143, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks117289.jpg', 'highlights': ['The interview session is divided into three broad components: machine learning core interview questions, technical questions related to Python in machine learning, and scenario-based questions testing the ability to solve real-world problems using machine learning.', 'Machine learning is categorized into three major components: supervised learning, unsupervised learning, and reinforcement learning, each with distinct characteristics such as learning with a teacher, learning without a teacher, and learning through hit and trial method, respectively.', 'Reinforcement learning is exemplified through the Mario game, where the agent takes predefined actions and is rewarded or penalized based on the outcomes, contributing to the creation of a model that learns the best actions for maximizing scores.', 'The discussion includes semi-supervised algorithms, which combine labeled and unlabeled data to create models, offering a balanced blend of supervised and unsupervised techniques to enhance model confidence.', 'Deep learning is explained as a subset of machine learning, with a focus on its ability to automatically extract features and classify data, contrasting it with the conventional machine learning process of feature extraction, classification, and pattern identification.']}, {'end': 1837.389, 'segs': [{'end': 834.927, 'src': 'embed', 'start': 802.672, 'weight': 2, 'content': [{'end': 808.516, 'text': "So in this case, you're getting bias to those conclusions of your inaccurate conclusions of yours,", 'start': 802.672, 'duration': 5.844}, {'end': 812.458, 'text': "and you're not making the accurate decisions based for the population.", 'start': 808.516, 'duration': 3.942}, {'end': 815.1, 'text': 'Okay, so that is the selection bias.', 'start': 812.838, 'duration': 2.262}, {'end': 821.621, 'text': "So what do you understand by precision and recall? So in this case, let's go on with the example.", 'start': 815.82, 'duration': 5.801}, {'end': 834.927, 'text': "So let's imagine that your girlfriend keep on giving you surprises from last 10 years of your birthdays and one certain day she comes to you and asks you do you remember all birthday surprises from me?", 'start': 821.982, 'duration': 12.945}], 'summary': 'Avoid bias for accurate decisions. understand precision and recall with a birthday example.', 'duration': 32.255, 'max_score': 802.672, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks802672.jpg'}, {'end': 933.499, 'src': 'embed', 'start': 910.734, 'weight': 0, 'content': [{'end': 919.401, 'text': 'So in this case, the precision is the ratio that the number of events you are recalling correctly to the number of events you recall.', 'start': 910.734, 'duration': 8.667}, {'end': 924.024, 'text': "So which is like out of 15, you're trying to recall the 10 as a character.", 'start': 919.441, 'duration': 4.583}, {'end': 926.046, 'text': 'So that is the precision which we have.', 'start': 924.044, 'duration': 2.002}, {'end': 933.499, 'text': 'so 10 real events and 15 answers so the ratio is the 66 percent in this case which is 10 divided by 15.', 'start': 926.636, 'duration': 6.863}], 'summary': 'The precision ratio for recalling 10 out of 15 events is 66%.', 'duration': 22.765, 'max_score': 910.734, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks910734.jpg'}, {'end': 1052.694, 'src': 'embed', 'start': 1021.458, 'weight': 1, 'content': [{'end': 1029.525, 'text': 'Okay So the next question is explain false negative false positive true negative and true positive with a simple example.', 'start': 1021.458, 'duration': 8.067}, {'end': 1034.249, 'text': 'So previously we just saw the example where we had an example of the birthday surprises.', 'start': 1029.825, 'duration': 4.424}, {'end': 1040.765, 'text': 'So you can also go with the example where it will be more realistic as It will give some impact to it.', 'start': 1034.589, 'duration': 6.176}, {'end': 1043.267, 'text': 'So for example the true positive.', 'start': 1040.925, 'duration': 2.342}, {'end': 1052.694, 'text': 'So in this case we can take some real example where what the model that we are trying to build what it does is based on the fire is there or not.', 'start': 1043.607, 'duration': 9.087}], 'summary': 'Explaining false negative, false positive, true negative, and true positive with a fire detection model as an example.', 'duration': 31.236, 'max_score': 1021.458, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks1021458.jpg'}, {'end': 1272.043, 'src': 'embed', 'start': 1240.811, 'weight': 3, 'content': [{'end': 1244.373, 'text': 'based on the predicted, the model output and the actual data which you have.', 'start': 1240.811, 'duration': 3.562}, {'end': 1248.134, 'text': 'So what is the difference between inductive and deductive learning?', 'start': 1245.173, 'duration': 2.961}, {'end': 1257.578, 'text': 'So, to help this concept, you can give some good examples, such as a father want to explain his son how the fire can burn him.', 'start': 1248.815, 'duration': 8.763}, {'end': 1262.58, 'text': 'So there are two ways that he can teach his kid as how he can get impacted from the fire.', 'start': 1257.798, 'duration': 4.782}, {'end': 1272.043, 'text': 'So first thing is he will show him some examples like he will show him some videos or he will show some demo as how the fire will get him burned.', 'start': 1262.6, 'duration': 9.443}], 'summary': "Explains difference between inductive and deductive learning with a father's example of teaching fire safety to his son.", 'duration': 31.232, 'max_score': 1240.811, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks1240811.jpg'}, {'end': 1384.112, 'src': 'embed', 'start': 1359.078, 'weight': 4, 'content': [{'end': 1365.679, 'text': 'as K means is a unsupervised technique algorithm and KNN is a supervised technique.', 'start': 1359.078, 'duration': 6.601}, {'end': 1369.541, 'text': 'And KNN is used as a supervised algorithm.', 'start': 1366.179, 'duration': 3.362}, {'end': 1374.705, 'text': 'KNN is used for classification regression and K means is used for clustering.', 'start': 1369.601, 'duration': 5.104}, {'end': 1379.389, 'text': "As it's a clustering algorithm, it is used to create the clusters of your data.", 'start': 1374.765, 'duration': 4.624}, {'end': 1384.112, 'text': 'K within KNN basically means it tries to observe the KN neighbors.', 'start': 1379.969, 'duration': 4.143}], 'summary': 'Knn is a supervised algorithm used for classification and regression, while k-means is an unsupervised algorithm used for clustering.', 'duration': 25.034, 'max_score': 1359.078, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks1359078.jpg'}, {'end': 1655.589, 'src': 'embed', 'start': 1622.722, 'weight': 5, 'content': [{'end': 1629.008, 'text': 'So do you understand as how this type 1 and type 2 errors are impacting the performance of your model.', 'start': 1622.722, 'duration': 6.286}, {'end': 1631.47, 'text': 'So interviews trying to understand those things.', 'start': 1629.408, 'duration': 2.062}, {'end': 1635.913, 'text': 'So you can give some good examples to say how these are going to impact.', 'start': 1631.51, 'duration': 4.403}, {'end': 1640.237, 'text': 'So first thing is type 1 error is false positives.', 'start': 1636.033, 'duration': 4.204}, {'end': 1646.362, 'text': 'So when something is not true, which in actuality is not true, but your model is saying it is true.', 'start': 1641.018, 'duration': 5.344}, {'end': 1655.589, 'text': 'So as we have the example if a doctor says to a male person that he then pregnant so it is like something is not true, but you are saying it as true.', 'start': 1646.803, 'duration': 8.786}], 'summary': "Type 1 and type 2 errors impact model performance, with type 1 leading to false positives, like a male being told he's pregnant.", 'duration': 32.867, 'max_score': 1622.722, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks1622722.jpg'}, {'end': 1717.95, 'src': 'embed', 'start': 1689.862, 'weight': 6, 'content': [{'end': 1694.384, 'text': 'is it better to have too many false positives or too many false negatives?', 'start': 1689.862, 'duration': 4.522}, {'end': 1701.307, 'text': 'this case really depends on the domain and the type of problem that you are solving.', 'start': 1694.384, 'duration': 6.923}, {'end': 1708.704, 'text': 'so, based on the different problems and the different business requirements, you have to decide on what to have.', 'start': 1701.307, 'duration': 7.397}, {'end': 1710.325, 'text': 'so there is always a trade-off.', 'start': 1708.704, 'duration': 1.621}, {'end': 1717.95, 'text': 'you have to maintain the trade-off between this false positive and false negatives and you have to decide which one we can keep as a more,', 'start': 1710.325, 'duration': 7.625}], 'summary': 'Balancing false positives and false negatives depends on domain and business requirements.', 'duration': 28.088, 'max_score': 1689.862, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks1689862.jpg'}, {'end': 1801.594, 'src': 'embed', 'start': 1773.9, 'weight': 7, 'content': [{'end': 1779.024, 'text': 'try to first be clear with what is false positives and what is false negatives.', 'start': 1773.9, 'duration': 5.124}, {'end': 1782.287, 'text': 'after that, select of specific domains and give the examples.', 'start': 1779.024, 'duration': 3.263}, {'end': 1791.686, 'text': 'So, for example, the medical testing negatives may provide a false reassuring messages to the patients and physicians that some disease is absent,', 'start': 1782.367, 'duration': 9.319}, {'end': 1793.868, 'text': 'but you are actually saying it is present.', 'start': 1791.686, 'duration': 2.182}, {'end': 1801.594, 'text': 'so what will happen in this case is you are trying to give inadequate treatments to both the patients and the disease which is not required,', 'start': 1793.868, 'duration': 7.726}], 'summary': 'False negatives in medical testing lead to inadequate treatments for present diseases.', 'duration': 27.694, 'max_score': 1773.9, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks1773900.jpg'}], 'start': 802.672, 'title': 'Model evaluation and learning concepts', 'summary': 'Covers understanding bias, precision, and recall with examples, inductive and deductive learning, knn vs k-means clustering, and type 1 and type 2 errors in models, emphasizing the importance of these concepts in evaluating model performance and decision-making, with real-world examples and trade-offs highlighted.', 'chapters': [{'end': 1240.811, 'start': 802.672, 'title': 'Understanding bias, precision, and recall', 'summary': 'Explains the concepts of selection bias, precision, and recall using examples, emphasizing the importance of these concepts in evaluating model performance and decision-making, with precision and recall defined as the ratios of correctly recalled events and correctly recalled events to the total recalled events respectively, and real-world examples illustrating true positive, false positive, false negative, and true negative scenarios.', 'duration': 438.139, 'highlights': ['Precision and recall are defined as the ratios of correctly recalled events and correctly recalled events to the total recalled events respectively, with recall being the number of events correctly recalled over the total number of correct events, and precision being the number of correctly recalled events over the total number of events recalled. Precision and recall are defined as the ratios of correctly recalled events and correctly recalled events to the total recalled events respectively, with recall being the number of events correctly recalled over the total number of correct events, and precision being the number of correctly recalled events over the total number of events recalled.', 'The chapter provides a real-world example illustrating true positive, false positive, false negative, and true negative scenarios, emphasizing the importance of these concepts in evaluating model performance. The chapter provides a real-world example illustrating true positive, false positive, false negative, and true negative scenarios, emphasizing the importance of these concepts in evaluating model performance.', 'The chapter explains the concept of selection bias and its impact on decision-making, emphasizing the importance of making accurate decisions based on the population. The chapter explains the concept of selection bias and its impact on decision-making, emphasizing the importance of making accurate decisions based on the population.']}, {'end': 1598.419, 'start': 1240.811, 'title': 'Inductive vs deductive learning and knn vs k-means clustering', 'summary': 'Discusses the concepts of inductive and deductive learning with an example, and then explains the difference between knn and k-means clustering, highlighting their applications and distinctions, followed by a detailed explanation of the roc curve and its interpretation for model performance evaluation.', 'duration': 357.608, 'highlights': ['The chapter explains inductive and deductive learning using a father-son example, where inductive learning involves observations and conclusions, while deductive learning involves drawing conclusions and then making observations, with a parallel drawn to machine learning (most relevant).', "It details the difference between KNN and K-means clustering, emphasizing that KNN is a supervised algorithm for classification and regression, while K-means clustering is an unsupervised technique used to create data clusters, and further highlights the role of 'K' in both methods (relevant).", 'The chapter provides a comprehensive explanation of the ROC curve, highlighting its origin, application in machine learning for model performance evaluation in binary classification, and the interpretation of the plot in terms of true positive rates and the trade-off between sensitivity and specificity (less relevant).']}, {'end': 1837.389, 'start': 1598.419, 'title': 'Type 1 and type 2 errors in models', 'summary': 'Discusses the concepts of type 1 and type 2 errors, their impact on model performance, trade-offs, and domain-specific examples, emphasizing the need to manage false positives and false negatives based on specific requirements.', 'duration': 238.97, 'highlights': ['The distinction between type 1 and type 2 errors in model performance is crucial for interviews, involving false positives and false negatives.', 'Trade-offs between false positives and false negatives depend on the domain and specific business requirements, impacting model decision-making.', 'Examples of domain-specific implications of false positives and false negatives, such as medical testing and spam filtering, highlight the need to manage these errors based on specific needs and potential consequences.']}], 'duration': 1034.717, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks802672.jpg', 'highlights': ['Precision and recall are defined as the ratios of correctly recalled events and correctly recalled events to the total recalled events respectively, with recall being the number of events correctly recalled over the total number of correct events, and precision being the number of correctly recalled events over the total number of events recalled.', 'The chapter provides a real-world example illustrating true positive, false positive, false negative, and true negative scenarios, emphasizing the importance of these concepts in evaluating model performance.', 'The chapter explains the concept of selection bias and its impact on decision-making, emphasizing the importance of making accurate decisions based on the population.', 'The chapter explains inductive and deductive learning using a father-son example, where inductive learning involves observations and conclusions, while deductive learning involves drawing conclusions and then making observations, with a parallel drawn to machine learning.', "It details the difference between KNN and K-means clustering, emphasizing that KNN is a supervised algorithm for classification and regression, while K-means clustering is an unsupervised technique used to create data clusters, and further highlights the role of 'K' in both methods.", 'The distinction between type 1 and type 2 errors in model performance is crucial for interviews, involving false positives and false negatives.', 'Trade-offs between false positives and false negatives depend on the domain and specific business requirements, impacting model decision-making.', 'Examples of domain-specific implications of false positives and false negatives, such as medical testing and spam filtering, highlight the need to manage these errors based on specific needs and potential consequences.']}, {'end': 3090.067, 'segs': [{'end': 2006.923, 'src': 'heatmap', 'start': 1837.745, 'weight': 0, 'content': [{'end': 1843.429, 'text': 'So based on the domain it differs and you have to give the example specific to those domains.', 'start': 1837.745, 'duration': 5.684}, {'end': 1848.633, 'text': 'Okay, so, which is more important to you the model accuracy or the model performance?', 'start': 1844.089, 'duration': 4.544}, {'end': 1853.956, 'text': 'This question, the interview is trying to understand how better you know these terms.', 'start': 1849.273, 'duration': 4.683}, {'end': 1857.299, 'text': 'So do you really know the differences between these terms?', 'start': 1854.457, 'duration': 2.842}, {'end': 1862.823, 'text': 'So first thing, try to understand what these terms are in actually.', 'start': 1857.639, 'duration': 5.184}, {'end': 1866.225, 'text': 'the model accuracy is part of model performance.', 'start': 1862.823, 'duration': 3.402}, {'end': 1868.981, 'text': 'It is subset of the model performance.', 'start': 1866.84, 'duration': 2.141}, {'end': 1874.685, 'text': 'There are different model performance measures and model accuracy is one of them.', 'start': 1869.021, 'duration': 5.664}, {'end': 1878.407, 'text': "So for example, let's consider the case of fraud detection.", 'start': 1875.265, 'duration': 3.142}, {'end': 1886.632, 'text': 'So in this case what will happen is you will have millions of rows and within those only very less percentage of rows will have actual frauds.', 'start': 1878.767, 'duration': 7.865}, {'end': 1890.415, 'text': 'So in those cases, if you look for the model accuracy,', 'start': 1887.072, 'duration': 3.343}, {'end': 1895.898, 'text': "or model accuracy will mostly be higher and it won't give you a complete picture of your model performance.", 'start': 1890.415, 'duration': 5.483}, {'end': 1904.719, 'text': 'So model accuracy is just a subset of the model performance and there are more metrics that you have to look to understand the model performance.', 'start': 1896.417, 'duration': 8.302}, {'end': 1910.381, 'text': 'Next question is what is the difference between Gini impurity and entropy in decision tree?', 'start': 1905.82, 'duration': 4.561}, {'end': 1911.621, 'text': 'So both.', 'start': 1910.821, 'duration': 0.8}, {'end': 1917.143, 'text': 'first thing this both the things are used as a impurity measure in decision tree.', 'start': 1911.621, 'duration': 5.522}, {'end': 1924.365, 'text': 'What do we mean by impurity measure? So first you have to tell the interviewer what is really impurity in the decision tree.', 'start': 1917.543, 'duration': 6.822}, {'end': 1932.65, 'text': 'So impurity is something as how this classified your classes are within the tree as when you make a splits, how your classes are getting split.', 'start': 1924.724, 'duration': 7.926}, {'end': 1937.333, 'text': 'So Gini is one thing where what it does is it tries to see.', 'start': 1932.97, 'duration': 4.363}, {'end': 1942.457, 'text': 'as when you pick a random sample out of the different labels, what is the chance of it getting picked?', 'start': 1937.333, 'duration': 5.124}, {'end': 1945.799, 'text': 'It tries to add those probabilities and create an impurity.', 'start': 1942.797, 'duration': 3.002}, {'end': 1948.866, 'text': 'So it is basically one minus those probabilities.', 'start': 1946.246, 'duration': 2.62}, {'end': 1956.488, 'text': "The lesser it is, you're more confident that your labels are getting clustered in a different groups and different nodes.", 'start': 1949.187, 'duration': 7.301}, {'end': 1960.028, 'text': 'Entropy is a measurement of lack of information.', 'start': 1957.168, 'duration': 2.86}, {'end': 1967.85, 'text': "So when you're making a split within your data, it tries to identify as how disorganized this data is.", 'start': 1960.409, 'duration': 7.441}, {'end': 1974.291, 'text': 'Basically both are trying to do a similar thing, but they are just doing it in a different manner with different mathematics.', 'start': 1968.61, 'duration': 5.681}, {'end': 1983.095, 'text': 'What happens is performance-wise, they both are same, but mostly people would go with Gini, as it is less computationally overheading,', 'start': 1975.21, 'duration': 7.885}, {'end': 1988.959, 'text': 'as entropy uses a log function within the calculations and that is bit computationally expensive.', 'start': 1983.095, 'duration': 5.864}, {'end': 1993.902, 'text': 'So using Gini it can be reduced so mostly people would go with the Gini.', 'start': 1989.38, 'duration': 4.522}, {'end': 1997.745, 'text': 'So what is difference between entropy and information gain?', 'start': 1994.543, 'duration': 3.202}, {'end': 2006.923, 'text': "So when we're trying to make a split within the data, Entropy is an indicator like how messy is your data within the nodes which you have.", 'start': 1998.165, 'duration': 8.758}], 'summary': 'Understanding model accuracy and performance, impurity measures in decision trees, and differences between gini and entropy measures.', 'duration': 48.887, 'max_score': 1837.745, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks1837745.jpg'}, {'end': 1948.866, 'src': 'embed', 'start': 1924.724, 'weight': 2, 'content': [{'end': 1932.65, 'text': 'So impurity is something as how this classified your classes are within the tree as when you make a splits, how your classes are getting split.', 'start': 1924.724, 'duration': 7.926}, {'end': 1937.333, 'text': 'So Gini is one thing where what it does is it tries to see.', 'start': 1932.97, 'duration': 4.363}, {'end': 1942.457, 'text': 'as when you pick a random sample out of the different labels, what is the chance of it getting picked?', 'start': 1937.333, 'duration': 5.124}, {'end': 1945.799, 'text': 'It tries to add those probabilities and create an impurity.', 'start': 1942.797, 'duration': 3.002}, {'end': 1948.866, 'text': 'So it is basically one minus those probabilities.', 'start': 1946.246, 'duration': 2.62}], 'summary': 'Gini measures impurity by calculating probabilities of label picking.', 'duration': 24.142, 'max_score': 1924.724, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks1924724.jpg'}, {'end': 1993.902, 'src': 'embed', 'start': 1968.61, 'weight': 3, 'content': [{'end': 1974.291, 'text': 'Basically both are trying to do a similar thing, but they are just doing it in a different manner with different mathematics.', 'start': 1968.61, 'duration': 5.681}, {'end': 1983.095, 'text': 'What happens is performance-wise, they both are same, but mostly people would go with Gini, as it is less computationally overheading,', 'start': 1975.21, 'duration': 7.885}, {'end': 1988.959, 'text': 'as entropy uses a log function within the calculations and that is bit computationally expensive.', 'start': 1983.095, 'duration': 5.864}, {'end': 1993.902, 'text': 'So using Gini it can be reduced so mostly people would go with the Gini.', 'start': 1989.38, 'duration': 4.522}], 'summary': 'Both gini and entropy offer similar performance, but gini is preferred due to lower computational overhead.', 'duration': 25.292, 'max_score': 1968.61, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks1968610.jpg'}, {'end': 2057.242, 'src': 'embed', 'start': 2029.232, 'weight': 4, 'content': [{'end': 2035.414, 'text': 'So as your entropy keeps on decreasing your information gain keeps on increasing.', 'start': 2029.232, 'duration': 6.182}, {'end': 2043.097, 'text': 'So both are related with each other and as your entropy is decreasing your information gain will keep on increasing.', 'start': 2035.894, 'duration': 7.203}, {'end': 2049.579, 'text': 'Your information gain will keep on increasing as your nodes are getting pure and pure.', 'start': 2044.137, 'duration': 5.442}, {'end': 2057.242, 'text': "So node purity basically says as you're getting specific classes within the nodes those nodes are getting purer.", 'start': 2049.859, 'duration': 7.383}], 'summary': 'Entropy decrease leads to increasing information gain and node purity.', 'duration': 28.01, 'max_score': 2029.232, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks2029232.jpg'}, {'end': 2178.315, 'src': 'embed', 'start': 2147.335, 'weight': 5, 'content': [{'end': 2154.418, 'text': 'overfitting is something which is like very much closely getting fitted with the data which is there within the training set.', 'start': 2147.335, 'duration': 7.083}, {'end': 2157.439, 'text': 'So it is getting very much closer.', 'start': 2154.998, 'duration': 2.441}, {'end': 2162.782, 'text': "It's creating a curve as you see in the diagram very much closer to the data which is there within the training.", 'start': 2157.52, 'duration': 5.262}, {'end': 2167.004, 'text': "So when you give it a new testing data, It won't generalize it very well.", 'start': 2163.122, 'duration': 3.882}, {'end': 2169.766, 'text': 'So you have to be sure in overfitting.', 'start': 2167.344, 'duration': 2.422}, {'end': 2178.315, 'text': "It's trying to create a model which is very much learning all the parameters in an exact manner from the training data, which shouldn't be the case.", 'start': 2169.847, 'duration': 8.468}], 'summary': 'Overfitting occurs when the model closely fits the training data, hindering generalization to new data.', 'duration': 30.98, 'max_score': 2147.335, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks2147335.jpg'}, {'end': 2376.542, 'src': 'embed', 'start': 2346.851, 'weight': 6, 'content': [{'end': 2350.772, 'text': 'So ensemble learning is basically learning from committee or crowd.', 'start': 2346.851, 'duration': 3.921}, {'end': 2358.416, 'text': 'So you basically train a large number of models and then try to combine their predictions and create a single conclusion out of it.', 'start': 2350.892, 'duration': 7.524}, {'end': 2367.538, 'text': 'So for example, when we split our data set into different samples and those each sample is fed into a similar kind of algorithm.', 'start': 2359.175, 'duration': 8.363}, {'end': 2368.819, 'text': 'For example, the decision tree.', 'start': 2367.558, 'duration': 1.261}, {'end': 2376.542, 'text': 'So we create 100 decision trees on 100 samples of our data and each model is trying to capture different patterns of your data.', 'start': 2369.139, 'duration': 7.403}], 'summary': 'Ensemble learning involves training and combining multiple models to capture diverse data patterns.', 'duration': 29.691, 'max_score': 2346.851, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks2346851.jpg'}, {'end': 2423.043, 'src': 'embed', 'start': 2391.026, 'weight': 7, 'content': [{'end': 2393.727, 'text': 'There are two types which are there in the ensemble model.', 'start': 2391.026, 'duration': 2.701}, {'end': 2400.929, 'text': 'One is the bagging and the other is the boosting so you can just give the theory about like how both of them works.', 'start': 2394.087, 'duration': 6.842}, {'end': 2406.07, 'text': 'So boosting is something where what we are doing is we are trying to sample the data.', 'start': 2401.229, 'duration': 4.841}, {'end': 2413.536, 'text': 'different samples are trained on similar kind of algorithms, such as the decision trees, the logistic regression or the SVM.', 'start': 2406.07, 'duration': 7.466}, {'end': 2423.043, 'text': 'on all samples, a single algorithm is used and at the end you combine all the outputs of this models, and that is the output of the bagging.', 'start': 2413.536, 'duration': 9.507}], 'summary': 'Ensemble model has two types: bagging and boosting. boosting involves sampling data and training on algorithms like decision trees, logistic regression, or svm, combining their outputs.', 'duration': 32.017, 'max_score': 2391.026, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks2391026.jpg'}, {'end': 2856.945, 'src': 'embed', 'start': 2833.731, 'weight': 9, 'content': [{'end': 2840.676, 'text': 'outliers to your data, and when you have a very high data and you can risk dropping those outliers,', 'start': 2833.731, 'duration': 6.945}, {'end': 2845.36, 'text': 'then you can go on with that or you can cap your data using the percentile.', 'start': 2840.676, 'duration': 4.684}, {'end': 2849.942, 'text': 'So mostly what people do is They use the 99 percentile or 95 percentile.', 'start': 2845.38, 'duration': 4.562}, {'end': 2854.564, 'text': 'So whatever the values are above those percentiles, your outliers are capped to those.', 'start': 2850.262, 'duration': 4.302}, {'end': 2856.945, 'text': 'So your outliers will get reduced.', 'start': 2854.864, 'duration': 2.081}], 'summary': 'Capping outliers at 99 or 95 percentile reduces data outliers, aiding in analysis.', 'duration': 23.214, 'max_score': 2833.731, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks2833731.jpg'}, {'end': 2903.023, 'src': 'embed', 'start': 2879.035, 'weight': 10, 'content': [{'end': 2887.437, 'text': "So collinearity occurs when you're trying to do a regression on a multi features and you see that your two of the predictors are correlated with each other.", 'start': 2879.035, 'duration': 8.402}, {'end': 2891.699, 'text': "For example, let's assume you have a date of birth and the age.", 'start': 2887.618, 'duration': 4.081}, {'end': 2898.181, 'text': 'So when you have both of this, your date of birth and age is always going to be correlated with each other.', 'start': 2891.999, 'duration': 6.182}, {'end': 2903.023, 'text': 'So in those cases, these two features are part of collinearity within the regression.', 'start': 2898.201, 'duration': 4.822}], 'summary': 'Collinearity happens in regression when predictors are correlated, such as date of birth and age.', 'duration': 23.988, 'max_score': 2879.035, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks2879035.jpg'}, {'end': 3078.445, 'src': 'embed', 'start': 3051.074, 'weight': 11, 'content': [{'end': 3055.556, 'text': 'and these algorithms really helps you to reduce the dimensions which is there within your data.', 'start': 3051.074, 'duration': 4.482}, {'end': 3058.238, 'text': 'So that is from the point of data analysis.', 'start': 3055.897, 'duration': 2.341}, {'end': 3064.902, 'text': 'The other applications are eigenvectors are the directions along which a particular linear transformation acts.', 'start': 3058.318, 'duration': 6.584}, {'end': 3068.062, 'text': 'So when you apply some linear transformation,', 'start': 3065.302, 'duration': 2.76}, {'end': 3073.944, 'text': 'this eigenvectors are actually giving you like in which direction those transformations are being applied.', 'start': 3068.062, 'duration': 5.882}, {'end': 3078.445, 'text': 'So this is like when you apply this transformation, what is the direction which is being there?', 'start': 3074.184, 'duration': 4.261}], 'summary': 'Algorithms reduce data dimensions, eigenvectors show transformation directions.', 'duration': 27.371, 'max_score': 3051.074, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks3051074.jpg'}], 'start': 1837.745, 'title': 'Model accuracy vs model performance', 'summary': 'Emphasizes the differences between model accuracy and model performance, highlighting that model accuracy is a subset of model performance. it also discusses gini impurity and entropy in decision trees, ensemble learning in machine learning, and outliers screening and handling.', 'chapters': [{'end': 1904.719, 'start': 1837.745, 'title': 'Model accuracy vs model performance', 'summary': 'Discusses the importance of understanding the differences between model accuracy and model performance, emphasizing that model accuracy is just a subset of model performance, and in scenarios like fraud detection where only a small percentage of cases involve fraud, model accuracy may not provide a complete picture of model performance.', 'duration': 66.974, 'highlights': ['Model accuracy is a subset of model performance, with various performance measures beyond accuracy that should be considered to understand the complete model performance.', 'In scenarios like fraud detection with a small percentage of actual fraud cases, model accuracy may be higher but may not provide a comprehensive assessment of model performance.', 'Understanding the differences between model accuracy and model performance is crucial for demonstrating knowledge in an interview setting.']}, {'end': 2281.823, 'start': 1905.82, 'title': 'Gini impurity vs. entropy in decision tree', 'summary': 'Explains the difference between gini impurity and entropy in a decision tree, their roles as impurity measures, and how they affect model performance, with a focus on computational efficiency and information gain. it also outlines the concept of overfitting in machine learning models and methods to prevent it, including collecting more data, using ensemble methods, choosing simpler models, and incorporating regularization.', 'duration': 376.003, 'highlights': ['Gini impurity measures the probability of misclassifying a random sample, while entropy measures the lack of information in data splits, both aiming to cluster labels in different nodes. Gini impurity measures the probability of misclassifying a random sample by adding the probabilities of each label and subtracting the result from 1, aiming to cluster labels in different nodes. Entropy measures the lack of information in data splits, identifying how disorganized the data is when making a split.', 'Gini impurity is often preferred over entropy due to its lower computational overhead, as entropy uses a log function in its calculations. Gini impurity is often preferred over entropy due to its lower computational overhead, as entropy uses a log function within the calculations, making it computationally expensive.', 'Information gain increases as entropy decreases, indicating better separation of labels into different nodes and increasing node purity. Information gain increases as entropy decreases, indicating better separation of labels into different nodes and increasing node purity, leading to better model performance.', 'Overfitting occurs when a model closely fits the training data, leading to poor generalization to new testing data, and can be prevented by collecting more data, using ensemble methods, choosing simpler models, and adding regularization. Overfitting occurs when a model closely fits the training data, leading to poor generalization to new testing data, and can be prevented by collecting more data, using ensemble methods, choosing simpler models, and adding regularization such as L1 and L2, which penalize the model for overfitting or underfitting.', 'Ensemble learning involves using multiple models to capture different patterns in the data and averaging their outputs to reduce overfitting. Ensemble learning involves using multiple models to capture different patterns in the data and averaging their outputs to reduce overfitting, providing a more robust and accurate prediction.']}, {'end': 2687.914, 'start': 2281.823, 'title': 'Ensemble learning in machine learning', 'summary': 'Explains how ensemble learning combines weak learners to create a better predictive model, with bagging and boosting as two types of ensemble models, helping to reduce overfitting and variance in machine learning by aggregating the outputs of different models and learning from previous mistakes.', 'duration': 406.091, 'highlights': ['Ensemble learning combines weak learners to create a better predictive model, with bagging and boosting as two types of ensemble models. Ensemble learning combines different models to create a better predictive model, with bagging and boosting as two types of ensemble models, which help reduce overfitting and variance in machine learning.', 'Ensemble learning helps to reduce overfitting by combining the outputs of different models and learning from previous mistakes. Ensemble learning helps reduce overfitting by combining the outputs of different models and learning from previous mistakes, ultimately leading to a better predictive model.', 'Bagging and boosting are two types of ensemble models that help reduce overfitting and variance in machine learning. Bagging and boosting are two types of ensemble models that help reduce overfitting and variance in machine learning by aggregating the outputs of different models and learning from previous mistakes.', "Boosting learns from mistakes of previous models to create improved versions, while bagging gives equal weightage to each model's output. Boosting learns from mistakes of previous models to create improved versions, while bagging gives equal weightage to each model's output, and both types help to reduce overfitting and variance in machine learning.", 'Ensemble models reduce overfitting and variance by aggregating outputs of different models and learning from previous mistakes. Ensemble models reduce overfitting and variance by aggregating outputs of different models and learning from previous mistakes, ultimately leading to a better predictive model.']}, {'end': 3090.067, 'start': 2687.914, 'title': 'Outliers screening and handling', 'summary': 'Discusses how to screen for outliers using box plots, probabilistic and statistical models, linear models, and proximity based models, as well as how to handle outliers by capping data using percentiles and imputing based on rules. it also explains collinearity, multicollinearity, and the concepts of eigenvectors and eigenvalues, including their applications in data analysis and linear transformations.', 'duration': 402.153, 'highlights': ['Screening outliers using box plots, probabilistic and statistical models, linear models, and proximity based models The chapter explains different methods for screening outliers, including box plots, probabilistic and statistical models, linear models, and proximity based models, to identify and understand outliers within the data.', 'Handling outliers by capping data using percentiles and imputing based on rules It details the methods for handling outliers, such as capping data using percentiles (e.g., 99th or 95th percentile) and imputing outliers based on business rules or data exploration to facilitate model creation.', 'Explaining collinearity and multicollinearity in regression with examples The chapter provides a clear explanation of collinearity and multicollinearity in regression, including examples such as the correlation between date of birth and age, and the correlation between multiple variables like age, year of birth, and class.', 'Defining eigenvectors and eigenvalues and their applications in data analysis and linear transformations It defines eigenvectors and eigenvalues, and their applications in data analysis, including understanding linear transformations, reducing dimensions using PCA and factor analysis, and their use in compressing images.']}], 'duration': 1252.322, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks1837745.jpg', 'highlights': ['Understanding the differences between model accuracy and model performance is crucial for demonstrating knowledge in an interview setting.', 'Model accuracy may be higher in scenarios like fraud detection with a small percentage of actual fraud cases, but may not provide a comprehensive assessment of model performance.', 'Gini impurity measures the probability of misclassifying a random sample, while entropy measures the lack of information in data splits, both aiming to cluster labels in different nodes.', 'Gini impurity is often preferred over entropy due to its lower computational overhead, as entropy uses a log function in its calculations.', 'Information gain increases as entropy decreases, indicating better separation of labels into different nodes and increasing node purity, leading to better model performance.', 'Overfitting occurs when a model closely fits the training data, leading to poor generalization to new testing data, and can be prevented by collecting more data, using ensemble methods, choosing simpler models, and adding regularization.', 'Ensemble learning involves using multiple models to capture different patterns in the data and averaging their outputs to reduce overfitting, providing a more robust and accurate prediction.', 'Bagging and boosting are two types of ensemble models that help reduce overfitting and variance in machine learning by aggregating the outputs of different models and learning from previous mistakes.', 'Ensemble learning helps reduce overfitting by combining the outputs of different models and learning from previous mistakes, ultimately leading to a better predictive model.', 'Handling outliers by capping data using percentiles and imputing based on rules facilitates model creation.', 'The chapter provides a clear explanation of collinearity and multicollinearity in regression, including examples such as the correlation between date of birth and age, and the correlation between multiple variables like age, year of birth, and class.', 'It defines eigenvectors and eigenvalues, and their applications in data analysis, including understanding linear transformations, reducing dimensions using PCA and factor analysis, and their use in compressing images.']}, {'end': 3790.68, 'segs': [{'end': 3114.579, 'src': 'embed', 'start': 3091.428, 'weight': 0, 'content': [{'end': 3098.812, 'text': 'What is a be testing? So a be testing is a statistical hypothesis testing which tries to compare different cases.', 'start': 3091.428, 'duration': 7.384}, {'end': 3104.894, 'text': 'So in our cases, we want to measure how different model performs as compared to each other.', 'start': 3098.852, 'duration': 6.042}, {'end': 3114.579, 'text': "So assume that in production you have a model which is already running and tries to see how your users are clicking through your products and they're buying those products.", 'start': 3105.235, 'duration': 9.344}], 'summary': "A/b testing compares models' performance to measure user behavior, like product clicks and purchases.", 'duration': 23.151, 'max_score': 3091.428, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks3091428.jpg'}, {'end': 3222.95, 'src': 'embed', 'start': 3195.294, 'weight': 1, 'content': [{'end': 3199.236, 'text': 'These are the different clusters which we have within which there are different data scientists.', 'start': 3195.294, 'duration': 3.942}, {'end': 3206.136, 'text': 'When we try to randomly select those clusters for our analysis, that is called as the cluster sampling.', 'start': 3199.716, 'duration': 6.42}, {'end': 3213.002, 'text': 'So in this case, the sample is nothing but different clusters and we are trying to select those samples.', 'start': 3207.037, 'duration': 5.965}, {'end': 3222.95, 'text': 'So for example, if managers are your samples, then companies are basically clusters and we do the clustering of this different company.', 'start': 3213.803, 'duration': 9.147}], 'summary': 'Cluster sampling involves randomly selecting clusters for analysis, such as managers as samples within different companies.', 'duration': 27.656, 'max_score': 3195.294, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks3195294.jpg'}, {'end': 3256.41, 'src': 'embed', 'start': 3231.26, 'weight': 2, 'content': [{'end': 3238.303, 'text': 'But do you know how does the tree decide on which variable to split at the root node and its succeeding child node?', 'start': 3231.26, 'duration': 7.043}, {'end': 3244.165, 'text': 'So you can explain the things we have discussed already about the Gini and the entropy parameters.', 'start': 3238.563, 'duration': 5.602}, {'end': 3247.626, 'text': 'So what Gini does is it calculates for sub nodes.', 'start': 3244.605, 'duration': 3.021}, {'end': 3256.41, 'text': 'What is the probability and success for each classes and that is done by its squares the probabilities of success like P square with the Q square.', 'start': 3248.047, 'duration': 8.363}], 'summary': 'Tree nodes split based on gini & entropy to calculate subnode success probabilities.', 'duration': 25.15, 'max_score': 3231.26, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks3231260.jpg'}, {'end': 3459.277, 'src': 'embed', 'start': 3418.166, 'weight': 3, 'content': [{'end': 3423.33, 'text': 'So those are the core storage format for all the data in data analysis.', 'start': 3418.166, 'duration': 5.164}, {'end': 3426.733, 'text': 'So SciPy, its full form is basically scientific Python.', 'start': 3423.77, 'duration': 2.963}, {'end': 3431.654, 'text': 'It basically gives you a library to deal with all the different kind of mathematical functions.', 'start': 3427.532, 'duration': 4.122}, {'end': 3435.456, 'text': 'So you can do Fourier transform for the audio related data.', 'start': 3432.095, 'duration': 3.361}, {'end': 3438.618, 'text': 'You can go with the optimizations you can go with the interpolation.', 'start': 3435.476, 'duration': 3.142}, {'end': 3442.48, 'text': 'So these are just a few examples, but all the mathematical related tasks.', 'start': 3438.958, 'duration': 3.522}, {'end': 3448.744, 'text': 'you can go with the scipy and scipy uses, as we already discussed, in a core numpy to store the data.', 'start': 3442.48, 'duration': 6.264}, {'end': 3456.636, 'text': 'So pandas is a data analysis library, again where You can store the data in a tabular manner, which is called as a data frames,', 'start': 3449.404, 'duration': 7.232}, {'end': 3459.277, 'text': 'and data frames are very powerful function within pandas.', 'start': 3456.636, 'duration': 2.641}], 'summary': 'Scipy provides a library for mathematical functions, including fourier transform and optimization, utilizing core numpy for data storage. pandas is a data analysis library for tabular data storage.', 'duration': 41.111, 'max_score': 3418.166, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks3418166.jpg'}, {'end': 3616.322, 'src': 'embed', 'start': 3592.455, 'weight': 4, 'content': [{'end': 3601.357, 'text': 'So, for example, when you want to have a distribution charts with a more detail, so Seaborn provides you charts such as the violin plots.', 'start': 3592.455, 'duration': 8.902}, {'end': 3602.658, 'text': 'it gives you a KDE plots.', 'start': 3601.357, 'duration': 1.301}, {'end': 3606.699, 'text': 'So for the detailed explanation of your data, you can go with the Seaborn.', 'start': 3602.978, 'duration': 3.721}, {'end': 3609.9, 'text': 'Bouquet is something which is an interactive visualization.', 'start': 3607.339, 'duration': 2.561}, {'end': 3616.322, 'text': 'Bouquet you go when you want to present your data to something of the outside world, when you want to publish it on a web.', 'start': 3610.36, 'duration': 5.962}], 'summary': 'Seaborn offers detailed distribution charts like violin plots and kde plots, while bouquet provides interactive visualization for presenting data externally.', 'duration': 23.867, 'max_score': 3592.455, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks3592455.jpg'}, {'end': 3768.746, 'src': 'embed', 'start': 3731.335, 'weight': 6, 'content': [{'end': 3734.936, 'text': 'and data frame has additional functions on top of series.', 'start': 3731.335, 'duration': 3.601}, {'end': 3738.917, 'text': 'so, for example, loc is a function which is there on top of this.', 'start': 3734.936, 'duration': 3.981}, {'end': 3744.618, 'text': 'so data frame is the upper level layer on top of series and gives you more control to the data.', 'start': 3738.917, 'duration': 5.701}, {'end': 3751.99, 'text': 'But the basic difference is with the series you just have a single column, but with the data frame you still can add more columns to the data.', 'start': 3745.118, 'duration': 6.872}, {'end': 3756.379, 'text': 'How can you handle duplicate values and data set for a variable in python?', 'start': 3752.677, 'duration': 3.702}, {'end': 3762.462, 'text': 'So in this case, you may have to write a code and show to the interviewer that how you can really achieve this thing.', 'start': 3756.699, 'duration': 5.763}, {'end': 3768.746, 'text': 'So you can just import the pandas library, show that you are just reading a random file using the PD dot,', 'start': 3762.702, 'duration': 6.044}], 'summary': 'Data frame provides additional functions and allows adding multiple columns to the data in python using pandas library.', 'duration': 37.411, 'max_score': 3731.335, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks3731335.jpg'}], 'start': 3091.428, 'title': 'A/b testing, cluster sampling, and python libraries for data analysis', 'summary': "Covers a/b testing for model performance comparison, cluster sampling for selecting samples, and decision-making in binary classification trees. it also discusses python libraries like numpy, scipy, pandas, scikit, matplotlib, seaborn, and bokeh for data manipulation, visualization, and machine learning, emphasizing numpy's role as a core storage format and the distinctions between matplotlib, seaborn, and bokeh for data visualization.", 'chapters': [{'end': 3358.031, 'start': 3091.428, 'title': 'A/b testing and cluster sampling', 'summary': "Discusses a/b testing to compare different models' performance and cluster sampling for selecting samples from different clusters within a population, also covering the decision-making process of a binary classification tree algorithm through gini and entropy parameters.", 'duration': 266.603, 'highlights': ["A/B testing compares different models' performance, such as predicting product recommendations and measuring user engagement, using statistical hypothesis testing like t-test to identify the better model. Comparison of models' performance, user engagement measurement, statistical hypothesis testing usage", 'Cluster sampling involves randomly selecting clusters from a population for analysis, with examples like selecting different company clusters as samples for analysis. Random selection of clusters for analysis, examples of cluster sampling', 'Explanation of the decision-making process in a binary classification tree algorithm using Gini and entropy parameters, where Gini calculates weighted Gini scores for node splitting and entropy measures impurity or randomness within the data. Decision-making process in binary classification tree algorithm, Gini and entropy parameters explanation']}, {'end': 3790.68, 'start': 3358.571, 'title': 'Python libraries for data analysis', 'summary': "Discusses the core libraries in python for data analysis, including numpy, scipy, pandas, scikit, matplotlib, seaborn, and bokeh, and their applications for data manipulation, visualization, and machine learning, with a focus on numpy's role as a core storage format and the differences between matplotlib, seaborn, and bokeh for data visualization.", 'duration': 432.109, 'highlights': ['NumPy, SciPy, Pandas, and Scikit are core libraries in Python for data analysis and scientific computations, with NumPy serving as the core storage format for all the data in data analysis. Highlights the core libraries in Python for data analysis and their significance as the foundation for scientific computations and data manipulation.', 'Matplotlib provides quick access to basic chart types for quick analysis and data exploration, Seaborn is used for in-depth analysis and statistical examination of data, while Bokeh is an interactive visualization tool for presenting data to the outside world. Explains the differences and applications of Matplotlib, Seaborn, and Bokeh for data visualization, catering to different levels of analysis and interaction with data.', 'NumPy is a numerical library for dealing with data, while SciPy provides mathematical functions for tasks such as Fourier transform and optimization, and Pandas offers data manipulation and analysis capabilities through data frames. Details the specific functions and purposes of NumPy, SciPy, and Pandas in data analysis, emphasizing their respective roles in numerical operations, mathematical functions, and data manipulation.', 'The main difference between a Pandas series and a single column data frame lies in the capability to store multiple columns, with data frames providing additional functions and control over the data compared to series. Clarifies the distinction between Pandas series and data frames, highlighting the differences in their capabilities and functions for handling data.', "Duplicate values in a data set can be handled by identifying them using the 'duplicated' function and subsequently removing them with the 'drop_duplicates' function in the Pandas library. Describes the process of handling duplicate values in a data set using specific functions in the Pandas library, demonstrating practical data manipulation techniques."]}], 'duration': 699.252, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks3091428.jpg', 'highlights': ["A/B testing compares different models' performance using statistical hypothesis testing like t-test.", 'Cluster sampling involves randomly selecting clusters from a population for analysis.', 'Explanation of the decision-making process in a binary classification tree algorithm using Gini and entropy parameters.', 'NumPy serves as the core storage format for all the data in data analysis.', 'Explains the differences and applications of Matplotlib, Seaborn, and Bokeh for data visualization.', 'Details the specific functions and purposes of NumPy, SciPy, and Pandas in data analysis.', 'Clarifies the distinction between Pandas series and data frames.', 'Describes the process of handling duplicate values in a data set using specific functions in the Pandas library.']}, {'end': 4310.908, 'segs': [{'end': 3832.614, 'src': 'embed', 'start': 3790.68, 'weight': 0, 'content': [{'end': 3793.763, 'text': 'you can try to show and explain them what they do.', 'start': 3790.68, 'duration': 3.083}, {'end': 3799.507, 'text': 'that would help you in that write a basic machine learning program to check the accuracy of the data set.', 'start': 3793.763, 'duration': 5.744}, {'end': 3803.304, 'text': 'inputting any data set using any classifier.', 'start': 3800.203, 'duration': 3.101}, {'end': 3808.446, 'text': 'So in this case, you are not limited with what data set you are loading what data classifier.', 'start': 3803.744, 'duration': 4.702}, {'end': 3814.468, 'text': 'the important thing which we have to show is how you are loading your performance parameters,', 'start': 3808.446, 'duration': 6.022}, {'end': 3819.81, 'text': 'and things that the interviewer will gauge is how you are trying to use the performance metrics.', 'start': 3814.468, 'duration': 5.342}, {'end': 3823.091, 'text': 'Are you trying to use it on the test set or the training set?', 'start': 3820.23, 'duration': 2.861}, {'end': 3827.652, 'text': "or you're trying to use it on the train data or just a test data?", 'start': 3823.67, 'duration': 3.982}, {'end': 3832.614, 'text': "so you have to very sure, as you don't have the computer to do the output, you just want to write it.", 'start': 3827.652, 'duration': 4.962}], 'summary': 'Write a basic ml program to check dataset accuracy using any classifier.', 'duration': 41.934, 'max_score': 3790.68, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks3790680.jpg'}, {'end': 3876.601, 'src': 'embed', 'start': 3852.806, 'weight': 1, 'content': [{'end': 3862.012, 'text': 'You can just show that you are reading some data and try to separate your data into the X and Y, the target data and the predictor data,', 'start': 3852.806, 'duration': 9.206}, {'end': 3865.854, 'text': 'and try to create a split within the data of train and test validations.', 'start': 3862.012, 'duration': 3.842}, {'end': 3869.076, 'text': 'You can use whatever the issue you want to use.', 'start': 3866.334, 'duration': 2.742}, {'end': 3876.601, 'text': 'You can use 80%, 70%, 50% as you like, but you need to give some justification to that also if your interviewer asks for it.', 'start': 3869.096, 'duration': 7.505}], 'summary': 'The process involves separating data into target and predictor, and creating a train-test split, with flexibility on the percentage used, while providing justifications if asked.', 'duration': 23.795, 'max_score': 3852.806, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks3852806.jpg'}, {'end': 3982.252, 'src': 'embed', 'start': 3956.606, 'weight': 2, 'content': [{'end': 3966.508, 'text': 'the next thing the interview may ask you is how can you improve the accuracy score so you can go for three simple steps where you can try to increase the accuracy of your model.', 'start': 3956.606, 'duration': 9.902}, {'end': 3972.249, 'text': 'first simple step would be try to see if you can make some tweaking into your probability cutoff.', 'start': 3966.508, 'duration': 5.741}, {'end': 3975.77, 'text': 'So the default probability cutoff is 50%.', 'start': 3972.49, 'duration': 3.28}, {'end': 3982.252, 'text': 'so the ones which are above the 50% is tagged as one and the ones below the 50% probability the tag does zero.', 'start': 3975.77, 'duration': 6.482}], 'summary': 'To improve accuracy score, consider adjusting probability cutoff. default is 50%.', 'duration': 25.646, 'max_score': 3956.606, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks3956606.jpg'}, {'end': 4139.542, 'src': 'embed', 'start': 4110.328, 'weight': 4, 'content': [{'end': 4117.911, 'text': "It's possible that this thing is trying to give you some idea about the data and there will be some hidden pattern within your data because of this,", 'start': 4110.328, 'duration': 7.583}, {'end': 4118.671, 'text': 'some missing values.', 'start': 4117.911, 'duration': 0.76}, {'end': 4125.974, 'text': "So it's possible that creating this new feature will help your model to give the better accuracy and give the better performance.", 'start': 4119.091, 'duration': 6.883}, {'end': 4130.636, 'text': 'Second thing we can remove that completely if you have a high count of data.', 'start': 4126.554, 'duration': 4.082}, {'end': 4139.542, 'text': 'So, for example, if you have a very good amount of data and you can say that okay, I can leave with removing this 30% of the data,', 'start': 4130.937, 'duration': 8.605}], 'summary': 'Creating a new feature may improve model accuracy, and removing data with a high count can be considered if the dataset is large.', 'duration': 29.214, 'max_score': 4110.328, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks4110328.jpg'}, {'end': 4203.025, 'src': 'embed', 'start': 4174.529, 'weight': 5, 'content': [{'end': 4179.032, 'text': 'So if they are continuous in nature, we can use the distribution to assign some data to them.', 'start': 4174.529, 'duration': 4.503}, {'end': 4184.712, 'text': 'So write an SQL query that makes recommendations using pages that your friends like.', 'start': 4179.886, 'duration': 4.826}, {'end': 4186.453, 'text': 'So assume you have two tables.', 'start': 4185.152, 'duration': 1.301}, {'end': 4190.678, 'text': 'First table is a column table of users and their friends.', 'start': 4186.834, 'duration': 3.844}, {'end': 4196.443, 'text': 'Second table is a column two column table of users and the pages they like.', 'start': 4191.457, 'duration': 4.986}, {'end': 4199.246, 'text': 'It should not recommend pages you already like.', 'start': 4196.743, 'duration': 2.503}, {'end': 4203.025, 'text': 'Why would interviewer ask an SQL question in this case?', 'start': 4199.961, 'duration': 3.064}], 'summary': "Use sql to recommend pages based on friends' likes, avoiding already liked pages.", 'duration': 28.496, 'max_score': 4174.529, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks4174529.jpg'}], 'start': 3790.68, 'title': 'Machine learning interview: coding and accuracy metrics', 'summary': 'Provides guidance on writing a basic machine learning program to check the accuracy of a dataset using any classifier, emphasizing the importance of performance parameters, data splitting, model selection, accuracy measurement, and potential steps for improving accuracy.', 'chapters': [{'end': 4051.42, 'start': 3790.68, 'title': 'Machine learning interview: coding and accuracy metrics', 'summary': 'Provides guidance on writing a basic machine learning program to check the accuracy of a dataset using any classifier, emphasizing the importance of performance parameters, data splitting, model selection, accuracy measurement, and potential steps for improving accuracy.', 'duration': 260.74, 'highlights': ['Guidance on writing a basic machine learning program to check the accuracy of a dataset using any classifier. The chapter emphasizes the importance of writing a basic machine learning program to check the accuracy of a dataset using any classifier, showcasing the ability to load performance parameters and use performance metrics on the test set.', "Emphasis on the importance of performance parameters and using performance metrics on the test set. The interviewer will gauge the candidate's ability to use performance metrics on the test set or the training set and the importance of using both training y and test y to show the accuracy parameter.", 'Importance of data splitting, model selection, and accuracy measurement. The chapter highlights the significance of importing data, separating it into predictor and target data, creating a split for train and test validations, using classifiers to train the data, creating a model, and measuring accuracy through predictions.', "Potential steps for improving accuracy through probability cutoff tweaking, feature importance algorithms, and creating new features. The chapter outlines potential steps for improving accuracy, including tweaking the probability cutoff, utilizing feature importance algorithms, and creating new features or adding more data to enhance the model's accuracy."]}, {'end': 4310.908, 'start': 4051.42, 'title': 'Dealing with missing values and sql recommendations', 'summary': "Covers strategies for handling missing values in a dataset with over 30% missing values, suggesting creating a new feature to identify missing values, removing irrelevant data, and using clustering or distribution to assign values to missing data. additionally, it explains the process of making sql recommendations based on friends' likes, focusing on table merging and excluding already liked pages.", 'duration': 259.488, 'highlights': ['Handling missing values in a dataset Suggests creating a new feature to identify missing values, removing irrelevant data, and using clustering or distribution to assign values to missing data, aiming to improve model accuracy and performance.', "Making SQL recommendations based on friends' likes Explains the process of making SQL recommendations based on friends' likes, focusing on table merging and excluding already liked pages to provide effective recommendations from a database point of view."]}], 'duration': 520.228, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks3790680.jpg', 'highlights': ['Guidance on writing a basic machine learning program to check the accuracy of a dataset using any classifier.', 'Importance of data splitting, model selection, and accuracy measurement.', 'Potential steps for improving accuracy through probability cutoff tweaking, feature importance algorithms, and creating new features.', 'Emphasis on the importance of performance parameters and using performance metrics on the test set.', 'Handling missing values in a dataset to improve model accuracy and performance.', "Making SQL recommendations based on friends' likes to provide effective recommendations from a database point of view."]}, {'end': 4992.553, 'segs': [{'end': 4360.64, 'src': 'embed', 'start': 4333.997, 'weight': 0, 'content': [{'end': 4340.399, 'text': "and, in follow-up, what is the probability of making money from this game if you're playing it for six times?", 'start': 4333.997, 'duration': 6.402}, {'end': 4353.695, 'text': 'so the first condition says if the sum of the values on the dice equals 7, then you win 21, but for all other cases you have to pay 5.', 'start': 4340.399, 'duration': 13.296}, {'end': 4360.64, 'text': 'So in this case, if we first assume all the possible cases, as we have fair six iron dies and we have two of them', 'start': 4353.695, 'duration': 6.945}], 'summary': 'Probability of making money from the game by playing 6 times is being discussed with a win of 21 on sum of dice equals 7 and a loss of 5 for all other cases.', 'duration': 26.643, 'max_score': 4333.997, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks4333997.jpg'}, {'end': 4562.498, 'src': 'embed', 'start': 4535.811, 'weight': 1, 'content': [{'end': 4545.503, 'text': 'So in this question, if we look at the first part, which says for each option, what is the expected number of ads shown in hundred new stories?', 'start': 4535.811, 'duration': 9.692}, {'end': 4547.946, 'text': "So let's check it for the first one.", 'start': 4546.023, 'duration': 1.923}, {'end': 4553.753, 'text': 'So for the first one out of every 25 stories one will be an ad.', 'start': 4548.066, 'duration': 5.687}, {'end': 4557.694, 'text': 'So this is like one out of 25.', 'start': 4554.312, 'duration': 3.382}, {'end': 4562.498, 'text': 'In the second case, every story has 4% chance of being an ad.', 'start': 4557.694, 'duration': 4.804}], 'summary': 'Expected number of ads: 1 in 25 stories, 4% chance for each story.', 'duration': 26.687, 'max_score': 4535.811, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks4535811.jpg'}, {'end': 4610.712, 'src': 'embed', 'start': 4585.489, 'weight': 2, 'content': [{'end': 4594.602, 'text': 'for the second question you have to see is The question asks what is the chance a user will be shown only a single ad in 100 stories?', 'start': 4585.489, 'duration': 9.113}, {'end': 4600.168, 'text': 'If you see this question, it is an example of binomial distribution.', 'start': 4595.202, 'duration': 4.966}, {'end': 4605.274, 'text': 'So as we just saw in the question three, binomial distribution takes three parameters.', 'start': 4600.708, 'duration': 4.566}, {'end': 4610.712, 'text': 'First is the probability of success and failure, which is in our case is 4%.', 'start': 4605.654, 'duration': 5.058}], 'summary': 'The probability of a user being shown a single ad in 100 stories is 4%.', 'duration': 25.223, 'max_score': 4585.489, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks4585489.jpg'}, {'end': 4813.455, 'src': 'embed', 'start': 4788.951, 'weight': 3, 'content': [{'end': 4796.632, 'text': 'This all variables would help you to understand if in the next month again that customer is going to buy or not.', 'start': 4788.951, 'duration': 7.681}, {'end': 4801.953, 'text': 'So if there are kids in the home, there are chances that because of kids, whenever there are kids,', 'start': 4796.892, 'duration': 5.061}, {'end': 4807.594, 'text': 'they will look for the TV and kids will always help you in understanding if there will be subscription or not.', 'start': 4801.953, 'duration': 5.641}, {'end': 4813.455, 'text': 'And if the subscription has either increased or decreased from the previous month,', 'start': 4807.994, 'duration': 5.461}], 'summary': 'Variables help predict customer purchases. kids influence tv subscription rate.', 'duration': 24.504, 'max_score': 4788.951, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks4788951.jpg'}, {'end': 4859.478, 'src': 'embed', 'start': 4831.566, 'weight': 4, 'content': [{'end': 4834.109, 'text': 'Okay, this customer is going to subscribe next month.', 'start': 4831.566, 'duration': 2.543}, {'end': 4836.471, 'text': 'This customer is not going to subscribe.', 'start': 4834.529, 'duration': 1.942}, {'end': 4839.595, 'text': 'So this is a pure classification related problem.', 'start': 4836.912, 'duration': 2.683}, {'end': 4842.827, 'text': 'Would you build predictive models? Yes.', 'start': 4840.805, 'duration': 2.022}, {'end': 4846.949, 'text': 'So before the classification related problem, we would like to build a predictive model.', 'start': 4842.887, 'duration': 4.062}, {'end': 4849.491, 'text': 'So we would already have a historical data.', 'start': 4847.21, 'duration': 2.281}, {'end': 4856.316, 'text': 'We will gather the data which is already existing and use that to train our model so we can use any algorithm.', 'start': 4849.591, 'duration': 6.725}, {'end': 4859.478, 'text': 'So we can use any classification related algorithms in this case.', 'start': 4856.356, 'duration': 3.122}], 'summary': 'A predictive model will be built for customer subscription using existing historical data and classification algorithms.', 'duration': 27.912, 'max_score': 4831.566, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks4831566.jpg'}, {'end': 4915.501, 'src': 'embed', 'start': 4891.087, 'weight': 5, 'content': [{'end': 4898.474, 'text': "it will even go that in production and within production you will also validate again and see in real life how it's working,", 'start': 4891.087, 'duration': 7.387}, {'end': 4902.839, 'text': 'make again changes to the model, so those things may change again,', 'start': 4898.474, 'duration': 4.365}, {'end': 4907.764, 'text': 'as the interviewer may keep on asking you the further things and you have to be ready for those things.', 'start': 4902.839, 'duration': 4.925}, {'end': 4915.501, 'text': "It didn't start with save questions, but as you keep on solving the problem, interlude, keep on digging further, as what would you do next?", 'start': 4908.357, 'duration': 7.144}], 'summary': 'In production, validate and make changes to the model based on real-life performance and interviewer feedback.', 'duration': 24.414, 'max_score': 4891.087, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks4891087.jpg'}, {'end': 4988.109, 'src': 'embed', 'start': 4956.668, 'weight': 6, 'content': [{'end': 4960.81, 'text': 'first you try to gather as to which components these things are related.', 'start': 4956.668, 'duration': 4.142}, {'end': 4969.335, 'text': 'So, for example, if you are trying to understand this nicknames from a Twitter tweets, try to see who is trying to refer to who and, based on this,', 'start': 4961.11, 'duration': 8.225}, {'end': 4976.999, 'text': 'nicknames, see the relation as which person is trying to talk with which person and you can try to identify, based on the NLP algorithms,', 'start': 4969.335, 'duration': 7.664}, {'end': 4979.18, 'text': 'as what is the real names for those people.', 'start': 4976.999, 'duration': 2.181}, {'end': 4980.462, 'text': 'So. similarly,', 'start': 4979.801, 'duration': 0.661}, {'end': 4988.109, 'text': 'you can try to identify in the Facebook or if you have some customer feedbacks and within those customer feedbacks you want to understand this.', 'start': 4980.462, 'duration': 7.647}], 'summary': 'Use nlp algorithms to analyze social media mentions for identifying real names and relationships.', 'duration': 31.441, 'max_score': 4956.668, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks4956668.jpg'}], 'start': 4311.548, 'title': 'Probability, distribution, and predictive modeling', 'summary': 'Covers probability analysis for a dice game, binomial distribution application, and predictive modeling for dish tv subscription renewal based on household activity, age demographics, channel usage, and trend analysis, with a focus on algorithm validation and nlp algorithm usage.', 'chapters': [{'end': 4747.727, 'start': 4311.548, 'title': 'Probability and distribution analysis', 'summary': 'Discusses the probability of winning in a dice game, the expected number of ads in stories, and the use of binomial distribution in probability calculations, with key points being the unfavorable odds of the dice game and the application of binomial distribution to calculate the probability of events.', 'duration': 436.179, 'highlights': ['The probability of winning in the dice game is one in six, resulting in a loss when playing for six games, making it unfavorable to play.', 'Both options of serving ads have an expected number of ads in 100 stories of 1 in 25 or 4%, indicating equivalent outcomes for the two options.', 'The binomial distribution is used to calculate the probability of events such as the chance of a user being shown only a single ad in 100 stories, demonstrating its application in probability calculations.']}, {'end': 4992.553, 'start': 4748.267, 'title': 'Predicting dish tv subscription renewal', 'summary': 'Discusses predicting dish tv subscription renewal based on household activity, number of kids and adults, channel usage, and trend analysis. it outlines using classification analysis and building predictive models using historical data, validating the algorithm, and mapping nicknames to real names using nlp algorithms.', 'duration': 244.286, 'highlights': ['Predicting subscription renewal based on household activity, number of kids and adults, channel usage, and trend analysis The variables for predicting subscription renewal include household activity, number of kids and adults, channel usage, and trend analysis.', 'Using classification analysis and building predictive models using historical data The approach involves using classification analysis and building predictive models using historical data.', 'Validating the algorithm and making real-life adjustments The process includes validating the algorithm and making real-life adjustments based on production performance.', 'Mapping nicknames to real names using NLP algorithms The approach for mapping nicknames to real names involves using NLP algorithms to analyze social media or customer feedback data.']}], 'duration': 681.005, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks4311548.jpg', 'highlights': ['The probability of winning in the dice game is one in six, resulting in a loss when playing for six games, making it unfavorable to play.', 'Both options of serving ads have an expected number of ads in 100 stories of 1 in 25 or 4%, indicating equivalent outcomes for the two options.', 'The binomial distribution is used to calculate the probability of events such as the chance of a user being shown only a single ad in 100 stories, demonstrating its application in probability calculations.', 'Predicting subscription renewal based on household activity, number of kids and adults, channel usage, and trend analysis The variables for predicting subscription renewal include household activity, number of kids and adults, channel usage, and trend analysis.', 'Using classification analysis and building predictive models using historical data The approach involves using classification analysis and building predictive models using historical data.', 'Validating the algorithm and making real-life adjustments The process includes validating the algorithm and making real-life adjustments based on production performance.', 'Mapping nicknames to real names using NLP algorithms The approach for mapping nicknames to real names involves using NLP algorithms to analyze social media or customer feedback data.']}, {'end': 6263.452, 'segs': [{'end': 5245.887, 'src': 'embed', 'start': 5215.341, 'weight': 0, 'content': [{'end': 5227.138, 'text': 'we have to create a combined probability which will include The chances of selecting the coin from the fair coins x 0.5 plus the chances of selecting the coin,', 'start': 5215.341, 'duration': 11.797}, {'end': 5237.423, 'text': 'given it is the double-headed x 1, and this would create a combined probability and give you the output as 0.7531..', 'start': 5227.138, 'duration': 10.285}, {'end': 5245.887, 'text': 'So for this case, we have to go step by step divide the problem into smaller parts and get to the answer of it.', 'start': 5237.423, 'duration': 8.464}], 'summary': 'Create combined probability using fair and double-headed coins to get 0.7531.', 'duration': 30.546, 'max_score': 5215.341, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks5215341.jpg'}, {'end': 5613.678, 'src': 'embed', 'start': 5584.352, 'weight': 1, 'content': [{'end': 5590.656, 'text': "So that's why time series regressions would be more accurate than the decision trees,", 'start': 5584.352, 'duration': 6.304}, {'end': 5594.82, 'text': 'because we have a linearity available within the time series data.', 'start': 5590.656, 'duration': 4.164}, {'end': 5599.925, 'text': 'suppose you found that your model is suffering from low bias and high variance.', 'start': 5594.82, 'duration': 5.105}, {'end': 5603.808, 'text': 'which algorithm you think could tackle this situation, and why?', 'start': 5599.925, 'duration': 3.883}, {'end': 5605.95, 'text': 'so first thing, so low bias and high variance.', 'start': 5603.808, 'duration': 2.142}, {'end': 5613.678, 'text': "so bias and variance are the term mostly used when so bias basically means you're getting biased to a specific set.", 'start': 5605.95, 'duration': 7.728}], 'summary': 'Time series regressions more accurate than decision trees due to linearity. low bias and high variance can be addressed by algorithm selection.', 'duration': 29.326, 'max_score': 5584.352, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks5584352.jpg'}, {'end': 5654.111, 'src': 'embed', 'start': 5625.828, 'weight': 2, 'content': [{'end': 5629.389, 'text': 'So in our case, it says there is low bias and there is a high variance.', 'start': 5625.828, 'duration': 3.561}, {'end': 5632.491, 'text': 'So our model is doing a overfit.', 'start': 5630.09, 'duration': 2.401}, {'end': 5636.843, 'text': 'So first thing is we can use some kind of bagging algorithms.', 'start': 5633.422, 'duration': 3.421}, {'end': 5642.626, 'text': 'So what this bagging algorithms do is they try to divide your data into multiple samples.', 'start': 5637.244, 'duration': 5.382}, {'end': 5647.728, 'text': 'So those are samples with the replacement and each sample is trained with the decision tree.', 'start': 5642.966, 'duration': 4.762}, {'end': 5648.748, 'text': 'So now what happens?', 'start': 5647.948, 'duration': 0.8}, {'end': 5654.111, 'text': 'we are creating multiple trees which are trying to understand different patterns within the data set.', 'start': 5648.748, 'duration': 5.363}], 'summary': 'Model has low bias and high variance, suggesting overfitting. bagging algorithms divide data into multiple samples for training.', 'duration': 28.283, 'max_score': 5625.828, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks5625828.jpg'}, {'end': 5710.95, 'src': 'embed', 'start': 5682.276, 'weight': 3, 'content': [{'end': 5685.998, 'text': 'other technique that we have is we can use the regularization technique.', 'start': 5682.276, 'duration': 3.722}, {'end': 5686.939, 'text': 'So regularization.', 'start': 5686.018, 'duration': 0.921}, {'end': 5692.483, 'text': 'there are two types of regularization, that is L1 regularization and there is L2 regularization.', 'start': 5686.939, 'duration': 5.544}, {'end': 5695.364, 'text': 'those techniques we can use to penalize the model.', 'start': 5692.483, 'duration': 2.881}, {'end': 5704.03, 'text': 'So what happens those try to penalize whenever your model goes for a higher variance it tries to penalize it by some parameters that we provided.', 'start': 5695.404, 'duration': 8.626}, {'end': 5710.95, 'text': 'So based on those parameters it tries to restrict your model for going beyond the limited threshold of the overfitting.', 'start': 5704.366, 'duration': 6.584}], 'summary': 'Regularization techniques like l1 and l2 penalize model for overfitting.', 'duration': 28.674, 'max_score': 5682.276, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks5682276.jpg'}, {'end': 6203.495, 'src': 'embed', 'start': 6178.39, 'weight': 4, 'content': [{'end': 6185.275, 'text': 'So even Netflix is just uses the content-based filtering for recommending the products based on the history that he have done,', 'start': 6178.39, 'duration': 6.885}, {'end': 6193.862, 'text': "but also he'll be recommended based on the collaborator filtering as what all other users have done the similarly and what they have done as a additional things.", 'start': 6185.275, 'duration': 8.587}, {'end': 6196.55, 'text': "So that's how the recommendation systems works.", 'start': 6194.389, 'duration': 2.161}, {'end': 6199.552, 'text': 'Just to summarize what we have done today.', 'start': 6197.771, 'duration': 1.781}, {'end': 6203.495, 'text': 'So basically we have started with the three component three types of interview questions.', 'start': 6199.733, 'duration': 3.762}], 'summary': 'Netflix uses content-based and collaborative filtering for recommendations, based on user history and similar user actions.', 'duration': 25.105, 'max_score': 6178.39, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks6178390.jpg'}], 'start': 4992.953, 'title': 'Probability, data handling, and model accuracy', 'summary': "Covers coin toss probability with a combined probability of 0.7531, handling missing data's impact on model accuracy, the superiority of time series regression, techniques to reduce variance, and an overview of recommendation systems used by major e-commerce companies and netflix.", 'chapters': [{'end': 5245.887, 'start': 4992.953, 'title': 'Coin toss probability', 'summary': 'Discusses the probability of getting 10 heads in a row when tossing a coin, selecting between fair and double-headed coins, and calculating the probability of the next toss being a head, resulting in a combined probability of 0.7531.', 'duration': 252.934, 'highlights': ['The combined probability of selecting a fair coin and getting 10 heads is 0.999 x 0.5^10, resulting in 0.999 x 1 x 1024. This highlights the calculation of the probability of selecting a fair coin and getting 10 heads in a row, resulting in the combined probability.', 'The probability of selecting a double-headed coin and getting 10 heads is 0.001. This emphasizes the probability of selecting a double-headed coin and getting 10 heads in a row.', 'The combined probability of selecting a fair coin or a double-headed coin is 0.7531. This showcases the calculation of the combined probability of selecting a fair coin or a double-headed coin, resulting in the final combined probability.']}, {'end': 5502.165, 'start': 5246.474, 'title': 'Handling missing data and model accuracy', 'summary': 'Explains how to calculate the percentage of unaffected data when missing values are spread along one standard deviation, and discusses the limitations of using accuracy as a performance metric for cancer detection models due to imbalanced data distributions.', 'duration': 255.691, 'highlights': ['Calculating Unaffected Data Percentage In the scenario where missing values are spread along one standard deviation, assuming a normal distribution, 32% of the data would remain unaffected, as 68% of the values are missing.', 'Limitations of Model Accuracy for Cancer Detection In cancer detection models, achieving high accuracy, such as 96%, may not reflect actual performance due to imbalanced data distributions, where the positive class (cancer cases) is very small, making accuracy an unreliable metric.']}, {'end': 5680.651, 'start': 5502.245, 'title': 'Accuracy in time series models', 'summary': 'Discusses the importance of accuracy in time series models, highlighting the superiority of time series regression over decision tree models due to linearity, and suggests using bagging algorithms to tackle low bias and high variance in the model.', 'duration': 178.406, 'highlights': ['Time series regression is more accurate than decision tree models due to linearity in time series data. Time series regression models can achieve higher accuracy than decision tree models in time series data due to the linear correlation with previous values, enabling regression based on historic values.', 'Bagging algorithms can tackle low bias and high variance in the model. Bagging algorithms can address low bias and high variance by dividing the data into multiple samples, training each sample with decision trees to create a strong learner that reduces overfitting and balances variance.', "Explanation of bias and variance in models. The terms 'bias' and 'variance' are explained, where bias indicates underfitting and variance indicates overfitting in the model, and the significance of addressing low bias and high variance is highlighted."]}, {'end': 6050.952, 'start': 5682.276, 'title': 'Techniques to reduce variance', 'summary': 'Discusses techniques to reduce variance including regularization, feature importance, and pca, emphasizing the significance of reducing overfitting and the impact of model changes on r square and overfitting in random forest models.', 'duration': 368.676, 'highlights': ['The chapter discusses the use of regularization techniques, including L1 and L2 regularization, to penalize the model and restrict it from overfitting. Regularization techniques penalize the model for higher variance and restrict it from overfitting by using L1 and L2 regularization.', 'The importance of feature selection using techniques like random forest to identify and utilize important features for creating a well-generalized model is emphasized. Emphasizes the use of feature importance techniques to select top features for creating a well-generalized model.', 'The significance of PCA in not only reducing multicollinearity but also in reducing the number of features based on variance, even when there is no multicollinearity, is explained. Explains that PCA not only reduces multicollinearity but also reduces the number of features based on variance, even in the absence of multicollinearity.', 'The impact of removing the intercept term on the R square value, resulting in a significant increase in R square, is discussed, highlighting the difference in nature between models with and without the intercept. Discusses the impact of removing the intercept term on the R square value, emphasizing the difference in nature between models with and without the intercept.', 'The concept of overfitting in random forest models and its impact on training and validation error, explaining the phenomenon of achieving perfect accuracy in training data but high error in validation data, is elaborated. Elaborates on the phenomenon of achieving perfect accuracy in training data but high error in validation data, indicating the presence of overfitting.']}, {'end': 6263.452, 'start': 6051.952, 'title': 'Recommendation systems overview', 'summary': 'Explains the two types of recommendation systems, collaborative filtering and content-based filtering, and how major e-commerce companies and netflix use these techniques to recommend products to users, along with a summary of the three types of interview questions covered in the session.', 'duration': 211.5, 'highlights': ['Major e-commerce companies and Netflix use collaborative filtering and content-based filtering to recommend products to users. Major e-commerce companies and Netflix use collaborative filtering and content-based filtering for product recommendations.', 'Collaborative filtering involves recommending products based on the activities of users with similar behavior and recommending additional items that the user has not interacted with. Collaborative filtering recommends products based on similar user activities and suggests additional items the user has not interacted with.', "Content-based filtering recommends products based on the user's past ratings and viewing history, creating new recommendations accordingly. Content-based filtering recommends products based on user's past ratings and viewing history to create new recommendations.", 'The session covered three types of interview questions: theoretical machine learning components, Python-related questions on machine learning, and scenario-based questions. The session covered three types of interview questions: theoretical machine learning components, Python-related questions, and scenario-based questions.']}], 'duration': 1270.499, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t6gOpFLt-Ks/pics/t6gOpFLt-Ks4992953.jpg', 'highlights': ['The combined probability of selecting a fair coin or a double-headed coin is 0.7531, showcasing the calculation of the combined probability.', 'Time series regression is more accurate than decision tree models due to linearity in time series data, achieving higher accuracy.', 'Bagging algorithms can address low bias and high variance by creating a strong learner that reduces overfitting and balances variance.', 'Regularization techniques penalize the model for higher variance and restrict it from overfitting using L1 and L2 regularization.', 'Major e-commerce companies and Netflix use collaborative filtering and content-based filtering for product recommendations.']}], 'highlights': ['Various job titles such as data scientists, machine learning engineers, deep learning engineers, and data analysts are used by companies to attract talent, but the key focus should be on understanding the job descriptions and being satisfied with the roles during the interview process.', 'The interview session is divided into three broad components: machine learning core interview questions, technical questions related to Python in machine learning, and scenario-based questions testing the ability to solve real-world problems using machine learning.', 'Reinforcement learning is exemplified through the Mario game, where the agent takes predefined actions and is rewarded or penalized based on the outcomes, contributing to the creation of a model that learns the best actions for maximizing scores.', 'The chapter provides a real-world example illustrating true positive, false positive, false negative, and true negative scenarios, emphasizing the importance of these concepts in evaluating model performance.', 'Understanding the differences between model accuracy and model performance is crucial for demonstrating knowledge in an interview setting.', 'Overfitting occurs when a model closely fits the training data, leading to poor generalization to new testing data, and can be prevented by collecting more data, using ensemble methods, choosing simpler models, and adding regularization.', "A/B testing compares different models' performance using statistical hypothesis testing like t-test.", 'Guidance on writing a basic machine learning program to check the accuracy of a dataset using any classifier.', 'The probability of winning in the dice game is one in six, resulting in a loss when playing for six games, making it unfavorable to play.', 'Predicting subscription renewal based on household activity, number of kids and adults, channel usage, and trend analysis The variables for predicting subscription renewal include household activity, number of kids and adults, channel usage, and trend analysis.', 'The combined probability of selecting a fair coin or a double-headed coin is 0.7531, showcasing the calculation of the combined probability.', 'Time series regression is more accurate than decision tree models due to linearity in time series data, achieving higher accuracy.']}