title

Data Science Interview Questions | Data Science Tutorial | Data Science Interviews | Edureka

description

π₯ Data Science Training (Use Code "πππππππππ") - https://www.edureka.co/data-science-r-programming-certification-course
This Data Science Interview Questions and Answers video will help you to prepare yourself for Data Science and Big Data Analytics interviews. This video is ideal for both beginners as well as professionals who want to learn or brush up their concepts in Data Science, Big Data Analytics and Machine Learning. Below are the topics covered in this tutorial:
1. Data Science Job Trends
2. Data Science Interview Questions
A. Statistics Questions
B. Data Analytics Questions
C. Machine Learning Questions
D. Probability Questions
3. Conclusion
Subscribe to our channel to get video updates. Hit the subscribe button above.
Check our complete Data Science playlist here: https://goo.gl/60NJJS
#DataScienceInterviewQuestions #BigDataAnalytics #DataScienceTutorial #DataScienceTraining #Datascience #Edureka
How it Works?
1. There will be 30 hours of instructor-led interactive online classes, 40 hours of assignments and 20 hours of project
2. We have a 24x7 One-on-One LIVE Technical Support to help you with any problems you might face or any clarifications you may require during the course.
3. You will get Lifetime Access to the recordings in the LMS.
4. At the end of the training you will have to complete the project based on which we will provide you a Verifiable Certificate!
- - - - - - - - - - - - - -
About the Course
Edureka's Data Science course will cover the whole data life cycle ranging from Data Acquisition and Data Storage using R-Hadoop concepts, Applying modelling through R programming using Machine learning algorithms and illustrate impeccable Data Visualization by leveraging on 'R' capabilities.
- - - - - - - - - - - - - -
Why Learn Data Science?
Data Science training certifies you with βin demandβ Big Data Technologies to help you grab the top paying Data Science job title with Big Data skills and expertise in R programming, Machine Learning and Hadoop framework.
After the completion of the Data Science course, you should be able to:
1. Gain insight into the 'Roles' played by a Data Scientist
2. Analyse Big Data using R, Hadoop and Machine Learning
3. Understand the Data Analysis Life Cycle
4. Work with different data formats like XML, CSV and SAS, SPSS, etc.
5. Learn tools and techniques for data transformation
6. Understand Data Mining techniques and their implementation
7. Analyse data using machine learning algorithms in R
8. Work with Hadoop Mappers and Reducers to analyze data
9. Implement various Machine Learning Algorithms in Apache Mahout
10. Gain insight into data visualization and optimization techniques
11. Explore the parallel processing feature in R
- - - - - - - - - - - - - -
Who should go for this course?
The course is designed for all those who want to learn machine learning techniques with implementation in R language, and wish to apply these techniques on Big Data. The following professionals can go for this course:
1. Developers aspiring to be a 'Data Scientist'
2. Analytics Managers who are leading a team of analysts
3. SAS/SPSS Professionals looking to gain understanding in Big Data Analytics
4. Business Analysts who want to understand Machine Learning (ML) Techniques
5. Information Architects who want to gain expertise in Predictive Analytics
6. 'R' professionals who want to captivate and analyze Big Data
7. Hadoop Professionals who want to learn R and ML techniques
8. Analysts wanting to understand Data Science methodologies
For more information, Please write back to us at sales@edureka.co or call us at IND: 9606058406 / US: 18338555775 (toll free).
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Customer Reviews:
Gnana Sekhar Vangara, Technology Lead at WellsFargo.com, says, "Edureka Data science course provided me a very good mixture of theoretical and practical training. The training course helped me in all areas that I was previously unclear about, especially concepts like Machine learning and Mahout. The training was very informative and practical. LMS pre recorded sessions and assignmemts were very good as there is a lot of information in them that will help me in my job. The trainer was able to explain difficult to understand subjects in simple terms. Edureka is my teaching GURU now...Thanks EDUREKA and all the best."

detail

{'title': 'Data Science Interview Questions | Data Science Tutorial | Data Science Interviews | Edureka', 'heatmap': [{'end': 1098.5, 'start': 1032.858, 'weight': 0.789}, {'end': 1350.066, 'start': 1290.636, 'weight': 1}, {'end': 1839.2, 'start': 1782.227, 'weight': 0.705}, {'end': 1988.407, 'start': 1928.353, 'weight': 0.817}, {'end': 4031.686, 'start': 3974.537, 'weight': 0.816}], 'summary': 'This tutorial provides insights on data science interview questions, fundamentals, and best practices, emphasizing the significance of python, handling selection bias, and the impact of false positives and negatives, with real-world applications in industries, and covering probability principles and applications.', 'chapters': [{'end': 60.568, 'segs': [{'end': 60.568, 'src': 'embed', 'start': 41.74, 'weight': 0, 'content': [{'end': 53.107, 'text': "we'll make the questions a bit more complex as we move on and we will try to answer which kind of skill sets people normally look at when you appear for a data science interview in any company.", 'start': 41.74, 'duration': 11.367}, {'end': 55.528, 'text': 'So there are, like, various roles.', 'start': 53.887, 'duration': 1.641}, {'end': 60.568, 'text': 'depending on that, the complexity and the variety in the questions might change,', 'start': 55.528, 'duration': 5.04}], 'summary': 'Data science interview questions become more complex with different roles and skill sets.', 'duration': 18.828, 'max_score': 41.74, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY41740.jpg'}], 'start': 0.029, 'title': 'Data science interview prep', 'summary': "Provides insights on fundamental data science interview questions and skill sets, based on the speaker's 7-8 years of experience in the industry, including work at rekit, benkiser, snapdeal, hike messenger, and probito.", 'chapters': [{'end': 60.568, 'start': 0.029, 'title': 'Data science interview prep', 'summary': "Provides insights on fundamental data science interview questions and skill sets, based on the speaker's 7-8 years of experience in the industry, including work at rekit, benkiser, snapdeal, hike messenger, and probito, where r&d driven data science solutions are developed for complex industry problems.", 'duration': 60.539, 'highlights': ['The speaker has close to seven, eight years of experience in data science, working at companies like Rekit, Benkiser, Snapdeal, Hike Messenger, and Probito, where R&D driven data science solutions are developed for complex industry problems.', 'The session will focus on fundamental questions and thought processes for data science interviews, gradually increasing the complexity of questions as it progresses.', 'Insights will be provided on the skill sets typically sought after for data science roles in various companies.']}], 'duration': 60.539, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY29.jpg', 'highlights': ['Insights on fundamental data science interview questions and skill sets based on 7-8 years of industry experience.', 'Focus on fundamental questions and thought processes for data science interviews, increasing complexity gradually.', 'Insights on skill sets sought after for data science roles in various companies.']}, {'end': 486.245, 'segs': [{'end': 172.875, 'src': 'embed', 'start': 149.423, 'weight': 5, 'content': [{'end': 158.63, 'text': 'like this is more than sufficient motivation that getting into data science would definitely land you in some really good professional career.', 'start': 149.423, 'duration': 9.207}, {'end': 161.632, 'text': "okay, so let's start with the questions here directly.", 'start': 158.63, 'duration': 3.002}, {'end': 166.092, 'text': "We'll, in the beginning, start to focus on some of the fundamental questions,", 'start': 162.25, 'duration': 3.842}, {'end': 172.875, 'text': 'which is more for you to understand what is data science than like a particular interviewer asking you that question.', 'start': 166.092, 'duration': 6.783}], 'summary': 'Data science offers promising career prospects.', 'duration': 23.452, 'max_score': 149.423, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY149423.jpg'}, {'end': 301.269, 'src': 'embed', 'start': 270.358, 'weight': 0, 'content': [{'end': 275.02, 'text': 'if you have worked for retail, you know how the business process in retail works, right.', 'start': 270.358, 'duration': 4.662}, {'end': 282.82, 'text': 'so people often also ask From the technology end, do we need any sort of experiences in Python language, right?', 'start': 275.02, 'duration': 7.8}, {'end': 286.141, 'text': 'Python or like, for example, R programming as well.', 'start': 283.4, 'duration': 2.741}, {'end': 295.646, 'text': 'So Python is one of the most looked out for kind of programming skills, particularly when you want to build solutions in data science domain.', 'start': 287.062, 'duration': 8.584}, {'end': 301.269, 'text': 'And with availability of libraries like NumPy Pandas,', 'start': 296.507, 'duration': 4.762}], 'summary': 'Python is a highly sought-after programming skill, especially in data science. numpy and pandas libraries are widely used.', 'duration': 30.911, 'max_score': 270.358, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY270358.jpg'}], 'start': 60.568, 'title': 'Data science fundamentals and opportunities', 'summary': 'Covers fundamental aspects of data science, including statistics, data analytics, machine learning, and probability, and explores the growing opportunities in data science careers, emphasizing the significance of python and addressing the challenge of selection bias in data analysis.', 'chapters': [{'end': 99.472, 'start': 60.568, 'title': 'Fundamentals of data science interview', 'summary': 'Discusses the fundamental aspects of data science, covering statistics, data analytics, machine learning, and probability to provide a common ground for understanding the requirements of data science.', 'duration': 38.904, 'highlights': ['The session aims to provide a fundamental understanding of data science regardless of the specific role being applied for.', 'The discussion will cover statistics, data analytics, machine learning, and probability, providing a structured approach to addressing questions in these areas.', 'The focus will be on preparing individuals with a common ground of what data science requires, irrespective of the specific role they are applying for.']}, {'end': 486.245, 'start': 100.013, 'title': 'Data science: opportunities and requirements', 'summary': 'Explores the growing opportunities in data science, showcasing the potential for professional careers and the essential skills and technologies required, emphasizing the significance of python in data science solutions and addressing the challenge of selection bias in data analysis.', 'duration': 386.232, 'highlights': ['Data science presents growing career opportunities with numerous job openings worldwide, leveraging the enormous volume, velocity, and variety of data assets in various industries, with potential for significant return on investment. Data science offers abundant career prospects with global job openings, harnessing vast and diverse data assets for substantial return on investment.', "Python is a highly sought-after programming skill in data science, reinforced by libraries like NumPy and Pandas, enabling robust framework for designing data science solutions and deployment of production-grade solutions, providing a competitive edge for modeling tasks. Python's prominence in data science is bolstered by libraries like NumPy and Pandas, facilitating robust solution design and deployment of production-grade solutions, offering a competitive advantage in modeling tasks.", 'Selection bias is a challenge in data analysis, particularly when working with large datasets, requiring strategic filtering and randomized selection to ensure representative samples for comprehensive analysis. Dealing with selection bias in data analysis involves strategic filtering and randomized selection to ensure representative samples from large datasets for comprehensive analysis.']}], 'duration': 425.677, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY60568.jpg', 'highlights': ["Python's prominence in data science is bolstered by libraries like NumPy and Pandas, facilitating robust solution design and deployment of production-grade solutions, offering a competitive advantage in modeling tasks.", 'Data science presents growing career opportunities with numerous job openings worldwide, leveraging the enormous volume, velocity, and variety of data assets in various industries, with potential for significant return on investment.', 'The focus will be on preparing individuals with a common ground of what data science requires, irrespective of the specific role they are applying for.', 'The session aims to provide a fundamental understanding of data science regardless of the specific role being applied for.', 'The discussion will cover statistics, data analytics, machine learning, and probability, providing a structured approach to addressing questions in these areas.', 'Selection bias is a challenge in data analysis, particularly when working with large datasets, requiring strategic filtering and randomized selection to ensure representative samples for comprehensive analysis.']}, {'end': 1533.753, 'segs': [{'end': 626.362, 'src': 'embed', 'start': 600.034, 'weight': 2, 'content': [{'end': 607.696, 'text': 'right by that I will bring in these two columns as one column by calling that as an attribute and put the values in one column.', 'start': 600.034, 'duration': 7.662}, {'end': 610.857, 'text': 'So this sort of format is called the long one.', 'start': 607.996, 'duration': 2.861}, {'end': 619.7, 'text': 'So what really happens is, instead of having two separate columns for two of your attributes, you put both of those into one column,', 'start': 610.877, 'duration': 8.823}, {'end': 626.362, 'text': 'and by doing that, there are a lot of benefits with respect to the task you have in your hand, particularly in data visualization.', 'start': 619.7, 'duration': 6.662}], 'summary': 'Combining two columns into one improves data visualization.', 'duration': 26.328, 'max_score': 600.034, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY600034.jpg'}, {'end': 851.256, 'src': 'embed', 'start': 825.351, 'weight': 3, 'content': [{'end': 831.214, 'text': 'And the moment we understand that something follows a normal distribution, all these properties of that distribution is like revealed.', 'start': 825.351, 'duration': 5.863}, {'end': 836.324, 'text': "So that's sort of the importance of doing any analysis around the distribution of data.", 'start': 831.82, 'duration': 4.504}, {'end': 838.665, 'text': 'And normal distribution like very common one.', 'start': 836.944, 'duration': 1.721}, {'end': 845.811, 'text': 'And in many statistical techniques and even model building exercises, if you have anything in a normal distribution,', 'start': 839.066, 'duration': 6.745}, {'end': 851.256, 'text': 'many other possibilities of applying certain modeling techniques comes out evidently there.', 'start': 845.811, 'duration': 5.445}], 'summary': 'Understanding normal distribution is important for data analysis and modeling techniques.', 'duration': 25.905, 'max_score': 825.351, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY825351.jpg'}, {'end': 1098.5, 'src': 'heatmap', 'start': 1032.858, 'weight': 0.789, 'content': [{'end': 1043.952, 'text': 'people normally come out with sort of saying what should be the sample size of my users whom I should be getting to participate in my AB testing framework?', 'start': 1032.858, 'duration': 11.094}, {'end': 1045.655, 'text': 'or also, when you are building some models,', 'start': 1043.952, 'duration': 1.703}, {'end': 1054.648, 'text': 'you might see that there are certain statistical measures which has to be evaluated by the end of model building exercise.', 'start': 1046.383, 'duration': 8.265}, {'end': 1060.992, 'text': "Like if you're building a machine learning model, let's say, and you want to see if those metrics on which I'm evaluating are really good or not.", 'start': 1054.888, 'duration': 6.104}, {'end': 1070.978, 'text': "So, in that sensitivity is one of those methods, or like metrics, which we normally evaluate, and I'm going to show you some,", 'start': 1062.594, 'duration': 8.384}, {'end': 1075.26, 'text': 'something that we normally refer by the name confusion matrix.', 'start': 1070.978, 'duration': 4.282}, {'end': 1080.162, 'text': "So I'll spend some time explaining this and then come to what we mean by like sensitivity.", 'start': 1075.98, 'duration': 4.182}, {'end': 1083.403, 'text': "So let's say you are building a model right,", 'start': 1080.882, 'duration': 2.521}, {'end': 1089.946, 'text': 'a model for predicting whether a particular customer is going to purchase from my platform within one month or not.', 'start': 1083.403, 'duration': 6.543}, {'end': 1098.5, 'text': 'So very simple problem statement, which might include many variables that we would bring in, and then finally build this model, saying okay,', 'start': 1090.735, 'duration': 7.765}], 'summary': 'The transcript discusses evaluating sample size for ab testing and statistical measures in model building, including sensitivity and confusion matrix.', 'duration': 65.642, 'max_score': 1032.858, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY1032858.jpg'}, {'end': 1350.066, 'src': 'heatmap', 'start': 1290.636, 'weight': 1, 'content': [{'end': 1296.741, 'text': 'so sensitivity, help us to find that out, which is, in simple terms, is a ratio between the true positive.', 'start': 1290.636, 'duration': 6.105}, {'end': 1300.904, 'text': 'in the denominator we have all the cases of positive predictions.', 'start': 1296.741, 'duration': 4.163}, {'end': 1306.308, 'text': 'so imagine now, if the type 1 error is going to grow, my sensitivity is going to come down.', 'start': 1300.904, 'duration': 5.404}, {'end': 1311.012, 'text': 'so if my true positives are like very high, the sensitivity will also be high.', 'start': 1306.308, 'duration': 4.704}, {'end': 1314.094, 'text': 'so this is sort of what we call the statistical power.', 'start': 1311.012, 'duration': 3.082}, {'end': 1321.739, 'text': 'if this sensitivity is really good, I would say that my positive cases are predicted well,', 'start': 1314.094, 'duration': 7.645}, {'end': 1326.083, 'text': 'and the exact opposite of the sensitivity is what we know by specificity.', 'start': 1321.739, 'duration': 4.344}, {'end': 1331.947, 'text': 'So we need to make sure that in a very good machine modeling model sensitivity and specificity both are balanced.', 'start': 1326.903, 'duration': 5.044}, {'end': 1341.424, 'text': 'right. so in very simple terms, this is what we like mean here the ratio of true positive by the total positive events there.', 'start': 1332.762, 'duration': 8.662}, {'end': 1350.066, 'text': 'and, as I mentioned, both of these, sensitivity and specificity, play in good role when you want to evaluate a models output right.', 'start': 1341.424, 'duration': 8.642}], 'summary': 'Sensitivity and specificity are crucial for evaluating model output, with a balance required for good machine modeling.', 'duration': 59.43, 'max_score': 1290.636, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY1290636.jpg'}, {'end': 1397.464, 'src': 'embed', 'start': 1365.512, 'weight': 0, 'content': [{'end': 1373.535, 'text': 'so in those cases we also come across some kind of issues like over fitting and under fitting given machine learning model right.', 'start': 1365.512, 'duration': 8.023}, {'end': 1379.118, 'text': 'so these words are very common and the idea is depending on the complexity of your model.', 'start': 1373.535, 'duration': 5.583}, {'end': 1386.901, 'text': 'you might see that you want to adapt very sort of exactly to your data points or you might want to do a generalization.', 'start': 1379.118, 'duration': 7.783}, {'end': 1397.464, 'text': 'so, for instance here, if I have these red and blue dots here right and if I draw a curve like this which separates the red from the blue,', 'start': 1386.901, 'duration': 10.563}], 'summary': "The issues of overfitting and underfitting in machine learning models are common, depending on the model's complexity and the need for adaptation to data points or generalization.", 'duration': 31.952, 'max_score': 1365.512, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY1365512.jpg'}, {'end': 1517.503, 'src': 'embed', 'start': 1494.976, 'weight': 1, 'content': [{'end': 1503.679, 'text': 'as you would be like very aware of, like things like standard deviation averages, how to interpret median, how to interpret quartiles right,', 'start': 1494.976, 'duration': 8.703}, {'end': 1507.119, 'text': 'the first quartile, second quartile and so and so on.', 'start': 1503.679, 'duration': 3.44}, {'end': 1509.92, 'text': 'and how, what do you mean by percentiles right?', 'start': 1507.119, 'duration': 2.801}, {'end': 1511.66, 'text': 'these are some basic questions.', 'start': 1509.92, 'duration': 1.74}, {'end': 1517.503, 'text': 'a bit more complex in nature might be discussions around sensitivity over fitting under fitting.', 'start': 1511.66, 'duration': 5.843}], 'summary': 'Discusses basic statistics concepts like standard deviation, averages, median, quartiles, and percentiles, as well as more complex topics like sensitivity, overfitting, and underfitting.', 'duration': 22.527, 'max_score': 1494.976, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY1494976.jpg'}], 'start': 486.245, 'title': 'Handling selection bias, normal distribution, ab testing, and model evaluation', 'summary': 'Discusses the importance of handling selection bias in data analysis, the significance of normal distribution in data analysis, and the importance of ab testing in data analysis. it also covers the use of confusion matrix and sensitivity in model evaluation and concepts of overfitting and underfitting in machine learning models.', 'chapters': [{'end': 663.827, 'start': 486.245, 'title': 'Selection bias and data formats in data analysis', 'summary': 'Discusses the importance of handling selection bias in data analysis, using examples and techniques to minimize it, and also explains the significance of different data formats, emphasizing the benefits and common usage of long and wide formats in data visualization.', 'duration': 177.582, 'highlights': ['The chapter emphasizes the importance of handling selection bias in data analysis and discusses techniques like randomized selection and stratified sampling to minimize it. Selection bias is a crucial characteristic in sampling on a large population of data, and techniques like randomized selection and stratified sampling are employed to minimize this bias.', 'The chapter illustrates the significance of different data formats, particularly long and wide formats, and highlights the benefits of using the long format in data visualization, emphasizing its impact on building legends. The long and wide data formats are explained, with a focus on the benefits of using the long format for data visualization, particularly in building legends.', 'An example of the long format is provided, showcasing the transformation of separate columns for attributes into a single column, and the benefits of this format for certain data visualizations. An example of the long format is demonstrated, showing the transformation of separate attribute columns into a single column and the benefits it provides for specific data visualizations.']}, {'end': 983.779, 'start': 663.827, 'title': 'Understanding normal distribution', 'summary': "Highlights the significance of normal distribution in data analysis, explaining its symmetrical bell-shaped curve, properties like mean and standard deviation, its relevance in statistical techniques and machine learning, and its application in a/b testing for evaluating changes in a website's features.", 'duration': 319.952, 'highlights': ['Normal distribution is significant in data analysis, known for its symmetrical bell-shaped curve, properties like mean and standard deviation, and its relevance in statistical techniques and machine learning. Normal distribution provides a symmetrical bell-shaped curve, with properties such as mean and standard deviation, and is crucial in statistical techniques and machine learning applications.', "The concept of normal distribution is widely applied in A/B testing for evaluating changes in a website's features, helping in identifying risks and establishing confidence in new feature implementations. Normal distribution plays a key role in A/B testing, enabling the identification of risks and confidence in implementing new website features through randomized exposure of user groups to different versions.", 'Understanding normal distribution aids in analyzing variables like employee salaries, business sales, and customer interactions, revealing important properties of the distribution. The understanding of normal distribution facilitates the analysis of variables such as employee salaries, business sales, and customer interactions, unveiling crucial distribution properties.']}, {'end': 1533.753, 'start': 983.779, 'title': 'Ab testing and model evaluation', 'summary': 'Covers the importance of ab testing in data analysis, the use of confusion matrix and sensitivity in model evaluation, and the concepts of overfitting and underfitting in machine learning models.', 'duration': 549.974, 'highlights': ['The chapter emphasizes the importance of AB testing in data analysis, particularly for data analysts, and the significance of understanding the AB testing framework. AB testing is crucial for data analysis, especially for data analysts, as it helps in evaluating the impact of new features or changes. Understanding the AB testing framework is essential for roles involving data analysis.', 'The use of confusion matrix is explained, highlighting the significance of true positives and true negatives in evaluating machine learning models. The explanation of confusion matrix emphasizes the importance of true positives and true negatives in evaluating the accuracy and reliability of machine learning models.', 'The concept of sensitivity in model evaluation is introduced, emphasizing its role in assessing the predictive power of machine learning models. Sensitivity is highlighted as a crucial metric for assessing the predictive power of machine learning models, particularly in evaluating the accuracy of positive predictions.', 'The chapter discusses the concepts of overfitting and underfitting in machine learning models, emphasizing the need for a balanced approach to model complexity. The discussion on overfitting and underfitting stresses the importance of finding a balanced approach to model complexity, particularly in regression models, to avoid issues of overfitting or underfitting.']}], 'duration': 1047.508, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY486245.jpg', 'highlights': ['The chapter emphasizes the importance of handling selection bias in data analysis and discusses techniques like randomized selection and stratified sampling to minimize it.', 'Normal distribution is significant in data analysis, known for its symmetrical bell-shaped curve, properties like mean and standard deviation, and its relevance in statistical techniques and machine learning.', 'The chapter emphasizes the importance of AB testing in data analysis, particularly for data analysts, and the significance of understanding the AB testing framework.', 'The use of confusion matrix is explained, highlighting the significance of true positives and true negatives in evaluating machine learning models.', 'The concept of sensitivity in model evaluation is introduced, emphasizing its role in assessing the predictive power of machine learning models.', 'The chapter discusses the concepts of overfitting and underfitting in machine learning models, emphasizing the need for a balanced approach to model complexity.']}, {'end': 2251.13, 'segs': [{'end': 1695.463, 'src': 'embed', 'start': 1659.464, 'weight': 1, 'content': [{'end': 1666.126, 'text': 'so if this is sort of the data, this which is given to you, you might want to first look at the transactional data which is present in the system.', 'start': 1659.464, 'duration': 6.662}, {'end': 1670.248, 'text': 'then you might also want to go to outside of your network.', 'start': 1666.126, 'duration': 4.122}, {'end': 1675.73, 'text': 'maybe you might get the sentiments of your customers from social media platforms and so on.', 'start': 1670.248, 'duration': 5.482}, {'end': 1678.851, 'text': 'so there will be different sources of data that you will collect.', 'start': 1675.73, 'duration': 3.121}, {'end': 1688.875, 'text': 'but oftentimes collecting the data is not only the task right and not like only building a model or doing statistical analysis might like come very later in the stage.', 'start': 1678.851, 'duration': 10.024}, {'end': 1695.463, 'text': 'but what comes before that, after you have collected the data, is to make sure that the integrity of the data is maintained,', 'start': 1688.875, 'duration': 6.588}], 'summary': 'Analyze transactional data, gather customer sentiments from various sources, and ensure data integrity.', 'duration': 35.999, 'max_score': 1659.464, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY1659464.jpg'}, {'end': 1839.2, 'src': 'heatmap', 'start': 1782.227, 'weight': 0.705, 'content': [{'end': 1790.073, 'text': 'so often times this question comes up where you like ask to distinguish between this univariate, bivariate and multivariate analysis.', 'start': 1782.227, 'duration': 7.846}, {'end': 1791.854, 'text': 'and the idea is very simple.', 'start': 1790.073, 'duration': 1.781}, {'end': 1801.3, 'text': 'in any sort of analysis, it is not only one variable which kind of decides the end output of your analysis, but there are multiple factors involved.', 'start': 1791.854, 'duration': 9.446}, {'end': 1807.035, 'text': 'so when there are multiple factors involved, you might also want to look at things like correlation.', 'start': 1801.69, 'duration': 5.345}, {'end': 1808.156, 'text': 'there are multiple variables.', 'start': 1807.035, 'duration': 1.121}, {'end': 1810.979, 'text': 'you want to see if there is any correlation between these things.', 'start': 1808.156, 'duration': 2.823}, {'end': 1813.521, 'text': 'sales are going down, but because of what?', 'start': 1810.979, 'duration': 2.542}, {'end': 1816.964, 'text': 'is it because my sales representatives are not going to the market?', 'start': 1813.521, 'duration': 3.443}, {'end': 1818.526, 'text': 'or is it my products are bad?', 'start': 1816.964, 'duration': 1.562}, {'end': 1820.648, 'text': 'or is there some other reasons?', 'start': 1819.187, 'duration': 1.461}, {'end': 1828.493, 'text': 'So, with all the variables in one place, you might want to go and dig deeper to see if there is any relationships coming in the variables or not.', 'start': 1821.048, 'duration': 7.445}, {'end': 1835.238, 'text': 'And when we collectively get all these variables together and do some sort of coherent analysis around the problem,', 'start': 1828.893, 'duration': 6.345}, {'end': 1839.2, 'text': 'you come out with a really crisp answers to what you are trying to analyze.', 'start': 1835.238, 'duration': 3.962}], 'summary': 'Analyzing multiple variables reveals correlations and provides insightful answers.', 'duration': 56.973, 'max_score': 1782.227, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY1782227.jpg'}, {'end': 1915.891, 'src': 'embed', 'start': 1879.371, 'weight': 3, 'content': [{'end': 1881.852, 'text': 'and with the five regions I am going to form different clusters.', 'start': 1879.371, 'duration': 2.481}, {'end': 1892.957, 'text': 'or in the systematic sampling you might also want to say that with the five regions that I have got I might want to analyze only for one product right,', 'start': 1882.492, 'duration': 10.465}, {'end': 1895.858, 'text': 'which is not doing that good in the sales.', 'start': 1892.957, 'duration': 2.901}, {'end': 1897.98, 'text': 'so these kind of sampling techniques,', 'start': 1895.858, 'duration': 2.122}, {'end': 1910.883, 'text': 'like the cluster based one or the systematic sampling techniques and there are different names for this people might be able to give a very good interpretation of what really went wrong in whichever sort of analysis you are doing.', 'start': 1897.98, 'duration': 12.903}, {'end': 1915.891, 'text': 'so one example is like sales going down, but you can adapt this to other analysis as well.', 'start': 1910.883, 'duration': 5.008}], 'summary': 'Using five regions, different sampling techniques help analyze sales and identify issues for specific products.', 'duration': 36.52, 'max_score': 1879.371, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY1879371.jpg'}, {'end': 1988.407, 'src': 'heatmap', 'start': 1928.353, 'weight': 0.817, 'content': [{'end': 1942.665, 'text': 'you know exactly which clusters or which sort of regions in this example you are like analyzing and in your end of the analysis you will be very able to say this is not like a randomized sample that I have taken,', 'start': 1928.353, 'duration': 14.312}, {'end': 1944.086, 'text': 'but from these five regions.', 'start': 1942.665, 'duration': 1.421}, {'end': 1951.411, 'text': 'So there are many different ways of doing clustering, cluster orders or sort of the systematic sampling,', 'start': 1944.485, 'duration': 6.926}, {'end': 1955.755, 'text': 'which kind of helps in this particular final end results of your,', 'start': 1951.411, 'duration': 4.344}, {'end': 1960.58, 'text': 'to put your end results in the right perspectives instead of doing a randomized sampling.', 'start': 1955.755, 'duration': 4.825}, {'end': 1969.328, 'text': 'One more quite a useful sort of an idea kind of widely borrowed from the field of linear algebra.', 'start': 1961.741, 'duration': 7.587}, {'end': 1979.819, 'text': 'and this is a bit related to what we earlier saw between moving from one variable to multiple variables right and eigenvalue and eigenvectors,', 'start': 1969.95, 'duration': 9.869}, {'end': 1988.407, 'text': 'kind of a concept borrowed from linear algebra, helps us to bring in some in some way, a linear combination of different variables together.', 'start': 1979.819, 'duration': 8.588}], 'summary': 'Clustering and systematic sampling aid in accurate analysis, while eigenvalue and eigenvectors help combine variables effectively.', 'duration': 60.054, 'max_score': 1928.353, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY1928353.jpg'}, {'end': 2135.065, 'src': 'embed', 'start': 2094.255, 'weight': 0, 'content': [{'end': 2099.279, 'text': "And that's because of one Egan vector can be representing a hundred variables together.", 'start': 2094.255, 'duration': 5.024}, {'end': 2111.875, 'text': "So that's sort of how it works quite a powerful idea and commonly used methods for reducing the dimensions of a large dataset like the PCA.", 'start': 2102.188, 'duration': 9.687}, {'end': 2115.518, 'text': 'a principal component analysis is actually based on eigenvalue and eigenvectors.', 'start': 2111.875, 'duration': 3.643}, {'end': 2123.143, 'text': 'So, if somebody asks you about eigenvalue and eigenvectors in an interview, also talk about the PCA principal component analysis,', 'start': 2115.918, 'duration': 7.225}, {'end': 2124.945, 'text': 'which is actually based on these two concepts.', 'start': 2123.143, 'duration': 1.802}, {'end': 2128.227, 'text': 'So that gives them a good idea to the interview.', 'start': 2125.685, 'duration': 2.542}, {'end': 2135.065, 'text': 'you know about eigenvalues and eigenvectors and you are also able to think of its application, like in PCA.', 'start': 2129.681, 'duration': 5.384}], 'summary': 'Eigen vector can represent 100 variables, used in pca for dataset dimensionality reduction.', 'duration': 40.81, 'max_score': 2094.255, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY2094255.jpg'}], 'start': 1533.753, 'title': 'Data analysis best practices', 'summary': 'Discusses statistical concepts for interviews, common data analysis questions, significance of text analytics, and the time-consuming nature of data cleaning, which takes 70 to 80% of the total data analysis time.', 'chapters': [{'end': 1739.466, 'start': 1533.753, 'title': 'Data analysis best practices', 'summary': 'Discusses the importance of understanding statistical concepts for interviews, common data analysis questions, the significance of text analytics, and the time-consuming nature of data cleaning, which takes 70 to 80% of the total data analysis time.', 'duration': 205.713, 'highlights': ['The time-consuming nature of data cleaning, which takes 70 to 80% of the total data analysis time Data cleaning and understanding the data, doing explorations with plot, takes close to 70 to 80% of your time in any data analysis task.', 'The significance of text analytics and related libraries in Python and R Text analytics, including sentiment analysis, is a large domain, with Python and R offering libraries like pandas, NumPy, NLTK, and TM for natural language processing and text mining.', 'The importance of understanding statistical concepts for interviews Understanding statistical concepts is crucial for interviews, as anything less might lead to difficulties in the interview process.', 'The significance of well-structured data in reducing data analysis time Maintaining well-structured data can reduce the heavy time spent on data analysis or data cleaning, which is essential for any new project without pre-existing data pipelines.']}, {'end': 2251.13, 'start': 1740.126, 'title': 'Data cleaning, multivariate analysis, and sampling techniques', 'summary': 'Emphasizes the importance of data cleaning, multivariate analysis, and sampling techniques in improving the performance of analysis models, with a focus on reducing dimensions and understanding the significance of false positive and false negative cases, using examples and practical applications.', 'duration': 511.004, 'highlights': ['Importance of Data Cleaning for Analysis Performance Data cleaning is highlighted as crucial, occupying 80% of the time spent on tasks, and is emphasized for improving analysis model performance.', 'Significance of Multivariate Analysis and Correlation The importance of moving beyond one variable and conducting multivariate analysis to understand the correlation between multiple factors for a more comprehensive analysis is stressed.', 'Role of Sampling Techniques in Analysis Interpretation The significance of systematic and cluster-based sampling techniques is explained, emphasizing their ability to provide a better interpretation of analysis results compared to randomized sampling.', 'Utilizing Eigenvalue and Eigenvectors for Dimensionality Reduction The concept of using eigenvalue and eigenvectors to reduce the dimensionality of large datasets is discussed, highlighting its impact on analysis time and data representability.', 'Application of Eigenvalue and Eigenvectors in Principal Component Analysis (PCA) The application of eigenvalue and eigenvectors in PCA is emphasized as a commonly used method for reducing the dimensions of a large dataset.', 'Understanding the Significance of False Positive and False Negative Cases The practical significance of false positive and false negative cases, particularly in scenarios such as medical domain applications, is illustrated, emphasizing the critical implications of these cases on decision-making processes.']}], 'duration': 717.377, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY1533753.jpg', 'highlights': ['The time-consuming nature of data cleaning, which takes 70 to 80% of the total data analysis time', 'The significance of well-structured data in reducing data analysis time', 'The importance of understanding statistical concepts for interviews', 'The significance of text analytics and related libraries in Python and R', 'Importance of Data Cleaning for Analysis Performance', 'Significance of Multivariate Analysis and Correlation', 'Role of Sampling Techniques in Analysis Interpretation', 'Utilizing Eigenvalue and Eigenvectors for Dimensionality Reduction', 'Application of Eigenvalue and Eigenvectors in Principal Component Analysis (PCA)', 'Understanding the Significance of False Positive and False Negative Cases']}, {'end': 2547.433, 'segs': [{'end': 2318.833, 'src': 'embed', 'start': 2291.718, 'weight': 1, 'content': [{'end': 2298.084, 'text': "with a treatment like this chemotherapy, like treatment then, it is like much better than saying you don't have cancer.", 'start': 2291.718, 'duration': 6.366}, {'end': 2302.989, 'text': 'okay, and a very similar example in some other context might also come up.', 'start': 2298.084, 'duration': 4.905}, {'end': 2307.287, 'text': 'so if you would like to think of some other examples in the same context, okay.', 'start': 2302.989, 'duration': 4.298}, {'end': 2311.069, 'text': 'so where is the other case now, which is the false negative right?', 'start': 2307.287, 'duration': 3.782}, {'end': 2313.25, 'text': 'so we talked about the false positives importance.', 'start': 2311.069, 'duration': 2.181}, {'end': 2318.833, 'text': 'but there might be also cases where false negative might become a bit more important there.', 'start': 2313.25, 'duration': 5.583}], 'summary': 'Chemotherapy is much better than no cancer; false negatives can be important too.', 'duration': 27.115, 'max_score': 2291.718, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY2291718.jpg'}, {'end': 2552.375, 'src': 'embed', 'start': 2526.725, 'weight': 0, 'content': [{'end': 2534.348, 'text': 'which will make sure that during the process there is one part dedicatedly given for the validation of the model.', 'start': 2526.725, 'duration': 7.623}, {'end': 2541.251, 'text': 'and when the model is done you might see that the final model is well trained on the data, at the same time validated.', 'start': 2534.348, 'duration': 6.903}, {'end': 2547.433, 'text': 'but when the model is completely done, then only you get into a process that we call testing right.', 'start': 2541.251, 'duration': 6.182}, {'end': 2552.375, 'text': 'so you can imagine like this you have a data of thousand records.', 'start': 2547.433, 'duration': 4.942}], 'summary': 'Model validation ensures well-trained and validated model before testing. data of thousand records used.', 'duration': 25.65, 'max_score': 2526.725, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY2526725.jpg'}], 'start': 2251.13, 'title': 'Importance of false positives and false negatives', 'summary': 'Emphasizes the impact of false positives and false negatives in machine learning models, discussing their significance in medical diagnosis, criminal justice, and banking industry, and the necessity to consider both cases in model building processes.', 'chapters': [{'end': 2547.433, 'start': 2251.13, 'title': 'Importance of false positives and false negatives', 'summary': 'Discusses the importance of false positives and false negatives in machine learning models, emphasizing their impact in medical diagnosis, criminal justice, and banking industry, and the need for clear understanding and consideration of both cases in model building processes.', 'duration': 296.303, 'highlights': ['In medical diagnosis, false positives can lead to harmful treatments while false negatives may miss detecting a disease, emphasizing the need to carefully consider both cases to avoid potential harm to patients. medical diagnosis', 'In criminal justice, false negatives can lead to letting a criminal walk free, posing a greater risk than convicting an innocent person, highlighting the significance of considering false negatives in building a model for convicting criminals. criminal justice', 'In the banking industry, false positives can lead to loss of business opportunities while false negatives can result in potential financial risks, underlining the equal importance of both cases in decision-making for loan approvals. banking industry', "The chapter also emphasizes the need for a clear understanding and consideration of both false positives and false negatives in the model building process, particularly in dividing the dataset into training, test, and validation data to ensure the model's accuracy and effectiveness. model building process"]}], 'duration': 296.303, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY2251130.jpg', 'highlights': ['In medical diagnosis, false positives can lead to harmful treatments while false negatives may miss detecting a disease, emphasizing the need to carefully consider both cases to avoid potential harm to patients.', 'In criminal justice, false negatives can lead to letting a criminal walk free, posing a greater risk than convicting an innocent person, highlighting the significance of considering false negatives in building a model for convicting criminals.', 'In the banking industry, false positives can lead to loss of business opportunities while false negatives can result in potential financial risks, underlining the equal importance of both cases in decision-making for loan approvals.', "The chapter also emphasizes the need for a clear understanding and consideration of both false positives and false negatives in the model building process, particularly in dividing the dataset into training, test, and validation data to ensure the model's accuracy and effectiveness."]}, {'end': 3307.985, 'segs': [{'end': 2637.267, 'src': 'embed', 'start': 2594.439, 'weight': 1, 'content': [{'end': 2604.503, 'text': 'so what you see here test set can be like replaced with a validation set right and you can see that this is a rolling sort of subset and you keep changing it in each fold.', 'start': 2594.439, 'duration': 10.064}, {'end': 2607.245, 'text': 'so you go for the first fold of the iteration.', 'start': 2604.503, 'duration': 2.742}, {'end': 2617.589, 'text': 'you keep one validation set and the rest of it is the training set and the next fold you move this window to another subset and the rest is training data and so on.', 'start': 2607.245, 'duration': 10.344}, {'end': 2621.333, 'text': 'and when this model is done, using this k-fold cross validation approach,', 'start': 2617.589, 'duration': 3.744}, {'end': 2626.88, 'text': 'in the end you will get a model which you can then use on the testing data to see if the accuracies are good or not.', 'start': 2621.333, 'duration': 5.547}, {'end': 2630.605, 'text': 'So this kind of brings in a lot of performance improvements.', 'start': 2627.844, 'duration': 2.761}, {'end': 2637.267, 'text': 'People have also found validation set to be an really good way of tuning the parameters.', 'start': 2630.945, 'duration': 6.322}], 'summary': 'K-fold cross validation improves model performance and parameter tuning using validation set.', 'duration': 42.828, 'max_score': 2594.439, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY2594439.jpg'}, {'end': 3010.768, 'src': 'embed', 'start': 2983.579, 'weight': 0, 'content': [{'end': 2990.541, 'text': 'same is true for when you want to build and classification algorithm for detecting cancer, whether the patient has a cancer or not,', 'start': 2983.579, 'duration': 6.962}, {'end': 2995.343, 'text': "has a cancer right, or if you want to detect, let's say, a malicious content or a malicious file,", 'start': 2990.541, 'duration': 4.802}, {'end': 2999.904, 'text': 'which might be a virus chosen or warm or something else right.', 'start': 2995.343, 'duration': 4.561}, {'end': 3004.266, 'text': 'so in that case the classes are now many, so more than one class can also be there.', 'start': 2999.904, 'duration': 4.362}, {'end': 3010.768, 'text': 'but the fundamental idea is we are following a supervised learning algorithm, but the type of problem we are solving is a classification problem.', 'start': 3004.266, 'duration': 6.502}], 'summary': 'Supervised learning for classification with multiple classes, e.g., cancer detection and malware identification.', 'duration': 27.189, 'max_score': 2983.579, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY2983579.jpg'}, {'end': 3261.308, 'src': 'embed', 'start': 3233.411, 'weight': 3, 'content': [{'end': 3238.054, 'text': 'you might want to show to each other that you have another friend whom you might want to connect.', 'start': 3233.411, 'duration': 4.643}, {'end': 3244.67, 'text': 'So there are like many such use cases which comes out the moment you get into the deeper understanding of recommender system.', 'start': 3238.704, 'duration': 5.966}, {'end': 3252.258, 'text': 'But the fundamental idea is how do I compare two items? The items might be product, people, movies and so on.', 'start': 3244.69, 'duration': 7.568}, {'end': 3257.744, 'text': 'And how do we compare two users in simple terms? So this is what a recommendation system works on.', 'start': 3252.659, 'duration': 5.085}, {'end': 3261.308, 'text': 'So there are quite famous examples like the collaborative filtering approaches.', 'start': 3258.024, 'duration': 3.284}], 'summary': 'Recommender systems compare items and users, with use cases for connecting friends and collaborative filtering.', 'duration': 27.897, 'max_score': 3233.411, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY3233411.jpg'}, {'end': 3317.295, 'src': 'embed', 'start': 3288.186, 'weight': 2, 'content': [{'end': 3294.912, 'text': "So what if I don't want to have a class of a particular user or a patient or something like that, right?", 'start': 3288.186, 'duration': 6.726}, {'end': 3299.116, 'text': 'But instead, if I would ask you, can you give me a crisp value instead of a class?', 'start': 3295.292, 'duration': 3.824}, {'end': 3307.985, 'text': "So when I'm classifying a given file into good, bad in which bad can be virus worms trousers these are the classes.", 'start': 3299.857, 'duration': 8.128}, {'end': 3317.295, 'text': "But if I don't want class but rather, for example, if I want to know the exact value of a house in a particular locality in my city,", 'start': 3308.505, 'duration': 8.79}], 'summary': 'Exploring the idea of using crisp values instead of classes for classification.', 'duration': 29.109, 'max_score': 3288.186, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY3288186.jpg'}], 'start': 2547.433, 'title': 'Machine learning basics and supervised classification', 'summary': 'Covers k-fold cross validation, supervised and unsupervised learning basics, and explores classification algorithms like logistic regression and recommender systems, with real-world applications in banking and e-commerce.', 'chapters': [{'end': 2922.183, 'start': 2547.433, 'title': 'Cross validation and machine learning basics', 'summary': 'Explains the process of k-fold cross validation, its significance in model building and parameter tuning, and the basics of supervised and unsupervised learning in machine learning, with specific examples and use cases.', 'duration': 374.75, 'highlights': ['k-fold cross validation process explained with the example of splitting data into training, validation, and testing sets, with the significance of using k-fold cross validation to improve model performance. ', 'Importance of validation set in tuning model parameters and its role in preventing overfitting and underfitting in machine learning models. ', 'Explanation of supervised and unsupervised learning in machine learning, with examples of algorithms such as support vector machine regression and decision trees for supervised learning, and clustering for unsupervised learning. ']}, {'end': 3307.985, 'start': 2922.603, 'title': 'Supervised learning and classification algorithms', 'summary': 'Discusses the fundamental concepts of supervised learning, focusing on classification algorithms such as logistic regression and recommender systems, and their applications in various industries, including banking and e-commerce.', 'duration': 385.382, 'highlights': ["Logistic regression is commonly used in banking for predicting customer defaulters, and other binary classification problems, providing robust implementations for decision-making. Logistic regression's common usage in banking and robustness in predicting customer defaulters.", 'Recommender systems, such as those used by Amazon, YouTube, and Netflix, are widely applied for personalized product recommendations and content suggestions, benefiting businesses through increased user engagement and sales. Wide application of recommender systems in platforms like Amazon, YouTube, and Netflix, and their business benefits.', 'Types of classification algorithms include linear regression, decision tree, and support vector machines, each suited for different types of classification problems. Overview of various types of classification algorithms and their suitability for different problems.']}], 'duration': 760.552, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY2547433.jpg', 'highlights': ['Importance of validation set in tuning model parameters and its role in preventing overfitting and underfitting in machine learning models.', 'k-fold cross validation process explained with the example of splitting data into training, validation, and testing sets, with the significance of using k-fold cross validation to improve model performance.', 'Explanation of supervised and unsupervised learning in machine learning, with examples of algorithms such as support vector machine regression and decision trees for supervised learning, and clustering for unsupervised learning.', 'Recommender systems, such as those used by Amazon, YouTube, and Netflix, are widely applied for personalized product recommendations and content suggestions, benefiting businesses through increased user engagement and sales.', 'Logistic regression is commonly used in banking for predicting customer defaulters, and other binary classification problems, providing robust implementations for decision-making.', 'Types of classification algorithms include linear regression, decision tree, and support vector machines, each suited for different types of classification problems.']}, {'end': 4155.087, 'segs': [{'end': 3393.519, 'src': 'embed', 'start': 3363.719, 'weight': 7, 'content': [{'end': 3370.563, 'text': 'I can use the model and predict exactly, because these features are somewhere similar in that locality, the prices might be in a particular range.', 'start': 3363.719, 'duration': 6.844}, {'end': 3376.685, 'text': 'So a model like linear regression will learn those patterns in the given input attributes and try to predict the price of the house.', 'start': 3370.643, 'duration': 6.042}, {'end': 3386.688, 'text': 'And the idea is if I have a set of data points, I want to build a very generic model like drawing this line which is as close as to all the points.', 'start': 3377.925, 'duration': 8.763}, {'end': 3393.519, 'text': 'So you can like draw infinitely many number of points if you are given a set of data points like this in a two dimensional space.', 'start': 3387.734, 'duration': 5.785}], 'summary': 'Linear regression model predicts house prices based on similar features in a locality.', 'duration': 29.8, 'max_score': 3363.719, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY3363719.jpg'}, {'end': 3451.061, 'src': 'embed', 'start': 3420.677, 'weight': 8, 'content': [{'end': 3422.638, 'text': 'which is nothing but the sum of all the distances.', 'start': 3420.677, 'duration': 1.961}, {'end': 3433.543, 'text': 'So there are simple linear regression ideas, and the fundamental idea that we are following here is your variables are having a linear relationship,', 'start': 3424.412, 'duration': 9.131}, {'end': 3437.648, 'text': 'which means with the increase of one variable, the other also increases.', 'start': 3433.543, 'duration': 4.105}, {'end': 3441.313, 'text': 'But if you have a pattern which is not so linear in shape,', 'start': 3437.969, 'duration': 3.344}, {'end': 3451.061, 'text': 'like if you are like not able to draw a generic line but the representation or the sort of points are aligned in a way that you can only create a model which is polynomial.', 'start': 3441.313, 'duration': 9.748}], 'summary': 'The fundamental idea is to model non-linear relationships using polynomial regression.', 'duration': 30.384, 'max_score': 3420.677, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY3420677.jpg'}, {'end': 3509.175, 'src': 'embed', 'start': 3480.861, 'weight': 5, 'content': [{'end': 3488.104, 'text': 'So if somebody is asking you around linear regression, it would be to start with saying how you build a linear regression model,', 'start': 3480.861, 'duration': 7.243}, {'end': 3489.845, 'text': 'and then you might give some examples of it.', 'start': 3488.104, 'duration': 1.741}, {'end': 3495.528, 'text': "So, at best, if you're not comfortable with these ideas like p-value or hypothesis testing,", 'start': 3490.525, 'duration': 5.003}, {'end': 3502.772, 'text': 'you might want to refresh that before you go for any interview, because if linear regression comes up, these concepts need to be a bit more explained.', 'start': 3495.528, 'duration': 7.244}, {'end': 3509.175, 'text': 'So, when I talked about the recommendation algorithm, I mentioned something about collaborative filters.', 'start': 3503.972, 'duration': 5.203}], 'summary': 'Linear regression, p-value, hypothesis testing, recommendation algorithm, collaborative filters mentioned in interview preparation.', 'duration': 28.314, 'max_score': 3480.861, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY3480861.jpg'}, {'end': 3676.474, 'src': 'embed', 'start': 3643.255, 'weight': 1, 'content': [{'end': 3645.935, 'text': 'like nowadays e-commerce companies do this a lot,', 'start': 3643.255, 'duration': 2.68}, {'end': 3651.677, 'text': 'but in those sales days you would obviously expect that the products purchase is going to go very high right.', 'start': 3645.935, 'duration': 5.742}, {'end': 3654.939, 'text': "but does that mean that it's an outlier to me?", 'start': 3652.177, 'duration': 2.762}, {'end': 3662.465, 'text': 'not, because if I am able to explain an outlier by saying that this was a discount day, I might be able to handle it separately either.', 'start': 3654.939, 'duration': 7.526}, {'end': 3669.43, 'text': 'I can like simply take all those points which are for the discounts day or sales day and keep it separately.', 'start': 3662.465, 'duration': 6.965}, {'end': 3676.474, 'text': 'or if I would like to have the variable like saying whether the given day is a sales day and keep the outlier as well,', 'start': 3669.43, 'duration': 7.044}], 'summary': 'E-commerce companies see high product purchases on sales days, but outliers can be explained by discount days.', 'duration': 33.219, 'max_score': 3643.255, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY3643255.jpg'}, {'end': 3735.776, 'src': 'embed', 'start': 3707.119, 'weight': 2, 'content': [{'end': 3711.282, 'text': 'any point which is greater than the 99th percentile can be like removed.', 'start': 3707.119, 'duration': 4.163}, {'end': 3719.129, 'text': "so these are like you are removing the toppers from the data points of yours of, let's say, a SAT examination or a CAT examination.", 'start': 3711.282, 'duration': 7.847}, {'end': 3724.272, 'text': 'So the outliers are sometimes can cause certain issues in explaining the model.', 'start': 3719.67, 'duration': 4.602}, {'end': 3727.473, 'text': 'So you can obviously imagine this in very intuitive terms also.', 'start': 3724.532, 'duration': 2.941}, {'end': 3735.776, 'text': 'If you have a set of scores for candidates who appeared for an examination and there was one outlier candidate who scored really, really high.', 'start': 3727.933, 'duration': 7.843}], 'summary': 'Removing outliers above 99th percentile improves model accuracy.', 'duration': 28.657, 'max_score': 3707.119, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY3707119.jpg'}, {'end': 4031.686, 'src': 'heatmap', 'start': 3974.537, 'weight': 0.816, 'content': [{'end': 3980.281, 'text': "so that's why people normally do this at a sort of mean minimum or maximum kind of a value,", 'start': 3974.537, 'duration': 5.744}, {'end': 3982.903, 'text': 'or they also calculate the average and impute the value there.', 'start': 3980.281, 'duration': 2.622}, {'end': 3986.805, 'text': 'so there can be some other pattern based imputation also possible.', 'start': 3983.343, 'duration': 3.462}, {'end': 3988.726, 'text': "but I'll just give you an example.", 'start': 3986.805, 'duration': 1.921}, {'end': 3996.05, 'text': 'and in other cases, if nothing is possible by putting a value, if everything is going to be misleading, then better is to like remove that.', 'start': 3988.726, 'duration': 7.324}, {'end': 3999.692, 'text': 'but that can only be done if you have a surplus of data with you.', 'start': 3996.05, 'duration': 3.642}, {'end': 4004.374, 'text': 'if not, then be cautious of removing any values, particularly the missing ones.', 'start': 3999.692, 'duration': 4.682}, {'end': 4010.878, 'text': 'okay, so this question particularly pertains to a machine learning algorithm called k-means right.', 'start': 4004.374, 'duration': 6.504}, {'end': 4016.139, 'text': 'every time you run the algorithm you have to define what should be the value of k right.', 'start': 4011.337, 'duration': 4.802}, {'end': 4025.323, 'text': 'so there are approaches like elbow curve, which plots, the kind of a plot scatter plot between the x-axis, which is the number of clusters,', 'start': 4016.139, 'duration': 9.184}, {'end': 4031.686, 'text': 'versus y-axis, which is the WSS, or within sum of square, which is also known by the name distortion.', 'start': 4025.323, 'duration': 6.363}], 'summary': 'Imputing missing values for machine learning; defining k for k-means algorithm.', 'duration': 57.149, 'max_score': 3974.537, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY3974537.jpg'}, {'end': 4155.087, 'src': 'embed', 'start': 4109.473, 'weight': 0, 'content': [{'end': 4117.736, 'text': 'so I hope you are now getting comfortable with questions around data analysis, statistics and even the machine learning part.', 'start': 4109.473, 'duration': 8.263}, {'end': 4123.817, 'text': 'So the next pillar which might very closely be associated with statistics as well is around probability.', 'start': 4118.196, 'duration': 5.621}, {'end': 4130.339, 'text': 'So it is like sometimes in most of the standard literature, probability and statistics comes together.', 'start': 4124.598, 'duration': 5.741}, {'end': 4132.96, 'text': 'It is inseparable anytime.', 'start': 4130.698, 'duration': 2.262}, {'end': 4139.691, 'text': 'In the name based algorithm like this is one of the machine learning algorithm which is based on the Bayes theorem.', 'start': 4134.205, 'duration': 5.486}, {'end': 4149.081, 'text': 'Probability ideas are quite a lot used and there are some really niche probability concepts, like the probability graph models,', 'start': 4140.292, 'duration': 8.789}, {'end': 4155.087, 'text': 'which is actually based on the basics coming from Bayes theorem and the fundamental properties around probability.', 'start': 4149.081, 'duration': 6.006}], 'summary': 'Probability is closely associated with statistics, used in machine learning algorithms like bayes theorem, with fundamental properties around probability.', 'duration': 45.614, 'max_score': 4109.473, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY4109473.jpg'}], 'start': 3308.505, 'title': 'Linear regression in machine learning and outlier handling', 'summary': 'Covers the application of linear regression models in machine learning for predicting house prices based on factors like number of bedrooms and area, emphasizing the significance of outlier handling in data analysis and the determination of k value in k-means clustering.', 'chapters': [{'end': 3363.719, 'start': 3308.505, 'title': 'Linear regression in machine learning', 'summary': 'Explains how linear regression models in machine learning can be used to predict the value of a house based on input data such as number of bedrooms and area in square feet, using past training data to build a model for future predictions.', 'duration': 55.214, 'highlights': ['Linear regression models in machine learning can predict the value of a house based on input data such as the number of bedrooms and area in square feet, providing a crisp value in terms of currency.', 'Training data with labeled attributes of houses is used to build the model for future predictions.', 'The technique can be used for any similar pattern in future data for new properties, enabling predictions for houses in any location.']}, {'end': 4155.087, 'start': 3363.719, 'title': 'Linear regression and outlier handling in data analysis', 'summary': 'Discusses the concept of linear regression for predicting house prices, the importance of handling outliers in data analysis, and the process of determining the appropriate value of k in k-means clustering algorithm.', 'duration': 791.368, 'highlights': ['The chapter discusses the concept of linear regression for predicting house prices. It explains how linear regression learns patterns in input attributes to predict house prices, and mentions the need for polynomial regression in cases of non-linear relationships.', 'The importance of handling outliers in data analysis is emphasized. It explains the impact of outliers on model prediction, the need to handle outliers before analysis, and various approaches such as removing data points based on standard deviation or percentile.', 'The process of determining the appropriate value of k in k-means clustering algorithm is explained. It describes the elbow curve method to identify the suitable value of k and the relationship between cluster distances and distortion values.']}], 'duration': 846.582, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY3308505.jpg', 'highlights': ['Linear regression models predict house value based on input data like bedrooms and area.', 'Training data with labeled attributes is used to build the model for future predictions.', 'The technique can be applied to any similar pattern in future data for new properties.', 'Linear regression learns patterns in input attributes to predict house prices.', 'Polynomial regression is needed for non-linear relationships in house price prediction.', 'Handling outliers in data analysis is crucial for accurate model prediction.', 'Various approaches such as removing data points based on standard deviation or percentile are used to handle outliers.', 'The process of determining the appropriate value of k in k-means clustering algorithm is explained.', 'The elbow curve method is used to identify the suitable value of k in k-means clustering.', 'The relationship between cluster distances and distortion values is described in determining k value.']}, {'end': 4963.245, 'segs': [{'end': 4963.245, 'src': 'embed', 'start': 4935.427, 'weight': 0, 'content': [{'end': 4940.79, 'text': 'Also, spend some time on linear algebra kind of ideas like the Egan values and Egan vectors.', 'start': 4935.427, 'duration': 5.363}, {'end': 4943.332, 'text': 'It might also be helpful on certain questions.', 'start': 4940.851, 'duration': 2.481}, {'end': 4949.916, 'text': 'So all the very best for your interview and hope you all get a really really successful career in data science.', 'start': 4943.953, 'duration': 5.963}, {'end': 4950.797, 'text': 'Thank you.', 'start': 4950.457, 'duration': 0.34}, {'end': 4953.178, 'text': 'I hope you enjoyed listening to this video.', 'start': 4951.537, 'duration': 1.641}, {'end': 4958.922, 'text': 'Please be kind enough to like it and you can comment any of your doubts and queries and we will reply to them at the earliest.', 'start': 4953.559, 'duration': 5.363}, {'end': 4962.805, 'text': 'do look out for more videos in our playlist and subscribe to our Edureka channel to learn more.', 'start': 4958.922, 'duration': 3.883}, {'end': 4963.245, 'text': 'happy learning.', 'start': 4962.805, 'duration': 0.44}], 'summary': 'Linear algebra concepts like eigenvalues and eigenvectors are useful in certain questions. best wishes for a successful career in data science!', 'duration': 27.818, 'max_score': 4935.427, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY4935427.jpg'}], 'start': 4155.631, 'title': 'Probability principles and applications', 'summary': 'Covers probability principles, such as independent events and equal likelihood, with examples including probability of seeing a shooting star and generating random numbers. it also discusses the application of probability and statistics in data science, with examples such as calculating the probability of having two girls and determining the probability of selecting a fair coin after observing 10 consecutive heads.', 'chapters': [{'end': 4590.107, 'start': 4155.631, 'title': 'Probability problems and generating random numbers', 'summary': 'Discusses probability problems, including calculating the probability of seeing a shooting star in an hour and generating a random number between 1 to 7 with a die, emphasizing the principles of independent events and equal likelihood in random selection.', 'duration': 434.476, 'highlights': ['The probability of not seeing a shooting star in an hour is 0.4, and the probability of seeing it is 0.5, based on the 20% probability in a 15-minute interval. The calculation of the probabilities for seeing and not seeing a shooting star in an hour, based on the given 20% probability in a 15-minute interval, demonstrates the use of independent events and the concept that the sum of probabilities cannot exceed one.', 'The approach to generating a random number between 1 to 7 with a die involves rolling the die twice to increase the possibilities to 36 outcomes, and then ensuring equal likelihood of each number by dividing the outcomes into seven parts, each containing five possibilities. The process of generating a random number between 1 to 7 with a die includes increasing the possibilities through rolling the die twice, then ensuring equal likelihood of each number by dividing the outcomes into seven parts, each containing five possibilities, to avoid bias in the random selection.']}, {'end': 4963.245, 'start': 4590.107, 'title': 'Probability and statistics in data science', 'summary': 'Discusses the application of probability and statistics in data science, including examples such as calculating the probability of having two girls given at least one is a girl, and determining the probability of selecting a fair coin after observing 10 consecutive heads, with detailed explanations and calculations.', 'duration': 373.138, 'highlights': ['Calculating the probability of having two girls given at least one is a girl Explains the process of calculating the probability of having two girls when a couple has two children, at least one of which is a girl, resulting in the probability of one by three.', 'Determining the probability of selecting a fair coin after observing 10 consecutive heads Discusses the probability of selecting a fair coin and getting 10 heads, as well as the probability of selecting an unfair coin with double heads, leading to the overall probability of selecting another head being 0.75.']}], 'duration': 807.614, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/tTAieUcNHdY/pics/tTAieUcNHdY4155631.jpg', 'highlights': ['The probability of not seeing a shooting star in an hour is 0.4, and the probability of seeing it is 0.5, based on the 20% probability in a 15-minute interval.', 'The approach to generating a random number between 1 to 7 with a die involves rolling the die twice to increase the possibilities to 36 outcomes, and then ensuring equal likelihood of each number by dividing the outcomes into seven parts, each containing five possibilities.', 'Calculating the probability of having two girls given at least one is a girl Explains the process of calculating the probability of having two girls when a couple has two children, at least one of which is a girl, resulting in the probability of one by three.', 'Determining the probability of selecting a fair coin after observing 10 consecutive heads Discusses the probability of selecting a fair coin and getting 10 heads, as well as the probability of selecting an unfair coin with double heads, leading to the overall probability of selecting another head being 0.75.']}], 'highlights': ["Python's prominence in data science is bolstered by libraries like NumPy and Pandas, facilitating robust solution design and deployment of production-grade solutions, offering a competitive advantage in modeling tasks.", 'The chapter emphasizes the importance of handling selection bias in data analysis and discusses techniques like randomized selection and stratified sampling to minimize it.', 'In medical diagnosis, false positives can lead to harmful treatments while false negatives may miss detecting a disease, emphasizing the need to carefully consider both cases to avoid potential harm to patients.', 'Importance of validation set in tuning model parameters and its role in preventing overfitting and underfitting in machine learning models.', 'Linear regression models predict house value based on input data like bedrooms and area.', 'The probability of not seeing a shooting star in an hour is 0.4, and the probability of seeing it is 0.5, based on the 20% probability in a 15-minute interval.']}