title
Data Science Full Course for Beginners 2023 (11 Hours Data Science Tutorial)
description
Data science is the domain of study that is a blend of Mathematics, Analytics, Algorithms, and Machine learning techniques. It deals with vast volumes of data using various tools and techniques to find unseen patterns, derives meaningful information, and make business decisions. Data science uses complex machine learning algorithms to build predictive models.
So, learn like a pro by understanding the concepts of big data, machine learning, BI, and model processing in data science. This full course on Data Science from Great Learning is ideal for both beginners and professionals who want to upskill in one of the highly trending domains of the industry. It covers topics as fundamental as statistics and data to Logistic and Linear Regression. Our data science for beginners course will help you master data science algorithms and more advanced data science topics such as:
🏁 Topics Covered:
• Introduction - 00:00:00
• Statistics vs Machine Learning - 00:02:15
• Types of Statistics - 00:08:55
• Types of Data - 01:50:35
• Correlation - 02:45:50
• Covariance - 02:52:23
• Basics of Python - 04:24:36
• Python Data Structures - 04:43:58
• Flow Control Statements in Python - 04:55:58
• Numpy - 05:32:48
• Pandas - 05:51:30
• Matplolib - 06:14:28
• Linear Regression - 06:38:14
• Logistic Regression - 09:54:34
🔥1000+ Free Courses With Free Certificates: https://www.mygreatlearning.com/academy?ambassador_code=GLYT_DES_u2zsY-2uZiE&utm_source=GLYT&utm_campaign=GLYT_DES_u2zsY-2uZiE
🔥Build a career in Data Science & Business Analytics: https://www.mygreatlearning.com/pg-program-data-science-and-business-analytics-course?ambassador_code=GLYT_DES_Middle_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Middle_SEP53
🔥Check Our Free Courses with free certificate:
📌Data Science with Python course, Register Now: https://glacad.me/3EXsbI5
📌Data Science Foundations: https://www.mygreatlearning.com/academy/learn-for-free/courses/data-science-foundations?ambassador_code=GLYT_DES_Middle_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Middle_SEP22
📌Career in Data Science: https://www.mygreatlearning.com/academy/learn-for-free/courses/career-in-data-science?ambassador_code=GLYT_DES_Middle_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Middle_SEP22
📌R for Data Science: https://www.mygreatlearning.com/academy/learn-for-free/courses/r-for-data-science?ambassador_code=GLYT_DES_Middle_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Middle_SEP22
📌Data Science Mathematics: https://www.mygreatlearning.com/academy/learn-for-free/courses/data-science-mathematics?ambassador_code=GLYT_DES_Middle_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Middle_SEP22
Get the free Great Learning App for a seamless experience, enroll for free courses and watch them offline by downloading them. https://glacad.me/3cSKlNl
⚡ About Great Learning:
With more than 5.4 Million+ learners in 170+ countries, Great Learning, a part of the BYJU'S group, is a leading global edtech company for professional and higher education offering industry-relevant programs in the blended, classroom, and purely online modes across technology, data and business domains. These programs are developed in collaboration with the top institutions like Stanford Executive Education, MIT Professional Education, The University of Texas at Austin, NUS, IIT Madras, IIT Bombay & more.
⚡ About Great Learning Academy:
Visit Great Learning Academy to get access to 1000+ free courses with a free certificates on Data Science, Data Analytics, Digital Marketing, Artificial Intelligence, Big Data, Cloud, Management, Cybersecurity, Software Development, and many more. These are supplemented with free projects, assignments, datasets, quizzes. You can earn a certificate of completion at the end of the course for free.
SOCIAL MEDIA LINKS:
🔹 For more interesting tutorials, don't forget to subscribe to our channel: https://glacad.me/YTsubscribe
🔹 For more updates on courses and tips follow us on:
✅ Telegram: https://t.me/GreatLearningAcademy
✅ Facebook: https://www.facebook.com/GreatLearningOfficial/
✅ LinkedIn: https://www.linkedin.com/school/great-learning/mycompany/verification/
✅ Follow our Blog: https://glacad.me/GL_Blog
detail
{'title': 'Data Science Full Course for Beginners 2023 (11 Hours Data Science Tutorial)', 'heatmap': [{'end': 4020.433, 'start': 3610.253, 'weight': 0.924}, {'end': 16079.29, 'start': 15659.931, 'weight': 0.748}, {'end': 40154.077, 'start': 39763.487, 'weight': 1}], 'summary': 'A comprehensive 11-hour data science course covers statistics, python, data analysis, visualization, machine learning, and regression, with real-world applications, emphasizing practical examples, and hands-on python basics, data manipulation, and visualization tutorials.', 'chapters': [{'end': 761.16, 'segs': [{'end': 36.248, 'src': 'embed', 'start': 0.369, 'weight': 2, 'content': [{'end': 3.47, 'text': 'Without a shred of doubt, our worlds revolve around data.', 'start': 0.369, 'duration': 3.101}, {'end': 6.551, 'text': "You draw money from the ATM, you've generated data.", 'start': 3.79, 'duration': 2.761}, {'end': 10.152, 'text': 'You purchase juice at the supermarket, further data is created.', 'start': 7.051, 'duration': 3.101}, {'end': 13.933, 'text': 'You clicked on this video, you get the gist.', 'start': 11.092, 'duration': 2.841}, {'end': 21.256, 'text': 'With all these resources in hand, we have realized that we can use this data to work smarter and innovate to make life so much easier.', 'start': 14.674, 'duration': 6.582}, {'end': 28.598, 'text': 'We have data with which we can come up with problem statements, but what methods can we employ to make inferences and act upon it?', 'start': 21.876, 'duration': 6.722}, {'end': 36.248, 'text': 'A subject like statistics is perfect for handling such problems and coming to good, useful conclusions from the information at hand.', 'start': 29.438, 'duration': 6.81}], 'summary': 'Data is everywhere, and statistics helps draw useful conclusions from it.', 'duration': 35.879, 'max_score': 0.369, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE369.jpg'}, {'end': 151.048, 'src': 'embed', 'start': 98.47, 'weight': 0, 'content': [{'end': 104.131, 'text': 'This course will be taught by Dr. Abhinanda Sarkar, who has his PhD in statistics from Stanford University.', 'start': 98.47, 'duration': 5.661}, {'end': 108.672, 'text': 'He has taught applied mathematics at the Massachusetts Institute of Technology.', 'start': 104.731, 'duration': 3.941}, {'end': 116.254, 'text': 'He has also been on the research staff at IBM and led the quality engineering development and analytics function at GE.', 'start': 109.812, 'duration': 6.442}, {'end': 119.737, 'text': 'He has also co-founded Omics Labs.', 'start': 117.537, 'duration': 2.2}, {'end': 124.359, 'text': "Without any further ado, let's get into the statistics for data science module.", 'start': 120.618, 'duration': 3.741}, {'end': 125.999, 'text': "It's time for some great learning.", 'start': 124.739, 'duration': 1.26}, {'end': 140.002, 'text': 'What you now need to do is you now need to be able to get the data to solve this problem.', 'start': 134.921, 'duration': 5.081}, {'end': 151.048, 'text': 'so therefore the statistical way of thinking typically says you formulate a problem and then you get the data to solve that problem.', 'start': 143.021, 'duration': 8.027}], 'summary': 'Dr. abhinanda sarkar, a stanford phd, teaches statistics for data science and has industry experience at ibm and ge.', 'duration': 52.578, 'max_score': 98.47, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE98470.jpg'}, {'end': 205.412, 'src': 'embed', 'start': 176.998, 'weight': 4, 'content': [{'end': 181.602, 'text': 'and I reach an interesting conclusion to this entire discussion that sometimes,', 'start': 176.998, 'duration': 4.604}, {'end': 191.352, 'text': "around the way the interviewer who's interviewing the statisticians for a data scientist job ask the question here is my data, what can you say?", 'start': 181.602, 'duration': 9.75}, {'end': 196.369, 'text': 'and the statistician answers with something like what do you want to know?', 'start': 192.728, 'duration': 3.641}, {'end': 200.03, 'text': "and the business guy says but that's why i want to hire you.", 'start': 196.369, 'duration': 3.661}, {'end': 205.412, 'text': "and the statistician says but if you don't tell me what you want to know, how do i know what to tell you?", 'start': 200.03, 'duration': 5.382}], 'summary': 'Interviewers expect statisticians to provide insights without specifying the desired information, leading to misunderstandings.', 'duration': 28.414, 'max_score': 176.998, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE176998.jpg'}], 'start': 0.369, 'title': 'Data science & statistics', 'summary': 'Emphasizes the importance of data in daily life and announces a comprehensive data science course focusing on statistics and python, taught by a stanford phd holder, with plans for additional high-quality tutorials. it also discusses the statistical approach in data science, challenges in communication between statisticians and business professionals, and the importance of asking the right questions to solve business problems through data analysis and decision-making.', 'chapters': [{'end': 119.737, 'start': 0.369, 'title': 'Data science & statistics course', 'summary': 'Emphasizes the importance of data in daily life, and announces a comprehensive data science course focusing on statistics and python, taught by a stanford phd holder, with plans for additional high-quality tutorials.', 'duration': 119.368, 'highlights': ['The course covers the basic foundation of problem solving using statistics and implementing it in Python, and is taught by Dr. Abhinanda Sarkar, a Stanford PhD holder and experienced professional.', 'The chapter emphasizes the importance of data in daily life, mentioning examples such as ATM transactions and supermarket purchases, to highlight the generation of data in various scenarios.', 'Plans for additional tutorials on computer vision, cybersecurity, and cloud computing are announced, showcasing the breadth of content that will be available on the channel.', 'The chapter outlines the topics to be covered in the course, including the contrast between statistics and machine learning, different types of statistics, types of data, correlation and covariance of data.', 'The importance of statistics in drawing conclusions from the available data is highlighted, emphasizing its role in problem solving and decision making.']}, {'end': 761.16, 'start': 120.618, 'title': 'Data science statistics and decision making', 'summary': 'Discusses the statistical approach in data science, the challenges in communication between statisticians and business professionals, and the importance of asking the right questions to solve business problems through data analysis and decision-making.', 'duration': 640.542, 'highlights': ['The statistical approach in data science emphasizes formulating a problem and then obtaining data to solve it, whereas the machine learning approach focuses on analyzing available data to derive insights.', 'The challenges in communication between statisticians and business professionals are highlighted, where the interviewer often asks for insights from the data without specifying the desired information, leading to a communication gap.', 'The importance of asking the right questions and framing the problem effectively in the context of data analysis and decision-making is emphasized, with a focus on the value of asking specific questions to solve business problems.']}], 'duration': 760.791, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE369.jpg', 'highlights': ['The course covers the basic foundation of problem solving using statistics and implementing it in Python, taught by Dr. Abhinanda Sarkar, a Stanford PhD holder and experienced professional.', 'The statistical approach in data science emphasizes formulating a problem and then obtaining data to solve it, whereas the machine learning approach focuses on analyzing available data to derive insights.', 'The chapter emphasizes the importance of data in daily life, mentioning examples such as ATM transactions and supermarket purchases, to highlight the generation of data in various scenarios.', 'The importance of statistics in drawing conclusions from the available data is highlighted, emphasizing its role in problem solving and decision making.', 'The challenges in communication between statisticians and business professionals are highlighted, where the interviewer often asks for insights from the data without specifying the desired information, leading to a communication gap.']}, {'end': 3253.89, 'segs': [{'end': 988.846, 'src': 'embed', 'start': 957.882, 'weight': 0, 'content': [{'end': 961.425, 'text': 'we all are right now in that big data.', 'start': 957.882, 'duration': 3.543}, {'end': 970.332, 'text': "what little data does the doctor know to see? that's a descriptive analytics problem.", 'start': 962.446, 'duration': 7.886}, {'end': 972.193, 'text': 'the doctor is not doing any inference on it.', 'start': 970.352, 'duration': 1.841}, {'end': 978.058, 'text': "the doctor is not building a conclusion and the doctor is not building an ai system on it, but it's still a hard problem.", 'start': 972.233, 'duration': 5.825}, {'end': 983.924, 'text': 'was given the vast amount of data that the doctor could potentially see.', 'start': 979.002, 'duration': 4.922}, {'end': 988.846, 'text': 'the doctor needs to know that i this is interesting to me, and this is interesting to me, and this is interesting to me,', 'start': 983.924, 'duration': 4.922}], 'summary': 'Descriptive analytics problem in healthcare with vast amount of data.', 'duration': 30.964, 'max_score': 957.882, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE957882.jpg'}, {'end': 1302.195, 'src': 'embed', 'start': 1268.215, 'weight': 1, 'content': [{'end': 1270.016, 'text': "the second question is you were saying it's a question of time.", 'start': 1268.215, 'duration': 1.801}, {'end': 1273.235, 'text': 'so you can average over time.', 'start': 1271.374, 'duration': 1.861}, {'end': 1275.357, 'text': 'if you average over time, this is a little easier.', 'start': 1273.695, 'duration': 1.662}, {'end': 1277.718, 'text': "you can say i'm going to do this maybe before eating.", 'start': 1275.397, 'duration': 2.321}, {'end': 1281.36, 'text': 'after eating little after eating.', 'start': 1279.299, 'duration': 2.061}, {'end': 1286.284, 'text': 'so those of you have a blood pressure test, for example of sorry blood sugar test.', 'start': 1282.001, 'duration': 4.283}, {'end': 1292.427, 'text': 'once they ask you to do it fasting and then they ask you to do it some two hours after eating.', 'start': 1286.284, 'duration': 6.143}, {'end': 1302.195, 'text': "do they tell you what to eat? sometimes with glucose sometimes they don't they sort of say that based on what you naturally eat.", 'start': 1293.668, 'duration': 8.527}], 'summary': 'Managing blood sugar levels involves timing and diet, with tests before and after eating.', 'duration': 33.98, 'max_score': 1268.215, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE1268215.jpg'}, {'end': 1830.351, 'src': 'embed', 'start': 1804.568, 'weight': 2, 'content': [{'end': 1809.272, 'text': 'the market research team decides to investigate whether there are differences across product line with respect to customer characteristics.', 'start': 1804.568, 'duration': 4.704}, {'end': 1815.257, 'text': 'exactly what you guys were suggesting that i should do with respect to the watch.', 'start': 1811.034, 'duration': 4.223}, {'end': 1818.4, 'text': 'understand who does what?', 'start': 1815.257, 'duration': 3.143}, {'end': 1826.287, 'text': 'entirely logical, the team decides to collect data on individuals who purchase a treadmill at a particular store during the past three months,', 'start': 1818.4, 'duration': 7.887}, {'end': 1830.351, 'text': 'like watches, and now click looking at data for treadmills.', 'start': 1826.287, 'duration': 4.064}], 'summary': 'Market research team investigates customer characteristics across product lines like watches and treadmills.', 'duration': 25.783, 'max_score': 1804.568, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE1804568.jpg'}, {'end': 2050.998, 'src': 'embed', 'start': 2018.151, 'weight': 3, 'content': [{'end': 2022.253, 'text': 'numpy is something that was built more for mathematical problems than anything else.', 'start': 2018.151, 'duration': 4.102}, {'end': 2025.594, 'text': 'so some of the mathematical algorithms that are needed are there.', 'start': 2023.193, 'duration': 2.401}, {'end': 2027.515, 'text': 'there are other stats.', 'start': 2026.755, 'duration': 0.76}, {'end': 2033.298, 'text': "i plots in metallo, plot life or seaborne and many other things that you've seen already.", 'start': 2027.515, 'duration': 5.783}, {'end': 2038.32, 'text': 'python is still figuring out how to arrange these libraries well enough.', 'start': 2033.298, 'duration': 5.022}, {'end': 2044.023, 'text': 'the, shall we say, that the programming bias is sometimes shows through in the libraries.', 'start': 2038.32, 'duration': 5.703}, {'end': 2050.998, 'text': 'so i for one do not remotely know this well enough to know what to import up front.', 'start': 2045.376, 'duration': 5.622}], 'summary': 'Numpy is designed for mathematical problems, with various algorithms and libraries for stats and plotting available, but python is still working on organizing these effectively.', 'duration': 32.847, 'max_score': 2018.151, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE2018151.jpg'}, {'end': 2620.35, 'src': 'embed', 'start': 2576.973, 'weight': 4, 'content': [{'end': 2586.478, 'text': 'who do i engineer for? and so therefore people have different ranges of what i mean to represent it.', 'start': 2576.973, 'duration': 9.505}, {'end': 2589.16, 'text': "so here's one version of it.", 'start': 2587.819, 'duration': 1.341}, {'end': 2591.937, 'text': 'this is what is called a five-point summary.', 'start': 2590.336, 'duration': 1.601}, {'end': 2603.161, 'text': 'i report out the minimum the 25% point the 50% point the 75% point and the maximum variable by variable.', 'start': 2593.197, 'duration': 9.964}, {'end': 2604.702, 'text': 'i report five numbers.', 'start': 2603.622, 'duration': 1.08}, {'end': 2620.35, 'text': 'i report the lowest what is 25% mean? 25% of my data set or the people are younger than 24.', 'start': 2607.203, 'duration': 13.147}], 'summary': 'Engineers create a five-point summary, reporting minimum, 25th, 50th, 75th percentiles, and maximum, for data analysis.', 'duration': 43.377, 'max_score': 2576.973, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE2576973.jpg'}], 'start': 761.54, 'title': 'Data analysis and descriptive analytics', 'summary': 'Covers challenges in creating prescriptions, analyzing blood sugar tests, treadmill customer data, data analysis and visualization, and five-point summary and distribution statistics, addressing various data interpretation challenges and providing statistical insights.', 'chapters': [{'end': 1265.473, 'start': 761.54, 'title': 'Prescription challenges and descriptive analytics', 'summary': 'Discusses the challenges of creating prescriptions to meet various requirements, such as autonomous vehicle rules, and the complexity of descriptive analytics in healthcare, addressing the difficulty of interpreting random bodily variables like blood content.', 'duration': 503.933, 'highlights': ['The challenges of creating prescriptions to meet various requirements, such as autonomous vehicle rules, and the complexity of descriptive analytics in healthcare.', 'The difficulty of interpreting random bodily variables like blood content and the challenges in reaching conclusions based on random quantities.', 'The complexities of taking blood samples and the challenges in ensuring consistency and accuracy in the samples.']}, {'end': 1694.453, 'start': 1268.215, 'title': 'Analyzing blood sugar tests', 'summary': 'Discusses the process of averaging blood sugar tests over time, the implications of averaging on neutralizing data, and the need for descriptive analytics and probability language in medical tests.', 'duration': 426.238, 'highlights': ["The process of averaging blood sugar tests over time is discussed, emphasizing the importance of normal eating and the implications for understanding the body's response.", 'The discussion on averaging highlights the neutralizing effect it has on data, providing contextual understanding for doctors in interpreting test results.', 'The need for descriptive analytics and probability language in medical tests is emphasized to quantify uncertainties and variations in test results.']}, {'end': 1992.618, 'start': 1696.738, 'title': 'Analyzing treadmill customer data', 'summary': 'Provides context for understanding a set of data related to customer characteristics for treadmill products, collected over the past three months, to identify product market fit and customer profiles in a statistical manner.', 'duration': 295.88, 'highlights': ['The market research team is tasked with identifying the typical customer profile for each treadmill product and investigating differences across product lines, collecting data on individuals who purchased treadmills at a particular store over the past three months.', 'The data includes information such as gender, age, education years, relationship status, annual household income, average number of times a customer plans to use a treadmill each week, and self-rated fitness scale, providing valuable insights into customer characteristics and behavior.', 'The chapter emphasizes the importance of understanding customer characteristics and product market fit to align product offerings with customer preferences and behaviors, highlighting the significance of statistical analysis in identifying product-market fit.']}, {'end': 2576.853, 'start': 1994.899, 'title': 'Data analysis and visualization', 'summary': 'Discusses the use of python libraries like pandas, numpy, and seaborn for data analysis and visualization, addressing challenges in data representation, and the need for representative data in product design.', 'duration': 581.954, 'highlights': ['Python libraries like pandas, numpy, and seaborn are used for data analysis and visualization, with pandas offering a fair amount of statistics built-in and numpy being more suitable for mathematical problems.', 'Challenges exist in representing data, such as the need to distinguish between numerical and categorical variables, and the storage granularity of data in the data frame.', 'The importance of obtaining a representative data point, such as a single average age, is emphasized for making informed product design decisions, considering factors like over-engineering and user variability.']}, {'end': 3253.89, 'start': 2576.973, 'title': 'Five-point summary and distribution statistics', 'summary': 'Explains the concept of a five-point summary and distribution statistics, including the median and mean, to represent data variability, with examples of age distribution and insights into the differences between median and mean, and their implications for data analysis.', 'duration': 676.917, 'highlights': ['Explaining the five-point summary and distribution statistics', 'Difference between median and mean representation', 'Implications of median and mean on data analysis']}], 'duration': 2492.35, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE761540.jpg', 'highlights': ['The challenges of creating prescriptions to meet various requirements, such as autonomous vehicle rules, and the complexity of descriptive analytics in healthcare.', "The process of averaging blood sugar tests over time is discussed, emphasizing the importance of normal eating and the implications for understanding the body's response.", 'The market research team is tasked with identifying the typical customer profile for each treadmill product and investigating differences across product lines, collecting data on individuals who purchased treadmills at a particular store over the past three months.', 'Python libraries like pandas, numpy, and seaborn are used for data analysis and visualization, with pandas offering a fair amount of statistics built-in and numpy being more suitable for mathematical problems.', 'Explaining the five-point summary and distribution statistics']}, {'end': 6562.193, 'segs': [{'end': 3314.887, 'src': 'embed', 'start': 3286.109, 'weight': 0, 'content': [{'end': 3287.87, 'text': "no, that's it's the same number of observations.", 'start': 3286.109, 'duration': 1.761}, {'end': 3290.031, 'text': "i'll say the data is pushed to the right.", 'start': 3288.57, 'duration': 1.461}, {'end': 3297.694, 'text': 'more variation on the right side is probably a safer way of putting it.', 'start': 3292.112, 'duration': 5.582}, {'end': 3301.455, 'text': 'yes so skewness is often measured in various things.', 'start': 3297.734, 'duration': 3.721}, {'end': 3306.858, 'text': 'one measure of skewness is typically, for example, mean minus median, mean minus median.', 'start': 3301.916, 'duration': 4.942}, {'end': 3311.946, 'text': 'if it is positive, it usually correspond try skewness mean minus median.', 'start': 3306.858, 'duration': 5.088}, {'end': 3314.887, 'text': 'negative usually corresponds to left skewness.', 'start': 3311.946, 'duration': 2.941}], 'summary': 'Skewness is measured by mean minus median. positive is right skewness, negative is left skewness.', 'duration': 28.778, 'max_score': 3286.109, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE3286109.jpg'}, {'end': 3397.355, 'src': 'embed', 'start': 3367.67, 'weight': 1, 'content': [{'end': 3368.591, 'text': "that doesn't make it a bad book.", 'start': 3367.67, 'duration': 0.921}, {'end': 3375.296, 'text': "so if you're looking for help on how to code things up, this is not the right book.", 'start': 3370.732, 'duration': 4.564}, {'end': 3378.819, 'text': 'get a book like king stats or something like that.', 'start': 3376.997, 'duration': 1.822}, {'end': 3384.363, 'text': "but if you want to understand the statistics side to it, it's an excellent book.", 'start': 3380.46, 'duration': 3.903}, {'end': 3386.505, 'text': "so everything that i'm talking about is going to be here.", 'start': 3384.624, 'duration': 1.881}, {'end': 3389.868, 'text': 'i might talk about which chapters and things like that at some point.', 'start': 3387.386, 'duration': 2.482}, {'end': 3393.754, 'text': 'and i might talk about how to use this in the book.', 'start': 3391.653, 'duration': 2.101}, {'end': 3397.355, 'text': 'so for example at the back of this book, there are lots in their tables.', 'start': 3393.774, 'duration': 3.581}], 'summary': 'Book is not for coding, but excellent for understanding statistics.', 'duration': 29.685, 'max_score': 3367.67, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE3367670.jpg'}, {'end': 4020.433, 'src': 'heatmap', 'start': 3610.253, 'weight': 0.924, 'content': [{'end': 3615.335, 'text': "it's it's a little easier from a computational perspective, although harder from a conceptual perspective.", 'start': 3610.253, 'duration': 5.082}, {'end': 3618.456, 'text': 'so we begin at this way, but hold on to that idea.', 'start': 3615.395, 'duration': 3.061}, {'end': 3625.778, 'text': "and then as you keep going see if this is something that you want to learn more on and if you can you're welcome just write to us.", 'start': 3619.016, 'duration': 6.762}, {'end': 3630.8, 'text': "so let us know already anyone know that with us just come in let her know and we'll get the references to you.", 'start': 3625.798, 'duration': 5.002}, {'end': 3635.762, 'text': 'but if you want to for say for the first residencies, please read the book and see what happens if there are doubts.', 'start': 3630.82, 'duration': 4.942}, {'end': 3638.063, 'text': "yes, but it's a it's a well-written book.", 'start': 3636.522, 'duration': 1.541}, {'end': 3644.266, 'text': "it's in its instructor is one of our colleagues here, you know, if you want to give you can also help explain things.", 'start': 3638.863, 'duration': 5.403}, {'end': 3649.148, 'text': 'so this is the summary.', 'start': 3648.308, 'duration': 0.84}, {'end': 3660.914, 'text': "what did the summary tell you? this summary gave you what's called the five numbers five numbers that help you describe the data minimum 2550 75 max.", 'start': 3650.089, 'duration': 10.825}, {'end': 3662.915, 'text': "we'll see another graphical description of this.", 'start': 3661.234, 'duration': 1.681}, {'end': 3672.858, 'text': 'it also described for you a mean there is also another number here and this is this number is indicated by the letters std.', 'start': 3663.515, 'duration': 9.343}, {'end': 3679.359, 'text': 'std refers to standard deviation yesterday refers to standard.', 'start': 3674.719, 'duration': 4.64}, {'end': 3690.541, 'text': 'deviation and what is the formula for a standard deviation? std is equal to.', 'start': 3681.62, 'duration': 8.921}, {'end': 3693.582, 'text': 'the square root of.', 'start': 3692.642, 'duration': 0.94}, {'end': 3715.936, 'text': 'little bit of a mess.', 'start': 3715.236, 'duration': 0.7}, {'end': 3722.34, 'text': 'but two steps step 1 calculate the average.', 'start': 3718.138, 'duration': 4.202}, {'end': 3729.044, 'text': 'step 2 take the distance from the average for every observation.', 'start': 3724.502, 'duration': 4.542}, {'end': 3731.826, 'text': 'ask the question.', 'start': 3731.006, 'duration': 0.82}, {'end': 3734.728, 'text': 'how far is every data point from the middle?', 'start': 3732.286, 'duration': 2.442}, {'end': 3740.751, 'text': 'if it is very far from the middle, say that the deviation is more.', 'start': 3737.109, 'duration': 3.642}, {'end': 3748.273, 'text': 'if it is not far from the middle, say the deviation is less, deviation being used as a synonym for variation, and talk about variation.', 'start': 3740.751, 'duration': 7.522}, {'end': 3750.173, 'text': 'variation can be more or variation can be less.', 'start': 3748.273, 'duration': 1.9}, {'end': 3754.234, 'text': 'more than the average, less than the average.', 'start': 3751.593, 'duration': 2.641}, {'end': 3757.974, 'text': "if someone is much older than average, there's variation.", 'start': 3754.234, 'duration': 3.74}, {'end': 3760.715, 'text': 'if someone is much younger than average, there is variation.', 'start': 3757.974, 'duration': 2.741}, {'end': 3763.775, 'text': 'so therefore both of these are variation.', 'start': 3762.155, 'duration': 1.62}, {'end': 3772.357, 'text': 'so what i do is when i take the difference from the average i square it so more than x bar becomes positive less than x bar also becomes positive.', 'start': 3764.196, 'duration': 8.161}, {'end': 3776.563, 'text': 'then i add it up and i average it.', 'start': 3774.022, 'duration': 2.541}, {'end': 3779.845, 'text': "there's a small question as to why it is n minus 1,", 'start': 3776.883, 'duration': 2.962}, {'end': 3784.747, 'text': "and that is because i'm taking a difference from an observation that is already taken from the data.", 'start': 3779.845, 'duration': 4.902}, {'end': 3791.57, 'text': 'now ever squared when i have squared my original unit was in age.', 'start': 3786.608, 'duration': 4.962}, {'end': 3795.332, 'text': 'when i have squared this has become a squared.', 'start': 3793.091, 'duration': 2.241}, {'end': 3799.781, 'text': 'so i take the square root in order to get my measure back into the scale of years.', 'start': 3796.499, 'duration': 3.282}, {'end': 3806.604, 'text': 'so the standard deviation is a measure of how spread a typical observation is from the average.', 'start': 3800.161, 'duration': 6.443}, {'end': 3813.047, 'text': 'it is a standard deviation where a deviation is how far from the average you are.', 'start': 3808.385, 'duration': 4.662}, {'end': 3819.67, 'text': 'and because of the squaring you need to work with a square root.', 'start': 3815.368, 'duration': 4.302}, {'end': 3829.835, 'text': 'in sort of modern machine learning people sometimes use something called a mean absolute deviation mad mad very optimistically called.', 'start': 3821.232, 'duration': 8.603}, {'end': 3836.677, 'text': "so mad is is you don't take a square you take an absolute value.", 'start': 3831.576, 'duration': 5.101}, {'end': 3840.799, 'text': 'and then you do not have a square root outside it.', 'start': 3838.998, 'duration': 1.801}, {'end': 3847.141, 'text': 'and that is sometimes used as a measure of how much variability there is.', 'start': 3842.619, 'duration': 4.522}, {'end': 3857.144, 'text': 'so why it is perfect? why is it we square it because we want to look at both positive and negative deviations.', 'start': 3849.082, 'duration': 8.062}, {'end': 3860.427, 'text': "if i didn't square it what would happen is it would cancel out.", 'start': 3858.044, 'duration': 2.383}, {'end': 3865.111, 'text': 'what was the word that one of you used neutralize right? i love that term.', 'start': 3861.447, 'duration': 3.664}, {'end': 3869.354, 'text': 'your positive deviations would neutralize your negative deviations.', 'start': 3866.412, 'duration': 2.942}, {'end': 3874.719, 'text': 'yes yes.', 'start': 3873.098, 'duration': 1.621}, {'end': 3886.593, 'text': 'this number is going to be positive if say x1.', 'start': 3882.81, 'duration': 3.783}, {'end': 3888.075, 'text': "so let's look at the first number here.", 'start': 3886.633, 'duration': 1.442}, {'end': 3892.559, 'text': 'so if i look at the head command here, when i did the head command here, what did the head??', 'start': 3888.495, 'duration': 4.064}, {'end': 3893.439, 'text': 'what did the head command??', 'start': 3892.599, 'duration': 0.84}, {'end': 3895.541, 'text': 'give me the first few observations.', 'start': 3893.479, 'duration': 2.062}, {'end': 3896.803, 'text': 'and now this is an 18 year old.', 'start': 3895.561, 'duration': 1.242}, {'end': 3898.824, 'text': 'this probably sorted by age.', 'start': 3896.823, 'duration': 2.001}, {'end': 3899.885, 'text': 'this is an 18 year old correct.', 'start': 3898.844, 'duration': 1.041}, {'end': 3905.686, 'text': "now i'm trying to explain the variability of this data with respect to this 18 year old.", 'start': 3901.024, 'duration': 4.662}, {'end': 3916.09, 'text': 'what is the what is the what why is a variation this 18 number is not the same as 28 and 18 is less than 28.', 'start': 3906.226, 'duration': 9.864}, {'end': 3921.152, 'text': 'so what i want to do is i want to go 18 minus.', 'start': 3916.09, 'duration': 5.062}, {'end': 3925.714, 'text': "28.7 what i'm interested in is this 10.", 'start': 3921.172, 'duration': 4.542}, {'end': 3927.175, 'text': 'this 10 year difference between the two.', 'start': 3925.714, 'duration': 1.461}, {'end': 3939.233, 'text': 'now the person the oldest person in this data set is how old 50 when i get to that rule this 50 will also differ from this 28 by 22 years.', 'start': 3929.349, 'duration': 9.884}, {'end': 3945.136, 'text': "so i'm interested in that 10 and i'm interested in the 22.", 'start': 3941.174, 'duration': 3.962}, {'end': 3950.413, 'text': "i'm not interested in the minus 10 or a minus 22.", 'start': 3945.136, 'duration': 5.277}, {'end': 3951.013, 'text': 'i can do that.', 'start': 3950.413, 'duration': 0.6}, {'end': 3952.294, 'text': 'i can do that.', 'start': 3951.213, 'duration': 1.081}, {'end': 3952.954, 'text': 'you know.', 'start': 3952.614, 'duration': 0.34}, {'end': 3964.481, 'text': 'what i can do is i can look at i can represent 18 minus 28 as 10 and i can represent 28 minus 50 as 22, and that is this, as i said 1 over n minus 1.', 'start': 3952.954, 'duration': 11.527}, {'end': 3965.521, 'text': 'absolute x.', 'start': 3964.481, 'duration': 1.04}, {'end': 3969.203, 'text': '1. minus x bar plus plus absolute xn.', 'start': 3965.521, 'duration': 3.682}, {'end': 3969.864, 'text': 'minus x bar.', 'start': 3969.203, 'duration': 0.661}, {'end': 3978.889, 'text': "that is this with n minus 1 and this is done as i'm saying this is what is called mean absolute deviation.", 'start': 3969.904, 'duration': 8.985}, {'end': 3982.051, 'text': 'and many machine learning algorithms use.', 'start': 3980.031, 'duration': 2.02}, {'end': 3987.152, 'text': "this you are correct in today's world.", 'start': 3983.492, 'duration': 3.66}, {'end': 3988.212, 'text': 'this is simpler.', 'start': 3987.512, 'duration': 0.7}, {'end': 3997.614, 'text': 'now when standard deviations came up first, this was actually harder but people did argue about this.', 'start': 3990.653, 'duration': 6.961}, {'end': 4003.095, 'text': 'i think well 150 maybe more about i forget my history that much.', 'start': 3999.935, 'duration': 3.16}, {'end': 4007.916, 'text': 'there are two famous mathematicians one named gauss and one named laplace.', 'start': 4003.475, 'duration': 4.441}, {'end': 4020.433, 'text': 'who argued as to whether to use this or whether to use this laplace said you should use this, and gauss said you should use now.', 'start': 4009.428, 'duration': 11.005}], 'summary': 'Summary of data analysis methods, including mean, standard deviation, and mean absolute deviation for variability assessment.', 'duration': 410.18, 'max_score': 3610.253, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE3610253.jpg'}, {'end': 3829.835, 'src': 'embed', 'start': 3800.161, 'weight': 2, 'content': [{'end': 3806.604, 'text': 'so the standard deviation is a measure of how spread a typical observation is from the average.', 'start': 3800.161, 'duration': 6.443}, {'end': 3813.047, 'text': 'it is a standard deviation where a deviation is how far from the average you are.', 'start': 3808.385, 'duration': 4.662}, {'end': 3819.67, 'text': 'and because of the squaring you need to work with a square root.', 'start': 3815.368, 'duration': 4.302}, {'end': 3829.835, 'text': 'in sort of modern machine learning people sometimes use something called a mean absolute deviation mad mad very optimistically called.', 'start': 3821.232, 'duration': 8.603}], 'summary': 'Standard deviation measures spread from average, used in modern machine learning.', 'duration': 29.674, 'max_score': 3800.161, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE3800161.jpg'}, {'end': 4420.896, 'src': 'embed', 'start': 4392.307, 'weight': 3, 'content': [{'end': 4399.012, 'text': 'interquartile ranges upper quartile minus lower quartile and these measures are used.', 'start': 4392.307, 'duration': 6.705}, {'end': 4403.755, 'text': 'they do see certain uses based on certain applications.', 'start': 4400.393, 'duration': 3.362}, {'end': 4407.317, 'text': 'you can see certain advantages to this.', 'start': 4405.596, 'duration': 1.721}, {'end': 4413.301, 'text': "for example, let's suppose that i calculate my five point summary with my five point summary.", 'start': 4408.138, 'duration': 5.163}, {'end': 4420.896, 'text': 'i can now give you a measure of location, which is my median, and i can give you two measures of dispersion,', 'start': 4413.381, 'duration': 7.515}], 'summary': 'Interquartile range has advantages in location and dispersion measures.', 'duration': 28.589, 'max_score': 4392.307, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE4392307.jpg'}, {'end': 5366.341, 'src': 'embed', 'start': 5336.748, 'weight': 4, 'content': [{'end': 5342.291, 'text': "it's also what makes it interesting and it's sort of interesting and exciting.", 'start': 5336.748, 'duration': 5.543}, {'end': 5344.352, 'text': "it's not all bad.", 'start': 5342.932, 'duration': 1.42}, {'end': 5351.036, 'text': 'okay. so the histogram command summaries of what these histograms are, and each gives you a sense of what the distribution is.', 'start': 5345.033, 'duration': 6.003}, {'end': 5357.539, 'text': 'and, as you can see from most of these pictures, most of these variables, when they do have a skew, tend to have a right skew.', 'start': 5351.036, 'duration': 6.503}, {'end': 5359.6, 'text': 'maybe education has a little bit of a left skew.', 'start': 5357.579, 'duration': 2.021}, {'end': 5366.341, 'text': 'maybe education a little bit of a left skew that a few people are educated and most people are here, but even so.', 'start': 5361.718, 'duration': 4.623}], 'summary': 'Histograms provide distribution insights, mostly right skewed, some left skewed.', 'duration': 29.593, 'max_score': 5336.748, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE5336748.jpg'}], 'start': 3253.89, 'title': 'Data skewness and visualization', 'summary': 'Discusses skewness in data, reviews a business statistics book, explains standard deviation, explores measures of variability, and covers data visualization using histograms and box plots, emphasizing practical applications and limitations in modern contexts.', 'chapters': [{'end': 3338.705, 'start': 3253.89, 'title': 'Understanding skewed data', 'summary': 'Discusses the concept of skewness in data, highlighting that right skewed data has more variation on the right side and is often measured using mean minus median, which gives a positive value for right skewness and a negative value for left skewness.', 'duration': 84.815, 'highlights': ['Right skewed data means more variation on the right side, often measured using mean minus median, giving a positive value for right skewness.', 'Mean minus median is a measure used to determine skewness, with a positive result indicating right skewness and a negative result indicating left skewness.', 'Skewed data can cause difficulties in analysis as the idea of variation changes, with variation on one side meaning something different than variation on the other side.']}, {'end': 3635.762, 'start': 3339.305, 'title': 'Business statistics book review', 'summary': 'Reviews a business statistics book, highlighting its strengths in explaining statistics concepts, while lacking in coding guidance. it emphasizes the importance of understanding the statistics side but warns against trying to delve into equal depths on every topic due to the overwhelming material to be covered in the program.', 'duration': 296.457, 'highlights': ['The book is excellent for understanding the statistics side, providing comprehensive coverage of the subject matter.', 'It does not provide guidance on coding and is not suitable for learning how to code things up.', 'The next nine months of the program will involve a significant amount of material, cautioning against trying to delve into equal depths on every topic.', 'Emphasizes the importance of understanding the statistics side and encourages delving into the depth of it.', 'Provides a cautionary note on the overwhelming amount of material to be covered in the program and advises to pick battles wisely.']}, {'end': 4184.999, 'start': 3636.522, 'title': 'Understanding standard deviation', 'summary': 'Explains the concept of standard deviation, highlighting its formula, purpose, historical background, and its comparison with mean absolute deviation, emphasizing its relevance and criticism in modern machine learning and finance literature.', 'duration': 548.477, 'highlights': ['The standard deviation formula is explained as the square root of the average of the squared differences from the mean, with the rationale for using n-1 in the formula to restore the measure to the scale of years.', "The comparison between standard deviation and mean absolute deviation is discussed, emphasizing the former's consideration of both positive and negative deviations, while the latter is critiqued for being less sensitive to outliers.", "The historical debate between mathematicians Gauss and Laplace on the use of standard deviation is outlined, highlighting Gauss' victory due to the ease of calculations using calculus, but acknowledging the increasing relevance of Laplace's approach in modern analytics.", "The criticism of standard deviation as a measure of variability in finance literature is mentioned, with Nassim Taleb's books 'The Black Swan' and 'Fooled by Randomness' cited as examples of the critique.", "The evolution of analytics and computing is discussed, emphasizing the diminishing relevance of the historical debate and the increasing use of Laplace's approach due to its resistance to outliers and its practicality in modern applications."]}, {'end': 5336.708, 'start': 4192.548, 'title': 'Understanding measures of variability', 'summary': 'Explores measures of variability such as range and interquartile range, along with the challenges of managing data in different formats and the practical considerations of deploying analytical solutions in companies.', 'duration': 1144.16, 'highlights': ['Measures of variability such as range and interquartile range are used to understand the dispersion of data, providing insights into the spread of values and the variability between observations.', 'Managing data in different formats poses challenges, including the need to store data along with a data dictionary to describe the variables and their types, leading to practical complexities in data curation and maintenance.', 'Deploying analytical solutions in companies involves balancing the trade-off between doing the right thing and the practical challenges of ensuring continuity and simplicity, often influenced by the company culture and regulatory environment.']}, {'end': 6562.193, 'start': 5336.748, 'title': 'Exploring data with histograms and box plots', 'summary': 'Covers the use of histograms and box plots to visualize data distributions, with emphasis on right skewness and the identification of outliers, as well as the application of pair plots for univariate analysis and the pivot table equivalent for categorical data.', 'duration': 1225.445, 'highlights': ['Histograms and box plots are used to visualize data distributions, with emphasis on detecting right skewness and identifying outliers based on the interquartile range.', 'Explanation of box plot components including whiskers, quartiles, and outliers, as well as the ability to modify the whisker length and visualize the five-point summary.', 'Introduction to pair plots for univariate analysis and the pivot table equivalent for categorical data, enabling visualization and comparison of variables within the dataset.']}], 'duration': 3308.303, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE3253890.jpg', 'highlights': ['Right skewed data means more variation on the right side, often measured using mean minus median, giving a positive value for right skewness.', 'The book is excellent for understanding the statistics side, providing comprehensive coverage of the subject matter.', 'The standard deviation formula is explained as the square root of the average of the squared differences from the mean, with the rationale for using n-1 in the formula to restore the measure to the scale of years.', 'Measures of variability such as range and interquartile range are used to understand the dispersion of data, providing insights into the spread of values and the variability between observations.', 'Histograms and box plots are used to visualize data distributions, with emphasis on detecting right skewness and identifying outliers based on the interquartile range.']}, {'end': 7755.627, 'segs': [{'end': 7129.171, 'src': 'embed', 'start': 7103.975, 'weight': 4, 'content': [{'end': 7115.827, 'text': 'so what is common between my observed set of data and the data for my new customers? that commonality is what you can think of as a distribution.', 'start': 7103.975, 'duration': 11.852}, {'end': 7129.171, 'text': 'so he says that from this can you give me a sense of what this distribution is and from this distribution i can think of other people.', 'start': 7120.108, 'duration': 9.063}], 'summary': 'Identifying commonalities to establish a distribution for new customers.', 'duration': 25.196, 'max_score': 7103.975, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE7103975.jpg'}, {'end': 7332.325, 'src': 'embed', 'start': 7301.639, 'weight': 1, 'content': [{'end': 7316.057, 'text': 'how do i know that because i have this reading this month? so the idea of a distribution is to be able to abstract away from the data.', 'start': 7301.639, 'duration': 14.418}, {'end': 7324.662, 'text': 'the random part and the systematic part and the systematic part is what remains as the distribution around it.', 'start': 7317.398, 'duration': 7.264}, {'end': 7326.162, 'text': "there's going to be a random variation.", 'start': 7324.682, 'duration': 1.48}, {'end': 7332.325, 'text': 'and the random variation is going to exist from data set to data set, like this month and next month,', 'start': 7327.543, 'duration': 4.782}], 'summary': 'Understanding distribution helps abstract data and identify random variations.', 'duration': 30.686, 'max_score': 7301.639, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE7301639.jpg'}, {'end': 7405.174, 'src': 'embed', 'start': 7371.81, 'weight': 2, 'content': [{'end': 7374.512, 'text': 'how do i extend your blood pressure readings to the next blood pressure readings?', 'start': 7371.81, 'duration': 2.702}, {'end': 7377.134, 'text': 'how do i figure this out?', 'start': 7376.214, 'duration': 0.92}, {'end': 7379.316, 'text': 'that is the heart of statistics.', 'start': 7377.975, 'duration': 1.341}, {'end': 7380.497, 'text': "that's called statistical inference.", 'start': 7379.336, 'duration': 1.161}, {'end': 7387.383, 'text': 'to abstract away from the data certain things that remain the same and certain things that do not.', 'start': 7381.938, 'duration': 5.445}, {'end': 7392.487, 'text': 'so a distribution is an estimate of that underlying true distribution of age.', 'start': 7387.383, 'duration': 5.104}, {'end': 7395.409, 'text': "and so it's not as rough.", 'start': 7394.489, 'duration': 0.92}, {'end': 7397.151, 'text': "it's smoother.", 'start': 7396.55, 'duration': 0.601}, {'end': 7405.174, 'text': 'how smooth is something that the plot changes that the plot figures out on its own like histogram, but you are free to change it.', 'start': 7398.607, 'duration': 6.567}], 'summary': 'Statistical inference smoothens data distribution for accurate estimates.', 'duration': 33.364, 'max_score': 7371.81, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE7371810.jpg'}, {'end': 7682.023, 'src': 'embed', 'start': 7642.48, 'weight': 0, 'content': [{'end': 7646.561, 'text': 'so in order to do that therefore you your experimentation should cover all of that.', 'start': 7642.48, 'duration': 4.081}, {'end': 7652.622, 'text': "what does that mean? for example in business terms, let's say that you're looking at sales data and you want to understand your sales distribution.", 'start': 7646.901, 'duration': 5.721}, {'end': 7655.143, 'text': "well, don't focus on certain salespeople.", 'start': 7653.062, 'duration': 2.081}, {'end': 7656.523, 'text': 'look at your bad salespeople.', 'start': 7655.403, 'duration': 1.12}, {'end': 7658.404, 'text': 'look at your good salespeople.', 'start': 7656.543, 'duration': 1.861}, {'end': 7660.064, 'text': 'look at your high selling products.', 'start': 7658.804, 'duration': 1.26}, {'end': 7663.525, 'text': 'look at your low selling products cover the range of possibilities.', 'start': 7660.084, 'duration': 3.441}, {'end': 7667.995, 'text': 'if you do not cover the range of possibilities, you will not see the distribution.', 'start': 7665.393, 'duration': 2.602}, {'end': 7674.339, 'text': "if you do not see the distribution, you will not know what, where the future data will come from, and if you don't know that,", 'start': 7668.655, 'duration': 5.684}, {'end': 7676.34, 'text': "you'll not be able to do any prediction or prescription for that.", 'start': 7674.339, 'duration': 2.001}, {'end': 7682.023, 'text': 'the histogram is just the summary.', 'start': 7680.302, 'duration': 1.721}], 'summary': 'Experimentation should cover a range of possibilities to understand data distribution and make predictions.', 'duration': 39.543, 'max_score': 7642.48, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE7642480.jpg'}], 'start': 6564.134, 'title': 'Analyzing data distribution', 'summary': 'Covers the identification and treatment of ordinal and categorical variables, importance of understanding data distribution, and statistical inference with examples of blood pressure readings and sales data, emphasizing challenges and solutions in data analysis and the significance of distribution plots for accurate prediction and prescription.', 'chapters': [{'end': 7189.797, 'start': 6564.134, 'title': 'Understanding categorical and ordinal variables', 'summary': 'Discusses the identification and treatment of ordinal and categorical variables, emphasizing the challenges and solutions in analyzing and representing such data, and concludes with insights on distribution plots and their significance in understanding data distribution and making predictions.', 'duration': 625.663, 'highlights': ['The fitness variable has few numbers (1 to 5) and is considered an ordinal categorical variable, posing challenges in data analysis and representation.', 'The discussion on handling variables like zip codes, which are categorical but often recognized as numbers by databases, and the challenges in analyzing and representing them.', 'Insights into the use of distribution plots to understand data distribution and make predictions based on samples from the population distribution.']}, {'end': 7371.53, 'start': 7190.638, 'title': 'Understanding data distribution', 'summary': 'Discusses the concept of data distribution using the example of blood pressure readings, emphasizing the distinction between representing the distribution and being the distribution itself, and highlighting the importance of understanding systematic and random variations in data for predictive analysis.', 'duration': 180.892, 'highlights': ['The distinction between representing the distribution and being the distribution itself is emphasized, with the example of blood pressure readings, to illustrate the concept of data distribution.', 'Importance of understanding systematic and random variations in data for predictive analysis is highlighted using the example of blood pressure readings, and its relevance to making future predictions based on current data.', 'Analogous examples are used to explain the application of understanding data distribution in scenarios such as analyzing different customer sets or branches of stores to generalize results and make predictions.']}, {'end': 7755.627, 'start': 7371.81, 'title': 'Statistical inference and distribution plotting', 'summary': 'Discusses statistical inference, distribution plotting, and the importance of covering a range of possibilities in experimentation for accurate prediction and prescription, using blood pressure readings and sales data as examples.', 'duration': 383.817, 'highlights': ['The importance of covering a range of possibilities in experimentation for accurate prediction and prescription.', 'Explanation of statistical inference and its application in estimating the underlying true distribution of age.', 'Explanation of density function and distribution function in plotting, including the relationship between the two.']}], 'duration': 1191.493, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE6564134.jpg', 'highlights': ['Insights into the use of distribution plots for understanding data distribution and making predictions.', 'Importance of understanding systematic and random variations in data for predictive analysis.', 'Explanation of statistical inference and its application in estimating the underlying true distribution of age.', 'The importance of covering a range of possibilities in experimentation for accurate prediction and prescription.', 'The distinction between representing the distribution and being the distribution itself is emphasized.']}, {'end': 8764.47, 'segs': [{'end': 7850.641, 'src': 'embed', 'start': 7824.174, 'weight': 0, 'content': [{'end': 7832.177, 'text': "nobody's interested in your data now, but you still have to analyze the data that is in front of you and reach a conclusion that makes sense to them.", 'start': 7824.174, 'duration': 8.003}, {'end': 7838.559, 'text': 'the bank has to look at its order at its, you know portfolio and figure out what is risk strategy should be.', 'start': 7833.337, 'duration': 5.222}, {'end': 7844.097, 'text': 'the clothing store needs to figure out look at its sales and figure out what loads it should make.', 'start': 7839.755, 'duration': 4.342}, {'end': 7850.641, 'text': 'great learning has to figure out its course reviews and figure out which faculty members to keep.', 'start': 7845.538, 'duration': 5.103}], 'summary': 'Businesses need to analyze data for risk strategy, sales, and course reviews.', 'duration': 26.467, 'max_score': 7824.174, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE7824174.jpg'}, {'end': 7930.538, 'src': 'embed', 'start': 7906.485, 'weight': 2, 'content': [{'end': 7913.65, 'text': 'how do you decide? experience you got past data and that data is telling you please cross the road.', 'start': 7906.485, 'duration': 7.165}, {'end': 7922.495, 'text': 'how that data is not seen that car ah k 53 3 6 1 9 with his driver has not been seen by your data set.', 'start': 7914.45, 'duration': 8.045}, {'end': 7930.538, 'text': "how are you crossing? because you're making the assumption that while i have not seen him i've seen many others like him.", 'start': 7923.455, 'duration': 7.083}], 'summary': 'Using past data, make assumptions to cross road safely.', 'duration': 24.053, 'max_score': 7906.485, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE7906485.jpg'}, {'end': 8065.235, 'src': 'embed', 'start': 8034.502, 'weight': 3, 'content': [{'end': 8040.286, 'text': 'and you could say that this is an estimate of the mean of the distribution.', 'start': 8034.502, 'duration': 5.784}, {'end': 8047.432, 'text': 'this is not this is the mean of the data, but you are not interested in the mean of the data.', 'start': 8043.209, 'duration': 4.223}, {'end': 8053.677, 'text': "why you're not interested in the mean of the data because you're not interested in this particular set of 180 people.", 'start': 8048.253, 'duration': 5.424}, {'end': 8057.6, 'text': 'but you are interested in the average age of my customers.', 'start': 8055.218, 'duration': 2.382}, {'end': 8065.235, 'text': 'so now the question becomes what does the age of my new customers have to do with the number 28?', 'start': 8058.629, 'duration': 6.606}], 'summary': "Estimate the average age of new customers based on 180 people's data.", 'duration': 30.733, 'max_score': 8034.502, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE8034502.jpg'}, {'end': 8216.049, 'src': 'embed', 'start': 8161.53, 'weight': 4, 'content': [{'end': 8166.893, 'text': 'if the data is standard deviation, if the data is very variable, this plus minus will be large and the nature of the data.', 'start': 8161.53, 'duration': 5.363}, {'end': 8168.674, 'text': 'same here.', 'start': 8168.094, 'duration': 0.58}, {'end': 8175.973, 'text': "so make sure that you depend upon how many things i'm averaging over if this was 180.", 'start': 8168.694, 'duration': 7.279}, {'end': 8180.777, 'text': "i'm so sure if this was 18, 000, i'll be even more sure.", 'start': 8175.973, 'duration': 4.804}, {'end': 8184.781, 'text': "if this was 18, i'd be less sure.", 'start': 8182.179, 'duration': 2.602}, {'end': 8189.105, 'text': 'so it depends on how much data is being averaged over the more data.', 'start': 8186.162, 'duration': 2.943}, {'end': 8193.208, 'text': 'i have the surer i am about the repeatability of it.', 'start': 8189.125, 'duration': 4.083}, {'end': 8195.75, 'text': 'the surer i am that i will see something similar again.', 'start': 8193.227, 'duration': 2.523}, {'end': 8201.035, 'text': 'it depends upon how sure i want to be.', 'start': 8198.692, 'duration': 2.343}, {'end': 8212.788, 'text': 'if i want to be 95% sure, if i want to be 99% sure, if i want to be 99 point, the more sure i want to be, the bigger the tolerance i must have on my,', 'start': 8201.035, 'duration': 11.753}, {'end': 8213.909, 'text': 'on my interval.', 'start': 8212.788, 'duration': 1.121}, {'end': 8216.049, 'text': "and those are things we'll get to.", 'start': 8213.909, 'duration': 2.14}], 'summary': 'Standard deviation affects tolerance, more data increases repeatability and certainty.', 'duration': 54.519, 'max_score': 8161.53, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE8161530.jpg'}, {'end': 8403.841, 'src': 'embed', 'start': 8377.013, 'weight': 7, 'content': [{'end': 8382.956, 'text': 'and then tell me what happens, but you need to do this often because i expect your blood sugar to be highly variable.', 'start': 8377.013, 'duration': 5.943}, {'end': 8389.217, 'text': 'simply because your body is being put through a enormous amount of variation.', 'start': 8385.536, 'duration': 3.681}, {'end': 8394.759, 'text': "in a business situation, let's suppose that you've introduced a new product.", 'start': 8391.618, 'duration': 3.141}, {'end': 8398.98, 'text': 'you do not know if this new product is going to sell or not.', 'start': 8395.859, 'duration': 3.121}, {'end': 8403.841, 'text': "what will you what will you do? i mean, how will you measure it? you've just introduced it.", 'start': 8400.28, 'duration': 3.561}], 'summary': 'Frequent blood sugar monitoring expected due to high variability, likened to introducing a new product in business.', 'duration': 26.828, 'max_score': 8377.013, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE8377013.jpg'}, {'end': 8470.518, 'src': 'embed', 'start': 8442.164, 'weight': 1, 'content': [{'end': 8445.505, 'text': 'my point is not that there are many things to look at, which you should.', 'start': 8442.164, 'duration': 3.341}, {'end': 8448.927, 'text': 'my point is that when there is a change in the distribution, when there is,', 'start': 8445.505, 'duration': 3.422}, {'end': 8455.411, 'text': 'when there is an unknown distribution coming in front of you whose variation you do not know, you tend to get more data.', 'start': 8448.927, 'duration': 6.484}, {'end': 8460.072, 'text': 'you sample more frequent here.', 'start': 8458.851, 'duration': 1.221}, {'end': 8460.892, 'text': 'you get more data.', 'start': 8460.092, 'duration': 0.8}, {'end': 8463.234, 'text': 'you figure this out.', 'start': 8461.793, 'duration': 1.441}, {'end': 8465.675, 'text': 'we do this all the time.', 'start': 8464.734, 'duration': 0.941}, {'end': 8468.437, 'text': "for example, let's suppose for those of you who have kids.", 'start': 8465.715, 'duration': 2.722}, {'end': 8470.518, 'text': "let's say for your kid is going to a new school.", 'start': 8468.817, 'duration': 1.701}], 'summary': 'Adapting to unknown distribution leads to increased data sampling and understanding, a common practice in various scenarios.', 'duration': 28.354, 'max_score': 8442.164, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE8442164.jpg'}, {'end': 8600.427, 'src': 'embed', 'start': 8569.425, 'weight': 9, 'content': [{'end': 8571.986, 'text': 'more of what solve another problem for me.', 'start': 8569.425, 'duration': 2.561}, {'end': 8575.542, 'text': 'give me new customers that i can go after and things like that.', 'start': 8573.001, 'duration': 2.541}, {'end': 8586.927, 'text': "so. therefore, that problem is a problem that statistician big data people often and it's not an easy problem to go after that,", 'start': 8576.362, 'duration': 10.565}, {'end': 8588.828, 'text': 'as you have more and more data coming in.', 'start': 8586.927, 'duration': 1.901}, {'end': 8590.668, 'text': 'how do you utilize it??', 'start': 8589.528, 'duration': 1.14}, {'end': 8595.35, 'text': 'how do you, how do you, how do you make efficient use of this information?', 'start': 8591.149, 'duration': 4.201}, {'end': 8600.427, 'text': "do you get tighter estimates of what you're going after?", 'start': 8595.991, 'duration': 4.436}], 'summary': 'Utilize big data to acquire new customers efficiently and improve estimation accuracy.', 'duration': 31.002, 'max_score': 8569.425, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE8569425.jpg'}], 'start': 7756.108, 'title': 'Analyzing data distribution and decision making', 'summary': 'Discusses inferring distributions from limited data, translating logic into algorithms, understanding data variation, and the significance of descriptive analytics, emphasizing the importance of estimating population mean, measuring variability in blood sugar and product sales, and the challenges of utilizing big data efficiently.', 'chapters': [{'end': 8005.643, 'start': 7756.108, 'title': 'Understanding distribution and decision making', 'summary': 'Discusses the challenge of inferring distributions from limited data to make decisions, illustrating with examples from salary negotiation, business strategies, and road crossing based on past experiences and assumptions.', 'duration': 249.535, 'highlights': ['The challenge of inferring distributions from limited data to make decisions is a critical aspect of statistics, as it involves drawing conclusions outside the available data, impacting various scenarios such as risk strategy in banking, sales analysis in retail, and salary negotiation.', 'The concept of using past experiences and assumptions to make decisions, as illustrated by the example of crossing the road based on the assumption that unseen cars would behave similarly to those encountered in the past, demonstrates the practical application of statistical thinking in everyday decision making.', 'The anecdote about the taxi driver and the logic behind his decision-making process, based on the assumption that others would behave similarly to those he had encountered before, serves as a relatable example of how people often make decisions based on past experiences and assumptions, similar to statistical reasoning.']}, {'end': 8216.049, 'start': 8006.164, 'title': "Analytics professional's objective", 'summary': 'Discusses the challenge of translating logic into algorithms for analytics professionals, highlighting the importance of estimating population mean and the impact of data variation on certainty levels.', 'duration': 209.885, 'highlights': ['Estimating population mean is crucial for analytics professionals, as it determines the average age of customers and the level of certainty in predictions.', 'Data variation, measured by standard deviation, directly impacts the level of certainty in predictions, with larger variations leading to wider tolerance intervals.', 'The amount of data being averaged over influences the level of certainty in predictions, with larger datasets providing greater confidence in the repeatability of results.', 'The chapter emphasizes the need to factor in the level of certainty desired when determining tolerance intervals, with higher certainty requiring larger intervals.']}, {'end': 8764.47, 'start': 8216.049, 'title': 'Descriptive analytics and data variation', 'summary': 'Discusses the importance of descriptive analytics in understanding data variation and the need for more data when facing unknown distributions, emphasizing the significance of measuring variability in blood sugar and product sales, as well as the challenges of utilizing big data efficiently.', 'duration': 548.421, 'highlights': ['The significance of measuring variability in blood sugar and product sales', 'Importance of more data when facing unknown distributions', 'Challenges of utilizing big data efficiently']}], 'duration': 1008.362, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE7756108.jpg', 'highlights': ['The challenge of inferring distributions from limited data impacts risk strategy in banking, sales analysis in retail, and salary negotiation.', 'The concept of using past experiences and assumptions to make decisions demonstrates the practical application of statistical thinking in everyday decision making.', 'The anecdote about the taxi driver serves as a relatable example of how people often make decisions based on past experiences and assumptions, similar to statistical reasoning.', 'Estimating population mean determines the average age of customers and the level of certainty in predictions.', 'Data variation, measured by standard deviation, directly impacts the level of certainty in predictions.', 'The amount of data being averaged over influences the level of certainty in predictions, with larger datasets providing greater confidence in the repeatability of results.', 'The chapter emphasizes the need to factor in the level of certainty desired when determining tolerance intervals, with higher certainty requiring larger intervals.', 'The significance of measuring variability in blood sugar and product sales.', 'Importance of more data when facing unknown distributions.', 'Challenges of utilizing big data efficiently.']}, {'end': 11548.88, 'segs': [{'end': 8854.65, 'src': 'embed', 'start': 8824.58, 'weight': 2, 'content': [{'end': 8827.743, 'text': "it says that i'm going to assume that the data has a normal distribution is an assumption.", 'start': 8824.58, 'duration': 3.163}, {'end': 8830.966, 'text': 'now, why do statisticians make assumptions like that?', 'start': 8828.964, 'duration': 2.002}, {'end': 8835.369, 'text': 'one reason they make assumptions like that is because they make it the calculation becomes easier.', 'start': 8831.506, 'duration': 3.863}, {'end': 8839.373, 'text': "now just because the calculation becomes easier doesn't mean the calculation is correct.", 'start': 8836.45, 'duration': 2.923}, {'end': 8841.695, 'text': 'because of the assumption is wrong.', 'start': 8840.654, 'duration': 1.041}, {'end': 8843.016, 'text': 'the calculation is also going to be wrong.', 'start': 8841.715, 'duration': 1.301}, {'end': 8850.609, 'text': "but because of the assumption, you can do many of these calculations, and if you don't make those assumptions,", 'start': 8844.687, 'duration': 5.922}, {'end': 8854.65, 'text': 'these calculations now become difficult or even impossible, given the data at hand.', 'start': 8850.609, 'duration': 4.041}], 'summary': "Statisticians assume normal distribution for easier calculations, but it doesn't guarantee correctness.", 'duration': 30.07, 'max_score': 8824.58, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE8824580.jpg'}, {'end': 8957.255, 'src': 'embed', 'start': 8931.53, 'weight': 5, 'content': [{'end': 8938.776, 'text': 'every industry has its own favorite distribution because every industry has its own generic data form.', 'start': 8931.53, 'duration': 7.246}, {'end': 8945.021, 'text': 'now even within the industry a particular data set could violate that rule.', 'start': 8941.658, 'duration': 3.363}, {'end': 8954.028, 'text': 'and then it becomes interesting that, as a statistician, do you now use a higher power tool set, a more powerful tool set, to solve that?', 'start': 8946.262, 'duration': 7.766}, {'end': 8957.255, 'text': 'this leads to certain complexities.', 'start': 8955.834, 'duration': 1.421}], 'summary': 'Industries have specific data distributions, requiring powerful tools for complex cases.', 'duration': 25.725, 'max_score': 8931.53, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE8931530.jpg'}, {'end': 9161.488, 'src': 'embed', 'start': 9124.78, 'weight': 3, 'content': [{'end': 9127.482, 'text': 'and the probabilistic part comes from statistical thinking.', 'start': 9124.78, 'duration': 2.702}, {'end': 9131.405, 'text': "the approximately part comes from machine learning thinking and it's a.", 'start': 9127.482, 'duration': 3.923}, {'end': 9132.506, 'text': "it's a, it's a deep field.", 'start': 9131.405, 'duration': 1.101}, {'end': 9138.27, 'text': "it's a serious field, but it puts a probabilistic statement or an approximation.", 'start': 9132.526, 'duration': 5.744}, {'end': 9145.435, 'text': 'so therefore at the end of the day, whatever method you use there has to be a sense of how generalizable it is.', 'start': 9139.111, 'duration': 6.324}, {'end': 9147.477, 'text': 'you will do that.', 'start': 9146.916, 'duration': 0.561}, {'end': 9153.182, 'text': "you'll do that fairly soon in a couple of months that you'll do your first hackathon.", 'start': 9148.939, 'duration': 4.243}, {'end': 9156.245, 'text': 'in your first hackathon.', 'start': 9154.543, 'duration': 1.702}, {'end': 9158.006, 'text': 'all your hackathons will have a certain feel to them.', 'start': 9156.245, 'duration': 1.761}, {'end': 9161.488, 'text': "a common feeling for a hackathon is i'm going to give you a data set,", 'start': 9158.006, 'duration': 3.482}], 'summary': 'Probabilistic and approximate methods in machine learning have to be generalizable, as emphasized in the discussion about hackathons.', 'duration': 36.708, 'max_score': 9124.78, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE9124780.jpg'}, {'end': 9305.841, 'src': 'embed', 'start': 9262.082, 'weight': 6, 'content': [{'end': 9263.223, 'text': 'it exists, but it is unknown.', 'start': 9262.082, 'duration': 1.141}, {'end': 9266.885, 'text': 'now, if the distribution is nice and symmetric like this,', 'start': 9263.883, 'duration': 3.002}, {'end': 9272.789, 'text': 'then this unknown thing in the middle can be estimated using a mean or it can be estimated using a medium.', 'start': 9266.885, 'duration': 5.904}, {'end': 9276.872, 'text': 'now the question becomes which is better?', 'start': 9273.51, 'duration': 3.362}, {'end': 9284.397, 'text': 'and the answer to that, roughly speaking, is this that if there are many outliers, if this distribution tends to sort of spread out to the tails,', 'start': 9276.872, 'duration': 7.525}, {'end': 9285.137, 'text': 'then use a medium.', 'start': 9284.397, 'duration': 0.74}, {'end': 9290.801, 'text': 'because of the reason that i said the median becomes stable to outliers.', 'start': 9287.018, 'duration': 3.783}, {'end': 9299.097, 'text': 'if this distribution has the more bell-shaped curve of this particular kind, the mean is more efficient at this.', 'start': 9292.794, 'duration': 6.303}, {'end': 9305.841, 'text': 'a better answer is what if the destination distribution is not that, but it is like this, then the median may be here.', 'start': 9299.097, 'duration': 6.744}], 'summary': 'The choice between mean and median depends on the distribution shape and presence of outliers.', 'duration': 43.759, 'max_score': 9262.082, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE9262082.jpg'}, {'end': 9572.815, 'src': 'embed', 'start': 9519.515, 'weight': 12, 'content': [{'end': 9525.297, 'text': 'some kind of heavy tail distribution and network traffic is a is an example of a typical example of heavy tail.', 'start': 9519.515, 'duration': 5.782}, {'end': 9533.381, 'text': 'so now here is what happens people in this particular case the mean and the median are carrying very different kinds of information.', 'start': 9526.158, 'duration': 7.223}, {'end': 9540.884, 'text': 'the medium is essentially saying that for a typical website that you go to.', 'start': 9535.522, 'duration': 5.362}, {'end': 9543.665, 'text': 'how much time do you spend on a typical website??', 'start': 9540.884, 'duration': 2.781}, {'end': 9551.199, 'text': 'now, if that number is low, that is an indication that most of the time,', 'start': 9546.015, 'duration': 5.184}, {'end': 9556.203, 'text': 'you are, shall we say, cruising or browsing?', 'start': 9552.74, 'duration': 3.463}, {'end': 9560.366, 'text': 'on the other hand, if you,', 'start': 9559.345, 'duration': 1.021}, {'end': 9568.211, 'text': "if you're looking at the mean and that number is high then you know that you're spending a lot of time on certain very specific websites.", 'start': 9560.366, 'duration': 7.845}, {'end': 9572.815, 'text': 'and this points to two very different kinds of people.', 'start': 9570.633, 'duration': 2.182}], 'summary': 'Heavy tail distribution in network traffic reveals differing user behaviors based on mean and median times spent on websites.', 'duration': 53.3, 'max_score': 9519.515, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE9519515.jpg'}, {'end': 9737.439, 'src': 'embed', 'start': 9709.834, 'weight': 13, 'content': [{'end': 9713.335, 'text': 'the idea of a numeric is that it is sort of continuous is not chunked that way.', 'start': 9709.834, 'duration': 3.501}, {'end': 9720.928, 'text': 'so for data that is chunked up or categorical, you can easily calculate the mode for something that is not,', 'start': 9714.724, 'duration': 6.204}, {'end': 9727.333, 'text': "and so the mode therefore has become less fashionable, because it's not a very easy thing to go after when we were in college.", 'start': 9720.928, 'duration': 6.405}, {'end': 9729.374, 'text': 'the mode was something actually quite easy to calculate.', 'start': 9727.373, 'duration': 2.001}, {'end': 9730.675, 'text': "here's the way we would calculate the mode.", 'start': 9729.394, 'duration': 1.281}, {'end': 9732.096, 'text': 'here is the histogram.', 'start': 9731.215, 'duration': 0.881}, {'end': 9737.439, 'text': 'and the way we would calculate the mode is this we draw a line from here to here.', 'start': 9734.437, 'duration': 3.002}], 'summary': 'The mode has become less fashionable for continuous data due to its difficulty in calculation.', 'duration': 27.605, 'max_score': 9709.834, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE9709834.jpg'}, {'end': 9826.96, 'src': 'embed', 'start': 9782.358, 'weight': 8, 'content': [{'end': 9787.422, 'text': 'therefore, his estimation of mode and his estimation of mode will be different from the same data set.', 'start': 9782.358, 'duration': 5.064}, {'end': 9789.104, 'text': 'that is not going to for the mean or for the median.', 'start': 9787.422, 'duration': 1.682}, {'end': 9796.93, 'text': 'and as soon as two different people find the same answer to the same different answers to the same question.', 'start': 9791.666, 'duration': 5.264}, {'end': 9798.352, 'text': "you know there's a problem with the statistic.", 'start': 9796.93, 'duration': 1.422}, {'end': 9806.36, 'text': "so therefore this is so the mode isn't done as much these days.", 'start': 9800.553, 'duration': 5.807}, {'end': 9809.983, 'text': 'these are the histograms sort of my data histogram.', 'start': 9807.401, 'duration': 2.582}, {'end': 9812.225, 'text': "it's this is a way of separating out the histograms.", 'start': 9810.103, 'duration': 2.122}, {'end': 9819.913, 'text': 'in other words, looking at the histograms by different column, equal to income essentially means that which variable the by says which gender.', 'start': 9812.966, 'duration': 6.947}, {'end': 9825.158, 'text': "so it's, and they go side by side because they essentially tell you as to what the difference in the distributions is.", 'start': 9819.913, 'duration': 5.245}, {'end': 9826.96, 'text': 'so what does this tell you?', 'start': 9825.999, 'duration': 0.961}], 'summary': 'Different estimations of mode, mean, and median can indicate problems with statistics; histograms help visualize data distributions.', 'duration': 44.602, 'max_score': 9782.358, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE9782358.jpg'}, {'end': 9973.041, 'src': 'embed', 'start': 9946.029, 'weight': 4, 'content': [{'end': 9953.751, 'text': 'there is a relationship between the two variables that i want to want to be able to capture a sense of relation or a sense of correlation.', 'start': 9946.029, 'duration': 7.722}, {'end': 9959.253, 'text': 'that how do i measure whether one variable is related to the other variable or not?', 'start': 9953.751, 'duration': 5.502}, {'end': 9960.514, 'text': "remember, i'm still describing.", 'start': 9959.253, 'duration': 1.261}, {'end': 9966.515, 'text': "i'm still trying to find a number like a mean, like a standard deviation.", 'start': 9962.332, 'duration': 4.183}, {'end': 9973.041, 'text': "i'm trying to simply describe a number if that number is this correlation is high if that number is this correlation is low.", 'start': 9966.836, 'duration': 6.205}], 'summary': 'Describing the relationship between variables using statistical measures like mean and standard deviation to determine correlation.', 'duration': 27.012, 'max_score': 9946.029, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE9946029.jpg'}, {'end': 10481.551, 'src': 'embed', 'start': 10456.157, 'weight': 0, 'content': [{'end': 10465.616, 'text': "so the covariance is a measure of the nature of the relationship between x and y if the covariance is positive they're moving in the same direction.", 'start': 10456.157, 'duration': 9.459}, {'end': 10469.86, 'text': "if the covariance is negative they're moving in opposite directions.", 'start': 10466.617, 'duration': 3.243}, {'end': 10471.882, 'text': 'if the covariance is zero.', 'start': 10470.421, 'duration': 1.461}, {'end': 10477.867, 'text': 'then many things can happen either the data looks like this.', 'start': 10474.444, 'duration': 3.423}, {'end': 10481.551, 'text': "there's no relation or maybe the data looks like this.", 'start': 10478.408, 'duration': 3.143}], 'summary': 'Covariance measures relationship between x and y, + is same direction, - is opposite direction, 0 is no relation.', 'duration': 25.394, 'max_score': 10456.157, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE10456157.jpg'}, {'end': 10576.956, 'src': 'embed', 'start': 10552.052, 'weight': 10, 'content': [{'end': 10558.417, 'text': 'there was a model that we were trying to for some reason trying to find out the relationship between or trying to understand where people stay.', 'start': 10552.052, 'duration': 6.365}, {'end': 10564.682, 'text': 'do they stay close to the office or do they stay far away from office?', 'start': 10560.599, 'duration': 4.083}, {'end': 10571.708, 'text': 'and now, what do you think is relationship between, say, experience? and distance to home?', 'start': 10565.783, 'duration': 5.925}, {'end': 10573.913, 'text': 'Tenor or experience.', 'start': 10572.851, 'duration': 1.062}, {'end': 10576.956, 'text': 'Okay We had normalized for that.', 'start': 10573.933, 'duration': 3.023}], 'summary': 'Model aimed to understand the relationship between experience and distance to home.', 'duration': 24.904, 'max_score': 10552.052, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE10552052.jpg'}, {'end': 10900.983, 'src': 'embed', 'start': 10871.36, 'weight': 1, 'content': [{'end': 10873.14, 'text': 'it is not just a measure of the relationship.', 'start': 10871.36, 'duration': 1.78}, {'end': 10879.139, 'text': 'it is a measure of what i would say the linear relationship between x and y.', 'start': 10873.2, 'duration': 5.939}, {'end': 10885.54, 'text': 'a nonlinear relationship or a strange relationship could cancel out positive and negative and end up with zero or a low number.', 'start': 10879.139, 'duration': 6.401}, {'end': 10893.862, 'text': 'so if the correlation is close to plus one, there is a strong positive relationship between the two.', 'start': 10887.901, 'duration': 5.961}, {'end': 10895.682, 'text': 'strong positive relation means what?', 'start': 10893.862, 'duration': 1.82}, {'end': 10900.983, 'text': 'if one of the variables is above average, then the other is also very likely to be above average.', 'start': 10895.682, 'duration': 5.301}], 'summary': 'Correlation close to plus one indicates a strong positive relationship between variables, with above-average values likely to co-occur.', 'duration': 29.623, 'max_score': 10871.36, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE10871360.jpg'}, {'end': 11076.043, 'src': 'embed', 'start': 11048.754, 'weight': 17, 'content': [{'end': 11056.596, 'text': 'there is no sense that if x goes up then y goes up because correlation of x and y is the same as the correlation of y and x.', 'start': 11048.754, 'duration': 7.842}, {'end': 11062.088, 'text': 'definition this is a symmetry concept.', 'start': 11058.885, 'duration': 3.203}, {'end': 11064.831, 'text': 'it makes no attempt at causation.', 'start': 11063.089, 'duration': 1.742}, {'end': 11066.913, 'text': "that's a different thing altogether.", 'start': 11065.552, 'duration': 1.361}, {'end': 11071.398, 'text': 'so this is a positive.', 'start': 11070.297, 'duration': 1.101}, {'end': 11073.601, 'text': 'this is these this is a positive relationship.', 'start': 11071.438, 'duration': 2.163}, {'end': 11076.043, 'text': "it's a weakly positive relationship.", 'start': 11074.161, 'duration': 1.882}], 'summary': 'Correlation between x and y is symmetric, indicating a weak positive relationship, devoid of causation.', 'duration': 27.289, 'max_score': 11048.754, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE11048754.jpg'}], 'start': 8764.47, 'title': 'Statistical analysis in data', 'summary': 'Covers statistical assumptions, model validity, distributions, pack learning, mean, median, mode, bivariate analysis, covariance, correlation, and their real-world applications in various industries.', 'chapters': [{'end': 9305.841, 'start': 8764.47, 'title': 'Statistical assumptions and model validity', 'summary': 'Discusses the importance of statistical assumptions in making calculations easier, the need for different distributions in various industries, and the concept of pack learning, emphasizing the significance of model generalizability in data analysis.', 'duration': 541.371, 'highlights': ['Statistical assumptions are made to simplify calculations, though incorrect assumptions can lead to flawed results, impacting model validity.', 'Different industries favor specific distributions due to the unique nature of their data, with the need for adapting tools to handle non-standard data sets leading to increased complexity.', 'The concept of pack learning, which combines statistical and machine learning thinking, emphasizes the importance of probabilistic statements and approximations in data analysis.', 'The importance of model generalizability is highlighted, especially in the context of hackathons, where success is not solely based on performance on a given data set but also on unseen data.', 'The influence of statistical distribution on the choice between mean and median for estimation, with the impact of outliers on stability and efficiency, is explained.']}, {'end': 9782.358, 'start': 9307.462, 'title': 'Mean, median, and mode in data analysis', 'summary': 'Discusses the concepts of mean, median, and mode in data analysis, highlighting the difference between them and their relevance in understanding data patterns and distributions.', 'duration': 474.896, 'highlights': ['The difference between mean and median indicates different browsing habits, with mean suggesting spending more time on specific websites.', 'The concept of mode is discussed, highlighting its difficulty in calculation and its relevance in categorical data analysis.', 'The distribution of network traffic is used as an example of a heavy-tailed distribution, with the majority of sessions being short and a few being longer.']}, {'end': 10321.894, 'start': 9782.358, 'title': 'Statistics: bivariate analysis', 'summary': 'Covers the limitations of mode in statistics, the significance of histograms in comparing distributions, and the introduction of bivariate analysis to measure the relationship between two variables using examples and mathematical explanations.', 'duration': 539.536, 'highlights': ["The mode's limitations in statistics are highlighted, revealing its decreasing significance in current analysis.", 'The significance of histograms in comparing distributions is explained, emphasizing their use in identifying differences in distributions and variables such as gender and product usage.', 'Introduction of bivariate analysis to measure the relationship between two variables is demonstrated using mathematical explanations and examples, emphasizing the need to capture the sense of correlation between variables.']}, {'end': 10706.289, 'start': 10321.894, 'title': 'Understanding covariance and relationship', 'summary': 'Explains the concept of covariance, its application in understanding the nature of the relationship between variables, and provides examples of its usage in finance and employee attrition analysis.', 'duration': 384.395, 'highlights': ['Covariance measures the nature of the relationship between variables, with positive covariance indicating variables moving in the same direction, negative indicating opposite directions, and zero indicating a complex or no relationship.', 'Covariance is heavily used in areas such as dimension reduction, principal components, finance, portfolio management, and employee attrition analysis.', "An example of covariance's practical application is in analyzing the relationship between employee experience, distance to home, and attrition, where a story was built around the observed data."]}, {'end': 11548.88, 'start': 10706.329, 'title': 'Understanding correlation in statistics', 'summary': 'Discusses the concept of correlation in statistics, emphasizing how to interpret the correlation value between two variables and its relevance to real-world relationships, illustrated with examples and mathematical explanations.', 'duration': 842.551, 'highlights': ['The correlation coefficient is a measure of the linear relationship between two variables, with a value between -1 and 1 indicating the strength and nature of the relationship.', 'Normalization of data using standard deviation removes the dependency on units of measurement, allowing for a meaningful comparison of the correlation between variables.', 'The correlation value does not imply causation between the variables and is a symmetric concept, providing insights into the strength of the relationship but not the direction of influence.']}], 'duration': 2784.41, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE8764470.jpg', 'highlights': ['Covariance measures relationship between variables, used in finance and employee attrition.', 'Correlation coefficient measures linear relationship strength between variables.', 'Statistical assumptions impact model validity, emphasizing importance of correct assumptions.', 'Pack learning combines statistical and machine learning thinking, emphasizing probabilistic statements.', 'Bivariate analysis measures relationship between two variables, capturing correlation sense.', 'Different industries favor specific distributions due to unique data nature, increasing complexity.', 'Influence of statistical distribution on choice between mean and median for estimation.', 'Model generalizability importance highlighted, especially in hackathons for unseen data success.', "Mode's limitations in statistics revealed, decreasing significance in current analysis.", 'Histograms used to compare distributions, identify differences in variables such as gender.', 'Covariance practical application in analyzing employee experience, distance to home, and attrition.', 'Normalization of data using standard deviation allows meaningful comparison of correlation.', 'Mean and median differences indicate browsing habits, with mean suggesting more time on specific websites.', "Mode's relevance in categorical data analysis and difficulty in calculation explained.", 'Heavy-tailed distribution example using network traffic, with majority short sessions and few longer.', 'Impact of outliers on stability and efficiency when choosing between mean and median for estimation.', 'Covariance indicating variables moving in same, opposite, or no direction, based on value.', 'Correlation value does not imply causation, provides insights into relationship strength but not direction.']}, {'end': 12829.866, 'segs': [{'end': 11576.417, 'src': 'embed', 'start': 11550, 'weight': 2, 'content': [{'end': 11556.203, 'text': 'so. therefore, this relationship depends on the empirical relationship between height and weight, for the data that is available,', 'start': 11550, 'duration': 6.203}, {'end': 11557.264, 'text': 'which is of humans growing.', 'start': 11556.203, 'duration': 1.061}, {'end': 11562.686, 'text': 'and so empirically people have discovered that this is the object that should be invariant.', 'start': 11558.804, 'duration': 3.882}, {'end': 11567.094, 'text': "this is an example of what's called dimension reduction.", 'start': 11564.994, 'duration': 2.1}, {'end': 11576.417, 'text': 'two variables are being combined into one, which is carrying information for you, but it relies on a nonlinear relationship between the two.', 'start': 11568.735, 'duration': 7.682}], 'summary': 'Empirical relationship between height and weight used for dimension reduction.', 'duration': 26.417, 'max_score': 11550, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE11550000.jpg'}, {'end': 11714.462, 'src': 'embed', 'start': 11687.551, 'weight': 1, 'content': [{'end': 11695.716, 'text': "there are thousands of genes and you look at the expression levels of each of these genes and you say these are the genes that have been expressed and these are the genes that haven't been.", 'start': 11687.551, 'duration': 8.165}, {'end': 11699.917, 'text': 'so if you are doing correlations of thousands of variables or hundreds of variables,', 'start': 11695.716, 'duration': 4.201}, {'end': 11705.499, 'text': 'often a night and nicely arranged set of variables with a heat map gives you a good picture of the data.', 'start': 11699.917, 'duration': 5.582}, {'end': 11714.462, 'text': "so heat map in this form is exactly the same as a correlation, except that it adds colors to the numbers, so that you're not looking at the numbers,", 'start': 11706.559, 'duration': 7.903}], 'summary': "Analyzing thousands of genes' expression levels with heat maps allows for visualizing correlations and patterns in the data.", 'duration': 26.911, 'max_score': 11687.551, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE11687551.jpg'}, {'end': 12023.224, 'src': 'embed', 'start': 11993.803, 'weight': 0, 'content': [{'end': 12003.51, 'text': 'now does that mean that in reality as you might say that there is a relationship between these three things? no correctly.', 'start': 11993.803, 'duration': 9.707}, {'end': 12010.335, 'text': "so receiving precise maybe i don't know not necessarily correct not necessarily.", 'start': 12003.55, 'duration': 6.785}, {'end': 12017.7, 'text': 'so when you do linear regression in the future a regression for that matter.', 'start': 12011.215, 'duration': 6.485}, {'end': 12021.062, 'text': 'there will be three uses of it use one.', 'start': 12018.1, 'duration': 2.962}, {'end': 12023.224, 'text': 'it is simply descriptive.', 'start': 12022.003, 'duration': 1.221}], 'summary': 'No clear relationship between three things in linear regression.', 'duration': 29.421, 'max_score': 11993.803, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE11993803.jpg'}, {'end': 12166.862, 'src': 'embed', 'start': 12140.316, 'weight': 3, 'content': [{'end': 12144.717, 'text': "getting it the way to do it is something that i won't talk about too much.", 'start': 12140.316, 'duration': 4.401}, {'end': 12149.738, 'text': "so there's a, there's.", 'start': 12146.717, 'duration': 3.021}, {'end': 12155.759, 'text': "there's this sk learn, which is, you know, one of the one of the learning modules that they learn in the sense of supervised learning.", 'start': 12149.738, 'duration': 6.021}, {'end': 12163.681, 'text': 'import, linear model, regression, linear model as a function and the slightly irritating big function here called linear model,', 'start': 12155.759, 'duration': 7.922}, {'end': 12165.201, 'text': 'which is inherited from linear model.', 'start': 12163.681, 'duration': 1.52}, {'end': 12166.862, 'text': "you're giving it a y.", 'start': 12166.322, 'duration': 0.54}], 'summary': 'The transcript discusses the use of sklearn for supervised learning.', 'duration': 26.546, 'max_score': 12140.316, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE12140316.jpg'}, {'end': 12513.95, 'src': 'embed', 'start': 12476.561, 'weight': 4, 'content': [{'end': 12477.741, 'text': 'the quality of the model.', 'start': 12476.561, 'duration': 1.18}, {'end': 12485.303, 'text': 'how accurate is my mean? how good is my prediction? these are things that are going to be inference and prediction will come.', 'start': 12478.181, 'duration': 7.122}, {'end': 12489.584, 'text': "we can't answer those questions before we get to probably the middle here sense of language on it.", 'start': 12485.583, 'duration': 4.001}, {'end': 12504.428, 'text': 'yes fitness and usage as huh.', 'start': 12489.604, 'duration': 14.824}, {'end': 12508.185, 'text': "that's true.", 'start': 12507.725, 'duration': 0.46}, {'end': 12513.95, 'text': "so you're saying that it doesn't make sense for certain values, which is true, which may well be.", 'start': 12508.886, 'duration': 5.064}], 'summary': 'Assessing model accuracy and prediction quality is important for inference and prediction.', 'duration': 37.389, 'max_score': 12476.561, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE12476561.jpg'}], 'start': 11550, 'title': 'Linear regression and data analysis', 'summary': 'Covers dimension reduction in height and weight, heat map visualization for correlations, trivariate analysis, linear regression in sklearn, and model inference and interpretation, emphasizing the limitations of correlation analysis and the use of linear regression for descriptive, predictive, and prescriptive purposes.', 'chapters': [{'end': 11617.191, 'start': 11550, 'title': 'Dimension reduction in height and weight', 'summary': 'Discusses the empirical relationship between height and weight, highlighting the concept of dimension reduction and the limitations of correlation analysis in determining causation.', 'duration': 67.191, 'highlights': ['The relationship between height and weight is an example of dimension reduction, where two variables are combined into one carrying information, relying on a nonlinear relationship between the two.', 'Correlation analysis has limitations as it only goes so far in capturing the relationship between variables, and it may not provide analytically useful insights.', 'The chapter emphasizes the importance of testing hypotheses to determine if a correlation is real or spurious, serving as a basis for establishing causation.']}, {'end': 11944.34, 'start': 11617.251, 'title': 'Heat map and predictive models', 'summary': 'Discusses the use of heat maps in visualizing correlations between variables, particularly in scenarios with numerous variables such as product catalog analysis and gene expression levels. it also introduces the concept of linear regression as a predictive tool to estimate usage of an engineered instrument based on fitness variables.', 'duration': 327.089, 'highlights': ['Heat map visualization is useful for analyzing correlations between a large number of variables, such as in product catalog sales and gene expression levels.', 'Introduction of a linear regression model as a predictive tool to estimate the usage of an engineered instrument based on fitness variables.']}, {'end': 12138.142, 'start': 11944.36, 'title': 'Understanding trivariate relationships in data analysis', 'summary': 'Discusses the use of trivariate analysis to describe relationships between three variables, emphasizing that linear regression can be used for descriptive, predictive, and prescriptive purposes in data analysis.', 'duration': 193.782, 'highlights': ['Linear regression can be used for descriptive, predictive, and prescriptive purposes in data analysis, providing insights into the relationships between variables and their impact (e.g., predicting outcomes and identifying behavioral changes).', 'The approach involves analyzing trivariate or multivariate relationships, such as using a three by three correlation matrix to understand the interplay between variables.', 'The chapter emphasizes that linear regression does not imply causation, but rather focuses on describing the nature of relationships and making predictions based on new input values.']}, {'end': 12446.961, 'start': 12140.316, 'title': 'Understanding linear regression in sklearn', 'summary': 'Explains the process of linear regression using sklearn, interpreting regression coefficients, and the descriptive use of linear regression to analyze the relationship between variables, with coefficients of 20 and 27 indicating the impact on miles based on usage and fitness, while the intercept is not treated as a coefficient in the model.', 'duration': 306.645, 'highlights': ['The regression coefficients are 20 and 27, representing the impact on miles when usage and fitness change by one unit.', 'The intercept in the model is -56, indicating the miles run when there is zero usage and zero fitness, although it may not make sense in practical terms.', 'Describes how linear regression in Sklearn looks at the data and estimates parameters to minimize the difference between the predicted and actual values, emphasizing the importance of supervised learning in prediction mode.', 'Explains the positive sign in the regression coefficients, indicating that as fitness or usage increases, miles also increase, providing a descriptive use of linear regression to analyze the relationship between variables.']}, {'end': 12829.866, 'start': 12448.177, 'title': 'Model inference and interpretation', 'summary': 'Discusses the importance of assessing the quality of a model, including making inferences and predictions, and the significance of regression coefficients in determining the relationship between variables and their predictive power.', 'duration': 381.689, 'highlights': ['The importance of assessing the quality of a model, including making inferences and predictions.', 'Significance of regression coefficients in determining the relationship between variables and their predictive power.', 'Challenges of expressing the relationship between multiple variables and the use of arbitrary equations.']}], 'duration': 1279.866, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE11550000.jpg', 'highlights': ['Linear regression can be used for descriptive, predictive, and prescriptive purposes in data analysis, providing insights into the relationships between variables and their impact (e.g., predicting outcomes and identifying behavioral changes).', 'Heat map visualization is useful for analyzing correlations between a large number of variables, such as in product catalog sales and gene expression levels.', 'The relationship between height and weight is an example of dimension reduction, where two variables are combined into one carrying information, relying on a nonlinear relationship between the two.', 'Describes how linear regression in Sklearn looks at the data and estimates parameters to minimize the difference between the predicted and actual values, emphasizing the importance of supervised learning in prediction mode.', 'The importance of assessing the quality of a model, including making inferences and predictions.']}, {'end': 14093.21, 'segs': [{'end': 12943.629, 'src': 'embed', 'start': 12900.688, 'weight': 0, 'content': [{'end': 12903.57, 'text': 'so just like a mean is a way to do analytics.', 'start': 12900.688, 'duration': 2.882}, {'end': 12908.008, 'text': 'correlation is a way to do analytics and deviation is similarly.', 'start': 12905.247, 'duration': 2.761}, {'end': 12927.072, 'text': 'so what we had done yesterday is we had spoken essentially about descriptive statistics,', 'start': 12912.549, 'duration': 14.523}, {'end': 12943.629, 'text': 'and descriptive statistics is the taking of data and to simply describe it, with the later purpose of either visualizing it or rewriting a report,', 'start': 12927.072, 'duration': 16.557}], 'summary': 'Descriptive statistics involves describing and visualizing data for analysis and reporting purposes.', 'duration': 42.941, 'max_score': 12900.688, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE12900688.jpg'}, {'end': 13002.776, 'src': 'embed', 'start': 12977.02, 'weight': 1, 'content': [{'end': 12982.683, 'text': 'professionals requirement to say that if something changes, then what happens?', 'start': 12977.02, 'duration': 5.663}, {'end': 12994.351, 'text': 'i should have made a comment that there are two english language words that mean more or less the same thing one is forecasting and one is prediction in the machine learning world.', 'start': 12984.345, 'duration': 10.006}, {'end': 13000.175, 'text': 'these words are used a little differently forecasting is usually in the context of time.', 'start': 12994.391, 'duration': 5.784}, {'end': 13002.776, 'text': 'so something has happened in the past.', 'start': 13000.915, 'duration': 1.861}], 'summary': 'Professionals need to distinguish between forecasting and prediction in machine learning, with forecasting usually tied to time and past events.', 'duration': 25.756, 'max_score': 12977.02, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE12977020.jpg'}, {'end': 13087.152, 'src': 'embed', 'start': 13053.628, 'weight': 2, 'content': [{'end': 13056.21, 'text': 'we had looked at what is called univariate data.', 'start': 13053.628, 'duration': 2.582}, {'end': 13064.096, 'text': 'univariate means one variable for the univariate distributions.', 'start': 13058.111, 'duration': 5.985}, {'end': 13068.219, 'text': 'we have seen certain kinds of descriptive statistics some of them.', 'start': 13064.196, 'duration': 4.023}, {'end': 13077.589, 'text': 'what about shall we say location? location meant where is the distribution and we had seen for example things like means and medians.', 'start': 13068.299, 'duration': 9.29}, {'end': 13087.152, 'text': 'which talked about where is the distribution located? we talked about things on variation.', 'start': 13081.49, 'duration': 5.662}], 'summary': 'Univariate data analysis covers descriptive statistics, including means and medians for location and variation.', 'duration': 33.524, 'max_score': 13053.628, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE13053628.jpg'}, {'end': 13146.759, 'src': 'embed', 'start': 13122.276, 'weight': 3, 'content': [{'end': 13129.958, 'text': 'these are parameters that are used in order to convey a message to someone saying that what is the data about?', 'start': 13122.276, 'duration': 7.682}, {'end': 13133.339, 'text': 'so, for example, a five-point summary talks about the minimum.', 'start': 13129.958, 'duration': 3.381}, {'end': 13142.834, 'text': 'the 25% point the 50% point the 75% point in the maximum irrespective of the number of data points.', 'start': 13134.364, 'duration': 8.47}, {'end': 13144.576, 'text': 'you could have 10 of them.', 'start': 13143.575, 'duration': 1.001}, {'end': 13145.597, 'text': 'you could have a hundred of them.', 'start': 13144.616, 'duration': 0.981}, {'end': 13146.759, 'text': 'you could have a million of them.', 'start': 13145.637, 'duration': 1.122}], 'summary': 'A five-point summary includes minimum, 25th, 50th, 75th percentile, and maximum, applicable to any number of data points.', 'duration': 24.483, 'max_score': 13122.276, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE13122276.jpg'}, {'end': 13278.582, 'src': 'embed', 'start': 13244.158, 'weight': 4, 'content': [{'end': 13251.661, 'text': 'we are simply describing then we are taken an even brief and perhaps even more confusing.', 'start': 13244.158, 'duration': 7.503}, {'end': 13258.164, 'text': 'look at multivariate our first multivariate summary where we looked at the idea for linear regression.', 'start': 13253.242, 'duration': 4.922}, {'end': 13267.278, 'text': 'a linear regression is an equation of the form.', 'start': 13264.417, 'duration': 2.861}, {'end': 13278.582, 'text': 'y is equal to say beta naught plus beta 1 x 1 plus beta p x p, where one variable is written as an equation of the others.', 'start': 13267.278, 'duration': 11.304}], 'summary': 'Describing multivariate summary for linear regression equation.', 'duration': 34.424, 'max_score': 13244.158, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE13244158.jpg'}, {'end': 13373.717, 'src': 'embed', 'start': 13339.012, 'weight': 5, 'content': [{'end': 13340.072, 'text': 'we looked at box plots.', 'start': 13339.012, 'duration': 1.06}, {'end': 13343.473, 'text': 'sort of pairs.', 'start': 13342.693, 'duration': 0.78}, {'end': 13347.695, 'text': 'which were essentially scattered what are called scattered plots.', 'start': 13345.454, 'duration': 2.241}, {'end': 13353.082, 'text': 'so these are for the human eye.', 'start': 13350.66, 'duration': 2.422}, {'end': 13356.945, 'text': 'these are things for the human eye to see data.', 'start': 13354.363, 'duration': 2.582}, {'end': 13361.648, 'text': 'and they have the limitations because we can only see data in a certain way.', 'start': 13358.366, 'duration': 3.282}, {'end': 13365.271, 'text': "we can't see very high dimensional data visually.", 'start': 13362.969, 'duration': 2.302}, {'end': 13366.792, 'text': 'we can see up to three dimensions.', 'start': 13365.311, 'duration': 1.481}, {'end': 13373.717, 'text': 'maybe for those of you who are interested about such things or any of you are in the graphics world etc.', 'start': 13366.832, 'duration': 6.885}], 'summary': 'Explored box plots and scatter plots for visualizing data, with limitations on human perception.', 'duration': 34.705, 'max_score': 13339.012, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE13339012.jpg'}, {'end': 14063.249, 'src': 'embed', 'start': 14030.589, 'weight': 6, 'content': [{'end': 14035.532, 'text': 'to do that you need typically to do linear algebra and in courses such as this and in machine learning books.', 'start': 14030.589, 'duration': 4.943}, {'end': 14037.694, 'text': 'you will see at the beginning of the book.', 'start': 14036.233, 'duration': 1.461}, {'end': 14046.56, 'text': 'you will often find chapters and optimization and linear algebra because of this or something similar to this that you represent a problem.', 'start': 14037.734, 'duration': 8.826}, {'end': 14052.724, 'text': 'you often need a matrix representation and to get a good learned solution.', 'start': 14047.08, 'duration': 5.644}, {'end': 14054.325, 'text': 'you need an optimization.', 'start': 14053.264, 'duration': 1.061}, {'end': 14063.249, 'text': "so most machine learning algorithms are built that way because for its, as i was saying yesterday, that you're going to tell someone to do something,", 'start': 14055.865, 'duration': 7.384}], 'summary': 'Machine learning algorithms rely on linear algebra and optimization for matrix representation and learning solutions.', 'duration': 32.66, 'max_score': 14030.589, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE14030589.jpg'}], 'start': 12829.866, 'title': 'Statistics and predictive analytics', 'summary': 'Discusses descriptive statistics, bivariate data, multivariate summary, and linear regression techniques, emphasizing the use of visualizations and optimization in machine learning algorithms.', 'chapters': [{'end': 13053.568, 'start': 12829.866, 'title': 'Descriptive statistics and predictive analytics', 'summary': 'Discusses the use of descriptive statistics to summarize multiple variables, the distinction between descriptive and predictive statistics, and the different contexts of prediction and forecasting in the machine learning world.', 'duration': 223.702, 'highlights': ['The chapter discusses the use of descriptive statistics to summarize multiple variables, emphasizing the purpose of describing data for visualization, reporting, and later applications.', 'The distinction between prediction and forecasting is explained, with prediction being described as a task often associated with machine learning and data mining, while forecasting is usually in the context of time.', "The chapter highlights that predictive analytics doesn't necessarily forecast anything, despite the fact that prediction itself is more forecasting, clarifying the slight differences in usage of the words 'prediction' and 'forecasting' in the machine learning context.", "The discussion includes the analogy of 'price' and 'worth' to illustrate how the words 'prediction' and 'forecasting' are used in slightly different contexts, providing a clear understanding of their distinctions.", 'The purpose of descriptive statistics in summarizing data for visualization, reporting, and later applications is emphasized, setting the stage for further exploration of visualization techniques and predictive analytics in later sessions.']}, {'end': 13244.098, 'start': 13053.628, 'title': 'Descriptive statistics and bivariate data', 'summary': 'Covered univariate distributions, including measures of location and spread such as means, medians, standard deviation, quartiles, and the five-point summary, before briefly discussing bivariate data and the concepts of covariance and correlation.', 'duration': 190.47, 'highlights': ['The chapter emphasized the importance of univariate distributions and measures of location and spread, such as means, medians, standard deviation, quartiles, and the five-point summary, in conveying information about the data.', 'It discussed the significance of the five-point summary, which includes the minimum, 25th percentile, median, 75th percentile, and maximum, in conveying key information about the data distribution.', 'The chapter briefly introduced the concept of bivariate data, covering covariance and correlation and highlighting the relationship between variables, including the distinction between positive and negative correlations.']}, {'end': 13415.831, 'start': 13244.158, 'title': 'Multivariate summary and visualization techniques', 'summary': 'Covers the concept of multivariate summary and visualization techniques, focusing on linear regression and various plot types to represent and understand high-dimensional data, emphasizing the limitations of human visual perception in processing data.', 'duration': 171.673, 'highlights': ['Linear regression is an equation of the form y = β0 + β1x1 + βpxp, used to describe the relationship between variables and for prediction.', 'Visualization techniques like histograms, box plots, and scatter plots are used to represent high-dimensional data for human eye perception.', 'Human visual perception has limitations in processing high-dimensional data, as it can only perceive up to three dimensions.']}, {'end': 14093.21, 'start': 13417.292, 'title': 'Linear regression and optimization', 'summary': "Explains the concept of linear regression, highlighting the process of minimizing the sum of squared distances to obtain the best values for 'beta naught' and 'beta 1'. it also delves into the mathematical optimization involved, emphasizing its significance in machine learning algorithms and real-world applications.", 'duration': 675.918, 'highlights': ["The process of linear regression involves minimizing the sum of squared distances to obtain the best values for 'beta naught' and 'beta 1', representing the closest fit to the data.", "The optimization process in linear regression involves differentiating the equation with respect to 'beta naught' and 'beta 1', solving the resulting equations, and finding the unique solution, which is crucial in machine learning algorithms and real-world applications.", 'Machine learning algorithms and real-world applications often require linear algebra and optimization techniques, which are essential for obtaining optimal solutions and representing problems in a matrix form.']}], 'duration': 1263.344, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE12829866.jpg', 'highlights': ['Descriptive statistics summarize variables for visualization and reporting.', 'Prediction in machine learning differs from forecasting in the context of time.', 'Univariate distributions and measures like means and standard deviation are important.', 'The five-point summary conveys key information about data distribution.', 'Linear regression equation y = β0 + β1x1 + βpxp describes variable relationships.', 'Visualization techniques like histograms and scatter plots aid in data representation.', 'Optimization in linear regression is crucial for obtaining optimal solutions.']}, {'end': 15792.821, 'segs': [{'end': 14284.155, 'src': 'embed', 'start': 14207.603, 'weight': 1, 'content': [{'end': 14209.344, 'text': 'this has to work for thousands of words.', 'start': 14207.603, 'duration': 1.741}, {'end': 14213.387, 'text': 'so i must be close to thousands of words at the same time.', 'start': 14210.845, 'duration': 2.542}, {'end': 14220.272, 'text': 'therefore i need to measure the distance from my prediction and my actuality over many many data points.', 'start': 14214.848, 'duration': 5.424}, {'end': 14226.736, 'text': 'so all these algorithms what they do is they take your prediction and they compare with the actuality.', 'start': 14222.873, 'duration': 3.863}, {'end': 14233.803, 'text': 'and they find a distance between them and they minimize the totality of the distance between the prediction, the actual,', 'start': 14227.54, 'duration': 6.263}, {'end': 14236.604, 'text': 'and algorithm that minimizes that distance is a good algorithm.', 'start': 14233.803, 'duration': 2.801}, {'end': 14237.665, 'text': 'it has learned well.', 'start': 14237.024, 'duration': 0.641}, {'end': 14250.61, 'text': 'so they all do something like this with this is the prediction and this is the actual and a and b are the parameters in the in the prediction.', 'start': 14240.266, 'duration': 10.344}, {'end': 14256.433, 'text': 'it was find a prediction such that it is closest to the actual.', 'start': 14251.551, 'duration': 4.882}, {'end': 14265.112, 'text': 'so this algorithm has become very popular is probably the single most popular fitting algorithm out there.', 'start': 14258.447, 'duration': 6.665}, {'end': 14266.673, 'text': 'this is called least squares.', 'start': 14265.632, 'duration': 1.041}, {'end': 14268.914, 'text': 'least squares.', 'start': 14268.174, 'duration': 0.74}, {'end': 14272.897, 'text': 'this is called least squares.', 'start': 14268.934, 'duration': 3.963}, {'end': 14284.155, 'text': "squares. here's a square least because you are minimizing this.", 'start': 14279.392, 'duration': 4.763}], 'summary': 'Algorithms minimize distance between prediction and actuality over many data points, popularly using least squares method.', 'duration': 76.552, 'max_score': 14207.603, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE14207603.jpg'}, {'end': 14420.826, 'src': 'embed', 'start': 14369.329, 'weight': 0, 'content': [{'end': 14372.99, 'text': 'it will not have seen that data before, but it will need to know what to do.', 'start': 14369.329, 'duration': 3.661}, {'end': 14378.873, 'text': 'yes so what do you do? so what you do is you train the algorithm.', 'start': 14376.052, 'duration': 2.821}, {'end': 14386.994, 'text': 'what is training the algorithm mean training the algorithm means you give it data for which the car is told what to do.', 'start': 14380.131, 'duration': 6.863}, {'end': 14389.455, 'text': 'in other words, you give it what they call ground truth.', 'start': 14387.614, 'duration': 1.841}, {'end': 14394.297, 'text': 'so you give it the y and you say here is the data or here is the situation.', 'start': 14389.815, 'duration': 4.482}, {'end': 14395.598, 'text': 'please do the right thing.', 'start': 14394.758, 'duration': 0.84}, {'end': 14399.26, 'text': 'so please do it such as this.', 'start': 14398.059, 'duration': 1.201}, {'end': 14400.901, 'text': "so here's a person who's crossing the road.", 'start': 14399.28, 'duration': 1.621}, {'end': 14403.482, 'text': 'please stop.', 'start': 14402.981, 'duration': 0.501}, {'end': 14408.224, 'text': "here is another person crossing the road, but he's very far away.", 'start': 14404.782, 'duration': 3.442}, {'end': 14414.585, 'text': 'calculate the distance compare with your speed and decide what to do.', 'start': 14409.684, 'duration': 4.901}, {'end': 14420.826, 'text': 'he may be far enough for you to be able to see but you may not stop if you are driving a car.', 'start': 14415.725, 'duration': 5.101}], 'summary': 'Training the algorithm involves providing data and ground truth for the car to make decisions, such as stopping for pedestrians.', 'duration': 51.497, 'max_score': 14369.329, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE14369329.jpg'}, {'end': 14567.953, 'src': 'embed', 'start': 14530.228, 'weight': 4, 'content': [{'end': 14531.349, 'text': 'this called validation data.', 'start': 14530.228, 'duration': 1.121}, {'end': 14538.235, 'text': 'and now, if your algorithm works on your own held out data, that data, that your algorithm is not seen,', 'start': 14532.81, 'duration': 5.425}, {'end': 14541.837, 'text': "you're more hopeful that it will work on somebody else's new data.", 'start': 14538.235, 'duration': 3.602}, {'end': 14544.479, 'text': "it's called validation.", 'start': 14543.719, 'duration': 0.76}, {'end': 14549.804, 'text': 'and this entire cycle is often called test validate train or train validate test etc.', 'start': 14545.52, 'duration': 4.284}, {'end': 14552.417, 'text': 'and you will do this in your hackathons.', 'start': 14551.055, 'duration': 1.362}, {'end': 14559.484, 'text': 'but to do this the algorithm needs to know as to how good it is and it means measures like this.', 'start': 14553.898, 'duration': 5.586}, {'end': 14560.926, 'text': 'there are other measures.', 'start': 14560.205, 'duration': 0.721}, {'end': 14567.953, 'text': "so for example, if you're classifying an algorithm good or bad positive sentiment on tweet or negative sentiment on tweet no numbers.", 'start': 14560.946, 'duration': 7.007}], 'summary': "Validation data ensures algorithm's performance on new data, key in hackathons.", 'duration': 37.725, 'max_score': 14530.228, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE14530228.jpg'}, {'end': 14648.835, 'src': 'embed', 'start': 14621.12, 'weight': 10, 'content': [{'end': 14625.082, 'text': 'but as we said as we discussed yesterday, even things like height and weight are not that simple.', 'start': 14621.12, 'duration': 3.962}, {'end': 14629.111, 'text': 'there are complexities to that.', 'start': 14627.931, 'duration': 1.18}, {'end': 14635.252, 'text': "so for example, you can have theories if we say for example, you could say let's say a savings rate.", 'start': 14629.651, 'duration': 5.601}, {'end': 14636.733, 'text': "what's the savings rate?", 'start': 14635.952, 'duration': 0.781}, {'end': 14640.573, 'text': 'of savings rate is the proportion of money that you save.', 'start': 14636.733, 'duration': 3.84}, {'end': 14648.835, 'text': 'now, if there is a saving state, what that would mean is that you, if i take your income data and i take your consumption data,', 'start': 14640.573, 'duration': 8.262}], 'summary': 'Understanding complexities of factors like savings rate in personal finance.', 'duration': 27.715, 'max_score': 14621.12, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE14621120.jpg'}, {'end': 14820.484, 'src': 'embed', 'start': 14765.267, 'weight': 8, 'content': [{'end': 14769.608, 'text': "and the laws of physics do not apply to critic cricket at least not in this way that i'm describing.", 'start': 14765.267, 'duration': 4.341}, {'end': 14777.191, 'text': 'so therefore these laws will get you somewhere like a straight line, etc, etc, but they are approximations.', 'start': 14772.309, 'duration': 4.882}, {'end': 14785.993, 'text': 'and so what you will do is you will build better versions of this when you use different actual prediction,', 'start': 14778.171, 'duration': 7.822}, {'end': 14789.434, 'text': 'but the same argument holds for things that mean standard deviations in many such things.', 'start': 14785.993, 'duration': 3.441}, {'end': 14794.216, 'text': "if there's a specific problem you need to solve you may or may get a better estimate for doing it.", 'start': 14789.494, 'duration': 4.722}, {'end': 14810.499, 'text': 'yes, someone had was asking a question.', 'start': 14795.011, 'duration': 15.488}, {'end': 14812.48, 'text': 'so so there are many ways to do that.', 'start': 14811.159, 'duration': 1.321}, {'end': 14820.484, 'text': 'one is you just put it in you find for different values of a and b you find what that number is and then you solve it.', 'start': 14812.78, 'duration': 7.704}], 'summary': "Physics laws don't apply to cricket as described, approximations are used for predictions and estimates.", 'duration': 55.217, 'max_score': 14765.267, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE14765267.jpg'}, {'end': 14977.194, 'src': 'embed', 'start': 14949.855, 'weight': 11, 'content': [{'end': 14958.939, 'text': 'this can also be written as the covariance of x and y divided by the variance of x.', 'start': 14949.855, 'duration': 9.084}, {'end': 14967.483, 'text': 'so if you want to calculate it for two variables, what you need to do is you need to calculate the covariance and divide by the variance and here.', 'start': 14958.939, 'duration': 8.544}, {'end': 14970.37, 'text': 'y bar minus b x bar.', 'start': 14968.929, 'duration': 1.441}, {'end': 14977.194, 'text': 'this means that the that the that the line passes through x bar y bar the line passes through the middle of the data.', 'start': 14970.77, 'duration': 6.424}], 'summary': 'The covariance of x and y can be calculated by dividing their covariance by the variance of x, showing the line passes through the middle of the data.', 'duration': 27.339, 'max_score': 14949.855, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE14949855.jpg'}, {'end': 15145.033, 'src': 'embed', 'start': 15114.881, 'weight': 6, 'content': [{'end': 15117.602, 'text': 'one particular measure is what is called the elasticity of demand.', 'start': 15114.881, 'duration': 2.721}, {'end': 15129.016, 'text': 'elasticity of demand means this if my price changes by 1%, by what percentage does my sales change??', 'start': 15119.523, 'duration': 9.493}, {'end': 15138.106, 'text': 'well, if my price goes down, i would expect my demand to go up, but by how much now?', 'start': 15133.08, 'duration': 5.026}, {'end': 15139.267, 'text': 'there are certain assumptions to this.', 'start': 15138.106, 'duration': 1.161}, {'end': 15145.033, 'text': "for example, it's it's assumed that the same number works if it increased price as well as you decrease price.", 'start': 15139.867, 'duration': 5.166}], 'summary': 'Elasticity of demand measures sales change in response to price change.', 'duration': 30.152, 'max_score': 15114.881, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE15114881.jpg'}, {'end': 15203.812, 'src': 'embed', 'start': 15178.521, 'weight': 5, 'content': [{'end': 15186.845, 'text': 'so the slope of a linear regression between log sales and log price is the elasticity of demand for that product.', 'start': 15178.521, 'duration': 8.324}, {'end': 15190.406, 'text': 'i mentioned log sales and not log price,', 'start': 15188.065, 'duration': 2.341}, {'end': 15197.409, 'text': 'because elasticity is done in terms of percentages a percentage increase in price and a percentage decrease in sales.', 'start': 15190.406, 'duration': 7.003}, {'end': 15199.57, 'text': "if i don't do it as a percentage.", 'start': 15198.049, 'duration': 1.521}, {'end': 15203.812, 'text': "there's a problem now my measure depends on my units.", 'start': 15199.59, 'duration': 4.222}], 'summary': 'Elasticity of demand is the slope of linear regression between log sales and log price, measured in percentages.', 'duration': 25.291, 'max_score': 15178.521, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE15178521.jpg'}], 'start': 14094.591, 'title': 'Machine learning and regression in various fields', 'summary': 'Covers the fundamental concepts of machine learning, including algorithm training, validation, and application in minimizing distance between predictions and actual data. it also explores the importance of road safety in algorithm training, discusses modeling savings rate and cricket performance, and delves into regression uses in marketing analytics for determining price sensitivity and demand elasticity.', 'chapters': [{'end': 14394.297, 'start': 14094.591, 'title': 'Machine learning fundamentals', 'summary': 'Discusses the fundamental concepts of machine learning, including the process of training algorithms, the measure of learning success, and the application of the least squares algorithm to minimize the distance between predictions and actual data points.', 'duration': 299.706, 'highlights': ["The process of training machine learning algorithms involves providing 'training data' to teach the algorithm the correct answers, known as 'ground truth'.", 'Machine learning algorithms measure their learning success by comparing their predictions with actual data points and minimizing the distance between them.', 'The least squares algorithm is widely used in machine learning to find a prediction that is closest to the actual data points, serving as a popular fitting algorithm.']}, {'end': 14619.699, 'start': 14394.758, 'title': 'Road safety and algorithm training', 'summary': 'Highlights the importance of road safety in decision-making while driving and explains the process of training an algorithm using validation data to ensure generalizability and accuracy, emphasizing the need for different evaluation measures based on the type of data being analyzed.', 'duration': 224.941, 'highlights': ['The importance of calculating distance and speed to make informed decisions while driving is emphasized, as individuals often make similar calculations when crossing roads, illustrating the need to educate algorithms on such decision-making processes.', 'The concept of validation data is explained, emphasizing its role in training algorithms to ensure their generalizability and effectiveness on new data, with the process often referred to as test validate train or train validate test.', 'Different evaluation measures are highlighted based on the type of data being analyzed, such as using a binary correct/incorrect classification for sentiment analysis and a measure of closeness for estimating numerical values, demonstrating the need for varied approaches in algorithm training and evaluation.']}, {'end': 15055.358, 'start': 14621.12, 'title': 'Modeling savings rate and cricket performance', 'summary': 'Discusses the complexities of modeling savings rate and cricket performance using physical laws as approximations, and explains the process of minimizing to estimate parameters in a model.', 'duration': 434.238, 'highlights': ['The process of minimizing to estimate parameters in a model is explained, with the formulas for the estimators provided.', 'Using physical laws as approximations for modeling cricket performance is discussed, highlighting the use of a physical law to predict non-physical outcomes.', 'The complexities of modeling savings rate are explained, emphasizing that the proportion of money saved does not always form a straight line.', 'The use of different values of a and b to find the best estimate for a specific problem is mentioned.']}, {'end': 15792.821, 'start': 15055.478, 'title': 'Uses of regression in marketing analytics', 'summary': 'Delves into the uses of regression in marketing, particularly in determining price sensitivity and understanding demand elasticity, with examples and manual calculations included.', 'duration': 737.343, 'highlights': ['Regression is used to understand price sensitivity and demand elasticity in marketing analytics.', 'Elasticity of demand is a measure of how much sales change in response to a 1% change in price.', 'Manual calculations for covariance and regression coefficients are demonstrated.']}], 'duration': 1698.23, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE14094591.jpg', 'highlights': ["The process of training machine learning algorithms involves providing 'training data' to teach the algorithm the correct answers, known as 'ground truth'.", 'Machine learning algorithms measure their learning success by comparing their predictions with actual data points and minimizing the distance between them.', 'The least squares algorithm is widely used in machine learning to find a prediction that is closest to the actual data points, serving as a popular fitting algorithm.', 'The importance of calculating distance and speed to make informed decisions while driving is emphasized, as individuals often make similar calculations when crossing roads, illustrating the need to educate algorithms on such decision-making processes.', 'The concept of validation data is explained, emphasizing its role in training algorithms to ensure their generalizability and effectiveness on new data, with the process often referred to as test validate train or train validate test.', 'Regression is used to understand price sensitivity and demand elasticity in marketing analytics.', 'Elasticity of demand is a measure of how much sales change in response to a 1% change in price.', 'Different evaluation measures are highlighted based on the type of data being analyzed, such as using a binary correct/incorrect classification for sentiment analysis and a measure of closeness for estimating numerical values, demonstrating the need for varied approaches in algorithm training and evaluation.', 'Using physical laws as approximations for modeling cricket performance is discussed, highlighting the use of a physical law to predict non-physical outcomes.', 'The process of minimizing to estimate parameters in a model is explained, with the formulas for the estimators provided.', 'The complexities of modeling savings rate are explained, emphasizing that the proportion of money saved does not always form a straight line.', 'Manual calculations for covariance and regression coefficients are demonstrated.', 'The use of different values of a and b to find the best estimate for a specific problem is mentioned.']}, {'end': 17008.767, 'segs': [{'end': 15861.688, 'src': 'embed', 'start': 15795.407, 'weight': 0, 'content': [{'end': 15801.369, 'text': "So now that you understand the foundation of data science, let's look at how we can implement it in Python.", 'start': 15795.407, 'duration': 5.962}, {'end': 15804.33, 'text': 'Here are some of the topics we will cover.', 'start': 15802.67, 'duration': 1.66}, {'end': 15808.292, 'text': 'First, we shall discuss the basics of Python.', 'start': 15805.931, 'duration': 2.361}, {'end': 15814.514, 'text': 'Understanding data structures is vital for data science, so we will go into how they are implemented in Python.', 'start': 15809.072, 'duration': 5.442}, {'end': 15819.316, 'text': 'Then, we will understand the flow of control statements in Python.', 'start': 15815.675, 'duration': 3.641}, {'end': 15825.807, 'text': 'We will also dive into some object-oriented programming, a core characteristic of Python.', 'start': 15820.882, 'duration': 4.925}, {'end': 15831.132, 'text': "From there, we'll delve into numerical computing with NumPy libraries in Python.", 'start': 15826.608, 'duration': 4.524}, {'end': 15835.797, 'text': 'We will also cover data manipulation with the Pandas library in Python.', 'start': 15831.633, 'duration': 4.164}, {'end': 15840.602, 'text': "Then, we'll show you how data can be visualized in Python using Matplotlib.", 'start': 15836.458, 'duration': 4.144}, {'end': 15847.441, 'text': 'We will then get into some learning algorithms like linear regression algorithm and logistic regression algorithm.', 'start': 15841.718, 'duration': 5.723}, {'end': 15850.682, 'text': 'The second half of the course will be taught by Mr. Mukesh Rao.', 'start': 15847.881, 'duration': 2.801}, {'end': 15855.185, 'text': "He's an adjunct faculty at Great Lakes for big data and machine learning.", 'start': 15851.243, 'duration': 3.942}, {'end': 15861.688, 'text': 'Mukesh has over 20 years of industry experience in market research, project management, and data science.', 'start': 15855.885, 'duration': 5.803}], 'summary': 'Python data science basics, python libraries, and industry expert instruction.', 'duration': 66.281, 'max_score': 15795.407, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE15795407.jpg'}, {'end': 15992.338, 'src': 'embed', 'start': 15967.398, 'weight': 6, 'content': [{'end': 15972.662, 'text': 'Now, Anaconda is a Python distribution which basically provides you all of the packages inbuilt.', 'start': 15967.398, 'duration': 5.264}, {'end': 15979.748, 'text': 'So you have packages such as matplotlib for visualization, pandas for data manipulation and numpy for numerical computing.', 'start': 15973.123, 'duration': 6.625}, {'end': 15982.39, 'text': "So you don't have to manually install all of these packages.", 'start': 15980.048, 'duration': 2.342}, {'end': 15988.415, 'text': 'So when you actually install Anaconda, all of these packages are actually pre-installed in Anaconda.', 'start': 15982.71, 'duration': 5.705}, {'end': 15992.338, 'text': 'And for doing all of your coding, you have something known as the Jupyter notebook.', 'start': 15988.815, 'duration': 3.523}], 'summary': 'Anaconda is a python distribution with pre-installed packages like matplotlib, pandas, and numpy, along with the jupyter notebook for coding.', 'duration': 24.94, 'max_score': 15967.398, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE15967398.jpg'}, {'end': 16048.419, 'src': 'embed', 'start': 16006.965, 'weight': 7, 'content': [{'end': 16012.806, 'text': "so we've got windows, mac os and linux, and this is the latest version of python 3.7.", 'start': 16006.965, 'duration': 5.841}, {'end': 16014.826, 'text': 'so that is what we had downloaded earlier.', 'start': 16012.806, 'duration': 2.02}, {'end': 16017.287, 'text': 'so this was the older version of python 2.7.', 'start': 16014.826, 'duration': 2.461}, {'end': 16025.828, 'text': "but everyone has moved on to python 3.7 now, and that is why we'll also be downloading anaconda for the latest python version, which is python 3.7.", 'start': 16017.287, 'duration': 8.541}, {'end': 16029.87, 'text': "so it'll automatically start the download for 64-bit operating system.", 'start': 16025.828, 'duration': 4.042}, {'end': 16035.693, 'text': "so once the download for anaconda is done, we need something as the jupyter notebook, which i've already told you.", 'start': 16029.87, 'duration': 5.823}, {'end': 16042.737, 'text': 'so as it is written over here, jupyter notebook is basically a browser-based interpreter that allows us to interactively work with python,', 'start': 16035.693, 'duration': 7.044}, {'end': 16045.158, 'text': 'and this is how your jupyter notebook looks like.', 'start': 16042.737, 'duration': 2.421}, {'end': 16048.419, 'text': 'so let me go ahead and open up jupyter notebook right.', 'start': 16045.158, 'duration': 3.261}], 'summary': 'Latest version of python 3.7 and anaconda for 64-bit os will be downloaded, along with jupyter notebook for interactive python work.', 'duration': 41.454, 'max_score': 16006.965, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE16006965.jpg'}, {'end': 16190.167, 'src': 'embed', 'start': 16160.511, 'weight': 9, 'content': [{'end': 16166.534, 'text': 'So simply put variables are basically temporary storage spaces where you can store data or values.', 'start': 16160.511, 'duration': 6.023}, {'end': 16168.836, 'text': "Now let's take this example over here.", 'start': 16167.095, 'duration': 1.741}, {'end': 16171.157, 'text': 'So consider this folder to be a variable.', 'start': 16169.116, 'duration': 2.041}, {'end': 16176.319, 'text': 'So what you do is you store this value temporarily inside this folder.', 'start': 16171.517, 'duration': 4.802}, {'end': 16178.861, 'text': 'So this value is John, which is the name of the student.', 'start': 16176.619, 'duration': 2.242}, {'end': 16183.243, 'text': "So you've taken this value and place inside this variable for certain amount of time.", 'start': 16179.161, 'duration': 4.082}, {'end': 16190.167, 'text': 'Now after that, since this is only a temporary storage piece, you can take out this variable and replace it with Sam.', 'start': 16183.803, 'duration': 6.364}], 'summary': 'Variables are temporary storage for data/values; can be replaced with different values.', 'duration': 29.656, 'max_score': 16160.511, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE16160511.jpg'}, {'end': 16305.623, 'src': 'embed', 'start': 16278.092, 'weight': 10, 'content': [{'end': 16281.735, 'text': 'so there are different types of data which your variable can hold.', 'start': 16278.092, 'duration': 3.643}, {'end': 16283.237, 'text': 'it can be of integer type.', 'start': 16281.735, 'duration': 1.502}, {'end': 16285.059, 'text': 'it can be floating or decimal type.', 'start': 16283.237, 'duration': 1.822}, {'end': 16286.56, 'text': 'it can be a boolean, it can be a string.', 'start': 16285.059, 'duration': 1.501}, {'end': 16291.565, 'text': 'so when it comes to integers, so you basically have numbers such as 10, 500, 1000, minus 10, minus 27 and so on,', 'start': 16286.56, 'duration': 5.005}, {'end': 16296.751, 'text': 'and floating are basically decimal point numbers.', 'start': 16294.548, 'duration': 2.203}, {'end': 16301.978, 'text': 'so 3.14, 15.97 or minus 2.1678.', 'start': 16296.751, 'duration': 5.227}, {'end': 16305.623, 'text': 'so any such decimal point number would be a floating point number.', 'start': 16301.978, 'duration': 3.645}], 'summary': 'Different data types include integer, floating point, boolean, and string. integers can include numbers like 10, 500, 1000, -10, -27. floating point numbers can include values like 3.14, 15.97, -2.1678.', 'duration': 27.531, 'max_score': 16278.092, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE16278092.jpg'}, {'end': 16378.319, 'src': 'embed', 'start': 16354.612, 'weight': 11, 'content': [{'end': 16361.513, 'text': 'So what you can basically notice is the type of the variable would depend on the type of the data, which are actually storing inside the variable.', 'start': 16354.612, 'duration': 6.901}, {'end': 16367.494, 'text': 'So if you store an integer type data inside the variable, then the type of the variable would be integer.', 'start': 16361.833, 'duration': 5.661}, {'end': 16374.597, 'text': 'and if you store a floating point or a decimal type data inside the variable and the type of the variable would be floating type.', 'start': 16367.854, 'duration': 6.743}, {'end': 16378.319, 'text': 'now, similarly, let me go ahead, then store a boolean value inside this.', 'start': 16374.597, 'duration': 3.722}], 'summary': 'Variable type depends on stored data: integer for int, floating point for decimal, boolean for boolean.', 'duration': 23.707, 'max_score': 16354.612, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE16354612.jpg'}, {'end': 16468.064, 'src': 'embed', 'start': 16401.839, 'weight': 12, 'content': [{'end': 16405.104, 'text': 'so finally, we have one more type of data which is basically complex number.', 'start': 16401.839, 'duration': 3.265}, {'end': 16407.205, 'text': 'So complex numbers.', 'start': 16405.904, 'duration': 1.301}, {'end': 16410.288, 'text': 'you basically know that it has a real part and an imaginary part.', 'start': 16407.205, 'duration': 3.083}, {'end': 16411.97, 'text': 'So let me just type it out.', 'start': 16410.848, 'duration': 1.122}, {'end': 16419.235, 'text': "So four plus let's see eight G now, normally in mathematics, it is actually represented as four plus eight.", 'start': 16412.05, 'duration': 7.185}, {'end': 16423.818, 'text': 'I know when it comes to Python programming, we represent the imaginary part with G.', 'start': 16419.415, 'duration': 4.403}, {'end': 16425.56, 'text': 'So four is the real part over here.', 'start': 16423.818, 'duration': 1.742}, {'end': 16428.342, 'text': 'And eight G is the imaginary part over here.', 'start': 16425.86, 'duration': 2.482}, {'end': 16431.403, 'text': "So just to confirm it, I'll again, check the type.", 'start': 16429.383, 'duration': 2.02}, {'end': 16432.525, 'text': 'So type of eight.', 'start': 16431.484, 'duration': 1.041}, {'end': 16436.106, 'text': 'the type of ES complex.', 'start': 16434.625, 'duration': 1.481}, {'end': 16440.969, 'text': 'So all of these are the different types of data which we can work with in Python.', 'start': 16436.565, 'duration': 4.404}, {'end': 16444.31, 'text': "So going ahead, we'll work with operators in Python.", 'start': 16441.929, 'duration': 2.381}, {'end': 16448.273, 'text': "So we've got arithmetic operators, relational operators and logical operators.", 'start': 16444.471, 'duration': 3.802}, {'end': 16450.313, 'text': "Let's start off with arithmetic operators.", 'start': 16448.713, 'duration': 1.6}, {'end': 16457.677, 'text': "I'll head on to a Jupyter notebook over here and arithmetic operators are basically the normal mathematical operations which you do.", 'start': 16451.693, 'duration': 5.984}, {'end': 16464.322, 'text': "So that would be addition, subtraction, and then you've got division and then you've got multiplication.", 'start': 16458.098, 'duration': 6.224}, {'end': 16468.064, 'text': "So these are the different symbols which you'll be using to perform these operations.", 'start': 16464.741, 'duration': 3.323}], 'summary': 'In python, complex numbers and arithmetic operators are key data types for mathematical operations.', 'duration': 66.225, 'max_score': 16401.839, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE16401839.jpg'}, {'end': 16575.887, 'src': 'embed', 'start': 16549.083, 'weight': 15, 'content': [{'end': 16557.035, 'text': 'so relational operators And relational operators basically help you to understand the relation between two variables.', 'start': 16549.083, 'duration': 7.952}, {'end': 16561.898, 'text': 'Now, what do I mean when I say it helps us to understand the relationship between two variables?', 'start': 16557.535, 'duration': 4.363}, {'end': 16569.023, 'text': "So let's say, when you have two variables, the value in one variable could either be greater than the value in the second variable,", 'start': 16562.539, 'duration': 6.484}, {'end': 16575.887, 'text': 'or it could be less than the value in the second variable, or it could be equal, or there might be no relation at all between these two variables.', 'start': 16569.023, 'duration': 6.864}], 'summary': 'Relational operators help understand relationship between variables.', 'duration': 26.804, 'max_score': 16549.083, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE16549082.jpg'}, {'end': 16696.277, 'src': 'embed', 'start': 16662.346, 'weight': 16, 'content': [{'end': 16664.968, 'text': "Now we'll actually start with the and operator.", 'start': 16662.346, 'duration': 2.622}, {'end': 16670.011, 'text': 'Now and operator is used to check the condition or the logic between two variables.', 'start': 16665.588, 'duration': 4.423}, {'end': 16673.113, 'text': 'and this is how the and operator works.', 'start': 16670.671, 'duration': 2.442}, {'end': 16679.473, 'text': 'so if both of the operands are true, only then the final result will evaluate to true,', 'start': 16673.113, 'duration': 6.36}, {'end': 16685.555, 'text': 'and if either of the operands is false or both of the operands is false, then the final result will be false.', 'start': 16679.473, 'duration': 6.082}, {'end': 16696.277, 'text': "so i'll just create two variables over here, and this time i'll be storing the value true inside a and i'll be storing false inside b.", 'start': 16685.555, 'duration': 10.722}], 'summary': "The 'and' operator checks logic between variables; result is true if both operands are true, else false.", 'duration': 33.931, 'max_score': 16662.346, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE16662346.jpg'}], 'start': 15795.407, 'title': 'Python for data science', 'summary': 'Covers python basics, data structures, flow of control, object-oriented programming, numerical computing with numpy, data manipulation with pandas, data visualization with matplotlib, and learning algorithms. it also includes installation of python, pycharm, anaconda, and jupyter notebook, focusing on python 3.7 and usage of jupyter notebook. additionally, it introduces variables, data types, and operators in python, highlighting the ability to change values and the association between data types and variables.', 'chapters': [{'end': 15861.688, 'start': 15795.407, 'title': 'Implementing data science in python', 'summary': 'Covers the basics of python, data structures, flow of control statements, object-oriented programming, numerical computing with numpy, data manipulation with pandas, data visualization with matplotlib, and learning algorithms like linear and logistic regression. mr. mukesh rao, with over 20 years of industry experience, will teach the second half of the course.', 'duration': 66.281, 'highlights': ['Mr. Mukesh Rao, with over 20 years of industry experience, will teach the second half of the course.', 'Data can be visualized in Python using Matplotlib.', 'Learning algorithms like linear regression algorithm and logistic regression algorithm will be covered.', 'Numerical computing with NumPy libraries in Python will be discussed.', 'Data manipulation with the Pandas library in Python will be covered.', 'Understanding data structures is vital for data science, and it will be explained how they are implemented in Python.', 'Object-oriented programming, a core characteristic of Python, will be delved into.', 'The basics of Python will be discussed.']}, {'end': 16119.567, 'start': 15862.588, 'title': 'Python installation and setup', 'summary': 'Covers the installation of python, pycharm, anaconda, and jupyter notebook, with a focus on python 3.7 and the usage of jupyter notebook for interactive python programming.', 'duration': 256.979, 'highlights': ['Python 3.7 is the latest version and recommended for download', 'Anaconda provides pre-installed packages such as matplotlib, pandas, and numpy', 'Jupyter notebook is a browser-based interpreter for interactive Python work']}, {'end': 16401.839, 'start': 16121.365, 'title': 'Introduction to variables and data types', 'summary': 'Introduces the concept of variables as temporary storage spaces for data, and demonstrates the use of different data types including integer, floating point, boolean, and string in python, highlighting the ability to change values and the association between data types and variables.', 'duration': 280.474, 'highlights': ['Variables are temporary storage spaces for data, allowing values to be changed.', 'Different data types in Python include integer, floating point, boolean, and string.', 'The type of a variable depends on the type of data stored within it.']}, {'end': 17008.767, 'start': 16401.839, 'title': 'Python data types and operators', 'summary': 'Covers python data types including complex numbers, arithmetic, relational, and logical operators, and working with python strings, with detailed explanations and examples.', 'duration': 606.928, 'highlights': ['Python data types include complex numbers, arithmetic, relational, and logical operators, and working with strings.', 'Explanation of complex numbers, real and imaginary parts, and their representation in Python programming.', 'Detailed demonstration of arithmetic operations in Python, including addition, subtraction, multiplication, and division.', 'Explanation of relational operators and their usage to compare variables and determine their relationship.', "In-depth explanation of logical operators 'and' and 'or' and their functionality in Python.", 'Comprehensive overview and examples of working with Python strings, including indexing, extracting characters, and using inbuilt functions like len, upper, and lower.']}], 'duration': 1213.36, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE15795407.jpg', 'highlights': ['Data manipulation with the Pandas library in Python will be covered.', 'Numerical computing with NumPy libraries in Python will be discussed.', 'Learning algorithms like linear regression algorithm and logistic regression algorithm will be covered.', 'Object-oriented programming, a core characteristic of Python, will be delved into.', 'Understanding data structures is vital for data science, and it will be explained how they are implemented in Python.', 'Data can be visualized in Python using Matplotlib.', 'Anaconda provides pre-installed packages such as matplotlib, pandas, and numpy', 'Python 3.7 is the latest version and recommended for download', 'Jupyter notebook is a browser-based interpreter for interactive Python work', 'Variables are temporary storage spaces for data, allowing values to be changed.', 'Different data types in Python include integer, floating point, boolean, and string.', 'The type of a variable depends on the type of data stored within it.', 'Python data types include complex numbers, arithmetic, relational, and logical operators, and working with strings.', 'Explanation of complex numbers, real and imaginary parts, and their representation in Python programming.', 'Detailed demonstration of arithmetic operations in Python, including addition, subtraction, multiplication, and division.', 'Explanation of relational operators and their usage to compare variables and determine their relationship.', "In-depth explanation of logical operators 'and' and 'or' and their functionality in Python.", 'Comprehensive overview and examples of working with Python strings, including indexing, extracting characters, and using inbuilt functions like len, upper, and lower.', 'The basics of Python will be discussed.', 'Understanding data structures is vital for data science, and it will be explained how they are implemented in Python.', 'Mr. Mukesh Rao, with over 20 years of industry experience, will teach the second half of the course.']}, {'end': 19071.84, 'segs': [{'end': 17158.099, 'src': 'embed', 'start': 17130.312, 'weight': 3, 'content': [{'end': 17134.253, 'text': "or you can't add another element inside the tuple which you have already created.", 'start': 17130.312, 'duration': 3.941}, {'end': 17139.654, 'text': 'So this is basically the immutability nature of tuples, and this is how you can create a tuple.', 'start': 17134.673, 'duration': 4.981}, {'end': 17144.335, 'text': 'So you will use these round braces over here and inside these you will given the values right?', 'start': 17140.214, 'duration': 4.121}, {'end': 17147.696, 'text': 'So, as you see, tuple is actually your heterogeneous data structure.', 'start': 17144.915, 'duration': 2.781}, {'end': 17152.317, 'text': 'So over here, you are storing a numerical value, a character value and a Boolean value.', 'start': 17148.036, 'duration': 4.281}, {'end': 17155.158, 'text': "Right? So let's go to Jupyter notebook and work with tuples.", 'start': 17152.937, 'duration': 2.221}, {'end': 17158.099, 'text': 'So let me create a tuple.', 'start': 17156.978, 'duration': 1.121}], 'summary': 'Tuples are immutable, store heterogeneous data, and created using round braces.', 'duration': 27.787, 'max_score': 17130.312, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE17130312.jpg'}, {'end': 17344.071, 'src': 'embed', 'start': 17300.482, 'weight': 0, 'content': [{'end': 17302.504, 'text': 'so there are five elements in a list.', 'start': 17300.482, 'duration': 2.022}, {'end': 17305.785, 'text': 'so the second element or the third element can be modified.', 'start': 17302.504, 'duration': 3.281}, {'end': 17310.768, 'text': "and again, since i've said that there are five elements, so another five elements can be added inside this list.", 'start': 17305.785, 'duration': 4.983}, {'end': 17316.272, 'text': 'so this is the basic difference between lists and tuples, and this is how you can create a list.', 'start': 17311.368, 'duration': 4.904}, {'end': 17321.316, 'text': "so you'll basically give in square braces and you'll give in all of the values inside the square braces.", 'start': 17316.272, 'duration': 5.044}, {'end': 17326.9, 'text': "so let's head on to jupyter notebook and work with lists.", 'start': 17321.316, 'duration': 5.584}, {'end': 17344.071, 'text': "so i'll type in l1 and i'll give in some values over here and then I'll print it out right.", 'start': 17326.9, 'duration': 17.171}], 'summary': 'Lists and tuples can be modified and extended in python, with lists denoted by square braces and capable of adding more elements than tuples.', 'duration': 43.589, 'max_score': 17300.482, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE17300482.jpg'}, {'end': 17717.651, 'src': 'embed', 'start': 17689.472, 'weight': 2, 'content': [{'end': 17692.514, 'text': 'And A, I have given 2 times, but then again, it comes only once.', 'start': 17689.472, 'duration': 3.042}, {'end': 17696.376, 'text': 'So, obviously, set does not allow any duplicates inside it.', 'start': 17693.074, 'duration': 3.302}, {'end': 17699.879, 'text': 'Now, let me go ahead and add some new values inside the set.', 'start': 17696.897, 'duration': 2.982}, {'end': 17705.863, 'text': "So, I'll type in s1.add hello world.", 'start': 17700.439, 'duration': 5.424}, {'end': 17711.207, 'text': "Right? So, I've added this new element.", 'start': 17709.185, 'duration': 2.022}, {'end': 17717.651, 'text': "if i want to add more than one element at a single time, then we've got this update method.", 'start': 17712.509, 'duration': 5.142}], 'summary': "Set does not allow duplicates. added 'hello world' using s1.add. update method for adding multiple elements.", 'duration': 28.179, 'max_score': 17689.472, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE17689472.jpg'}, {'end': 17790.066, 'src': 'embed', 'start': 17763.784, 'weight': 1, 'content': [{'end': 17770.711, 'text': "So in real world, you will come across a lot of situations where you'd have to make a decision on the basis of a condition.", 'start': 17763.784, 'duration': 6.927}, {'end': 17777.838, 'text': "So let's see if this happens, then you'd have to perform a set of actions else you'd have to perform a different set of actions.", 'start': 17771.211, 'duration': 6.627}, {'end': 17780.42, 'text': "So let's take this example to understand this better.", 'start': 17778.398, 'duration': 2.022}, {'end': 17784.483, 'text': "so let's say it's raining and you'd want to play football.", 'start': 17781.101, 'duration': 3.382}, {'end': 17790.066, 'text': "so if it's raining, then you can't do anything and you just have to sit inside else.", 'start': 17784.483, 'duration': 5.583}], 'summary': 'Real-world decisions based on conditions and actions. example: deciding to play football during rain.', 'duration': 26.282, 'max_score': 17763.784, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE17763784.jpg'}, {'end': 18952.105, 'src': 'embed', 'start': 18922.125, 'weight': 4, 'content': [{'end': 18925.246, 'text': 'so hello world.', 'start': 18922.125, 'duration': 3.121}, {'end': 18932.488, 'text': 'right now, if i have to invoke this function, i just have to type in hello, with this parenthesis all right.', 'start': 18925.246, 'duration': 7.242}, {'end': 18937.45, 'text': 'so if i want to print out hello world, all i have to do is copy this, paste it over here.', 'start': 18932.488, 'duration': 4.962}, {'end': 18943.612, 'text': 'so all i have to do is invoke this function, and then i can happily print hello world how many times i want.', 'start': 18937.45, 'duration': 6.162}, {'end': 18952.105, 'text': "Now, after this, what I'll do is I will create a function where I'm taking an input value and adding 10 more to it.", 'start': 18944.542, 'duration': 7.563}], 'summary': "Demonstration of invoking a function to print 'hello world' and creating a function to add 10 to an input value", 'duration': 29.98, 'max_score': 18922.125, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE18922125.jpg'}], 'start': 17008.928, 'title': 'Python programming fundamentals', 'summary': 'Covers non-primitive data structures, like tuples, lists, dictionaries, and sets, if-else statements, while loops, and functions in python, emphasizing their characteristics, operations, and practical examples.', 'chapters': [{'end': 17226.681, 'start': 17008.928, 'title': 'Working with non-primitive data structures', 'summary': 'Covers the basics of working with non-primitive data structures in python, focusing on tuples as an ordered collection of elements, immutable nature, and accessing individual elements.', 'duration': 217.753, 'highlights': ['Tuples are immutable, meaning once created, their elements cannot be changed or modified, and new elements cannot be added.', 'Tuples serve as heterogeneous data structures, allowing storage of different types of elements within a single data structure.', 'Accessing elements from a tuple is similar to accessing elements from a string, using index notation to extract individual or sequential elements.']}, {'end': 17763.124, 'start': 17226.681, 'title': 'Python data structures and basic operations', 'summary': 'Covers the concepts of tuples, lists, dictionaries, and sets in python, highlighting the differences between them, their mutability, and methods for adding, removing, and modifying elements, as well as their key characteristics such as immutability for tuples, mutability for lists, key-value pairs for dictionaries, and unordered, unindexed nature for sets.', 'duration': 536.443, 'highlights': ['Tuples are immutable and do not support item assignment, while lists are mutable and allow adding, removing, and modifying elements.', 'The creation and modification of lists, including accessing elements, changing values, adding new elements, adding a list inside a list, and removing elements.', 'Dictionary creation, accessing keys and values, and modifying elements by changing the value for a specific key.', 'Sets as an unordered and unindexed collection of elements that do not allow duplicate values, with examples of set creation, addition of elements, and the update method for adding multiple elements.']}, {'end': 18238.217, 'start': 17763.784, 'title': 'Using if else statements in programming', 'summary': 'Explains the use of if else statements in programming, showcasing examples of conditional decision-making, evaluating expressions, and working with tuples, lists, and dictionaries, as well as introducing looping statements for repeating tasks until a condition is met.', 'duration': 474.433, 'highlights': ['Explaining the concept of if else statements and demonstrating their use in programming with real-world examples, such as making decisions based on conditions and representing them in programming.', 'Demonstrating the evaluation of expressions using if else statements, with examples of comparing values and executing specific actions based on the evaluation of conditions.', 'Illustrating the use of if else statements with tuples, lists, and dictionaries, including checking for the presence of elements and modifying values based on conditions.', 'Introducing looping statements and explaining their purpose in repeating tasks until a specific condition is met, using the analogy of filling a bucket with water until it is full.']}, {'end': 18624.829, 'start': 18238.577, 'title': 'Understanding while loop in python', 'summary': 'Explains the working of a while loop in python, demonstrating the iteration process with the 2 table and a list, incrementing values and ensuring termination to avoid infinite loops.', 'duration': 386.252, 'highlights': ['Demonstrates the iteration process with the 2 table', 'Illustrates iterating over a list and incrementing values', 'Explanation of the working of a while loop in Python']}, {'end': 19071.84, 'start': 18626.147, 'title': 'Python loops and functions', 'summary': 'Covers the basics of creating and iterating through lists using for loops, demonstrating nested loops, and then delves into creating and invoking functions, simplifying code execution by demonstrating a hello world function, an add10 function, and an odd even function.', 'duration': 445.693, 'highlights': ['Demonstrating nested loops and iterating through lists using for loops.', 'Explanation of creating and invoking functions, including the hello world function, add10 function, and odd even function.']}], 'duration': 2062.912, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE17008928.jpg', 'highlights': ['Lists are mutable, allowing adding, removing, and modifying elements.', 'If else statements demonstrate decision-making based on conditions.', 'Sets do not allow duplicate values and support addition of elements.', 'Tuples are immutable and serve as heterogeneous data structures.', 'Functions are created and invoked, including hello world and add10 functions.']}, {'end': 21150.626, 'segs': [{'end': 19233.832, 'src': 'embed', 'start': 19204.315, 'weight': 1, 'content': [{'end': 19206.756, 'text': "Now let's understand what exactly is a object.", 'start': 19204.315, 'duration': 2.441}, {'end': 19209.938, 'text': 'So objects are basically specific instances of a class.', 'start': 19207.116, 'duration': 2.822}, {'end': 19213.8, 'text': "So as I've told you, a class is a general template.", 'start': 19210.518, 'duration': 3.282}, {'end': 19219.883, 'text': 'And when you want a specific instance of that general template, that is when you have an object.', 'start': 19214.16, 'duration': 5.723}, {'end': 19223.525, 'text': 'So when I say phone, that phone would be your class.', 'start': 19220.343, 'duration': 3.182}, {'end': 19228.428, 'text': 'And the objects of that phone would be Apple, Motorola, and Samsung.', 'start': 19223.905, 'duration': 4.523}, {'end': 19233.832, 'text': 'Right So Apple, Motorola and Samsung would have the general template of a phone.', 'start': 19228.948, 'duration': 4.884}], 'summary': 'Objects are specific instances of a class, e.g., apple, motorola, and samsung are objects of the class phone.', 'duration': 29.517, 'max_score': 19204.315, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE19204315.jpg'}, {'end': 19319.186, 'src': 'embed', 'start': 19291.079, 'weight': 3, 'content': [{'end': 19295.502, 'text': "So I'll type in capital P and then I'll follow it up with H-O-N-E, so it is phone.", 'start': 19291.079, 'duration': 4.423}, {'end': 19300.628, 'text': 'and then inside this I will give in the properties and the behavior of the class.', 'start': 19296.383, 'duration': 4.245}, {'end': 19301.99, 'text': 'so a phone.', 'start': 19300.628, 'duration': 1.362}, {'end': 19306.275, 'text': 'basically, with the help of a phone, you can make phone calls and you can play games.', 'start': 19301.99, 'duration': 4.285}, {'end': 19308.477, 'text': "so I'll start off by creating methods.", 'start': 19306.275, 'duration': 2.202}, {'end': 19311.18, 'text': 'so the first method is to make a phone call.', 'start': 19308.477, 'duration': 2.703}, {'end': 19319.186, 'text': 'so I will name this method as make call and then this will have something known as self.', 'start': 19311.18, 'duration': 8.006}], 'summary': 'Creating a class for a phone with methods to make calls and play games.', 'duration': 28.107, 'max_score': 19291.079, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE19291079.jpg'}, {'end': 19817.488, 'src': 'embed', 'start': 19792.156, 'weight': 2, 'content': [{'end': 19797.284, 'text': "now let's head on to an important concept in object-oriented programming, which is inheritance.", 'start': 19792.156, 'duration': 5.128}, {'end': 19799.948, 'text': 'so what comes to your mind when you hear the word inheritance?', 'start': 19797.284, 'duration': 2.664}, {'end': 19808.141, 'text': "So, simply put, inheritance basically means that when something acquires the properties of someone else's or something else's.", 'start': 19800.836, 'duration': 7.305}, {'end': 19817.488, 'text': "Right So let's say you are inheriting your features from your father or you are inheriting the land or the property from your ancestors.", 'start': 19808.362, 'duration': 9.126}], 'summary': 'Inheritance in object-oriented programming allows acquiring properties from another entity.', 'duration': 25.332, 'max_score': 19792.156, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE19792156.jpg'}, {'end': 20016.302, 'src': 'embed', 'start': 19984.455, 'weight': 4, 'content': [{'end': 19986.196, 'text': "so that's too much of information.", 'start': 19984.455, 'duration': 1.741}, {'end': 19990.2, 'text': 'so simply put, numpy basically has a multi-dimensional array.', 'start': 19986.196, 'duration': 4.004}, {'end': 19998.427, 'text': 'Now, to process those multi-dimensional arrays, you have certain functions and you have certain operations pre-built in the NumPy package.', 'start': 19990.7, 'duration': 7.727}, {'end': 20002.23, 'text': 'And that is how you can work with these NumPy multi-dimensional arrays.', 'start': 19998.987, 'duration': 3.243}, {'end': 20006.894, 'text': 'So you can perform all sorts of numerical and scientific operations on this NumPy array.', 'start': 20002.71, 'duration': 4.184}, {'end': 20012.038, 'text': "So let's go to Jupyter Notebook and work with this very famous package called as NumPy.", 'start': 20007.394, 'duration': 4.644}, {'end': 20016.302, 'text': "So to start working with the NumPy library, you'd have to first import it.", 'start': 20012.739, 'duration': 3.563}], 'summary': 'Numpy enables processing multi-dimensional arrays for numerical and scientific operations.', 'duration': 31.847, 'max_score': 19984.455, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE19984455.jpg'}, {'end': 21118.948, 'src': 'embed', 'start': 21089.909, 'weight': 0, 'content': [{'end': 21095.452, 'text': "now let's go ahead and work with another interesting package in python, which is pandas.", 'start': 21089.909, 'duration': 5.543}, {'end': 21102.236, 'text': 'so panda stands for panel data and it is the core library for data manipulation and data analysis.', 'start': 21095.452, 'duration': 6.784}, {'end': 21106.439, 'text': 'So as NumPy provides us a multidimensional array,', 'start': 21102.816, 'duration': 3.623}, {'end': 21113.063, 'text': 'similarly Pandas provides us a multidimensional data structure for performing various data manipulation operations.', 'start': 21106.439, 'duration': 6.624}, {'end': 21118.948, 'text': 'So it provides us both a single dimensional data structure and a multidimensional data structure.', 'start': 21113.544, 'duration': 5.404}], 'summary': 'Pandas is a core library for data manipulation and analysis in python, offering a multidimensional data structure.', 'duration': 29.039, 'max_score': 21089.909, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE21089909.jpg'}], 'start': 19071.84, 'title': 'Introduction to object oriented programming and numpy arrays', 'summary': 'Introduces object-oriented programming in python with classes, objects, and methods, and discusses inheritance. it also covers initializing and performing operations on numpy arrays, including creating and manipulating arrays with specific values and ranges, along with an introduction to pandas for data analysis.', 'chapters': [{'end': 19526.804, 'start': 19071.84, 'title': 'Introduction to object oriented programming', 'summary': 'Introduces object-oriented programming in python, explaining the concepts of classes and objects, and demonstrates the creation and invocation of methods in a class, using a phone class as an example.', 'duration': 454.964, 'highlights': ['Objects are specific instances of a class, representing real-world entities, such as Apple, Motorola, and Samsung being instances of the phone class, each having general properties and behavior in common.', 'Creation and invocation of methods in a class, using the example of defining methods like make call and play game in a phone class, and creating an instance of the class to invoke these methods.', 'Explanation of the self attribute in methods, clarifying its role in indicating that the method belongs to the instance which invokes it and its usage in accessing instance attributes.']}, {'end': 19792.156, 'start': 19527.545, 'title': 'Creating and using methods in python', 'summary': "Explains the process of creating methods in a python class, with examples of adding parameters, assigning values, and invoking methods through instances, showcasing the use of 'self' attribute and its relevance.", 'duration': 264.611, 'highlights': ['Explaining the process of creating methods in a Python class', "Demonstrating the use of 'self' attribute and its relevance", 'Examples of adding parameters and assigning values']}, {'end': 20240.55, 'start': 19792.156, 'title': 'Inheritance in object-oriented programming and numpy package', 'summary': "Discusses inheritance in object-oriented programming, where one class acquires the properties of another class, demonstrated through the creation of a subclass 'iphone' inheriting from a base class 'phone' and the utilization of the numpy package for creating single and multi-dimensional arrays.", 'duration': 448.394, 'highlights': ["Inheritance in object-oriented programming involves one class acquiring the features or properties of another class, demonstrated through the creation of a subclass 'iPhone' inheriting from a base class 'phone'.", "Demonstration of inheritance through the addition of a method 'cure cancer' to the subclass 'iPhone', showcasing the concept of adding new functionality to the inherited class.", 'Utilization of the NumPy package for creating single and multi-dimensional arrays, showcasing the process of importing, installing, and creating NumPy arrays.']}, {'end': 20459.742, 'start': 20240.55, 'title': 'Initializing numpy arrays in python', 'summary': 'Covers the initialization of numpy arrays in python, including creating arrays with zeros, filling arrays with a specific value, and generating arrays with a particular range, demonstrating the use of np.zeros, np.full, and np.arange methods.', 'duration': 219.192, 'highlights': ['Creating NumPy arrays with zeros using np.zeros method', 'Initializing NumPy arrays with specific values using np.full method', 'Initializing NumPy arrays with a particular range using np.arange method']}, {'end': 21150.626, 'start': 20459.742, 'title': 'Numpy arrays and operations', 'summary': 'Covers the initialization of numpy arrays with specific ranges and random numbers, checking the shape of arrays, performing simple mathematics, adding numpy arrays, and joining arrays using vstack, hstack, and column stack methods, leading to the introduction to pandas for data manipulation and analysis.', 'duration': 690.884, 'highlights': ['Pandas provides a multidimensional data structure for data manipulation, including series and data frames, essential for data science operations like machine learning algorithms.', 'Performing basic addition operations on NumPy arrays using the sum function and setting the axis for vertical and horizontal additions.', 'Initializing NumPy arrays with specific ranges and random numbers, checking their shape, and reshaping them using the shape method.', 'Joining different NumPy arrays using vstack, hstack, and column stack methods to combine arrays row-wise, column-wise, and create two-dimensional arrays.']}], 'duration': 2078.786, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE19071840.jpg', 'highlights': ['Pandas provides a multidimensional data structure for data manipulation, essential for data science operations.', 'Objects are specific instances of a class, representing real-world entities, such as Apple, Motorola, and Samsung being instances of the phone class.', 'Inheritance in object-oriented programming involves one class acquiring the features or properties of another class.', 'Creating and invocation of methods in a class, using the example of defining methods like make call and play game in a phone class.', 'Utilization of the NumPy package for creating single and multi-dimensional arrays, showcasing the process of importing, installing, and creating NumPy arrays.']}, {'end': 22468.917, 'segs': [{'end': 21203.93, 'src': 'embed', 'start': 21174.154, 'weight': 0, 'content': [{'end': 21180.817, 'text': 'but the series object is a one-dimensional labeled array and this is how you can create a series object.', 'start': 21174.154, 'duration': 6.663}, {'end': 21186.88, 'text': "so before we go ahead and create a series object, first we'd have to invoke the pandas library.", 'start': 21181.417, 'duration': 5.463}, {'end': 21193.144, 'text': "so we'll type in import pandas as pd, and pd again over here is just an alias for pandas.", 'start': 21186.88, 'duration': 6.264}, {'end': 21196.666, 'text': "so after we invoke pandas we'll type in pd dot series.", 'start': 21193.144, 'duration': 3.522}, {'end': 21203.93, 'text': "so over here you'd have to keep in mind that s is capital, and inside this well personal list, one, two, three, four and five,", 'start': 21196.666, 'duration': 7.264}], 'summary': "To create a series object in pandas, invoke the pandas library using 'import pandas as pd', then use 'pd.series' with a list of values like [1, 2, 3, 4, 5].", 'duration': 29.776, 'max_score': 21174.154, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE21174154.jpg'}, {'end': 21252.603, 'src': 'embed', 'start': 21222.013, 'weight': 1, 'content': [{'end': 21225.455, 'text': 'So if you have to manually install pandas, this is what you have to do.', 'start': 21222.013, 'duration': 3.442}, {'end': 21233.04, 'text': "Open up Anaconda prompt and then you'd have to type in pip install pandas.", 'start': 21226.095, 'duration': 6.945}, {'end': 21236.022, 'text': 'So once you do that, pandas will be installed.', 'start': 21233.68, 'duration': 2.342}, {'end': 21240.41, 'text': 'all right, so i have imported the pandas library.', 'start': 21236.946, 'duration': 3.464}, {'end': 21243.733, 'text': 'now let me go ahead and create my first series object.', 'start': 21240.41, 'duration': 3.323}, {'end': 21252.603, 'text': "so i'll name this object as s1 and i'll type in pd dot series and inside this i'll basically pass in a list.", 'start': 21243.733, 'duration': 8.87}], 'summary': "Manually install pandas using 'pip install pandas', then import and create a series object.", 'duration': 30.59, 'max_score': 21222.013, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE21222013.jpg'}, {'end': 21295.64, 'src': 'embed', 'start': 21269.321, 'weight': 2, 'content': [{'end': 21275.526, 'text': 'So integer 64 bit, and these are the labels associated with each of these values.', 'start': 21269.321, 'duration': 6.205}, {'end': 21279.088, 'text': 'So zero is the label associated with 10.', 'start': 21276.126, 'duration': 2.962}, {'end': 21283.772, 'text': 'One is the label associated with 22 is the label associated with 30.', 'start': 21279.088, 'duration': 4.684}, {'end': 21289.396, 'text': 'so you can either call these index values or the labels associated with these numbers.', 'start': 21283.772, 'duration': 5.624}, {'end': 21295.64, 'text': 'so now you must be thinking that is there a way that we can actually change the index of a panda series?', 'start': 21289.396, 'duration': 6.244}], 'summary': 'Explaining 64-bit integer labels and how to change index in a pandas series', 'duration': 26.319, 'max_score': 21269.321, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE21269321.jpg'}, {'end': 21465.558, 'src': 'embed', 'start': 21436.215, 'weight': 3, 'content': [{'end': 21439.959, 'text': 'right. so that was all about the single dimensional data structure series.', 'start': 21436.215, 'duration': 3.744}, {'end': 21444.504, 'text': "now we'll head on to the most important data structure in python, which is a data frame.", 'start': 21439.959, 'duration': 4.545}, {'end': 21452.752, 'text': 'so, as it has stated over here, a data frame is basically a two-dimensional labeled data structure and it comprises of rows and columns.', 'start': 21444.504, 'duration': 8.248}, {'end': 21454.232, 'text': 'so this what you see.', 'start': 21452.752, 'duration': 1.48}, {'end': 21460.575, 'text': "so you've got three rows over here and two columns over here, and normally, when it comes to a data frame,", 'start': 21454.232, 'duration': 6.343}, {'end': 21465.558, 'text': 'all of the elements inside one column would have the same type.', 'start': 21460.575, 'duration': 4.983}], 'summary': 'Introduction to data frame, a two-dimensional labeled data structure with 3 rows and 2 columns.', 'duration': 29.343, 'max_score': 21436.215, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE21436215.jpg'}, {'end': 21633.503, 'src': 'embed', 'start': 21606.985, 'weight': 4, 'content': [{'end': 21612.67, 'text': 'Similarly, 87, 13, 99 and 67 are the row values for this key student marks.', 'start': 21606.985, 'duration': 5.685}, {'end': 21619.675, 'text': "So, now that we know how to create a data frame, let's go ahead and perform some inbuilt functions on top of this data frame.", 'start': 21613.39, 'duration': 6.285}, {'end': 21624.399, 'text': "So, we'll be working with this basic function such as tail, head, shape and describe.", 'start': 21620.216, 'duration': 4.183}, {'end': 21630.96, 'text': "So to work with all of those inbuilt functions, we'll start off by reading our CSV file as our data frame first.", 'start': 21625.814, 'duration': 5.146}, {'end': 21633.503, 'text': 'So again, let me perform the basic steps.', 'start': 21631.781, 'duration': 1.722}], 'summary': 'Learning to work with inbuilt functions on a data frame, using csv file.', 'duration': 26.518, 'max_score': 21606.985, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE21606985.jpg'}, {'end': 21790.419, 'src': 'embed', 'start': 21761.048, 'weight': 5, 'content': [{'end': 21763.69, 'text': 'so there is also this describe function.', 'start': 21761.048, 'duration': 2.642}, {'end': 21770.333, 'text': "so i'll just type in describe over here and let's just see what it does over here.", 'start': 21763.69, 'duration': 6.643}, {'end': 21776.616, 'text': 'so this would basically describe this entire data frame in terms of all of these different measures.', 'start': 21770.333, 'duration': 6.283}, {'end': 21778.536, 'text': "we've got count.", 'start': 21777.316, 'duration': 1.22}, {'end': 21783.718, 'text': 'so this count is basically the number of records present for each of these different columns.', 'start': 21778.536, 'duration': 5.182}, {'end': 21785.638, 'text': 'so there are 150 records for sepal length.', 'start': 21783.718, 'duration': 1.92}, {'end': 21790.419, 'text': 'similarly, 150 for sepal width, petal length and petal width all right.', 'start': 21785.638, 'duration': 4.781}], 'summary': "The 'describe' function provides measures for 150 records in a data frame.", 'duration': 29.371, 'max_score': 21761.048, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE21761048.jpg'}, {'end': 22021.926, 'src': 'embed', 'start': 21995.976, 'weight': 6, 'content': [{'end': 22002.62, 'text': 'so this is my subsection of the entire data frame, where I have row numbers, starting from index number 10, going on till index number 20,', 'start': 21995.976, 'duration': 6.644}, {'end': 22004.601, 'text': "and these are the two columns which I've extracted.", 'start': 22002.62, 'duration': 1.981}, {'end': 22006.322, 'text': 'so sepal length and species.', 'start': 22004.601, 'duration': 1.721}, {'end': 22008.343, 'text': 'so this is how I can work with the iloc method.', 'start': 22006.322, 'duration': 2.021}, {'end': 22015.943, 'text': 'then, analogous to the ilock method, we also have the lock method to extract individual rows and columns.', 'start': 22009.6, 'duration': 6.343}, {'end': 22019.265, 'text': 'so the only difference is instead of giving the index values.', 'start': 22015.943, 'duration': 3.322}, {'end': 22021.926, 'text': 'over here we given the labels of the columns.', 'start': 22019.265, 'duration': 2.661}], 'summary': 'Using iloc to extract rows 10-20 with sepal length and species columns, and introducing lock method.', 'duration': 25.95, 'max_score': 21995.976, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE21995976.jpg'}, {'end': 22121.644, 'src': 'embed', 'start': 22088.831, 'weight': 7, 'content': [{'end': 22095.253, 'text': "so these are all the records, starting from row number 33, going on till row number 44, and those are the two columns which i've extracted,", 'start': 22088.831, 'duration': 6.422}, {'end': 22097.27, 'text': 'sepal width and petal width right.', 'start': 22095.253, 'duration': 2.017}, {'end': 22103.973, 'text': 'so till now what we did is we extracted some individual rows and columns with the help of dot lock and dot iloc.', 'start': 22097.27, 'duration': 6.703}, {'end': 22110.977, 'text': "but now we'll actually perform data manipulation operations where we'll be extracting records on the basis of a condition.", 'start': 22103.973, 'duration': 7.004}, {'end': 22112.938, 'text': "so let's go ahead and do that now.", 'start': 22110.977, 'duration': 1.961}, {'end': 22119.121, 'text': 'the condition which i want to specify is i want only those records from this data set where the sepal length is greater than five.', 'start': 22112.938, 'duration': 6.183}, {'end': 22121.644, 'text': "so let's see how can we do that.", 'start': 22119.841, 'duration': 1.803}], 'summary': 'Extracted records 33-44, performing data manipulation based on condition sepal length > 5.', 'duration': 32.813, 'max_score': 22088.831, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE22088831.jpg'}, {'end': 22468.917, 'src': 'embed', 'start': 22446.196, 'weight': 9, 'content': [{'end': 22453.9, 'text': 'so see that there are only three records out of these 150 records where these three conditions are satisfied, that is,', 'start': 22446.196, 'duration': 7.704}, {'end': 22460.403, 'text': 'the sepal length is greater than six, the petal width is greater than three and the petal length is greater than six.', 'start': 22453.9, 'duration': 6.503}, {'end': 22465.525, 'text': 'right. so these were some different data manipulation operations which we could perform on the iris data set.', 'start': 22460.403, 'duration': 5.122}, {'end': 22468.917, 'text': 'so that was data manipulation.', 'start': 22467.096, 'duration': 1.821}], 'summary': 'Only 3 out of 150 records meet specific conditions in iris dataset.', 'duration': 22.721, 'max_score': 22446.196, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE22446196.jpg'}], 'start': 21150.626, 'title': 'Data manipulation in python', 'summary': 'Covers pandas series object, data frame in python, data frame functions and extraction methods, and data manipulation in python with examples. it includes creating series objects, data frame structure, extraction and manipulation methods, and data manipulation using the iris data set with specific conditions. the chapter provides detailed explanations and practical examples to enhance understanding.', 'chapters': [{'end': 21435.609, 'start': 21150.626, 'title': 'Pandas series object', 'summary': 'Introduces the creation of a pandas series object, including creating it from a list or a dictionary, using labeled arrays, and changing index values, and explains the installation of the pandas library and creation of series objects with relevant examples.', 'duration': 284.983, 'highlights': ['Creation of Pandas series object from a list and dictionary', 'Introduction to Pandas library and installation process', 'Explanation of labeled arrays and changing index values']}, {'end': 21696.977, 'start': 21436.215, 'title': 'Data frame in python', 'summary': 'Discusses the creation of a data frame in python, highlighting its structure, creation from a dictionary, and performing inbuilt functions like tail, head, shape, and describe on a sample data frame.', 'duration': 260.762, 'highlights': ['A data frame in Python is a two-dimensional labeled data structure comprising rows and columns, where elements in a column have the same type.', 'To create a data frame, one can use the PD.dataframe method and create it from a dictionary with keys becoming column names and values becoming row values.', 'Inbuilt functions like head, tail, shape, and describe can be applied to a data frame in Python for data analysis.']}, {'end': 22139.778, 'start': 21697.617, 'title': 'Data frame functions and extraction methods', 'summary': 'Covers the usage of head, tail, describe, iloc, and loc functions to extract and manipulate data from a data frame in python, with examples of extracting specific rows and columns, applying conditions, and displaying relevant statistical measures.', 'duration': 442.161, 'highlights': ['The describe function provides statistical measures such as count, mean, minimum, maximum, and percentiles for each column in the data frame, with 150 records for sepal length, sepal width, petal length, and petal width.', 'The iloc method is used to extract specific rows and columns from the data frame, such as extracting the first three records and first two columns, or a subsection comprising row numbers from 99 to 126, and the petal length and petal width columns.', 'The loc method is used to extract individual rows and columns based on labels, for example, extracting all records from row number 33 to 44 and the sepal width and petal width columns.', 'Data manipulation operations involve extracting records based on a condition, such as extracting records where the sepal length is greater than five.']}, {'end': 22468.917, 'start': 22140.278, 'title': 'Data manipulation in python', 'summary': 'Demonstrates data manipulation in python using the iris data set, where conditions are applied to extract specific records, such as sepal length greater than 5, petal length greater than 2 and species equal to virginica, and multiple conditions including sepal length, sepal width, and petal length greater than specific values, resulting in only three records satisfying all conditions out of 150.', 'duration': 328.639, 'highlights': ['The chapter demonstrates applying conditions to extract specific records from the iris data set, such as sepal length greater than 5, petal length greater than 2 and species equal to Virginica, and applying multiple conditions including sepal length, sepal width, and petal length greater than specific values.', 'The chapter illustrates using conditions to extract specific records from the iris data set, resulting in only three records out of 150 satisfying multiple conditions including sepal length, sepal width, and petal length greater than specific values.']}], 'duration': 1318.291, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE21150626.jpg', 'highlights': ['Creation of Pandas series object from a list and dictionary', 'Introduction to Pandas library and installation process', 'Explanation of labeled arrays and changing index values', 'A data frame in Python is a two-dimensional labeled data structure comprising rows and columns, where elements in a column have the same type', 'Inbuilt functions like head, tail, shape, and describe can be applied to a data frame in Python for data analysis', 'The describe function provides statistical measures such as count, mean, minimum, maximum, and percentiles for each column in the data frame, with 150 records for sepal length, sepal width, petal length, and petal width', 'The iloc method is used to extract specific rows and columns from the data frame, such as extracting the first three records and first two columns, or a subsection comprising row numbers from 99 to 126, and the petal length and petal width columns', 'The loc method is used to extract individual rows and columns based on labels, for example, extracting all records from row number 33 to 44 and the sepal width and petal width columns', 'Data manipulation operations involve extracting records based on a condition, such as extracting records where the sepal length is greater than five', 'The chapter demonstrates applying conditions to extract specific records from the iris data set, such as sepal length greater than 5, petal length greater than 2 and species equal to Virginica, and applying multiple conditions including sepal length, sepal width, and petal length greater than specific values', 'The chapter illustrates using conditions to extract specific records from the iris data set, resulting in only three records out of 150 satisfying multiple conditions including sepal length, sepal width, and petal length greater than specific values']}, {'end': 24028.14, 'segs': [{'end': 22492.573, 'src': 'embed', 'start': 22468.917, 'weight': 0, 'content': [{'end': 22476.023, 'text': "now we'll head on to data visualization and to perform data visualization, python provides us a package called as matplotlib,", 'start': 22468.917, 'duration': 7.106}, {'end': 22483.229, 'text': 'and with the help of matplotlib you can create beautiful graphs such as bar plots, scatter plots, histograms and a lot more.', 'start': 22476.023, 'duration': 7.206}, {'end': 22485.971, 'text': "so let's go to jupyter notebook and work with these graphs.", 'start': 22483.229, 'duration': 2.742}, {'end': 22492.573, 'text': "So we'll start off by importing the pyplot sub module from the matplotlib library.", 'start': 22488.271, 'duration': 4.302}], 'summary': "Python's matplotlib enables creation of various graphs for data visualization.", 'duration': 23.656, 'max_score': 22468.917, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE22468917.jpg'}, {'end': 22668.544, 'src': 'embed', 'start': 22634.645, 'weight': 4, 'content': [{'end': 22644.871, 'text': "so to add the title, you have to type in plt dot title and let's say the title which i'll be giving this is line plot.", 'start': 22634.645, 'duration': 10.226}, {'end': 22648.794, 'text': 'now let me give it the x axis label.', 'start': 22644.871, 'duration': 3.923}, {'end': 22656.641, 'text': "so this would be plt dot x label and i'll just simply put in x axis over here.", 'start': 22648.794, 'duration': 7.847}, {'end': 22658.342, 'text': 'now let me also put in the y label.', 'start': 22656.641, 'duration': 1.701}, {'end': 22668.544, 'text': "so i'll type in plt dot y label and inside this i'll type in y axis, right.", 'start': 22658.342, 'duration': 10.202}], 'summary': 'Creating a line plot with x and y labels using plt.title, plt.xlabel, and plt.ylabel.', 'duration': 33.899, 'max_score': 22634.645, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE22634645.jpg'}, {'end': 23474.041, 'src': 'embed', 'start': 23446.282, 'weight': 3, 'content': [{'end': 23448.862, 'text': 'so the maximum petal width of 2.5.', 'start': 23446.282, 'duration': 2.58}, {'end': 23450.263, 'text': 'there are around 5 records.', 'start': 23448.862, 'duration': 1.401}, {'end': 23456.006, 'text': "right now let's again understand the distribution of sepal width.", 'start': 23451.923, 'duration': 4.083}, {'end': 23461.631, 'text': "so i'll change this and i'll just put in sepal over here.", 'start': 23456.006, 'duration': 5.625}, {'end': 23464.053, 'text': 'right, so we have a big peak over here.', 'start': 23461.631, 'duration': 2.422}, {'end': 23473.02, 'text': 'so there are 25 records where the sepal width is three and there is less, and there is just one record whose sepal width would be around 4.3 or 4.4.', 'start': 23464.053, 'duration': 8.967}, {'end': 23474.041, 'text': 'right, so this is histogram.', 'start': 23473.02, 'duration': 1.021}], 'summary': 'Data analysis reveals a maximum petal width of 2.5 and approximately 5 records, with a predominant sepal width peak at 3 and only one record at around 4.3 or 4.4.', 'duration': 27.759, 'max_score': 23446.282, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE23446282.jpg'}, {'end': 23629.496, 'src': 'embed', 'start': 23601.721, 'weight': 2, 'content': [{'end': 23605.524, 'text': 'So if the species of the flower is versicolor, then the median petal width would be 1.2.', 'start': 23601.721, 'duration': 3.803}, {'end': 23615.113, 'text': 'So the basic inference is virginica over here would have the maximum petal width and setosa would have the minimum petal width.', 'start': 23605.524, 'duration': 9.589}, {'end': 23618.934, 'text': 'so that was the same case which we saw with sepal length as well.', 'start': 23615.453, 'duration': 3.481}, {'end': 23624.435, 'text': 'so virginica had the maximum sepal length and setosa had the minimum sepal length.', 'start': 23618.934, 'duration': 5.501}, {'end': 23629.496, 'text': 'if we want to make this plot even more beautiful, we can use the seaborn library.', 'start': 23624.435, 'duration': 5.061}], 'summary': 'For the flower species versicolor, the median petal width is 1.2. virginica has the maximum petal width, while setosa has the minimum. similarly, virginica has the maximum sepal length, and setosa has the minimum. the seaborn library can enhance the plot.', 'duration': 27.775, 'max_score': 23601.721, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE23601721.jpg'}, {'end': 23850.878, 'src': 'embed', 'start': 23824.5, 'weight': 1, 'content': [{'end': 23829.843, 'text': "let's see, if i just put in 1 over here, then you've got one decimal value over here right now.", 'start': 23824.5, 'duration': 5.343}, {'end': 23832.245, 'text': 'i can also add a shadow to this.', 'start': 23829.843, 'duration': 2.402}, {'end': 23840.29, 'text': "so i'll just put in shadow equals, true, and this is my pie chart over here.", 'start': 23832.245, 'duration': 8.045}, {'end': 23844.133, 'text': 'so this shows us that the maximum percentage of the fruits belongs to orange.', 'start': 23840.29, 'duration': 3.843}, {'end': 23850.878, 'text': "we've got 30.4 percent and then we've got banana, 28.7, followed by apple, and then the least is mango.", 'start': 23844.133, 'duration': 6.745}], 'summary': 'Pie chart shows fruits distribution: orange 30.4%, banana 28.7%, apple, and mango.', 'duration': 26.378, 'max_score': 23824.5, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE23824500.jpg'}], 'start': 22468.917, 'title': 'Python data visualization', 'summary': "Covers data visualization using python's matplotlib package, demonstrating the creation of line plots, addition of titles and labels, and customization of the plot's appearance, with a brief introduction to creating bar plots. it also covers creating bar plots, scatter plots, and histograms using python's matplotlib library, explaining the difference between bar plots and histograms, and the distribution of continuous variables. additionally, it includes the distribution of petal and sepal width in the iris dataset, highlighting key statistics and the use of box plots and a pie chart for visualization. the chapter also explains the process of creating a pie chart in python to visualize the distribution of fruit costs, with specific percentages for each fruit.", 'chapters': [{'end': 22874.919, 'start': 22468.917, 'title': 'Python data visualization', 'summary': "Covers data visualization using python's matplotlib package, demonstrating the creation of line plots, addition of titles and labels, and customization of the plot's appearance, with a brief introduction to creating bar plots.", 'duration': 406.002, 'highlights': ["The chapter covers data visualization using Python's matplotlib package, demonstrating the creation of line plots, addition of titles and labels, and customization of the plot's appearance, with a brief introduction to creating bar plots.", "Demonstrates creating line plots with x and y values, and adding titles, x and y labels, and customizing the plot's appearance.", 'Introduction to creating bar plots is briefly discussed.']}, {'end': 23418.37, 'start': 22874.919, 'title': 'Data visualization in python', 'summary': "Covers creating bar plots, scatter plots, and histograms using python's matplotlib library, demonstrating the process with examples and actionable insights. it also explains the difference between bar plots and histograms, and the distribution of continuous variables.", 'duration': 543.451, 'highlights': ["The chapter covers creating bar plots, scatter plots, and histograms using Python's matplotlib library.", 'The bar plot demonstrates the marks of three students: Sam (30), Bob (50), and Julia (70).', 'The scatter plot showcases points denoted by specific x and y coordinates, with the addition of a second theme using a different color.', 'The histogram demonstrates the distribution of the sepal length column from the iris dataset, with an explanation of the bins and the inferences drawn from the plot.', 'Explains the difference between a bar plot and a histogram, detailing their respective uses in understanding the distribution of categorical and continuous variables.']}, {'end': 23698.107, 'start': 23421.414, 'title': 'Distribution of petal and sepal width', 'summary': 'Covers the distribution of petal and sepal width in the iris dataset, highlighting key statistics such as the maximum petal width of 2.5 and the median sepal length for different species, along with the use of box plots and a pie chart for visualization.', 'duration': 276.693, 'highlights': ['The maximum petal width in the dataset is 2.5, with around 5 records, and the median sepal length for different species is approximately 6.5 for virginica, 5.9 for versicolor, and 5 for setosa.', 'The distribution of sepal width shows a peak with 25 records at a width of three, and just one record with a width of around 4.3 or 4.4.', 'The use of box plots to depict the distribution of sepal length and petal width for different species provides insights into the median values for each species, with virginica exhibiting the maximum width and length, and setosa the minimum.']}, {'end': 24028.14, 'start': 23698.107, 'title': 'Creating pie chart in python', 'summary': 'Explains the process of creating a pie chart in python to visualize the distribution of fruit costs, with orange and banana having the maximum percentage of 30.4% and 28.7% respectively, followed by apple and mango, and also discusses the importance of linear models in data science and machine learning.', 'duration': 330.033, 'highlights': ['The pie chart shows that the maximum distribution or the maximum percentage is mostly of orange and banana, with 30.4% and 28.7% respectively.', 'Linear regression, one of the most popular algorithms in production, is used for classification and regression and is one of the core algorithms used by many other algorithms.', 'The importance of linear models in the industry and their association with various popular algorithms, such as logistic regression, support vector machine, and decision trees.']}], 'duration': 1559.223, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE22468917.jpg', 'highlights': ["The chapter covers creating bar plots, scatter plots, and histograms using Python's matplotlib library.", 'The pie chart shows that the maximum distribution or the maximum percentage is mostly of orange and banana, with 30.4% and 28.7% respectively.', 'The maximum petal width in the dataset is 2.5, with around 5 records, and the median sepal length for different species is approximately 6.5 for virginica, 5.9 for versicolor, and 5 for setosa.', 'The distribution of sepal width shows a peak with 25 records at a width of three, and just one record with a width of around 4.3 or 4.4.', "Demonstrates creating line plots with x and y values, and adding titles, x and y labels, and customizing the plot's appearance."]}, {'end': 27746.874, 'segs': [{'end': 24061.807, 'src': 'embed', 'start': 24029.662, 'weight': 0, 'content': [{'end': 24032.604, 'text': 'As the name suggests, this algorithm is called linear regression.', 'start': 24029.662, 'duration': 2.942}, {'end': 24042.892, 'text': 'And the reason why it is called linear regression, as the name suggests, is it is based on the The concept of a line, a line, a plane,', 'start': 24035.386, 'duration': 7.506}, {'end': 24045.054, 'text': 'a hyperplane we call it by these names.', 'start': 24042.892, 'duration': 2.162}, {'end': 24051.419, 'text': 'So to begin with I need to ask you whether you know how a line is represented.', 'start': 24045.994, 'duration': 5.425}, {'end': 24054.921, 'text': 'Line, line in mathematics.', 'start': 24053.56, 'duration': 1.361}, {'end': 24061.807, 'text': "Are you all comfortable with this? Alright, so let's start.", 'start': 24056.823, 'duration': 4.984}], 'summary': 'Linear regression is based on the concept of representing data with lines and planes in mathematics.', 'duration': 32.145, 'max_score': 24029.662, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE24029662.jpg'}, {'end': 24298.874, 'src': 'embed', 'start': 24264.232, 'weight': 1, 'content': [{'end': 24270.634, 'text': 'So in linear regression, this line which represents the relationship in x and y, this line is my model.', 'start': 24264.232, 'duration': 6.402}, {'end': 24277.617, 'text': 'So I want to predict the value of y given the value of x.', 'start': 24273.115, 'duration': 4.502}, {'end': 24283.499, 'text': 'What the line is saying is y is equal to x.', 'start': 24277.617, 'duration': 5.882}, {'end': 24287.02, 'text': 'So the prediction is whenever x is some value, y will also be the same value.', 'start': 24283.499, 'duration': 3.521}, {'end': 24298.874, 'text': 'Can I write this expression as y equal to 1x Does it make any difference? 1 into x.', 'start': 24288.941, 'duration': 9.933}], 'summary': 'Linear regression predicts y based on x with a model y=1x.', 'duration': 34.642, 'max_score': 24264.232, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE24264232.jpg'}, {'end': 25537.709, 'src': 'embed', 'start': 25504.183, 'weight': 2, 'content': [{'end': 25512.889, 'text': 'So, before we start building our models, before we start building our models, we actually need to analyze our data set,', 'start': 25504.183, 'duration': 8.706}, {'end': 25528.982, 'text': 'and what I mean by this is suppose this is my data set I will go and analyze between this IV1 and target IV2 and target IV3 and target which one of these is strongly related with the target.', 'start': 25512.889, 'duration': 16.093}, {'end': 25537.709, 'text': 'If suppose IV2 and the target, the distribution is scattered all over the place.', 'start': 25530.603, 'duration': 7.106}], 'summary': 'Before building models, analyze dataset to find ivs strongly related to the target.', 'duration': 33.526, 'max_score': 25504.183, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE25504183.jpg'}, {'end': 25742.16, 'src': 'embed', 'start': 25708.089, 'weight': 5, 'content': [{'end': 25716.352, 'text': 'how these two variables, the target and the independent variable, how they co-relate, how they relate to each other,', 'start': 25708.089, 'duration': 8.263}, {'end': 25719.753, 'text': 'which we express using a character called R.', 'start': 25716.352, 'duration': 3.401}, {'end': 25723.274, 'text': 'I will tell you what this R is.', 'start': 25719.753, 'duration': 3.521}, {'end': 25731.617, 'text': 'This R metric can have a value of plus 1, minus 1 to plus 1 in this range.', 'start': 25724.114, 'duration': 7.503}, {'end': 25740.859, 'text': 'the least it can have is minus 1, the max it can have a value is plus 1.', 'start': 25734.898, 'duration': 5.961}, {'end': 25742.16, 'text': 'It cannot be beyond this range.', 'start': 25740.859, 'duration': 1.301}], 'summary': 'The correlation coefficient r can range from -1 to +1.', 'duration': 34.071, 'max_score': 25708.089, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE25708089.jpg'}, {'end': 26069.737, 'src': 'embed', 'start': 26012.359, 'weight': 3, 'content': [{'end': 26016.762, 'text': 'how reliable your central values is that reliability is given by the measure of variance.', 'start': 26012.359, 'duration': 4.403}, {'end': 26024.996, 'text': 'if the variance is too large, the central value is not reliable, right.', 'start': 26018.649, 'duration': 6.347}, {'end': 26030.642, 'text': 'So, the variance gives you the reliability of the central values, how reliable the central values are.', 'start': 26025.756, 'duration': 4.886}, {'end': 26036.207, 'text': 'So, formula for variance is this, right.', 'start': 26033.224, 'duration': 2.983}, {'end': 26040.352, 'text': 'Just look at the numerator, just look at the numerator.', 'start': 26036.968, 'duration': 3.384}, {'end': 26044.918, 'text': 'This is the case when you have only one variable x.', 'start': 26041.496, 'duration': 3.422}, {'end': 26047.42, 'text': 'Now I have two variables x and y.', 'start': 26044.918, 'duration': 2.502}, {'end': 26052.744, 'text': 'I want to see how they vary together, how they influence each other, vary together.', 'start': 26047.42, 'duration': 5.324}, {'end': 26063.691, 'text': 'That is called covariance, which is given by xi minus x bar, we do not square it, into yi minus y bar.', 'start': 26053.104, 'duration': 10.587}, {'end': 26069.737, 'text': 'So we find out x, i, minus x bar, multiplied by y, i, minus y bar.', 'start': 26065.216, 'duration': 4.521}], 'summary': 'Variance measures reliability of central values. covariance shows how variables influence each other.', 'duration': 57.378, 'max_score': 26012.359, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE26012359.jpg'}, {'end': 26986.053, 'src': 'embed', 'start': 26925.177, 'weight': 6, 'content': [{'end': 26927.038, 'text': 'It is like looking for a needle in a haystack.', 'start': 26925.177, 'duration': 1.861}, {'end': 26935.284, 'text': 'And to do this it makes use of a process which is called the gradient descent.', 'start': 26929.88, 'duration': 5.404}, {'end': 26942.33, 'text': 'All algorithms use under the hood a learning process.', 'start': 26937.506, 'duration': 4.824}, {'end': 26946.994, 'text': 'A process which they make use of to find the best model for you.', 'start': 26943.671, 'duration': 3.323}, {'end': 26951.805, 'text': 'in the given data set, that process in this case is called gradient descent.', 'start': 26947.802, 'duration': 4.003}, {'end': 26986.053, 'text': 'You are talking about type 1, type 2 error that that comes in classification this is regression.', 'start': 26974.307, 'duration': 11.746}], 'summary': 'Algorithms use gradient descent to find best model in given dataset.', 'duration': 60.876, 'max_score': 26925.177, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE26925177.jpg'}, {'end': 27305.167, 'src': 'embed', 'start': 27275.384, 'weight': 7, 'content': [{'end': 27279.407, 'text': 'the lesser the variance of the points across the model, the better the model is.', 'start': 27275.384, 'duration': 4.023}, {'end': 27283.15, 'text': 'Same concept comes to you in a different way.', 'start': 27281.068, 'duration': 2.082}, {'end': 27290.876, 'text': 'So, sum of squared errors is nothing but variance, variance of the data points across the model.', 'start': 27286.092, 'duration': 4.784}, {'end': 27298.142, 'text': 'So, I want to find that model where the variance of the data points across the model is the least.', 'start': 27292.337, 'duration': 5.805}, {'end': 27305.167, 'text': 'It is slightly confusing when you say covariance and you said like we have more value of covariance,', 'start': 27299.262, 'duration': 5.905}], 'summary': 'Minimize variance across model for better performance', 'duration': 29.783, 'max_score': 27275.384, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE27275384.jpg'}], 'start': 24029.662, 'title': 'Linear regression fundamentals', 'summary': 'Covers the basics of linear regression, including representation of a line, relationship between variables, models in data science, multiple dimensions, collinearity, variance, covariance, and correlation, aiming to provide a comprehensive understanding of fundamental concepts in linear regression.', 'chapters': [{'end': 24261.456, 'start': 24029.662, 'title': 'Linear regression basics', 'summary': "Introduces the concept of linear regression, explaining the representation of a line in mathematics, the relationship between independent and dependent variables, and the concept of models in data science, culminating in the explanation of k nearest neighbors' division of mathematical space into regions.", 'duration': 231.794, 'highlights': ['Linear regression is based on representing the relationship between independent and dependent variables as a straight line, serving as a model in the feature space.', 'Explains the relationship between independent and dependent variables using the example of miles per gallon and car weight, illustrating how the weight of the car affects the miles per gallon.', 'Introduces the concept of models in data science as lines, surfaces, and hyper surfaces in the feature space, highlighting the exploration of the interaction between independent and dependent attributes.', 'Explains how K nearest neighbors breaks a mathematical space into regions, known as Voronoi regions, to represent the model in K nearest neighbors.']}, {'end': 25416.556, 'start': 24264.232, 'title': 'Understanding linear regression in depth', 'summary': 'Explains the concept of linear regression, including the mathematical representation of the model y=mx+c, the impact of angles and trigonometry in defining the relationship between x and y, the extension to multiple dimensions, and the challenges of collinearity and dimensionality reduction in model building.', 'duration': 1152.324, 'highlights': ['The mathematical representation of the model y=mx+c, where y is the target variable, x is the independent variable, m is the slope, and c is the intercept, forms the basis of linear regression.', 'The impact of angles and trigonometry in defining the relationship between x and y, where the angle of the line with the x-axis determines the value of the slope (m) and the tan of that angle represents the slope into x.', 'The extension to multiple dimensions, where the model can express the relationship between multiple independent variables x1, x2, etc., and the target variable y as y= m1x1 + m2x2 + ... + c.', 'The challenges of collinearity and the need for dimensionality reduction in model building to address the problem of independent dimensions interacting with each other and not being truly independent as assumed by the algorithms.']}, {'end': 26012.359, 'start': 25417.036, 'title': 'Introduction to linear regression', 'summary': 'Introduces the concept of linear regression, emphasizing the importance of analyzing data to identify strong predictors, the use of linear models for predicting values, and the measurement of the strength of the relationship between variables using the coefficient of correlation.', 'duration': 595.323, 'highlights': ['The importance of analyzing data to identify strong predictors before building models.', 'Explanation of the relationship between independent variables and the target.', 'Introduction to the coefficient of correlation for measuring the strength of relationships.']}, {'end': 26951.805, 'start': 26012.359, 'title': 'Variance, covariance, and correlation in linear models', 'summary': 'Explains the concepts of variance, covariance, and correlation in linear models, highlighting the significance of these metrics in determining the reliability and strength of relationships between variables. it also delves into the process of finding the best fit line using the gradient descent algorithm.', 'duration': 939.446, 'highlights': ['The variance determines the reliability of central values, with a high variance indicating lower reliability.', 'Covariance measures the variation between two variables and signifies how they influence each other, with independent variables ideally having no covariance.', 'The correlation coefficient (R) is a unitless quantity representing the relationship between two variables, with values close to 1 indicating a strong positive correlation, close to -1 indicating a strong negative correlation, and close to 0 indicating little to no correlation.', 'In linear models, the selection of independent dimensions is based on their correlation with the target variable, with values close to 0 indicating a lack of usefulness.', 'The gradient descent algorithm is utilized to find the best fit line in linear models, aiming to minimize the errors between predicted and actual values by evaluating multiple lines and selecting the one that minimizes the sum of errors.']}, {'end': 27746.874, 'start': 26974.307, 'title': 'Regression and variance in linear models', 'summary': 'Discusses the concept of linear regression and the minimization of variance in the context of finding the best-fit line, emphasizing the importance of understanding deterministic and stochastic variances as well as the impact of collinearity on noise.', 'duration': 772.567, 'highlights': ['The importance of minimizing variance in linear models is emphasized, with the aim of finding the best-fit line with the least sum of squared errors (variance) across the model.', 'Discussion on the distinction between deterministic and stochastic variances, with a focus on addressing the impact of stochastic noise within the data points.', 'Exploration of the impact of collinearity in the dataset, highlighting its potential to cancel out or magnify noise within the data points.']}], 'duration': 3717.212, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE24029662.jpg', 'highlights': ['Linear regression represents the relationship between variables as a straight line', 'The model y=mx+c forms the basis of linear regression', 'Importance of analyzing data to identify strong predictors before building models', 'Variance determines the reliability of central values', 'Covariance measures the variation between two variables and their influence', 'Correlation coefficient (R) represents the relationship between variables', 'Gradient descent algorithm is utilized to find the best fit line', 'Importance of minimizing variance in linear models']}, {'end': 29347.634, 'segs': [{'end': 27776.506, 'src': 'embed', 'start': 27746.874, 'weight': 0, 'content': [{'end': 27750.235, 'text': "so for some combination of M and C I'll get the least error.", 'start': 27746.874, 'duration': 3.361}, {'end': 27756.688, 'text': 'combination of M and C, which gives me the least error, is my best fit line.', 'start': 27752.864, 'duration': 3.824}, {'end': 27763.033, 'text': 'So the algorithm will start from some random M and C.', 'start': 27759.33, 'duration': 3.703}, {'end': 27764.955, 'text': 'So maybe this is the random M and C.', 'start': 27763.033, 'duration': 1.922}, {'end': 27767.638, 'text': 'This is my random M and this is my random C.', 'start': 27764.955, 'duration': 2.683}, {'end': 27772.062, 'text': 'When I use this I get a line but the sum of squared errors across the line is very high.', 'start': 27767.638, 'duration': 4.424}, {'end': 27776.506, 'text': 'So each point here is one line in the original mathematical space.', 'start': 27772.622, 'duration': 3.884}], 'summary': 'The algorithm seeks the best fit line by minimizing error with random m and c.', 'duration': 29.632, 'max_score': 27746.874, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE27746874.jpg'}, {'end': 28019.486, 'src': 'embed', 'start': 27993.614, 'weight': 1, 'content': [{'end': 28007.746, 'text': 'Linear regression, the error function being quadratic, it guarantees you will reach absolute minimum.', 'start': 27993.614, 'duration': 14.132}, {'end': 28012.45, 'text': 'But in the process of jumping, there is something called learning step.', 'start': 28008.227, 'duration': 4.223}, {'end': 28019.486, 'text': "We are too far away from this, so I'll just tell you the concept, which is used along with this partial derivatives.", 'start': 28014.482, 'duration': 5.004}], 'summary': 'Linear regression guarantees reaching absolute minimum with quadratic error function.', 'duration': 25.872, 'max_score': 27993.614, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE27993614.jpg'}, {'end': 28187.069, 'src': 'embed', 'start': 28163.339, 'weight': 2, 'content': [{'end': 28170.862, 'text': 'The distance between the actual which is shown by the black dot and the line, that distance is shown by yellow dashed lines.', 'start': 28163.339, 'duration': 7.523}, {'end': 28174.243, 'text': 'This distance is a measurement of error.', 'start': 28171.482, 'duration': 2.761}, {'end': 28184.068, 'text': 'The objective is square of these errors, all these errors squared, that sum should be minimized,', 'start': 28174.764, 'duration': 9.304}, {'end': 28187.069, 'text': 'which is not very difficult to conceptualize once again.', 'start': 28184.068, 'duration': 3.001}], 'summary': 'Minimize sum of squared errors to reduce distance from line.', 'duration': 23.73, 'max_score': 28163.339, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE28163339.jpg'}, {'end': 28509.051, 'src': 'embed', 'start': 28482.545, 'weight': 3, 'content': [{'end': 28489.407, 'text': 'Keep in mind regression error and residual errors, these two together is a total error.', 'start': 28482.545, 'duration': 6.862}, {'end': 28494.048, 'text': "And don't mistake the word error, error means variance.", 'start': 28490.827, 'duration': 3.221}, {'end': 28496.748, 'text': 'error means variance.', 'start': 28495.888, 'duration': 0.86}, {'end': 28498.949, 'text': 'Of the total variance in your data set.', 'start': 28496.948, 'duration': 2.001}, {'end': 28503.149, 'text': 'I told you one part of the variance is caused by the deterministic.', 'start': 28498.949, 'duration': 4.2}, {'end': 28509.051, 'text': 'the other part of the variance is caused by the stochastic, non-deterministic same thing.', 'start': 28503.149, 'duration': 5.902}], 'summary': 'Residual and regression errors together constitute the total variance in the dataset.', 'duration': 26.506, 'max_score': 28482.545, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE28482545.jpg'}], 'start': 27746.874, 'title': 'Linear regression concepts', 'summary': 'Covers finding the best fit line with m and c, gradient descent in linear regression, sum of squared errors, and understanding regression error and model evaluation, emphasizing minimizing errors and evaluating models using various metrics.', 'chapters': [{'end': 27796.019, 'start': 27746.874, 'title': 'Finding best fit line with m and c', 'summary': 'Explains the process of finding the best fit line by determining the combination of m and c that minimizes the sum of squared errors, starting from random values and iterating to minimize the error.', 'duration': 49.145, 'highlights': ['The algorithm iterates to find the combination of M and C that minimizes the sum of squared errors, resulting in the best fit line.', 'The initial step involves starting with random values for M and C and generating a line, but the sum of squared errors across the line is very high.', 'Each point in the original mathematical space represents a line with specific values of M and C, and the goal is to minimize the sum of squared errors for the best fit line.']}, {'end': 28089.565, 'start': 27799.662, 'title': 'Gradient descent in linear regression', 'summary': 'Explains the gradient descent algorithm in linear regression, using partial derivatives to minimize error and reach the global minima, ensuring the absolute minimum in quadratic functions, and the importance of adjusting learning steps to prevent oscillation.', 'duration': 289.903, 'highlights': ['The algorithm uses gradient descent and partial derivatives to minimize error and reach the global minima, ensuring the absolute minimum in quadratic functions.', 'The learning step is adjusted to prevent oscillation, where one variant of gradient descent is the bold driver algorithm.', 'The relevance of understanding partial derivatives for neural networks and back propagation of errors in deep learning is mentioned.']}, {'end': 28481.745, 'start': 28090.265, 'title': 'Sum of squared errors in regression', 'summary': 'Discusses the concept of sum of squared errors in linear regression modeling, emphasizing the importance of minimizing the variance and understanding the different types of errors, including total error, regression error, and residual errors.', 'duration': 391.48, 'highlights': ['The objective is to minimize the sum of squared errors, where the variance needs to be minimized for a better model, and the distance between the actual and predicted values of y is measured to achieve this.', 'The chapter explains the concept of total error, which is the difference between the actual and expected values of y, and introduces regression error as the difference between the predicted and expected values of y.', 'Residual errors, also known as sum of squared errors, are discussed as unexplained errors in the model, and the concept of total error is illustrated using an example of predicting height based on characteristics.']}, {'end': 29347.634, 'start': 28482.545, 'title': 'Understanding regression error and model evaluation', 'summary': 'Covers the concept of regression error, minimizing unexplained error, and evaluating models using metrics like coefficient of determination and adjusted r square, with emphasis on understanding the variance explained and the impact of including useless variables.', 'duration': 865.089, 'highlights': ['The concept of regression error and residual errors being the total error is explained, with emphasis on minimizing unexplained error.', "Explanation of coefficient of determination (R square) as a metric to evaluate the model's performance and the impact of including useless variables on the evaluation.", "The impact of sampling on the introduction of fake R's and the need for further investigation when the R value is close to 0.5 or 0.6 is discussed."]}], 'duration': 1600.76, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE27746874.jpg', 'highlights': ['The algorithm iterates to find the combination of M and C that minimizes the sum of squared errors, resulting in the best fit line.', 'The algorithm uses gradient descent and partial derivatives to minimize error and reach the global minima, ensuring the absolute minimum in quadratic functions.', 'The objective is to minimize the sum of squared errors, where the variance needs to be minimized for a better model, and the distance between the actual and predicted values of y is measured to achieve this.', 'The concept of regression error and residual errors being the total error is explained, with emphasis on minimizing unexplained error.']}, {'end': 30593.317, 'segs': [{'end': 29386.556, 'src': 'embed', 'start': 29350.876, 'weight': 3, 'content': [{'end': 29355.178, 'text': 'Some assumptions that this linear regression models make, we already discussed this.', 'start': 29350.876, 'duration': 4.302}, {'end': 29361.841, 'text': 'Assumption of linearity, the relationship between the target and the independent variables are expected to be linear.', 'start': 29355.758, 'duration': 6.083}, {'end': 29364.482, 'text': 'That brings me to another point.', 'start': 29363.342, 'duration': 1.14}, {'end': 29366.203, 'text': 'Please be careful.', 'start': 29365.503, 'duration': 0.7}, {'end': 29381.693, 'text': "When your r value is close to 0, what does it mean? I think that's what I ended up saying.", 'start': 29366.783, 'duration': 14.91}, {'end': 29386.556, 'text': 'There is no linear relationship between x and y.', 'start': 29382.314, 'duration': 4.242}], 'summary': 'Linear regression assumes a linear relationship between target and independent variables; an r value close to 0 indicates no linear relationship.', 'duration': 35.68, 'max_score': 29350.876, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE29350876.jpg'}, {'end': 29505.159, 'src': 'embed', 'start': 29473.649, 'weight': 1, 'content': [{'end': 29477.472, 'text': 'So is it always suggestible to draw pair plot for each? 100 percent?', 'start': 29473.649, 'duration': 3.823}, {'end': 29485.56, 'text': 'if you ask me, I will always say pair plot is the most important tool you have in your toolbox, which you should use to understand your data.', 'start': 29477.472, 'duration': 8.088}, {'end': 29499.315, 'text': 'Try a different model, may be a non-linear model.', 'start': 29495.132, 'duration': 4.183}, {'end': 29503.758, 'text': 'So now that brings me to a question what is linear and what is non-linear?', 'start': 29499.715, 'duration': 4.043}, {'end': 29505.159, 'text': 'Now, look at this.', 'start': 29504.278, 'duration': 0.881}], 'summary': 'Pair plot is a crucial tool for understanding data and should always be utilized; consider trying non-linear models.', 'duration': 31.51, 'max_score': 29473.649, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE29473649.jpg'}, {'end': 29729.156, 'src': 'embed', 'start': 29703.997, 'weight': 0, 'content': [{'end': 29708.84, 'text': 'So what we do is have you seen this Kohn-Binaga-Karotapathy?', 'start': 29703.997, 'duration': 4.843}, {'end': 29712.763, 'text': 'Have you noticed that the audience never go wrong?', 'start': 29710.382, 'duration': 2.381}, {'end': 29718.408, 'text': 'Now the audience individually may not have very high IQ, but put together they rarely go wrong.', 'start': 29714.245, 'duration': 4.163}, {'end': 29721.13, 'text': 'This concept is called wisdom of the crowd.', 'start': 29719.428, 'duration': 1.702}, {'end': 29724.192, 'text': 'The same concept is used in data science.', 'start': 29721.95, 'duration': 2.242}, {'end': 29727.294, 'text': "also, when we productionize our models, we don't put one single model into play.", 'start': 29724.192, 'duration': 3.102}, {'end': 29729.156, 'text': 'we always put a collection of models into play.', 'start': 29727.294, 'duration': 1.862}], 'summary': 'Concept of wisdom of the crowd in data science with use of multiple models for productionization.', 'duration': 25.159, 'max_score': 29703.997, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE29703997.jpg'}, {'end': 29864.288, 'src': 'embed', 'start': 29836.443, 'weight': 2, 'content': [{'end': 29845.551, 'text': 'okay, so when you build up models, linear models, any model, the first thing we have to do is handle the outliers first.', 'start': 29836.443, 'duration': 9.108}, {'end': 29851.899, 'text': 'okay, There are various ways of testing this out, whether this is happening or not.', 'start': 29845.551, 'duration': 6.348}, {'end': 29860.465, 'text': 'One of the ways of testing this is you do a scatter plot between the actual values of y and predicted values of y ok.', 'start': 29853.34, 'duration': 7.125}, {'end': 29864.288, 'text': 'Between actual values of y and predicted values of y.', 'start': 29860.805, 'duration': 3.483}], 'summary': 'Handle outliers first when building models, test for outliers with scatter plot.', 'duration': 27.845, 'max_score': 29836.443, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE29836443.jpg'}, {'end': 30158.963, 'src': 'embed', 'start': 30128.12, 'weight': 9, 'content': [{'end': 30129.301, 'text': 'You cannot interpret those models.', 'start': 30128.12, 'duration': 1.181}, {'end': 30130.402, 'text': 'You do not know what they are.', 'start': 30129.682, 'duration': 0.72}, {'end': 30131.843, 'text': 'They are all black box.', 'start': 30131.002, 'duration': 0.841}, {'end': 30140.489, 'text': 'Sir, what do we mean by physical definition? I can convert this, whatever this formula is telling me, I can map it to English, right.', 'start': 30132.544, 'duration': 7.945}, {'end': 30143.071, 'text': 'I can tell you in day to day language what it is telling you.', 'start': 30140.609, 'duration': 2.462}, {'end': 30149.679, 'text': "I won't be able to do that if I use say random forest, you're going to do ensemble where you'll do random forest.", 'start': 30145.017, 'duration': 4.662}, {'end': 30153.861, 'text': "Random forest is a black box model, we don't know what it's actually doing inside.", 'start': 30150.199, 'duration': 3.662}, {'end': 30156.222, 'text': 'Support vector machine is a black forest.', 'start': 30154.661, 'duration': 1.561}, {'end': 30158.963, 'text': "we don't know what black forest.", 'start': 30156.222, 'duration': 2.741}], 'summary': 'Complex models like random forest and support vector machines are black box and lack physical definition, making interpretation challenging.', 'duration': 30.843, 'max_score': 30128.12, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE30128120.jpg'}, {'end': 30202.839, 'src': 'embed', 'start': 30171.471, 'weight': 6, 'content': [{'end': 30173.551, 'text': 'Outliers can bring it on to its knees.', 'start': 30171.471, 'duration': 2.08}, {'end': 30175.632, 'text': 'This particular model is prone to outliers.', 'start': 30173.972, 'duration': 1.66}, {'end': 30177.232, 'text': 'All linear models are prone to outliers.', 'start': 30175.712, 'duration': 1.52}, {'end': 30179.973, 'text': 'Decision tree is not prone to outliers.', 'start': 30178.252, 'duration': 1.721}, {'end': 30185.974, 'text': 'And one, this is a very important point.', 'start': 30184.014, 'duration': 1.96}, {'end': 30193.716, 'text': 'Since the linear model goes through the point where x bar and y bar meet the best fit lines,', 'start': 30187.794, 'duration': 5.922}, {'end': 30196.896, 'text': 'the lines evaluated for you go through where the x bar and y bar meet.', 'start': 30193.716, 'duration': 3.18}, {'end': 30202.839, 'text': 'If your x bar and y bar are not reliable, then your model itself will not be reliable.', 'start': 30197.876, 'duration': 4.963}], 'summary': 'Outliers can significantly impact linear model reliability and accuracy.', 'duration': 31.368, 'max_score': 30171.471, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE30171471.jpg'}, {'end': 30259.182, 'src': 'embed', 'start': 30225.346, 'weight': 5, 'content': [{'end': 30229.367, 'text': 'So the more time you spend in selecting your attributes to build the models,', 'start': 30225.346, 'duration': 4.021}, {'end': 30234.743, 'text': 'the higher the quality of the attributes that you use to build the models, the better your models will be.', 'start': 30230.54, 'duration': 4.203}, {'end': 30239.747, 'text': 'And the last, boundaries are linear.', 'start': 30237.645, 'duration': 2.102}, {'end': 30242.829, 'text': 'This is a problem not in regression, this is a problem in classification.', 'start': 30239.907, 'duration': 2.922}, {'end': 30247.613, 'text': 'So if we can use linear model for classification also.', 'start': 30244.491, 'duration': 3.122}, {'end': 30259.182, 'text': 'So if you are having a distribution of classes like this and then this, and you are trying to build linear classifiers,', 'start': 30248.754, 'duration': 10.428}], 'summary': 'Quality attributes lead to better models; linear boundaries for classification.', 'duration': 33.836, 'max_score': 30225.346, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE30225346.jpg'}, {'end': 30306.931, 'src': 'embed', 'start': 30277.243, 'weight': 4, 'content': [{'end': 30278.904, 'text': 'Maybe you would have gotten something like this.', 'start': 30277.243, 'duration': 1.661}, {'end': 30288.13, 'text': 'So, if you use non-linear models, it might do a better classification than linear models.', 'start': 30283.807, 'duration': 4.323}, {'end': 30292.961, 'text': 'So, this limitation is not in regression it is in classification.', 'start': 30290.199, 'duration': 2.762}, {'end': 30298.545, 'text': 'Let us do one hands on ok.', 'start': 30296.543, 'duration': 2.002}, {'end': 30305.49, 'text': 'The hands on is not auto MPG data set as it is given here it is the other one which is simplistic.', 'start': 30299.045, 'duration': 6.445}, {'end': 30306.931, 'text': 'So, we will start with simpler ones.', 'start': 30305.83, 'duration': 1.101}], 'summary': 'Using non-linear models may improve classification accuracy over linear models, demonstrated through hands-on experience with a simplistic dataset.', 'duration': 29.688, 'max_score': 30277.243, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE30277243.jpg'}, {'end': 30456.123, 'src': 'embed', 'start': 30420.207, 'weight': 7, 'content': [{'end': 30424.25, 'text': 'Given all these independent variables, can we predict the price of the car??', 'start': 30420.207, 'duration': 4.043}, {'end': 30427.192, 'text': 'Is there a relationship in price and all these variables?', 'start': 30424.41, 'duration': 2.782}, {'end': 30428.333, 'text': 'That is what we want to find out.', 'start': 30427.312, 'duration': 1.021}, {'end': 30434.737, 'text': 'What is that relationship? But before we do that, look at the data types.', 'start': 30429.334, 'duration': 5.403}, {'end': 30438.5, 'text': 'Many of the data types are object.', 'start': 30436.999, 'duration': 1.501}, {'end': 30439.521, 'text': 'Object means string.', 'start': 30438.6, 'duration': 0.921}, {'end': 30443.564, 'text': 'Machine learning algorithms cannot handle string data types.', 'start': 30441.302, 'duration': 2.262}, {'end': 30445.58, 'text': 'they have to be converted in numbers.', 'start': 30444.38, 'duration': 1.2}, {'end': 30456.123, 'text': 'So what I am doing here is, I am going to convert into numbers, but before I convert numbers, I am doing, I am dropping some of these columns.', 'start': 30448.561, 'duration': 7.562}], 'summary': 'Analyzing car price prediction by converting data types to numbers.', 'duration': 35.916, 'max_score': 30420.207, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE30420207.jpg'}], 'start': 29350.876, 'title': 'Linear models and ensemble techniques', 'summary': 'Discusses linear regression assumptions, non-linear relationships, differences between linear and non-linear models, ensemble techniques in data science, interpreting linear models, and data preprocessing for machine learning. it emphasizes the importance of checking for non-linear relationships, handling outliers, and preparing data for machine learning algorithms.', 'chapters': [{'end': 29505.159, 'start': 29350.876, 'title': 'Linear regression assumptions and non-linear relationships', 'summary': 'Discusses the assumptions of linearity in linear regression models, emphasizing the importance of checking for non-linear relationships when the correlation coefficient (r value) is close to 0, as it indicates a lack of linear relationship but not necessarily no relationship, highlighting the need to utilize pair plots to identify non-linear distributions and consider non-linear models for such attributes in the dataset.', 'duration': 154.283, 'highlights': ['The correlation coefficient (r value) being close to 0 does not necessarily mean no relationship, but indicates a lack of linear relationship, emphasizing the importance of checking for non-linear relationships and utilizing pair plots to identify such distributions.', 'When dealing with non-linear relationships and building a linear model on that, it is considered a fundamental mistake, underlining the need to consider non-linear models for such attributes in the dataset.', 'The pair plot is highlighted as the most important tool to understand the data and identify non-linear relationships, with a strong recommendation to always utilize it when working with datasets.']}, {'end': 29672.493, 'start': 29505.159, 'title': 'Linear vs non-linear models', 'summary': 'Discusses the distinction between linear and non-linear models in data science, emphasizing the possibility of using linear modeling on non-linear functions and the importance of minimal bias variance errors in model selection.', 'duration': 167.334, 'highlights': ['Linear vs non-linear models in data science', 'Possibility of using linear modeling on non-linear functions', 'Importance of minimal bias variance errors in model selection']}, {'end': 30100.249, 'start': 29674.615, 'title': 'Ensemble techniques in data science', 'summary': 'Explains the concept of ensemble techniques in data science, emphasizing the use of multiple models grouped together, the assumptions and implications of linear models, and the importance of handling outliers and testing for homoscedasticity and independence of errors.', 'duration': 425.634, 'highlights': ['Ensemble techniques in data science involve putting together multiple models in production, leveraging the concept of wisdom of the crowd.', 'Linear models in data science assume a linear relationship between independent variables and the target variable, as well as the absence of a relationship between the independent variables.', 'Handling outliers is crucial in model building, as linear models are prone to being influenced by outliers, which can lead to poor model performance.', 'Testing for homoscedasticity and independence of errors in model building is essential, with methods such as scatter plots and analyzing trends in residuals being utilized for this purpose.']}, {'end': 30359.183, 'start': 30101.385, 'title': 'Interpreting linear models and limitations in machine learning', 'summary': 'Discusses the physical interpretation of linear models, the limitations in interpreting black box models like neural networks and random forests, the vulnerability of linear models to outliers, the importance of reliable attributes in model building, and the limitations of linear models in classification compared to non-linear models.', 'duration': 257.798, 'highlights': ['The vulnerability of linear models to outliers is emphasized, as they can bring the model to its knees, making it prone to unreliable predictions.', 'The chapter highlights the difficulty in interpreting black box models like neural networks and random forests, contrasting them with the physical interpretation possible with linear models.', 'The importance of reliable attributes in model building is emphasized, as the reliability of the model is dependent on the quality of the attributes used.', 'The limitations of linear models in classification are discussed, pointing out that non-linear models may provide better classification results in certain scenarios.']}, {'end': 30593.317, 'start': 30361.584, 'title': 'Data preprocessing and analysis', 'summary': 'Covers the process of loading and analyzing a comma separated file, identifying missing and string data types, dropping low variance columns, and preparing the data for machine learning algorithms by converting string data types into numbers.', 'duration': 231.733, 'highlights': ["The data frame 'car_df' is created from a comma separated file with column names from the UCI data set, containing missing and string data types.", 'The objective is to predict the price of the car by identifying the relationship between independent variables and the price, and converting string data types into numbers for machine learning algorithms.', 'Dropping low variance columns, such as those containing only one value, to remove useless information from the analysis.']}], 'duration': 1242.441, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE29350876.jpg', 'highlights': ['Ensemble techniques leverage wisdom of the crowd', 'Pair plot is crucial for identifying non-linear relationships', 'Handling outliers is crucial in model building', 'Linear models assume linear relationship between variables', 'Non-linear models may provide better classification results', 'Importance of reliable attributes in model building', 'Linear models vulnerable to outliers, impacting predictions', "Creating 'car_df' to predict car price and preprocess data", 'Linear models have limitations in classification scenarios', 'Interpreting black box models vs. physical interpretation']}, {'end': 33323.322, 'segs': [{'end': 30803.517, 'src': 'embed', 'start': 30776.064, 'weight': 4, 'content': [{'end': 30782.066, 'text': 'So if a column is an ordinal data type You can go and introduce order in your numerical values.', 'start': 30776.064, 'duration': 6.002}, {'end': 30790.49, 'text': 'If the column is not ordinal, gender column, then you cannot blindly convert them into 1 and 2, you have to resort to one-hot coding.', 'start': 30782.746, 'duration': 7.744}, {'end': 30796.653, 'text': 'In scikit-learn there is a facility function called label encoder.', 'start': 30792.631, 'duration': 4.022}, {'end': 30801.035, 'text': 'Label encoder introduces order in your data.', 'start': 30798.034, 'duration': 3.001}, {'end': 30803.517, 'text': 'So, be careful when you are using that.', 'start': 30802.096, 'duration': 1.421}], 'summary': 'Ordinal data can be ordered numerically, non-ordinal data requires one-hot encoding. label encoder introduces order in data.', 'duration': 27.453, 'max_score': 30776.064, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE30776064.jpg'}, {'end': 31704.717, 'src': 'embed', 'start': 31676.809, 'weight': 2, 'content': [{'end': 31680.45, 'text': 'let us see if our analysis stands, let us see whether it stands, ok.', 'start': 31676.809, 'duration': 3.641}, {'end': 31686.992, 'text': 'So, this is numerical way, statistical way of analyzing data, but instead you can do a pair plot.', 'start': 31680.95, 'duration': 6.042}, {'end': 31694.814, 'text': 'In pair plot, I always prefer to have the diagonals in form of density graphs.', 'start': 31689.713, 'duration': 5.101}, {'end': 31704.717, 'text': 'How do you get that? When you call the pair panel, you give diagonal kind is KDE, Kernel Density Estimates.', 'start': 31696.635, 'duration': 8.082}], 'summary': 'Analyzing data using statistical methods and pair plots with kde for diagonals.', 'duration': 27.908, 'max_score': 31676.809, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE31676809.jpg'}, {'end': 32493.211, 'src': 'embed', 'start': 32460.62, 'weight': 0, 'content': [{'end': 32465.062, 'text': 'So, if you start dropping the rows for all columns where you have out last, your data set might shrink.', 'start': 32460.62, 'duration': 4.442}, {'end': 32479.468, 'text': 'So, now every column except one every column has outlier.', 'start': 32476.367, 'duration': 3.101}, {'end': 32485.249, 'text': 'So, when we are removing the outlier for every column 569 records come to 230, 000.', 'start': 32479.488, 'duration': 5.761}, {'end': 32487.25, 'text': 'So, that is not good.', 'start': 32485.249, 'duration': 2.001}, {'end': 32488.89, 'text': 'That is not good.', 'start': 32487.73, 'duration': 1.16}, {'end': 32493.211, 'text': 'So, dropping records is always the last option when you have plenty of data.', 'start': 32489.09, 'duration': 4.121}], 'summary': 'Dropping outliers reduced records from 569 to 230,000, not ideal. dropping records should be last resort with ample data.', 'duration': 32.591, 'max_score': 32460.62, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE32460620.jpg'}, {'end': 32553.056, 'src': 'embed', 'start': 32522.003, 'weight': 3, 'content': [{'end': 32524.204, 'text': "it's a descriptive statistics.", 'start': 32522.003, 'duration': 2.201}, {'end': 32527.026, 'text': 'next we do is bivariate analysis.', 'start': 32524.204, 'duration': 2.822}, {'end': 32530.889, 'text': 'in bivariate analysis, how these columns interact with each other.', 'start': 32527.026, 'duration': 3.863}, {'end': 32540.225, 'text': 'For example, if you look at this one, this is telling you interaction between the symbolizing symboling and this one wheel base.', 'start': 32531.937, 'duration': 8.288}, {'end': 32548.272, 'text': "By the way, in this square matrix, above the diagonal and below the diagonal, it's mirror image.", 'start': 32541.806, 'duration': 6.466}, {'end': 32553.056, 'text': 'Either you focus below the diagonal or above the diagonal.', 'start': 32549.533, 'duration': 3.523}], 'summary': 'Descriptive statistics followed by bivariate analysis of column interactions.', 'duration': 31.053, 'max_score': 32522.003, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE32522003.jpg'}, {'end': 32844.896, 'src': 'embed', 'start': 32819.36, 'weight': 1, 'content': [{'end': 32827.909, 'text': 'If you look at the cars with number of cylinders, in your data set most of the cars have 4 cylinders, 5 cylinders, 6 cylinder and 8 cylinder.', 'start': 32819.36, 'duration': 8.549}, {'end': 32837.778, 'text': 'Most of the cars look at this have 4 cylinders, by the way these data points might be sitting on top of one another.', 'start': 32830.071, 'duration': 7.707}, {'end': 32841.86, 'text': 'So, that does not mean your data set has only 1, 2, 3, 4, 5, 6 records of 4 cylinders.', 'start': 32838.517, 'duration': 3.343}, {'end': 32844.896, 'text': 'do that mistake.', 'start': 32844.055, 'duration': 0.841}], 'summary': 'Most cars in the dataset have 4, 5, 6, or 8 cylinders, with 4 cylinders being the most common. data points may overlap, so the count of cylinders may not be accurately represented.', 'duration': 25.536, 'max_score': 32819.36, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE32819360.jpg'}], 'start': 30595.759, 'title': 'Car data analysis', 'summary': 'Covers car data analysis, data preprocessing, statistical interpretation, and outlier handling, emphasizing the analysis of cylinder distribution and the importance of domain knowledge. it includes processes like handling missing values, identifying skewness, and understanding column interactions.', 'chapters': [{'end': 30649.203, 'start': 30595.759, 'title': 'Car data analysis', 'summary': 'Describes the process of analyzing car data, identifying and removing low variance columns, and dropping useless object data type columns, leaving only the necessary object data type columns such as number of doors and fuel type.', 'duration': 53.444, 'highlights': ['The data analysis involves identifying low variance columns and removing them, resulting in a more streamlined dataset.', 'The process includes dropping useless object data type columns, such as number of doors, fuel type, and engine location, to clean the dataset for further analysis.', "The necessary object data type columns that are retained after the cleaning process include 'number of cylinders' and any other relevant object data types."]}, {'end': 31582.943, 'start': 30661.225, 'title': 'Data preprocessing and handling missing values', 'summary': 'Covers data preprocessing, including dropping low variance columns, converting categorical variables to numerical using label encoder or one-hot encoding, and handling missing values by replacing with median or using mice package for imputing missing values.', 'duration': 921.718, 'highlights': ['Converting categorical variables to numerical using label encoder or one-hot encoding', 'Handling missing values by replacing with median or using MICE package for imputing missing values', 'Dropping low variance columns']}, {'end': 32412.8, 'start': 31583.723, 'title': 'Data analysis and statistical interpretation', 'summary': 'Discusses the process of statistical analysis for a dataset, including interpreting descriptive statistics, identifying skewness, using pair plots with kernel density estimates, and handling outliers, emphasizing the importance of data distribution and its impact on model building.', 'duration': 829.077, 'highlights': ['The chapter emphasizes the importance of data distribution and its impact on model building, including the identification of skewness and the use of pair plots with kernel density estimates.', 'The discussion includes the process of handling outliers and the impact of removing outliers on the data distribution and standard deviation.', 'The chapter provides insights into the significance of interpreting descriptive statistics, such as mean, median, and quartiles, to assess the distribution and symmetry of the data.']}, {'end': 32788.921, 'start': 32420.302, 'title': 'Data analysis and outlier handling', 'summary': 'Discusses the challenges of handling outliers, the impact of removing records on data size, and the importance of univariate and bivariate analysis in understanding column interactions and relationships, highlighting the need for synthetic dimensions in the presence of non-linearly independent dimensions.', 'duration': 368.619, 'highlights': ['Univariate and bivariate analysis are crucial in understanding column interactions and relationships in the dataset.', 'The impact of removing outliers on data size is significant, reducing the dataset from 569 to 230,000 records.', 'Challenges arise when dealing with non-linearly independent dimensions, necessitating the creation of synthetic dimensions through techniques such as principal component analysis or singular value decomposition.']}, {'end': 33323.322, 'start': 32819.36, 'title': 'Cylinder distribution analysis', 'summary': 'Discusses the analysis of cylinder distribution in a dataset, revealing the prevalence of 4-cylinder cars and the use of kernel density estimates to understand the spread of cylinder distribution in the population, emphasizing the importance of domain knowledge in data analysis.', 'duration': 503.962, 'highlights': ['The majority of cars in the dataset have 4 cylinders, with some having 5, 6, and 8 cylinders, while very few have 12 or 2 cylinders.', 'Kernel Density Estimates is used to estimate the density distribution of cylinder values in the population, providing insights on the spread of the distribution based on the available data.', "The importance of domain knowledge in model building and data analysis is emphasized, particularly in understanding variables like 'symboling' and the impact of categorical variables such as the origin of the car on mileage."]}], 'duration': 2727.563, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE30595759.jpg', 'highlights': ['The impact of removing outliers on data size is significant, reducing the dataset from 569 to 230,000 records.', 'The majority of cars in the dataset have 4 cylinders, with some having 5, 6, and 8 cylinders, while very few have 12 or 2 cylinders.', 'The chapter emphasizes the importance of data distribution and its impact on model building, including the identification of skewness and the use of pair plots with kernel density estimates.', 'Univariate and bivariate analysis are crucial in understanding column interactions and relationships in the dataset.', 'Converting categorical variables to numerical using label encoder or one-hot encoding']}, {'end': 36390.756, 'segs': [{'end': 33350.713, 'src': 'embed', 'start': 33323.322, 'weight': 0, 'content': [{'end': 33326.605, 'text': 'by this high end foreign brands they all come with embedded chips.', 'start': 33323.322, 'duration': 3.283}, {'end': 33330.228, 'text': 'Those embedded chips in real time.', 'start': 33328.266, 'duration': 1.962}, {'end': 33339.134, 'text': 'they capture the data about your driving style and pass it on to a central server where they sit down and analyze, and the risk factor is adjusted,', 'start': 33330.228, 'duration': 8.906}, {'end': 33344.617, 'text': 'recalculated, recalibrated, based on how the car is being driven.', 'start': 33339.134, 'duration': 5.483}, {'end': 33347.038, 'text': 'The symboling reflects that.', 'start': 33345.917, 'duration': 1.121}, {'end': 33350.713, 'text': "right. ok, let's move now.", 'start': 33348.271, 'duration': 2.442}], 'summary': 'High-end foreign cars have embedded chips that capture driving data in real-time for risk analysis and adjustment.', 'duration': 27.391, 'max_score': 33323.322, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE33323322.jpg'}, {'end': 33774.26, 'src': 'embed', 'start': 33747.462, 'weight': 6, 'content': [{'end': 33751.344, 'text': 'this is the point where the best fit line using gradient descent is found for you.', 'start': 33747.462, 'duration': 3.882}, {'end': 33754.887, 'text': 'Now look at this.', 'start': 33754.266, 'duration': 0.621}, {'end': 33760.951, 'text': 'I am printing out the coefficients of all those, the best fit line, the best fit plane, hyperplane rather.', 'start': 33754.907, 'duration': 6.044}, {'end': 33764.333, 'text': 'So when I print the coefficients, these are the coefficients.', 'start': 33762.052, 'duration': 2.281}, {'end': 33766.615, 'text': 'Now we need to understand what this is.', 'start': 33764.493, 'duration': 2.122}, {'end': 33769.917, 'text': 'You said there are 16 columns.', 'start': 33768.576, 'duration': 1.341}, {'end': 33774.26, 'text': 'So the 16 columns have 16 different coefficients, m1, m2, m16.', 'start': 33771.138, 'duration': 3.122}], 'summary': 'Gradient descent finds best fit line for 16 columns.', 'duration': 26.798, 'max_score': 33747.462, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE33747462.jpg'}, {'end': 34046.365, 'src': 'embed', 'start': 34017.169, 'weight': 3, 'content': [{'end': 34019.65, 'text': 'we have not done anything about that multicollinearity.', 'start': 34017.169, 'duration': 2.481}, {'end': 34029.519, 'text': 'So, those are the core reasons which will come across in all data set, which lead to overall model level problems.', 'start': 34022.336, 'duration': 7.183}, {'end': 34038.642, 'text': 'Now, I am going to take you through further down into slightly more deeper stuff.', 'start': 34031.639, 'duration': 7.003}, {'end': 34046.365, 'text': 'Let me see if I can cover it today, if not we will start with fresh minds tomorrow, because I need that particular thing needs lot of brain energy.', 'start': 34039.282, 'duration': 7.083}], 'summary': 'Multicollinearity issues persist, causing model problems across all datasets.', 'duration': 29.196, 'max_score': 34017.169, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE34017169.jpg'}, {'end': 34465.34, 'src': 'embed', 'start': 34386.063, 'weight': 1, 'content': [{'end': 34392.385, 'text': 'Shall we move on? Now, we need to improve the performance of the model.', 'start': 34386.063, 'duration': 6.322}, {'end': 34395.767, 'text': 'Your first cut model will never give you the results that you are expecting, very rare.', 'start': 34392.505, 'duration': 3.262}, {'end': 34403.681, 'text': "So you'll always end up in another cycle where your objective will be how do you improve the performance of this model, how do you notch it up.", 'start': 34397.537, 'duration': 6.144}, {'end': 34413.247, 'text': 'So there are various things you can do but before you do that I would like to introduce you to a library called StatsModel.', 'start': 34405.462, 'duration': 7.785}, {'end': 34421.893, 'text': 'StatsModel. the formula API, SMF StatsModel functions.', 'start': 34416.469, 'duration': 5.424}, {'end': 34432.8, 'text': 'okay, what happens is When you build this linear models in R, R gives you a lot of statistical information about your attributes and your models.', 'start': 34421.893, 'duration': 10.907}, {'end': 34438.321, 'text': 'scikit-learn linear regression does not give you those information.', 'start': 34434.701, 'duration': 3.62}, {'end': 34444.162, 'text': 'But there was a lot of demand for that kind of R kind of analysis in Python.', 'start': 34439.601, 'duration': 4.561}, {'end': 34449.323, 'text': 'So those people who support Python, scikit-learn, they came out with this library called stats model.', 'start': 34444.882, 'duration': 4.441}, {'end': 34453.544, 'text': 'Stats model tries to replicate the behavior of R.', 'start': 34450.703, 'duration': 2.841}, {'end': 34463.34, 'text': 'okay and gives you similar kind of statistical analysis as you would have received or gotten obtained in R.', 'start': 34455.256, 'duration': 8.084}, {'end': 34465.34, 'text': 'So, let us see what it does and why is it required.', 'start': 34463.34, 'duration': 2}], 'summary': 'Improving model performance using statsmodel library for statistical analysis in python.', 'duration': 79.277, 'max_score': 34386.063, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE34386063.jpg'}, {'end': 35460.048, 'src': 'embed', 'start': 35428.062, 'weight': 5, 'content': [{'end': 35428.982, 'text': 'Statistical fluke.', 'start': 35428.062, 'duration': 0.92}, {'end': 35434.527, 'text': 'Those attributes where p-values is less than 0.05, good attributes.', 'start': 35430.924, 'duration': 3.603}, {'end': 35438.531, 'text': 'Let me be a killjoy, okay.', 'start': 35436.129, 'duration': 2.402}, {'end': 35445.985, 'text': 'In the statistics community in the real world, there is a vertical split between the statisticians.', 'start': 35441.044, 'duration': 4.941}, {'end': 35448.846, 'text': 'One school of thought says p-values is not reliable.', 'start': 35446.485, 'duration': 2.361}, {'end': 35455.347, 'text': 'The other school of thought, which comes from the conventional statistics point of view, they say p-value is reliable.', 'start': 35450.446, 'duration': 4.901}, {'end': 35460.048, 'text': 'The reason why they say p-value is not reliable is you can go and check it yourselves.', 'start': 35456.767, 'duration': 3.281}], 'summary': "P-values < 0.05 are considered good attributes, but there's debate over their reliability in the statistics community.", 'duration': 31.986, 'max_score': 35428.062, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE35428062.jpg'}, {'end': 35735.397, 'src': 'embed', 'start': 35703.876, 'weight': 2, 'content': [{'end': 35707.678, 'text': 'It is a classification method based on linear regression.', 'start': 35703.876, 'duration': 3.802}, {'end': 35718.344, 'text': 'The response variable, that is, the target variable, can be binary class, default or non-default, or diabetic, non-diabetic,', 'start': 35710.578, 'duration': 7.766}, {'end': 35720.526, 'text': 'or it can be multi-class classification also.', 'start': 35718.344, 'duration': 2.182}, {'end': 35726.57, 'text': 'I can use logistic regression to for optical character recognition, I can do that.', 'start': 35721.166, 'duration': 5.404}, {'end': 35735.397, 'text': 'And in my personal experience I have seen and I have also read some papers about it when you compare models,', 'start': 35727.791, 'duration': 7.606}], 'summary': 'Logistic regression is used for binary and multi-class classification, including optical character recognition.', 'duration': 31.521, 'max_score': 35703.876, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE35703876.jpg'}, {'end': 36181.492, 'src': 'embed', 'start': 36148.934, 'weight': 9, 'content': [{'end': 36158.639, 'text': 'This S curve is called a sigmoid and it is very easy to achieve this.', 'start': 36148.934, 'duration': 9.705}, {'end': 36169.643, 'text': 'The sigmoid is nothing but 1 by 1 plus e, e is a Euler constant we use in mathematics, minus mx plus c.', 'start': 36159.755, 'duration': 9.888}, {'end': 36178.689, 'text': 'So this best fit line, the best fit line which is found for you, this bed face line is fed into this transformation, this mathematical formula.', 'start': 36169.643, 'duration': 9.046}, {'end': 36181.492, 'text': 'The result of this transformation is this curve.', 'start': 36179.41, 'duration': 2.082}], 'summary': 'The sigmoid curve transformation is achieved using the mathematical formula 1/(1+e^(-mx+c)).', 'duration': 32.558, 'max_score': 36148.934, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE36148934.jpg'}], 'start': 33323.322, 'title': 'Embedded chips, data analysis, and regression', 'summary': 'Discusses embedded chips capturing real-time driving data, linear regression coefficients analysis, statsmodel for linear regression in python, understanding coefficients and p-values in regression analysis, and logistic regression for classification, with a focus on machine learning, statistical analysis, and model application.', 'chapters': [{'end': 33938.784, 'start': 33323.322, 'title': 'Embedded chips and data analysis in driving', 'summary': 'Discusses how high-end foreign cars come with embedded chips that capture real-time driving data, which is analyzed to adjust the risk factor, followed by an explanation of data segregation, model training, and evaluation in machine learning with quantitative examples such as coefficients and their interpretations.', 'duration': 615.462, 'highlights': ['High-end foreign cars come with embedded chips that capture real-time driving data, which is analyzed to adjust the risk factor.', 'Explanation of data segregation and model training in machine learning with a quantitative example of splitting data in the ratio of 75-25 for test and training sets.', 'Interpretation of coefficients in the context of machine learning, with examples such as an 88.57-unit increase in price for every one-unit increase in symboling.']}, {'end': 34403.681, 'start': 33939.164, 'title': 'Linear regression coefficients analysis', 'summary': 'Discusses the issues related to coefficients in linear regression, such as outliers, multicollinearity, and model performance, with a focus on normalizing data, intercept, and coefficient interpretation, and achieving model improvement.', 'duration': 464.517, 'highlights': ['The chapter highlights the issues related to coefficients in linear regression, including outliers and multicollinearity, impacting model performance, with a focus on normalizing data and intercept interpretation.', 'The discussion emphasizes the importance of addressing outliers and multicollinearity in linear regression to avoid model performance issues.', 'The chapter explains the impact of normalizing data on linear regression coefficients, stating that the relationship between the dependent and independent variables remains unchanged.', 'The discussion elaborates on interpreting the intercept and coefficients in linear regression, emphasizing the functions model.intercept and model.coef_, and the significance of these values in the regression model.', "The chapter concludes with the importance of continuously improving the model's performance, highlighting the rarity of achieving desired results with the first-cut model and the need for iterative improvement."]}, {'end': 34740.135, 'start': 34405.462, 'title': 'Statsmodel for linear regression in python', 'summary': 'Introduces the statsmodel library for linear regression in python, highlighting its ability to provide r-like statistical analysis and additional results such as adjusted r square, aic, and bic.', 'duration': 334.673, 'highlights': ["StatsModel replicates R's statistical analysis and provides additional results such as adjusted R square, AIC, and BIC.", 'The library allows combining independent and dependent variables into a single data frame for R-like input, addressing the limitations of scikit-learn.', 'The chapter explains the process of building a linear model using the library, highlighting the use of ordinary least squares (ols) and the representation of the best fit line.']}, {'end': 35668.278, 'start': 34740.135, 'title': 'Understanding coefficients and p-values in regression analysis', 'summary': 'Explains the importance of coefficients and p-values in regression analysis, emphasizing the significance of p-values in determining the reliability of the relationship between attributes, with a critical view on their reliability due to collinearity and the contrasting opinions within the statistics community. it also highlights the significance of p-values in rejecting or accepting the null hypothesis, and the different methodologies for establishing the reliability of dimensions and models.', 'duration': 928.143, 'highlights': ['The chapter emphasizes the significance of p-values in determining the reliability of the relationship between attributes, with a critical view on their reliability due to collinearity and the contrasting opinions within the statistics community.', 'It also highlights the significance of p-values in rejecting or accepting the null hypothesis, and the different methodologies for establishing the reliability of dimensions and models.', 'The discussion explains the process of using coefficients and p-values to determine the reliability of dimensions and models, and the impact of collinearity on the reliability of coefficients and p-values.']}, {'end': 36390.756, 'start': 35673.262, 'title': 'Logistic regression for classification', 'summary': 'Introduces logistic regression as a classification method based on linear regression, applicable to binary and multi-class classification, with a focus on the probability-based model and its transformation into a sigmoid curve. the model is exemplified using a hypothetical scenario of predicting loan defaulters based on demographic data.', 'duration': 717.494, 'highlights': ['Logistic regression is a classification method based on linear regression, applicable to binary and multi-class classification, often performing well in model comparisons.', 'The model is exemplified using a hypothetical scenario of predicting loan defaulters based on demographic data, showcasing the process of assigning numerical values to classes and utilizing linear models to predict probabilities.', 'The transformation of the linear model into an S curve, known as a sigmoid, is explained as a technique to ensure the probability values are within the 0 to 1 range, aligning with the characteristics of a probability distribution.']}], 'duration': 3067.434, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE33323322.jpg', 'highlights': ['High-end cars use embedded chips for real-time driving data analysis.', "StatsModel replicates R's statistical analysis and provides additional results.", 'Logistic regression is a classification method based on linear regression.', 'Explanation of data segregation and model training in machine learning.', 'Importance of addressing outliers and multicollinearity in linear regression.', 'Significance of p-values in determining the reliability of the relationship between attributes.', 'Interpretation of coefficients in the context of machine learning.', "Importance of continuously improving the model's performance.", 'Process of building a linear model using the StatsModel library.', 'Transformation of the linear model into an S curve, known as a sigmoid.']}, {'end': 38290.32, 'segs': [{'end': 36601.449, 'src': 'embed', 'start': 36570.629, 'weight': 2, 'content': [{'end': 36571.25, 'text': 'we do not want that.', 'start': 36570.629, 'duration': 0.621}, {'end': 36574.632, 'text': 'probabilities have to be between 0 and 1, that is why they will convert into sigmoid.', 'start': 36571.25, 'duration': 3.382}, {'end': 36580.276, 'text': 'Sigmoid curve has that property that will always remain between 0 and 1.', 'start': 36575.112, 'duration': 5.164}, {'end': 36585.18, 'text': 'So, we are able to map the numerical values to probability function.', 'start': 36580.276, 'duration': 4.904}, {'end': 36592.405, 'text': 'So, coming back to the same question, now you explain the binary thing, so same thing Yes.', 'start': 36586.481, 'duration': 5.924}, {'end': 36596.407, 'text': 'In multi-class classification, it will be 1 versus rest.', 'start': 36592.965, 'duration': 3.442}, {'end': 36601.449, 'text': 'Suppose I used this logistic regression for OCR, optical character recognition.', 'start': 36596.547, 'duration': 4.902}], 'summary': 'Logistic regression maps numerical values to probabilities for ocr.', 'duration': 30.82, 'max_score': 36570.629, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE36570629.jpg'}, {'end': 36724.342, 'src': 'embed', 'start': 36700.61, 'weight': 3, 'content': [{'end': 36707.095, 'text': 'So what you do is, you again find a column here in this data set and break this into two, that means you are drawing a vertical boundary.', 'start': 36700.61, 'duration': 6.485}, {'end': 36713.016, 'text': 'So this and this is a child node here and here.', 'start': 36709.454, 'duration': 3.562}, {'end': 36724.342, 'text': 'So decision tree also breaks your mathematical space into pockets such that each pocket becomes homogeneous at leaf level.', 'start': 36714.857, 'duration': 9.485}], 'summary': 'Decision tree breaks mathematical space into homogeneous pockets for classification.', 'duration': 23.732, 'max_score': 36700.61, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE36700610.jpg'}, {'end': 36930.59, 'src': 'embed', 'start': 36902.874, 'weight': 1, 'content': [{'end': 36909.9, 'text': 'So if I take two dimensions, m and c, and plot the error, this will be a bowl shape.', 'start': 36902.874, 'duration': 7.026}, {'end': 36920.5, 'text': "Now in this ball shape we have to reach the global minima, it's guaranteed you will have only one global minima, you won't have any other.", 'start': 36913.633, 'duration': 6.867}, {'end': 36925.545, 'text': 'For quadratic error expressions you will always have one global minima guaranteed, okay.', 'start': 36921.12, 'duration': 4.425}, {'end': 36930.59, 'text': 'So we start with some random MNC, that random MNC gives us some error.', 'start': 36926.025, 'duration': 4.565}], 'summary': 'Quadratic error expressions have one global minima, and we start with random mnc to reach it.', 'duration': 27.716, 'max_score': 36902.874, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE36902874.jpg'}, {'end': 36993.475, 'src': 'embed', 'start': 36959.115, 'weight': 0, 'content': [{'end': 36960.016, 'text': 'we will take it up in detail.', 'start': 36959.115, 'duration': 0.901}, {'end': 36966.314, 'text': 'same concept will apply today in logistic because logistic uses linear model under the hood.', 'start': 36961.249, 'duration': 5.065}, {'end': 36974.581, 'text': 'So it finds for you the best fit line given the spread of the data points in mathematics space or best fit plane or a hyperplane.', 'start': 36967.215, 'duration': 7.366}, {'end': 36985.932, 'text': 'Once that plane is found, that plane is sent through the sigmoid transformation and it is converted to S curve, right.', 'start': 36975.542, 'duration': 10.39}, {'end': 36993.475, 'text': 'Now the driving force behind this, here in this case it was quadratic.', 'start': 36988.874, 'duration': 4.601}], 'summary': 'Logistic regression applies linear model and sigmoid transformation to find best fit plane in mathematics space.', 'duration': 34.36, 'max_score': 36959.115, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE36959115.jpg'}, {'end': 37199.809, 'src': 'embed', 'start': 37164.634, 'weight': 5, 'content': [{'end': 37171.639, 'text': 'the model is saying that probability belongs to blue class is very high and is actually blue class.', 'start': 37164.634, 'duration': 7.005}, {'end': 37174.241, 'text': 'no error 0 error.', 'start': 37171.639, 'duration': 2.602}, {'end': 37175.022, 'text': 'right 0 error.', 'start': 37174.241, 'duration': 0.781}, {'end': 37176.123, 'text': 'Come to this.', 'start': 37175.622, 'duration': 0.501}, {'end': 37180.806, 'text': 'What is the y value for blue class? 1.', 'start': 37177.184, 'duration': 3.622}, {'end': 37192.206, 'text': '1 into, what is the probability very high, almost close to 1? Log of a very high number, when log comes close to 1, this is 0.', 'start': 37180.806, 'duration': 11.4}, {'end': 37197.308, 'text': 'Log of a very high number is 0, close to 0.', 'start': 37192.206, 'duration': 5.102}, {'end': 37199.809, 'text': '2 raised to power 0 is 1.', 'start': 37197.308, 'duration': 2.501}], 'summary': 'The model predicts a high probability for the blue class with no errors.', 'duration': 35.175, 'max_score': 37164.634, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE37164634.jpg'}, {'end': 37536.987, 'src': 'embed', 'start': 37504.332, 'weight': 7, 'content': [{'end': 37511.398, 'text': 'So you sum up the errors done across all the data points those which are correctly classified, those which are not correctly classified.', 'start': 37504.332, 'duration': 7.066}, {'end': 37514.781, 'text': 'sum it up, you will get the total error across the entire model.', 'start': 37511.398, 'duration': 3.383}, {'end': 37518.624, 'text': 'Sum of squared errors same way this.', 'start': 37516.602, 'duration': 2.022}, {'end': 37528.813, 'text': 'So the objective is to minimize the sum of squared errors by finding the right logistic surface given the classes.', 'start': 37522.107, 'duration': 6.706}, {'end': 37536.987, 'text': 'So, the gradient descent will be the same.', 'start': 37534.466, 'duration': 2.521}], 'summary': 'Minimize sum of squared errors to find right logistic surface for classes.', 'duration': 32.655, 'max_score': 37504.332, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE37504332.jpg'}, {'end': 37634.408, 'src': 'embed', 'start': 37603.51, 'weight': 4, 'content': [{'end': 37611.895, 'text': 'The beauty of logistic regression is, it makes no assumption about the distribution of classes in the feature space.', 'start': 37603.51, 'duration': 8.385}, {'end': 37618.639, 'text': 'Many of these algorithms, linear model especially, if you are building linear classifiers or linear regression, they expect Gaussian distributions.', 'start': 37612.235, 'duration': 6.404}, {'end': 37624.164, 'text': 'you understand the term Gaussian distribution all of you, no ok.', 'start': 37619.983, 'duration': 4.181}, {'end': 37634.408, 'text': 'When you build any model, expectation is, if you take this attribute, the data will be spread around the central value of this attribute,', 'start': 37624.805, 'duration': 9.603}], 'summary': 'Logistic regression makes no class distribution assumption, unlike linear models which expect gaussian distributions.', 'duration': 30.898, 'max_score': 37603.51, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE37603510.jpg'}, {'end': 37738.734, 'src': 'embed', 'start': 37711.512, 'weight': 9, 'content': [{'end': 37721.218, 'text': 'Okay You can do multi-class classification using either binomial distribution in this case or multinomial distribution also is possible.', 'start': 37711.512, 'duration': 9.706}, {'end': 37728.222, 'text': 'You can also print out the probability values if you are not interested in the class but I want the probability values.', 'start': 37723.54, 'duration': 4.682}, {'end': 37730.584, 'text': 'What is the probability that belongs to this class or that class?', 'start': 37728.583, 'duration': 2.001}, {'end': 37738.734, 'text': 'quick to learn, because it is based on linear model and you have gradient descent to help us out to find the best fit line quickly.', 'start': 37733.032, 'duration': 5.702}], 'summary': 'Multi-class classification can use binomial or multinomial distribution with quick learning based on linear model and gradient descent.', 'duration': 27.222, 'max_score': 37711.512, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE37711512.jpg'}, {'end': 37787.762, 'src': 'embed', 'start': 37761.397, 'weight': 6, 'content': [{'end': 37769.618, 'text': 'because as a simple linear model, linear models cannot be very complex shapes, it is resistant to overfitting, okay.', 'start': 37761.397, 'duration': 8.221}, {'end': 37776.12, 'text': 'Resistant to overfitting does not mean resistant to bias errors.', 'start': 37771.699, 'duration': 4.421}, {'end': 37778.54, 'text': 'it means resistant to variance errors.', 'start': 37776.12, 'duration': 2.42}, {'end': 37787.762, 'text': 'right, but we have another one to deal with and you can print out the coefficients, the probability values, and interpret the probability values.', 'start': 37778.54, 'duration': 9.222}], 'summary': 'Linear models are resistant to overfitting and can handle complex shapes, but are still susceptible to bias and variance errors.', 'duration': 26.365, 'max_score': 37761.397, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE37761397.jpg'}, {'end': 37938.242, 'src': 'embed', 'start': 37911.509, 'weight': 8, 'content': [{'end': 37914.93, 'text': 'you have to know your data before you start building the models.', 'start': 37911.509, 'duration': 3.421}, {'end': 37922.154, 'text': 'You have to know your data, which means you have to know every attribute, how the data is distributed if you are in classification,', 'start': 37915.791, 'duration': 6.363}, {'end': 37929.117, 'text': 'how the classes are distributed, which dimensions or which attributes are able to linearly separate the two classes.', 'start': 37922.154, 'duration': 6.963}, {'end': 37931.398, 'text': 'use only those for your classification models.', 'start': 37929.117, 'duration': 2.281}, {'end': 37938.242, 'text': 'You have some attributes where both the classes are significantly overlapping including the central values.', 'start': 37933.68, 'duration': 4.562}], 'summary': 'Know your data attributes before building models, use only linearly separable ones for classification.', 'duration': 26.733, 'max_score': 37911.509, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE37911509.jpg'}], 'start': 36395.022, 'title': 'Machine learning model errors and logistic regression', 'summary': 'Discusses machine learning model errors, logistic regression, and its concepts such as sigmoid function, error minimization, gaussian distribution, advantages, disadvantages, and model evaluation with a focus on pima tribes dataset. it includes concepts like m and c dimensions, log loss function, gradient descent, non-reliance on gaussian distribution assumptions, and resistance to overfitting.', 'chapters': [{'end': 36901.973, 'start': 36395.022, 'title': 'Machine learning model errors', 'summary': 'Discusses the errors in machine learning models, including the misclassification of data points due to overlapping datasets and the use of sigmoid curve to ensure probabilities remain between 0 and 1. it also covers the concept of breaking mathematical space into pockets for decision trees and the use of ovr (one versus rest) approach in multiclass classification.', 'duration': 506.951, 'highlights': ['The concept of breaking mathematical space into pockets for decision trees is explained, where the algorithm aims to achieve homogeneity at the leaf level by drawing vertical boundaries, creating pockets for different classes.', 'The use of sigmoid curve in machine learning models is discussed, emphasizing its property to keep probabilities between 0 and 1, thus preventing overfitting and ensuring numerical values are mapped to a probability function.', 'The chapter also touches upon the concept of OVR (One Versus Rest) approach in multiclass classification, where most algorithms provide OVR by default, and some offer the option to replace it with Bayesian probabilities.']}, {'end': 37164.634, 'start': 36902.874, 'title': 'Logistic regression and sigmoid function', 'summary': 'Explains the concept of reaching global minima in logistic regression using m and c dimensions to find the best sigmoid surface through the log loss function, which is crucial for deep learning and neural networks.', 'duration': 261.76, 'highlights': ['Logistic regression aims to find the best fit line or plane in the mathematics space using the linear model, followed by the sigmoid transformation to form an S curve.', 'The log loss function, crucial for finding the best sigmoid surface, is explained as a simple expression involving the target variable and the predicted probability.', 'The concept of reaching global minima in logistic regression is explained using the dimensions M and C to plot the error in a bowl shape, ensuring a single global minima for quadratic error expressions.']}, {'end': 37601.609, 'start': 37164.634, 'title': 'Logistic regression and error minimization', 'summary': 'Explains the concept of logistic regression by demonstrating how the model computes probabilities for classification, evaluates the errors in classifying blue and red points, and aims to minimize the total error across the entire model using gradient descent.', 'duration': 436.975, 'highlights': ["The model's probability prediction for belonging to the blue class is very high, resulting in 0 error due to correct classification, with the log to the base 2 of a very high number being close to 0 and 2 raised to power 0 being 1.", 'Misclassification of blue points results in a very large error, particularly when the model predicts a very low probability for belonging to the blue class, leading to a log of a very small number, which translates to a very large error.', 'The chapter emphasizes the objective of minimizing the sum of squared errors by finding the right logistic surface given the classes, driving the gradient descent to reduce the total loss and achieve error minimization.']}, {'end': 37730.584, 'start': 37603.51, 'title': 'Logistic regression and gaussian distribution', 'summary': 'Introduces logistic regression, highlighting its non-reliance on gaussian distribution assumptions, the impact of outliers on the algorithm, and the options for multi-class classification and probability value printing.', 'duration': 127.074, 'highlights': ['Logistic regression makes no assumption about the distribution of classes in the feature space, in contrast to linear models which expect Gaussian distributions.', 'Outliers in the dataset can significantly impact the performance of logistic regression, leading to more severe errors as the outliers become more extreme.', 'Multi-class classification can be performed using binomial or multinomial distribution, and the algorithm allows for the printing of probability values for each class.']}, {'end': 37998.658, 'start': 37733.032, 'title': 'Logistic regression advantages and disadvantages', 'summary': 'Discusses the advantages and disadvantages of logistic regression, highlighting its resistance to overfitting and linear boundaries, while cautioning about its limitations in handling non-linearly separable distributions, with a focus on the importance of understanding data prior to model building.', 'duration': 265.626, 'highlights': ['Logistic regression stands out among the top few in many situations', 'Resistance to overfitting and linear boundaries', 'Importance of understanding data before model building']}, {'end': 38290.32, 'start': 37999.158, 'title': 'Data analysis and model evaluation', 'summary': 'Explores the change in density of the plot, analysis of switching axes, application of random forest model, and hands-on data set loading and model evaluation using confusion matrix, focusing on predicting type 2 diabetes among pima tribes.', 'duration': 291.162, 'highlights': ['The density of the plot changes from very low to very high and then starts falling, forming a normal distribution when converted into 3D.', 'Discussion on the application of random forest model, which constructs every node using randomly selected features from the dataset, and its suitability in separating classes based on attributes.', 'Explanation of loading the Pima tribe dataset, including the column names and the significance of the captured characteristics in predicting type 2 diabetes among the tribe members.']}], 'duration': 1895.298, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE36395022.jpg', 'highlights': ['Logistic regression aims to find the best fit line or plane in the mathematics space using the linear model, followed by the sigmoid transformation to form an S curve.', 'The concept of reaching global minima in logistic regression is explained using the dimensions M and C to plot the error in a bowl shape, ensuring a single global minima for quadratic error expressions.', 'The use of sigmoid curve in machine learning models is discussed, emphasizing its property to keep probabilities between 0 and 1, thus preventing overfitting and ensuring numerical values are mapped to a probability function.', 'The concept of breaking mathematical space into pockets for decision trees is explained, where the algorithm aims to achieve homogeneity at the leaf level by drawing vertical boundaries, creating pockets for different classes.', 'Logistic regression makes no assumption about the distribution of classes in the feature space, in contrast to linear models which expect Gaussian distributions.', "The model's probability prediction for belonging to the blue class is very high, resulting in 0 error due to correct classification, with the log to the base 2 of a very high number being close to 0 and 2 raised to power 0 being 1.", 'Resistance to overfitting and linear boundaries', 'The chapter emphasizes the objective of minimizing the sum of squared errors by finding the right logistic surface given the classes, driving the gradient descent to reduce the total loss and achieve error minimization.', 'Importance of understanding data before model building', 'Multi-class classification can be performed using binomial or multinomial distribution, and the algorithm allows for the printing of probability values for each class.']}, {'end': 40155.598, 'segs': [{'end': 38341.52, 'src': 'embed', 'start': 38314.643, 'weight': 7, 'content': [{'end': 38319.906, 'text': 'we are trying to find out is there a relationship between the independent variables and the target variables?', 'start': 38314.643, 'duration': 5.263}, {'end': 38323.875, 'text': 'what is that relationship?', 'start': 38322.895, 'duration': 0.98}, {'end': 38325.976, 'text': 'we are trying to find out that relationship.', 'start': 38323.875, 'duration': 2.101}, {'end': 38329.317, 'text': 'we assume exists in the universe, in the real world.', 'start': 38325.976, 'duration': 3.341}, {'end': 38332.098, 'text': 'we are trying to find out whether we can discover that relationship.', 'start': 38329.317, 'duration': 2.781}, {'end': 38341.52, 'text': 'So for that we are building this model and I always do this based on the experience on car mpg dataset.', 'start': 38332.338, 'duration': 9.182}], 'summary': 'Discovering relationship between variables using car mpg dataset', 'duration': 26.877, 'max_score': 38314.643, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE38314643.jpg'}, {'end': 38398.107, 'src': 'embed', 'start': 38360.926, 'weight': 8, 'content': [{'end': 38367.151, 'text': 'To identify that, we can use this method.', 'start': 38360.926, 'duration': 6.225}, {'end': 38375.377, 'text': 'What I am doing is, I am using a numerical Python function isReal.', 'start': 38371.134, 'duration': 4.243}, {'end': 38380.896, 'text': 'This function is a binary function, it is a Boolean function.', 'start': 38377.253, 'duration': 3.643}, {'end': 38382.977, 'text': 'It will give true or false.', 'start': 38381.756, 'duration': 1.221}, {'end': 38387.64, 'text': 'This function I am applying it on to, look at this apply.', 'start': 38384.378, 'duration': 3.262}, {'end': 38392.423, 'text': 'I am applying it on to all the rows, all the columns of this.', 'start': 38387.66, 'duration': 4.763}, {'end': 38398.107, 'text': 'This is the beauty of Python where you have to do minimal coding.', 'start': 38394.425, 'duration': 3.682}], 'summary': "Using python's 'isreal' function to apply a binary boolean function on all rows and columns for minimal coding.", 'duration': 37.181, 'max_score': 38360.926, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE38360926.jpg'}, {'end': 38509.828, 'src': 'embed', 'start': 38480.743, 'weight': 9, 'content': [{'end': 38483.784, 'text': 'But we have nothing to be happy about because we are seeing lot of zeros.', 'start': 38480.743, 'duration': 3.041}, {'end': 38488.365, 'text': 'And zeros in blood pressure is surprising, it cannot be.', 'start': 38485.404, 'duration': 2.961}, {'end': 38489.752, 'text': 'not possible.', 'start': 38489.232, 'duration': 0.52}, {'end': 38494.957, 'text': 'Zero in plasma is not possible, but you have zero values in blood pressure, you have zero values in plasma.', 'start': 38490.133, 'duration': 4.824}, {'end': 38502.322, 'text': 'So, missing values are there, the missing values will cause problems, we need to address the missing values, ok.', 'start': 38496.658, 'duration': 5.664}, {'end': 38503.643, 'text': 'Let us move on.', 'start': 38503.123, 'duration': 0.52}, {'end': 38509.828, 'text': 'Do a div, describe how the data is distributed on the various dimensions.', 'start': 38504.664, 'duration': 5.164}], 'summary': 'Address missing values in blood pressure and plasma data to avoid problems.', 'duration': 29.085, 'max_score': 38480.743, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE38480743.jpg'}, {'end': 38622.622, 'src': 'embed', 'start': 38593.341, 'weight': 5, 'content': [{'end': 38594.261, 'text': 'Maybe this is a typo.', 'start': 38593.341, 'duration': 0.92}, {'end': 38595.302, 'text': 'We do not know.', 'start': 38594.902, 'duration': 0.4}, {'end': 38597.924, 'text': 'Okay Right.', 'start': 38596.803, 'duration': 1.121}, {'end': 38598.645, 'text': 'I am going to move on.', 'start': 38597.944, 'duration': 0.701}, {'end': 38605.115, 'text': 'the next thing that you have to do is first understand how the data is distributed on your various dimensions.', 'start': 38599.773, 'duration': 5.342}, {'end': 38608.756, 'text': 'if you think there are outliers, we need to address outliers.', 'start': 38605.115, 'duration': 3.641}, {'end': 38610.277, 'text': 'we already know we have missing values.', 'start': 38608.756, 'duration': 1.521}, {'end': 38612.158, 'text': 'we need to address missing values.', 'start': 38610.277, 'duration': 1.881}, {'end': 38620.461, 'text': 'the next thing you need to do is, since you are in classification, how many records are available for each class in the data set?', 'start': 38612.158, 'duration': 8.303}, {'end': 38622.622, 'text': 'look at that.', 'start': 38620.461, 'duration': 2.161}], 'summary': 'Identify outliers and missing values, analyze data distribution, and evaluate class imbalance.', 'duration': 29.281, 'max_score': 38593.341, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE38593341.jpg'}, {'end': 38720.19, 'src': 'embed', 'start': 38692.932, 'weight': 6, 'content': [{'end': 38703.156, 'text': 'And keep in mind whenever you are in classification and the classes are skewed like this all algorithms are biased towards the higher represented class.', 'start': 38692.932, 'duration': 10.224}, {'end': 38710.324, 'text': 'All models, the objective is to minimize the overall misclassification.', 'start': 38705.558, 'duration': 4.766}, {'end': 38713.266, 'text': 'So their focus will be on the higher class.', 'start': 38711.524, 'duration': 1.742}, {'end': 38720.19, 'text': 'So this is another source of bias which comes into the model where the algorithms themselves are biased.', 'start': 38715.247, 'duration': 4.943}], 'summary': 'Skewed classes bias algorithms, focusing on higher represented class.', 'duration': 27.258, 'max_score': 38692.932, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE38692932.jpg'}, {'end': 39368.727, 'src': 'embed', 'start': 39324.82, 'weight': 1, 'content': [{'end': 39330.362, 'text': 'These are those values of red lying on the extreme end, values of blue lying on the extreme end.', 'start': 39324.82, 'duration': 5.542}, {'end': 39335.145, 'text': 'Look at the diabetic class.', 'start': 39334.104, 'duration': 1.041}, {'end': 39350.801, 'text': '84, there are 84 diabetic cases in your test data, of this 84, 46 have been correctly classified as diabetic.', 'start': 39338.716, 'duration': 12.085}, {'end': 39358.824, 'text': '84 minus 46 is 38, 38 have been misclassified, ok.', 'start': 39350.821, 'duration': 8.003}, {'end': 39362.285, 'text': 'Now look at this, now look at this.', 'start': 39359.504, 'duration': 2.781}, {'end': 39366.987, 'text': 'What is the overall accuracy? 77%.', 'start': 39363.885, 'duration': 3.102}, {'end': 39368.727, 'text': 'Looks decent.', 'start': 39366.987, 'duration': 1.74}], 'summary': 'Out of 84 diabetic cases, 46 were correctly classified, resulting in a 77% overall accuracy.', 'duration': 43.907, 'max_score': 39324.82, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE39324820.jpg'}, {'end': 39471.309, 'src': 'embed', 'start': 39440.533, 'weight': 0, 'content': [{'end': 39446.899, 'text': 'what is important is this but this is what we expected, because this is represented very poorly the diabetic class.', 'start': 39440.533, 'duration': 6.366}, {'end': 39451.832, 'text': 'This is important for us.', 'start': 39450.931, 'duration': 0.901}, {'end': 39455.135, 'text': 'Look at the recall for the non-diabetic class.', 'start': 39452.573, 'duration': 2.562}, {'end': 39461.3, 'text': 'Non-diabetic class, the recall is 132 by 147.', 'start': 39456.916, 'duration': 4.384}, {'end': 39466.104, 'text': 'This will come very high, 90 percent, 80 percent, 90 percent roughly round.', 'start': 39461.3, 'duration': 4.804}, {'end': 39471.309, 'text': 'So the recall for the non-diabetic class is very high.', 'start': 39469.027, 'duration': 2.282}], 'summary': 'The recall for the non-diabetic class is 90-95%.', 'duration': 30.776, 'max_score': 39440.533, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE39440533.jpg'}, {'end': 40154.077, 'src': 'heatmap', 'start': 39763.487, 'weight': 1, 'content': [{'end': 39765.428, 'text': 'It creates synthetic data in these regions.', 'start': 39763.487, 'duration': 1.941}, {'end': 39773.011, 'text': 'Yes But by doing the upsampling, are we not introducing the bias error? 100%.', 'start': 39767.989, 'duration': 5.022}, {'end': 39781.834, 'text': 'Guaranteed So we are, by introducing the bias error, we are taking away the bias in the algorithm which is towards the higher class.', 'start': 39773.011, 'duration': 8.823}, {'end': 39784.836, 'text': 'We are trying to cancel that bias error by upsampling.', 'start': 39782.274, 'duration': 2.562}, {'end': 39791.098, 'text': 'So actually upsampling and downsampling is adjusting the bias.', 'start': 39787.236, 'duration': 3.862}, {'end': 39797.682, 'text': 'We are trading off the bias which exists naturally in the algorithm towards higher case.', 'start': 39792.818, 'duration': 4.864}, {'end': 39802.906, 'text': 'Are they exact duplicates or? No, no they are not duplicate.', 'start': 39799.903, 'duration': 3.003}, {'end': 39805.348, 'text': 'That synthetic data generated using k nearest neighbors.', 'start': 39802.926, 'duration': 2.422}, {'end': 39808.05, 'text': 'So I take a random point here.', 'start': 39806.329, 'duration': 1.721}, {'end': 39809.071, 'text': 'then k nearest neighbors.', 'start': 39808.05, 'duration': 1.021}, {'end': 39812.914, 'text': 'if you use k nearest regressor, for example, it will sum up these three and give you the average.', 'start': 39809.071, 'duration': 3.843}, {'end': 39823.631, 'text': 'No, all are stars, but what it will do is, it is k nearest regressor.', 'start': 39819.73, 'duration': 3.901}, {'end': 39828.712, 'text': 'So, it is going to take the blood pressure of all three, find out the average, give the average to this new point.', 'start': 39824.471, 'duration': 4.241}, {'end': 39831.652, 'text': 'So, that can be That can be.', 'start': 39830.352, 'duration': 1.3}, {'end': 39833.773, 'text': 'That can be.', 'start': 39831.992, 'duration': 1.781}, {'end': 39837.533, 'text': 'So, there are two class modes, star and circle.', 'start': 39833.813, 'duration': 3.72}, {'end': 39839.794, 'text': 'It works only on the underrepresented class.', 'start': 39837.853, 'duration': 1.941}, {'end': 39843.374, 'text': 'It generates synthetic data only for the lower case.', 'start': 39840.974, 'duration': 2.4}, {'end': 39847.095, 'text': 'So, in this case, it will generate synthetic data only for the diabetic class.', 'start': 39844.214, 'duration': 2.881}, {'end': 39863.276, 'text': 'It will bring both the samples to the same level.', 'start': 39860.875, 'duration': 2.401}, {'end': 39867.259, 'text': '50-50 It will bring it to 50-50.', 'start': 39863.296, 'duration': 3.963}, {'end': 39876.766, 'text': 'Can you give us some pointers? IMLN, go and search for IMLN on Google, you will find that package IMLN, you will see lot of reading materials there.', 'start': 39867.7, 'duration': 9.066}, {'end': 39879.348, 'text': 'But we will cover in detail in FMT.', 'start': 39877.667, 'duration': 1.681}, {'end': 39892.894, 'text': "Don't do your manual approach of deleting the data, don't do that.", 'start': 39889.391, 'duration': 3.503}, {'end': 39898.319, 'text': 'That is that will be the worst case, worst thing to do.', 'start': 39896.137, 'duration': 2.182}, {'end': 39905.866, 'text': 'Anyway coming back to this, so what the next thing which you are going to try is since all data is numeric.', 'start': 39898.739, 'duration': 7.127}, {'end': 39914.831, 'text': "all data is numeric, let's convert all this data into what is called standard scale, x i minus x bar by standard deviation, right.", 'start': 39907.287, 'duration': 7.544}, {'end': 39920.473, 'text': "And that's what I am doing here from pre-processing library, I am calling the scalar function.", 'start': 39915.751, 'duration': 4.722}, {'end': 39926.176, 'text': 'So it converts every record, every column of this data frame into scaled data.', 'start': 39921.974, 'duration': 4.202}, {'end': 39929.638, 'text': 'Same with test record also convert into scaled.', 'start': 39927.417, 'duration': 2.221}, {'end': 39935.341, 'text': 'On that I rebuild my model and relook at my scores.', 'start': 39931.299, 'duration': 4.042}, {'end': 39939.511, 'text': 'As I said sometimes we expect miracles to happen, it never happens.', 'start': 39936.71, 'duration': 2.801}, {'end': 39945.054, 'text': 'Which type of scaling does it use? Which type of scaling does it use? This is z-scores.', 'start': 39939.951, 'duration': 5.103}, {'end': 39951.736, 'text': 'Z-scores Yesterday I told you scaling does not impact linear models.', 'start': 39945.414, 'duration': 6.322}, {'end': 39953.957, 'text': 'So, you will not see any impact.', 'start': 39952.937, 'duration': 1.02}, {'end': 39958.379, 'text': 'There is nothing else we can do.', 'start': 39957.259, 'duration': 1.12}, {'end': 39964.442, 'text': 'Converting z-scores, x i minus x bar by standard deviation for every column.', 'start': 39960.96, 'duration': 3.482}, {'end': 39967.159, 'text': 'centralize the data to zero.', 'start': 39966.178, 'duration': 0.981}, {'end': 39998.445, 'text': 'That depends on how your data is distributed.', 'start': 39994.601, 'duration': 3.844}, {'end': 40001.728, 'text': 'It depends on that.', 'start': 39998.545, 'duration': 3.183}, {'end': 40002.929, 'text': 'It depends on that.', 'start': 40002.248, 'duration': 0.681}, {'end': 40012.117, 'text': "Okay It's a very subtle relationship with the data distributions in the way it's done.", 'start': 40003.009, 'duration': 9.108}, {'end': 40015.64, 'text': 'And the threshold management, changing the thresholds.', 'start': 40012.578, 'duration': 3.062}, {'end': 40018.283, 'text': 'Fifty percent is the threshold.', 'start': 40017.222, 'duration': 1.061}, {'end': 40025.259, 'text': 'So I can say whenever probability is less than 30 percent it belongs to red class, above 30 percent it belongs to blue class.', 'start': 40020.097, 'duration': 5.162}, {'end': 40029.16, 'text': "I can change the threshold, right now by default it's 50 percent.", 'start': 40026.079, 'duration': 3.081}, {'end': 40035.842, 'text': "If probability of belonging to red class is less than 50 percent that means he belongs to blue class, right now it's that.", 'start': 40029.88, 'duration': 5.962}, {'end': 40040.283, 'text': 'So that 50 percent can be changed to 30 percent by building wrappers around this model.', 'start': 40036.522, 'duration': 3.761}, {'end': 40045.185, 'text': 'Number of records is what you have in your data.', 'start': 40043.464, 'duration': 1.721}, {'end': 40062.634, 'text': 'Threshold is in your sigmoid curves, sigmoid or any probability based algorithm, this is my probability dimension 0 to 1, 0.5.', 'start': 40047.79, 'duration': 14.844}, {'end': 40064.636, 'text': 'Whenever function.', 'start': 40062.635, 'duration': 2.001}, {'end': 40074.379, 'text': 'this sigmoid function says the probability is greater than 0.5, 0.55, 0.56, then it will automatically say that this record belongs to the blue.', 'start': 40064.636, 'duration': 9.743}, {'end': 40078.445, 'text': 'If it is below 0.5, it will say it belongs to the red.', 'start': 40076.144, 'duration': 2.301}, {'end': 40085.83, 'text': 'So I can bring down this threshold and I can put it over here, 0.3.', 'start': 40079.906, 'duration': 5.924}, {'end': 40089.092, 'text': 'Anything over 0.3 belongs to blue, anything below 0.3 belongs to red.', 'start': 40085.83, 'duration': 3.262}, {'end': 40096.396, 'text': 'So by changing the threshold, I can control the misclassification of the diabetic cases.', 'start': 40089.892, 'duration': 6.504}, {'end': 40099.758, 'text': 'No, no, it is not a parameter.', 'start': 40098.698, 'duration': 1.06}, {'end': 40102.58, 'text': 'You have to build a wrapper around your model.', 'start': 40099.778, 'duration': 2.802}, {'end': 40105.502, 'text': 'There is no automatic way of doing this.', 'start': 40104.061, 'duration': 1.441}, {'end': 40109.357, 'text': "so I'll use a function in scikit-learn.", 'start': 40106.755, 'duration': 2.602}, {'end': 40111.498, 'text': "it's called binarize.", 'start': 40109.357, 'duration': 2.141}, {'end': 40116.04, 'text': 'in case you are very impatient and you want to know what it is, please go and explore this function.', 'start': 40111.498, 'duration': 4.542}, {'end': 40123.064, 'text': 'binarize, b-i-n-a-r-i-z-e, how, using binarize, we can control the threshold.', 'start': 40116.04, 'duration': 7.024}, {'end': 40123.565, 'text': 'explore that.', 'start': 40123.064, 'duration': 0.501}, {'end': 40128.129, 'text': "If you're not able to find, I'll tell you how to do that in FMT.", 'start': 40125.188, 'duration': 2.941}, {'end': 40131.63, 'text': 'We hope you liked this complete tutorial on data science.', 'start': 40128.849, 'duration': 2.781}, {'end': 40138.312, 'text': 'Great Learning offers high quality, impactful, and industrially relevant programs to working professionals like you.', 'start': 40132.49, 'duration': 5.822}, {'end': 40144.714, 'text': 'Our faculty pool comprises of leading teachers and industry practitioners in the field of data analytics.', 'start': 40139.072, 'duration': 5.642}, {'end': 40148.275, 'text': 'For more information, check the links in the description down below.', 'start': 40145.274, 'duration': 3.001}, {'end': 40150.696, 'text': "Don't forget to like, share, and subscribe.", 'start': 40148.615, 'duration': 2.081}, {'end': 40154.077, 'text': 'Remember, the only learning that matters is great learning.', 'start': 40151.116, 'duration': 2.961}], 'summary': 'Synthetic data is created for underrepresented class, adjusting bias error through upsampling and downsampling, using k nearest neighbors for generation, and scaling data for model rebuilding.', 'duration': 390.59, 'max_score': 39763.487, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE39763487.jpg'}, {'end': 39797.682, 'src': 'embed', 'start': 39773.011, 'weight': 2, 'content': [{'end': 39781.834, 'text': 'Guaranteed So we are, by introducing the bias error, we are taking away the bias in the algorithm which is towards the higher class.', 'start': 39773.011, 'duration': 8.823}, {'end': 39784.836, 'text': 'We are trying to cancel that bias error by upsampling.', 'start': 39782.274, 'duration': 2.562}, {'end': 39791.098, 'text': 'So actually upsampling and downsampling is adjusting the bias.', 'start': 39787.236, 'duration': 3.862}, {'end': 39797.682, 'text': 'We are trading off the bias which exists naturally in the algorithm towards higher case.', 'start': 39792.818, 'duration': 4.864}], 'summary': 'Introducing bias error to counter algorithmic bias towards higher class by upsampling and downsampling.', 'duration': 24.671, 'max_score': 39773.011, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE39773011.jpg'}, {'end': 40002.929, 'src': 'embed', 'start': 39945.414, 'weight': 4, 'content': [{'end': 39951.736, 'text': 'Z-scores Yesterday I told you scaling does not impact linear models.', 'start': 39945.414, 'duration': 6.322}, {'end': 39953.957, 'text': 'So, you will not see any impact.', 'start': 39952.937, 'duration': 1.02}, {'end': 39958.379, 'text': 'There is nothing else we can do.', 'start': 39957.259, 'duration': 1.12}, {'end': 39964.442, 'text': 'Converting z-scores, x i minus x bar by standard deviation for every column.', 'start': 39960.96, 'duration': 3.482}, {'end': 39967.159, 'text': 'centralize the data to zero.', 'start': 39966.178, 'duration': 0.981}, {'end': 39998.445, 'text': 'That depends on how your data is distributed.', 'start': 39994.601, 'duration': 3.844}, {'end': 40001.728, 'text': 'It depends on that.', 'start': 39998.545, 'duration': 3.183}, {'end': 40002.929, 'text': 'It depends on that.', 'start': 40002.248, 'duration': 0.681}], 'summary': 'Z-scores centralize data to zero, no impact on linear models.', 'duration': 57.515, 'max_score': 39945.414, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE39945414.jpg'}], 'start': 38290.32, 'title': 'Diabetes classification analysis', 'summary': 'Covers data analysis, distribution, handling skewed class distributions, logistic regression model evaluation, upsampling, downsampling, data scaling, and threshold management for diabetes classification, with a focus on addressing missing values and skewed class distributions, and utilizing logistic regression model evaluation for performance assessment.', 'chapters': [{'end': 38503.643, 'start': 38290.32, 'title': 'Data analysis and missing values', 'summary': 'Discusses the objective of finding the relationship between independent variables and the target variables, with a focus on identifying non-numerical values in the dataset and addressing missing values in blood pressure and plasma.', 'duration': 213.323, 'highlights': ['The objective is to find the relationship between independent and target variables.', "Using Python's isReal function to identify non-numerical values in the dataset.", 'Addressing missing values in blood pressure and plasma.']}, {'end': 38666.993, 'start': 38504.664, 'title': 'Data distribution analysis in classification', 'summary': 'Discusses the analysis of data distribution on various dimensions, addressing outliers and missing values, and emphasizes the importance of understanding the distribution of records for each class in a dataset, with examples of a diabetic classification dataset and a comparison to studying trigonometry for exams.', 'duration': 162.329, 'highlights': ['The number of cases for non-diabetic zero is 500, whereas the number of cases for diabetic is half of that almost half, emphasizing the importance of understanding the distribution of records for each class in a dataset.', 'The mean is drastically shifted away from the median on the higher side, indicating the presence of long tails on the right side and the impact of outliers on the mean.', 'The pair plot visually provides the same univariate analysis as univariate analysis, offering an alternative approach for data distribution assessment.', 'The need to address outliers and missing values in the dataset is emphasized as a crucial step in data analysis.', 'Analogizing the importance of studying all possible patterns in trigonometry questions to understanding the distribution of records for each class in a dataset, highlighting the potential consequences of focusing only on a subset of patterns.']}, {'end': 39124.884, 'start': 38668.314, 'title': 'Handling skewed class distributions in classification', 'summary': 'Discusses the challenge of skewed class distributions and provides strategies for handling it, including up sampling, down sampling, and modifying thresholds to improve accuracy for the underrepresented class in classification, as well as the limitations of using various dimensions for predicting diabetic cases.', 'duration': 456.57, 'highlights': ['Strategies for handling skewed class distributions', 'Impact of skewed class distributions on model performance', 'Limitations of using various dimensions for predicting diabetic cases']}, {'end': 39721.807, 'start': 39126.925, 'title': 'Logistic regression model evaluation', 'summary': 'Covers the evaluation of a logistic regression model for diabetes classification, including confusion matrix analysis, overall accuracy, recall for diabetic and non-diabetic classes, and the importance of data preparation in model performance, with a focus on the recall for the diabetic class and the reliance on data preparation over algorithm running.', 'duration': 594.882, 'highlights': ["The overall accuracy of the logistic regression model is 77%, with 46 out of 84 diabetic cases and 132 out of 147 non-diabetic cases being correctly classified, indicating the model's decent performance.", "The recall for the diabetic class is only 55%, implying that the model's performance in identifying diabetic cases is not significantly better than random guessing, highlighting the need for improvement in classifying diabetic cases.", "The recall for the non-diabetic class is very high at around 90%, emphasizing the model's strong performance in identifying non-diabetic cases, indicating the need for specific focus on improving the model's performance in classifying diabetic cases.", 'The chapter emphasizes the significance of data preparation, stating that around 80% of the project effort in data science goes into preparing the data, highlighting the critical role of data preparation in model performance.']}, {'end': 39898.319, 'start': 39721.868, 'title': 'Upsampling and downsampling in ml', 'summary': 'Explains how upsampling uses k nearest neighbors to generate synthetic data for the underrepresented class, aiming to address bias error and bring both classes to the same level, typically 50-50.', 'duration': 176.451, 'highlights': ['Upsampling generates synthetic data using k nearest neighbors to address bias error and bring both classes to the same level, typically 50-50.', 'The synthetic data generated does not disturb the overlap region and aims to cancel the bias error by adjusting the bias in the algorithm.', 'IMLN is a package to explore for detailed materials on upsampling and downsampling, emphasizing the importance of not manually deleting data as a worst-case approach.']}, {'end': 40155.598, 'start': 39898.739, 'title': 'Data scaling and threshold management', 'summary': 'Discusses the process of converting data into standard scale using z-scores and the management of thresholds to control misclassification of diabetic cases, emphasizing the impact on model scores and the flexibility to adjust the threshold.', 'duration': 256.859, 'highlights': ['Data scaling using z-scores to standardize the data by calculating xi minus x bar by standard deviation for every column and its impact on model scores.', "Management of thresholds to control misclassification of diabetic cases, ability to adjust the threshold to improve accuracy, and the use of the 'binarize' function in scikit-learn to control the threshold.", 'Explanation of the subtle relationship with the data distributions in the process of scaling and the use of wrappers around the model to change the threshold.']}], 'duration': 1865.278, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/u2zsY-2uZiE/pics/u2zsY-2uZiE38290320.jpg', 'highlights': ["The recall for the non-diabetic class is very high at around 90%, emphasizing the model's strong performance in identifying non-diabetic cases, indicating the need for specific focus on improving the model's performance in classifying diabetic cases.", "The overall accuracy of the logistic regression model is 77%, with 46 out of 84 diabetic cases and 132 out of 147 non-diabetic cases being correctly classified, indicating the model's decent performance.", 'Upsampling generates synthetic data using k nearest neighbors to address bias error and bring both classes to the same level, typically 50-50.', "The recall for the diabetic class is only 55%, implying that the model's performance in identifying diabetic cases is not significantly better than random guessing, highlighting the need for improvement in classifying diabetic cases.", 'Data scaling using z-scores to standardize the data by calculating xi minus x bar by standard deviation for every column and its impact on model scores.', 'The need to address outliers and missing values in the dataset is emphasized as a crucial step in data analysis.', 'Strategies for handling skewed class distributions and their impact on model performance.', 'The objective is to find the relationship between independent and target variables.', "Using Python's isReal function to identify non-numerical values in the dataset.", 'Addressing missing values in blood pressure and plasma.']}], 'highlights': ['The course covers the basic foundation of problem solving using statistics and implementing it in Python, taught by Dr. Abhinanda Sarkar, a Stanford PhD holder and experienced professional.', 'Linear regression can be used for descriptive, predictive, and prescriptive purposes in data analysis, providing insights into the relationships between variables and their impact (e.g., predicting outcomes and identifying behavioral changes).', 'The statistical approach in data science emphasizes formulating a problem and then obtaining data to solve it, whereas the machine learning approach focuses on analyzing available data to derive insights.', 'Python libraries like pandas, numpy, and seaborn are used for data analysis and visualization, with pandas offering a fair amount of statistics built-in and numpy being more suitable for mathematical problems.', 'The challenges of creating prescriptions to meet various requirements, such as autonomous vehicle rules, and the complexity of descriptive analytics in healthcare.', 'Right skewed data means more variation on the right side, often measured using mean minus median, giving a positive value for right skewness.', "The process of training machine learning algorithms involves providing 'training data' to teach the algorithm the correct answers, known as 'ground truth'.", 'Covariance measures relationship between variables, used in finance and employee attrition.', 'Descriptive statistics summarize variables for visualization and reporting.', 'Lists are mutable, allowing adding, removing, and modifying elements.', 'Pandas provides a multidimensional data structure for data manipulation, essential for data science operations.', 'Creation of Pandas series object from a list and dictionary', "The chapter covers creating bar plots, scatter plots, and histograms using Python's matplotlib library.", 'Linear regression represents the relationship between variables as a straight line', 'Ensemble techniques leverage wisdom of the crowd', 'High-end cars use embedded chips for real-time driving data analysis.', 'Logistic regression aims to find the best fit line or plane in the mathematics space using the linear model, followed by the sigmoid transformation to form an S curve.', "The recall for the non-diabetic class is very high at around 90%, emphasizing the model's strong performance in identifying non-diabetic cases, indicating the need for specific focus on improving the model's performance in classifying diabetic cases.", "The overall accuracy of the logistic regression model is 77%, with 46 out of 84 diabetic cases and 132 out of 147 non-diabetic cases being correctly classified, indicating the model's decent performance.", 'Upsampling generates synthetic data using k nearest neighbors to address bias error and bring both classes to the same level, typically 50-50.']}