title

Python for Data Science | Data Science with Python | Python for Data Analysis | 11 Hours Full Course

description

π₯1000+ Free Courses With Free Certificates: https://www.mygreatlearning.com/academy?ambassador_code=GLYT_DES_Top_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Top_SEP22
π₯Build a career in Data Science & Business Analytics: https://www.mygreatlearning.com/pg-program-data-science-and-business-analytics-course?ambassador_code=GLYT_DES_Middle_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Middle_SEP53
Hey Folks! Watch this 10-hour tutorial on Python For Data Science! Python is one of the most famous programming languages globally and is favored in data science because of its simple and flexible syntax, its ability to work with multiple paradigms, and the ease with which it combines with other software components. These factors have made Python the most popular programming with data science, which is why so many people want to learn it today. Great Learning brings you Python for data science full course to help you understand everything you need to know about this topic and get started on the journey to learn about it well.
Data science with python starts focuses on an introduction to Python and Python IDEs, followed by looking at data structures in Python. Then we look at object-oriented programming.
Following this, in python for data science tutorial will get to know more about Python libraries:
β NumPy
β Pandas
β Matplotlib
β Seaborn
β Linear and Logistic regression
β Time-series
This video teaches Python For Data Science and its key functions and concepts with a variety of demonstrations & examples to help you get started on the right foot.
π Topics Covered:
Introduction - 0:00
1. Basics of Python - 5:55
2. Python Data Structures - 21:11
3. Flow Control Statements in Python - 33:18
4. Object-Oriented Programming in Python - 55:20
5. Numerical Computing with Numpy - 1:10:03
6. Data Manipulation with Pandas - 1:28:48
7. Data Visualization with Matplotlib - 1:51:45
8. Linear Regression Algorithm - 2:15:28
9. Logistic Regression Algorithm - 5:31:45
10. Naive Bayes Algorithm - 6:55:05
11. K-means clustering - 8:29:25
12. Hierarchical Clustering - 10:34:55
π₯Check Our Free Courses with free certificate:
πIntroduction to Data Science: https://www.mygreatlearning.com/academy/learn-for-free/courses/introduction-to-data-science?ambassador_code=GLYT_DES_Middle_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Middle_SEP22
π Data Science Foundations: https://www.mygreatlearning.com/academy/learn-for-free/courses/data-science-foundations?ambassador_code=GLYT_DES_Middle_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Middle_SEP22
π Career in Data Science: https://www.mygreatlearning.com/academy/learn-for-free/courses/career-in-data-science?ambassador_code=GLYT_DES_Middle_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Middle_SEP22
π R for Data Science: https://www.mygreatlearning.com/academy/learn-for-free/courses/r-for-data-science?ambassador_code=GLYT_DES_Middle_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Middle_SEP22
π Data Science Mathematics: https://www.mygreatlearning.com/academy/learn-for-free/courses/data-science-mathematics?ambassador_code=GLYT_DES_Middle_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Middle_SEP22
Here is a list of our other full course videos, you can check out:
β Data Science Tutorial: https://www.youtube.com/watch?v=u2zsY-2uZiE&t=680s
β Python for Data Science: https://www.youtube.com/watch?v=edvg4eHi_Mw&t=15700s
β Machine Learning with Python: https://www.youtube.com/watch?v=RnFGwxJwx-0&t=8732s
β Statistics for Data Science: https://www.youtube.com/watch?v=Vfo5le26IhY&t=189s
β‘ About Great Learning Academy:
Visit Great Learning Academy to get access to 1000+ free courses with free certificate on Data Science, Data Analytics, Digital Marketing, Artificial Intelligence, Big Data, Cloud, Management, Cybersecurity, Software Development, and many more. These are supplemented with free projects, assignments, datasets, quizzes. You can earn a certificate of completion at the end of the course for free.
β‘ About Great Learning:
With more than 5.4 Million+ learners in 170+ countries, Great Learning, a part of the BYJU'S group, is a leading global edtech company for professional and higher education offering industry-relevant programs in the blended, classroom, and purely online modes across technology, data and business domains. These programs are developed in collaboration with the top institutions like Stanford Executive Education, MIT Professional Education, The University of Texas at Austin, NUS, IIT Madras, IIT Bombay & more.
SOCIAL MEDIA LINKS:
πΉ For more interesting tutorials, don't forget to subscribe to our channel: https://glacad.me/YTsubscribe
πΉ For more updates on courses and tips follow us on:
β
Telegram: https://t.me/GreatLearningAcademy
β
Facebook: https://www.facebook.com/GreatLearningOfficial/
β
LinkedIn: https://www.linkedin.com/school/great-learning/mycompany/verification/
β
Follow our Blog: https://glacad.me/GL_Blog
#Python #DataScience #GreatLearning

detail

{'title': 'Python for Data Science | Data Science with Python | Python for Data Analysis | 11 Hours Full Course', 'heatmap': [{'end': 6298.392, 'start': 5903.028, 'weight': 0.712}, {'end': 7874.758, 'start': 7471.198, 'weight': 1}], 'summary': 'This 11-hour python for data science course covers python fundamentals, data structures, control flow, object-oriented programming, numpy, pandas, data visualization, linear regression, statistical models, regression model evaluation, linear and non-linear relationships, data analysis, regression and classification models, logistic regression, class imbalance, bayesian probability, flight delay analysis, k nearest neighbor, model tuning, wine classification, clustering for unsupervised learning, clustering considerations, customer segment clustering, machine learning for car models, interpreting clusters, and linkage methods.', 'chapters': [{'end': 101.177, 'segs': [{'end': 64.357, 'src': 'embed', 'start': 0.549, 'weight': 0, 'content': [{'end': 7.172, 'text': 'Data science is the hottest job of the 21st century and Python is the hottest programming language of the 21st century.', 'start': 0.549, 'duration': 6.623}, {'end': 17.316, 'text': 'And that is why the average salary of a data scientist is around $120,000 and the average salary of a Python developer is $100,000.', 'start': 7.652, 'duration': 9.664}, {'end': 24.88, 'text': "So it's pretty obvious that anyone who has skills in both data science and Python would be in great demand in the industry.", 'start': 17.317, 'duration': 7.563}, {'end': 29.102, 'text': 'And that is why we have come up with this full course on Python for data science.', 'start': 25.66, 'duration': 3.442}, {'end': 31.446, 'text': 'Now, before we go ahead,', 'start': 30.185, 'duration': 1.261}, {'end': 38.671, 'text': "I'd like to inform you that we'll be coming up with a series of high quality tutorials on artificial intelligence and computer vision.", 'start': 31.446, 'duration': 7.225}, {'end': 45.836, 'text': "So please do subscribe to Create Learning's YouTube channel and click on the bell icon so that you have a notification of our upcoming videos.", 'start': 39.112, 'duration': 6.724}, {'end': 48.238, 'text': "Now let's have a quick glance at the agenda.", 'start': 46.537, 'duration': 1.701}, {'end': 54.423, 'text': "We'll start off by working with the basics of Python, such as variables, data types and operators.", 'start': 48.859, 'duration': 5.564}, {'end': 60.114, 'text': "Then we'll go through Python's data structures, which are tuple, list, dictionary, and set.", 'start': 55.41, 'duration': 4.704}, {'end': 63.456, 'text': "After that, we'll learn how to do numerical computing with NumPy.", 'start': 60.714, 'duration': 2.742}, {'end': 64.357, 'text': 'Going ahead.', 'start': 63.917, 'duration': 0.44}], 'summary': 'Data science and python are high-demand skills with average salaries of $120,000 and $100,000 respectively. course covers python basics, data structures, and numerical computing with numpy.', 'duration': 63.808, 'max_score': 0.549, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw549.jpg'}], 'start': 0.549, 'title': 'Python for data science course', 'summary': 'Discusses the high demand for data science and python skills, with average salaries of $120,000 for data scientists and $100,000 for python developers. it introduces a full course on python for data science and upcoming tutorials on artificial intelligence and computer vision.', 'chapters': [{'end': 101.177, 'start': 0.549, 'title': 'Python for data science course', 'summary': 'Discusses the high demand for skills in data science and python, with average salaries of $120,000 for data scientists and $100,000 for python developers, introducing a full course on python for data science and upcoming tutorials on artificial intelligence and computer vision.', 'duration': 100.628, 'highlights': ['The average salary of a data scientist is around $120,000, and the average salary of a Python developer is $100,000, indicating the high demand for skills in data science and Python.', "Upcoming tutorials on artificial intelligence and computer vision will be released, encouraging viewers to subscribe to Create Learning's YouTube channel and click on the bell icon for notifications.", 'The agenda for the Python for data science course includes working with the basics of Python, data structures, numerical computing with NumPy, data manipulation with the Pandas library, data visualization with the Matplotlib library, and machine learning algorithms such as linear regression, logistic regression, and Naive Bayes, as well as understanding unsupervised learning and clustering.', 'The chapter emphasizes the importance of having skills in both data science and Python due to the high demand in the industry.']}], 'duration': 100.628, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw549.jpg', 'highlights': ['The average salary of a data scientist is around $120,000, and the average salary of a Python developer is $100,000, indicating the high demand for skills in data science and Python.', 'The agenda for the Python for data science course includes working with the basics of Python, data structures, numerical computing with NumPy, data manipulation with the Pandas library, data visualization with the Matplotlib library, and machine learning algorithms such as linear regression, logistic regression, and Naive Bayes, as well as understanding unsupervised learning and clustering.', 'The chapter emphasizes the importance of having skills in both data science and Python due to the high demand in the industry.', "Upcoming tutorials on artificial intelligence and computer vision will be released, encouraging viewers to subscribe to Create Learning's YouTube channel and click on the bell icon for notifications."]}, {'end': 1264.927, 'segs': [{'end': 221.555, 'src': 'embed', 'start': 194.771, 'weight': 0, 'content': [{'end': 198.753, 'text': "And again, since I'm using a window system, so this is for the window system over here.", 'start': 194.771, 'duration': 3.982}, {'end': 199.593, 'text': 'All right.', 'start': 199.313, 'duration': 0.28}, {'end': 205.802, 'text': 'Now Anaconda is a Python distribution, which basically provides you all of the packages inbuilt.', 'start': 200.538, 'duration': 5.264}, {'end': 212.868, 'text': 'So you have packages such as Matplotlib for visualization, Pandas for data manipulation and NumPy for numerical computing.', 'start': 206.263, 'duration': 6.605}, {'end': 215.53, 'text': "So you don't have to manually install all of these packages.", 'start': 212.888, 'duration': 2.642}, {'end': 221.555, 'text': 'So when you actually install Anaconda, all of these packages are actually pre-installed in Anaconda.', 'start': 215.85, 'duration': 5.705}], 'summary': 'Anaconda is a python distribution with pre-installed packages for visualization, data manipulation, and numerical computing.', 'duration': 26.784, 'max_score': 194.771, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw194771.jpg'}, {'end': 762.167, 'src': 'embed', 'start': 733.683, 'weight': 1, 'content': [{'end': 737.447, 'text': "So I'll just type in A plus B and you get the value of 30.", 'start': 733.683, 'duration': 3.764}, {'end': 738.708, 'text': 'Well, this is quite simple.', 'start': 737.447, 'duration': 1.261}, {'end': 745.453, 'text': "So this is the basic addition operation which you've been doing since your kindergarten, right? And also we'll be doing some similar operations.", 'start': 738.728, 'duration': 6.725}, {'end': 748.556, 'text': "And then I'll go ahead and perform A minus B.", 'start': 745.934, 'duration': 2.622}, {'end': 754.761, 'text': "So what do you think is the answer? Well, it's minus 10 because when you subtract 20 from 10, you'll get a value of minus 10.", 'start': 748.556, 'duration': 6.205}, {'end': 756.343, 'text': "Well, let's also multiply these two numbers.", 'start': 754.761, 'duration': 1.582}, {'end': 758.184, 'text': "So I'll just type in A cross B.", 'start': 756.463, 'duration': 1.721}, {'end': 762.167, 'text': 'so that would be 10 into 20, which will give you 200.', 'start': 758.965, 'duration': 3.202}], 'summary': 'Basic arithmetic operations: a + b = 30, a - b = -10, a x b = 200.', 'duration': 28.484, 'max_score': 733.683, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw733683.jpg'}, {'end': 834.887, 'src': 'embed', 'start': 809.408, 'weight': 2, 'content': [{'end': 815.252, 'text': "So this is where we'll have relational operators to find out what is the exact relation between two operands.", 'start': 809.408, 'duration': 5.844}, {'end': 820.056, 'text': "right. so again i'll just type in a equals 10 and b equals 20.", 'start': 815.772, 'duration': 4.284}, {'end': 822.698, 'text': "over here i'll click on run right.", 'start': 820.056, 'duration': 2.642}, {'end': 824.919, 'text': "so i'll start with the less than operator.", 'start': 822.698, 'duration': 2.221}, {'end': 827.482, 'text': "so i'll type in a less than b.", 'start': 824.919, 'duration': 2.563}, {'end': 832.526, 'text': "so over here i'm just checking if the value of a is actually less than the value of b.", 'start': 827.482, 'duration': 5.044}, {'end': 834.887, 'text': 'so we get a value of true.', 'start': 832.526, 'duration': 2.361}], 'summary': 'Demonstrating relational operators to compare values, a < b yields true.', 'duration': 25.479, 'max_score': 809.408, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw809408.jpg'}, {'end': 1120.716, 'src': 'embed', 'start': 1093.179, 'weight': 3, 'content': [{'end': 1098.641, 'text': "so let's say i'd want to extract the first character, which is j, from this entire string.", 'start': 1093.179, 'duration': 5.462}, {'end': 1101.703, 'text': 'Now, when it comes to Python, the indexing starts from 0..', 'start': 1099.161, 'duration': 2.542}, {'end': 1105.605, 'text': 'So j is basically present at index number 0.', 'start': 1101.703, 'duration': 3.902}, {'end': 1113.01, 'text': "So if I want to extract j, I'll type in a1, I'll put in square braces and then I'll type in 0.", 'start': 1105.605, 'duration': 7.405}, {'end': 1115.452, 'text': 'And I have successfully extracted the first letter.', 'start': 1113.01, 'duration': 2.442}, {'end': 1120.716, 'text': 'And similarly, if I want to extract the last character from this entire string.', 'start': 1115.872, 'duration': 4.844}], 'summary': 'Extract first and last characters from a string using python indexing.', 'duration': 27.537, 'max_score': 1093.179, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw1093179.jpg'}], 'start': 110.458, 'title': 'Python fundamentals', 'summary': 'Covers python installation, programming basics, variables, data types, operations, and key tools for python on windows, including demonstrations and examples. it also includes the concept of variables, different data types, arithmetic operators, and inbuilt functions for strings.', 'chapters': [{'end': 362.908, 'start': 110.458, 'title': 'Python installation and programming basics', 'summary': 'Covers the installation of python, pycharm, anaconda, and jupyter notebook on windows, including the process and key tools for python programming, as well as the demonstration of running a simple python program in jupyter notebook.', 'duration': 252.45, 'highlights': ['The chapter covers the installation of Python, PyCharm, Anaconda, and Jupyter Notebook on Windows, including the process and key tools for Python programming. It details the installation process of Python, PyCharm, Anaconda, and Jupyter Notebook, including the relevance to Python programming and the specific tools provided by each platform.', 'Demonstration of running a simple Python program in Jupyter Notebook. It demonstrates running a simple Python program using the print statement in Jupyter Notebook, showcasing the ease of running Python code and the initial step towards programming in Python.', "Anaconda provides pre-installed packages such as Matplotlib, Pandas, and NumPy for data manipulation and visualization, eliminating the need for manual installation. Anaconda is highlighted for providing pre-installed packages like Matplotlib, Pandas, and NumPy for data manipulation and visualization, streamlining the installation process and enhancing Python's capabilities."]}, {'end': 771.755, 'start': 363.248, 'title': 'Variables and data types in python', 'summary': 'Explains the concept of variables in python as temporary storage spaces for data, and the different data types including integers, floating point numbers, boolean values, and strings. it also covers working with arithmetic operators for addition, subtraction, multiplication, and division with examples.', 'duration': 408.507, 'highlights': ['Variables are temporary storage spaces where you can store data or values, allowing for changeable values inside it. Explains the concept of variables as temporary storage spaces for data, allowing for changeable values inside it.', 'Different data types in Python include integers, floating point numbers, boolean values, and strings, each with specific examples and explanations. Provides examples and explanations for different data types in Python, including integers, floating point numbers, boolean values, and strings.', 'Demonstrates working with arithmetic operators for addition, subtraction, multiplication, and division with specific examples and results. Demonstrates working with arithmetic operators for addition, subtraction, multiplication, and division with specific examples and results.']}, {'end': 1264.927, 'start': 771.755, 'title': 'Python data types & operations', 'summary': 'Covers basic arithmetic, relational, and logical operators in python, as well as working with strings, including examples and inbuilt functions, highlighting key concepts and quantifiable data.', 'duration': 493.172, 'highlights': ["Explaining relational operators including less than, greater than, equal to, and not equal to, with examples of their usage and results. Covered different relational operators and their outcomes with specific values, such as '10 < 20' resulting in 'True' and '10 == 20' resulting in 'False'.", "Demonstrating the functionality of logical operators 'AND' and 'OR' with examples and their corresponding results. Provided examples for 'AND' and 'OR' logical operators with specific values, showing the behavior of 'AND' and 'OR' operations on true and false values.", "Detailed explanation of Python strings, including indexing, extracting characters, and utilizing inbuilt functions such as 'LEN', 'upper', 'lower', and 'title'. Explained the concept of indexing, character extraction, and utilization of inbuilt string functions like 'LEN', 'upper', 'lower', and 'title', while demonstrating practical examples."]}], 'duration': 1154.469, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw110458.jpg', 'highlights': ["Anaconda provides pre-installed packages like Matplotlib, Pandas, and NumPy for data manipulation and visualization, streamlining the installation process and enhancing Python's capabilities.", 'Demonstrates working with arithmetic operators for addition, subtraction, multiplication, and division with specific examples and results.', 'Explaining relational operators including less than, greater than, equal to, and not equal to, with examples of their usage and results.', "Detailed explanation of Python strings, including indexing, extracting characters, and utilizing inbuilt functions such as 'LEN', 'upper', 'lower', and 'title'."]}, {'end': 3304.977, 'segs': [{'end': 1293.88, 'src': 'embed', 'start': 1265.888, 'weight': 0, 'content': [{'end': 1269.191, 'text': 'Right So these were some basic functions when it came to strings.', 'start': 1265.888, 'duration': 3.303}, {'end': 1274.343, 'text': "now let's go ahead and work with some non-primitive data structures in python.", 'start': 1270.219, 'duration': 4.124}, {'end': 1279.427, 'text': 'so the main non-primitive data structures in python are tuple list, dictionary and set.', 'start': 1274.343, 'duration': 5.084}, {'end': 1286.713, 'text': 'so when it comes to all of these data structures, so you can store multiple elements inside one data structure.', 'start': 1279.427, 'duration': 7.286}, {'end': 1293.88, 'text': "so till now we've worked with only variables and when it came to variables we could store only one value in one variable.", 'start': 1286.713, 'duration': 7.167}], 'summary': 'Basic functions for strings in python, followed by non-primitive data structures such as tuple, list, dictionary, and set.', 'duration': 27.992, 'max_score': 1265.888, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw1265888.jpg'}, {'end': 3129.816, 'src': 'embed', 'start': 3089.144, 'weight': 2, 'content': [{'end': 3098.377, 'text': "and let's say, if you want to deposit money around 100 times, then you'd have to write this 100 lines of code 100 times now instead of this.", 'start': 3089.144, 'duration': 9.233}, {'end': 3105.141, 'text': 'it would be wonderful if we just had one deposit function and all we did was invoke this deposit function and we could deposit our money.', 'start': 3098.377, 'duration': 6.764}, {'end': 3106.302, 'text': 'happily right.', 'start': 3105.141, 'duration': 1.161}, {'end': 3108.243, 'text': 'so this is where functions come in.', 'start': 3106.302, 'duration': 1.941}, {'end': 3110.685, 'text': 'similarly, when it comes to the withdraw function,', 'start': 3108.243, 'duration': 2.442}, {'end': 3117.77, 'text': "we'll just have to add one function where we'll write in all the bunch of code only once and whenever we want to withdraw money,", 'start': 3110.685, 'duration': 7.085}, {'end': 3119.711, 'text': "we'll just invoke this function.", 'start': 3117.77, 'duration': 1.941}, {'end': 3125.475, 'text': "similarly, whenever we want to check the balance, i'll just invoke the balance function and i can happily check the balance.", 'start': 3119.711, 'duration': 5.764}, {'end': 3129.816, 'text': "so let's go to jupyter notebook and work with functions.", 'start': 3126.155, 'duration': 3.661}], 'summary': 'Functions simplify coding for repetitive tasks, like depositing, withdrawing, and checking balances, improving efficiency.', 'duration': 40.672, 'max_score': 3089.144, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw3089144.jpg'}, {'end': 3185.248, 'src': 'embed', 'start': 3155.265, 'weight': 4, 'content': [{'end': 3158.386, 'text': 'so hello world.', 'start': 3155.265, 'duration': 3.121}, {'end': 3165.63, 'text': 'right now, if i have to invoke this function, i just have to type in hello, with this parenthesis all right.', 'start': 3158.386, 'duration': 7.244}, {'end': 3170.592, 'text': 'so if i want to print out hello world, all i have to do is copy this, paste it over here.', 'start': 3165.63, 'duration': 4.962}, {'end': 3176.755, 'text': 'so all i have to do is invoke this function, and then i can happily print hello world how many times i want.', 'start': 3170.592, 'duration': 6.163}, {'end': 3185.248, 'text': "Now, after this, what I'll do is I will create a function where I'm taking an input value and adding 10 more to it.", 'start': 3177.685, 'duration': 7.563}], 'summary': "Demonstrating how to invoke a function to print 'hello world' and create a function to add 10 to an input value.", 'duration': 29.983, 'max_score': 3155.265, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw3155265.jpg'}], 'start': 1265.888, 'title': 'Python data structures and control flow', 'summary': 'Introduces non-primitive data structures in python, covering tuples, lists, dictionaries, and sets, and provides examples including storing names of 10,000 music concert attendees. it also delves into working with python data structures, flow control statements, if-else statements, and loops, with detailed examples and explanations. additionally, it discusses the concept of functions in python, showcasing their role in simplifying code structure and demonstrating their creation and invocation using various examples.', 'chapters': [{'end': 1547.431, 'start': 1265.888, 'title': 'Non-primitive data structures in python', 'summary': 'Introduces non-primitive data structures in python, including tuple, list, dictionary, and set, highlighting the ability to store multiple elements in a single data structure to avoid the limitation of storing only one value in a variable, with an example of storing names of 10,000 music concert attendees. it explains the immutability of tuples, demonstrates accessing and modifying elements in a tuple, and contrasts the mutability of lists with tuples.', 'duration': 281.543, 'highlights': ['The chapter introduces non-primitive data structures in Python, including tuple, list, dictionary, and set, highlighting the ability to store multiple elements in a single data structure to avoid the limitation of storing only one value in a variable, with an example of storing names of 10,000 music concert attendees.', 'It explains the immutability of tuples, demonstrating that once created, the elements within a tuple cannot be changed or modified, and new elements cannot be added.', 'The chapter demonstrates accessing individual elements from a tuple and the use of square braces and index numbers to extract specific elements.', 'It contrasts the mutability of lists with the immutability of tuples, explaining that lists can be modified and new elements can be added once they are created.']}, {'end': 2035.291, 'start': 1547.632, 'title': 'Working with python data structures', 'summary': 'Covers working with python data structures, including lists, tuples, dictionaries, and sets, providing examples and operations for each data structure, and then delves into flow control statements, specifically the if-else statement for decision-making based on conditions.', 'duration': 487.659, 'highlights': ['The chapter covers working with Python data structures, including lists, tuples, dictionaries, and sets It provides examples and operations for each data structure, showcasing how to create, access, modify, and work with lists, tuples, dictionaries, and sets.', 'The chapter delves into flow control statements, specifically the if-else statement for decision-making based on conditions It explains the if-else statement using a real-world example of making a decision based on a condition, such as playing football based on the weather, and how it can be represented in programming.']}, {'end': 2430.255, 'start': 2035.291, 'title': 'Working with if else in python', 'summary': 'Covers the usage of if-else statements in python, including examples of comparing variables, checking multiple conditions, working with tuples, lists, and dictionaries, with a focus on evaluating conditions and executing corresponding statements.', 'duration': 394.964, 'highlights': ['Explained the usage of if-else statements in Python with examples of comparing variables, checking multiple conditions, working with tuples, lists, and dictionaries. Examples covering the usage of if-else statements in Python, including comparing variables, checking multiple conditions, and working with tuples, lists, and dictionaries.', 'Demonstrated the comparison of variables using if-else statements, with examples of evaluating conditions and executing corresponding statements. Demonstration of comparing variables using if-else statements, including evaluating conditions and executing corresponding statements.', 'Illustrated the usage of if statements with tuples, lists, and dictionaries, including examples of evaluating conditions and executing corresponding statements. Illustration of using if statements with tuples, lists, and dictionaries, including examples of evaluating conditions and executing corresponding statements.']}, {'end': 3049.204, 'start': 2430.255, 'title': 'Working with loops in python', 'summary': 'Covers the usage of if statements to increment values, while loop to print the 2 table, and a nested for loop to iterate through two separate lists with detailed explanations and examples.', 'duration': 618.949, 'highlights': ['Usage of nested for loops to iterate through two separate lists Understand the implementation of nested for loops to iterate through two separate lists, with a detailed explanation and example.', 'Explanation of the while loop to print the 2 table A detailed explanation of using a while loop to print the 2 table, with step-by-step evaluation and resulting values.', 'Demonstration of using if statements to increment values Demonstration of using if statements to increment values, with a detailed example and explanation.']}, {'end': 3304.977, 'start': 3049.764, 'title': 'Working with functions in python', 'summary': 'Discusses the concept of functions in python, highlighting their role in simplifying code structure by encapsulating tasks, and demonstrates the creation and invocation of functions using examples of deposit, withdrawal, checking balance, and mathematical operations, showcasing the simplification and reusability of code.', 'duration': 255.213, 'highlights': ['Functions encapsulate code to simplify and reuse tasks, such as depositing money, withdrawing money, and checking balance in an ATM, reducing redundancy and improving code structure and reusability. Functions in Python encapsulate tasks like depositing money, withdrawing money, and checking balance, simplifying code structure and reducing redundancy.', 'Demonstrates the creation and invocation of functions using examples of deposit, withdrawal, checking balance, and mathematical operations, showcasing the simplification and reusability of code. The transcript demonstrates the creation and invocation of functions using examples of deposit, withdrawal, checking balance, and mathematical operations, showcasing the simplification and reusability of code.', "Illustrates the creation of a simple function 'hello' that prints 'hello world' and demonstrates its invocation, showcasing the simplicity of function usage. The transcript illustrates the creation of a simple function 'hello' that prints 'hello world' and demonstrates its invocation, showcasing the simplicity of function usage.", "Shows the creation of a function 'add10' to add 10 to an input value, and demonstrates its invocation with different input values, showcasing the parameterized behavior of functions. The transcript shows the creation of a function 'add10' to add 10 to an input value, and demonstrates its invocation with different input values, showcasing the parameterized behavior of functions.", "Demonstrates the creation of a function 'odd even' to determine if a value is even or odd, and showcases its invocation with different input values, illustrating the conditional behavior of functions. The transcript demonstrates the creation of a function 'odd even' to determine if a value is even or odd, and showcases its invocation with different input values, illustrating the conditional behavior of functions."]}], 'duration': 2039.089, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw1265888.jpg', 'highlights': ['Introduces non-primitive data structures in Python, including tuple, list, dictionary, and set, highlighting the ability to store multiple elements in a single data structure to avoid the limitation of storing only one value in a variable, with an example of storing names of 10,000 music concert attendees.', 'The chapter covers working with Python data structures, including lists, tuples, dictionaries, and sets. It provides examples and operations for each data structure, showcasing how to create, access, modify, and work with lists, tuples, dictionaries, and sets.', 'Functions in Python encapsulate tasks like depositing money, withdrawing money, and checking balance, simplifying code structure and reducing redundancy.', 'Demonstrates the creation and invocation of functions using examples of deposit, withdrawal, checking balance, and mathematical operations, showcasing the simplification and reusability of code.', "Illustrates the creation of a simple function 'hello' that prints 'hello world' and demonstrates its invocation, showcasing the simplicity of function usage."]}, {'end': 5383.755, 'segs': [{'end': 4094.897, 'src': 'embed', 'start': 4064.292, 'weight': 0, 'content': [{'end': 4068.658, 'text': 'and when it comes to object oriented programming, it pretty much means the same thing.', 'start': 4064.292, 'duration': 4.366}, {'end': 4072.242, 'text': 'so over here you will have two classes, or more than two classes.', 'start': 4068.658, 'duration': 3.584}, {'end': 4076.188, 'text': 'one class inherits the features or the properties of another class,', 'start': 4072.242, 'duration': 3.946}, {'end': 4079.812, 'text': 'and that is pretty much the concept of inheritance and object oriented programming.', 'start': 4076.188, 'duration': 3.624}, {'end': 4083.414, 'text': "So let's again head back to Jupyter Notebook and start with inheritance.", 'start': 4080.433, 'duration': 2.981}, {'end': 4088.435, 'text': "So to show inheritance over here, I'll just create an extension of the original class.", 'start': 4084.054, 'duration': 4.381}, {'end': 4094.897, 'text': "So what I'll do is I will create a new class and I'll name this class as iPhone.", 'start': 4088.455, 'duration': 6.442}], 'summary': "In object-oriented programming, inheritance involves classes inheriting features from one another, illustrated by creating a new class named 'iphone'.", 'duration': 30.605, 'max_score': 4064.292, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw4064292.jpg'}, {'end': 4137.148, 'src': 'embed', 'start': 4108.098, 'weight': 2, 'content': [{'end': 4114.04, 'text': 'now this would have one extra method, which is cure cancer.', 'start': 4108.098, 'duration': 5.942}, {'end': 4119.441, 'text': "now, as you know, iphone is very famous, and it's famous because it can actually cure cancer.", 'start': 4114.04, 'duration': 5.401}, {'end': 4121.282, 'text': 'so that is why it costs so much.', 'start': 4119.441, 'duration': 1.841}, {'end': 4123.082, 'text': "so that is amazing, isn't it?", 'start': 4121.282, 'duration': 1.8}, {'end': 4124.363, 'text': 'so cure cancer.', 'start': 4123.082, 'duration': 1.281}, {'end': 4132.425, 'text': "i'll give in the self attribute and then i'll just print out i can cure cancer.", 'start': 4124.363, 'duration': 8.062}, {'end': 4135.644, 'text': 'So I have created this new class.', 'start': 4134.099, 'duration': 1.545}, {'end': 4137.148, 'text': 'Now let me create an instance of this.', 'start': 4135.724, 'duration': 1.424}], 'summary': 'The iphone is famous for curing cancer, making it expensive.', 'duration': 29.05, 'max_score': 4108.098, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw4108098.jpg'}, {'end': 4249.424, 'src': 'embed', 'start': 4217.565, 'weight': 1, 'content': [{'end': 4219.325, 'text': "so that's too much of information.", 'start': 4217.565, 'duration': 1.76}, {'end': 4223.326, 'text': 'so simply put, numpy basically has a multi-dimensional array.', 'start': 4219.325, 'duration': 4.001}, {'end': 4232.132, 'text': 'now to process those multi-dimensional arrays, you have certain functions and you have certain operations pre-built in the numpy package.', 'start': 4223.846, 'duration': 8.286}, {'end': 4235.855, 'text': 'and that is how you can work with these numpy multi-dimensional arrays.', 'start': 4232.132, 'duration': 3.723}, {'end': 4240.538, 'text': 'so you can perform all sorts of numerical and scientific operations on this numpy array.', 'start': 4235.855, 'duration': 4.683}, {'end': 4245.162, 'text': "so let's go to jupyter notebook and work with this very famous package called as numpy.", 'start': 4240.538, 'duration': 4.624}, {'end': 4249.424, 'text': "So to start working with the NumPy library, you'd have to first import it.", 'start': 4245.882, 'duration': 3.542}], 'summary': 'Numpy allows working with multi-dimensional arrays, enabling various numerical and scientific operations.', 'duration': 31.859, 'max_score': 4217.565, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw4217565.jpg'}, {'end': 4801.307, 'src': 'embed', 'start': 4775.372, 'weight': 3, 'content': [{'end': 4779.839, 'text': 'and since 300 is not inclusive, that is why you will not have that number over there.', 'start': 4775.372, 'duration': 4.467}, {'end': 4784.181, 'text': "now let's see how can we initialize our numpy array with random numbers.", 'start': 4780.299, 'duration': 3.882}, {'end': 4793.104, 'text': "so if you want to initialize our numpy array with random integers, then we've got this rand int sub method from the random method.", 'start': 4784.181, 'duration': 8.923}, {'end': 4798.706, 'text': 'so again you will set the range of the numbers from which you want the random numbers over here right.', 'start': 4793.104, 'duration': 5.602}, {'end': 4801.307, 'text': 'the range of the numbers is 1 and 100.', 'start': 4798.706, 'duration': 2.601}], 'summary': 'Initialize numpy array with random integers within the range of 1 to 100.', 'duration': 25.935, 'max_score': 4775.372, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw4775372.jpg'}, {'end': 4999.097, 'src': 'embed', 'start': 4970.625, 'weight': 4, 'content': [{'end': 4975.449, 'text': 'once we know how to initialize our numpy arrays and how to check the shape of our numpy arrays.', 'start': 4970.625, 'duration': 4.824}, {'end': 4979.572, 'text': "let's go ahead and perform some simple mathematics on top of these numpy arrays.", 'start': 4975.449, 'duration': 4.123}, {'end': 4983.514, 'text': "so we'll look at some basic addition operations.", 'start': 4980.353, 'duration': 3.161}, {'end': 4989.015, 'text': 'so to add to numpy arrays we will just use the sum function from the numpy package.', 'start': 4983.514, 'duration': 5.501}, {'end': 4999.097, 'text': "so over here i am creating two numpy arrays, n1 and n2, and when i use the sum method over here i'll pass in n1 and n2 as a list right.", 'start': 4989.015, 'duration': 10.082}], 'summary': 'Learn to initialize, check shape, and perform addition on numpy arrays.', 'duration': 28.472, 'max_score': 4970.625, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw4970625.jpg'}], 'start': 3304.977, 'title': 'Object oriented programming and numpy in python', 'summary': 'Introduces object oriented programming in python, covering classes, objects, and inheritance with examples. it also introduces numpy library for numeric and scientific computing, covering array creation, initialization, manipulation, and basic mathematical operations with demonstrations.', 'chapters': [{'end': 3684.942, 'start': 3304.977, 'title': 'Introduction to object oriented programming', 'summary': 'Introduces the concept of object oriented programming in python, explaining the concepts of classes and objects, with an example of creating a class for a phone, and creating and invoking its methods using instances.', 'duration': 379.965, 'highlights': ['The chapter introduces the concept of object oriented programming, explaining the concepts of classes and objects in Python.', 'The properties and behavior of real world objects are represented using classes in object oriented programming.', 'Objects are specific instances of a class, inheriting the properties and behavior defined in the class template.', 'The example demonstrates creating a class for a phone, defining methods for making phone calls and playing games, and creating an object of the class to invoke the methods.']}, {'end': 4192.738, 'start': 3684.942, 'title': 'Python class inheritance & methods', 'summary': "Discusses creating a python class to define methods for adding color and cost to a phone, demonstrating the use of 'self' attribute, and then extends to inheritance with an iphone class inheriting from the base class 'phone' and introducing a new method 'cure cancer' with a demonstration of method invocation and inheritance.", 'duration': 507.796, 'highlights': ["Demonstrating the use of 'self' attribute and adding methods to a Python class for adding color and cost to a phone The transcript explains the process of adding methods to a Python class for adding color and cost to a phone, emphasizing the use of the 'self' attribute and demonstrating how the 'self' attribute differs from the parameter value passed when invoking the method.", "Demonstrating method invocation and inheritance with an iPhone class inheriting from the base class 'phone' and introducing a new method 'cure cancer' The transcript demonstrates method invocation and inheritance with an iPhone class inheriting from the base class 'phone', showcasing how the iPhone class inherits the methods 'addColor' and 'addCost' from the base class and introduces a new method 'cure cancer' specific to the iPhone class.", 'Explanation of inheritance in object-oriented programming and the concept of acquiring properties from another class The transcript provides an explanation of inheritance in object-oriented programming, likening it to the concept of acquiring properties from another person or thing, and emphasizes the concept of one class inheriting the features or properties of another class.']}, {'end': 4468.856, 'start': 4192.738, 'title': 'Introduction to numpy in python', 'summary': 'Introduces numpy, the core library for numeric and scientific computing in python, with a focus on creating single and multi-dimensional arrays and its pre-built functions and operations.', 'duration': 276.118, 'highlights': ['The chapter explains NumPy as the core library for numeric and scientific computing in Python and introduces its fundamental feature of multi-dimensional arrays, enabling various numerical and scientific operations.', "It highlights the process of importing NumPy by using an alias 'NP' and emphasizes the requirement of installing NumPy if not pre-installed in the Python IDE, with a specific mention of Anaconda's pre-installed packages.", 'The chapter demonstrates the creation of a single-dimensional array using np.array and a list of values, and explains the creation of a multi-dimensional array using a list of lists, displaying the values and structure of the arrays.', 'It also provides an example of creating a single-dimensional NumPy array using a list of values and checking its type to verify as a numpy nd array, followed by creating a multi-dimensional array with two rows and three columns and displaying its structure.']}, {'end': 4970.625, 'start': 4468.856, 'title': 'Initializing numpy arrays and manipulating shapes', 'summary': 'Covers initializing numpy arrays with zeros, specific values, range, and random integers, and demonstrates how to check and change the shape of a numpy array, showcasing its flexibility and utility.', 'duration': 501.769, 'highlights': ['Initializing NumPy arrays with zeros using the np.zeros method for different dimensions, such as 1x2 and 6x6, showcasing the flexibility and utility of NumPy arrays 1x2, 6x6', 'Initializing NumPy arrays with specific values using the np.full method for dimensions like 2x2 and 3x3, displaying the ability to fill arrays with specific values 2x2, 3x3', 'Demonstrating the use of np.arange to create NumPy arrays with a specific range and demonstrating the ability to specify a skip value, showcasing the flexibility of array initialization and range specification Range from 10 to 20, range from 10 to 50 with a skip value of 5', 'Illustrating the initialization of NumPy arrays with random integers using np.random.randint, demonstrating the capability to generate random numbers within specified ranges 10 random values between 100 and 200', 'Showing how to check the shape of a NumPy array using the shape method and changing the shape of the array using the np.reshape method, highlighting the flexibility and utility of the NumPy library Original shape of 2x4, reshaped to 4x2']}, {'end': 5157.456, 'start': 4970.625, 'title': 'Performing basic mathematics on numpy arrays', 'summary': 'Covers the basic addition operations on numpy arrays, using the sum function to add elements, and demonstrates vertical and horizontal addition with axis set to 0 and 1, resulting in 100 as the complete value, vertical addition of 40 and 60, and horizontal addition of 30 and 70.', 'duration': 186.831, 'highlights': ['The chapter covers the basic addition operations on numpy arrays. It explains the use of the sum function to add elements from two numpy arrays, resulting in a complete value of 100.', 'Demonstrates vertical and horizontal addition with axis set to 0 and 1. It showcases the vertical addition of 40 and 60 when axis is set to 0 and the horizontal addition of 30 and 70 when axis is set to 1.', 'The sum function adds elements from two numpy arrays resulting in a complete value of 100. It showcases the use of the sum function to add elements from two numpy arrays, resulting in a complete value of 100.']}, {'end': 5383.755, 'start': 5157.456, 'title': 'Numpy arrays operations and pandas data structures', 'summary': 'Covers numpy array operations including vstack, headstack, and column stack for joining arrays, and introduces pandas as a core library for data manipulation and analysis, providing single and multi-dimensional data structures crucial for machine learning algorithms.', 'duration': 226.299, 'highlights': ['The chapter covers Numpy array operations including vstack, headstack, and column stack for joining arrays.', 'Pandas is introduced as a core library for data manipulation and analysis, providing single and multi-dimensional data structures crucial for machine learning algorithms.', 'Numpy provides multi-dimensional arrays, while Pandas provides multi-dimensional data structures including series and data frames, essential for data manipulation operations.', 'Machine learning algorithms such as linear regression, logistic regression, and decision trees are applied on Pandas data frames, making them extremely important for data science operations.', 'The chapter demonstrates the usage of vstack, headstack, and column stack functions with examples, showcasing the process of joining Numpy arrays row-wise, column-wise, and creating two-dimensional arrays.']}], 'duration': 2078.778, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw3304977.jpg', 'highlights': ['Introduces object oriented programming in Python, covering classes, objects, and inheritance with examples.', 'Explains NumPy as the core library for numeric and scientific computing in Python and introduces its fundamental feature of multi-dimensional arrays, enabling various numerical and scientific operations.', "Demonstrates method invocation and inheritance with an iPhone class inheriting from the base class 'phone' and introducing a new method 'cure cancer'.", 'Illustrates the initialization of NumPy arrays with random integers using np.random.randint, demonstrating the capability to generate random numbers within specified ranges.', 'The chapter covers the basic addition operations on NumPy arrays. It explains the use of the sum function to add elements from two NumPy arrays, resulting in a complete value of 100.']}, {'end': 6702.057, 'segs': [{'end': 5437.063, 'src': 'embed', 'start': 5407.277, 'weight': 4, 'content': [{'end': 5413.962, 'text': 'but the series object is a one dimensional labeled array and this is how you can create a series object.', 'start': 5407.277, 'duration': 6.685}, {'end': 5420.007, 'text': "so before we go ahead and create a series object, first we'd have to invoke the pandas library.", 'start': 5414.562, 'duration': 5.445}, {'end': 5426.293, 'text': "so we'll type in import pandas as pd, and pd again over here is just an alias for pandas.", 'start': 5420.007, 'duration': 6.286}, {'end': 5429.796, 'text': "so after we invoke pandas, we'll type in pd dot series.", 'start': 5426.293, 'duration': 3.503}, {'end': 5437.063, 'text': "so over here you'd have to keep in mind that s is capital and inside this well personal list, one, two, three, four and five,", 'start': 5429.796, 'duration': 7.267}], 'summary': 'To create a series object using pandas library, use pd.series with a list of values [1, 2, 3, 4, 5].', 'duration': 29.786, 'max_score': 5407.277, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw5407277.jpg'}, {'end': 6298.392, 'src': 'heatmap', 'start': 5903.028, 'weight': 0.712, 'content': [{'end': 5906.389, 'text': "so i've got this first five records of the data frame.", 'start': 5903.028, 'duration': 3.361}, {'end': 5908.53, 'text': "let's understand this data frame properly.", 'start': 5906.389, 'duration': 2.141}, {'end': 5914.972, 'text': 'so this has got five columns which are basically sepal length, sepal width, petal length, petal width and species.', 'start': 5908.53, 'duration': 6.442}, {'end': 5920.454, 'text': 'so these are basically the different features of the species of the iris flower.', 'start': 5914.972, 'duration': 5.482}, {'end': 5924.995, 'text': "so you've got three different species setosa, versicolor and virginica.", 'start': 5920.454, 'duration': 4.541}, {'end': 5930.097, 'text': "and for all of these three different species we've got the sepal length, sepal width, petal length and petal width.", 'start': 5924.995, 'duration': 5.102}, {'end': 5937.222, 'text': 'so, as we already saw over here, the head function helps us to have a glance at the first five records of this data frame.', 'start': 5930.757, 'duration': 6.465}, {'end': 5943.687, 'text': 'now if i want to have a glance at the first 10 records, then all i have to do is type in 10 inside this head function.', 'start': 5937.222, 'duration': 6.465}, {'end': 5946.85, 'text': "then we'll have a glance at the first 10 records of this data frame.", 'start': 5943.687, 'duration': 3.163}, {'end': 5954.977, 'text': "right, and if i want to have a glance at, let's say, the first 100 records, then i'll just type in iris dot head 100 right.", 'start': 5946.85, 'duration': 8.127}, {'end': 5962.805, 'text': 'so these are the first 100 records where the index value starts at 0 and that goes on till 99.', 'start': 5954.977, 'duration': 7.828}, {'end': 5966.728, 'text': 'now, analogous to the head function, we also have the tail function.', 'start': 5962.805, 'duration': 3.923}, {'end': 5971.472, 'text': "so i'll type in iris dot tail.", 'start': 5966.728, 'duration': 4.744}, {'end': 5976.016, 'text': 'now this tail function would give me the last five records of this data frame.', 'start': 5971.472, 'duration': 4.544}, {'end': 5981.58, 'text': 'so this starts at index number 145 and goes on till index number 149.', 'start': 5976.016, 'duration': 5.564}, {'end': 5987.345, 'text': "similarly, if i want to have a glance at the last 10 records, then i'll just pass in 10 over here, right.", 'start': 5981.58, 'duration': 5.765}, {'end': 5991.387, 'text': 'so these are the last 10 records of this iris data frame.', 'start': 5987.345, 'duration': 4.042}, {'end': 5994.188, 'text': 'now these are the head and tail functions.', 'start': 5991.387, 'duration': 2.801}, {'end': 5996.83, 'text': 'so there is also this describe function.', 'start': 5994.188, 'duration': 2.642}, {'end': 6003.473, 'text': "so i'll just type in describe over here and let's just see what it does over here.", 'start': 5996.83, 'duration': 6.643}, {'end': 6009.756, 'text': 'so this would basically describe this entire data frame in terms of all of these different measures.', 'start': 6003.473, 'duration': 6.283}, {'end': 6011.676, 'text': "we've got count.", 'start': 6010.456, 'duration': 1.22}, {'end': 6016.858, 'text': 'so this count is basically the number of records present for each of these different columns.', 'start': 6011.676, 'duration': 5.182}, {'end': 6018.778, 'text': 'so there are 150 records for sepal length.', 'start': 6016.858, 'duration': 1.92}, {'end': 6023.559, 'text': 'similarly, 150 for sepal width, petal length and petal width all right.', 'start': 6018.778, 'duration': 4.781}, {'end': 6027.22, 'text': 'so these are basically the different count values of all of these columns.', 'start': 6023.559, 'duration': 3.661}, {'end': 6029.28, 'text': "and then we've got the mean value.", 'start': 6027.22, 'duration': 2.06}, {'end': 6036.582, 'text': "so this is the mean value of sepal length, sepal width, petal length and petal width, and similarly we've got the minimum value, standard value,", 'start': 6029.28, 'duration': 7.302}, {'end': 6039.343, 'text': '25 percentile maximum value, and so on.', 'start': 6036.582, 'duration': 2.761}, {'end': 6044.421, 'text': 'So these are some basic functions which can be implemented on top of any data frame.', 'start': 6040.003, 'duration': 4.418}, {'end': 6050.672, 'text': "so now, going ahead, we'll see how to access individual rows and columns from our data frame.", 'start': 6045.208, 'duration': 5.464}, {'end': 6054.514, 'text': "so to do that we've got the iloc and the loc methods.", 'start': 6050.672, 'duration': 3.842}, {'end': 6056.856, 'text': "so let's start with the iloc method.", 'start': 6054.514, 'duration': 2.342}, {'end': 6062.74, 'text': "so first we'll give in the name of the data frame and then we'll just type in dot iloc.", 'start': 6056.856, 'duration': 5.884}, {'end': 6071.726, 'text': 'so this dot iloc basically stands for index location, and this is how we can extract some specific rows and columns from this entire data frame.', 'start': 6062.74, 'duration': 8.986}, {'end': 6080.59, 'text': 'so over here what we are doing is we are extracting the first three records from this data frame and the first two columns right.', 'start': 6071.726, 'duration': 8.864}, {'end': 6087.773, 'text': 'so the index values go from 0 to 3, so 0, 1, 2, and then the column values are 0 to 2, right.', 'start': 6080.59, 'duration': 7.183}, {'end': 6091.214, 'text': 'so 0 and 1, sepal length and sepal width.', 'start': 6087.773, 'duration': 3.441}, {'end': 6097.377, 'text': 'so this is how we can extract a subset of the data frame from this entire data frame.', 'start': 6091.214, 'duration': 6.163}, {'end': 6100.158, 'text': 'so now let me again print out the head of this.', 'start': 6097.377, 'duration': 2.781}, {'end': 6104.229, 'text': 'so iris dot head, this is my data frame.', 'start': 6100.158, 'duration': 4.071}, {'end': 6114.301, 'text': "over here now let's say, i want a subsection which comprises of the row numbers, starting from, let's say, 99 and going on till 126,", 'start': 6104.229, 'duration': 10.072}, {'end': 6118.085, 'text': 'and i want only the petal length and the petal width column.', 'start': 6114.301, 'duration': 3.784}, {'end': 6126.428, 'text': "so to do that i'd have to type in iris and then i'll type in loc, i'll give in this square braces over here.", 'start': 6118.866, 'duration': 7.562}, {'end': 6132.869, 'text': "i'll put in a comma now whatever you're given on the left side of the comma, that would denote all of the rows which you'd want to extract.", 'start': 6126.428, 'duration': 6.441}, {'end': 6138.15, 'text': "and whatever you put in on the right side of the comma, that would denote all of the columns which you'd want to extract.", 'start': 6132.869, 'duration': 5.281}, {'end': 6141.291, 'text': "so, as i've said, the row numbers which i want to extract are from 99 to 126.", 'start': 6138.15, 'duration': 3.141}, {'end': 6149.171, 'text': 'so indexing will start from 99 and go till 1, 2, 7.', 'start': 6141.291, 'duration': 7.88}, {'end': 6151.753, 'text': 'then the column value should be 0, 1, 2 & 3.', 'start': 6149.172, 'duration': 2.581}, {'end': 6155.514, 'text': 'so this will go from 2 to 4.', 'start': 6151.753, 'duration': 3.761}, {'end': 6164.397, 'text': "so I'll just put in 2 to 4 over here and I'll store this in a new data frame and I'll name that to be iris 1.", 'start': 6155.514, 'duration': 8.883}, {'end': 6168.633, 'text': 'so let me print out iris 1 over here, right.', 'start': 6164.397, 'duration': 4.236}, {'end': 6170.715, 'text': 'so this is our data frame, iris one,', 'start': 6168.633, 'duration': 2.082}, {'end': 6180.844, 'text': 'which comprise of only the columns petal length and petal width right and the row numbers start from 99 and go on till 126..', 'start': 6170.715, 'duration': 10.129}, {'end': 6187.09, 'text': "now. similarly, let's say, if i want to extract all of the records, starting from row number 10 and going on till row number 20,", 'start': 6180.844, 'duration': 6.246}, {'end': 6192.255, 'text': 'and i want only these two columns, the sepal length column and the species column.', 'start': 6187.09, 'duration': 5.165}, {'end': 6194.177, 'text': "so let me show you guys how it's done.", 'start': 6192.255, 'duration': 1.922}, {'end': 6201.848, 'text': "so this time i'll name the data frame as iris 2 and i'll type in iris dot.", 'start': 6195.606, 'duration': 6.242}, {'end': 6211.692, 'text': 'i will type in iloc over here and this would go from 10 to 21 and after this i will just give in the column value.', 'start': 6201.848, 'duration': 9.844}, {'end': 6219.315, 'text': 'so again i will take in a list over here and inside this i will give in the index of the first column, which is basically 0,', 'start': 6211.692, 'duration': 7.623}, {'end': 6222.376, 'text': "and then i'll give in the index of the final column, which is 4.", 'start': 6219.315, 'duration': 3.061}, {'end': 6222.996, 'text': 'over here.', 'start': 6222.376, 'duration': 0.62}, {'end': 6227.335, 'text': "I'll click on run and let me print out iris 2 over here.", 'start': 6224.173, 'duration': 3.162}, {'end': 6235.76, 'text': 'Right. so this is my subsection of the entire data frame, where I have row numbers starting from index number 10, going on till index number 20,', 'start': 6228.816, 'duration': 6.944}, {'end': 6237.441, 'text': "and these are the two columns which I've extracted.", 'start': 6235.76, 'duration': 1.681}, {'end': 6239.021, 'text': 'So sepal length and species.', 'start': 6237.721, 'duration': 1.3}, {'end': 6241.483, 'text': 'So this is how I can work with the iloc method.', 'start': 6239.442, 'duration': 2.041}, {'end': 6249.081, 'text': 'then, analogous to the ilock method, we also have the lock method to extract individual rows and columns.', 'start': 6242.739, 'duration': 6.342}, {'end': 6255.063, 'text': 'so the only difference is, instead of giving the index values over here, we given the labels of the columns.', 'start': 6249.081, 'duration': 5.982}, {'end': 6256.563, 'text': 'over here right.', 'start': 6255.063, 'duration': 1.5}, {'end': 6260.764, 'text': "so over here, when it comes to rows, we'll similarly given the index values of the rows.", 'start': 6256.563, 'duration': 4.201}, {'end': 6265.666, 'text': 'and over here, when it comes to columns, i will given the names of the columns.', 'start': 6260.764, 'duration': 4.902}, {'end': 6267.646, 'text': "so let's work with ilock.", 'start': 6265.666, 'duration': 1.98}, {'end': 6271.916, 'text': 'now let me print out the head again, iris dot, head right.', 'start': 6267.646, 'duration': 4.27}, {'end': 6280.301, 'text': "so this is my data frame over here and this time i want all of the records starting from row number 33 and going on till, let's say, 44,", 'start': 6271.916, 'duration': 8.385}, {'end': 6282.963, 'text': "and i'll store this in iris 3.", 'start': 6280.301, 'duration': 2.662}, {'end': 6293.009, 'text': 'so this will be iris dot loc and inside this this has to go from 33 to 44,', 'start': 6282.963, 'duration': 10.046}, {'end': 6298.392, 'text': 'and then the names of the columns which i want to extract would be sepal width and petal width.', 'start': 6293.009, 'duration': 5.383}], 'summary': 'Analyzed iris data frame with head, tail, and describe functions, then extracted subsets using iloc and loc methods.', 'duration': 395.364, 'max_score': 5903.028, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw5903028.jpg'}, {'end': 6087.773, 'src': 'embed', 'start': 6062.74, 'weight': 2, 'content': [{'end': 6071.726, 'text': 'so this dot iloc basically stands for index location, and this is how we can extract some specific rows and columns from this entire data frame.', 'start': 6062.74, 'duration': 8.986}, {'end': 6080.59, 'text': 'so over here what we are doing is we are extracting the first three records from this data frame and the first two columns right.', 'start': 6071.726, 'duration': 8.864}, {'end': 6087.773, 'text': 'so the index values go from 0 to 3, so 0, 1, 2, and then the column values are 0 to 2, right.', 'start': 6080.59, 'duration': 7.183}], 'summary': 'Using dot iloc to extract 3 rows and 2 columns from a dataframe.', 'duration': 25.033, 'max_score': 6062.74, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw6062740.jpg'}, {'end': 6581.483, 'src': 'embed', 'start': 6554.649, 'weight': 1, 'content': [{'end': 6558.931, 'text': 'so here you see, petal length is greater than two and the species is equal to virginica.', 'start': 6554.649, 'duration': 4.282}, {'end': 6562.473, 'text': 'so we have given multiple conditions to extract these records.', 'start': 6558.931, 'duration': 3.542}, {'end': 6564.934, 'text': 'now, similarly, we can give as many conditions as possible.', 'start': 6562.473, 'duration': 2.461}, {'end': 6568.616, 'text': 'you can give two, three, four or five as many conditions as you want.', 'start': 6564.934, 'duration': 3.682}, {'end': 6569.977, 'text': "so let's complicate this a bit.", 'start': 6568.616, 'duration': 1.361}, {'end': 6571.698, 'text': "and we'll give in three conditions over here.", 'start': 6569.977, 'duration': 1.721}, {'end': 6581.483, 'text': 'so this time i want the sepal length to be greater than six, sepal width to be greater than three and petal length to be also greater than three.', 'start': 6572.378, 'duration': 9.105}], 'summary': 'Using multiple conditions, data is filtered with petal length > 2 and species = virginica. more conditions can be added, such as sepal length > 6, sepal width > 3, and petal length > 3.', 'duration': 26.834, 'max_score': 6554.649, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw6554649.jpg'}, {'end': 6702.057, 'src': 'embed', 'start': 6679.336, 'weight': 0, 'content': [{'end': 6687.04, 'text': 'so see that there are only three records out of these 150 records where these three conditions are satisfied, that is,', 'start': 6679.336, 'duration': 7.704}, {'end': 6693.523, 'text': 'the sepal length is greater than six, the petal width is greater than three and the petal length is greater than six.', 'start': 6687.04, 'duration': 6.483}, {'end': 6698.645, 'text': 'right. so these were some different data manipulation operations which we could perform on the iris data set.', 'start': 6693.523, 'duration': 5.122}, {'end': 6702.057, 'text': 'so that was data manipulation.', 'start': 6700.236, 'duration': 1.821}], 'summary': 'Out of 150 records, only 3 satisfy conditions: sepal length > 6, petal width > 3, petal length > 6.', 'duration': 22.721, 'max_score': 6679.336, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw6679336.jpg'}], 'start': 5383.755, 'title': 'Introduction to pandas and data manipulation', 'summary': 'Introduces the pandas series object, covering its creation and manipulation, and demonstrates data manipulation using the iris dataset, filtering records based on multiple conditions, resulting in only three out of 150 records satisfying the criteria, and using inbuilt functions like head, tail, and describe to analyze data.', 'chapters': [{'end': 5476.868, 'start': 5383.755, 'title': 'Introduction to pandas series object', 'summary': 'Introduces the pandas series object, a one dimensional labeled array in python, and explains how to create and manipulate it using pandas library, with a mention of installing pandas using pip in anaconda.', 'duration': 93.113, 'highlights': ['The series object in pandas is a one dimensional labeled array, distinct from a numpy array, and can be created using the pandas library in Python.', "To create a series object, the pandas library needs to be invoked using the command 'import pandas as pd', and then the series can be created by typing 'pd.series' followed by the data elements.", "Pandas library comes pre-installed in Anaconda, but if manually installing, the command 'pip install pandas' should be used in Anaconda prompt."]}, {'end': 6044.421, 'start': 5476.868, 'title': 'Pandas data structures and functions', 'summary': 'Covers creating and manipulating pandas series and data frames, demonstrating how to change index labels, create series from lists and dictionaries, create a data frame from a dictionary, and use inbuilt functions like head, tail, and describe to analyze data.', 'duration': 567.553, 'highlights': ['Creating a data frame from a dictionary The speaker demonstrates creating a data frame from a dictionary with student names and marks, with Bob scoring 87, Sam scoring 13, Julia scoring 99, and Charles scoring 67.', 'Creating a series object from a dictionary The process of creating a series object from a dictionary is explained, where the keys become the index values and the values become the actual values of the series object.', 'Changing the index of a Panda series The method of changing the index of a Panda series is illustrated, showing how to change index labels from 0, 1, 2, 3, and 4 to A, B, C, D, and E.']}, {'end': 6372.897, 'start': 6045.208, 'title': 'Accessing rows and columns in data frames', 'summary': 'Demonstrates the use of iloc and loc methods to extract specific rows and columns from a data frame, including examples of extracting subsets and performing data manipulation operations based on conditions.', 'duration': 327.689, 'highlights': ['The iloc method is used to extract specific rows and columns from a data frame by providing index values, demonstrated by extracting the first three records and the first two columns, resulting in a subset of data. First three records, first two columns', 'The loc method is used to extract a subsection of the data frame by providing row numbers and column indices, resulting in the creation of a new data frame comprising specific rows and columns. Row numbers 99 to 126, columns 2 to 4', 'Demonstration of using iloc to extract records from row number 10 to 20 and specific columns, resulting in a subsection of the data frame with the specified rows and columns. Row numbers 10 to 20, specific columns extracted', 'Utilizing the loc method to extract records based on a condition, specifically extracting records where sepal length is greater than five, showcasing data manipulation operations. Condition: Sepal length > 5']}, {'end': 6702.057, 'start': 6373.417, 'title': 'Data manipulation in python', 'summary': 'Demonstrates data manipulation in python using the iris dataset, including filtering records based on multiple conditions and extracting specific data points, with an example of extracting records based on multiple conditions resulting in only three out of 150 records satisfying the criteria.', 'duration': 328.64, 'highlights': ['The chapter demonstrates filtering records based on multiple conditions, with an example of extracting records based on three conditions resulting in only three out of 150 records satisfying the criteria.', 'The process of extracting records based on multiple conditions is illustrated using the iris dataset, showing the extraction of records where the sepal length is greater than six, the sepal width is greater than three, and the petal length is greater than three, resulting in only three out of 150 records satisfying the criteria.', 'The demonstration includes the extraction of records where the petal length is greater than two and the species is equal to virginica, showcasing the ability to apply multiple conditions to extract specific data points from the dataset.']}], 'duration': 1318.302, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw5383755.jpg', 'highlights': ['The process of extracting records based on multiple conditions is illustrated using the iris dataset, showing the extraction of records where the sepal length is greater than six, the sepal width is greater than three, and the petal length is greater than three, resulting in only three out of 150 records satisfying the criteria.', 'The demonstration includes the extraction of records where the petal length is greater than two and the species is equal to virginica, showcasing the ability to apply multiple conditions to extract specific data points from the dataset.', 'The iloc method is used to extract specific rows and columns from a data frame by providing index values, demonstrated by extracting the first three records and the first two columns, resulting in a subset of data.', 'The loc method is used to extract a subsection of the data frame by providing row numbers and column indices, resulting in the creation of a new data frame comprising specific rows and columns.', 'The series object in pandas is a one dimensional labeled array, distinct from a numpy array, and can be created using the pandas library in Python.']}, {'end': 8263.816, 'segs': [{'end': 6725.713, 'src': 'embed', 'start': 6702.057, 'weight': 0, 'content': [{'end': 6709.163, 'text': "now we'll head on to data visualization and to perform data visualization, python provides us a package called as matplotlib,", 'start': 6702.057, 'duration': 7.106}, {'end': 6716.369, 'text': 'and with the help of matplotlib you can create beautiful graphs such as bar plots, scatter plots, histograms and a lot more.', 'start': 6709.163, 'duration': 7.206}, {'end': 6719.111, 'text': "so let's go to jupyter notebook and work with these graphs.", 'start': 6716.369, 'duration': 2.742}, {'end': 6725.713, 'text': "So we'll start off by importing the pyplot sub module from the matplotlib library.", 'start': 6721.411, 'duration': 4.302}], 'summary': "Python's matplotlib enables creation of various graphs like bar plots, scatter plots, and histograms for data visualization.", 'duration': 23.656, 'max_score': 6702.057, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw6702057.jpg'}, {'end': 7139.165, 'src': 'embed', 'start': 7104.139, 'weight': 1, 'content': [{'end': 7106.541, 'text': "now let's go ahead and create a bar plot.", 'start': 7104.139, 'duration': 2.402}, {'end': 7108.063, 'text': "so i'll type in.", 'start': 7106.541, 'duration': 1.522}, {'end': 7109.283, 'text': "i'll just give it a comment over here.", 'start': 7108.063, 'duration': 1.22}, {'end': 7110.665, 'text': 'bar plot.', 'start': 7109.283, 'duration': 1.382}, {'end': 7114.207, 'text': 'now, to create this bar plot i would need a dictionary first.', 'start': 7110.665, 'duration': 3.542}, {'end': 7123.555, 'text': "so let me again name this dictionary to be student over here and i'll have some names of students and the marks of the students.", 'start': 7114.207, 'duration': 9.348}, {'end': 7130.24, 'text': "so let's say the first student is sam again and he has scored 30..", 'start': 7123.555, 'duration': 6.685}, {'end': 7139.165, 'text': "then we've got bob who has scored 50 and finally we've got julia who has scored 70.", 'start': 7130.24, 'duration': 8.925}], 'summary': "Creating a bar plot using a dictionary with students' names and their scores: sam (30), bob (50), julia (70).", 'duration': 35.026, 'max_score': 7104.139, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw7104139.jpg'}, {'end': 7404.387, 'src': 'embed', 'start': 7373.156, 'weight': 2, 'content': [{'end': 7378.161, 'text': 'so one, two, three, four, five and six.', 'start': 7373.156, 'duration': 5.005}, {'end': 7381.336, 'text': "now, similarly, i'll take another list.", 'start': 7379.235, 'duration': 2.101}, {'end': 7383.477, 'text': "i'll name it to be y1.", 'start': 7381.336, 'duration': 2.141}, {'end': 7387.519, 'text': 'now you have to keep in mind that you take the same number of values in x and y,', 'start': 7383.477, 'duration': 4.042}, {'end': 7396.143, 'text': 'because a scatter plot it basically gives you a point for a specific position of x and y coordinates, right.', 'start': 7387.519, 'duration': 8.624}, {'end': 7399.024, 'text': "so again i'll give in some random six values.", 'start': 7396.143, 'duration': 2.881}, {'end': 7404.387, 'text': 'so nine, one, two, six and four and nine again.', 'start': 7399.024, 'duration': 5.363}], 'summary': 'Two sets of six values, x and y, for scatter plot.', 'duration': 31.231, 'max_score': 7373.156, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw7373156.jpg'}, {'end': 7874.758, 'src': 'heatmap', 'start': 7471.198, 'weight': 1, 'content': [{'end': 7478.962, 'text': "So 4, 7, 8, 1, 2, 6, right? So I've got my bunch of six values over here.", 'start': 7471.198, 'duration': 7.764}, {'end': 7483.225, 'text': 'let me create another scatter plot over here.', 'start': 7479.781, 'duration': 3.444}, {'end': 7488.512, 'text': 'so this time it will be with respect to x and y 2, and i will give it a different color.', 'start': 7483.225, 'duration': 5.287}, {'end': 7493.278, 'text': 'so the color would be red all right.', 'start': 7488.512, 'duration': 4.766}, {'end': 7497.163, 'text': 'so i have added two different theme points on the same graph over here.', 'start': 7493.278, 'duration': 3.885}, {'end': 7502.412, 'text': "now let's head on to the next plotting, which is a histogram.", 'start': 7498.911, 'duration': 3.501}, {'end': 7507.414, 'text': 'so histogram basically helps us to understand the distribution of a continuous column.', 'start': 7502.412, 'duration': 5.002}, {'end': 7514.556, 'text': "so for this we'll be actually loading the iris data set again and understanding the distribution of some of the continuous columns of that data set.", 'start': 7507.414, 'duration': 7.142}, {'end': 7515.997, 'text': "so again i'll load up the data set.", 'start': 7514.556, 'duration': 1.441}, {'end': 7529.281, 'text': "so i will type in pd dot, read csv, and now i'll give in the name of the file, which is iris dot, csv, and I'll store this in iris again.", 'start': 7515.997, 'duration': 13.284}, {'end': 7534.026, 'text': 'alright. so let me have a glance at the head of this again.', 'start': 7529.281, 'duration': 4.745}, {'end': 7539.625, 'text': 'so iris.head, right, so this is our data set.', 'start': 7534.026, 'duration': 5.599}, {'end': 7543.769, 'text': 'now I want to understand the distribution of this sepal length column.', 'start': 7539.625, 'duration': 4.144}, {'end': 7546.371, 'text': "so for that what I'll do is plt.", 'start': 7543.769, 'duration': 2.602}, {'end': 7548.073, 'text': 'to make a histogram.', 'start': 7546.471, 'duration': 1.602}, {'end': 7556.821, 'text': "we've got this hist function and inside this I'll just give in the name of the column, so iris, and I'll give in the name of the column,", 'start': 7548.073, 'duration': 8.748}, {'end': 7562.285, 'text': "which is sepal length, and then I'll just put in plt.show.", 'start': 7556.821, 'duration': 5.464}, {'end': 7569.284, 'text': 'so this is the histogram and this shows us the distribution of the sepal length column.', 'start': 7563.962, 'duration': 5.322}, {'end': 7575.005, 'text': "now to increase the distribution of these bins over here we've got the bins attribute.", 'start': 7569.284, 'duration': 5.721}, {'end': 7576.786, 'text': "so i'll just type in bins.", 'start': 7575.005, 'duration': 1.781}, {'end': 7579.567, 'text': "so let's say i want 20 bins.", 'start': 7576.786, 'duration': 2.781}, {'end': 7581.067, 'text': "so i'll just type in bins.", 'start': 7579.567, 'duration': 1.5}, {'end': 7584.688, 'text': "so i'll just set the bins value to be equal to 20..", 'start': 7581.067, 'duration': 3.621}, {'end': 7588.49, 'text': "so the inference which we can draw from this plot is so let's see.", 'start': 7584.688, 'duration': 3.802}, {'end': 7590.871, 'text': 'if we take this particular bin over here,', 'start': 7588.49, 'duration': 2.381}, {'end': 7599.955, 'text': 'then this tells us that there are around 16 records where the sepal length of the flower is exactly around 5..', 'start': 7590.871, 'duration': 9.084}, {'end': 7601.656, 'text': "then we've got this over here.", 'start': 7599.955, 'duration': 1.701}, {'end': 7607.117, 'text': 'so this bin tells us that there are again around 16 records where the sepal length is 6.5.', 'start': 7601.656, 'duration': 5.461}, {'end': 7610.658, 'text': 'and if we take this pin over here, so there are very, very few records.', 'start': 7607.117, 'duration': 3.541}, {'end': 7617.301, 'text': "so this would be, let's say, there's just one record where the sepal length is somewhere between 7.5 and 8..", 'start': 7610.658, 'duration': 6.643}, {'end': 7625.083, 'text': 'so again, if we take this bin, so there would be around four records, or there are just four flowers whose sepal length is 4.5.', 'start': 7617.301, 'duration': 7.782}, {'end': 7627.984, 'text': 'So this is the sort of inference which we can draw from a histogram.', 'start': 7625.083, 'duration': 2.901}, {'end': 7636.547, 'text': 'And the basic difference between a histogram and a bar plot is so a bar plot is used to understand the distribution of a categorical variable,', 'start': 7628.504, 'duration': 8.043}, {'end': 7640.229, 'text': 'while a histogram is used to understand the distribution of a continuous variable.', 'start': 7636.547, 'duration': 3.682}, {'end': 7651.513, 'text': "Now, similarly, if I want to understand the distribution of let's say petal width, so I'll change this over here and I'll just put in petal width.", 'start': 7640.869, 'duration': 10.644}, {'end': 7657.315, 'text': 'right. so this is the distribution of petal width.', 'start': 7654.554, 'duration': 2.761}, {'end': 7664.816, 'text': 'so there are around 34 records whose petal width would be somewhere around 0.2 or 0.3.', 'start': 7657.315, 'duration': 7.501}, {'end': 7667.377, 'text': 'so those constitute a lot of flowers.', 'start': 7664.816, 'duration': 2.561}, {'end': 7669.677, 'text': 'right, and if we take this bin over here,', 'start': 7667.377, 'duration': 2.3}, {'end': 7677.739, 'text': "so there would be just one record or there's just one flower where the petal width is 0.5 and this seems to be the maximum petal width,", 'start': 7669.677, 'duration': 8.062}, {'end': 7679.419, 'text': 'which is around 2.5.', 'start': 7677.739, 'duration': 1.68}, {'end': 7681.4, 'text': 'so the maximum petal width of 2.5.', 'start': 7679.419, 'duration': 1.981}, {'end': 7683.4, 'text': 'there are around 5 records.', 'start': 7681.4, 'duration': 2}, {'end': 7689.146, 'text': "right now let's again understand the distribution of sepal width.", 'start': 7685.063, 'duration': 4.083}, {'end': 7694.771, 'text': "so i'll change this and i'll just put in sepal over here.", 'start': 7689.146, 'duration': 5.625}, {'end': 7697.173, 'text': 'right, so we have a big peak over here.', 'start': 7694.771, 'duration': 2.402}, {'end': 7706.14, 'text': 'so there are 25 records where the sepal width is three and there is less, and there is just one record whose sepal width would be around 4.3 or 4.4.', 'start': 7697.173, 'duration': 8.967}, {'end': 7707.181, 'text': 'right, so this is histogram.', 'start': 7706.14, 'duration': 1.041}, {'end': 7713.928, 'text': "so now we'll go ahead and make some box plots.", 'start': 7711.104, 'duration': 2.824}, {'end': 7721.998, 'text': 'so box plots are used to understand the distribution of how does one continuous variable change with respect to a categorical value?', 'start': 7713.928, 'duration': 8.07}, {'end': 7724.381, 'text': "and again we'll be building box plots on this iris data set.", 'start': 7721.998, 'duration': 2.383}, {'end': 7727.325, 'text': "so i'll type in iris dot box plot.", 'start': 7724.381, 'duration': 2.944}, {'end': 7730.806, 'text': 'and this basically has two parameters.', 'start': 7728.804, 'duration': 2.002}, {'end': 7741.033, 'text': "so the first parameter is the column where we'll assign the y values or the continuous value and the continuous value which i'd want to given a sepal length over here.", 'start': 7730.806, 'duration': 10.227}, {'end': 7746.416, 'text': "so i'll type in sepal length and then i'll map the categorical value onto the x-axis.", 'start': 7741.033, 'duration': 5.383}, {'end': 7752.06, 'text': "so for this i've got the by attribute and i've only got one categorical variable over here.", 'start': 7746.416, 'duration': 5.644}, {'end': 7755.683, 'text': "right, so the species column is the only categorical variable which you've got.", 'start': 7752.06, 'duration': 3.623}, {'end': 7758.124, 'text': "So I'll type in species over here.", 'start': 7756.423, 'duration': 1.701}, {'end': 7761.326, 'text': "I'll hit on run and this is the box plot which you get.", 'start': 7758.624, 'duration': 2.702}, {'end': 7768.49, 'text': 'So what you see over here, so this line which you see in the middle of this box, this is known as the median line.', 'start': 7762.146, 'duration': 6.344}, {'end': 7772.812, 'text': 'This is the 25 percentile line and this is the 75 percentile line.', 'start': 7768.59, 'duration': 4.222}, {'end': 7775.833, 'text': "Let's just focus on this median line over here.", 'start': 7773.332, 'duration': 2.501}, {'end': 7784.454, 'text': 'So this tells us that if the species of the flower is setosa, then the median sepal length of the flower would be around 5.', 'start': 7776.674, 'duration': 7.78}, {'end': 7786.455, 'text': "again, let's take this box over here.", 'start': 7784.454, 'duration': 2.001}, {'end': 7791.157, 'text': 'so if the species of the flower is versicolor, then the median sepal length would be around 5.9.', 'start': 7786.455, 'duration': 4.702}, {'end': 7798.7, 'text': "and let's take this so if the species of the flower is virginica, then the median sepal length over here would be 6.5.", 'start': 7791.157, 'duration': 7.543}, {'end': 7803.702, 'text': 'so this is the inference which you can draw from a box plot now.', 'start': 7798.7, 'duration': 5.002}, {'end': 7814.505, 'text': 'similarly, if i want to understand how does petal width vary with species, let me just put in petal width over here, right.', 'start': 7803.702, 'duration': 10.803}, {'end': 7819.595, 'text': 'so again, we have three boxes over here, one box each for the different species of the iris flower.', 'start': 7814.505, 'duration': 5.09}, {'end': 7823.784, 'text': 'so over here it seems that the median line would be somewhere over here.', 'start': 7819.595, 'duration': 4.189}, {'end': 7824.144, 'text': 'so the.', 'start': 7823.784, 'duration': 0.36}, {'end': 7831.297, 'text': 'So if the species of the flower is setosa, then the median petal width would be around 0.3 or 0.4..', 'start': 7825.632, 'duration': 5.665}, {'end': 7834.84, 'text': "Then we've got versicolor over here.", 'start': 7831.297, 'duration': 3.543}, {'end': 7838.664, 'text': 'So if the species of the flower is versicolor, then the median petal width would be 1.2.', 'start': 7834.861, 'duration': 3.803}, {'end': 7848.253, 'text': 'So the basic inference is virginica over here would have the maximum petal width and setosa would have the minimum petal width.', 'start': 7838.664, 'duration': 9.589}, {'end': 7852.074, 'text': 'so that was the same case which we saw with sepal length as well.', 'start': 7848.593, 'duration': 3.481}, {'end': 7857.575, 'text': 'so virginica had the maximum sepal length and setosa had the minimum sepal length.', 'start': 7852.074, 'duration': 5.501}, {'end': 7862.636, 'text': 'if we want to make this plot even more beautiful, we can use the seaborn library.', 'start': 7857.575, 'duration': 5.061}, {'end': 7865.157, 'text': 'so let me just import the seaborn library.', 'start': 7862.636, 'duration': 2.521}, {'end': 7869.158, 'text': "i'll type in import seaborn as sns.", 'start': 7865.157, 'duration': 4.001}, {'end': 7874.758, 'text': "then we've got the box plot method from this.", 'start': 7871.195, 'duration': 3.563}], 'summary': 'Data analyzed: sepal length, petal width, sepal width. plotted: scatter plot, histogram, box plot.', 'duration': 403.56, 'max_score': 7471.198, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw7471198.jpg'}, {'end': 7529.281, 'src': 'embed', 'start': 7493.278, 'weight': 3, 'content': [{'end': 7497.163, 'text': 'so i have added two different theme points on the same graph over here.', 'start': 7493.278, 'duration': 3.885}, {'end': 7502.412, 'text': "now let's head on to the next plotting, which is a histogram.", 'start': 7498.911, 'duration': 3.501}, {'end': 7507.414, 'text': 'so histogram basically helps us to understand the distribution of a continuous column.', 'start': 7502.412, 'duration': 5.002}, {'end': 7514.556, 'text': "so for this we'll be actually loading the iris data set again and understanding the distribution of some of the continuous columns of that data set.", 'start': 7507.414, 'duration': 7.142}, {'end': 7515.997, 'text': "so again i'll load up the data set.", 'start': 7514.556, 'duration': 1.441}, {'end': 7529.281, 'text': "so i will type in pd dot, read csv, and now i'll give in the name of the file, which is iris dot, csv, and I'll store this in iris again.", 'start': 7515.997, 'duration': 13.284}], 'summary': 'Data visualization includes graphing themes and histogram distribution of continuous columns from the iris dataset.', 'duration': 36.003, 'max_score': 7493.278, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw7493278.jpg'}, {'end': 7741.033, 'src': 'embed', 'start': 7713.928, 'weight': 4, 'content': [{'end': 7721.998, 'text': 'so box plots are used to understand the distribution of how does one continuous variable change with respect to a categorical value?', 'start': 7713.928, 'duration': 8.07}, {'end': 7724.381, 'text': "and again we'll be building box plots on this iris data set.", 'start': 7721.998, 'duration': 2.383}, {'end': 7727.325, 'text': "so i'll type in iris dot box plot.", 'start': 7724.381, 'duration': 2.944}, {'end': 7730.806, 'text': 'and this basically has two parameters.', 'start': 7728.804, 'duration': 2.002}, {'end': 7741.033, 'text': "so the first parameter is the column where we'll assign the y values or the continuous value and the continuous value which i'd want to given a sepal length over here.", 'start': 7730.806, 'duration': 10.227}], 'summary': 'Box plots show distribution of continuous variable with respect to a categorical value in the iris dataset.', 'duration': 27.105, 'max_score': 7713.928, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw7713928.jpg'}, {'end': 7990.159, 'src': 'embed', 'start': 7961.57, 'weight': 5, 'content': [{'end': 7968.533, 'text': "So 76, 45, 90 and let's say I'll give in 85 as well.", 'start': 7961.57, 'duration': 6.963}, {'end': 7973.275, 'text': "Right, so I've got the fruits list ready and I've also got the cost list ready.", 'start': 7969.353, 'duration': 3.922}, {'end': 7976.936, 'text': 'Now all I have to do is map these two into a pie chart.', 'start': 7973.955, 'duration': 2.981}, {'end': 7979.077, 'text': "So I'll type in plt.pie.", 'start': 7977.397, 'duration': 1.68}, {'end': 7990.159, 'text': "and inside this I will first map the numerical values, so I'll type in cost and then I'll map the labels.", 'start': 7982.231, 'duration': 7.928}], 'summary': 'Mapping numerical values and labels to create a pie chart.', 'duration': 28.589, 'max_score': 7961.57, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw7961570.jpg'}, {'end': 8202.423, 'src': 'embed', 'start': 8173.481, 'weight': 6, 'content': [{'end': 8180.287, 'text': 'So many of the other algorithms which are very popular in the industry, internally they are linked to linear models.', 'start': 8173.481, 'duration': 6.806}, {'end': 8185.391, 'text': 'You can spend your entire lifetime studying linear models.', 'start': 8182.468, 'duration': 2.923}, {'end': 8193.277, 'text': 'It has so much of depth and breadth that it takes lot of time to actually get used to all these things.', 'start': 8186.772, 'duration': 6.505}, {'end': 8202.423, 'text': 'So today I am going to start by introducing you to simple linear models, the core concepts,', 'start': 8195.237, 'duration': 7.186}], 'summary': 'Linear models are fundamental and widely used in the industry.', 'duration': 28.942, 'max_score': 8173.481, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw8173481.jpg'}], 'start': 6702.057, 'title': 'Data visualization techniques', 'summary': 'Covers data visualization techniques using matplotlib in python, including creating bar plots, scatter plots, histograms, and linear models, with examples such as bar plots, scatterplots, and various graph types. it also demonstrates the creation and interpretation of different plots and their applications in machine learning.', 'chapters': [{'end': 7104.139, 'start': 6702.057, 'title': 'Data visualization with matplotlib', 'summary': 'Covers the process of creating beautiful graphs such as bar plots, scatter plots, and histograms using the matplotlib package in python, including the creation of a line plot with x and y values, addition of titles, x and y labels, changing color, line width, style, adding grid, and creating multiple lines within a single plot.', 'duration': 402.082, 'highlights': ["Creating a line plot with x and y values The process of creating a line plot involves the creation of x and y values using numpy arrays, and plotting them using the 'plt.plot' method.", "Adding title, x label, and y label to the line plot The title, x label, and y label can be added to the line plot using 'plt.title', 'plt.xlabel', and 'plt.ylabel' methods, respectively.", "Changing color, line width, and line style of the plot The color, line width, and line style of the plot can be modified using the 'color', 'linewidth', and 'linestyle' attributes within the 'plt.plot' method.", "Creating multiple lines within a single plot Multiple lines can be added within a single plot by creating separate y values and plotting them using the 'plt.plot' method, with different colors to distinguish between the lines.", "Adding grid to the plot A grid can be added to the plot using the 'plt.grid' method with the parameter set to 'True'."]}, {'end': 7493.278, 'start': 7104.139, 'title': 'Creating bar plots and scatterplots', 'summary': 'Explains how to create a bar plot using a dictionary of student names and marks and then demonstrates the creation of a scatterplot using x and y coordinate lists, with examples of horizontal bar plots and depicting two different datasets on the same scatterplot.', 'duration': 389.139, 'highlights': ['Creation of bar plot using a dictionary of student names and marks The chapter demonstrates the creation of a bar plot using a dictionary containing the names and marks of three students: Sam (30), Bob (50), and Julia (70).', 'Explanation of horizontal bar plot using plt.barh function The chapter explains the usage of plt.barh function to create a horizontal bar plot, providing a visual comparison of the same data presented in a horizontal orientation.', 'Demonstration of scatterplot creation using x and y coordinate lists The chapter demonstrates the creation of a scatterplot using two lists of x and y coordinates, portraying the plotted points based on the given coordinate values.', 'Depiction of two different datasets on the same scatterplot with different colors The chapter illustrates the depiction of two different datasets on the same scatterplot using different colors, allowing for the visualization of multiple datasets within a single scatterplot.']}, {'end': 8263.816, 'start': 7493.278, 'title': 'Visualizing data with graphs and plots', 'summary': 'Covers the creation and interpretation of histograms, box plots, and pie charts, using the iris dataset. it also introduces the importance of linear models in machine learning and their applications in various algorithms.', 'duration': 770.538, 'highlights': ['The chapter demonstrates the creation of histograms to understand the distribution of continuous columns, such as sepal length, petal width, and sepal width, in the iris dataset. The histogram creation process is demonstrated using the iris dataset to understand the distribution of continuous columns, such as sepal length, petal width, and sepal width. It also explains the interpretation of the histogram, such as the distribution of records within specific bins and the difference between histogram and bar plot.', 'Box plots are used to understand the distribution of a continuous variable with respect to a categorical value, as demonstrated with the sepal length and petal width columns in the iris dataset. The chapter illustrates the use of box plots to analyze the distribution of continuous variables, such as sepal length and petal width, with respect to a categorical value, such as species, in the iris dataset. It explains the interpretation of the median lines and the differences in distribution among different categories.', 'The creation and customization of a pie chart to represent the distribution of fruits and their costs are demonstrated, highlighting the use of numerical values and labels to create the chart. The process of creating a pie chart to represent the distribution of fruits and their costs is detailed, including the mapping of numerical values and labels to create the chart. It also covers the customization options, such as displaying percentage values and adding a shadow to the chart.', 'The chapter introduces the importance of linear models in machine learning, highlighting their versatility for classification, regression, and their linkage to other popular algorithms. The importance of linear models in machine learning is emphasized, highlighting their versatility for classification, regression, and their linkage to other popular algorithms. It also acknowledges the depth and breadth of linear models, indicating further discussions on advanced concepts in subsequent visits.']}], 'duration': 1561.759, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw6702057.jpg', 'highlights': ['Demonstrates creation of different plots using matplotlib in python', 'Illustrates creation of bar plot using a dictionary of student names and marks', 'Explains creation of scatterplot using x and y coordinate lists', 'Demonstrates creation of histograms to understand distribution of continuous columns', 'Illustrates use of box plots to analyze distribution of continuous variables', 'Details creation and customization of a pie chart to represent distribution of fruits and their costs', 'Emphasizes importance of linear models in machine learning for classification and regression']}, {'end': 10216.025, 'segs': [{'end': 8439.035, 'src': 'embed', 'start': 8404.392, 'weight': 1, 'content': [{'end': 8407.413, 'text': 'We in data science call these things as models.', 'start': 8404.392, 'duration': 3.021}, {'end': 8412.374, 'text': 'Models are nothing but surfaces.', 'start': 8409.793, 'duration': 2.581}, {'end': 8415.326, 'text': 'in your feature space.', 'start': 8413.965, 'duration': 1.361}, {'end': 8420.728, 'text': 'What is a feature space? Feature space is collection of your independent attributes and the dependent attribute.', 'start': 8415.386, 'duration': 5.342}, {'end': 8430.311, 'text': 'You are trying to explore how this y and this x interact with each other, that interaction is what we represented as models.', 'start': 8421.528, 'duration': 8.783}, {'end': 8439.035, 'text': 'So models are nothing but lines, surfaces, hyper surfaces in your feature space.', 'start': 8432.452, 'duration': 6.583}], 'summary': 'Data science models are representations of interactions between independent and dependent attributes in a feature space.', 'duration': 34.643, 'max_score': 8404.392, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw8404392.jpg'}, {'end': 8532.014, 'src': 'embed', 'start': 8468.148, 'weight': 0, 'content': [{'end': 8475.651, 'text': 'So it says this region belongs to the circular point, this region belongs to the triangular point and you have things on the wrong side.', 'start': 8468.148, 'duration': 7.503}, {'end': 8483.214, 'text': 'K nearest neighbors splits your mathematical space into regions.', 'start': 8477.912, 'duration': 5.302}, {'end': 8486.375, 'text': 'Those regions are called Voronoi regions.', 'start': 8484.554, 'duration': 1.821}, {'end': 8494.578, 'text': 'So this imaginary surface is the model in K nearest neighbor.', 'start': 8490.997, 'duration': 3.581}, {'end': 8503.774, 'text': 'So in linear regression, this line which represents the relationship in x and y, this line is my model.', 'start': 8497.372, 'duration': 6.402}, {'end': 8510.757, 'text': 'So I want to predict the value of y given the value of x.', 'start': 8506.255, 'duration': 4.502}, {'end': 8516.639, 'text': 'What the line is saying is y is equal to x.', 'start': 8510.757, 'duration': 5.882}, {'end': 8520.16, 'text': 'So the prediction is whenever x is some value, y will also be the same value.', 'start': 8516.639, 'duration': 3.521}, {'end': 8532.014, 'text': 'Can I write this expression as y equal to 1x Does it make any difference? 1 into x.', 'start': 8522.061, 'duration': 9.953}], 'summary': 'K-nearest neighbors uses voronoi regions to model mathematical space and predict values.', 'duration': 63.866, 'max_score': 8468.148, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw8468148.jpg'}, {'end': 8610.258, 'src': 'embed', 'start': 8577.533, 'weight': 6, 'content': [{'end': 8578.053, 'text': 'Look at this.', 'start': 8577.533, 'duration': 0.52}, {'end': 8584.309, 'text': 'When we studied in school, we came across trigonometry.', 'start': 8581.288, 'duration': 3.021}, {'end': 8587.59, 'text': 'In trigonometry, we came across sine theta, cos theta, tan theta.', 'start': 8584.389, 'duration': 3.201}, {'end': 8595.273, 'text': 'What are these sine theta, cos theta, tan theta? What are these things? These are names given to certain ratios.', 'start': 8588.751, 'duration': 6.522}, {'end': 8605.116, 'text': 'So tan theta is, tan means the ratio between, we show it as dy by dx.', 'start': 8597.934, 'duration': 7.182}, {'end': 8610.258, 'text': 'How much y changes whenever x changes by 1 unit? That ratio is called tan.', 'start': 8605.176, 'duration': 5.082}], 'summary': 'Trigonometry involves studying sine, cos, and tan ratios, such as tan as the ratio of dy by dx.', 'duration': 32.725, 'max_score': 8577.533, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw8577533.jpg'}, {'end': 8906.331, 'src': 'embed', 'start': 8765.844, 'weight': 3, 'content': [{'end': 8770.046, 'text': 'Where in your mathematical space on the y dimension?', 'start': 8765.844, 'duration': 4.202}, {'end': 8776.109, 'text': 'where on the y dimension does the line intercept or cut the y axis?', 'start': 8770.046, 'duration': 6.063}, {'end': 8782.692, 'text': 'So this is the generic equation of a line.', 'start': 8779.33, 'duration': 3.362}, {'end': 8786.773, 'text': 'y equal to mx plus c.', 'start': 8784.532, 'duration': 2.241}, {'end': 8790.214, 'text': 'It tells you how y and x are related to each other.', 'start': 8786.773, 'duration': 3.441}, {'end': 8802.619, 'text': 'If you look at this line l1, if you look at l1 then this becomes 0.', 'start': 8793.916, 'duration': 8.703}, {'end': 8803.599, 'text': 'So y equal to mx.', 'start': 8802.619, 'duration': 0.98}, {'end': 8810.842, 'text': 'If you look at l2, l2 this becomes not 0 but 2.', 'start': 8804.4, 'duration': 6.442}, {'end': 8812.983, 'text': 'So this relation continues to remain same.', 'start': 8810.842, 'duration': 2.141}, {'end': 8818.178, 'text': 'In both the cases, the M is same.', 'start': 8815.276, 'duration': 2.902}, {'end': 8821.701, 'text': 'M is nothing but tan of the 45 degrees.', 'start': 8818.439, 'duration': 3.262}, {'end': 8823.883, 'text': 'We call it slope.', 'start': 8822.802, 'duration': 1.081}, {'end': 8841.857, 'text': 'Right? So this is the generic equation, one way of representing a line, y equal to mx plus c.', 'start': 8828.206, 'duration': 13.651}, {'end': 8849.018, 'text': 'So in this expression, Y is given in your data set, it is a target column.', 'start': 8841.857, 'duration': 7.161}, {'end': 8856.422, 'text': 'X is given in your data set, it is the independent columns, the independent attributes.', 'start': 8850.979, 'duration': 5.443}, {'end': 8865.367, 'text': 'From this data that you have given to the algorithm, the algorithm has found out for you the M and the C.', 'start': 8858.003, 'duration': 7.364}, {'end': 8871.151, 'text': 'We call these coefficients, coefficients of the model.', 'start': 8865.367, 'duration': 5.784}, {'end': 8879.377, 'text': 'This M and the C They reflect the relationship between Y and X in your data set.', 'start': 8873.352, 'duration': 6.025}, {'end': 8884.798, 'text': 'So the M and the C is what forms the model for you.', 'start': 8881.598, 'duration': 3.2}, {'end': 8896.842, 'text': 'Shall I complicate a bit more? We will complicate a bit more.', 'start': 8891.78, 'duration': 5.062}, {'end': 8906.331, 'text': 'Suppose instead of having 1.', 'start': 8902.803, 'duration': 3.528}], 'summary': 'Generic equation of a line y=mx+c helps in understanding the relationship between y and x. algorithm finds m and c from given data.', 'duration': 140.487, 'max_score': 8765.844, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw8765844.jpg'}, {'end': 9020.614, 'src': 'embed', 'start': 8986.003, 'weight': 7, 'content': [{'end': 8993.368, 'text': 'Suppose instead of only one variable x, now you have x1 and you have another variable x2.', 'start': 8986.003, 'duration': 7.365}, {'end': 8998.011, 'text': 'And you have only one dependent variable y.', 'start': 8996.37, 'duration': 1.641}, {'end': 9010.826, 'text': 'In such case the algorithm will find out the relationship between x1 and y, x2 and y.', 'start': 9000.959, 'duration': 9.867}, {'end': 9020.614, 'text': 'So it will express that relationship as y, equal to m1 x1 plus m2 x2, and there will be only one constant term.', 'start': 9010.826, 'duration': 9.788}], 'summary': 'Algorithm finds relationship between x1, x2, and y in a single equation.', 'duration': 34.611, 'max_score': 8986.003, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw8986003.jpg'}, {'end': 9117.261, 'src': 'embed', 'start': 9084.633, 'weight': 9, 'content': [{'end': 9087.795, 'text': 'Yeah, so human minds can imagine only three dimensions.', 'start': 9084.633, 'duration': 3.162}, {'end': 9092.337, 'text': 'But suppose they go beyond three dimensions, there also the linear models work.', 'start': 9088.815, 'duration': 3.522}, {'end': 9095.499, 'text': 'Those planes are called hyperplanes.', 'start': 9093.738, 'duration': 1.761}, {'end': 9101.242, 'text': "Hyperplanes, how do they look? We can't imagine, so we don't know that, but they will also be one single plane.", 'start': 9095.899, 'duration': 5.343}, {'end': 9105.664, 'text': 'And there will be no ups and downs, no curves, nothing.', 'start': 9103.063, 'duration': 2.601}, {'end': 9106.725, 'text': 'It will be straight plane.', 'start': 9105.684, 'duration': 1.041}, {'end': 9110.887, 'text': 'What do you mean by straight plane in four dimensions? No idea.', 'start': 9107.245, 'duration': 3.642}, {'end': 9117.261, 'text': 'it will have the same properties as the properties of a plane in three dimensions.', 'start': 9112.799, 'duration': 4.462}], 'summary': 'Human minds can only imagine three dimensions, but linear models work in higher dimensions with hyperplanes having the same properties as in three dimensions.', 'duration': 32.628, 'max_score': 9084.633, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw9084633.jpg'}, {'end': 9241.987, 'src': 'embed', 'start': 9215.63, 'weight': 8, 'content': [{'end': 9222.135, 'text': 'then this angle, this angle is my m 1, m 2, the point at which the plane cuts the y axis is my c.', 'start': 9215.63, 'duration': 6.505}, {'end': 9230.12, 'text': 'We cannot imagine beyond three dimensions how the planes will look,', 'start': 9226.178, 'duration': 3.942}, {'end': 9233.942, 'text': 'but they will have the same mathematical properties as a plane in three dimensions.', 'start': 9230.12, 'duration': 3.822}, {'end': 9241.987, 'text': 'A point that you should probably keep in mind, it will become useful to you down the line when you do deep learning and so on and so forth.', 'start': 9236.324, 'duration': 5.663}], 'summary': 'Mathematical planes have properties in 3d, useful for deep learning.', 'duration': 26.357, 'max_score': 9215.63, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw9215630.jpg'}, {'end': 9455.182, 'src': 'embed', 'start': 9425.966, 'weight': 10, 'content': [{'end': 9431.489, 'text': 'this collinearity can lead you to a problem when you productionize your models.', 'start': 9425.966, 'duration': 5.523}, {'end': 9437.692, 'text': 'The models may be less effective, less what you call predictive power.', 'start': 9432.529, 'duration': 5.163}, {'end': 9441.619, 'text': 'than required to be because of the collinearity.', 'start': 9439.178, 'duration': 2.441}, {'end': 9449.681, 'text': 'So one of the things that we do when building models is we see what the collinearity is between the dimensions.', 'start': 9443.439, 'duration': 6.242}, {'end': 9455.182, 'text': 'If the collinearity is very strong they are very strongly related to each other,', 'start': 9450.401, 'duration': 4.781}], 'summary': 'Collinearity can reduce model effectiveness and predictive power, requiring assessment during model building.', 'duration': 29.216, 'max_score': 9425.966, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw9425966.jpg'}, {'end': 9517.908, 'src': 'embed', 'start': 9488.294, 'weight': 11, 'content': [{'end': 9492.517, 'text': 'All of you know what is R value? Coefficient of correlation, R value.', 'start': 9488.294, 'duration': 4.223}, {'end': 9496.359, 'text': 'Statistics. you have done statistics right?', 'start': 9494.298, 'duration': 2.061}, {'end': 9507.261, 'text': 'R value of measurement of how these two variables relate to each other, how strongly.', 'start': 9502.498, 'duration': 4.763}, {'end': 9512.224, 'text': 'you will see that R value comes very close to positive 1 in this case.', 'start': 9507.261, 'duration': 4.963}, {'end': 9513.765, 'text': 'The maximum it can be is 1.', 'start': 9512.585, 'duration': 1.18}, {'end': 9517.908, 'text': 'So it will be very close to positive 1.', 'start': 9513.765, 'duration': 4.143}], 'summary': 'R value measures strong positive correlation between two variables.', 'duration': 29.614, 'max_score': 9488.294, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw9488294.jpg'}, {'end': 9627.257, 'src': 'embed', 'start': 9592.377, 'weight': 12, 'content': [{'end': 9597.58, 'text': "That depends on which dimension is more prone to errors when you calculate, when you're capturing the data.", 'start': 9592.377, 'duration': 5.203}, {'end': 9606.085, 'text': 'So instead of doing all that analysis, I might convert into a synthetic dimension using a technique called PCA, principal component analysis.', 'start': 9599.301, 'duration': 6.784}, {'end': 9611.708, 'text': 'So in the previous example, we took two variables.', 'start': 9606.105, 'duration': 5.603}, {'end': 9613.81, 'text': 'So they are separated by an entity.', 'start': 9611.789, 'duration': 2.021}, {'end': 9616.832, 'text': 'So in that case, we can have only four independent variables.', 'start': 9613.83, 'duration': 3.002}, {'end': 9627.257, 'text': 'No, no that is what I am saying, that is our brain, our brain can see only 3 dimensions, but the algorithm can work in 300 dimensions.', 'start': 9620.034, 'duration': 7.223}], 'summary': 'Pca technique can convert data into synthetic dimension for analysis, allowing work in 300 dimensions.', 'duration': 34.88, 'max_score': 9592.377, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw9592377.jpg'}, {'end': 9921.887, 'src': 'embed', 'start': 9891.47, 'weight': 14, 'content': [{'end': 9894.411, 'text': 'Linear models are built only when you see linear relationships.', 'start': 9891.47, 'duration': 2.941}, {'end': 9899.713, 'text': 'If you see non-linear relationships, then you might want to resort to non-linear models.', 'start': 9895.491, 'duration': 4.222}, {'end': 9908.336, 'text': 'Linear models, they expect linear relationship between y and the independent variables.', 'start': 9901.894, 'duration': 6.442}, {'end': 9912.118, 'text': "Then your model's predictive power will be very high.", 'start': 9909.497, 'duration': 2.621}, {'end': 9916.804, 'text': 'However, as you will see now, those are all very perfect conditions.', 'start': 9913.442, 'duration': 3.362}, {'end': 9921.887, 'text': 'In the real world, we hardly have such perfect conditions, right.', 'start': 9917.945, 'duration': 3.942}], 'summary': 'Linear models expect linear relationships but real-world conditions are rarely perfect.', 'duration': 30.417, 'max_score': 9891.47, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw9891470.jpg'}], 'start': 8263.816, 'title': 'Linear regression and modeling', 'summary': "Covers linear regression, multidimensional relationships, collinearity's impact, and dimensionality reduction, emphasizing the concept of linear models, equations, and their influence on predictive power.", 'chapters': [{'end': 8494.578, 'start': 8263.816, 'title': 'Linear regression and data modeling', 'summary': 'Explains the concept of linear regression as a model representing the relationship between independent and dependent variables, and also delves into the idea of models as surfaces in feature space, with a comparison to k nearest neighbors.', 'duration': 230.762, 'highlights': ['Linear regression is based on the concept of a line, and it represents the relationship between independent and dependent variables. Linear regression is based on the concept of a line, representing the relationship between independent and dependent variables.', 'Models in data science are surfaces in the feature space, representing the interaction between independent and dependent attributes. Models in data science are surfaces in the feature space, representing the interaction between independent and dependent attributes.', 'K nearest neighbors breaks the mathematical space into regions, called Voronoi regions, as a way to represent the model. K nearest neighbors breaks the mathematical space into regions, called Voronoi regions, to represent the model.']}, {'end': 8983.861, 'start': 8497.372, 'title': 'Understanding linear regression', 'summary': 'Explains the concept of linear regression using a line equation y=mx+c, with m representing the slope and c the intercept, and emphasizes the relationship between target variable y and independent variable x, with tan(theta) being a key factor in determining the equation of the line.', 'duration': 486.489, 'highlights': ['The mathematical expression of the line is represented as y=mx, where m represents the slope and c the intercept, forming the generic equation y=mx+c. The mathematical expression of the line is represented as y=mx, where m represents the slope and c the intercept, forming the generic equation y=mx+c.', 'The coefficients M and C reflect the relationship between Y and X in the data set, forming the model for linear regression. The coefficients M and C reflect the relationship between Y and X in the data set, forming the model for linear regression.', 'The angle represented by M in the line equation y=mx+c determines the slope, with tan(theta) being a key factor in the equation of the line. The angle represented by M in the line equation y=mx+c determines the slope, with tan(theta) being a key factor in the equation of the line.', 'The understanding of tan(theta) as the ratio of y change to x change in trigonometry forms the basis for determining the equation of the line in linear regression. The understanding of tan(theta) as the ratio of y change to x change in trigonometry forms the basis for determining the equation of the line in linear regression.']}, {'end': 9273.377, 'start': 8986.003, 'title': 'Multidimensional linear relationships', 'summary': 'Discusses the concept of multidimensional linear relationships, explaining how the algorithm expresses the relationship between multiple independent variables and a dependent variable using equations, and the implications of planes in higher dimensions on linear models.', 'duration': 287.374, 'highlights': ['The algorithm expresses the relationship between multiple independent variables and a dependent variable as y = m1x1 + m2x2, with only one constant term, indicating the impact of each independent variable on the dependent variable.', 'Planes in higher dimensions represent the relationship between the variables, with the plane cutting the y-axis at the intercept point, and the slopes of the edges, m1 and m2, reflecting the angles with the x-axis.', 'Linear models extend to hyperplanes in dimensions beyond three, with the same properties as in three dimensions, illustrating the applicability of linear models in higher dimensions and the concept of orthogonal relationships between dimensions.', 'The number of dimensions in a model is always one less than the number of dimensions in the feature space, demonstrating the relationship between the dimensions of the model and the feature space.']}, {'end': 9517.908, 'start': 9276.92, 'title': 'Impact of collinearity on model predictive power', 'summary': 'Discusses the impact of collinearity on model dimensions and predictive power, highlighting the assumption of independence among variables, the influence of collinearity on model effectiveness, and the use of techniques to address strong collinearity.', 'duration': 240.988, 'highlights': ['Collinearity can lead to less effective models with reduced predictive power, necessitating the assessment of collinearity between dimensions during model building. Collinearity can lead to less effective models, resulting in reduced predictive power, requiring the assessment of collinearity between dimensions during model building.', 'The algorithm assumes that independent variables are independent of each other, but in reality, they influence each other, leading to the problem of collinearity. The algorithm assumes that independent variables are independent of each other, but in reality, they influence each other, leading to the problem of collinearity.', 'R value, a measurement of the strength of the relationship between variables, can come very close to positive 1 in cases of strong positive correlation between variables. R value, a measurement of the strength of the relationship between variables, can come very close to positive 1 in cases of strong positive correlation between variables.']}, {'end': 10216.025, 'start': 9517.908, 'title': 'Dimensionality reduction and linear models', 'summary': 'Discusses the concept of dimensionality reduction using techniques like pca and the importance of analyzing the relationship between independent variables and the target in linear models, with a focus on coefficient of correlation and the impact of linearity on model predictive power.', 'duration': 698.117, 'highlights': ['Dimensionality reduction involves techniques like PCA to create synthetic dimensions and improve model building by capturing data in fewer dimensions, allowing algorithms to work in higher dimensions, with all dimensions being orthogonal to each other. PCA is used to convert correlated dimensions into a composite dimension, allowing algorithms to work in higher dimensions and improve model performance.', 'In linear models, analyzing the strength of the relationship between independent variables and the target is crucial, and this is measured using the coefficient of correlation (R value) which ranges from -1 to +1, indicating the strength and direction of the relationship. The coefficient of correlation (R value) is used to measure the strength and direction of the relationship between independent variables and the target, with a range from -1 to +1, indicating the strength of the relationship.', 'The chapter emphasizes the need for linear relationships in linear models and discusses the impact of non-linear relationships, suggesting the resort to non-linear models in such cases for improved predictive power. Linear models are built based on linear relationships between independent variables and the target, while non-linear relationships may require the use of non-linear models for improved predictive power.']}], 'duration': 1952.209, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw8263816.jpg', 'highlights': ['Linear regression represents the relationship between independent and dependent variables.', 'Models in data science are surfaces in the feature space, representing the interaction between independent and dependent attributes.', 'K nearest neighbors breaks the mathematical space into regions, called Voronoi regions, to represent the model.', 'The mathematical expression of the line is represented as y=mx, forming the generic equation y=mx+c.', 'The coefficients M and C reflect the relationship between Y and X in the data set, forming the model for linear regression.', 'The angle represented by M in the line equation y=mx+c determines the slope, with tan(theta) being a key factor in the equation of the line.', 'The understanding of tan(theta) as the ratio of y change to x change in trigonometry forms the basis for determining the equation of the line in linear regression.', 'The algorithm expresses the relationship between multiple independent variables and a dependent variable as y = m1x1 + m2x2, with only one constant term, indicating the impact of each independent variable on the dependent variable.', 'Planes in higher dimensions represent the relationship between the variables, with the plane cutting the y-axis at the intercept point, and the slopes of the edges, m1 and m2, reflecting the angles with the x-axis.', 'Linear models extend to hyperplanes in dimensions beyond three, illustrating the applicability of linear models in higher dimensions and the concept of orthogonal relationships between dimensions.', 'Collinearity can lead to less effective models, resulting in reduced predictive power, requiring the assessment of collinearity between dimensions during model building.', 'R value, a measurement of the strength of the relationship between variables, can come very close to positive 1 in cases of strong positive correlation between variables.', 'Dimensionality reduction involves techniques like PCA to create synthetic dimensions and improve model building by capturing data in fewer dimensions, allowing algorithms to work in higher dimensions, with all dimensions being orthogonal to each other.', 'The coefficient of correlation (R value) is used to measure the strength and direction of the relationship between independent variables and the target, with a range from -1 to +1, indicating the strength of the relationship.', 'Linear models are built based on linear relationships between independent variables and the target, while non-linear relationships may require the use of non-linear models for improved predictive power.']}, {'end': 11971.352, 'segs': [{'end': 10273.492, 'src': 'embed', 'start': 10245.499, 'weight': 2, 'content': [{'end': 10249.902, 'text': 'how reliable your central values is that reliability is given by the measure of variance?', 'start': 10245.499, 'duration': 4.403}, {'end': 10258.136, 'text': 'if the variance is too large, the central value is not reliable, right.', 'start': 10251.789, 'duration': 6.347}, {'end': 10263.782, 'text': 'So, the variance gives you the reliability of the central values, how reliable the central values are.', 'start': 10258.896, 'duration': 4.886}, {'end': 10269.347, 'text': 'So, formula for variance is this, right.', 'start': 10266.364, 'duration': 2.983}, {'end': 10273.492, 'text': 'Just look at the numerator, just look at the numerator.', 'start': 10270.108, 'duration': 3.384}], 'summary': 'Variance measures reliability of central values, too large variance indicates unreliability.', 'duration': 27.993, 'max_score': 10245.499, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw10245499.jpg'}, {'end': 10362.659, 'src': 'embed', 'start': 10329.623, 'weight': 1, 'content': [{'end': 10339.427, 'text': 'When you are building linear models, the target variable, the Y, and the independent variables, IVs, should have very strong covariance,', 'start': 10329.623, 'duration': 9.804}, {'end': 10344.009, 'text': 'but within the independent variables the covariance should be 0.', 'start': 10339.427, 'duration': 4.582}, {'end': 10346.67, 'text': 'that is an ideal situation.', 'start': 10344.009, 'duration': 2.661}, {'end': 10350.391, 'text': 'practically it never happens.', 'start': 10346.67, 'duration': 3.721}, {'end': 10353.873, 'text': 'alright now, given this covariance.', 'start': 10350.391, 'duration': 3.482}, {'end': 10356.594, 'text': 'now look at the formula for R.', 'start': 10353.873, 'duration': 2.721}, {'end': 10362.659, 'text': 'okay, on the top in the numerator, you are seeing covariance right.', 'start': 10356.594, 'duration': 6.065}], 'summary': 'Linear models require strong covariance between y and ivs, but 0 covariance within ivs. the formula for r involves covariance.', 'duration': 33.036, 'max_score': 10329.623, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw10329623.jpg'}, {'end': 11181.858, 'src': 'embed', 'start': 11148.572, 'weight': 0, 'content': [{'end': 11150.412, 'text': 'That line is called the best fit line.', 'start': 11148.572, 'duration': 1.84}, {'end': 11158.235, 'text': 'The algorithm will find out for you from infinite number of possibilities the best fit line for you.', 'start': 11152.333, 'duration': 5.902}, {'end': 11160.096, 'text': 'It is like looking for a needle in a haystack.', 'start': 11158.315, 'duration': 1.781}, {'end': 11168.414, 'text': 'and to do this it makes use of a process which is called the gradient descent.', 'start': 11163.012, 'duration': 5.402}, {'end': 11181.858, 'text': 'All algorithms use under the hood a learning process, a process which they make use of to find the best model for you in the given data set.', 'start': 11170.634, 'duration': 11.224}], 'summary': 'Algorithm finds best fit line using gradient descent process.', 'duration': 33.286, 'max_score': 11148.572, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw11148572.jpg'}, {'end': 11524.019, 'src': 'embed', 'start': 11494.874, 'weight': 3, 'content': [{'end': 11499.596, 'text': "Isn't this formula for variance? So error is nothing but variance.", 'start': 11494.874, 'duration': 4.722}, {'end': 11507.359, 'text': 'How data points vary, scatter across the best fit line? The lesser the variance, more reliable the central value is.', 'start': 11500.536, 'duration': 6.823}, {'end': 11512.53, 'text': 'the lesser the variance of the points across the model, the better the model is.', 'start': 11508.527, 'duration': 4.003}, {'end': 11516.293, 'text': 'Same concept comes to you in a different way.', 'start': 11514.191, 'duration': 2.102}, {'end': 11524.019, 'text': 'So, sum of squared errors is nothing but variance, variance of the data points across the model.', 'start': 11519.235, 'duration': 4.784}], 'summary': 'Variance measures how data points scatter around the best fit line, indicating model reliability.', 'duration': 29.145, 'max_score': 11494.874, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw11494874.jpg'}, {'end': 11720.651, 'src': 'embed', 'start': 11685.615, 'weight': 4, 'content': [{'end': 11687.516, 'text': 'One is called stochastic variance.', 'start': 11685.615, 'duration': 1.901}, {'end': 11693.859, 'text': 'which means nothing but probabilistic variance.', 'start': 11689.977, 'duration': 3.882}, {'end': 11706.343, 'text': 'the other one is called deterministic variance, which means variance that happens for reasons that I know.', 'start': 11693.859, 'duration': 12.484}, {'end': 11707.864, 'text': 'I know why that variance happens.', 'start': 11706.343, 'duration': 1.521}, {'end': 11709.825, 'text': 'okay, that is called deterministic.', 'start': 11707.864, 'duration': 1.961}, {'end': 11714.606, 'text': "but variance also happens in your data set for reasons that I don't know.", 'start': 11709.825, 'duration': 4.781}, {'end': 11715.427, 'text': 'we call them noise.', 'start': 11714.606, 'duration': 0.821}, {'end': 11720.651, 'text': 'okay, that is called noise, alright.', 'start': 11719.031, 'duration': 1.62}], 'summary': 'Stochastic variance is probabilistic, while deterministic variance occurs for known reasons. variance happening for unknown reasons is termed noise.', 'duration': 35.036, 'max_score': 11685.615, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw11685615.jpg'}], 'start': 10216.465, 'title': 'Statistical models', 'summary': 'Discusses the reliability of central values, variance measures, covariance, correlation, linear models, linear regression algorithm, gradient descent, model variance, and the impact of collinearity, emphasizing key mathematical concepts and methodologies in statistical modeling.', 'chapters': [{'end': 10273.492, 'start': 10216.465, 'title': 'Reliability of central values', 'summary': 'Discusses the reliability of central values, emphasizing that the variance measures the reliability of the central values, with a formula for variance described.', 'duration': 57.027, 'highlights': ['Variance measures the reliability of central values, with a larger variance indicating lower reliability.', 'The formula for variance is crucial for understanding the reliability of central values.']}, {'end': 10914.321, 'start': 10274.639, 'title': 'Covariance, correlation, and linear models', 'summary': "Explains the concepts of covariance, correlation, and how they are used in linear models, emphasizing the importance of strong covariance between the target variable and independent variables, and how the value of 'r' determines the relationship between variables.", 'duration': 639.682, 'highlights': ['The importance of strong covariance between the target variable and independent variables in linear models. The target variable (Y) and the independent variables (IVs) should have very strong covariance, while within the independent variables, the covariance should ideally be 0. A high covariance between the target variable and independent variables is crucial for building effective linear models.', "Explanation of the 'r' value and its significance in determining the relationship between variables. The 'r' value, or the correlation coefficient, indicates the strength and direction of the relationship between variables. A value close to 0 signifies a weak or non-existent relationship, while values close to +1 or -1 indicate strong positive or negative relationships, respectively.", "Discussion on the impact of sampling on the 'r' value, emphasizing the possibility of statistical flukes. In cases where the 'r' value is close to 0, it may indicate a statistical fluke caused by sampling errors, leading to a potentially misleading perception of a relationship between variables. Further investigation is necessary to determine the authenticity of the relationship."]}, {'end': 11224.036, 'start': 10914.341, 'title': 'Linear regression algorithm', 'summary': 'Discusses the linear regression algorithm, emphasizing its process of finding the best fit line using gradient descent to minimize errors and accurately represent the relationship between independent and target variables.', 'duration': 309.695, 'highlights': ['The algorithm evaluates different possible lines to find the best fit line that goes through the maximum number of data points and minimizes the distance between other points and the line, termed as error. The algorithm aims to minimize errors by finding the best fit line that maximizes the number of data points it goes through and minimizes the distance between other points and the line.', 'The process involves using gradient descent to find the best model for the given data set, similar to searching for a needle in a haystack from infinite possibilities. The algorithm utilizes gradient descent as a learning process to find the best model for the given data set, akin to searching for the best fit line from infinite possibilities.', 'The algorithm uses the process of gradient descent to minimize the sum of errors across all data points, resulting in the identification of the best fit line. By minimizing the sum of errors across all data points, the algorithm identifies the best fit line through the process of gradient descent.']}, {'end': 11971.352, 'start': 11224.056, 'title': 'Linear regression and model variance', 'summary': 'Discusses the concept of linear regression, model variance, and the importance of minimizing error in finding the best fit line, emphasizing the need to address both deterministic and stochastic variance in the data points and the impact of collinearity on noise.', 'duration': 747.296, 'highlights': ['The chapter emphasizes the need to find the best fit line by minimizing the error, where the error is calculated as the sum of squared errors, representing the variance across the line, and discusses the importance of addressing both deterministic and stochastic variance in the data points. Minimizing error by finding the best fit line, calculation of error as sum of squared errors, importance of addressing deterministic and stochastic variance.', 'It explains the impact of collinearity on noise in the data set, stating that when different variables interact with each other, there can be trouble with noise, which can either get cancelled out or magnified. Impact of collinearity on noise, potential trouble with noise getting cancelled out or magnified.', 'The chapter also mentions the concept of convex functions in quadratic equations, pointing out that all quadratic equations have the property of being convex functions, acquiring a bowl shape in three dimensions when plotted against their independent variables. Introduction to convex functions in quadratic equations, property of acquiring a bowl shape in three dimensions.']}], 'duration': 1754.887, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw10216465.jpg', 'highlights': ['The algorithm uses the process of gradient descent to minimize the sum of errors across all data points, resulting in the identification of the best fit line.', 'The importance of strong covariance between the target variable and independent variables in linear models.', 'The formula for variance is crucial for understanding the reliability of central values.', 'The chapter emphasizes the need to find the best fit line by minimizing the error, where the error is calculated as the sum of squared errors, representing the variance across the line, and discusses the importance of addressing both deterministic and stochastic variance in the data points.', 'The importance of addressing deterministic and stochastic variance.', 'The target variable (Y) and the independent variables (IVs) should have very strong covariance, while within the independent variables, the covariance should ideally be 0.']}, {'end': 13580.773, 'segs': [{'end': 12000.761, 'src': 'embed', 'start': 11971.352, 'weight': 0, 'content': [{'end': 11979.994, 'text': 'they are guaranteed to have one absolute minima parabolic structure, one absolute minima.', 'start': 11971.352, 'duration': 8.642}, {'end': 11983.375, 'text': "so for some combination of M and C I'll get the least error.", 'start': 11979.994, 'duration': 3.381}, {'end': 11989.831, 'text': 'combination of M and C, which gives me the least error, is my best fit line.', 'start': 11986.007, 'duration': 3.824}, {'end': 11996.176, 'text': 'So the algorithm will start from some random M and C.', 'start': 11992.473, 'duration': 3.703}, {'end': 11998.078, 'text': 'So maybe this is the random M and C.', 'start': 11996.176, 'duration': 1.902}, {'end': 12000.761, 'text': 'This is my random M and this is my random C.', 'start': 11998.078, 'duration': 2.683}], 'summary': 'The algorithm aims to find the combination of m and c that yields the least error, starting from random values.', 'duration': 29.409, 'max_score': 11971.352, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw11971352.jpg'}, {'end': 12269.157, 'src': 'embed', 'start': 12241.364, 'weight': 1, 'content': [{'end': 12245.587, 'text': 'But in the process of jumping, there is something called learning step.', 'start': 12241.364, 'duration': 4.223}, {'end': 12252.646, 'text': "We are too far away from this, so I'll just tell you the concept, which is used along with this partial derivatives.", 'start': 12247.622, 'duration': 5.024}, {'end': 12260.991, 'text': 'If the learning step, the amount of change that you do in dy and dx, if the learning step is too high, you might jump and oscillate.', 'start': 12253.306, 'duration': 7.685}, {'end': 12269.157, 'text': 'You might jump the global minimum, go on the wrong side, and keep oscillating backward and forward, infinite loops.', 'start': 12262.933, 'duration': 6.224}], 'summary': 'Learning step affects jumping behavior, high step leads to oscillation and missing global minimum.', 'duration': 27.793, 'max_score': 12241.364, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw12241364.jpg'}, {'end': 12473.158, 'src': 'embed', 'start': 12443.166, 'weight': 2, 'content': [{'end': 12446.508, 'text': 'which is nothing but xi minus x bar, x bar here is your predicted lines.', 'start': 12443.166, 'duration': 3.342}, {'end': 12449.21, 'text': 'So formula remains same.', 'start': 12448.19, 'duration': 1.02}, {'end': 12453.994, 'text': 'That variance is called sum of squared errors, that variance we have to minimize.', 'start': 12449.931, 'duration': 4.063}, {'end': 12458.237, 'text': 'The minimal the variance is, the better your model is.', 'start': 12455.495, 'duration': 2.742}, {'end': 12471.017, 'text': 'Alright, so let me explain this to you Because you are going to come across these terms down the line.', 'start': 12464.061, 'duration': 6.956}, {'end': 12473.158, 'text': 'So let me explain this to you in slightly more detail.', 'start': 12471.077, 'duration': 2.081}], 'summary': 'Minimize sum of squared errors to improve model performance.', 'duration': 29.992, 'max_score': 12443.166, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw12443166.jpg'}, {'end': 12615.671, 'src': 'embed', 'start': 12584.226, 'weight': 3, 'content': [{'end': 12586.507, 'text': 'that distance is called total error.', 'start': 12584.226, 'duration': 2.281}, {'end': 12596.169, 'text': 'Of the total error, your model has captured this much.', 'start': 12590.528, 'duration': 5.641}, {'end': 12601.51, 'text': 'Your model is a regression model, hence the name given to this is regression error.', 'start': 12597.669, 'duration': 3.841}, {'end': 12605.491, 'text': 'Of the total error, your model is predicted this much error.', 'start': 12602.451, 'duration': 3.04}, {'end': 12611.793, 'text': 'Of the total error in your dataset, your model is predicted this much.', 'start': 12607.512, 'duration': 4.281}, {'end': 12615.671, 'text': 'the difference between predicted and the y bar.', 'start': 12613.971, 'duration': 1.7}], 'summary': 'Regression model captures a certain percentage of total error in the dataset.', 'duration': 31.445, 'max_score': 12584.226, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw12584226.jpg'}, {'end': 12938.016, 'src': 'embed', 'start': 12907.81, 'weight': 4, 'content': [{'end': 12914.014, 'text': 'So, increase the SSR to SST ratio that is what our objective is, yeah.', 'start': 12907.81, 'duration': 6.204}, {'end': 12915.755, 'text': 'Alright, let us move on.', 'start': 12914.814, 'duration': 0.941}, {'end': 12924.921, 'text': 'So, before you build the model you have to evaluate each independent dimension and see what is the R value.', 'start': 12918.637, 'duration': 6.284}, {'end': 12935.047, 'text': 'R value coefficient of correlation comes into play to help you identify good predictors given the target, ok.', 'start': 12926.261, 'duration': 8.786}, {'end': 12938.016, 'text': 'once I built the model.', 'start': 12936.235, 'duration': 1.781}], 'summary': 'Objective: increase ssr to sst ratio for model building and evaluation.', 'duration': 30.206, 'max_score': 12907.81, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw12907810.jpg'}, {'end': 13248.859, 'src': 'embed', 'start': 13217.362, 'weight': 5, 'content': [{'end': 13227.128, 'text': 'How much of this total variance has been explained, captured by your model? R square is the ratio of the dark area to light area.', 'start': 13217.362, 'duration': 9.766}, {'end': 13233.171, 'text': 'What you are seeing on the top, this is the residuals, unexplained variance.', 'start': 13228.909, 'duration': 4.262}, {'end': 13241.116, 'text': 'So, R square is again a ratio, where the ratio is between the total variance in the data points.', 'start': 13235.172, 'duration': 5.944}, {'end': 13245.296, 'text': 'and the variance explained by your model.', 'start': 13243.115, 'duration': 2.181}, {'end': 13248.859, 'text': 'That ratio is called r square.', 'start': 13247.158, 'duration': 1.701}], 'summary': 'Model captures variance with r square ratio.', 'duration': 31.497, 'max_score': 13217.362, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw13217362.jpg'}, {'end': 13431.553, 'src': 'embed', 'start': 13399.003, 'weight': 6, 'content': [{'end': 13400.664, 'text': 'But adjusted R square will go down.', 'start': 13399.003, 'duration': 1.661}, {'end': 13402.966, 'text': 'It will decrease.', 'start': 13400.684, 'duration': 2.282}, {'end': 13410.371, 'text': 'So when you include useless variables in your model, adjusted R square will go down.', 'start': 13404.026, 'duration': 6.345}, {'end': 13411.472, 'text': 'It will reduce.', 'start': 13410.391, 'duration': 1.081}, {'end': 13414.194, 'text': 'Whereas R square is likely to go up.', 'start': 13411.892, 'duration': 2.302}, {'end': 13421.004, 'text': 'Adjusted R square will go up only when you include good variables in your models.', 'start': 13416.621, 'duration': 4.383}, {'end': 13425.748, 'text': 'Good variables are variables whose relationship is significant, strong.', 'start': 13422.466, 'duration': 3.282}, {'end': 13431.553, 'text': 'So we use adjusted R square for evaluating our models.', 'start': 13428.811, 'duration': 2.742}], 'summary': 'Including useless variables decreases adjusted r square, while good variables increase it.', 'duration': 32.55, 'max_score': 13399.003, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw13399003.jpg'}, {'end': 13558.604, 'src': 'embed', 'start': 13527.849, 'weight': 7, 'content': [{'end': 13530.931, 'text': 'your sample will never be a true representation of the real population.', 'start': 13527.849, 'duration': 3.082}, {'end': 13531.591, 'text': 'It will never be.', 'start': 13530.951, 'duration': 0.64}, {'end': 13536.073, 'text': 'So its distributions will be slightly different from the distributions in the population.', 'start': 13532.591, 'duration': 3.482}, {'end': 13541.595, 'text': "That change in distribution is what will introduce fake R's.", 'start': 13538.374, 'duration': 3.221}, {'end': 13545.897, 'text': 'So only sampling which introduces fake? Sampling.', 'start': 13542.736, 'duration': 3.161}, {'end': 13546.978, 'text': "Sampling is the source of R's.", 'start': 13545.937, 'duration': 1.041}, {'end': 13558.604, 'text': 'Did I show you that video where I showed you universe to be under flux continuously changing, it is never static, whereas your sample is a snapshot.', 'start': 13549.738, 'duration': 8.866}], 'summary': 'Sampling introduces fake distributions, leading to inaccurate representation of the population.', 'duration': 30.755, 'max_score': 13527.849, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw13527849.jpg'}], 'start': 11971.352, 'title': 'Regression model evaluation', 'summary': 'Covers topics including gradient descent algorithm for error minimization, regression error concepts, model evaluation metrics, and the limitations of r square with adjusted r square, emphasizing the need for capturing maximal variance in y and the impact of useless variables on model evaluation.', 'chapters': [{'end': 12384.558, 'start': 11971.352, 'title': 'Gradient descent algorithm', 'summary': 'Describes the gradient descent algorithm, which uses partial derivatives to minimize the error in linear regression, guaranteeing the reach of absolute minima and preventing oscillations by adjusting the learning step.', 'duration': 413.206, 'highlights': ['The algorithm starts from random M and C and utilizes gradient descent, employing partial derivatives to minimize error in linear regression, ensuring the reach of absolute minima. Starts from random M and C, uses gradient descent, employs partial derivatives, minimizes error in linear regression, ensures reach of absolute minima', 'The learning step is adjusted to prevent oscillations, where the amount of change in dy and dx decreases as it jumps towards the global minimum, using the bold driver algorithm as one variant of gradient descent. Adjusts learning step, prevents oscillations, decreases change in dy and dx, uses bold driver algorithm as variant of gradient descent', 'The algorithm is crucial in understanding neural networks, back propagation of errors, and deep learning, while also being part of a series of algorithms that have emerged as improvements. Crucial in understanding neural networks, back propagation of errors, part of emerging algorithms']}, {'end': 12844.985, 'start': 12387.728, 'title': 'Understanding regression errors', 'summary': "Discusses the concept of regression errors in modeling, emphasizing the importance of minimizing sum of squared errors to improve the model's accuracy and explaining the distinction between total error, sum of squared errors, and regression error.", 'duration': 457.257, 'highlights': ["The minimal the variance is, the better your model is. Minimizing the variance, measured by the sum of squared errors, improves the model's accuracy.", "Total error is the distance between your actual data point and expected value. Defining total error as the difference between the actual and expected values, emphasizing the importance of minimizing this distance to improve the model's accuracy.", "Regression error is the difference between predicted and the y bar. Explaining regression error as the difference between the predicted value and the expected value, highlighting its significance in assessing the model's predictive accuracy."]}, {'end': 13307.137, 'start': 12845.705, 'title': 'Model evaluation and determination metrics', 'summary': "Discusses the need to minimize unexplained error, increase ssr to sst ratio, and the use of r square as a metric to evaluate the model's performance, emphasizing the importance of capturing maximal variance in y.", 'duration': 461.432, 'highlights': ['The need to minimize unexplained error and maximize SSR to SST ratio The speaker emphasizes the need for a model with minimized unexplained error and highlights the objective of increasing the SSR to SST ratio, indicating the importance of reducing unexplained variance and maximizing the explained variance.', "The use of R square as a metric to evaluate the model's performance R square is discussed as a metric to evaluate the model's performance, measuring how much of the total variance in Y has been explained by the model, with a range between 0 and 1, where a value closer to 1 indicates a better model.", 'The explanation of R square as the ratio of explained variance to total variance The concept of R square as the ratio of the explained variance to the total variance in the data points is explained, with an emphasis on the importance of capturing and explaining maximal variance by the model.']}, {'end': 13580.773, 'start': 13307.137, 'title': 'Adjusted r square for model evaluation', 'summary': "Discusses the limitations of r square as a model evaluation metric and the advantages of using adjusted r square, highlighting how it accounts for the impact of useless variables and the need for further investigation when r values are close to 0.5 or 0.4, while emphasizing the impact of sampling on introducing fake r's.", 'duration': 273.636, 'highlights': ['The beauty of adjusted R square is that it decreases when useless variables are included in the model, while R square keeps increasing, emphasizing the importance of adjusted R square for evaluating models.', 'Attributes in model building always have both good R and fake R components, with the percentage of these components differing between good and poor attributes, stressing the need for further investigation when R values are close to 0.5 or 0.4.', "The chapter highlights the impact of sampling on introducing fake R's in the model, emphasizing that the sample will never be a true representation of the real population, leading to differences in distributions and the introduction of fake R's."]}], 'duration': 1609.421, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw11971352.jpg', 'highlights': ['The algorithm starts from random M and C and utilizes gradient descent, employing partial derivatives to minimize error in linear regression, ensuring the reach of absolute minima.', 'The learning step is adjusted to prevent oscillations, where the amount of change in dy and dx decreases as it jumps towards the global minimum, using the bold driver algorithm as one variant of gradient descent.', "The minimal the variance is, the better your model is. Minimizing the variance, measured by the sum of squared errors, improves the model's accuracy.", "Regression error is the difference between predicted and the y bar. Explaining regression error as the difference between the predicted value and the expected value, highlighting its significance in assessing the model's predictive accuracy.", 'The need to minimize unexplained error and maximize SSR to SST ratio The speaker emphasizes the need for a model with minimized unexplained error and highlights the objective of increasing the SSR to SST ratio, indicating the importance of reducing unexplained variance and maximizing the explained variance.', 'The explanation of R square as the ratio of explained variance to total variance The concept of R square as the ratio of the explained variance to the total variance in the data points is explained, with an emphasis on the importance of capturing and explaining maximal variance by the model.', 'The beauty of adjusted R square is that it decreases when useless variables are included in the model, while R square keeps increasing, emphasizing the importance of adjusted R square for evaluating models.', "The chapter highlights the impact of sampling on introducing fake R's in the model, emphasizing that the sample will never be a true representation of the real population, leading to differences in distributions and the introduction of fake R's."]}, {'end': 15816.081, 'segs': [{'end': 13732.469, 'src': 'embed', 'start': 13699.502, 'weight': 0, 'content': [{'end': 13706.988, 'text': 'So, unless we draw this pair plot, we will not come to know that we have this kind of non-linear distribution.', 'start': 13699.502, 'duration': 7.486}, {'end': 13710.611, 'text': 'So is it always suggestible to draw a pair for each? 100 percent?', 'start': 13707.008, 'duration': 3.603}, {'end': 13718.698, 'text': 'if you ask me, I will always say pair plot is the most important tool you have in your toolbox, which you should use to understand your data.', 'start': 13710.611, 'duration': 8.087}, {'end': 13732.469, 'text': 'Try a different model, may be a non-linear model.', 'start': 13728.266, 'duration': 4.203}], 'summary': 'Pair plot is a crucial tool for understanding data; recommended for every analysis.', 'duration': 32.967, 'max_score': 13699.502, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw13699502.jpg'}, {'end': 13825.328, 'src': 'embed', 'start': 13794.845, 'weight': 2, 'content': [{'end': 13796.725, 'text': 'They are linear in terms of coefficients.', 'start': 13794.845, 'duration': 1.88}, {'end': 13805.63, 'text': 'If I have a model like this, y equal to m x square plus c, for us, it is a linear model.', 'start': 13798.466, 'duration': 7.164}, {'end': 13810.172, 'text': 'So, do not get confused with this.', 'start': 13808.911, 'duration': 1.261}, {'end': 13819.867, 'text': 'A lot of mathematicians will say this is not linear, but for us, for us in data science scikit-learn or R, this is a linear model.', 'start': 13812.705, 'duration': 7.162}, {'end': 13823.868, 'text': 'It will show us a trend.', 'start': 13823.067, 'duration': 0.801}, {'end': 13825.328, 'text': 'You will see.', 'start': 13824.688, 'duration': 0.64}], 'summary': 'In data science, a model like y = mx^2 + c is considered linear and shows a trend.', 'duration': 30.483, 'max_score': 13794.845, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw13794845.jpg'}, {'end': 13963.617, 'src': 'embed', 'start': 13937.132, 'weight': 1, 'content': [{'end': 13941.973, 'text': 'So what we do is have you seen this Kohn-Binaga-Karotapathy?', 'start': 13937.132, 'duration': 4.841}, {'end': 13945.894, 'text': 'Have you noticed that the audience never go wrong?', 'start': 13943.513, 'duration': 2.381}, {'end': 13951.535, 'text': 'Now the audience individually may not have very high IQ, but put together they rarely go wrong.', 'start': 13947.374, 'duration': 4.161}, {'end': 13954.255, 'text': 'This concept is called wisdom of the crowd.', 'start': 13952.555, 'duration': 1.7}, {'end': 13957.316, 'text': 'The same concept is used in data science.', 'start': 13955.075, 'duration': 2.241}, {'end': 13960.416, 'text': "also, when we productionize our models, we don't put one single model into play.", 'start': 13957.316, 'duration': 3.1}, {'end': 13962.257, 'text': 'we always put a collection of models into play.', 'start': 13960.416, 'duration': 1.841}, {'end': 13963.617, 'text': 'That is called ensemble.', 'start': 13962.757, 'duration': 0.86}], 'summary': 'Wisdom of the crowd concept: collective audience rarely goes wrong; applied in data science through ensemble of models.', 'duration': 26.485, 'max_score': 13937.132, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw13937132.jpg'}, {'end': 14699.539, 'src': 'embed', 'start': 14670.139, 'weight': 3, 'content': [{'end': 14671.64, 'text': 'Many of the data types are object.', 'start': 14670.139, 'duration': 1.501}, {'end': 14672.661, 'text': 'Object means string.', 'start': 14671.72, 'duration': 0.941}, {'end': 14676.684, 'text': 'Machine learning algorithms cannot handle string data types.', 'start': 14674.442, 'duration': 2.242}, {'end': 14678.721, 'text': 'they have to be converted in numbers.', 'start': 14677.52, 'duration': 1.201}, {'end': 14689.27, 'text': 'So what I am doing here is, I am going to convert into numbers, but before I convert numbers, I am doing, I am dropping some of these columns.', 'start': 14681.704, 'duration': 7.566}, {'end': 14699.539, 'text': 'The reason I am dropping these columns is if you go and take a frequency, count on these columns fuel type, for example, or engine location,', 'start': 14691.452, 'duration': 8.087}], 'summary': 'Converting string data to numbers for machine learning. dropping columns due to frequency count.', 'duration': 29.4, 'max_score': 14670.139, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw14670139.jpg'}, {'end': 15036.646, 'src': 'embed', 'start': 15009.215, 'weight': 4, 'content': [{'end': 15015.202, 'text': 'So if a column is an ordinal data type You can go and introduce order in your numerical values.', 'start': 15009.215, 'duration': 5.987}, {'end': 15023.623, 'text': 'If the column is not ordinal, gender column, then you cannot blindly convert them into 1 and 2, you have to resort to one-hot coding.', 'start': 15015.882, 'duration': 7.741}, {'end': 15029.785, 'text': 'In scikit-learn there is a facility function called label encoder.', 'start': 15025.764, 'duration': 4.021}, {'end': 15034.185, 'text': 'Label encoder introduces order in your data.', 'start': 15031.165, 'duration': 3.02}, {'end': 15036.646, 'text': 'So, be careful when you are using that.', 'start': 15035.226, 'duration': 1.42}], 'summary': 'In ordinal data, introduce order with label encoder; for non-ordinal data, use one-hot coding.', 'duration': 27.431, 'max_score': 15009.215, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw15009215.jpg'}, {'end': 15406.787, 'src': 'embed', 'start': 15377.073, 'weight': 5, 'content': [{'end': 15380.115, 'text': 'Rest of the columns which are object data types I am changing into float types.', 'start': 15377.073, 'duration': 3.042}, {'end': 15382.638, 'text': 'Till this point.', 'start': 15381.757, 'duration': 0.881}, {'end': 15383.118, 'text': 'is it ok??', 'start': 15382.638, 'duration': 0.48}, {'end': 15391.466, 'text': 'Once I have changed the data types to numbers numerical, I know there are many missing values.', 'start': 15385.32, 'duration': 6.146}, {'end': 15403.626, 'text': 'So the strategy I am using here for missing values is replace the missing values of price column with the median of the price column.', 'start': 15393.203, 'duration': 10.423}, {'end': 15406.787, 'text': 'I want to explain this step to you.', 'start': 15405.547, 'duration': 1.24}], 'summary': 'Converting object data types to float, replacing missing price values with median.', 'duration': 29.714, 'max_score': 15377.073, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw15377073.jpg'}], 'start': 13584.014, 'title': 'Linear models and data analysis techniques', 'summary': 'Covers understanding linear and non-linear relationships, highlighting the significance of pair plots, discusses ensemble techniques, linear model assumptions, advantages of linear models, and data analysis and preprocessing techniques including data type conversion, encoding, and handling missing values in the auto mpg dataset.', 'chapters': [{'end': 13905.67, 'start': 13584.014, 'title': 'Understanding linear and non-linear relationships', 'summary': 'Emphasizes the importance of recognizing non-linear relationships in data analysis, highlighting that an r value close to 0 does not necessarily indicate no relationship, but rather a lack of linear relationship, and stresses the significance of pair plots in identifying non-linear distributions.', 'duration': 321.656, 'highlights': ['The significance of recognizing non-linear relationships in data analysis, emphasizing that an r value close to 0 does not indicate no relationship, but rather a lack of linear relationship, which can be identified through pair plots (Quantity: N/A)', 'The importance of utilizing pair plots as a tool to understand data and to identify non-linear distributions, suggesting it as a fundamental practice in data analysis (Quantity: N/A)', 'The distinction between linear and non-linear functions in mathematics and data science, highlighting the interpretation of x^2 as a linear function in data science, despite being considered non-linear in mathematics (Quantity: N/A)', 'The recommendation to try both linear and non-linear models and determine which minimizes bias variance errors for effective data analysis (Quantity: N/A)']}, {'end': 14312.408, 'start': 13907.732, 'title': 'Ensemble techniques and linear model assumptions', 'summary': 'Discusses the concept of ensemble techniques, particularly the wisdom of the crowd, in data science, and highlights the assumptions and challenges related to linear models, including the impact of outliers and the concept of homoscedasticity and heteroscedasticity.', 'duration': 404.676, 'highlights': ['The concept of ensemble techniques, particularly the wisdom of the crowd, is applied in data science, where a collection of models is put into play, leveraging the idea that the audience, when grouped together, rarely go wrong.', 'Linear models make assumptions about the relationship between independent variables and the target variable, as well as the distribution of errors, and are prone to performance issues when dealing with outliers, as the best fit line may get influenced and gravitate towards outliers, leading to suboptimal model performance.', 'The concept of homoscedasticity and heteroscedasticity in linear models is discussed, emphasizing the expectation of uniform spread of errors across all ranges of independent variables, with heteroscedasticity indicating varying error magnitudes for different ranges, which can impact model validity and performance.']}, {'end': 14550.178, 'start': 14315.603, 'title': 'Advantages of linear models', 'summary': 'Discusses the advantages of linear models, such as ease of interpretation, ability to provide physical meaning to equations, susceptibility to outliers, and limitations in classification, emphasizing the importance of selecting high-quality attributes for model building.', 'duration': 234.575, 'highlights': ['Linear models provide physical meaning to equations, allowing for easy interpretation and understanding of the relationship between variables. Linear models, such as y = m1x1 + m2x2 + c, allow for the physical interpretation of the relationship between variables, providing insight into how a one unit increase in x1 affects y (given by m1).', 'The susceptibility of linear models to outliers and the impact of unreliable x bar and y bar on model reliability. Linear models are prone to outliers, and their reliability is affected when x bar and y bar are not reliable, emphasizing the importance of attribute selection for building high-quality models.', 'Limitations of linear models in classification compared to non-linear models, and the potential for better classification with non-linear models. Linear models exhibit limitations in classification, as demonstrated by the example of better classification results with non-linear models, highlighting the importance of model selection based on the nature of the data.']}, {'end': 14930.71, 'start': 14554.079, 'title': 'Data analysis and preprocessing', 'summary': 'Covers loading a csv file, inspecting data, dropping columns with low variance, and converting string data types to numbers in preparation for machine learning analysis.', 'duration': 376.631, 'highlights': ['The chapter covers loading a CSV file, inspecting data, dropping columns with low variance, and converting string data types to numbers in preparation for machine learning analysis. The chapter involves loading a CSV file, inspecting the data, dropping columns with low variance, and converting string data types to numbers in preparation for machine learning analysis.', "The data frame 'car_df' is created from the CSV file, containing records with missing data and string data types. The data frame 'car_df' is created from the CSV file, containing records with missing data and string data types.", 'Columns with low variance are dropped to remove irrelevant data that does not influence the car price prediction. Columns with low variance are dropped to remove irrelevant data that does not influence the car price prediction.', 'String data types are converted into numbers to make them compatible with machine learning algorithms. String data types are converted into numbers to make them compatible with machine learning algorithms.']}, {'end': 15252.798, 'start': 14931.131, 'title': 'Data type conversion and encoding', 'summary': 'Discusses the conversion of string data into numerical values, emphasizing the importance of careful handling to avoid introducing erroneous order, and introduces techniques like one-hot encoding and label encoder. it also explains the concept of one-hot coding and its application in converting categorical data into numerical values.', 'duration': 321.667, 'highlights': ['The importance of careful handling in converting string data into numerical values to avoid introducing erroneous order is emphasized, with examples like gender and income group data. It is crucial to handle the conversion of string data into numerical values with care, illustrated through examples of gender and income group data, where introducing an order could create misleading relationships.', 'Introduction of techniques like one-hot encoding for non-ordinal data and label encoder for introducing order in data is explained, with a cautionary note on using label encoder and a recommendation to use one-hot coders instead. The explanation of techniques such as one-hot encoding for non-ordinal data and label encoder for introducing order in data is provided, with a caution against using label encoder and a suggestion to use one-hot coders.', 'Explanation of one-hot coding concept and its application in converting categorical gender data into two numerical columns is detailed. The concept of one-hot coding is explained, along with its practical application in converting categorical gender data into two numerical columns.']}, {'end': 15816.081, 'start': 15253.683, 'title': 'Handling missing values in auto mpg dataset', 'summary': "Covers the process of identifying and replacing missing values in the auto mpg dataset, including the use of 'nan' for null values, converting data types to float, and replacing missing values with the median of the price column to address outliers and randomness in the data.", 'duration': 562.398, 'highlights': ["Replacing missing values with 'nan' and converting object data types to float to enable numerical Python operations. The process involves replacing missing values with 'nan' and converting object data types to float to enable numerical Python operations, ensuring the dataset is ready for analysis.", 'Strategy for handling missing values involves replacing the missing values of the price column with the median of the price column to address outliers and randomness in the data. A strategy for handling missing values is described, involving the replacement of missing values in the price column with the median of the column to address outliers and randomness in the data.', 'Discussion of different strategies to address missing values, including the use of Multiple Imputations Through Chained Equations (MICE) for predicting values based on other columns. The chapter covers different strategies for addressing missing values, including the use of MICE for predicting values based on other columns, providing a powerful tool for handling missing data.']}], 'duration': 2232.067, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw13584014.jpg', 'highlights': ['The importance of utilizing pair plots as a tool to understand data and to identify non-linear distributions, suggesting it as a fundamental practice in data analysis', 'The concept of ensemble techniques, particularly the wisdom of the crowd, is applied in data science, where a collection of models is put into play, leveraging the idea that the audience, when grouped together, rarely go wrong.', 'Linear models provide physical meaning to equations, allowing for easy interpretation and understanding of the relationship between variables.', 'The chapter covers loading a CSV file, inspecting data, dropping columns with low variance, and converting string data types to numbers in preparation for machine learning analysis.', 'Introduction of techniques like one-hot encoding for non-ordinal data and label encoder for introducing order in data is explained, with a cautionary note on using label encoder and a recommendation to use one-hot coders instead.', 'Strategy for handling missing values involves replacing the missing values of the price column with the median of the price column to address outliers and randomness in the data.']}, {'end': 17556.466, 'segs': [{'end': 15937.857, 'src': 'embed', 'start': 15877.235, 'weight': 0, 'content': [{'end': 15882.338, 'text': 'The distribution is likely to be a symmetric bell curve for height alright.', 'start': 15877.235, 'duration': 5.103}, {'end': 15884.799, 'text': 'I do not see any skew to worry about.', 'start': 15882.758, 'duration': 2.041}, {'end': 15887.081, 'text': 'Look at the car weight.', 'start': 15885.5, 'duration': 1.581}, {'end': 15895.19, 'text': 'same story repeats here 2555, 2014.', 'start': 15888.905, 'duration': 6.285}, {'end': 15903.839, 'text': 'If there is any column where the mean and median are very different, then you might be having a skewed data set.', 'start': 15895.191, 'duration': 8.648}, {'end': 15907.943, 'text': 'Look at this one price, now price is the target, forget the price.', 'start': 15904.58, 'duration': 3.363}, {'end': 15913.57, 'text': 'let us see if our analysis stands, let us see whether it stands, ok.', 'start': 15909.949, 'duration': 3.621}, {'end': 15920.132, 'text': 'So, this is numerical way, statistical way of analyzing data, but instead you can do a pair plot.', 'start': 15914.07, 'duration': 6.062}, {'end': 15927.954, 'text': 'In pair plot, I always prefer to have the diagonals is in form of density graphs.', 'start': 15922.853, 'duration': 5.101}, {'end': 15937.857, 'text': 'How do you get that? When you call the pair panel, you give diagonal kind is KDE, Kernel Density Estimates.', 'start': 15929.775, 'duration': 8.082}], 'summary': 'Analyzing data distribution and identifying skewness with statistical and visual methods.', 'duration': 60.622, 'max_score': 15877.235, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw15877235.jpg'}, {'end': 16192.886, 'src': 'embed', 'start': 16165.458, 'weight': 2, 'content': [{'end': 16168.801, 'text': 'instead of giving you a scatter plot, they have given you a density curve.', 'start': 16165.458, 'duration': 3.343}, {'end': 16174.105, 'text': 'By default they give you histograms we have changed it to density curve.', 'start': 16170.823, 'duration': 3.282}, {'end': 16181.051, 'text': 'The advantage of doing that is in R if you do this pair plot it will leave out the diagonals as blank.', 'start': 16175.106, 'duration': 5.945}, {'end': 16185.804, 'text': 'It will leave out the diagonals as blank, it will not show you anything there.', 'start': 16183.203, 'duration': 2.601}, {'end': 16189.185, 'text': 'Whereas what happens in scikit-learn is they give you a distribution.', 'start': 16186.364, 'duration': 2.821}, {'end': 16192.886, 'text': 'Why leave it blank? Use it to show the distribution on that particular column.', 'start': 16189.765, 'duration': 3.121}], 'summary': 'In r, using a density curve instead of a scatter plot in a pair plot leaves out the diagonals as blank and shows the distribution on that particular column.', 'duration': 27.428, 'max_score': 16165.458, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw16165458.jpg'}, {'end': 16262.385, 'src': 'embed', 'start': 16234.789, 'weight': 3, 'content': [{'end': 16237.67, 'text': 'Look at the length of the car, again almost a normal distribution.', 'start': 16234.789, 'duration': 2.881}, {'end': 16248.47, 'text': 'But we see some kind of overlap here between the you see, multiple gaussians here, one behind the other may not be of concern right now,', 'start': 16239.261, 'duration': 9.209}, {'end': 16250.012, 'text': 'because they are all kind of overlapping.', 'start': 16248.47, 'duration': 1.542}, {'end': 16257.4, 'text': 'Remember which was the column where we saw perfect match, the central values, what is the column? Bore.', 'start': 16251.754, 'duration': 5.646}, {'end': 16260.803, 'text': 'Height Height, height was the column.', 'start': 16257.42, 'duration': 3.383}, {'end': 16262.385, 'text': 'This is the distribution of height.', 'start': 16261.204, 'duration': 1.181}], 'summary': 'Car length follows a normal distribution, with overlapping gaussians. perfect match in bore column for height distribution.', 'duration': 27.596, 'max_score': 16234.789, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw16234789.jpg'}, {'end': 16755.121, 'src': 'embed', 'start': 16693.758, 'weight': 4, 'content': [{'end': 16698.2, 'text': 'So, if you start dropping the rows for all columns where you have out last, your data set might shrink.', 'start': 16693.758, 'duration': 4.442}, {'end': 16712.606, 'text': 'So, now every column except one every column has outlier.', 'start': 16709.504, 'duration': 3.102}, {'end': 16718.388, 'text': 'So, when we are removing the outlier for every column 569 records comes down to 230,000.', 'start': 16712.626, 'duration': 5.762}, {'end': 16720.411, 'text': 'So, that is not good.', 'start': 16718.389, 'duration': 2.022}, {'end': 16722.032, 'text': 'That is not good.', 'start': 16720.871, 'duration': 1.161}, {'end': 16726.354, 'text': 'So, dropping records is always the last option when you have plenty of data.', 'start': 16722.232, 'duration': 4.122}, {'end': 16731.357, 'text': 'When data size itself is restricted dropping records is not a good option.', 'start': 16728.215, 'duration': 3.142}, {'end': 16736.27, 'text': 'Shall we move on? All of you? Okay.', 'start': 16733.518, 'duration': 2.752}, {'end': 16737.29, 'text': "Let's move.", 'start': 16736.849, 'duration': 0.441}, {'end': 16741.613, 'text': 'So now, always start by analyzing the diagonals first.', 'start': 16737.991, 'duration': 3.622}, {'end': 16744.333, 'text': 'How the data is distributed on each column.', 'start': 16742.293, 'duration': 2.04}, {'end': 16746.436, 'text': 'This is my univariate analysis.', 'start': 16744.654, 'duration': 1.782}, {'end': 16750.238, 'text': 'This is what the output of DF describes.', 'start': 16748.317, 'duration': 1.921}, {'end': 16753.8, 'text': 'How your data is distributed on that particular column.', 'start': 16751.559, 'duration': 2.241}, {'end': 16755.121, 'text': "It's a basic statistic.", 'start': 16753.84, 'duration': 1.281}], 'summary': 'Removing outliers reduced data from 569 to 230,000 records, which is not ideal. dropping records is not recommended when data size is restricted.', 'duration': 61.363, 'max_score': 16693.758, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw16693758.jpg'}, {'end': 16981.196, 'src': 'embed', 'start': 16955.828, 'weight': 6, 'content': [{'end': 16960.95, 'text': 'If calculating the length of the curve is more prone to measurement errors, drop the length of the curve.', 'start': 16955.828, 'duration': 5.122}, {'end': 16964.231, 'text': 'Keep the other one.', 'start': 16963.59, 'duration': 0.641}, {'end': 16971.773, 'text': 'However, if you are not able to take that call, you might want to run a principal component analysis,', 'start': 16965.551, 'duration': 6.222}, {'end': 16977.255, 'text': 'a mathematical technique or singular value decomposition, another technique.', 'start': 16971.773, 'duration': 5.482}, {'end': 16981.196, 'text': 'Using those techniques, we can create synthetic dimension out of this.', 'start': 16977.895, 'duration': 3.301}], 'summary': 'Consider dropping length measurement if prone to errors. use pca or svd for synthetic dimensions.', 'duration': 25.368, 'max_score': 16955.828, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw16955828.jpg'}, {'end': 17078.02, 'src': 'embed', 'start': 17052.351, 'weight': 7, 'content': [{'end': 17061.055, 'text': 'If you look at the cars with number of cylinders in your data set most of the cars have 4 cylinders, 5 cylinders, 6 cylinder and 8 cylinder.', 'start': 17052.351, 'duration': 8.704}, {'end': 17070.9, 'text': 'Most of the cars look at this have 4 cylinders, by the way these data points might be sitting on top of one another.', 'start': 17063.216, 'duration': 7.684}, {'end': 17075.002, 'text': 'So, that does not mean your data set has only 1, 2, 3, 4, 5, 6 records of 4 cylinders.', 'start': 17071.64, 'duration': 3.362}, {'end': 17078.02, 'text': 'do that mistake.', 'start': 17077.179, 'duration': 0.841}], 'summary': 'Most cars in the dataset have 4, 5, 6, or 8 cylinders.', 'duration': 25.669, 'max_score': 17052.351, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw17052351.jpg'}, {'end': 17158.266, 'src': 'embed', 'start': 17129.653, 'weight': 8, 'content': [{'end': 17134.335, 'text': 'This diagram, which I plotted here, is called KDE Kernel Density Estimates.', 'start': 17129.653, 'duration': 4.682}, {'end': 17135.936, 'text': 'So there are three words in this.', 'start': 17134.555, 'duration': 1.381}, {'end': 17138.717, 'text': "First word is, it's an estimate.", 'start': 17137.016, 'duration': 1.701}, {'end': 17148.401, 'text': "It's an estimation of the possible distribution, density distribution, density estimate in the population.", 'start': 17140.078, 'duration': 8.323}, {'end': 17153.924, 'text': 'In the population, how the cars are distributed around this value in cylinder.', 'start': 17149.302, 'duration': 4.622}, {'end': 17158.266, 'text': 'It is a density estimate based on a mathematical function.', 'start': 17154.384, 'duration': 3.882}], 'summary': 'Kde is a density estimate based on a mathematical function.', 'duration': 28.613, 'max_score': 17129.653, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw17129653.jpg'}, {'end': 17277.476, 'src': 'embed', 'start': 17235.764, 'weight': 9, 'content': [{'end': 17240.769, 'text': 'If a car coming from Japan versus car coming from US, the mileage, miles per gallon is going to be different.', 'start': 17235.764, 'duration': 5.005}, {'end': 17246.221, 'text': 'So it is a categorical variable, origin of the car is categorical variable, it has an impact on mileage.', 'start': 17241.98, 'duration': 4.241}, {'end': 17251.022, 'text': 'Look at the ratio of the distribution.', 'start': 17246.241, 'duration': 4.781}, {'end': 17277.476, 'text': 'in a column there are eight values, but majority of the records is that value one very few percentage of value two.', 'start': 17270.475, 'duration': 7.001}], 'summary': 'Origin of the car impacts mileage based on categorical variables and distribution ratio.', 'duration': 41.712, 'max_score': 17235.764, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw17235764.jpg'}, {'end': 17516.04, 'src': 'embed', 'start': 17482.239, 'weight': 10, 'content': [{'end': 17484.74, 'text': 'So you have to do that analysis, ok.', 'start': 17482.239, 'duration': 2.501}, {'end': 17490.944, 'text': 'So when you do a real life project, when you get into capstone project, you have to reflect all these things.', 'start': 17485.381, 'duration': 5.563}, {'end': 17498.189, 'text': 'How did you analyze your columns? How did you handle your outliers? How did you handle your missing values? And you have to justify your strategy.', 'start': 17491.404, 'duration': 6.785}, {'end': 17502.972, 'text': 'Why you did that? You cannot blindly replace something with a median.', 'start': 17498.269, 'duration': 4.703}, {'end': 17505.113, 'text': 'That may not be the optimal strategy.', 'start': 17503.332, 'duration': 1.781}, {'end': 17516.04, 'text': 'Shall we move on? Yes, for symbolic this is, symbolic means cars are initially assigned the risk factors.', 'start': 17508.649, 'duration': 7.391}], 'summary': 'In capstone projects, justifying analysis and handling outliers and missing values is crucial.', 'duration': 33.801, 'max_score': 17482.239, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw17482239.jpg'}], 'start': 15816.842, 'title': 'Analyzing car data and distributions', 'summary': 'Covers using descriptive statistics and pair plots to analyze car data, understanding data distributions, detecting outliers, and exploring car cylinder distributions, including prevalence and kde kernel density estimates.', 'chapters': [{'end': 16113.305, 'start': 15816.842, 'title': 'Analyzing car data with descriptive statistics', 'summary': 'Discusses the use of basic statistics and pair plots in analyzing car data, including comparing mean and median values to identify skewed data, and using pair plots with kernel density estimates to visualize relationships between variables.', 'duration': 296.463, 'highlights': ['The distribution of car height and weight is likely to be symmetrical, with minimal skew. The mean and median values for car height and weight show minimal difference, indicating a symmetrical bell curve distribution.', 'Using pair plots with kernel density estimates to visualize relationships between variables in the car dataset. The pair plot with kernel density estimates helps visualize the relationships between variables in the car dataset, with a focus on creating a square matrix for comparison.', 'Identifying the importance of comparing mean and median values to detect skewed data sets. Emphasizing the significance of comparing mean and median values to detect skewed data sets, especially when analyzing the car dataset.']}, {'end': 16645.944, 'start': 16113.305, 'title': 'Understanding pair plot and data distributions', 'summary': 'Discusses the pair plot analysis, showing the limitations of comparing a column with itself, understanding the distribution of data, identifying mixtures of gaussians, and handling outliers in the dataset.', 'duration': 532.639, 'highlights': ['Explaining the limitations of comparing a column with itself in pair plot analysis, showing that it results in a scatter plot with no useful information. Comparison of a column with itself in pair plot results in a scatter plot with no spread or useful information.', 'Identifying the use of density curves instead of scatter plots for column versus itself, and the advantage of showing distributions on the diagonals of the pair plot. Using density curves instead of scatter plots for column versus itself, and the advantage of showing distributions on the diagonals of the pair plot in R.', 'Discussing the distribution of various columns in the dataset, identifying normal distributions, overlapping Gaussian curves, and the significance of identifying mixtures of Gaussians for building linear models. Identifying normal distributions, overlapping Gaussian curves, and the significance of identifying mixtures of Gaussians for building linear models.', 'Highlighting the impact of outliers on standard deviation, the process of handling outliers, and the implications of sharp curves on standard deviation and outlier identification. Impact of outliers on standard deviation, handling outliers, and the implications of sharp curves on standard deviation and outlier identification.']}, {'end': 17022.061, 'start': 16653.447, 'title': 'Data analysis and outlier detection', 'summary': 'Discusses the challenges of removing outliers in a dataset, the impact on data size, the importance of analyzing data distribution, and identifying relationships between columns for decision making in data analysis.', 'duration': 368.614, 'highlights': ['Removing outlier records can significantly reduce the dataset size, from 569 records to 230,000, impacting the data analysis process. When removing outlier records for every column, the dataset size reduced from 569 records to 230,000, emphasizing the significant impact of removing outliers on the dataset size.', 'Analyzing the distribution of data on each column and identifying relationships between columns is crucial for data analysis and decision making. The chapter emphasizes the importance of analyzing data distribution and identifying relationships between columns through univariate and bivariate analysis, highlighting its significance in the data analysis and decision-making process.', 'The interdependence of dimensions in the dataset challenges the assumption of linear independence, requiring techniques such as principal component analysis or singular value decomposition to handle potential errors and create synthetic dimensions. The interdependence of dimensions challenges the assumption of linear independence, prompting the need for techniques like principal component analysis or singular value decomposition to handle potential errors and create synthetic dimensions, addressing the challenges posed by interdependent dimensions in the dataset.']}, {'end': 17556.466, 'start': 17052.351, 'title': 'Understanding car cylinder distribution', 'summary': 'Explains the distribution of car cylinders, highlighting the prevalence of 4, 5, 6, and 8 cylinders, and the use of kde kernel density estimates to estimate the density distribution of the cylinder column in the population based on available data.', 'duration': 504.115, 'highlights': ['The majority of cars in the dataset have 4 cylinders, followed by 5, 6, and 8 cylinders, with few records for 12 and 2 cylinders. Quantifiable data: Majority of cars have 4 cylinders.', 'The use of KDE Kernel Density Estimates to estimate the density distribution of the cylinder column in the population based on available data. Quantifiable data: KDE Kernel Density Estimates used to estimate density distribution.', 'The impact of car origin on mileage as a categorical variable and the importance of domain knowledge in model building and data analysis. Quantifiable data: Mention of categorical variable and the importance of domain knowledge.', 'The necessity to analyze columns, handle outliers and missing values, and justify strategies based on domain knowledge in real-life projects. Quantifiable data: Emphasis on the necessity of justifying strategies based on domain knowledge in real-life projects.']}], 'duration': 1739.624, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw15816842.jpg', 'highlights': ['Using pair plots with kernel density estimates to visualize relationships between variables in the car dataset.', 'Identifying the importance of comparing mean and median values to detect skewed data sets.', 'Identifying the use of density curves instead of scatter plots for column versus itself, and the advantage of showing distributions on the diagonals of the pair plot.', 'Discussing the distribution of various columns in the dataset, identifying normal distributions, overlapping Gaussian curves, and the significance of identifying mixtures of Gaussians for building linear models.', 'Removing outlier records can significantly reduce the dataset size, from 569 records to 230,000, impacting the data analysis process.', 'Analyzing the distribution of data on each column and identifying relationships between columns is crucial for data analysis and decision making.', 'The interdependence of dimensions in the dataset challenges the assumption of linear independence, requiring techniques such as principal component analysis or singular value decomposition to handle potential errors and create synthetic dimensions.', 'Quantifiable data: Majority of cars have 4 cylinders.', 'Quantifiable data: KDE Kernel Density Estimates used to estimate density distribution.', 'Quantifiable data: Mention of categorical variable and the importance of domain knowledge.', 'Quantifiable data: Emphasis on the necessity of justifying strategies based on domain knowledge in real-life projects.']}, {'end': 20654.641, 'segs': [{'end': 17583.853, 'src': 'embed', 'start': 17556.466, 'weight': 0, 'content': [{'end': 17559.747, 'text': 'by this high end foreign brands they all come with embedded chips.', 'start': 17556.466, 'duration': 3.281}, {'end': 17563.369, 'text': 'Those embedded chips in real time.', 'start': 17561.388, 'duration': 1.981}, {'end': 17572.253, 'text': 'they capture the data about your driving style and pass it on to a central server where they sit down and analyze, and the risk factor is adjusted,', 'start': 17563.369, 'duration': 8.884}, {'end': 17577.755, 'text': 'recalculated, recalibrated, based on how the car is being driven.', 'start': 17572.253, 'duration': 5.502}, {'end': 17580.176, 'text': 'The symboling reflects that.', 'start': 17579.055, 'duration': 1.121}, {'end': 17583.853, 'text': "right. ok, let's move now.", 'start': 17581.411, 'duration': 2.442}], 'summary': 'High-end foreign cars have embedded chips that capture and analyze driving data in real time, adjusting risk factors based on driving style.', 'duration': 27.387, 'max_score': 17556.466, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw17556466.jpg'}, {'end': 17696.935, 'src': 'embed', 'start': 17664.883, 'weight': 1, 'content': [{'end': 17670.203, 'text': 'Are you all ok with this? so have segregated the independent and the dependent variable.', 'start': 17664.883, 'duration': 5.32}, {'end': 17680.61, 'text': 'If this was R, I do not need to do this because R takes in as input only one single data frame and along with that a formula.', 'start': 17672.104, 'duration': 8.506}, {'end': 17686.733, 'text': 'Whereas in scikit-learn I have to separate this data into independent and dependent variables.', 'start': 17682.311, 'duration': 4.422}, {'end': 17696.935, 'text': 'Then I am making use of the random function which generates the training set and test we call it train test split.', 'start': 17689.775, 'duration': 7.16}], 'summary': 'In scikit-learn, data is segregated into independent and dependent variables for training using train test split.', 'duration': 32.052, 'max_score': 17664.883, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw17664883.jpg'}, {'end': 18271.792, 'src': 'embed', 'start': 18244.679, 'weight': 2, 'content': [{'end': 18250.322, 'text': 'two we have some kind of a what you call correlation between the independent variables, which you noticed.', 'start': 18244.679, 'duration': 5.643}, {'end': 18252.804, 'text': 'we have not done anything about that multicollinearity.', 'start': 18250.322, 'duration': 2.482}, {'end': 18262.664, 'text': 'So, those are the core reasons which will come across in all data set which lead to overall model level problems.', 'start': 18255.479, 'duration': 7.185}, {'end': 18271.792, 'text': 'Now, I am going to take you through further down into slightly more deeper stuff.', 'start': 18264.786, 'duration': 7.006}], 'summary': 'High multicollinearity observed in independent variables leading to model problems.', 'duration': 27.113, 'max_score': 18244.679, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw18244679.jpg'}, {'end': 19649.871, 'src': 'embed', 'start': 19614.675, 'weight': 4, 'content': [{'end': 19615.616, 'text': 'Null hypothesis is nothing doing.', 'start': 19614.675, 'duration': 0.941}, {'end': 19616.356, 'text': 'There is no relationship.', 'start': 19615.636, 'duration': 0.72}, {'end': 19619.027, 'text': 'No relationship means this is the way.', 'start': 19617.165, 'duration': 1.862}, {'end': 19622.209, 'text': 'Now the question is from this you have drawn the sample.', 'start': 19620.328, 'duration': 1.881}, {'end': 19622.71, 'text': 'in the sample.', 'start': 19622.209, 'duration': 0.501}, {'end': 19626.333, 'text': "you are saying this what is the probability of seeing this kind of distribution if it's coming from this?", 'start': 19622.71, 'duration': 3.623}, {'end': 19631.918, 'text': 'If the p-value is less than 0.05, we reject the null hypothesis.', 'start': 19626.913, 'duration': 5.005}, {'end': 19637.823, 'text': "If p-value is 0.05 or greater than equal to, we say we don't have sufficient evidence to reject the null hypothesis.", 'start': 19632.478, 'duration': 5.345}, {'end': 19642.527, 'text': 'So we accept the null hypothesis likely to be true, we reject this.', 'start': 19639.084, 'duration': 3.443}, {'end': 19649.871, 'text': "This analysis R provides, Python doesn't.", 'start': 19646.889, 'duration': 2.982}], 'summary': 'Null hypothesis tested by p-value, if < 0.05, reject it.', 'duration': 35.196, 'max_score': 19614.675, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw19614675.jpg'}, {'end': 19789.922, 'src': 'embed', 'start': 19763.783, 'weight': 3, 'content': [{'end': 19767.705, 'text': 'We have other ways of establishing the reliability of the dimensions and the models.', 'start': 19763.783, 'duration': 3.922}, {'end': 19768.426, 'text': "Let's use that.", 'start': 19767.765, 'duration': 0.661}, {'end': 19772.548, 'text': 'That is why scikit-learn does not give you this facility.', 'start': 19770.407, 'duration': 2.141}, {'end': 19777.591, 'text': 'But then subsequently under pressure they came out with stats model libraries which gives you this.', 'start': 19773.969, 'duration': 3.622}, {'end': 19783.695, 'text': 'Now, just to end up the show, what is telling us p-value?', 'start': 19780.313, 'duration': 3.382}, {'end': 19789.922, 'text': 'and what it is telling you, the ninety five percent confidence within which these values lie.', 'start': 19784.955, 'duration': 4.967}], 'summary': 'Scikit-learn lacks reliability check, stats models provides p-values and confidence intervals.', 'duration': 26.139, 'max_score': 19763.783, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw19763783.jpg'}, {'end': 19968.534, 'src': 'embed', 'start': 19937.034, 'weight': 5, 'content': [{'end': 19940.838, 'text': 'It is a classification method based on linear regression.', 'start': 19937.034, 'duration': 3.804}, {'end': 19951.482, 'text': 'The response variable, that is, the target variable, can be binary class, default or non-default, or diabetic, non-diabetic,', 'start': 19943.717, 'duration': 7.765}, {'end': 19953.664, 'text': 'or it can be multi-class classification also.', 'start': 19951.482, 'duration': 2.182}, {'end': 19959.728, 'text': 'I can use logistic regression to for optical character recognition, I can do that.', 'start': 19954.304, 'duration': 5.424}, {'end': 19968.534, 'text': 'And in my personal experience I have seen and I have also read some papers about it when you compare models,', 'start': 19960.929, 'duration': 7.605}], 'summary': 'Logistic regression is used for binary or multi-class classification, including optical character recognition.', 'duration': 31.5, 'max_score': 19937.034, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw19937034.jpg'}, {'end': 20414.615, 'src': 'embed', 'start': 20382.074, 'weight': 6, 'content': [{'end': 20391.779, 'text': 'This S curve is called a sigmoid and it is very easy to achieve this.', 'start': 20382.074, 'duration': 9.705}, {'end': 20402.786, 'text': 'The sigmoid is nothing but 1 by 1 plus e, e is Euler constant we use in mathematics, minus mx plus c.', 'start': 20392.878, 'duration': 9.908}, {'end': 20411.833, 'text': 'So this best fit line, the best fit line which is found for you, this bed face line is fed into this transformation, this mathematical formula.', 'start': 20402.786, 'duration': 9.047}, {'end': 20414.615, 'text': 'The result of this transformation is this curve.', 'start': 20412.553, 'duration': 2.062}], 'summary': 'Achieve the sigmoid curve using the mathematical formula 1/(1+e^(-mx+c))', 'duration': 32.541, 'max_score': 20382.074, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw20382074.jpg'}], 'start': 17556.466, 'title': 'Regression and classification models', 'summary': 'Covers data analysis, model evaluation, linear regression coefficients analysis, model improvement, understanding p-values, logistic regression, binary classification, and the sigmoid curve transformation, providing insights into real-time driving data analysis and the significance of coefficients and p-values in statistical and model interpretation.', 'chapters': [{'end': 17953.615, 'start': 17556.466, 'title': 'Data analysis and model evaluation', 'summary': 'Discusses how embedded chips in high-end foreign cars capture driving data in real-time, which is then analyzed to adjust the risk factor. it then explains the process of segregating independent and dependent variables, performing train-test split, building and evaluating the model using predicted and actual values.', 'duration': 397.149, 'highlights': ['High-end foreign cars come with embedded chips that capture driving data in real-time and analyze it to adjust the risk factor. The embedded chips in high-end foreign cars capture real-time driving data and adjust the risk factor based on the driving style.', 'The process of segregating independent and dependent variables, performing train-test split, building and evaluating the model using predicted and actual values is explained in detail. The transcript covers the process of segregating independent and dependent variables, performing train-test split, building the model using fit function, and evaluating the model using predicted and actual values to determine accuracy.', 'The train-test split is performed with a data ratio of 75-25, where 25% is for testing and 75% for training. The train-test split is performed with a data ratio of 75-25, with 25% allocated for testing and 75% for training.']}, {'end': 18172.307, 'start': 17953.615, 'title': 'Linear regression coefficients analysis', 'summary': 'Demonstrates the instantiation of a linear regression model, with the coefficients of the best fit line analyzed to reveal insights into the impact of various features on the car price, highlighting issues of multicollinearity and unexpected impact of certain features.', 'duration': 218.692, 'highlights': ['The coefficients of the best fit line are analyzed, revealing insights into the impact of various features on the car price, such as a $88.57 increase for every one unit increase in symboling and a $71.82 increase for every one unit increase in wheel base, while highlighting unexpected impacts like the negative impact of the length of the car on the price. Analysis of coefficients reveals the impact of various features on car price, e.g., $88.57 increase for one unit increase in symboling, $71.82 increase for one unit increase in wheel base, and a negative impact of car length on price.', 'Issues of multicollinearity are identified, indicated by coefficients not aligning with the expected relationships between features, such as the opposite directions of impact for the length and wheel base of the car, signifying potential data set anomalies. The opposite impacts of length and wheel base indicate potential multicollinearity issues in the dataset, causing coefficients to not align with the expected relationships between features.', "The unexpected impact of features, such as the negative effect of an increase in horsepower on the car's price, is highlighted, pointing towards potential data anomalies affecting the accuracy of the model's predictions. An unexpected negative impact of horsepower on car price indicates potential data anomalies affecting the accuracy of the model's predictions."]}, {'end': 19259.138, 'start': 18172.307, 'title': 'Linear regression analysis and model improvement', 'summary': 'Discusses issues with coefficients, handling outliers, and multicollinearity in linear regression analysis, introduces statsmodel for statistical analysis, and explains the significance of coefficients, p-values, and t-scores in model interpretation.', 'duration': 1086.831, 'highlights': ['Linear regression issues with coefficients, handling outliers, and multicollinearity The chapter discusses the problems with coefficients not aligning with domain knowledge, the impact of outliers on data, and the need to address multicollinearity issues, which are key factors contributing to model level problems.', "Introduction of StatsModel for statistical analysis StatsModel is introduced as a tool for replicating R's statistical analysis capabilities in Python, providing additional statistical information such as R square, adjusted R square, F statistics, AIC, and BIC, which are essential for model evaluation and improvement.", 'Significance of coefficients, p-values, and t-scores in model interpretation The chapter explains the importance of coefficients in determining the best fit line, the use of p-values for hypothesis testing, and the interpretation of t-scores as z-scores in assessing the distribution of coefficients, providing insights into model interpretation and evaluation.']}, {'end': 19900.91, 'start': 19259.838, 'title': 'Understanding p-values and statistical analysis', 'summary': 'Explains the significance of p-values in statistical analysis, emphasizing the probability of finding a relationship between variables in data, and the implications for rejecting or accepting the null hypothesis based on p-values.', 'duration': 641.072, 'highlights': ['The p-value indicates the probability of finding a relationship between variables, with a p-value greater than 0.05 suggesting a likely fluke relationship and a p-value less than 0.05 indicating a reliable relationship. Significance of p-values in determining the reliability of relationships between variables in data.', 'The concept of null hypothesis is explained, where a high p-value leads to acceptance of null hypothesis, while a low p-value results in rejection of the null hypothesis. Explanation of null hypothesis acceptance or rejection based on p-values.', 'The discussion covers the impact of collinearity on the reliability of coefficients and p-values, leading to a split in the statistics community regarding the reliability of p-values. Impact of collinearity on the reliability of coefficients and p-values, leading to a split in the statistics community.']}, {'end': 20187.556, 'start': 19906.393, 'title': 'Logistic regression for classification', 'summary': 'Discusses logistic regression, a classification method based on linear regression, and its versatility in binary and multi-class classification, with a focus on its application in comparing models and the use of probability-based predictions.', 'duration': 281.163, 'highlights': ['Logistic regression is often top-ranked in model comparisons for classifications. In personal experience and research, logistic regression is frequently among the top models for classifications.', 'Logistic regression can be used for both binary and multi-class classification. The method is versatile and can handle binary class or multi-class classification, such as default or non-default, or diabetic and non-diabetic.', "Probability-based predictions can be obtained using logistic regression. The function 'predict_proba' in logistic regression allows for the output of probability values for classification predictions, offering a deeper insight into the model's predictions.", 'Application in building a model to predict default likelihood based on age and other factors. Illustration of a scenario where logistic regression can be used to build a model predicting the likelihood of default based on inputs like age, gender, income, past loans, and property ownership.', 'Observation of age pattern in defaulters and non-defaulters. Noting a pattern where a significant proportion of defaulters are in the lower age bracket while non-defaulters are more concentrated in the higher age bracket, illustrating the potential use of age as a predictor in logistic regression.']}, {'end': 20654.641, 'start': 20189.698, 'title': 'Binary classification and sigmoid curve', 'summary': 'Discusses the process of assigning numerical values to classes, building a linear model, and transforming it into an s curve using the sigmoid function to ensure probabilities are within 0 and 1 and matching model predictions with actual labels.', 'duration': 464.943, 'highlights': ['The process of assigning numerical values to classes, green assigned 1 and red assigned 0, to represent patterns in the data based on income levels. The numerical assignment of 1 to green and 0 to red classes is based on income patterns, showing more greens as income increases and more reds as income decreases.', 'Discussion on building a linear model for probability of belonging to a class and the need to transform it into the S curve using the sigmoid function to ensure probabilities are within 0 and 1. The linear model for probability of belonging to a class is transformed into an S curve using the sigmoid function to ensure probabilities are within 0 and 1, addressing the limitation of linear models going to infinity and the need for probabilities to be within 0 and 1.', "Matching model predictions with actual labels, demonstrating the model's ability to predict the probability of belonging to a class based on data patterns. The model's predictions are shown to match the actual labels, indicating its ability to predict the probability of belonging to a particular class based on data patterns."]}], 'duration': 3098.175, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw17556466.jpg', 'highlights': ['Real-time driving data analysis for risk adjustment in high-end foreign cars.', 'Process of segregating independent and dependent variables, train-test split, and model evaluation.', "Analysis of linear regression coefficients' impact on car price and identification of multicollinearity issues.", 'Introduction of StatsModel for statistical analysis and significance of coefficients, p-values, and t-scores in model interpretation.', 'Significance of p-values in determining the reliability of relationships between variables in data and the concept of null hypothesis.', 'Versatility of logistic regression for binary and multi-class classification, probability-based predictions, and age pattern observation in defaulters and non-defaulters.', 'Numerical assignment of classes based on income levels and transformation of linear model into the S curve using the sigmoid function for probability constraints.']}, {'end': 22522.54, 'segs': [{'end': 20855.275, 'src': 'embed', 'start': 20825.285, 'weight': 2, 'content': [{'end': 20829.547, 'text': 'Yes, in multiclass classification it will be 1 versus rest.', 'start': 20825.285, 'duration': 4.262}, {'end': 20834.589, 'text': 'Suppose I used this logistic regression for OCR, optical character recognition.', 'start': 20829.687, 'duration': 4.902}, {'end': 20843.273, 'text': 'So you will have one S curve for A versus others, one S curve for B versus others, so on so forth.', 'start': 20835.169, 'duration': 8.104}, {'end': 20848.475, 'text': 'So in your mathematical space you will have multiple S curves cutting each other.', 'start': 20843.673, 'duration': 4.802}, {'end': 20855.275, 'text': 'One is for A versus others, the other one is B versus others, C versus others, so on so forth.', 'start': 20851.112, 'duration': 4.163}], 'summary': 'Logistic regression used for ocr with multiclass classification creates multiple s curves for each class.', 'duration': 29.99, 'max_score': 20825.285, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw20825285.jpg'}, {'end': 21322.469, 'src': 'embed', 'start': 21294.766, 'weight': 1, 'content': [{'end': 21302.088, 'text': 'and to find out which is the best sigmoid surface of all the infinite possibilities, it uses a function.', 'start': 21294.766, 'duration': 7.322}, {'end': 21303.568, 'text': 'that function is called log loss.', 'start': 21302.088, 'duration': 1.48}, {'end': 21309.579, 'text': 'this is just like a gradient descent ok.', 'start': 21307.557, 'duration': 2.022}, {'end': 21314.663, 'text': "Don't worry about this function is actually very easy looks dangerous, but actually it is very easy.", 'start': 21310.079, 'duration': 4.584}, {'end': 21322.469, 'text': 'Now let us look at the log loss function looks very dangerous, but it is actually simple very simple ok.', 'start': 21315.603, 'duration': 6.866}], 'summary': 'Using the log loss function to find the best sigmoid surface among infinite possibilities in a simple manner.', 'duration': 27.703, 'max_score': 21294.766, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw21294766.jpg'}, {'end': 21783.241, 'src': 'embed', 'start': 21755.249, 'weight': 0, 'content': [{'end': 21761.956, 'text': 'So the objective is to minimize the sum of squared errors by finding the right logistic surface given the classes.', 'start': 21755.249, 'duration': 6.707}, {'end': 21770.132, 'text': 'So, the gradient descent will be the same.', 'start': 21767.59, 'duration': 2.542}, {'end': 21776.076, 'text': 'So, suppose it is a high loss and it will go in a direction where it wants to reduce the total loss.', 'start': 21770.552, 'duration': 5.524}, {'end': 21783.241, 'text': 'Yesterday the sum of square errors was driving the gradient descent, here this will drive the gradient descent.', 'start': 21778.618, 'duration': 4.623}], 'summary': 'Objective: minimize sum of squared errors to find right logistic surface for classes.', 'duration': 27.992, 'max_score': 21755.249, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw21755249.jpg'}, {'end': 21857.305, 'src': 'embed', 'start': 21824.858, 'weight': 3, 'content': [{'end': 21827.06, 'text': 'Yeah, for classification always even.', 'start': 21824.858, 'duration': 2.202}, {'end': 21834.747, 'text': 'Alright, shall we move on? Extension of linear model that we saw yesterday.', 'start': 21829.902, 'duration': 4.845}, {'end': 21845.036, 'text': 'The beauty of logistic regression is it makes no assumption about the distribution of classes in the feature space.', 'start': 21836.648, 'duration': 8.388}, {'end': 21851.782, 'text': 'Many of these algorithms, linear model especially, if you are building linear classifiers or linear regression they expect Gaussian distributions.', 'start': 21845.376, 'duration': 6.406}, {'end': 21857.305, 'text': 'You understand the term Gaussian distribution, all of you? No? Ok.', 'start': 21853.103, 'duration': 4.202}], 'summary': 'Logistic regression has no assumption about class distribution in feature space, unlike linear models.', 'duration': 32.447, 'max_score': 21824.858, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw21824858.jpg'}, {'end': 22074.002, 'src': 'embed', 'start': 22046.91, 'weight': 4, 'content': [{'end': 22058.278, 'text': 'Right now we took a linearly separable this thing but what if the distribution is not linearly separable then logistic regression will suffer.', 'start': 22046.91, 'duration': 11.368}, {'end': 22068.481, 'text': 'If the distribution was like this with overlap, but it was like this, then it will look well.', 'start': 22060.059, 'duration': 8.422}, {'end': 22074.002, 'text': 'But if the distribution like this on the attribute you have decided, you see the difference?', 'start': 22068.801, 'duration': 5.201}], 'summary': 'Logistic regression suffers if the distribution is not linearly separable.', 'duration': 27.092, 'max_score': 22046.91, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw22046910.jpg'}, {'end': 22162.276, 'src': 'embed', 'start': 22136.161, 'weight': 5, 'content': [{'end': 22143.22, 'text': 'When the data distribution is like this, none of the algorithms will be able to help you 100 percent.', 'start': 22136.161, 'duration': 7.059}, {'end': 22148.084, 'text': 'you have to know your data before you start building the models.', 'start': 22143.22, 'duration': 4.864}, {'end': 22155.29, 'text': 'You have to know your data, which means you have to know every attribute, how the data is distributed if you are in classification,', 'start': 22148.925, 'duration': 6.365}, {'end': 22162.276, 'text': 'how the classes are distributed, which dimensions or which attributes are able to linearly separate the two classes.', 'start': 22155.29, 'duration': 6.986}], 'summary': 'Understanding data distribution is crucial for model building and accuracy.', 'duration': 26.115, 'max_score': 22136.161, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw22136161.jpg'}], 'start': 20657.281, 'title': 'Logistic regression analysis', 'summary': 'Discusses logistic regression model analysis, classification errors, and multiclass classification using logistic regression. it covers the use of s curve, 1 versus rest approach, gradient descent algorithm, and log loss function. it emphasizes minimizing errors through gradient descent and highlights the advantages, limitations, and considerations for effective classification model building.', 'chapters': [{'end': 20745.478, 'start': 20657.281, 'title': 'Model classification errors', 'summary': "Discusses the model's classification errors based on the probability of belonging to different classes with examples, highlighting the occurrence of errors in training data.", 'duration': 88.197, 'highlights': ['The model predicts a very high probability of belonging to the green class based on income, leading to a misclassification of a red point as green.', 'The model predicts a very low probability of belonging to the green class for a green point, resulting in a misclassification.', 'In the presence of overlapping data sets, despite the best fit line, classification errors are inevitable, termed as training errors.']}, {'end': 21397.774, 'start': 20745.478, 'title': 'Logistic regression and multiclass classification', 'summary': 'Discusses the use of s curve in logistic regression to map numerical values to probability functions, multiclass classification using 1 versus rest approach, and the gradient descent algorithm for finding the best fit line, with a focus on the log loss function for evaluating the sigmoid surface.', 'duration': 652.296, 'highlights': ['The use of S curve in logistic regression to map numerical values to probability functions The S curve is used in logistic regression to ensure that probabilities remain between 0 and 1, allowing the mapping of numerical values to a probability function.', 'Multiclass classification using 1 versus rest approach In multiclass classification, the 1 versus rest approach is used, where separate S curves are created for each class versus the rest, breaking the mathematical space into compartments.', 'The gradient descent algorithm for finding the best fit line The linear model uses the gradient descent algorithm to find the best fit line, ensuring that the error function reaches the global minima in the mathematical space.', 'The log loss function for evaluating the sigmoid surface The log loss function is used to evaluate the best sigmoid surface by considering the target variable, and it is crucial for logistic regression and deep learning applications.']}, {'end': 21896.534, 'start': 21397.774, 'title': 'Logistic regression model analysis', 'summary': 'Discusses the analysis of a logistic regression model, emphasizing the calculation of errors for correct and incorrect classifications and the objective of minimizing the sum of squared errors through gradient descent.', 'duration': 498.76, 'highlights': ['The model accurately classifies a blue point, resulting in 0 error due to the high probability (close to 1) of belonging to the blue class. The model accurately classifies a blue point with a high probability close to 1, resulting in 0 error.', 'Misclassification of a blue point leads to a very large error due to the very low probability (almost 0) predicted by the model for the blue class. Misclassification of a blue point leads to a very large error due to the very low probability (almost 0) predicted by the model for the blue class.', 'Misclassification of a red point also results in a large error due to the very high probability (close to 1) predicted by the model for the blue class. Misclassification of a red point also results in a large error due to the very high probability (close to 1) predicted by the model for the blue class.', 'Logistic regression does not assume Gaussian distributions in the feature space, unlike linear models which expect such distributions. Logistic regression does not assume Gaussian distributions in the feature space, unlike linear models which expect such distributions.']}, {'end': 22522.54, 'start': 21897.635, 'title': 'Logistic regression overview', 'summary': 'Provides an overview of logistic regression, highlighting its advantages such as resistance to overfitting, the impact of outliers, and limitations including linear boundaries and the requirement for linearly separable data. it also emphasizes the importance of understanding data distribution and attribute selection for effective classification model building.', 'duration': 624.905, 'highlights': ['The algorithm of logistic regression is resistant to overfitting due to its simplicity as a linear model.', 'Outliers in the dataset can significantly impact the performance of the logistic regression model, especially if they are extreme, leading to increased errors.', "Logistic regression's disadvantage lies in its use of linear boundaries, causing limitations when dealing with non-linearly separable data distributions.", 'The importance of understanding the data distribution and attribute selection is emphasized for effective classification model building, as demonstrated through the example of the PMAT dataset and the Pima Indians dataset.', "Logistic regression's ability to handle multi-class classification using binomial or multinomial distribution is highlighted, along with the option to print out probability values for class prediction."]}], 'duration': 1865.259, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw20657281.jpg', 'highlights': ['The gradient descent algorithm ensures that the error function reaches the global minima in the mathematical space.', 'The log loss function is crucial for logistic regression and deep learning applications.', 'The 1 versus rest approach is used in multiclass classification, creating separate S curves for each class versus the rest.', 'Logistic regression does not assume Gaussian distributions in the feature space, unlike linear models.', "Logistic regression's disadvantage lies in its use of linear boundaries, causing limitations when dealing with non-linearly separable data distributions.", 'The importance of understanding the data distribution and attribute selection is emphasized for effective classification model building.']}, {'end': 26058.57, 'segs': [{'end': 22876.198, 'src': 'embed', 'start': 22845.299, 'weight': 0, 'content': [{'end': 22853.603, 'text': 'the next thing you need to do is, since you are in classification, how many records are available for each class in the data set?', 'start': 22845.299, 'duration': 8.304}, {'end': 22855.764, 'text': 'look at that.', 'start': 22853.603, 'duration': 2.161}, {'end': 22864.954, 'text': 'the number of cases for non-diabetic zero is 500, whereas the number of cases for diabetic is half of that almost half.', 'start': 22855.764, 'duration': 9.19}, {'end': 22876.198, 'text': 'You are preparing for your board exams and most of the questions that you solve, practice on is calculus.', 'start': 22868.315, 'duration': 7.883}], 'summary': 'In the classification data set, there are 500 non-diabetic cases and almost half as many diabetic cases.', 'duration': 30.899, 'max_score': 22845.299, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw22845299.jpg'}, {'end': 22925.555, 'src': 'embed', 'start': 22901.435, 'weight': 1, 'content': [{'end': 22909.882, 'text': 'To the model I have given only 268 of the diabetic cases, maybe this 268 does not reflect all possible permutations that lead to diabetes.', 'start': 22901.435, 'duration': 8.447}, {'end': 22916.208, 'text': 'So the model will perform poorly in predicting the diabetic cases.', 'start': 22913.086, 'duration': 3.122}, {'end': 22920.572, 'text': 'It will perform relatively well in predicting the non-diabetic cases, but I want the reverse.', 'start': 22916.649, 'duration': 3.923}, {'end': 22925.555, 'text': 'The objective is the reverse, ok.', 'start': 22923.194, 'duration': 2.361}], 'summary': 'Model performs poorly in predicting diabetic cases with only 268 cases, while relatively well in predicting non-diabetic cases.', 'duration': 24.12, 'max_score': 22901.435, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw22901435.jpg'}, {'end': 23760.916, 'src': 'embed', 'start': 23731.975, 'weight': 2, 'content': [{'end': 23735.676, 'text': 'How do you prepare your data for analytics? That is where the magic is.', 'start': 23731.975, 'duration': 3.701}, {'end': 23740.258, 'text': 'It is not in the algorithms, okay.', 'start': 23736.776, 'duration': 3.482}, {'end': 23743.019, 'text': 'It depends on data preparation.', 'start': 23742.078, 'duration': 0.941}, {'end': 23750.268, 'text': '80 percent of our project estimated effort in data science goes into preparing the data.', 'start': 23744.123, 'duration': 6.145}, {'end': 23760.916, 'text': 'Running the algorithm is not, then rest around 20, 15 to 20 percent goes into fine tuning a model, it goes into that.', 'start': 23752.189, 'duration': 8.727}], 'summary': 'Data preparation is crucial, accounting for 80% of project effort in data science.', 'duration': 28.941, 'max_score': 23731.975, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw23731975.jpg'}, {'end': 23873.97, 'src': 'embed', 'start': 23837.5, 'weight': 3, 'content': [{'end': 23840.042, 'text': 'You need to learn the tricks of identifying good attributes.', 'start': 23837.5, 'duration': 2.542}, {'end': 23846.087, 'text': 'Feature engineering, feature selection and that is the that is where the core lies.', 'start': 23840.843, 'duration': 5.244}, {'end': 23853.634, 'text': '100 percent domain.', 'start': 23846.107, 'duration': 7.527}, {'end': 23863.159, 'text': 'if I want to predict the time taken to travel from A to B, then I should have the domain experience of what are the factors that can impact the time.', 'start': 23855.171, 'duration': 7.988}, {'end': 23873.97, 'text': 'That is what the attributes they collected are useless attributes, it is not good attribute.', 'start': 23867.563, 'duration': 6.407}], 'summary': 'Learn feature engineering and selection for better predictions. domain knowledge is crucial for accurate predictions.', 'duration': 36.47, 'max_score': 23837.5, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw23837500.jpg'}, {'end': 25330.478, 'src': 'embed', 'start': 25301.082, 'weight': 4, 'content': [{'end': 25308.407, 'text': "That concept of recalibrating your probabilities based on the information that you are gathering, that concept is called Bayes' theorem.", 'start': 25301.082, 'duration': 7.325}, {'end': 25314.632, 'text': 'So what Mr. Bayesian said is start with some probability.', 'start': 25311.329, 'duration': 3.303}, {'end': 25322.012, 'text': 'the default probability values, but keep on recalibrating those probabilities the moment more and more information comes to you.', 'start': 25315.848, 'duration': 6.164}, {'end': 25330.478, 'text': "Howsoever the information may be, howsoever weak the information may be, don't ignore any information and recalibrate your probabilities.", 'start': 25322.633, 'duration': 7.845}], 'summary': "Bayes' theorem advises updating probabilities with new information.", 'duration': 29.396, 'max_score': 25301.082, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw25301082.jpg'}], 'start': 22523.46, 'title': 'Analyzing data and handling class imbalance', 'summary': 'Covers analyzing blood sugar test data, revealing a notable class imbalance with almost twice as many non-diabetic cases, and discusses strategies for handling skewed classes in classification, emphasizing the importance of data preparation over algorithmic performance.', 'chapters': [{'end': 22900.134, 'start': 22523.46, 'title': 'Blood sugar test analysis', 'summary': 'Discusses the analysis of a blood sugar test dataset, exploring relationships between independent variables and the target variable, identifying non-numerical values, addressing missing values and outliers, and analyzing class distribution, with a notable finding that the number of non-diabetic cases is almost double that of diabetic cases.', 'duration': 376.674, 'highlights': ['The number of cases for non-diabetic zero is 500, whereas the number of cases for diabetic is almost half, indicating a significant class imbalance.', 'The mean and median for the test attribute show a drastic difference, signifying potential long tails on the higher side and the impact of outliers on the mean.', 'The chapter discusses the analysis of a blood sugar test dataset, exploring relationships between independent variables and the target variable, identifying non-numerical values, addressing missing values and outliers, and analyzing class distribution.']}, {'end': 23274.368, 'start': 22901.435, 'title': 'Handling skewed classes in classification', 'summary': 'Explains the challenges of skewed classes in classification, the biases in algorithms, and strategies such as up sampling, down sampling, and modifying thresholds to improve accuracy of underrepresented classes in predictive models.', 'duration': 372.933, 'highlights': ["The model's poor performance in predicting diabetic cases due to skewed classes and algorithms biased towards the higher represented class.", 'Explanation of strategies like up sampling, down sampling, and modifying thresholds to improve accuracy of underrepresented classes in predictive models.', 'Discussion on the challenges of skewed distributions in achieving high accuracy for underrepresented class in classification.']}, {'end': 23829.574, 'start': 23276.488, 'title': 'Model evaluation and data preparation', 'summary': 'Discusses the challenges of model evaluation and emphasizes the importance of data preparation, highlighting an example of a logistic regression model with an accuracy of 77% but poor recall for the diabetic class, emphasizing the need for data preparation over algorithmic performance.', 'duration': 553.086, 'highlights': ['Importance of data preparation over algorithms The chapter emphasizes that 80% of the project estimated effort in data science goes into preparing the data, while the rest (15-20%) is for fine-tuning a model.', 'Model accuracy and poor recall for diabetic class An example of a logistic regression model with an accuracy of 77% but poor recall (55-54%) for the diabetic class is highlighted, indicating the need for improved data representation for diabetic cases.', 'Use of confusion matrix for model evaluation The practice of utilizing a confusion matrix to evaluate model performance and the significance of recall for both diabetic and non-diabetic classes are discussed.', 'Impact of data representation on model performance The chapter explains how poor data representation for attributes and underrepresented classes can lead to challenges in building a good model.', 'Data conversion and model instantiation The process of converting data frames into arrays, splitting independent and dependent attributes, and instantiating a logistic regression model are covered in the chapter.']}, {'end': 25123.224, 'start': 23837.5, 'title': 'Understanding feature engineering and upsampling in data science', 'summary': 'Delves into the importance of feature engineering and selection in predicting travel time, the use of upsampling to create synthetic data, and the concept of adjusting bias error through upsampling and downsampling.', 'duration': 1285.724, 'highlights': ['The chapter emphasizes the importance of feature engineering and selection in predicting the time taken to travel from A to B, stressing the need for domain experience to identify factors impacting travel time. Importance of feature engineering and selection in predicting travel time, emphasis on domain experience for identifying factors impacting travel time.', 'Upsampling is discussed as a method for creating synthetic data for the underrepresented class, with a focus on adjusting bias error and achieving balance between classes. Explanation of upsampling to create synthetic data for the underrepresented class, focus on adjusting bias error and achieving class balance.', 'The concept of adjusting bias error through upsampling and downsampling is explained, highlighting the trade-off and the use of k-nearest neighbors for generating synthetic data. Explanation of adjusting bias error through upsampling and downsampling, use of k-nearest neighbors for generating synthetic data.']}, {'end': 26058.57, 'start': 25123.724, 'title': 'Understanding probability and bayesian models', 'summary': "Discusses probability concepts including frequencies approach to probability, poisson distribution, joint probability, conditional probability, and the application of bayes' theorem in building a model, emphasizing the importance of recalibrating probabilities based on new information and the implications of using naive bayes algorithm.", 'duration': 934.846, 'highlights': ["Bayes' theorem emphasizes recalibrating probabilities based on new information, regardless of the strength of the information, in building models. The concept of recalibrating probabilities based on the information gathered, as highlighted by Bayes' theorem, is a fundamental principle in building models, emphasizing the importance of incorporating all information, regardless of its strength.", 'The chapter explains the concept of conditional probability, illustrating the shift in probability calculation after obtaining new information. The explanation of conditional probability demonstrates the adjustment in probability calculation after receiving new information, highlighting the impact of updated data on probability outcomes.', 'The discussion emphasizes the use of Naive Bayes algorithm and the assumption of independence between events, cautioning against its use in cases with strong relationships between variables. The discussion on the Naive Bayes algorithm highlights the assumption of independence between events and advises against its use in scenarios with strong relationships between variables, emphasizing the importance of understanding the limitations of the algorithm.', "The transcript showcases the application of Bayes' theorem in real-life scenarios, such as predicting the gender associated with a name, demonstrating the practical use of Bayesian probability in everyday situations. The real-life application of Bayes' theorem in predicting the gender associated with a name illustrates the practical use of Bayesian probability in everyday decision-making, emphasizing its relevance beyond theoretical concepts."]}], 'duration': 3535.11, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw22523460.jpg', 'highlights': ['The number of cases for non-diabetic zero is 500, whereas the number of cases for diabetic is almost half, indicating a significant class imbalance.', "The model's poor performance in predicting diabetic cases due to skewed classes and algorithms biased towards the higher represented class.", 'Importance of data preparation over algorithms The chapter emphasizes that 80% of the project estimated effort in data science goes into preparing the data, while the rest (15-20%) is for fine-tuning a model.', 'The chapter emphasizes the importance of feature engineering and selection in predicting the time taken to travel from A to B, stressing the need for domain experience to identify factors impacting travel time.', "Bayes' theorem emphasizes recalibrating probabilities based on new information, regardless of the strength of the information, in building models."]}, {'end': 27846.473, 'segs': [{'end': 26220.885, 'src': 'embed', 'start': 26188.815, 'weight': 9, 'content': [{'end': 26194.7, 'text': 'If this crosses 0.5, he is going to belong to not class A.', 'start': 26188.815, 'duration': 5.885}, {'end': 26198.463, 'text': 'Get my point? So what decides your class is the numerator.', 'start': 26194.7, 'duration': 3.763}, {'end': 26201.052, 'text': 'not the denominator.', 'start': 26200.151, 'duration': 0.901}, {'end': 26203.273, 'text': 'denominator is same in both.', 'start': 26201.052, 'duration': 2.221}, {'end': 26207.456, 'text': "so many of the authors they don't reflect this.", 'start': 26203.273, 'duration': 4.183}, {'end': 26214.781, 'text': 'they say that this is the Bayesian model and you will keep wondering where did the Pb go?', 'start': 26207.456, 'duration': 7.325}, {'end': 26220.885, 'text': "so the reason why they don't show Pb is it's a numerator which decide the class.", 'start': 26214.781, 'duration': 6.104}], 'summary': 'The numerator, not the denominator, determines the class in the bayesian model.', 'duration': 32.07, 'max_score': 26188.815, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw26188815.jpg'}, {'end': 26327.047, 'src': 'embed', 'start': 26289.232, 'weight': 0, 'content': [{'end': 26290.713, 'text': 'In the population, how many are diabetic?', 'start': 26289.232, 'duration': 1.481}, {'end': 26292.373, 'text': 'What percentage?', 'start': 26291.713, 'duration': 0.66}, {'end': 26293.154, 'text': 'That is this PA.', 'start': 26292.454, 'duration': 0.7}, {'end': 26298.436, 'text': 'How many are diabetic with this symptoms? Likelihood ratios.', 'start': 26294.895, 'duration': 3.541}, {'end': 26315.124, 'text': 'Okay Now I will go back to the slides which are jumped or should I go forward just once again let me check.', 'start': 26303.819, 'duration': 11.305}, {'end': 26317.405, 'text': 'Yeah, I will go forward and then go backward.', 'start': 26315.464, 'duration': 1.941}, {'end': 26321.083, 'text': 'So I have my data set.', 'start': 26318.581, 'duration': 2.502}, {'end': 26324.225, 'text': 'this data set is 100 records.', 'start': 26321.083, 'duration': 3.142}, {'end': 26327.047, 'text': 'you see 100 in the bottom right.', 'start': 26324.225, 'duration': 2.822}], 'summary': 'Population: 100 records, diabetic count and percentage, symptoms likelihood ratios.', 'duration': 37.815, 'max_score': 26289.232, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw26289232.jpg'}, {'end': 26628.974, 'src': 'embed', 'start': 26598.24, 'weight': 2, 'content': [{'end': 26601.323, 'text': 'So you can keep on recalibrating your probabilities.', 'start': 26598.24, 'duration': 3.083}, {'end': 26602.744, 'text': 'forget flight delay.', 'start': 26601.323, 'duration': 1.421}, {'end': 26604.426, 'text': 'what is the probability of an intrusion?', 'start': 26602.744, 'duration': 1.682}, {'end': 26605.387, 'text': 'network intrusion.', 'start': 26604.426, 'duration': 0.961}, {'end': 26611.172, 'text': 'given that I am seeing this IP, I am seeing this login ID from this particular geographical location.', 'start': 26605.387, 'duration': 5.785}, {'end': 26614.695, 'text': 'So you keep on recalibrating your probability values.', 'start': 26612.333, 'duration': 2.362}, {'end': 26624.144, 'text': 'the moment you see such events in the log files, the moment the probability value crosses the threshold you send your alarm right.', 'start': 26614.695, 'duration': 9.449}, {'end': 26628.974, 'text': 'but there is a problem.', 'start': 26628.174, 'duration': 0.8}], 'summary': 'Recalibrate probabilities based on network events to detect intrusion and send alarm.', 'duration': 30.734, 'max_score': 26598.24, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw26598240.jpg'}, {'end': 26730.123, 'src': 'embed', 'start': 26700.288, 'weight': 1, 'content': [{'end': 26711.092, 'text': "This very small number prevents the numerator from becoming 0, when those rare events occur for which you don't have data, okay.", 'start': 26700.288, 'duration': 10.804}, {'end': 26715.034, 'text': 'Since the event is very rare, we give it a very low value.', 'start': 26712.013, 'duration': 3.021}, {'end': 26724.339, 'text': 'this is in there is lot of technical discussions behind this, but this is called Laplace smoothing factor.', 'start': 26719.315, 'duration': 5.024}, {'end': 26730.123, 'text': 'By default it is a very small value of 0.00005.', 'start': 26725.84, 'duration': 4.283}], 'summary': 'Laplace smoothing factor of 0.00005 prevents numerator from becoming 0 for rare events.', 'duration': 29.835, 'max_score': 26700.288, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw26700288.jpg'}, {'end': 27107.042, 'src': 'embed', 'start': 27061.92, 'weight': 7, 'content': [{'end': 27065.622, 'text': 'Just like your logistic, it is very fast both in training and testing.', 'start': 27061.92, 'duration': 3.702}, {'end': 27070.306, 'text': 'K nearest neighbor is very fast in training but very slow in testing.', 'start': 27067.364, 'duration': 2.942}, {'end': 27075.107, 'text': 'because it has to find so many distances to find the k nearest neighbors.', 'start': 27071.445, 'duration': 3.662}, {'end': 27077.749, 'text': 'Does well with.', 'start': 27077.149, 'duration': 0.6}, {'end': 27081.311, 'text': 'noisy data is what you will see in many of these books and articles.', 'start': 27077.749, 'duration': 3.562}, {'end': 27088.676, 'text': 'but noisy data means there is a lot of jiggle in the universe from where you are taking the snapshots,', 'start': 27081.311, 'duration': 7.365}, {'end': 27091.378, 'text': 'which means two consecutive snapshots will have different distributions.', 'start': 27088.676, 'duration': 2.702}, {'end': 27099.623, 'text': 'So, you have to take this with a pinch of salt, any algorithm will find it difficult with noisy datasets.', 'start': 27093.439, 'duration': 6.184}, {'end': 27107.042, 'text': 'requires few examples for training, again something which we need to actually take up the pinch of salt,', 'start': 27101.96, 'duration': 5.082}], 'summary': 'K nearest neighbor: fast training, slow testing, struggles with noisy datasets', 'duration': 45.122, 'max_score': 27061.92, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw27061920.jpg'}, {'end': 27193.665, 'src': 'embed', 'start': 27160.828, 'weight': 6, 'content': [{'end': 27170.374, 'text': 'People have said that the estimated probabilities, conditional probability is usually less reliable, it is not reliable.', 'start': 27160.828, 'duration': 9.546}, {'end': 27178.399, 'text': 'That is true because these probabilities, assumption is the attributes are independent of each other.', 'start': 27170.874, 'duration': 7.525}, {'end': 27184.123, 'text': 'But because attributes are not independent, these probabilities have to be further jacked up.', 'start': 27179.46, 'duration': 4.663}, {'end': 27189.947, 'text': 'That is the same problem that you saw in p values.', 'start': 27187.825, 'duration': 2.122}, {'end': 27193.665, 'text': 'between MPG and horse power.', 'start': 27191.584, 'duration': 2.081}], 'summary': 'Estimated conditional probabilities are unreliable due to attribute independence assumptions, similar to issues with p values in the case of mpg and horsepower.', 'duration': 32.837, 'max_score': 27160.828, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw27160828.jpg'}, {'end': 27340.167, 'src': 'embed', 'start': 27309.936, 'weight': 3, 'content': [{'end': 27311.237, 'text': "Don't worry about the accuracy and all.", 'start': 27309.936, 'duration': 1.301}, {'end': 27316.741, 'text': 'How do you tweak the accuracy to a further higher up level? That will deal in feature engine model tuning.', 'start': 27311.458, 'duration': 5.283}, {'end': 27322.486, 'text': 'But first thing that you should do is get used to the libraries that you need to call.', 'start': 27318.042, 'duration': 4.444}, {'end': 27328.19, 'text': 'Second thing you need to learn is the basic syntax of scikit-learn.', 'start': 27324.587, 'duration': 3.603}, {'end': 27335.435, 'text': 'Third thing, by the time you become comfortable with all these things, then we can go into model tuning.', 'start': 27330.491, 'duration': 4.944}, {'end': 27340.167, 'text': 'Okay, so that will bring in continuity in your.', 'start': 27337.765, 'duration': 2.402}], 'summary': 'To improve accuracy, focus on feature engine model tuning and become comfortable with necessary libraries and scikit-learn syntax.', 'duration': 30.231, 'max_score': 27309.936, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw27309936.jpg'}, {'end': 27436.4, 'src': 'embed', 'start': 27387.609, 'weight': 4, 'content': [{'end': 27390.911, 'text': 'This was gen, this algorithm was 200 years ago.', 'start': 27387.609, 'duration': 3.302}, {'end': 27396.596, 'text': 'Performance Performance.', 'start': 27394.796, 'duration': 1.8}, {'end': 27404.538, 'text': 'They are all under the hood, they are all related.', 'start': 27397.757, 'duration': 6.781}, {'end': 27408.68, 'text': 'Decision tree also creates based on entropy and Gini.', 'start': 27405.739, 'duration': 2.941}, {'end': 27427.077, 'text': 'That is another way of maximizing basic properties.', 'start': 27411.12, 'duration': 15.957}, {'end': 27429.658, 'text': 'They are conditional probabilities, absolutely.', 'start': 27427.077, 'duration': 2.581}, {'end': 27436.4, 'text': 'What is the probability of class A given income is this, age is this and this is that, that is what the path is.', 'start': 27430.378, 'duration': 6.022}], 'summary': 'Algorithm developed 200 years ago, focuses on performance, uses decision tree based on entropy and gini to maximize properties and conditional probabilities.', 'duration': 48.791, 'max_score': 27387.609, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw27387609.jpg'}, {'end': 27745.302, 'src': 'embed', 'start': 27716.357, 'weight': 5, 'content': [{'end': 27723.965, 'text': 'So, shall we get in the hands on? It is the same data set.', 'start': 27716.357, 'duration': 7.608}, {'end': 27727.408, 'text': 'So, all of you know what the problem is ok.', 'start': 27724.505, 'duration': 2.903}, {'end': 27730.95, 'text': 'We have the wine data sets this thing, but we can do this on.', 'start': 27727.888, 'duration': 3.062}, {'end': 27738.036, 'text': 'Would you like to take this as a homework? Apply this Naive Bayes algorithm on Pima data sets and see what is the result.', 'start': 27731.431, 'duration': 6.605}, {'end': 27745.302, 'text': 'You have to give me overall accuracy and class level accuracy right all of you ok.', 'start': 27739.077, 'duration': 6.225}], 'summary': 'Apply naive bayes algorithm on pima dataset for homework, provide overall and class level accuracy.', 'duration': 28.945, 'max_score': 27716.357, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw27716357.jpg'}], 'start': 26058.57, 'title': 'Bayesian probability, flight delay analysis, k nearest neighbor, and model tuning', 'summary': 'Covers the application of bayesian probability in classification, flight delay probability analysis, k nearest neighbor algorithm, and model tuning for classification in data science. it emphasizes the significance of likelihood ratios, laplace smoothing factor, conditional probabilities, and the challenges and advantages of k nearest neighbor algorithm.', 'chapters': [{'end': 26327.047, 'start': 26058.57, 'title': 'Bayesian probability in classifications', 'summary': 'Discusses the application of bayesian probability in classification algorithms, emphasizing the importance of the numerator in decision-making and highlighting the oversight in considering the denominator as the key determinant, with an emphasis on the significance of pb in classification and the concept of likelihood ratios in determining the probability of events.', 'duration': 268.477, 'highlights': ['The numerator is the key determinant in deciding the class in Bayesian probability, not the denominator, with the importance of PB in classification. The numerator, rather than the denominator, is the crucial factor in determining the class in Bayesian probability, with PB playing a significant role in classification.', 'The oversight in considering the denominator as the key determinant in Bayesian probability models, leading to confusion in understanding the importance of PB. Many authors fail to reflect the importance of the denominator in Bayesian models, causing confusion regarding the significance of PB in decision-making.', "The concept of likelihood ratios in determining the probability of events, particularly in relation to the population's diabetic percentage and the presence of symptoms. The discussion includes the importance of likelihood ratios in determining the probability of events, exemplified through the consideration of the population's diabetic percentage and the presence of symptoms."]}, {'end': 27022.037, 'start': 26327.047, 'title': 'Flight delay probability analysis', 'summary': 'Discusses the analysis of flight delay probabilities based on observed events and likelihood ratios, including the consideration of multiple parameters and the application of laplace smoothing factor to prevent zero probabilities. it also emphasizes the importance of assessing conditional probabilities in recalibrating likelihood values.', 'duration': 694.99, 'highlights': ['The chapter provides a detailed analysis of flight delay probabilities based on observed events and likelihood ratios, aiming to extend the argument to include multiple parameters. It discusses the analysis of flight delay probabilities based on observed events and likelihood ratios, emphasizing the extension of the argument to include multiple parameters.', 'The application of Laplace smoothing factor is explained as a method to prevent zero probabilities for rare events lacking training data. It explains the application of Laplace smoothing factor as a method to prevent zero probabilities for rare events lacking training data.', 'The importance of assessing conditional probabilities in recalibrating likelihood values is emphasized, particularly in the context of network intrusion detection. It emphasizes the importance of assessing conditional probabilities in recalibrating likelihood values, particularly in the context of network intrusion detection.']}, {'end': 27309.136, 'start': 27025.34, 'title': 'K nearest neighbor algorithm', 'summary': 'Covers the k nearest neighbor algorithm, highlighting its advantages including fast training, but slow testing, and its challenges with noisy data and assumptions of attribute independence, along with the recommendation to be aware of the curse of dimensionality.', 'duration': 283.796, 'highlights': ['K nearest neighbor is very fast in training but very slow in testing, as it has to find many distances to determine the k nearest neighbors.', 'Challenges with noisy data, as algorithms find it difficult to handle datasets with significant variations between consecutive snapshots.', 'The assumption of attribute independence leads to unreliable estimated probabilities, and the curse of dimensionality can significantly impact performance when dealing with a large number of attributes.']}, {'end': 27846.473, 'start': 27309.936, 'title': 'Model tuning and classification in data science', 'summary': 'Discusses the process of model tuning, basic syntax of scikit-learn, decision tree classification, pattern recognition, model building prerequisites, and applying naive bayes algorithm on the pima data set.', 'duration': 536.537, 'highlights': ['The process of model tuning and the importance of getting accustomed to the required libraries and scikit-learn syntax are emphasized as prerequisites before delving into model tuning, ensuring continuity for individuals on jobs.', 'The concept of decision tree classification, including the creation of equidistant bins for numerical variables, and its relevance in modern computing due to performance enhancement, is explained, highlighting the use of entropy and Gini for maximizing basic properties.', 'The significance of recognizing patterns in data to build models, the necessity of well-defined processes with identifiable outputs for model representation, and the inability to model random processes are discussed, emphasizing the need for predictable outputs to build effective models.', 'The practical application of the Naive Bayes algorithm on the Pima data set for classification tasks is proposed as a hands-on exercise, highlighting the importance of obtaining overall accuracy and class level accuracy in model evaluation.']}], 'duration': 1787.903, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw26058570.jpg', 'highlights': ["The concept of likelihood ratios in determining the probability of events, particularly in relation to the population's diabetic percentage and the presence of symptoms.", 'The application of Laplace smoothing factor as a method to prevent zero probabilities for rare events lacking training data.', 'The importance of assessing conditional probabilities in recalibrating likelihood values, particularly in the context of network intrusion detection.', 'The process of model tuning and the importance of getting accustomed to the required libraries and scikit-learn syntax are emphasized as prerequisites before delving into model tuning, ensuring continuity for individuals on jobs.', 'The concept of decision tree classification, including the creation of equidistant bins for numerical variables, and its relevance in modern computing due to performance enhancement, is explained, highlighting the use of entropy and Gini for maximizing basic properties.', 'The practical application of the Naive Bayes algorithm on the Pima data set for classification tasks is proposed as a hands-on exercise, highlighting the importance of obtaining overall accuracy and class level accuracy in model evaluation.', 'The assumption of attribute independence leads to unreliable estimated probabilities, and the curse of dimensionality can significantly impact performance when dealing with a large number of attributes.', 'Challenges with noisy data, as algorithms find it difficult to handle datasets with significant variations between consecutive snapshots.', 'K nearest neighbor is very fast in training but very slow in testing, as it has to find many distances to determine the k nearest neighbors.', 'The numerator is the key determinant in deciding the class in Bayesian probability, not the denominator, with the importance of PB in classification.']}, {'end': 30543.802, 'segs': [{'end': 28050.586, 'src': 'embed', 'start': 28015.324, 'weight': 2, 'content': [{'end': 28026.774, 'text': 'So this is when your central values are overlapping 98 versus your 99, central values are overlapping very small shift.', 'start': 28015.324, 'duration': 11.45}, {'end': 28032.098, 'text': 'that means you have only a few outliers, very few outliers on the right side.', 'start': 28026.774, 'duration': 5.324}, {'end': 28036.682, 'text': 'Otherwise the mean would have been drastically pulled to the right side.', 'start': 28034.1, 'duration': 2.582}, {'end': 28046.006, 'text': 'So one or two records seem to have outliers on that one particular dimension.', 'start': 28037.324, 'duration': 8.682}, {'end': 28050.586, 'text': 'All outliers have to be handled.', 'start': 28047.246, 'duration': 3.34}], 'summary': '98% and 99% central values overlap with very few outliers to be handled.', 'duration': 35.262, 'max_score': 28015.324, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw28015324.jpg'}, {'end': 28395.049, 'src': 'embed', 'start': 28367.879, 'weight': 0, 'content': [{'end': 28372.901, 'text': 'This is where our Naive Bayes likelihood ratios will be calculated.', 'start': 28367.879, 'duration': 5.022}, {'end': 28377.182, 'text': 'And then I look at this.', 'start': 28375.922, 'duration': 1.26}, {'end': 28380.022, 'text': 'the training set itself.', 'start': 28378.721, 'duration': 1.301}, {'end': 28381.182, 'text': 'I am doing a testing.', 'start': 28380.022, 'duration': 1.16}, {'end': 28383.003, 'text': "it's giving you 97% accuracy.", 'start': 28381.182, 'duration': 1.821}, {'end': 28383.663, 'text': 'why am I doing this?', 'start': 28383.003, 'duration': 0.66}, {'end': 28391.647, 'text': "I'll tell you in a minute, while I can also do this model on the test data.", 'start': 28383.663, 'duration': 7.984}, {'end': 28395.049, 'text': 'here I am running the model on the test data.', 'start': 28391.647, 'duration': 3.402}], 'summary': 'Naive bayes model achieves 97% accuracy on test data.', 'duration': 27.17, 'max_score': 28367.879, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw28367879.jpg'}, {'end': 28668.461, 'src': 'embed', 'start': 28634.726, 'weight': 5, 'content': [{'end': 28636.968, 'text': 'the next challenge will be where will you get the data from?', 'start': 28634.726, 'duration': 2.242}, {'end': 28641.044, 'text': 'Some data will be available within the organization.', 'start': 28638.382, 'duration': 2.662}, {'end': 28643.185, 'text': 'some data will be available outside the organization.', 'start': 28641.044, 'duration': 2.141}, {'end': 28644.706, 'text': 'some data will be available with the customer.', 'start': 28643.185, 'duration': 1.521}, {'end': 28651.73, 'text': 'So getting these stakeholders to give the data to us will be such a challenge.', 'start': 28647.667, 'duration': 4.063}, {'end': 28655.832, 'text': 'So all your soft skills will come into play, right.', 'start': 28652.07, 'duration': 3.762}, {'end': 28660.735, 'text': 'So once the data comes in, you have to first establish the reliability of the data.', 'start': 28656.352, 'duration': 4.383}, {'end': 28668.461, 'text': 'What if the tech support department has given you data where customer is very happy? They have not shared with you the dirty data.', 'start': 28660.775, 'duration': 7.686}], 'summary': 'Challenges in obtaining data from various sources and ensuring its reliability.', 'duration': 33.735, 'max_score': 28634.726, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw28634726.jpg'}, {'end': 29197.221, 'src': 'embed', 'start': 29122.469, 'weight': 4, 'content': [{'end': 29131.012, 'text': 'So we make use of this technique to find out overall how the performance is going to be in production.', 'start': 29122.469, 'duration': 8.543}, {'end': 29135.013, 'text': '96% average accuracy score I can expect from this model on this dataset.', 'start': 29131.032, 'duration': 3.981}, {'end': 29162.853, 'text': 'No, no I will tell you why, why we did that, no we did subsequently testing also we did, look at the line, we have done on testing also.', 'start': 29143.307, 'duration': 19.546}, {'end': 29169.655, 'text': 'Okay, why did I do the evaluation on training and testing, let me tell you this, okay with that we will wind up.', 'start': 29163.313, 'duration': 6.342}, {'end': 29180.968, 'text': 'This is very important point and you should learn this trick now itself and if you reflect all these things in your projects, it will be good.', 'start': 29170.68, 'duration': 10.288}, {'end': 29182.649, 'text': 'Oh, there it is.', 'start': 29182.129, 'duration': 0.52}, {'end': 29192.457, 'text': 'Okay Remember we talked about overfit, underfit? Okay.', 'start': 29183.09, 'duration': 9.367}, {'end': 29193.698, 'text': 'What is underfit??', 'start': 29192.797, 'duration': 0.901}, {'end': 29197.221, 'text': 'How do you define underfit??', 'start': 29196.32, 'duration': 0.901}], 'summary': 'Model predicts 96% accuracy, emphasizes importance of testing and evaluation.', 'duration': 74.752, 'max_score': 29122.469, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw29122469.jpg'}, {'end': 29612.036, 'src': 'embed', 'start': 29582.673, 'weight': 1, 'content': [{'end': 29585.535, 'text': 'we always build multiple models.', 'start': 29582.673, 'duration': 2.862}, {'end': 29589.718, 'text': 'each model might be different algorithm and then we average out.', 'start': 29585.535, 'duration': 4.183}, {'end': 29592.86, 'text': 'that concept is called stacking.', 'start': 29589.718, 'duration': 3.142}, {'end': 29596.623, 'text': 'if you use the same algorithm multiple instances that we call ensemble.', 'start': 29592.86, 'duration': 3.763}, {'end': 29603.569, 'text': 'So in production we always put an ensemble or a stack,', 'start': 29599.446, 'duration': 4.123}, {'end': 29608.213, 'text': 'and the ensemble or a stack will give you much more reliable result than individual instances will.', 'start': 29603.569, 'duration': 4.644}, {'end': 29612.036, 'text': "Ok Let's move on.", 'start': 29610.155, 'duration': 1.881}], 'summary': 'Ensemble or stacking models in production for more reliable results.', 'duration': 29.363, 'max_score': 29582.673, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw29582673.jpg'}, {'end': 30169.555, 'src': 'embed', 'start': 30142.863, 'weight': 8, 'content': [{'end': 30146.564, 'text': 'But the test data in no way should have been influenced by the training set.', 'start': 30142.863, 'duration': 3.701}, {'end': 30149.325, 'text': 'They are not independent of each other.', 'start': 30148.005, 'duration': 1.32}, {'end': 30152.411, 'text': 'This is a very common source of data leaks.', 'start': 30150.35, 'duration': 2.061}, {'end': 30156.512, 'text': 'In the classroom it is ok, but you cannot do this in your projects.', 'start': 30153.071, 'duration': 3.441}, {'end': 30164.234, 'text': 'Then what will happen is your model will perform well on test, but it will bomb in production.', 'start': 30158.272, 'duration': 5.962}, {'end': 30169.555, 'text': 'That is one very common way of data leaks.', 'start': 30167.114, 'duration': 2.441}], 'summary': 'Test data should not be influenced by training set to avoid data leaks and ensure model performance in production.', 'duration': 26.692, 'max_score': 30142.863, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw30142863.jpg'}], 'start': 27846.473, 'title': 'Wine classification and model training', 'summary': 'Discusses building a wine classification model, achieving 97% accuracy on the test set with naive bayes model, and emphasizes the importance of using multiple models to improve reliability and predictive power of the analysis. it also addresses challenges in data science projects and the significance of avoiding data leaks and modeling pitfalls.', 'chapters': [{'end': 28256.077, 'start': 27846.473, 'title': 'Wine classification model', 'summary': 'Discusses building a wine classification model using various parameters, analyzing the data distribution, and identifying linear separability to distinguish between wine classes.', 'duration': 409.604, 'highlights': ['The model aims to replace human judgment with an automated system using various input parameters to classify the wine. The model is designed to replace human judgment with an automated system, using various input parameters to classify the wine, ensuring objectivity and efficiency.', 'Analysis of data set attributes reveals the mean and median convergence in most dimensions, indicating minimal skewness and long tails. The analysis of data set attributes shows convergence of mean and median in most dimensions, indicating minimal skewness and long tails, ensuring data reliability and consistency.', "The discussion includes comparing the distribution and separability of wine classes using different dimensions, highlighting the potential for linear separability among classes. The discussion involves comparing the distribution and separability of wine classes using different dimensions, emphasizing the potential for linear separability among classes, enhancing the model's accuracy in distinguishing between classes."]}, {'end': 28634.726, 'start': 28256.077, 'title': 'Naive bayes model training and evaluation', 'summary': 'Discusses the training and evaluation of a naive bayes model using a wine dataset, achieving 97% accuracy on the test set and high class-level recall scores, demonstrating the significance of combining weak predictors for strong prediction.', 'duration': 378.649, 'highlights': ['The Naive Bayes model achieved 97% accuracy on the test set, demonstrating its effectiveness in classifying the wine dataset. The model demonstrated a 97% accuracy on the test set, showcasing its proficiency in accurately classifying the wine dataset.', "The class-level recall scores for the model were almost 100%, indicating the model's high accuracy in classifying different classes within the dataset. The class-level recall scores were almost 100%, showcasing the model's precision in accurately classifying different classes within the dataset.", 'The significance of combining weak predictors for strong prediction is emphasized, highlighting the effectiveness of utilizing Naive Bayes for accurate predictions. The chapter emphasizes the significance of combining weak predictors for strong prediction, showcasing the effectiveness of utilizing Naive Bayes for accurate predictions.']}, {'end': 29555.101, 'start': 28634.726, 'title': 'Challenges in data science projects', 'summary': 'Discusses the challenges in data science projects, emphasizing the importance of obtaining reliable data, dealing with soft skills, and the significance of understanding model performance, highlighting the use of techniques such as cross-validation and addressing overfitting.', 'duration': 920.375, 'highlights': ['The significance of obtaining reliable data and the challenges in acquiring data from different stakeholders are emphasized. The importance of obtaining data from various sources within and outside the organization, and the challenges of acquiring reliable data are highlighted.', "The importance of understanding the reliability of data and its impact on the data scientist's work is discussed. The impact of unreliable data on a data scientist's model and the challenges this poses are highlighted.", 'The significance of soft skills in acquiring data from stakeholders is emphasized. The importance of soft skills in obtaining data from stakeholders is emphasized.', 'The process of model performance evaluation and the usage of techniques such as cross-validation to estimate model accuracy are explained. The process of evaluating model performance and using techniques like cross-validation to estimate accuracy are explained.', 'The concept of overfitting and its impact on model performance are discussed, emphasizing the need to recognize underfitting and overfitting in models. The concept of overfitting and its impact on model performance, along with the need to recognize underfitting and overfitting in models, is discussed.']}, {'end': 30032.893, 'start': 29555.101, 'title': 'Importance of multiple models in data analysis', 'summary': 'Emphasizes the importance of using multiple models, such as ensemble or stacking, to improve reliability and predictive power of the analysis, illustrated through the example of identifying weak predictors in a dataset for classification.', 'duration': 477.792, 'highlights': ['The chapter emphasizes the importance of using multiple models, such as ensemble or stacking, to improve reliability and predictive power of the analysis. Using multiple models, such as ensemble or stacking, leads to more reliable results and improved predictive power.', "Identifying weak predictors in a dataset for classification. The analysis identifies weak predictors in the dataset, such as the dimension 'ash,' and discusses how individual dimensions may be weak predictors but can collectively segregate the three classes properly, impacting prediction accuracy.", 'The concept of outliers and their impact on model performance. Handling outliers is discussed as a method to improve classification accuracy, with emphasis on how outliers affect the performance of logistic and Naive Bayes models, and the importance of domain expertise in understanding the relationships between variables.']}, {'end': 30543.802, 'start': 30038.915, 'title': 'Data leaks and modeling pitfalls', 'summary': 'Emphasizes the importance of being cautious about data leaks and modeling pitfalls, such as test data leakage and one-hot coding issues, which can lead to poor model performance and runtime problems in production. it also highlights the impact of creating high-dimensional spaces through one-hot coding, leading to overfitting and the need to address multicollinearity.', 'duration': 504.887, 'highlights': ['The importance of being cautious about data leaks and modeling pitfalls, such as test data leakage and one-hot coding issues, which can lead to poor model performance and runtime problems in production.', 'The impact of creating high-dimensional spaces through one-hot coding, leading to overfitting and the need to address multicollinearity.', 'The common source of data leaks in modeling where the test data can be influenced by the training set, resulting in model performance discrepancies between testing and production environments.', 'The need to break the data into training set and test set before performing transformations, especially in the case of z-score and one-hot coding, to avoid runtime problems and ensure data independence.', 'The introduction of multicollinearity when creating one-hot coding, leading to the necessity of potentially dropping some one-hot coding columns to address this issue.']}], 'duration': 2697.329, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw27846473.jpg', 'highlights': ['The Naive Bayes model achieved 97% accuracy on the test set, showcasing its proficiency in accurately classifying the wine dataset.', 'Using multiple models, such as ensemble or stacking, leads to more reliable results and improved predictive power.', 'The analysis of data set attributes shows convergence of mean and median in most dimensions, ensuring data reliability and consistency.', 'The chapter emphasizes the importance of using multiple models to improve reliability and predictive power of the analysis.', 'The process of evaluating model performance and using techniques like cross-validation to estimate accuracy are explained.', 'The importance of obtaining data from various sources within and outside the organization, and the challenges of acquiring reliable data are highlighted.', 'The importance of soft skills in obtaining data from stakeholders is emphasized.', 'The concept of overfitting and its impact on model performance, along with the need to recognize underfitting and overfitting in models, is discussed.', 'The importance of being cautious about data leaks and modeling pitfalls, such as test data leakage and one-hot coding issues, which can lead to poor model performance and runtime problems in production.']}, {'end': 32129.12, 'segs': [{'end': 30613.44, 'src': 'embed', 'start': 30577.718, 'weight': 1, 'content': [{'end': 30583.901, 'text': 'Unsupervised techniques are all those methods that form an integral part of your exploratory data analytics.', 'start': 30577.718, 'duration': 6.183}, {'end': 30586.94, 'text': "You don't actually build any models here.", 'start': 30585.199, 'duration': 1.741}, {'end': 30590.603, 'text': 'You make use of these techniques to understand what the data is telling you.', 'start': 30587.561, 'duration': 3.042}, {'end': 30594.726, 'text': 'And based on what the data is telling you,', 'start': 30591.784, 'duration': 2.942}, {'end': 30601.391, 'text': "that'll give you a head start or that'll give you the direction in which you should go while you're building your supervised models.", 'start': 30594.726, 'duration': 6.665}, {'end': 30612.019, 'text': 'So the value of unsupervised learning methods is in exploring your data and trying to find out are there some interesting stories hidden in it? Stories,', 'start': 30602.792, 'duration': 9.227}, {'end': 30612.779, 'text': 'what do you mean by stories?', 'start': 30612.019, 'duration': 0.76}, {'end': 30613.44, 'text': "I'll tell you in a minute.", 'start': 30612.839, 'duration': 0.601}], 'summary': 'Unsupervised techniques aid in exploring data for insights without building models; valuable for finding hidden stories.', 'duration': 35.722, 'max_score': 30577.718, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw30577718.jpg'}, {'end': 30723.401, 'src': 'embed', 'start': 30696.36, 'weight': 0, 'content': [{'end': 30703.345, 'text': 'when you do the exploratory analytics on this using unsupervised methods and you discover the hidden structures inside,', 'start': 30696.36, 'duration': 6.985}, {'end': 30715.315, 'text': 'then you will realize why you are getting such a poor performance in these models and what you could have done to increase the performance beyond 80%.', 'start': 30703.345, 'duration': 11.97}, {'end': 30723.401, 'text': 'that information will jump out at you when you apply clustering techniques to that particular dataset, and the same thing can be done on any dataset.', 'start': 30715.315, 'duration': 8.086}], 'summary': 'Unsupervised analytics reveals hidden patterns, improving model performance to over 80% through clustering techniques.', 'duration': 27.041, 'max_score': 30696.36, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw30696360.jpg'}, {'end': 30881.359, 'src': 'embed', 'start': 30854.992, 'weight': 3, 'content': [{'end': 30861.995, 'text': 'overall accuracy score was very high default or non-default but within the defaulted class the accuracy was very low.', 'start': 30854.992, 'duration': 7.003}, {'end': 30862.675, 'text': 'it was around 55 56 percent.', 'start': 30861.995, 'duration': 0.68}, {'end': 30873.629, 'text': "When you're building classification models, that particular class of interest to me was the defaulter class,", 'start': 30866.82, 'duration': 6.809}, {'end': 30881.359, 'text': 'and on that particular class the model was giving me 56% accuracy overall was 76%.', 'start': 30873.629, 'duration': 7.73}], 'summary': 'Model achieved 76% overall accuracy, but only 56% for defaulters.', 'duration': 26.367, 'max_score': 30854.992, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw30854992.jpg'}, {'end': 30974.582, 'src': 'embed', 'start': 30947.982, 'weight': 4, 'content': [{'end': 30955.604, 'text': 'So, instead of going down the path, building all the model and struggling with accuracy, and then coming back, use clustering first,', 'start': 30947.982, 'duration': 7.622}, {'end': 30957.144, 'text': 'find out what is there in the data.', 'start': 30955.604, 'duration': 1.54}, {'end': 30961.265, 'text': 'If need be, build separate models for separate clusters.', 'start': 30958.445, 'duration': 2.82}, {'end': 30963.706, 'text': 'That is the object of clustering.', 'start': 30962.666, 'duration': 1.04}, {'end': 30969.24, 'text': 'So that prevents you from wasting so much of time, effort, and energy.', 'start': 30965.919, 'duration': 3.321}, {'end': 30974.582, 'text': 'So we always use clustering as a lead into classification, as a lead into modeling.', 'start': 30969.58, 'duration': 5.002}], 'summary': 'Use clustering to save time and effort in model building and improve accuracy by building separate models for separate clusters.', 'duration': 26.6, 'max_score': 30947.982, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw30947982.jpg'}, {'end': 31058.901, 'src': 'embed', 'start': 31028.462, 'weight': 5, 'content': [{'end': 31032.504, 'text': 'Most of the implementation of hierarchical clustering you will see will be agglomerative.', 'start': 31028.462, 'duration': 4.042}, {'end': 31034.164, 'text': 'Bottom up.', 'start': 31033.784, 'duration': 0.38}, {'end': 31040.427, 'text': 'I have not come across any implementation except in research papers of divisive.', 'start': 31036.145, 'duration': 4.282}, {'end': 31044.709, 'text': 'Divisive method is similar to your decision tree.', 'start': 31042.268, 'duration': 2.441}, {'end': 31050.892, 'text': 'In decision tree, the driving force is to reduce entropy or gene.', 'start': 31045.849, 'duration': 5.043}, {'end': 31053.653, 'text': 'Here it is to reduce the variance.', 'start': 31052.172, 'duration': 1.481}, {'end': 31058.901, 'text': 'okay, so methodology is same, approach is different.', 'start': 31055.319, 'duration': 3.582}], 'summary': 'Hierarchical clustering mostly agglomerative, divisive rare. divisive method similar to decision tree, focusing on reducing variance.', 'duration': 30.439, 'max_score': 31028.462, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw31028462.jpg'}, {'end': 31218.398, 'src': 'embed', 'start': 31164.954, 'weight': 6, 'content': [{'end': 31165.994, 'text': 'why am I telling you all these things?', 'start': 31164.954, 'duration': 1.04}, {'end': 31172.387, 'text': "The reason I'm telling you all these things is when you do k-means clustering, you have to be very careful.", 'start': 31167.383, 'duration': 5.004}, {'end': 31177.351, 'text': "It's actually very easy to implement, but very difficult to get results from.", 'start': 31172.487, 'duration': 4.864}, {'end': 31182.375, 'text': 'Very difficult to get the results out of k-means clustering.', 'start': 31179.192, 'duration': 3.183}, {'end': 31185.137, 'text': 'It requires a lot of expertise to use this.', 'start': 31183.035, 'duration': 2.102}, {'end': 31196.566, 'text': "The reason I'm telling you this is our objective is whatever clusters are found in our mathematical space feature space.", 'start': 31187.679, 'duration': 8.887}, {'end': 31204.575, 'text': 'our objective is we should minimize within cluster variance.', 'start': 31197.594, 'duration': 6.981}, {'end': 31210.696, 'text': 'so my clusters should be as compact and as tight as possible.', 'start': 31204.575, 'duration': 6.121}, {'end': 31215.777, 'text': 'that is what will reduce the within w stands for within within cluster variance.', 'start': 31210.696, 'duration': 5.081}, {'end': 31218.398, 'text': 'i want to minimize this.', 'start': 31215.777, 'duration': 2.621}], 'summary': 'K-means clustering requires expertise to achieve compact clusters and minimize within-cluster variance.', 'duration': 53.444, 'max_score': 31164.954, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw31164954.jpg'}, {'end': 31500.915, 'src': 'embed', 'start': 31471.962, 'weight': 8, 'content': [{'end': 31477.974, 'text': 'so the optimization problem for us is This is variance within a cluster.', 'start': 31471.962, 'duration': 6.012}, {'end': 31479.916, 'text': 'I have k clusters.', 'start': 31477.994, 'duration': 1.922}, {'end': 31488.544, 'text': 'So for k is equal to 1 to k, I want to minimize the variance within the clusters.', 'start': 31480.417, 'duration': 8.127}, {'end': 31495.39, 'text': 'I want to form my clusters in such a way that the variances, sum of all the variances across the clusters, they minimize it.', 'start': 31489.385, 'duration': 6.005}, {'end': 31497.192, 'text': 'That is optimization problem.', 'start': 31495.99, 'duration': 1.202}, {'end': 31500.915, 'text': 'This is the driving force behind clustering.', 'start': 31498.993, 'duration': 1.922}], 'summary': 'Optimize k clusters to minimize variance within them.', 'duration': 28.953, 'max_score': 31471.962, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw31471962.jpg'}, {'end': 31816.034, 'src': 'embed', 'start': 31788.488, 'weight': 9, 'content': [{'end': 31792.63, 'text': "let's assume there are supposed to be three clusters in my data set based on domain knowledge.", 'start': 31788.488, 'duration': 4.142}, {'end': 31800.315, 'text': 'but if i want to find out the three best clusters in my data set there is no algorithm which will find that best three clusters.', 'start': 31792.63, 'duration': 7.685}, {'end': 31806.919, 'text': "for me, k-means clustering belongs to np hard family of problems which don't have a solution.", 'start': 31800.315, 'duration': 6.604}, {'end': 31812.753, 'text': 'So k-means clusterings they usually end up in local minimas.', 'start': 31808.772, 'duration': 3.981}, {'end': 31816.034, 'text': "And that's what causes problems.", 'start': 31814.213, 'duration': 1.821}], 'summary': 'K-means clustering struggles to find the best three clusters due to np-hard nature and local minimas.', 'duration': 27.546, 'max_score': 31788.488, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw31788488.jpg'}, {'end': 31923.234, 'src': 'embed', 'start': 31892.475, 'weight': 10, 'content': [{'end': 31895.397, 'text': 'The most common application of clustering is anomaly detection.', 'start': 31892.475, 'duration': 2.922}, {'end': 31897.818, 'text': 'Anomaly detection is basically outlier detections.', 'start': 31895.477, 'duration': 2.341}, {'end': 31905.823, 'text': "You swipe your credit cards, many times you'd have got a message, did you make this transaction? There is a clustering running for you.", 'start': 31898.519, 'duration': 7.304}, {'end': 31908.325, 'text': 'All your past behavior.', 'start': 31906.684, 'duration': 1.641}, {'end': 31917.911, 'text': 'transactions done on the credit card have been accumulated by the service provider and it has been mapped to your clusters which cluster you belong to.', 'start': 31908.325, 'duration': 9.586}, {'end': 31923.234, 'text': 'And now you make a transaction where the data point somehow is not falling within the cluster.', 'start': 31919.053, 'duration': 4.181}], 'summary': 'Clustering is used for anomaly detection in credit card transactions to identify outliers and prevent fraud.', 'duration': 30.759, 'max_score': 31892.475, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw31892475.jpg'}], 'start': 30544.262, 'title': 'Clustering for unsupervised learning', 'summary': 'Delves into the value of unsupervised learning, emphasizing the use of clustering to find hidden structures, discussing challenges in building classification models, explaining k-means clustering and variance formula, and highlighting the challenges and applications of k-means clustering, providing insights from the car-mpg dataset and unbalanced data scenarios.', 'chapters': [{'end': 30828.866, 'start': 30544.262, 'title': 'Unsupervised learning methods', 'summary': 'Explores the value of unsupervised learning in exploring and understanding data, including the use of clustering to find hidden structures, with an example of improving model performance in the car-mpg dataset.', 'duration': 284.604, 'highlights': ['Unsupervised techniques are integral for exploratory data analytics, providing direction for building supervised models, and can help increase model performance by discovering hidden structures. Unsupervised techniques provide direction for building supervised models and can help increase model performance by discovering hidden structures.', 'The value of unsupervised learning methods lies in exploring data to find interesting stories and hidden structures, such as using clustering to identify tight clusters within the data. Unsupervised learning methods offer value in exploring data to find interesting stories and using clustering to identify tight clusters within the data.', 'Applying unsupervised methods to the car-mpg dataset revealed hidden structures and insights that could improve model performance beyond 80% accuracy. Applying unsupervised methods to the car-mpg dataset revealed hidden structures and insights that could improve model performance beyond 80% accuracy.']}, {'end': 31058.901, 'start': 30828.866, 'title': 'Clustering as lead into modeling', 'summary': 'Discusses the challenges faced in building classification models for unbalanced data, emphasizing the importance of using clustering as a lead into modeling, citing the example of k-means clustering and the benefits of preventing time and effort wastage.', 'duration': 230.035, 'highlights': ['The accuracy score for the defaulted class was very low at around 55-56%, despite an overall accuracy of 76%, due to unbalanced data with a class imbalance of half the number of defaulters compared to non-defaulters. The accuracy score for the defaulted class was only around 55-56%, indicating the impact of unbalanced data with half the number of defaulters compared to non-defaulters, despite the overall accuracy being 76%.', 'Clustering is suggested as a lead into classification and modeling, with k-means clustering being the recommended technique, to prevent wasting time and effort in struggling with accuracy when building separate models for separate clusters may be more appropriate. Clustering is recommended as a lead into classification and modeling, particularly using k-means clustering, to avoid wasting time and effort in struggling with accuracy, instead focusing on building separate models for separate clusters when necessary.', 'The chapter also covers the types of clustering techniques, including k-means, hierarchical, agglomerative, and divisive methods, with a focus on the implementation and benefits of agglomerative techniques. The chapter provides insights into various clustering techniques such as k-means, hierarchical, agglomerative, and divisive methods, highlighting the implementation and benefits of agglomerative techniques over divisive methods.']}, {'end': 31390.666, 'start': 31058.901, 'title': 'Understanding k-means clustering', 'summary': 'Explains the concept of k-means clustering, its objective of minimizing within-cluster variance, and the challenges in determining the number of clusters, emphasizing the need for careful implementation and expertise to derive meaningful information from the clustering process.', 'duration': 331.765, 'highlights': ['The objective of k-means clustering is to minimize within-cluster variance, aiming for tight and compact clusters that are farthest from each other, enhancing the meaningful information derived from the clustering process.', 'The challenge in k-means clustering lies in determining the number of clusters to look for, as it may require multiple iterations and expertise due to the risk of misleading or irrelevant information if not implemented carefully.', 'When performing k-means clustering, it is crucial to ensure that data points are assigned to only one cluster, with no overlapping or shared assignments between clusters, maintaining the distinctiveness of each cluster.', 'Implementing k-means clustering requires careful consideration, as it is relatively easy to implement but challenging to derive meaningful results from, emphasizing the importance of expertise and diligence in the process.']}, {'end': 31757.679, 'start': 31390.666, 'title': 'Variance formula and k-means clustering', 'summary': 'Explains the variance formula for clusters, the concept of within-cluster and between-cluster variance, and the optimization problem of minimizing variance within clusters in k-means clustering, with a focus on maximizing the distance between clusters to achieve the same goal.', 'duration': 367.013, 'highlights': ['The optimization problem is to minimize the variance within clusters for k clusters, forming clusters in such a way that the sum of all variances across the clusters is minimized.', 'The concept of within-cluster variance (WC) involves finding the variance within clusters when breaking the data into mathematical space, while between-cluster variance (BC) is the variance between the centroids of different clusters.', 'Maximizing the distance between clusters or tightening the clusters will achieve the same goal of minimizing within-cluster variance, with T representing the total variance within the data set and being a constant in the optimization process.']}, {'end': 32129.12, 'start': 31757.859, 'title': 'Challenges of k-means clustering', 'summary': 'Discusses the challenges of k-means clustering, belonging to np-hard problems, potentially leading to misleading results, and its application in anomaly detection.', 'duration': 371.261, 'highlights': ['K-means clustering belongs to NP-hard problems and may lead to local minimas, potentially causing misleading results in clustering. K-means clustering is a part of NP-hard problems, lacking a well-defined algorithm and may result in local minimas, leading to misleading clustering outcomes.', "Anomaly detection, a common application of clustering, involves identifying outliers by measuring their distance from the centroid, typically 1.96 standard deviations away from the cluster's center. Anomaly detection is widely used in clustering, where outliers are identified as data points lying 1.96 standard deviations away from the centroid, aiding in identifying potentially fraudulent activities such as credit card transactions.", 'Handling outliers in clustering is crucial, as past outliers may reoccur and need to be treated as suspects, rather than immediately fraudulent. Handling outliers in clustering is essential, as past outliers may reoccur and need to be treated as suspects, requiring careful consideration rather than immediate labeling as fraudulent activities.']}], 'duration': 1584.858, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw30544262.jpg', 'highlights': ['Applying unsupervised methods to the car-mpg dataset revealed hidden structures and insights that could improve model performance beyond 80% accuracy.', 'Unsupervised techniques are integral for exploratory data analytics, providing direction for building supervised models and can help increase model performance by discovering hidden structures.', 'The value of unsupervised learning methods lies in exploring data to find interesting stories and using clustering to identify tight clusters within the data.', 'The accuracy score for the defaulted class was only around 55-56%, indicating the impact of unbalanced data with half the number of defaulters compared to non-defaulters, despite the overall accuracy being 76%.', 'Clustering is recommended as a lead into classification and modeling, particularly using k-means clustering, to avoid wasting time and effort in struggling with accuracy, instead focusing on building separate models for separate clusters when necessary.', 'The chapter provides insights into various clustering techniques such as k-means, hierarchical, agglomerative, and divisive methods, highlighting the implementation and benefits of agglomerative techniques over divisive methods.', 'The objective of k-means clustering is to minimize within-cluster variance, aiming for tight and compact clusters that are farthest from each other, enhancing the meaningful information derived from the clustering process.', 'The challenge in k-means clustering lies in determining the number of clusters to look for, as it may require multiple iterations and expertise due to the risk of misleading or irrelevant information if not implemented carefully.', 'The optimization problem is to minimize the variance within clusters for k clusters, forming clusters in such a way that the sum of all variances across the clusters is minimized.', 'K-means clustering is a part of NP-hard problems, lacking a well-defined algorithm and may result in local minimas, leading to misleading clustering outcomes.', 'Anomaly detection is widely used in clustering, where outliers are identified as data points lying 1.96 standard deviations away from the centroid, aiding in identifying potentially fraudulent activities such as credit card transactions.', 'Handling outliers in clustering is essential, as past outliers may reoccur and need to be treated as suspects, requiring careful consideration rather than immediate labeling as fraudulent activities.']}, {'end': 33200.197, 'segs': [{'end': 32270.332, 'src': 'embed', 'start': 32236.062, 'weight': 2, 'content': [{'end': 32241.645, 'text': 'So, many times you will be advised to convert all your dimension to z-scores or some kind of scales.', 'start': 32236.062, 'duration': 5.583}, {'end': 32243.254, 'text': 'to remove the units.', 'start': 32242.554, 'duration': 0.7}, {'end': 32250.159, 'text': 'The technique is a safe technique, no doubt, but sometimes it can have an adverse effect.', 'start': 32244.956, 'duration': 5.203}, {'end': 32254.282, 'text': 'Instead of giving you good clusters, it can end up giving you bad clusters.', 'start': 32251.5, 'duration': 2.782}, {'end': 32260.045, 'text': 'Scaling So when do you scale, when do you not scale is a very important thing that you need to be aware of.', 'start': 32255.422, 'duration': 4.623}, {'end': 32270.332, 'text': 'The reason why we ask to scale is those dimensions which are larger scale, they tend to overwhelm the calculations of the distance.', 'start': 32261.806, 'duration': 8.526}], 'summary': 'Scaling data to z-scores can affect clustering outcomes. consider when to scale.', 'duration': 34.27, 'max_score': 32236.062, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw32236062.jpg'}, {'end': 32424.703, 'src': 'embed', 'start': 32392.996, 'weight': 3, 'content': [{'end': 32398.757, 'text': 'Mathematically it will be different, but it is a property of shift of origin actually.', 'start': 32392.996, 'duration': 5.761}, {'end': 32401.817, 'text': 'So, it will just shift.', 'start': 32398.797, 'duration': 3.02}, {'end': 32410.559, 'text': 'So, when you convert into z-scores, this one and this one, the formula is xi minus x bar by standard deviation.', 'start': 32402.738, 'duration': 7.821}, {'end': 32413.435, 'text': 'this is a formula.', 'start': 32412.014, 'duration': 1.421}, {'end': 32419.059, 'text': 'so this is formula for this y, i, minus y bar by standard deviation of y.', 'start': 32413.435, 'duration': 5.624}, {'end': 32424.703, 'text': 'this is the formula for this x right.', 'start': 32419.059, 'duration': 5.644}], 'summary': 'Explaining the shift of origin and z-score conversion formulas.', 'duration': 31.707, 'max_score': 32392.996, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw32392996.jpg'}, {'end': 32526.19, 'src': 'embed', 'start': 32500.584, 'weight': 5, 'content': [{'end': 32508.212, 'text': 'Your question? My question is, so for example I have got one transaction which is that I want to find out the anomaly.', 'start': 32500.584, 'duration': 7.628}, {'end': 32512.977, 'text': 'So for that also I will have to cheat the same way and then only I can do that.', 'start': 32508.873, 'duration': 4.104}, {'end': 32517.922, 'text': 'transformed data and do the analysis.', 'start': 32515.58, 'duration': 2.342}, {'end': 32526.19, 'text': 'So, but when I look at the data for the you know analysis purpose or EDA.', 'start': 32518.443, 'duration': 7.747}], 'summary': 'Analyzing transaction data to detect anomalies and perform eda.', 'duration': 25.606, 'max_score': 32500.584, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw32500584.jpg'}, {'end': 32621.307, 'src': 'embed', 'start': 32592.483, 'weight': 4, 'content': [{'end': 32594.324, 'text': "you can't do it blindly.", 'start': 32592.483, 'duration': 1.841}, {'end': 32597.687, 'text': 'many times, when you scale your data, your clusters might go wrong.', 'start': 32594.324, 'duration': 3.363}, {'end': 32606.041, 'text': 'lot of times in the raw data, we have attributes which naturally lead to clustering.', 'start': 32599.798, 'duration': 6.243}, {'end': 32614.304, 'text': 'They naturally bring in some kind of clustering on them and by scaling those dimensions we dilute the clusters there.', 'start': 32607.741, 'duration': 6.563}, {'end': 32619.386, 'text': 'The natural tendency for clustering on the dimension is diluted when you actually scale it.', 'start': 32614.424, 'duration': 4.962}, {'end': 32621.307, 'text': "So let's see what happens.", 'start': 32620.447, 'duration': 0.86}], 'summary': 'Scaling data can dilute natural clustering tendencies.', 'duration': 28.824, 'max_score': 32592.483, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw32592483.jpg'}, {'end': 32722.125, 'src': 'embed', 'start': 32693.644, 'weight': 8, 'content': [{'end': 32701.61, 'text': 'on these two clusters, when I run my k-means clustering without doing any normalization on the raw data,', 'start': 32693.644, 'duration': 7.966}, {'end': 32707.434, 'text': 'the two clusters I get are the red and the blue.', 'start': 32701.61, 'duration': 5.824}, {'end': 32709.396, 'text': 'as you can see, that is misleading clusters.', 'start': 32707.434, 'duration': 1.962}, {'end': 32713.703, 'text': 'When I normalize my data, look at the scales.', 'start': 32711.363, 'duration': 2.34}, {'end': 32715.164, 'text': 'This is Z-scored.', 'start': 32714.304, 'duration': 0.86}, {'end': 32716.884, 'text': 'Look at the difference here.', 'start': 32715.964, 'duration': 0.92}, {'end': 32717.884, 'text': 'Look at the difference here.', 'start': 32717.064, 'duration': 0.82}, {'end': 32718.704, 'text': 'This is Z-scored.', 'start': 32717.924, 'duration': 0.78}, {'end': 32722.125, 'text': 'It gives me the right clusters.', 'start': 32721.085, 'duration': 1.04}], 'summary': 'K-means clustering without normalization yields misleading clusters, but normalization with z-score gives the right clusters.', 'duration': 28.481, 'max_score': 32693.644, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw32693644.jpg'}, {'end': 32811.186, 'src': 'embed', 'start': 32784.947, 'weight': 0, 'content': [{'end': 32791.43, 'text': 'not take that thing that normalization will lead you to good clustering not necessary.', 'start': 32784.947, 'duration': 6.483}, {'end': 32793.269, 'text': 'it might lead you to poor clustering.', 'start': 32791.43, 'duration': 1.839}, {'end': 32799.682, 'text': 'No, no, no.', 'start': 32798.822, 'duration': 0.86}, {'end': 32802.384, 'text': 'It is how the data is distributed on the various dimensions.', 'start': 32799.843, 'duration': 2.541}, {'end': 32808.345, 'text': 'For example here the data distribution on this dimension, this dimension.', 'start': 32802.424, 'duration': 5.921}, {'end': 32811.186, 'text': 'what is the difference of this dimension and this dimension on this clustering?', 'start': 32808.345, 'duration': 2.841}], 'summary': 'Normalization may not always lead to good clustering, as data distribution across dimensions plays a crucial role.', 'duration': 26.239, 'max_score': 32784.947, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw32784947.jpg'}, {'end': 33000.343, 'src': 'embed', 'start': 32966.316, 'weight': 10, 'content': [{'end': 32968.879, 'text': 'The reason why this happens is what is in the data?', 'start': 32966.316, 'duration': 2.563}, {'end': 32975.165, 'text': 'The reason why this happens is there are probably some outliers in the data sets which are not handled.', 'start': 32971.081, 'duration': 4.084}, {'end': 32984.178, 'text': 'In the clustering where you included the target variables, the target variables influenced the cluster creation, so everything was fine.', 'start': 32977.495, 'duration': 6.683}, {'end': 32989.04, 'text': 'But when you removed it, the outliers were pushed into another cluster, now they belong to this cluster.', 'start': 32984.578, 'duration': 4.462}, {'end': 33000.343, 'text': "But in the same question, what happened that the k value, we don't know, I got it, but in that question, it was told that there are two cars.", 'start': 32992.321, 'duration': 8.022}], 'summary': 'Outliers in data affected cluster creation, k value unknown, 2 cars mentioned.', 'duration': 34.027, 'max_score': 32966.316, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw32966316.jpg'}, {'end': 33119.082, 'src': 'embed', 'start': 33064.496, 'weight': 12, 'content': [{'end': 33091.258, 'text': 'But they told that in car, there are two more groups, That means there are some subcategories hidden inside one of the larger categories.', 'start': 33064.496, 'duration': 26.762}, {'end': 33114.538, 'text': 'Though your business says there are three categories, what it means is inside one of the three at least there is some subcategories over there.', 'start': 33107.792, 'duration': 6.746}, {'end': 33119.082, 'text': 'So that is where you should apply your hierarchical clustering and see what are the subcategories.', 'start': 33115.078, 'duration': 4.004}], 'summary': 'Hierarchical clustering to identify subcategories within larger categories.', 'duration': 54.586, 'max_score': 33064.496, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw33064496.jpg'}, {'end': 33182.194, 'src': 'embed', 'start': 33151.756, 'weight': 13, 'content': [{'end': 33152.317, 'text': 'that is what it means.', 'start': 33151.756, 'duration': 0.561}, {'end': 33160.237, 'text': 'This is a clear indication.', 'start': 33159.036, 'duration': 1.201}, {'end': 33165.502, 'text': 'k-means plus in the ELBO method is a very good indicator of how many clusters are likely in your,', 'start': 33160.237, 'duration': 5.265}, {'end': 33168.744, 'text': 'because scikit-learn implements what is called k-means plus plus.', 'start': 33165.502, 'duration': 3.242}, {'end': 33174.667, 'text': 'So k-means plus plus tries to overcome this problem of getting stuck in local minima.', 'start': 33170.305, 'duration': 4.362}, {'end': 33182.194, 'text': 'Now it is not guaranteed that will not get stuck it tries to, so that ELBO method is quite reliable.', 'start': 33177.17, 'duration': 5.024}], 'summary': 'K-means plus in elbo method indicates likely clusters. scikit-learn implements k-means plus plus to overcome local minima.', 'duration': 30.438, 'max_score': 33151.756, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw33151756.jpg'}], 'start': 32129.661, 'title': 'Data clustering considerations', 'summary': 'Covers anomaly detection, k-means clustering, data transformation, and normalization, emphasizing the impact of scaling, the importance of understanding data distributions and dimensions, and the challenges in clustering data analysis with a focus on the influence of target variables and outliers.', 'chapters': [{'end': 32499.423, 'start': 32129.661, 'title': 'K-means clustering and euclidean distance', 'summary': 'Discusses anomaly detection, k-means clustering using only euclidean distance, the impact of scaling on distance calculations, and the property of shift of origin when converting data to z-scores.', 'duration': 369.762, 'highlights': ['The chapter explains that k-means clustering only supports euclidean distance and discusses the impact of scaling on distance calculations, cautioning about adverse effects if not done properly.', 'The chapter details the property of shift of origin when converting data to z-scores, highlighting that while mathematically the centroids may be different, their position in the mathematical space remains the same.', 'The chapter emphasizes the need to be aware of when to scale data, as larger scale dimensions can overwhelm distance calculations, potentially leading to adverse effects on cluster quality.']}, {'end': 32670.604, 'start': 32500.584, 'title': 'Data transformation and clustering', 'summary': 'Discusses the importance of data transformation in anomaly detection and the impact of scaling on clustering, emphasizing the need to be cautious when applying normalization to raw data for clustering purposes.', 'duration': 170.02, 'highlights': ['Data transformation is crucial for anomaly detection, as it allows for analysis and modeling, but it should be reversed for exploratory data analysis (EDA) to avoid misleading findings.', 'When transforming data, the distributions remain unchanged, only the measurement scales differ, emphasizing the importance of maintaining the natural clustering tendencies present in raw data.', 'Scaling raw data before clustering can lead to diluted clusters, as the natural tendency for clustering on the dimensions is affected by scaling, cautioning against blindly normalizing data.', 'In clustering, the measurement of similarity or dissimilarity between points is determined by the distance calculation method, with Euclidean distance being the sole option for hierarchical and k-means clustering.']}, {'end': 32863.133, 'start': 32675.605, 'title': 'Data clustering and normalization', 'summary': 'Discusses the impact of data normalization on clustering, highlighting the need to understand data distributions and dimensions for effective clustering. it emphasizes the significance of normalization by showcasing the misleading clusters obtained without normalization and the improvement in cluster quality post normalization, while cautioning against blindly normalizing data.', 'duration': 187.528, 'highlights': ['The misleading clusters obtained without data normalization demonstrate the importance of normalization for accurate clustering results.', 'The improvement in cluster quality after normalization, as evidenced by the correct clustering obtained post Z-score normalization, emphasizes the significance of understanding data distributions and dimensions for effective clustering.', 'The caution against blindly normalizing data, highlighting the need to analyze data distributions and dimensions to determine whether normalization is necessary for accurate clustering results.']}, {'end': 33200.197, 'start': 32863.133, 'title': 'Clustering data analysis', 'summary': 'Discusses the challenges and considerations in clustering data analysis, including the influence of target variables on clusters, the impact of outliers, and the identification of subcategories within larger clusters.', 'duration': 337.064, 'highlights': ['The clustering analysis was impacted by the presence of categorical data which influenced the target variable, causing discrepancies between the clusters and the target variable values.', 'The presence of outliers in the data set led to the shifting of data points between clusters when the target variable was removed during clustering.', 'The application of K-means clustering indicated the existence of subcategories within the business-defined categories, highlighting the need for hierarchical clustering to explore and identify these subclusters.', 'The ELBO method in K-means clustering served as a reliable indicator of the likely number of clusters, with a clear indication of subcategories hidden within larger categories.']}], 'duration': 1070.536, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw32129661.jpg', 'highlights': ['The importance of understanding data distributions and dimensions for effective clustering.', 'The caution against blindly normalizing data, emphasizing the need to analyze data distributions and dimensions.', 'The impact of scaling on distance calculations, cautioning about adverse effects if not done properly.', 'The property of shift of origin when converting data to z-scores, highlighting that centroids may be different mathematically but their position in the mathematical space remains the same.', 'The need to be aware of when to scale data, as larger scale dimensions can overwhelm distance calculations, potentially leading to adverse effects on cluster quality.', 'Data transformation is crucial for anomaly detection, as it allows for analysis and modeling.', 'The importance of maintaining the natural clustering tendencies present in raw data when transforming data.', 'The caution against blindly normalizing data, as scaling raw data before clustering can lead to diluted clusters.', 'The misleading clusters obtained without data normalization demonstrate the importance of normalization for accurate clustering results.', 'The improvement in cluster quality after normalization, emphasizing the significance of understanding data distributions and dimensions for effective clustering.', 'The impact of categorical data on clustering analysis, causing discrepancies between the clusters and the target variable values.', 'The presence of outliers in the data set led to the shifting of data points between clusters when the target variable was removed during clustering.', 'The application of K-means clustering indicated the existence of subcategories within the business-defined categories, highlighting the need for hierarchical clustering to explore and identify these subclusters.', 'The ELBO method in K-means clustering served as a reliable indicator of the likely number of clusters, with a clear indication of subcategories hidden within larger categories.']}, {'end': 34342.176, 'segs': [{'end': 33270.199, 'src': 'embed', 'start': 33202.619, 'weight': 0, 'content': [{'end': 33207.171, 'text': 'Shall we move on? So keep in mind, I have kept the code here.', 'start': 33202.619, 'duration': 4.552}, {'end': 33208.833, 'text': 'The code is there on the shared on this.', 'start': 33207.211, 'duration': 1.622}, {'end': 33210.314, 'text': 'He will share it with Olympus.', 'start': 33208.953, 'duration': 1.361}, {'end': 33212.135, 'text': 'You try to play around with this code.', 'start': 33210.593, 'duration': 1.542}, {'end': 33219.121, 'text': 'Generate your own random sets and see how scaling can have negative impact on your clusters.', 'start': 33212.576, 'duration': 6.545}, {'end': 33220.161, 'text': "Let's move on.", 'start': 33219.72, 'duration': 0.441}, {'end': 33224.124, 'text': 'When you are doing hierarchical clustering, the choice of distance matters.', 'start': 33221.322, 'duration': 2.802}, {'end': 33226.006, 'text': 'We will see this in hierarchical clustering.', 'start': 33224.485, 'duration': 1.521}, {'end': 33229.569, 'text': 'The choice of the distance matters in the definition of clustering.', 'start': 33227.047, 'duration': 2.522}, {'end': 33233.031, 'text': 'In flat clustering, k means there is no such option.', 'start': 33230.069, 'duration': 2.962}, {'end': 33235.133, 'text': 'We have only Euclidean distance.', 'start': 33234.091, 'duration': 1.042}, {'end': 33239.756, 'text': 'Knowledge of the distribution of the data is very important.', 'start': 33237.014, 'duration': 2.742}, {'end': 33246.218, 'text': 'the pair panel is very important for us, at least on the dimensions that we are going to include in our k-means clustering.', 'start': 33239.756, 'duration': 6.462}, {'end': 33250.962, 'text': 'We should know how the data is distributed, whether we should apply z-scores or not.', 'start': 33246.699, 'duration': 4.263}, {'end': 33257.566, 'text': 'All the attributes that we are going to use.', 'start': 33256.085, 'duration': 1.481}, {'end': 33261.375, 'text': "unfortunately, then, we don't have that.", 'start': 33259.795, 'duration': 1.58}, {'end': 33264.357, 'text': "we call them independent attributes, but they're not really independent attributes.", 'start': 33261.375, 'duration': 2.982}, {'end': 33270.199, 'text': "they're going to influence each other so that impacts clustering also quality of clustering.", 'start': 33264.357, 'duration': 5.842}], 'summary': 'Exploring impact of scaling on clusters, importance of distance in clustering, and influence of attribute interdependence on clustering quality.', 'duration': 67.58, 'max_score': 33202.619, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw33202619.jpg'}, {'end': 33444.557, 'src': 'embed', 'start': 33382.154, 'weight': 3, 'content': [{'end': 33391.184, 'text': 'inertia is nothing but the variance inside the cluster, within cluster, variance w right.', 'start': 33382.154, 'duration': 9.03}, {'end': 33401.934, 'text': 'so every time you do clustering for every cluster, you will get three things one is the centroids, one is the label and the third one is the inertia.', 'start': 33391.184, 'duration': 10.75}, {'end': 33417.303, 'text': 'using these three, we have to extract information computationally, extremely intensive K-means.', 'start': 33401.934, 'duration': 15.369}, {'end': 33423.745, 'text': "clustering is very intensive computationally, but once you've clusters we don't need the data points.", 'start': 33417.303, 'duration': 6.442}, {'end': 33433.027, 'text': "So K-means clustering, once I find my clusters, suppose these are the two clusters, I don't need any of the data points in my mathematical space.", 'start': 33424.345, 'duration': 8.682}, {'end': 33434.788, 'text': 'I just need these two points.', 'start': 33433.447, 'duration': 1.341}, {'end': 33438.989, 'text': 'So I store these two points in an array.', 'start': 33436.968, 'duration': 2.021}, {'end': 33442.617, 'text': 'along with the standard deviation.', 'start': 33440.415, 'duration': 2.202}, {'end': 33444.557, 'text': 'so i create an array.', 'start': 33442.617, 'duration': 1.94}], 'summary': 'K-means clustering extracts centroids, labels, and inertia for each cluster, reducing computational intensity and storage requirements.', 'duration': 62.403, 'max_score': 33382.154, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw33382154.jpg'}, {'end': 33627.201, 'src': 'embed', 'start': 33600.488, 'weight': 5, 'content': [{'end': 33606.772, 'text': 'so when you are doing clustering, one of the things that we need to do is, as we will see in hierarchical clustering also,', 'start': 33600.488, 'duration': 6.284}, {'end': 33609.253, 'text': 'the cluster should be more or less of similar size.', 'start': 33606.772, 'duration': 2.481}, {'end': 33614.056, 'text': 'you should not have large cluster and other satellite clusters.', 'start': 33609.253, 'duration': 4.803}, {'end': 33615.237, 'text': 'we call them satellite clusters.', 'start': 33614.056, 'duration': 1.181}, {'end': 33621.035, 'text': 'When you have such clustering, you will fall into this kind of problems.', 'start': 33615.97, 'duration': 5.065}, {'end': 33623.858, 'text': 'Maybe the dimensions you used for clustering is not right.', 'start': 33621.075, 'duration': 2.783}, {'end': 33627.201, 'text': 'I am not saying it cannot be applied.', 'start': 33623.878, 'duration': 3.323}], 'summary': 'In clustering, aim for similar-sized clusters to avoid satellite clusters and potential issues with dimensions used.', 'duration': 26.713, 'max_score': 33600.488, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw33600488.jpg'}, {'end': 33744.804, 'src': 'embed', 'start': 33716.371, 'weight': 6, 'content': [{'end': 33723.796, 'text': 'This will happen every time the magnitude of the axis is significantly different.', 'start': 33716.371, 'duration': 7.425}, {'end': 33724.897, 'text': 'Yeah, exactly.', 'start': 33724.217, 'duration': 0.68}, {'end': 33727.399, 'text': "When you don't scale it, that's where I was coming from earlier.", 'start': 33724.917, 'duration': 2.482}, {'end': 33736.222, 'text': "Absolutely So when you don't do scaling, you're guaranteed you will come across this.", 'start': 33731.24, 'duration': 4.982}, {'end': 33739.202, 'text': "I'm talking even with scale.", 'start': 33738.222, 'duration': 0.98}, {'end': 33740.863, 'text': 'Because I had earlier.', 'start': 33739.923, 'duration': 0.94}, {'end': 33744.804, 'text': 'Because at the end of the day, the magnitude will affect.', 'start': 33741.003, 'duration': 3.801}], 'summary': 'Scaling the axis significantly impacts magnitude, affecting data analysis.', 'duration': 28.433, 'max_score': 33716.371, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw33716371.jpg'}, {'end': 33863.247, 'src': 'embed', 'start': 33840.325, 'weight': 7, 'content': [{'end': 33850.195, 'text': 'all the data points, distance from these three centroids will be calculated and they will be associated with the centroids which are closest to them.', 'start': 33840.325, 'duration': 9.87}, {'end': 33852.977, 'text': 'but as you can see, the centroids are not really centers.', 'start': 33850.195, 'duration': 2.782}, {'end': 33854.559, 'text': 'they are somewhere on the edge.', 'start': 33852.977, 'duration': 1.582}, {'end': 33856.821, 'text': 'so we need to recalculate the centroids.', 'start': 33854.559, 'duration': 2.262}, {'end': 33858.042, 'text': 'how do you recalculate centroids?', 'start': 33856.821, 'duration': 1.221}, {'end': 33863.247, 'text': 'simple, you take the x bar of this red points, y bar of the red points wherever they meet.', 'start': 33858.042, 'duration': 5.205}], 'summary': 'Data points are associated with centroids based on closest distance. recalculating centroids involves finding the average x and y coordinates of the data points.', 'duration': 22.922, 'max_score': 33840.325, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw33840325.jpg'}, {'end': 34147.713, 'src': 'embed', 'start': 34112.347, 'weight': 8, 'content': [{'end': 34119.211, 'text': 'So when you do the pair plot, in the pair plot, you remember the diagonal? The diagonal can be KDE.', 'start': 34112.347, 'duration': 6.864}, {'end': 34122.212, 'text': 'KDE is Kernel Density Estimates.', 'start': 34120.511, 'duration': 1.701}, {'end': 34125.094, 'text': 'So I do a pair plot.', 'start': 34123.873, 'duration': 1.221}, {'end': 34128.495, 'text': 'In the pair plot, look at the diagonals.', 'start': 34125.634, 'duration': 2.861}, {'end': 34134.018, 'text': 'So if I have four attributes.', 'start': 34130.596, 'duration': 3.422}, {'end': 34140.025, 'text': 'look at the diagonal part.', 'start': 34138.564, 'duration': 1.461}, {'end': 34147.713, 'text': 'in the diagonal part, if you are seeing bulges like this somewhere, you are seeing two bulges somewhere.', 'start': 34140.025, 'duration': 7.688}], 'summary': 'Pair plot with kde diagonal shows two bulges if present.', 'duration': 35.366, 'max_score': 34112.347, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw34112347.jpg'}], 'start': 33202.619, 'title': 'Clustering methods and their impact', 'summary': 'Covers the importance of distance choice in hierarchical clustering and the impact of data distribution and scaling on cluster quality. it also discusses the challenges and properties of k-means clustering, emphasizing computational intensity and the need for tight clusters to avoid outlier situations. additionally, it provides an overview of k-means clustering, focusing on scaling, centroid calculation, interpretation of clusters, and variance minimization.', 'chapters': [{'end': 33270.199, 'start': 33202.619, 'title': 'Hierarchical clustering and distance impact', 'summary': 'Covers the importance of distance choice in hierarchical clustering and the impact of data distribution and scaling on cluster quality.', 'duration': 67.58, 'highlights': ['The choice of distance matters in hierarchical clustering and has a significant impact on clustering definition.', 'Understanding the data distribution and applying z-scores to the attributes can influence the quality of clustering.', 'Scaling can have a negative impact on clusters, and experimenting with different random sets can demonstrate this effect.']}, {'end': 33716.331, 'start': 33270.199, 'title': 'Understanding k-means clustering', 'summary': 'Discusses the challenges and properties of k-means clustering, along with its application in real-time analytics, emphasizing the computational intensity and the need for tight clusters to avoid outlier situations.', 'duration': 446.132, 'highlights': ['K-means clustering outputs three important properties for every cluster: centroid position, cluster label, and inertia. The output of K-means clustering includes the position of the centroid on various dimensions, cluster label, and inertia, providing essential properties for each cluster.', 'In production, K-means clustering requires only the centroids and associated properties, making it computationally efficient for real-time analytics. For production and real-time analytics, K-means clustering only needs the centroids and their associated properties, reducing the computational burden and making it suitable for real-time applications.', 'Tight clusters and avoiding satellite clusters are essential in K-means clustering to prevent outlier situations, ensuring the clusters are more or less of similar size. To prevent outlier situations, it is crucial to achieve tight clusters and avoid satellite clusters in K-means clustering, ensuring that the clusters are of similar size and distribution.']}, {'end': 34342.176, 'start': 33716.371, 'title': 'K-means clustering overview', 'summary': 'Provides an overview of k-means clustering, emphasizing the importance of scaling, centroid calculation, and interpretation of clusters, with a focus on minimizing variance and the visual estimation of clusters. the process involves initial centroid generation, data point association, centroid recalculation, and iterative variance reduction.', 'duration': 625.805, 'highlights': ['Importance of Scaling The lack of scaling can lead to significant differences in axis magnitude, impacting clustering accuracy and interpretation.', 'Centroid Calculation and Recalculation The process involves initial generation, data point association, recalculation, and iterative variance reduction, aiming to minimize intra-cluster variance and determine cluster interpretations.', 'Visual Estimation of Clusters The visual estimation technique involves examining bulges and gaussians in a pair plot diagonal to estimate the likely number of clusters, aiding in determining the appropriate k value for k-means clustering.']}], 'duration': 1139.557, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw33202619.jpg', 'highlights': ['The choice of distance matters in hierarchical clustering and has a significant impact on clustering definition.', 'Understanding the data distribution and applying z-scores to the attributes can influence the quality of clustering.', 'Scaling can have a negative impact on clusters, and experimenting with different random sets can demonstrate this effect.', 'K-means clustering outputs three important properties for every cluster: centroid position, cluster label, and inertia.', 'In production, K-means clustering requires only the centroids and associated properties, making it computationally efficient for real-time analytics.', 'Tight clusters and avoiding satellite clusters are essential in K-means clustering to prevent outlier situations, ensuring the clusters are more or less of similar size.', 'Importance of Scaling The lack of scaling can lead to significant differences in axis magnitude, impacting clustering accuracy and interpretation.', 'Centroid Calculation and Recalculation The process involves initial generation, data point association, recalculation, and iterative variance reduction, aiming to minimize intra-cluster variance and determine cluster interpretations.', 'Visual Estimation of Clusters The visual estimation technique involves examining bulges and gaussians in a pair plot diagonal to estimate the likely number of clusters, aiding in determining the appropriate k value for k-means clustering.']}, {'end': 35423.632, 'segs': [{'end': 34368.86, 'src': 'embed', 'start': 34342.176, 'weight': 2, 'content': [{'end': 34348.379, 'text': 'we can also use clustering to keep an eye on say, for example, customer segments how the clusters change over time.', 'start': 34342.176, 'duration': 6.203}, {'end': 34350.199, 'text': 'Clusters will never be fixed.', 'start': 34348.819, 'duration': 1.38}, {'end': 34351.96, 'text': 'They will change with time.', 'start': 34350.9, 'duration': 1.06}, {'end': 34356.982, 'text': "How frequently they change, that depends on your domain, what problem statement you're in.", 'start': 34352.32, 'duration': 4.662}, {'end': 34361.076, 'text': 'What we do is, Data points in three dimensions.', 'start': 34357.823, 'duration': 3.253}, {'end': 34364.718, 'text': "Depending on the scales you're looking at, you'll get different number of clusters.", 'start': 34361.736, 'duration': 2.982}, {'end': 34368.86, 'text': "Suppose I'm looking at this scale, I have clusters, well-defined cluster here.", 'start': 34365.658, 'duration': 3.202}], 'summary': 'Clustering tracks changing customer segments over time with varying cluster numbers.', 'duration': 26.684, 'max_score': 34342.176, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw34342176.jpg'}, {'end': 34472.86, 'src': 'embed', 'start': 34435.849, 'weight': 0, 'content': [{'end': 34438.291, 'text': "In retail, it's used for customer segmentation and analysis.", 'start': 34435.849, 'duration': 2.442}, {'end': 34440.129, 'text': "It's used for all these things.", 'start': 34439.209, 'duration': 0.92}, {'end': 34442.41, 'text': 'But this is not part of our delivery.', 'start': 34440.93, 'duration': 1.48}, {'end': 34443.691, 'text': "So we'll move on.", 'start': 34442.43, 'duration': 1.261}, {'end': 34448.933, 'text': "The reason I'm showing you these things is when you're doing a capstone project, think about these things.", 'start': 34444.011, 'duration': 4.922}, {'end': 34452.075, 'text': 'Maybe you can bring all these things into your project.', 'start': 34448.973, 'duration': 3.102}, {'end': 34455.036, 'text': 'This is what I call dynamic clustering.', 'start': 34453.495, 'duration': 1.541}, {'end': 34459.718, 'text': 'You can use clustering to understand how data is shifting from one point to another point over a period of time.', 'start': 34455.536, 'duration': 4.182}, {'end': 34472.86, 'text': 'The biggest problem with clustering is k means clustering is how do you find the right number of k and where are those k data points?', 'start': 34465.896, 'duration': 6.964}], 'summary': 'Retail uses dynamic clustering for customer analysis and segmentation.', 'duration': 37.011, 'max_score': 34435.849, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw34435849.jpg'}, {'end': 34510.556, 'src': 'embed', 'start': 34479.584, 'weight': 1, 'content': [{'end': 34487.048, 'text': "So I'll have to iterate through many times before we find the most understandable clusters.", 'start': 34479.584, 'duration': 7.464}, {'end': 34490.17, 'text': 'Any questions on this?', 'start': 34489.389, 'duration': 0.781}, {'end': 34492.311, 'text': 'Shall I move on?', 'start': 34490.19, 'duration': 2.121}, {'end': 34496.13, 'text': 'Auto-MPG dataset.', 'start': 34494.369, 'duration': 1.761}, {'end': 34497.13, 'text': 'this is what you are talking about?', 'start': 34496.13, 'duration': 1}, {'end': 34501.572, 'text': 'Must be, that must be same, because this is what I have used.', 'start': 34499.331, 'duration': 2.241}, {'end': 34504.654, 'text': 'Shall go through this? Okay.', 'start': 34502.032, 'duration': 2.622}, {'end': 34510.556, 'text': 'How I used auto-MPG, clustering on auto-MPG dataset? Let us see that.', 'start': 34506.535, 'duration': 4.021}], 'summary': 'Iterating to find understandable clusters on the auto-mpg dataset.', 'duration': 30.972, 'max_score': 34479.584, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw34479584.jpg'}, {'end': 34797.698, 'src': 'embed', 'start': 34765.148, 'weight': 3, 'content': [{'end': 34766.849, 'text': 'The Elbow Plot is telling you something else.', 'start': 34765.148, 'duration': 1.701}, {'end': 34776.495, 'text': 'The Elbow Plot is telling you the drop in error in 4 clusters is much severe, you cannot ignore this, than in 3 clusters.', 'start': 34767.469, 'duration': 9.026}, {'end': 34780.977, 'text': 'I think you drawn x, y.', 'start': 34777.435, 'duration': 3.542}, {'end': 34784.059, 'text': 'Oh this is my errors? No, I think it is.', 'start': 34780.977, 'duration': 3.082}, {'end': 34790.775, 'text': 'No, no, at least 7000.', 'start': 34784.319, 'duration': 6.456}, {'end': 34792.336, 'text': 'Scale Scale, okay.', 'start': 34790.775, 'duration': 1.561}, {'end': 34796.257, 'text': 'Sorry, this is my error, inertia.', 'start': 34793.156, 'duration': 3.101}, {'end': 34797.698, 'text': 'This is number of clusters.', 'start': 34796.638, 'duration': 1.06}], 'summary': 'The elbow plot reveals a significant drop in error from 3 to 4 clusters, with an inertia of at least 7000.', 'duration': 32.55, 'max_score': 34765.148, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw34765148.jpg'}, {'end': 35162.664, 'src': 'embed', 'start': 35134.003, 'weight': 4, 'content': [{'end': 35138.726, 'text': 'Even though K-means cluster is telling you four is the right number of clusters.', 'start': 35134.003, 'duration': 4.723}, {'end': 35139.346, 'text': "Let's see.", 'start': 35139.006, 'duration': 0.34}, {'end': 35139.846, 'text': "We don't know.", 'start': 35139.426, 'duration': 0.42}, {'end': 35142.728, 'text': 'But there are many outliers here.', 'start': 35141.007, 'duration': 1.721}, {'end': 35146.11, 'text': 'Outliers indicate loose clusters.', 'start': 35144.229, 'duration': 1.881}, {'end': 35152.794, 'text': 'Outliers indicate the data points lying on the edge of the clusters.', 'start': 35148.292, 'duration': 4.502}, {'end': 35154.876, 'text': "They're making your clusters very large and loose.", 'start': 35152.834, 'duration': 2.042}, {'end': 35161.043, 'text': 'so we handle the outliers.', 'start': 35156.641, 'duration': 4.402}, {'end': 35162.664, 'text': 'i removed the outliers here.', 'start': 35161.043, 'duration': 1.621}], 'summary': 'K-means cluster suggests 4 clusters, but handling outliers improved cluster tightness.', 'duration': 28.661, 'max_score': 35134.003, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw35134003.jpg'}, {'end': 35304.097, 'src': 'embed', 'start': 35277.489, 'weight': 5, 'content': [{'end': 35283.871, 'text': 'What I did is, I took one of the dimensions, say for example, horsepower.', 'start': 35277.489, 'duration': 6.382}, {'end': 35289.092, 'text': 'The target variable which I have to predict in the supervised method is mpg.', 'start': 35285.571, 'duration': 3.521}, {'end': 35298.275, 'text': "So let's do a plot between mpg and horsepower and see what the scatter plot looks like in these four clusters.", 'start': 35290.073, 'duration': 8.202}, {'end': 35301.236, 'text': 'Now this is where the information will come out.', 'start': 35299.416, 'duration': 1.82}, {'end': 35304.097, 'text': 'Look at this particular plot.', 'start': 35302.636, 'duration': 1.461}], 'summary': 'Analyzed relationship between mpg and horsepower in four clusters.', 'duration': 26.608, 'max_score': 35277.489, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw35277489.jpg'}], 'start': 34342.176, 'title': 'Customer segment clustering and dynamic analysis', 'summary': 'Discusses using clustering to monitor changes in customer segments over time, emphasizing dynamic clustering for customer analysis in retail, with a focus on preventing customer attrition through prevention strategies and challenges of finding the right number of clusters using k-means clustering.', 'chapters': [{'end': 34391.616, 'start': 34342.176, 'title': 'Customer segment clustering', 'summary': 'Discusses using clustering to monitor changes in customer segments over time, emphasizing that clusters are not fixed and can change depending on the problem statement and domain, and that clean clusters are rare, with data points often scattered in loose clusters.', 'duration': 49.44, 'highlights': ['Clusters are not fixed and can change over time, depending on the problem statement and domain.', 'Clean clusters are rare, and data points are often scattered in loose clusters.', 'Clustering is used to monitor changes in customer segments over time.']}, {'end': 34886.45, 'start': 34392.556, 'title': 'Dynamic clustering for customer analysis', 'summary': 'Discusses the application of dynamic clustering for customer analysis in retail, with a focus on tracking shifts in key customer clusters over time and preventing customer attrition through prevention strategies. the chapter also explores the challenges of finding the right number of clusters using k-means clustering on an auto-mpg dataset.', 'duration': 493.894, 'highlights': ['Dynamic clustering for customer analysis in retail involves tracking shifts in key customer clusters over time and preventing customer attrition through prevention strategies. This approach allows businesses to monitor the movement of key customers between clusters and take preventive measures if they are shifting away towards other clusters, such as underdogs.', 'Challenges of finding the right number of clusters using k-means clustering on an auto-MPG dataset. The speaker discusses the NP-hard problem of determining the right number of clusters (k) and the iterative process required to find the most understandable clusters in the auto-MPG dataset.', 'Application of dynamic clustering in customer segmentation and analysis in retail. The speaker mentions the use of clustering for customer segmentation and analysis in the retail industry, providing insights into customer behavior and preferences.']}, {'end': 35423.632, 'start': 34887.451, 'title': 'K-means clustering and analysis', 'summary': "Covers the process of assigning cluster ids to records using k-means clustering, identifying the likely number of clusters using the elbow curve, handling outliers, and analyzing the clusters' attributes and relationships for prediction purposes.", 'duration': 536.181, 'highlights': ['Identifying likely number of clusters using the Elbow curve The instructor explains the concept of the Elbow curve for identifying the likely number of clusters based on the inertia and the number of clusters, ultimately determining four as the likely number of clusters.', 'Handling outliers and their impact on clusters The process of handling outliers within clusters by replacing data points beyond two standard deviations with the median, resulting in reduced outliers and a sharper distribution, impacting the separation of clusters.', 'Analysis using scatterplot for predicting miles per gallon (mpg) The analysis of the relationship between attributes, such as horsepower and displacement, and miles per gallon (mpg) within different clusters, indicating the varying predictive strength of attributes for different types of cars.']}], 'duration': 1081.456, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw34342176.jpg', 'highlights': ['Dynamic clustering for customer analysis in retail involves tracking shifts in key customer clusters over time and preventing customer attrition through prevention strategies.', 'Challenges of finding the right number of clusters using k-means clustering on an auto-MPG dataset.', 'Clustering is used to monitor changes in customer segments over time.', 'Identifying likely number of clusters using the Elbow curve.', 'Handling outliers and their impact on clusters.', 'Analysis using scatterplot for predicting miles per gallon (mpg).']}, {'end': 37305.77, 'segs': [{'end': 35571.623, 'src': 'embed', 'start': 35546.462, 'weight': 0, 'content': [{'end': 35555.97, 'text': 'but looking at the output of this clustering on the various dimensions, i took a call that if i build a linear model,', 'start': 35546.462, 'duration': 9.508}, {'end': 35558.472, 'text': 'i have to build separate linear models.', 'start': 35555.97, 'duration': 2.502}, {'end': 35563.456, 'text': 'but if you want one single model, decision tree regressor came out to be the right fit.', 'start': 35558.472, 'duration': 4.984}, {'end': 35568.841, 'text': 'it gave me an accuracy of ninety percent almost on test data.', 'start': 35563.456, 'duration': 5.385}, {'end': 35569.621, 'text': 'you want to explore this.', 'start': 35568.841, 'duration': 0.78}, {'end': 35571.623, 'text': 'you want to try this out.', 'start': 35569.621, 'duration': 2.002}], 'summary': 'Decision tree regressor achieved 90% accuracy on test data.', 'duration': 25.161, 'max_score': 35546.462, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw35546462.jpg'}, {'end': 36037.571, 'src': 'embed', 'start': 36007.898, 'weight': 1, 'content': [{'end': 36012.6, 'text': 'So in this case, the boundary points are those two and these three are the boundary points in these two clusters.', 'start': 36007.898, 'duration': 4.702}, {'end': 36015.402, 'text': 'When you merge, they still remain the boundary points in one large cluster.', 'start': 36012.64, 'duration': 2.762}, {'end': 36021.985, 'text': 'So all the data points put together become one big cluster, super cluster.', 'start': 36018.403, 'duration': 3.582}, {'end': 36026.783, 'text': "It doesn't stop.", 'start': 36026.243, 'duration': 0.54}, {'end': 36028.124, 'text': "Hierarchical clustering doesn't stop.", 'start': 36026.843, 'duration': 1.281}, {'end': 36031.147, 'text': 'It stops at that point where you have no more clustering left.', 'start': 36028.204, 'duration': 2.943}, {'end': 36037.571, 'text': 'Just like in decision tree, the root node where the entire data frame is a root node.', 'start': 36032.367, 'duration': 5.204}], 'summary': 'Hierarchical clustering merges boundary points into a super cluster, continuing until no more clustering is possible.', 'duration': 29.673, 'max_score': 36007.898, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw36007898.jpg'}, {'end': 36961.893, 'src': 'embed', 'start': 36926.795, 'weight': 2, 'content': [{'end': 36932.118, 'text': 'dendrogramic distance between the clusters is very large and they are all balanced clusters.', 'start': 36926.795, 'duration': 5.323}, {'end': 36942.762, 'text': 'So dendrogram tells you visually how many logical clusters are likely to be there.', 'start': 36936.638, 'duration': 6.124}, {'end': 36948.845, 'text': 'Logical clusters are those clusters which meet at a very high dendrogram existence.', 'start': 36943.802, 'duration': 5.043}, {'end': 36956.39, 'text': 'Looking at this visual stuff, I can go back and redo my k-means.', 'start': 36952.687, 'duration': 3.703}, {'end': 36961.893, 'text': 'So the purpose of hierarchical clustering is this.', 'start': 36958.211, 'duration': 3.682}], 'summary': 'Dendrogramic distance between clusters is large, indicating balanced clusters, guiding a switch to k-means for improved clustering.', 'duration': 35.098, 'max_score': 36926.795, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw36926795.jpg'}], 'start': 35424.853, 'title': 'Machine learning for car models and data analysis', 'summary': 'Covers differentiating car models using machine learning, achieving 87% accuracy with support vector machine, decision tree regressor for prediction with ninety percent accuracy, challenges of k-means clustering, and hierarchical clustering in data analysis with insights into decision trees and distance calculation methods.', 'chapters': [{'end': 35523.87, 'start': 35424.853, 'title': 'Differentiating car models using machine learning', 'summary': 'Discusses the use of different attributes for building linear models for large and small cars, emphasizing the uselessness of the acceleration column and the application of k-means clustering for tech support analysis, achieving 87% accuracy with support vector machine.', 'duration': 99.017, 'highlights': ['The attributes used to build linear models for large cars differ from those used for small cars, highlighting the uselessness of the acceleration column and the application of k-means clustering for tech support analysis.', 'The uselessness of the acceleration column is emphasized, as it is found that all cars, including large and small, have overlapping clusters, rendering it a useless dimension for analysis.', 'The application of k-means clustering for tech support analysis is mentioned, with a brief insight into the achieved accuracy of 87% using support vector machine.']}, {'end': 35959.144, 'start': 35523.87, 'title': 'Decision tree for prediction', 'summary': 'Explores the use of decision tree regressor for prediction, achieving an accuracy of ninety percent on test data and the challenges of k-means clustering in determining the number of clusters, and the different methods of hierarchical clustering such as single linkage, max linkage, and average distance.', 'duration': 435.274, 'highlights': ['Decision tree regressor achieved an accuracy of ninety percent on test data. The decision tree regressor was found to be the right fit for prediction, providing an accuracy of ninety percent on the test data.', 'Challenges of determining the number of clusters in K-Means clustering. The challenges of K-Means clustering were highlighted, particularly the difficulty in determining the appropriate number of clusters to look for.', 'Different methods of hierarchical clustering, including single linkage, max linkage, and average distance. The different methods of hierarchical clustering were discussed, such as single linkage, max linkage, and average distance, each with its unique approach to combining clusters based on distance.']}, {'end': 36273.036, 'start': 35959.144, 'title': 'Hierarchical clustering in data analysis', 'summary': 'Explains hierarchical clustering in data analysis, emphasizing the process of merging clusters, the role of boundary points, and the stopping criteria, with insights into the formation of decision trees and distance calculation methods.', 'duration': 313.892, 'highlights': ["Hierarchical clustering doesn't stop until all data points become part of one single large supercluster, and there are no more data points left to cluster, similar to the root node in a decision tree (Relevance: 5)", 'The method of merging clusters in hierarchical clustering involves taking the whole group and merging it, considering the variance in the dataset to decide which clusters to merge, potentially leading to one big cluster where all the data points are put together, with the process stopping at the root node (Relevance: 4)', 'The boundary points in hierarchical clustering remain as boundary points even after merging clusters, contributing to the formation of a single large supercluster, and the clustering stops at the point where no more clustering is left, similar to the root node in a decision tree (Relevance: 3)', 'The distance calculation method used in hierarchical clustering influences the average distance of data points and centroid distance, with the requirement to satisfy the triangle of inequality and meet specific properties such as distance between points and the non-negativity of distances (Relevance: 2)', 'The method of merging clusters in hierarchical clustering involves taking the whole group and merging it, considering the variance in the dataset to decide which clusters to merge, potentially leading to one big cluster where all the data points are put together, with the process stopping at the root node (Relevance: 1)']}, {'end': 36844.852, 'start': 36274.082, 'title': 'Hierarchical clustering and dendrogram', 'summary': 'Covers the concept of hierarchical clustering using complete linkage and the formation of dendrogram to visualize the clustering process, with an emphasis on the importance of dendrogramic distance and the objective of clustering.', 'duration': 570.77, 'highlights': ['The process of hierarchical clustering using complete linkage and the formation of dendrogram to visualize the clustering process is explained, emphasizing the importance of dendrogramic distance (e.g., BF and A at a distance of 0.62) and the objective of clustering (i.e., tightest and farthest clusters).', 'The concept of dendrogramic distance is highlighted, showcasing how clusters coalesce and merge at different levels in the dendrogram, with a focus on identifying farthest and tightest clusters based on the dendrogramic scale.', 'Explanation of the distance matrix and the use of complete linkage in hierarchical clustering, demonstrating the iterative process of combining clusters based on the least max distance between points and the visualization of the clustering structure resembling a decision tree.']}, {'end': 37305.77, 'start': 36845.372, 'title': 'Hierarchical clustering overview', 'summary': 'Discusses the concept of hierarchical clustering, visually illustrating how logical clusters are determined from dendrogramic distance, emphasizing the significance of visual interpretation in guiding k-means clustering, and highlighting the importance of labeling clusters for meaningful interpretation.', 'duration': 460.398, 'highlights': ['The chapter explains how logical clusters are visually determined from dendrogramic distance, guiding the process of redoing k-means clustering for improved accuracy.', 'It emphasizes the significance of balanced and populated clusters, cautioning against clusters with significant size discrepancies, and advises on the importance of drawing horizontal lines to ensure clusters are well-separated.', 'The algorithm merges clusters based on dendrogramic distance, recalculating the distance matrix iteratively until all data points are part of a large supercluster, providing insights into the hierarchical clustering process.', 'It highlights the need for meaningful labeling of clusters, emphasizing that the interpretation and labeling of clusters require manual effort to derive actionable insights from the analysis.']}], 'duration': 1880.917, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw35424853.jpg', 'highlights': ['Decision tree regressor achieved 90% accuracy on test data.', "Hierarchical clustering doesn't stop until all data points become part of one single large supercluster.", 'The process of hierarchical clustering using complete linkage and the formation of dendrogram to visualize the clustering process is explained.', 'The chapter explains how logical clusters are visually determined from dendrogramic distance, guiding the process of redoing k-means clustering for improved accuracy.']}, {'end': 39350.006, 'segs': [{'end': 38324.221, 'src': 'embed', 'start': 38291.804, 'weight': 0, 'content': [{'end': 38301.092, 'text': 'hierarchical clustering are making use of co phonetic distance calculation, coefficient calculation dendrograms and the linkage methods.', 'start': 38291.804, 'duration': 9.288}, {'end': 38306.797, 'text': 'Now look at this.', 'start': 38303.794, 'duration': 3.003}, {'end': 38312.621, 'text': 'I am going to use on this wine data set average linkage method, ok,', 'start': 38306.797, 'duration': 5.824}, {'end': 38316.865, 'text': 'and that average linkage method I am going to feed it to this co phonetic coefficient calculator.', 'start': 38312.621, 'duration': 4.244}, {'end': 38324.221, 'text': 'This calculator is giving me a coefficient of 83%,', 'start': 38319.98, 'duration': 4.241}], 'summary': 'Hierarchical clustering using co-phonetic distance calculation and average linkage method achieved a coefficient of 83% on wine dataset.', 'duration': 32.417, 'max_score': 38291.804, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw38291804.jpg'}, {'end': 38455.575, 'src': 'embed', 'start': 38418.694, 'weight': 3, 'content': [{'end': 38420.156, 'text': 'Yeah OK.', 'start': 38418.694, 'duration': 1.462}, {'end': 38422.317, 'text': 'This is the one we had originally.', 'start': 38420.176, 'duration': 2.141}, {'end': 38441.024, 'text': "OK?, If I change this linkage method from 40 to say 80 and redraw this, It's a very computationally intensive algorithm, k-means and hierarchical.", 'start': 38422.317, 'duration': 18.707}, {'end': 38443.266, 'text': 'So many distance calculations have to happen.', 'start': 38441.365, 'duration': 1.901}, {'end': 38445.067, 'text': 'Now look at this.', 'start': 38444.467, 'duration': 0.6}, {'end': 38455.575, 'text': "At a dendrogramic distance of 80, which means you are somewhere here, when you draw a horizontal line, you're going to cut only one vertical axis.", 'start': 38447.069, 'duration': 8.506}], 'summary': 'Changing linkage method to 80 reduces computational intensity and distance calculations in k-means and hierarchical algorithms.', 'duration': 36.881, 'max_score': 38418.694, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw38418694.jpg'}, {'end': 38545.374, 'src': 'embed', 'start': 38518.108, 'weight': 2, 'content': [{'end': 38521.089, 'text': 'When I redo this analysis I get different clusters.', 'start': 38518.108, 'duration': 2.981}, {'end': 38530.745, 'text': 'So every time you change the distance of finding the nearest clusters, distance methodology of finding nearest, your decision trees will change.', 'start': 38523.001, 'duration': 7.744}, {'end': 38535.388, 'text': 'Once again, look at this.', 'start': 38534.427, 'duration': 0.961}, {'end': 38542.092, 'text': "The reason why I'm not happy with this is, if I want equally balanced clusters, I'll have to give a very low threshold.", 'start': 38536.309, 'duration': 5.783}, {'end': 38545.374, 'text': 'If I give a very low threshold, it becomes meaningless clusters.', 'start': 38542.592, 'duration': 2.782}], 'summary': 'Changing distance methodology alters clusters and balance, affecting decision trees.', 'duration': 27.266, 'max_score': 38518.108, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw38518108.jpg'}, {'end': 38626.16, 'src': 'embed', 'start': 38596.056, 'weight': 4, 'content': [{'end': 38600.958, 'text': 'four clusters and the cluster size are similar, almost similar cluster size.', 'start': 38596.056, 'duration': 4.902}, {'end': 38604.3, 'text': 'Let us look at the co-finitic coefficient.', 'start': 38602.539, 'duration': 1.761}, {'end': 38608.142, 'text': 'The co-finitic coefficient for WARD is only 66 percent.', 'start': 38605.48, 'duration': 2.662}, {'end': 38616.914, 'text': 'Only 66% of the original distance has been maintained by your dendrogramic distance.', 'start': 38609.989, 'duration': 6.925}, {'end': 38623.818, 'text': 'Many times, this 66% is actually good.', 'start': 38618.775, 'duration': 5.043}, {'end': 38626.16, 'text': 'We consider this a good coefficient.', 'start': 38624.159, 'duration': 2.001}], 'summary': 'Ward cluster has a 66% co-finitic coefficient, considered good.', 'duration': 30.104, 'max_score': 38596.056, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw38596056.jpg'}, {'end': 38872.141, 'src': 'embed', 'start': 38843.932, 'weight': 1, 'content': [{'end': 38849.959, 'text': 'the algorithm, the k-means algorithm, has given you two clusters In these two clusters.', 'start': 38843.932, 'duration': 6.027}, {'end': 38851.68, 'text': 'are these good clusters or not??', 'start': 38849.959, 'duration': 1.721}, {'end': 38856.765, 'text': 'We can find that out using what is called the Silote coefficients or Silote plots.', 'start': 38852.221, 'duration': 4.544}, {'end': 38861.01, 'text': 'What the Silote coefficient is, it is a ratio.', 'start': 38858.547, 'duration': 2.463}, {'end': 38867.816, 'text': 'Ratio of, take any data point, take any data point.', 'start': 38863.853, 'duration': 3.963}, {'end': 38872.141, 'text': 'Take for example the green data point here, the green one.', 'start': 38869.618, 'duration': 2.523}], 'summary': 'The k-means algorithm has created two clusters, which can be evaluated using silote coefficients or silote plots.', 'duration': 28.209, 'max_score': 38843.932, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw38843932.jpg'}], 'start': 37307.892, 'title': 'Interpreting clusters and linkage methods', 'summary': 'Discusses interpreting clusters using box plots and hierarchical clustering on wine data, emphasizing the use of silhouette coefficients and the impact of linkage methods on cluster quality and interpretability, with an achievement of 83% coefficient calculation in agglomerative clustering.', 'chapters': [{'end': 37466.069, 'start': 37307.892, 'title': 'Interpreting clusters with box plots', 'summary': 'Discusses interpreting clusters using box plots, with a focus on using centroids to interpret clusters based on dimensions, and provides a fictitious example of interpreting clusters based on interest in sports and music.', 'duration': 158.177, 'highlights': ['Using box plots to interpret clusters based on centroids and dimensions, with a fictitious example of interpreting clusters based on interest in sports and music.', "Explaining how to interpret clusters using centroids and dimensions, as well as the example of people's interests in sports and music.", 'Emphasizing the interpretation of clusters based on centroids and dimensions, with a focus on the position of centroids in the various dimensions.']}, {'end': 37994.512, 'start': 37467.37, 'title': 'Interpreting clusters with box plots', 'summary': 'Discusses the use of box plots to interpret clusters, the importance of interpreting clusters meaningfully, and the significance of co-phonetic correlation coefficient in hierarchical clustering, with emphasis on preserving original euclidean distance.', 'duration': 527.142, 'highlights': ['Box plots are used to interpret clusters visually by assessing the separation of box pillars on multiple dimensions, indicating good or overlapping clusters.', 'Interpreting clusters meaningfully is crucial, and methods such as ELBO, Silaute analysis, and convergence of k values across different methods are essential for accurate interpretation.', 'The co-phonetic correlation coefficient in hierarchical clustering measures the preservation of original Euclidean distance, guiding the selection of the right linkage methods for forming clusters.']}, {'end': 38417.694, 'start': 37995.113, 'title': 'Hierarchical clustering on wine data', 'summary': 'Discusses the application of hierarchical clustering on a wine dataset to identify clusters and outliers, determining the number of clusters through pair plot analysis, and using agglomerative clustering with average linkage method to achieve 83% coefficient calculation.', 'duration': 422.581, 'highlights': ['Using agglomerative clustering with average linkage method to obtain 83% coefficient calculation The chapter demonstrates the use of average linkage method in agglomerative clustering to achieve an 83% coefficient calculation, indicating the maintenance of 83% original distance between data points.', 'Determining number of clusters through pair plot analysis The speaker discusses the process of determining the number of clusters through pair plot analysis, identifying multiple clusters in different dimensions based on the distribution of data points.', 'Exploring the identification of outliers and clusters through hierarchical clustering The discussion revolves around the identification of outliers and clusters using hierarchical clustering, including the possibility of outliers forming their own cluster and the impact on the overall clustering process.']}, {'end': 38722.463, 'start': 38418.694, 'title': 'Clustering analysis and linkage methods', 'summary': 'Discusses the impact of changing linkage methods and thresholds in clustering analysis, highlighting the challenges of achieving balanced and interpretable clusters, with a focus on the trade-offs between different methods and their impact on the quality and interpretability of clusters.', 'duration': 303.769, 'highlights': ['The impact of changing linkage methods and thresholds in clustering analysis is discussed, emphasizing the challenges of achieving balanced and interpretable clusters, with a focus on the trade-offs between different methods and their impact on the quality and interpretability of clusters.', 'The computational intensity of the algorithm, k-means, and hierarchical clustering is highlighted, with an emphasis on the significant number of distance calculations required for these methods.', 'The effect of altering the dendrogramic distance threshold on the resulting clusters is explained, showcasing the trade-offs between creating super clusters versus smaller, imbalanced clusters.', 'The influence of changing the distance calculation method, including examples of transitioning from average to complete linkage and then to WARD, on the resulting clusters is demonstrated, emphasizing the impact on cluster size, balance, and interpretability.', 'The concept of the co-finitic coefficient for WARD is introduced, with a focus on how it reflects the maintenance of original distance within the dendrogramic distance, and the consideration of 66% as a good coefficient.', 'The discussion of subjective quality in cluster analysis is presented, highlighting the manual interpretation process and the use of tools to assess the interpretability and quality of clusters, with a final emphasis on the subjective nature of cluster quality assessment.']}, {'end': 39350.006, 'start': 38722.744, 'title': 'Clustering and silhouette coefficients', 'summary': 'Discusses the importance of clusters being interpretable and the use of silhouette coefficients to determine the right number of clusters, with a focus on maximizing the coefficient for optimal clustering.', 'duration': 627.262, 'highlights': ['The relevance of interpretable clusters in clustering mechanisms It emphasizes the importance of clusters being interpretable for reliable clustering mechanisms.', 'Explanation of Silhouette coefficients and their significance in determining the right number of clusters It provides a detailed explanation of Silhouette coefficients and their significance in determining the optimal number of clusters based on maximizing the coefficient.', 'The method for finding the best Silhouette coefficient for clustering It explains the process of finding the best Silhouette coefficient for clustering by identifying the maximum coefficient as the ideal clustering solution.']}], 'duration': 2042.114, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/edvg4eHi_Mw/pics/edvg4eHi_Mw37307892.jpg', 'highlights': ['Using agglomerative clustering with average linkage method to obtain 83% coefficient calculation', 'Explanation of Silhouette coefficients and their significance in determining the right number of clusters', 'The impact of changing linkage methods and thresholds in clustering analysis is discussed, emphasizing the challenges of achieving balanced and interpretable clusters', 'The computational intensity of the algorithm, k-means, and hierarchical clustering is highlighted', 'The concept of the co-finitic coefficient for WARD is introduced, with a focus on how it reflects the maintenance of original distance within the dendrogramic distance']}], 'highlights': ['The average salary of a data scientist is around $120,000, and the average salary of a Python developer is $100,000, indicating the high demand for skills in data science and Python.', 'The agenda for the Python for data science course includes working with the basics of Python, data structures, numerical computing with NumPy, data manipulation with the Pandas library, data visualization with the Matplotlib library, and machine learning algorithms such as linear regression, logistic regression, and Naive Bayes, as well as understanding unsupervised learning and clustering.', "Anaconda provides pre-installed packages like Matplotlib, Pandas, and NumPy for data manipulation and visualization, streamlining the installation process and enhancing Python's capabilities.", 'Introduces non-primitive data structures in Python, including tuple, list, dictionary, and set, highlighting the ability to store multiple elements in a single data structure to avoid the limitation of storing only one value in a variable, with an example of storing names of 10,000 music concert attendees.', 'Functions in Python encapsulate tasks like depositing money, withdrawing money, and checking balance, simplifying code structure and reducing redundancy.', 'Introduces object oriented programming in Python, covering classes, objects, and inheritance with examples.', 'The process of extracting records based on multiple conditions is illustrated using the iris dataset, showing the extraction of records where the sepal length is greater than six, the sepal width is greater than three, and the petal length is greater than three, resulting in only three out of 150 records satisfying the criteria.', 'Demonstrates creation of different plots using matplotlib in python', 'Linear regression represents the relationship between independent and dependent variables.', 'The algorithm uses the process of gradient descent to minimize the sum of errors across all data points, resulting in the identification of the best fit line.', 'The algorithm starts from random M and C and utilizes gradient descent, employing partial derivatives to minimize error in linear regression, ensuring the reach of absolute minima.', 'The importance of utilizing pair plots as a tool to understand data and to identify non-linear distributions, suggesting it as a fundamental practice in data analysis', 'Real-time driving data analysis for risk adjustment in high-end foreign cars.', 'The gradient descent algorithm ensures that the error function reaches the global minima in the mathematical space.', 'The number of cases for non-diabetic zero is 500, whereas the number of cases for diabetic is almost half, indicating a significant class imbalance.', "The model's poor performance in predicting diabetic cases due to skewed classes and algorithms biased towards the higher represented class.", "The concept of likelihood ratios in determining the probability of events, particularly in relation to the population's diabetic percentage and the presence of symptoms.", 'Applying unsupervised methods to the car-mpg dataset revealed hidden structures and insights that could improve model performance beyond 80% accuracy.', 'Dynamic clustering for customer analysis in retail involves tracking shifts in key customer clusters over time and preventing customer attrition through prevention strategies.', 'Decision tree regressor achieved 90% accuracy on test data.', 'Using agglomerative clustering with average linkage method to obtain 83% coefficient calculation', 'The process of hierarchical clustering using complete linkage and the formation of dendrogram to visualize the clustering process is explained.']}