title
Data Science Full Course 2020 | Data Science For Beginners | Data Science from Scratch | Simplilearn

description
🔥 Data Science Bootcamp (US Only): https://www.simplilearn.com/post-graduate-program-data-science?utm_campaign=DataScienceSep182019&utm_medium=DescriptionFirstFold&utm_source=youtube 🔥IIT Kanpur Professional Certificate Course In Data Science (India Only): https://www.simplilearn.com/iitk-professional-certificate-course-data-science 🔥 Post Graduate Program In Data Science: https://www.simplilearn.com/pgp-data-science-certification-bootcamp-program?utm_campaign=DataScienceSep182019&utm_medium=DescriptionFirstFold&utm_source=youtube 🔥Data Scientist Masters Program (Discount Code - YTBE15): https://www.simplilearn.com/big-data-and-analytics/senior-data-scientist-masters-program-training?utm_campaign=DataScienceSep182019&utm_medium=DescriptionFirstFold&utm_source=youtube 🟡 Caltech AI & Machine Learning Bootcamp (For US Learners Only) - https://www.simplilearn.com/ai-machine-learning-bootcamp?utm_campaign=DataScienceSep182019&utm_medium=DescriptionFirstFold&utm_source=youtube Will cover all the below-given topics required for a complete Data Science Tutorial: 0. Introduction (0:00) 1. Data Science basics (01:28) 2. What is Data Science (05:51) 3. Need for Data Science (06:38) 4. Business intelligence vs Data Science (17:30) 5. Prerequisites for Data Science (22:31) 6. What does a Data Scientist do? (30:23) 7. Demand for Data Scientist (53:03) 8. Linear regression (2:30:10) 9. Decision trees (2:53:39) 10. Logistic regression in R (3:09:12) 11. What is a decision tree? (3:27:04) 12. What is clustering? (4:35:40) 13. Divisive clustering (4:51:14) 14. Support vector machine (5:17:21) 15. K-means clustering 96:44:13) 16. Time series analysis (7:33:05) 17. How to Become a Data Scientist (8:26:54) 18. Job roles in Data Science (8:30:59) 19. Simplilearn certifications in Data Science (8:33:50) 20. Who is a Data Science engineer? (8:34:34) 21. Data Science engineer resume (9:00:04) 22. Data Science interview questions and answers (9:04:42) 🔥 Enroll for FREE Data Science Course & Get your Completion Certificate: https://www.simplilearn.com/getting-started-data-science-with-python-skillup?utm_campaign=Skillup-DataScienceSep182019&utm_medium=DescriptionFirstFold&utm_source=youtube To learn more about Data Science, subscribe to our YouTube channel: https://www.youtube.com/user/Simplilearn?sub_confirmation=1 Download Datasets: https://drive.google.com/drive/folders/1oa_hIIb4dnRpzjmeaDTwFtHp7n22yc21 Download the Data Science career guide to explore and step into the exciting world of data, and follow the path towards your dream career: https://www.simplilearn.com/data-science-career-guide-pdf?utm_campaign=DataScienceSep182019&utm_medium=Description&utm_source=youtube Read the full article here: https://www.simplilearn.com/career-in-data-science-ultimate-guide-article?utm_campaign=DataScienceSep182019&utm_medium=Description&utm_source=youtube Watch more videos on Data Science: https://www.youtube.com/watch?v=0gf5iLTbiQM&list=PLEiEAq2VkUUIEQ7ENKU5Gv0HpRDtOphC6 #DataScienceWithPython #DataScienceWithR #DataScienceCourse #DataScience #DataScientist #BusinessAnalytics #machinelearning ➡️ Advanced Certificate Program In Data Science This Advanced Certificate Program in Data Science is designed for working professionals & covers job-critical topics like R, Python, Machine Learning techniques, NLP notions,& Data Visualization with Tableau, using an active learning model that includes live sessions from global professionals, practical labs, IBM Hackathons, corporate ready projects. 👉Learn More at: https://www.simplilearn.com/pgp-data-science-certification-bootcamp-program?utm_campaign=DataScienceSep182019&utm_medium=Description&utm_source=youtube ✅ Key features: - Advanced Data Science certificate and Purdue alumni association membership - 3 Capstone and 25+ Projects with industry data sets from Amazon, Uber, Walmart, Comcast, and many more - Masterclasses delivered by Purdue faculty and IBM experts - Exclusive hackathons and Ask Me Anything sessions by IBM - Simplilearn's JobAssist helps you get noticed by top hiring companies - Resume preparation and LinkedIn profile building - 1:1 mock interview - Career accelerator webinars - 8X higher engagement in live online classes by seasoned academics and industry professionals ✅ Skills covered:- - Exploratory Data Analysis - Descriptive Statistics - Inferential Statistics - Model building and fine tuning - Supervised and unsupervised learning - Natural Language Processing - Ensemble Learning 👉Learn More at: https://www.simplilearn.com/pgp-data-science-certification-bootcamp-program?utm_campaign=DataScienceSep182019&utm_medium=Description&utm_source=youtube 🔥🔥 Interested in Attending Live Classes? Call Us: IN - 18002127688 / US - +18445327688

detail
{'title': 'Data Science Full Course 2020 | Data Science For Beginners | Data Science from Scratch | Simplilearn', 'heatmap': [{'end': 3554.216, 'start': 3194.001, 'weight': 1}], 'summary': 'Course covers essential data science traits, machine learning algorithms, data analysis techniques, and applications of various models, achieving high accuracies in classification and prediction tasks, and emphasizes the demand for data scientists, essential skills, and job roles in the industry.', 'chapters': [{'end': 2032.855, 'segs': [{'end': 663.66, 'src': 'embed', 'start': 635.142, 'weight': 2, 'content': [{'end': 640.664, 'text': 'And last but not least, they determine what is the best mode of transport for this delivery as well.', 'start': 635.142, 'duration': 5.522}, {'end': 648.97, 'text': 'So what is data science used for? These are some of the main areas where data science is used for better decision making.', 'start': 641.084, 'duration': 7.886}, {'end': 652.292, 'text': 'There are always tricky decisions to be made.', 'start': 649.41, 'duration': 2.882}, {'end': 655.534, 'text': 'So which is the right decision, which way to go? So that is one area.', 'start': 652.452, 'duration': 3.082}, {'end': 663.66, 'text': 'Then for predicting, for performing predictive analysis like, for example can we predict delays, like in the case of airlines?', 'start': 656.055, 'duration': 7.605}], 'summary': 'Data science used for better decision-making, predictive analysis in areas like transport and airlines', 'duration': 28.518, 'max_score': 635.142, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI635142.jpg'}, {'end': 1014.826, 'src': 'embed', 'start': 943.664, 'weight': 0, 'content': [{'end': 954.548, 'text': 'So what is the process or what are the various steps in data science? The first step is asking the right question and exploring the data.', 'start': 943.664, 'duration': 10.884}, {'end': 958.489, 'text': "Basically, you want to know what exactly is the problem you're trying to solve.", 'start': 954.628, 'duration': 3.861}, {'end': 960.51, 'text': 'that is asking the right question.', 'start': 958.909, 'duration': 1.601}, {'end': 963.572, 'text': 'So that is the, this circle out here.', 'start': 960.81, 'duration': 2.762}, {'end': 968.495, 'text': 'Then the next step is after exploring the data.', 'start': 964.132, 'duration': 4.363}, {'end': 971.357, 'text': 'So as a first step, you will ask some questions.', 'start': 968.555, 'duration': 2.802}, {'end': 973.598, 'text': "What exactly is the problem you're trying to solve?", 'start': 971.377, 'duration': 2.221}, {'end': 980.943, 'text': 'And then obviously you will have some data for that as input and you perform some exploratory analysis on the data.', 'start': 973.678, 'duration': 7.265}, {'end': 985.446, 'text': 'For example, you need to clean the data to make sure everything is fine and so on and so forth.', 'start': 981.303, 'duration': 4.143}, {'end': 988.868, 'text': 'So all that is a part of exploratory analysis.', 'start': 985.466, 'duration': 3.402}, {'end': 991.19, 'text': 'And then you need to do the modeling.', 'start': 989.108, 'duration': 2.082}, {'end': 998.635, 'text': "Let's say, if you have to perform machine learning, you need to decide which algorithm to use and which model to use,", 'start': 991.55, 'duration': 7.085}, {'end': 1002.237, 'text': 'and Then you need to train the model, and so on and so forth.', 'start': 998.635, 'duration': 3.602}, {'end': 1005.079, 'text': "So that's all part of the modeling process.", 'start': 1002.257, 'duration': 2.822}, {'end': 1010.463, 'text': 'And then you run your data through this model and then through this process.', 'start': 1005.48, 'duration': 4.983}, {'end': 1014.826, 'text': 'And then you come out with the final results of this exercise,', 'start': 1010.863, 'duration': 3.963}], 'summary': 'Data science process involves asking the right question, exploring and cleaning data, modeling, and obtaining final results.', 'duration': 71.162, 'max_score': 943.664, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI943664.jpg'}, {'end': 1065.533, 'src': 'embed', 'start': 1039.344, 'weight': 1, 'content': [{'end': 1050.132, 'text': 'that has to be communicated in a proper way, in an easy to understand way, which is again a key part of this whole exercise communicating the results.', 'start': 1039.344, 'duration': 10.788}, {'end': 1056.009, 'text': "so let's now talk about the difference between business intelligence and data science.", 'start': 1050.666, 'duration': 5.343}, {'end': 1065.533, 'text': 'now, business intelligence was one of the initial phases where people started making or wanted to make some sense out of data.', 'start': 1056.009, 'duration': 9.524}], 'summary': 'Differentiate between business intelligence and data science, key part of communication exercise', 'duration': 26.189, 'max_score': 1039.344, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI1039344.jpg'}], 'start': 5.071, 'title': 'Data science essentials', 'summary': 'Covers essential traits, prerequisites, tools, skills, and day-to-day activities of a data scientist, emphasizing curiosity, common sense, and communication, and delving into machine learning, modeling, statistics, programming, data warehousing, and data visualization. it also outlines various machine learning algorithms, including regression and clustering.', 'chapters': [{'end': 238.3, 'start': 5.071, 'title': 'Data science course overview', 'summary': 'Covers the comprehensive data science course by simply learn, including topics like data acquisition, preparation, modeling, and the role of a data scientist, with a focus on practical implementation and important interview questions.', 'duration': 233.229, 'highlights': ['The chapter covers the comprehensive data science course by Simply Learn, including topics like data acquisition, preparation, modeling, and the role of a data scientist. The course covers topics such as data acquisition, preparation, mining, modeling, and maintenance, as well as practical implementation using Python and RStudio.', 'The chapter highlights the practical implementation of various algorithms such as linear regression, logistic regression, decision tree, and SVM. The course includes practical implementation of various algorithms like linear regression, logistic regression, decision tree, and SVM.', 'The chapter emphasizes the importance of exploratory data analysis in refining the selection of feature variables for model development. The importance of exploratory data analysis in refining the selection of feature variables for model development is emphasized.', 'The chapter discusses the role of a data scientist, including tasks such as data acquisition, data preparation, data modeling, and visualization and communication with stakeholders. The chapter discusses the role of a data scientist, covering tasks such as data acquisition, data preparation, data modeling, and visualization and communication with stakeholders.']}, {'end': 538.235, 'start': 238.58, 'title': 'Data science impact & applications', 'summary': 'Discusses the daily activities of a data scientist, the application of data science in various industries, and the growing demand for data scientists, including insights on employee attrition and median base salaries.', 'duration': 299.655, 'highlights': ['Self-driving cars are expected to minimize over 2 million deaths caused by car accidents annually, with most cars projected to be autonomous in 10 to 15 years. According to a study, self-driving cars are predicted to significantly reduce car accident-related deaths, with most cars expected to be autonomous in 10 to 15 years.', 'Data science contributes to the optimization of airline operations by predicting flight delays, efficiently managing routes, and ensuring proper equipment selection. Data science plays a crucial role in the aviation industry by predicting flight delays, optimizing routes, and ensuring the proper selection of equipment, ultimately improving operational efficiency.', 'Insights from data science enable the prediction of employee attrition and identification of key variables influencing turnover, providing valuable HR insights. Data science enables the prediction of employee attrition and the identification of key variables influencing turnover, offering valuable insights for human resources management.', 'The median base salaries for data scientists range from $95,000 to $165,000, reflecting the high demand for skilled professionals in the field. Data scientists can expect median base salaries ranging from $95,000 to $165,000, highlighting the significant demand for skilled professionals in the field.']}, {'end': 971.357, 'start': 538.735, 'title': 'Applications of data science in industries', 'summary': 'Illustrates the applications of data science in industries such as airlines, logistics, e-commerce, and entertainment, highlighting how it can reduce cancellations, predict delays, optimize routes, and provide personalized recommendations.', 'duration': 432.622, 'highlights': ['Data science applications in industries like airlines, logistics, e-commerce, and entertainment are highlighted, including reducing cancellations, predicting delays, optimizing routes, and providing personalized recommendations.', 'Examples of data science applications in the airline industry, such as route planning, predictive analytics for delay prediction, and the optimization of promotional offers, are described.', 'The use of data science in logistics, particularly by companies like FedEx, to increase efficiency, optimize routes, determine delivery times, and select the best mode of transport, is emphasized.', 'The various areas where data science is used for better decision-making, predictive analysis, and pattern analysis to discover buying patterns and seasonality in sales data are explained.', 'The process and steps involved in data science, including asking the right questions, exploring the data, and performing analysis, are outlined.']}, {'end': 1296.006, 'start': 971.377, 'title': 'Data science vs business intelligence', 'summary': 'Explains the processes involved in data science and business intelligence, comparing their data sources, methods, skills, and focus, highlighting the differences in handling structured and unstructured data, depth of analysis, and skill requirements, with data science being more comprehensive and involving a deeper scientific approach.', 'duration': 324.629, 'highlights': ['Data science involves analysis of structured and unstructured data, including web logs and customer feedback, while business intelligence primarily uses structured data from enterprise applications like ERP and CRM. Data science encompasses both structured and unstructured data, expanding its scope beyond the structured data used in business intelligence.', 'Data science goes beyond presenting historical data and aims to understand why certain behaviors occur, involving deeper statistical analysis and insights, while business intelligence focuses on presenting the truth and historical data. Data science delves into deeper statistical analysis and insights, going beyond historical data, compared to the focus on historical data in business intelligence.', 'Data science requires a broader skill set involving statistics, visualization, correlation analysis, regression, and machine learning, while business intelligence mainly focuses on visualization and requires some statistics. Data science demands a more extensive skill set involving statistics, correlation analysis, regression, and machine learning, compared to the emphasis on visualization and basic statistics in business intelligence.']}, {'end': 2032.855, 'start': 1296.366, 'title': 'Data science essentials', 'summary': 'Covers the essential traits, prerequisites, tools, skills, and the day-to-day activities of a data scientist, emphasizing the importance of curiosity, common sense, and communication. it also delves into machine learning, modeling, statistics, programming, data warehousing, and data visualization, while highlighting key components and popular tools and skills. moreover, it touches upon the life of a data scientist and outlines various machine learning algorithms, including regression and clustering.', 'duration': 736.489, 'highlights': ['The essential traits for a data scientist are curiosity, common sense, and communication skills, which are crucial for asking the right questions, being creative in problem-solving, and effectively communicating results. Curiosity, common sense, and communication skills are essential traits for a data scientist, crucial for problem-solving and effective communication.', 'Data science involves machine learning, modeling, statistics, programming, and understanding databases, with Python being a popular language due to its ease of learning and multiple libraries supporting data science and machine learning. Data science involves machine learning, modeling, statistics, programming, and understanding databases, with Python being popular for its ease of learning and library support.', 'Tools and skills used in data analysis include programming languages like Python and R, along with popular tools such as SAS, Jupyter Notebooks, RStudio, and Excel, emphasizing the importance of understanding statistics and data visualization capabilities. Data analysis involves programming languages like Python and R, along with tools such as SAS, Jupyter Notebooks, RStudio, and Excel, emphasizing the importance of statistics and data visualization.', 'Skills required for data warehousing encompass ETL, SQL, Hadoop, Spark, and tools like Informatica, Data Stage, Talent, and AWS Redshift, especially for handling large amounts of structured and unstructured data in a distributed mode. Data warehousing skills include ETL, SQL, Hadoop, Spark, and tools like Informatica, Data Stage, Talent, and AWS Redshift, for handling large amounts of structured and unstructured data.', 'Various machine learning algorithms such as regression for continuous value prediction and clustering for unsupervised learning are essential skills for a data scientist, emphasizing the need to understand and apply these techniques effectively. Understanding machine learning algorithms like regression for continuous value prediction and clustering for unsupervised learning is essential for a data scientist.']}], 'duration': 2027.784, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI5071.jpg', 'highlights': ['The course covers comprehensive data science topics including data acquisition, preparation, mining, modeling, and maintenance, with practical implementation using Python and RStudio.', 'The chapter emphasizes the importance of exploratory data analysis in refining the selection of feature variables for model development.', 'Insights from data science enable the prediction of employee attrition and identification of key variables influencing turnover, providing valuable HR insights.', 'Data scientists can expect median base salaries ranging from $95,000 to $165,000, highlighting the significant demand for skilled professionals in the field.', 'Data science plays a crucial role in the aviation industry by predicting flight delays, optimizing routes, and ensuring the proper selection of equipment, ultimately improving operational efficiency.', 'Data science encompasses both structured and unstructured data, expanding its scope beyond the structured data used in business intelligence.', 'Curiosity, common sense, and communication skills are essential traits for a data scientist, crucial for problem-solving and effective communication.', 'Data science involves machine learning, modeling, statistics, programming, and understanding databases, with Python being popular for its ease of learning and library support.', 'Data analysis involves programming languages like Python and R, along with tools such as SAS, Jupyter Notebooks, RStudio, and Excel, emphasizing the importance of statistics and data visualization.', 'Data warehousing skills include ETL, SQL, Hadoop, Spark, and tools like Informatica, Data Stage, Talent, and AWS Redshift, for handling large amounts of structured and unstructured data.']}, {'end': 4131.555, 'segs': [{'end': 2380.317, 'src': 'embed', 'start': 2352.784, 'weight': 14, 'content': [{'end': 2360.307, 'text': 'if the data size is too big, you may have to come up with ways to reduce it meaningfully without losing information.', 'start': 2352.784, 'duration': 7.523}, {'end': 2361.627, 'text': 'then data cleaning.', 'start': 2360.307, 'duration': 1.32}, {'end': 2366.589, 'text': 'so there will be either wrong values or null values, or there are missing values.', 'start': 2361.627, 'duration': 4.962}, {'end': 2368.25, 'text': 'so how do you handle all of that?', 'start': 2366.589, 'duration': 1.661}, {'end': 2371.872, 'text': 'A few examples of very specific stuff.', 'start': 2368.67, 'duration': 3.202}, {'end': 2373.673, 'text': 'So there are missing values.', 'start': 2371.892, 'duration': 1.781}, {'end': 2380.317, 'text': 'How do you handle missing values or null values? Here in this particular slide, we are seeing three types of issues.', 'start': 2373.753, 'duration': 6.564}], 'summary': 'Data cleaning involves handling wrong values, null values, and missing values.', 'duration': 27.533, 'max_score': 2352.784, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI2352784.jpg'}, {'end': 2840.148, 'src': 'embed', 'start': 2818.374, 'weight': 11, 'content': [{'end': 2828.541, 'text': 'It has a very good integrated visualization or plot mechanism which can be used for doing exploratory data analysis and then later on to do analysis,', 'start': 2818.374, 'duration': 10.167}, {'end': 2831.323, 'text': 'detailed analysis and machine learning, and so on and so forth.', 'start': 2828.541, 'duration': 2.782}, {'end': 2833.744, 'text': 'Then, of course, you can write Python programs.', 'start': 2831.443, 'duration': 2.301}, {'end': 2840.148, 'text': 'Python offers a rich library for performing data analysis and machine learning and so on.', 'start': 2834.125, 'duration': 6.023}], 'summary': 'Python provides integrated visualization for exploratory data analysis, detailed analysis, and machine learning.', 'duration': 21.774, 'max_score': 2818.374, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI2818374.jpg'}, {'end': 3013.133, 'src': 'embed', 'start': 2965.104, 'weight': 3, 'content': [{'end': 2975.733, 'text': 'At the end of the training process, you have a certain value of M and C and that is used for predicting the values of any new data that comes.', 'start': 2965.104, 'duration': 10.629}, {'end': 2976.554, 'text': 'All right.', 'start': 2976.194, 'duration': 0.36}, {'end': 2989.262, 'text': 'so the way it works is we use the training and the test data set to train the model and then validate whether the model is working fine or not using test data.', 'start': 2976.554, 'duration': 12.708}, {'end': 2996.084, 'text': 'and if it is working fine, then it is taken to the next level which is put in production.', 'start': 2989.262, 'duration': 6.822}, {'end': 2998.345, 'text': 'if not, the model has to be retrained.', 'start': 2996.084, 'duration': 2.261}, {'end': 3003.667, 'text': 'if the accuracy is not good enough, then the model is retrained, maybe with more data,', 'start': 2998.345, 'duration': 5.322}, {'end': 3008.049, 'text': 'or you come up with a newer model or algorithm and then repeat that process.', 'start': 3003.667, 'duration': 4.382}, {'end': 3009.89, 'text': 'so it is an iterative process.', 'start': 3008.049, 'duration': 1.841}, {'end': 3013.133, 'text': 'once the training is completed, training and test,', 'start': 3009.89, 'duration': 3.243}], 'summary': 'After training and testing, model is validated for accuracy. if not good enough, retrain with more data or new algorithm.', 'duration': 48.029, 'max_score': 2965.104, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI2965104.jpg'}, {'end': 3554.216, 'src': 'heatmap', 'start': 3194.001, 'weight': 1, 'content': [{'end': 3195.521, 'text': 'So there is a huge gap.', 'start': 3194.001, 'duration': 1.52}, {'end': 3199.343, 'text': 'So what are some of the industries with high demand for data scientists?', 'start': 3195.661, 'duration': 3.682}, {'end': 3212.25, 'text': "I think gaming is definitely one area where it's an industry which is consumer facing industry and a lot of people play games and growing industry and it requires a lot of data science.", 'start': 3199.523, 'duration': 12.727}, {'end': 3215.232, 'text': 'so that is an area where data scientists are in demand.', 'start': 3212.25, 'duration': 2.982}, {'end': 3217.354, 'text': 'then we have healthcare, for example.', 'start': 3215.232, 'duration': 2.122}, {'end': 3225.56, 'text': 'data science is used for diagnosis and several other activities within healthcare, predicting, for example, a disease.', 'start': 3217.354, 'duration': 8.206}, {'end': 3231.485, 'text': 'so healthcare is definitely finance, definitely banks, insurance companies all of these.', 'start': 3225.56, 'duration': 5.925}, {'end': 3233.707, 'text': 'there is a huge demand for data scientists.', 'start': 3231.485, 'duration': 2.222}, {'end': 3237.53, 'text': 'marketing is like a horizontal functionality across all industries.', 'start': 3233.707, 'duration': 3.823}, {'end': 3239.912, 'text': "There's a demand for data scientists there.", 'start': 3237.87, 'duration': 2.042}, {'end': 3241.714, 'text': 'Then of course in technology area.', 'start': 3239.972, 'duration': 1.742}, {'end': 3245.697, 'text': 'So pretty much all of these areas there is a lot of demand.', 'start': 3241.794, 'duration': 3.903}, {'end': 3247.699, 'text': 'Globally there is a huge demand.', 'start': 3245.998, 'duration': 1.701}, {'end': 3253.905, 'text': 'So this is a very very critical skill that would be required currently as well as in the future.', 'start': 3247.719, 'duration': 6.186}, {'end': 3258.371, 'text': "Let's take a look at what are the various techniques that are used for data cleaning.", 'start': 3254.005, 'duration': 4.366}, {'end': 3265.882, 'text': 'So we need to ensure that the data is valid and it is consistent and uniform and accurate.', 'start': 3258.872, 'duration': 7.01}, {'end': 3271.049, 'text': 'So these are the various parameters that we need to ensure as a part of the data cleaning process.', 'start': 3265.922, 'duration': 5.127}, {'end': 3279.634, 'text': 'Now, what are the techniques that that are used for data cleaning are so we will see what each of these are in this particular case.', 'start': 3271.229, 'duration': 8.405}, {'end': 3287.138, 'text': 'And so what is the data set that we have, we have data about a bank and its customer details.', 'start': 3280.254, 'duration': 6.884}, {'end': 3291.101, 'text': "So let's take an example and see how we go about cleaning the data.", 'start': 3287.218, 'duration': 3.883}, {'end': 3295.263, 'text': 'And in this particular example, we are assuming we are using Python.', 'start': 3291.401, 'duration': 3.862}, {'end': 3300.567, 'text': "So let's Assume we loaded this data, which is the raw file.csv.", 'start': 3295.523, 'duration': 5.044}, {'end': 3303.369, 'text': 'This is how the customer data looks like.', 'start': 3300.707, 'duration': 2.662}, {'end': 3309.514, 'text': 'And we will see, for example, we take a closer look at the geography column.', 'start': 3303.75, 'duration': 5.764}, {'end': 3312.817, 'text': 'We will see that there are quite a few blank spaces.', 'start': 3309.774, 'duration': 3.043}, {'end': 3319.942, 'text': 'So how do we go about when we have some blank spaces or if it is a string value,', 'start': 3313.277, 'duration': 6.665}, {'end': 3325.907, 'text': 'then we put an empty string here or we just use a space or empty string?', 'start': 3319.942, 'duration': 5.965}, {'end': 3330.649, 'text': 'If they are numerical values, then we need to come up with a strategy.', 'start': 3326.327, 'duration': 4.322}, {'end': 3334.19, 'text': 'For example, we put the mean value.', 'start': 3331.209, 'duration': 2.981}, {'end': 3339.753, 'text': 'So wherever it is missing, we find the mean for that particular column.', 'start': 3334.27, 'duration': 5.483}, {'end': 3346.576, 'text': "So in this case, let's assume we have credit score and we see that quite a few of these values are missing.", 'start': 3339.813, 'duration': 6.763}, {'end': 3356.124, 'text': 'So what do we do here? We find the mean for this column for all the existing values and we found that the mean is equal to 638.6.', 'start': 3346.656, 'duration': 9.468}, {'end': 3361.569, 'text': 'So we kind of write a piece of code to replace wherever there are blank values.', 'start': 3356.124, 'duration': 5.445}, {'end': 3366.714, 'text': 'NAN is basically like null and we just go ahead and say fill it with the mean value.', 'start': 3361.689, 'duration': 5.025}, {'end': 3368.876, 'text': 'So this is the piece of code we are writing to fill it.', 'start': 3366.814, 'duration': 2.062}, {'end': 3373.6, 'text': 'So all the blanks or all the null values get replaced with the mean value.', 'start': 3368.996, 'duration': 4.604}, {'end': 3381.803, 'text': 'Now one of the reasons for doing this is that very often if you have some such situation many of your statistical functions may not even work.', 'start': 3373.74, 'duration': 8.063}, {'end': 3389.346, 'text': "So that's the reason you need to fill up these values or either get rid of these records or fill up these values with something meaningful.", 'start': 3381.903, 'duration': 7.443}, {'end': 3393.107, 'text': 'So this is one mechanism which is basically using a mean.', 'start': 3389.426, 'duration': 3.681}, {'end': 3395.728, 'text': 'There are a few others as we move forward we can see.', 'start': 3393.267, 'duration': 2.461}, {'end': 3396.669, 'text': 'What are the other ways?', 'start': 3395.868, 'duration': 0.801}, {'end': 3403.996, 'text': 'For example, we can also say that any missing value in a particular row, if even one column value is missing,', 'start': 3396.749, 'duration': 7.247}, {'end': 3409.962, 'text': 'you just drop that particular row or delete all rows where even a single column has missing values.', 'start': 3403.996, 'duration': 5.966}, {'end': 3411.603, 'text': 'So that is one way of dealing.', 'start': 3410.042, 'duration': 1.561}, {'end': 3421.912, 'text': "Now, the problem here can be that if a lot of data has, let's say, one or two columns missing and we drop many such rows,", 'start': 3411.724, 'duration': 10.188}, {'end': 3429.195, 'text': "then overall you may lose out on, let's say, sixty percent of the data as some value, or the other missing sixty percent of the rows,", 'start': 3421.912, 'duration': 7.283}, {'end': 3436.418, 'text': "then it may not be a good idea to delete all the rows like in that manner, because then you're losing pretty much sixty percent of your data.", 'start': 3429.195, 'duration': 7.223}, {'end': 3442.46, 'text': "therefore your analysis won't be accurate, But if it is only 5 or 10%, then this will work.", 'start': 3436.418, 'duration': 6.042}, {'end': 3449.943, 'text': 'Another way is only to drop values where, or rather drop rows where all the columns are empty,', 'start': 3442.6, 'duration': 7.343}, {'end': 3455.265, 'text': 'which makes sense because that means that record is of really no use because it has no information in it.', 'start': 3449.943, 'duration': 5.322}, {'end': 3457.667, 'text': 'So there can be some situations like that.', 'start': 3455.445, 'duration': 2.222}, {'end': 3464.594, 'text': 'So we can provide a condition saying that drop the records where all the columns are blank or not applicable.', 'start': 3457.727, 'duration': 6.867}, {'end': 3467.317, 'text': 'We can also specify some kind of a threshold.', 'start': 3464.734, 'duration': 2.583}, {'end': 3469.979, 'text': "Let's say you have 10 or 20 columns in a row.", 'start': 3467.377, 'duration': 2.602}, {'end': 3474.183, 'text': 'You can specify that maybe five columns are blank or null.', 'start': 3470.099, 'duration': 4.084}, {'end': 3475.785, 'text': 'Then you drop that record.', 'start': 3474.444, 'duration': 1.341}, {'end': 3483.573, 'text': 'so again, we need to take care that such a condition, such a situation, the amount of data that has been removed or excluded, is not large.', 'start': 3475.945, 'duration': 7.628}, {'end': 3491.761, 'text': "if it is like maybe five percent maximum ten percent, then it's okay, but by doing this, if you're losing out on a large chunk of data,", 'start': 3483.573, 'duration': 8.188}, {'end': 3493.343, 'text': 'then it may not be a good idea.', 'start': 3491.761, 'duration': 1.582}, {'end': 3495.244, 'text': 'you need to come up with something better.', 'start': 3493.343, 'duration': 1.901}, {'end': 3499.486, 'text': 'what else we need to do next is the data preparation part is done.', 'start': 3495.244, 'duration': 4.242}, {'end': 3501.987, 'text': 'so now we get into the data mining part.', 'start': 3499.486, 'duration': 2.501}, {'end': 3504.848, 'text': 'so what exactly we do in data mining?', 'start': 3501.987, 'duration': 2.861}, {'end': 3509.349, 'text': 'primarily, we come up with ways to take meaningful decisions.', 'start': 3504.848, 'duration': 4.501}, {'end': 3516.752, 'text': 'so data mining will give us insights into the data, what is existing there, and then we can do additional stuff,', 'start': 3509.349, 'duration': 7.403}, {'end': 3521.314, 'text': 'like maybe machine learning and so on to get perform advanced analytics and so on.', 'start': 3516.752, 'duration': 4.562}, {'end': 3529.919, 'text': 'so one of the first steps we do is what is known as data discovery and which is basically like exploratory analysis.', 'start': 3521.474, 'duration': 8.445}, {'end': 3533.741, 'text': 'so we can use tools like tableau for doing some of this.', 'start': 3529.919, 'duration': 3.822}, {'end': 3536.602, 'text': "so let's just take a quick look at how we go about that.", 'start': 3533.741, 'duration': 2.861}, {'end': 3549.131, 'text': 'so tableau is excellent data mining, or actually more of a reporting or a bi tool, and you can download a trial version of tableau at tableau.com.', 'start': 3536.602, 'duration': 12.529}, {'end': 3554.216, 'text': 'or there is also tableau public, which is free and you can actually use and play around.', 'start': 3549.131, 'duration': 5.085}], 'summary': 'Industries with high demand for data scientists include gaming, healthcare, finance, and marketing. data cleaning techniques involve ensuring data validity, consistency, and uniformity, and using strategies like filling empty values with means or dropping rows based on specific conditions. data mining involves data discovery, exploratory analysis, and tools like tableau for reporting and visualization.', 'duration': 360.215, 'max_score': 3194.001, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI3194001.jpg'}, {'end': 3334.19, 'src': 'embed', 'start': 3303.75, 'weight': 1, 'content': [{'end': 3309.514, 'text': 'And we will see, for example, we take a closer look at the geography column.', 'start': 3303.75, 'duration': 5.764}, {'end': 3312.817, 'text': 'We will see that there are quite a few blank spaces.', 'start': 3309.774, 'duration': 3.043}, {'end': 3319.942, 'text': 'So how do we go about when we have some blank spaces or if it is a string value,', 'start': 3313.277, 'duration': 6.665}, {'end': 3325.907, 'text': 'then we put an empty string here or we just use a space or empty string?', 'start': 3319.942, 'duration': 5.965}, {'end': 3330.649, 'text': 'If they are numerical values, then we need to come up with a strategy.', 'start': 3326.327, 'duration': 4.322}, {'end': 3334.19, 'text': 'For example, we put the mean value.', 'start': 3331.209, 'duration': 2.981}], 'summary': 'Analyzing the geography column reveals numerous blank spaces. strategy for handling numerical values includes using mean value.', 'duration': 30.44, 'max_score': 3303.75, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI3303750.jpg'}, {'end': 3585.16, 'src': 'embed', 'start': 3559.721, 'weight': 10, 'content': [{'end': 3565.766, 'text': 'so you need to purchase license and you can then run some of the data mining activities.', 'start': 3559.721, 'duration': 6.045}, {'end': 3567.067, 'text': 'say your data source.', 'start': 3565.766, 'duration': 1.301}, {'end': 3573.993, 'text': 'your data is in some excel sheet, so you can select the source as microsoft, excel or any other format,', 'start': 3567.067, 'duration': 6.926}, {'end': 3582.398, 'text': 'and the data will be brought into the tableau environment and then it will show you what is known as dimensions and measures.', 'start': 3573.993, 'duration': 8.405}, {'end': 3585.16, 'text': 'so dimensions are all the descriptive columns.', 'start': 3582.398, 'duration': 2.762}], 'summary': 'To use tableau, purchase a license, connect data from excel, and explore dimensions and measures.', 'duration': 25.439, 'max_score': 3559.721, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI3559721.jpg'}, {'end': 3747.421, 'src': 'embed', 'start': 3717.282, 'weight': 16, 'content': [{'end': 3717.582, 'text': 'all right.', 'start': 3717.282, 'duration': 0.3}, {'end': 3723.068, 'text': 'so now, if we pull the data here, create like bar graphs, this is how it would look.', 'start': 3717.582, 'duration': 5.486}, {'end': 3723.808, 'text': 'so what is yellow?', 'start': 3723.068, 'duration': 0.74}, {'end': 3724.429, 'text': "let's go back.", 'start': 3723.808, 'duration': 0.621}, {'end': 3727.312, 'text': 'so yellow is, uh, who exited?', 'start': 3724.429, 'duration': 2.883}, {'end': 3738.198, 'text': 'and for the mail, only 16.45 percent have exited And we can also draw a reference line that will help us or even provide aliases.', 'start': 3727.312, 'duration': 10.886}, {'end': 3741.759, 'text': 'So these are a lot of fancy stuff that is provided by Tableau.', 'start': 3738.218, 'duration': 3.541}, {'end': 3747.421, 'text': 'You can create aliases and so that it looks good rather than basic labels.', 'start': 3741.939, 'duration': 5.482}], 'summary': 'Data analysis shows 16.45% mail exits, tableau provides advanced visualization features.', 'duration': 30.139, 'max_score': 3717.282, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI3717282.jpg'}, {'end': 3807.044, 'src': 'embed', 'start': 3765.485, 'weight': 4, 'content': [{'end': 3769.686, 'text': 'we do see that there is some difference in the male and female behavior.', 'start': 3765.485, 'duration': 4.201}, {'end': 3772.186, 'text': "now let's take the next criteria, which is the credit card.", 'start': 3769.686, 'duration': 2.5}, {'end': 3777.107, 'text': "so let's see if having a credit card has any impact on the customer exit behavior.", 'start': 3772.186, 'duration': 4.921}, {'end': 3781.968, 'text': 'so just like before we drag and drop, the credit card has credit card, a column.', 'start': 3777.107, 'duration': 4.861}, {'end': 3789.47, 'text': 'if we drag and drop here and then we will see that there is pretty much no difference between people having credit card and not having credit card.', 'start': 3781.968, 'duration': 7.502}, {'end': 3800.759, 'text': '20 point, 8.1 percent of people who have no credit card have exited and similarly, 20.18 percent of people who have credit card have also exited.', 'start': 3789.47, 'duration': 11.289}, {'end': 3804.402, 'text': 'so the credit card is not having much of an impact.', 'start': 3800.759, 'duration': 3.643}, {'end': 3807.044, 'text': "that's what this piece of analysis shows.", 'start': 3804.402, 'duration': 2.642}], 'summary': 'Male and female behavior differs. having a credit card does not significantly impact customer exit behavior, as 20.8% without credit cards and 20.18% with credit cards have exited.', 'duration': 41.559, 'max_score': 3765.485, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI3765485.jpg'}, {'end': 4063.679, 'src': 'embed', 'start': 4040.351, 'weight': 0, 'content': [{'end': 4048.558, 'text': "Now let's take our example here, where we want to perform supervised learning, which is basically we want to do a multilinear regression,", 'start': 4040.351, 'duration': 8.207}, {'end': 4050.861, 'text': 'which means there are multiple independent variables.', 'start': 4048.558, 'duration': 2.303}, {'end': 4054.624, 'text': 'And then you want to perform a linear regression to predict certain values.', 'start': 4051.121, 'duration': 3.503}, {'end': 4059.577, 'text': 'So in this particular example, we have world happiness data.', 'start': 4054.795, 'duration': 4.782}, {'end': 4063.679, 'text': 'So this is the data about the happiness quotient of people from various countries.', 'start': 4059.737, 'duration': 3.942}], 'summary': 'Supervised learning with multilinear regression to predict world happiness data.', 'duration': 23.328, 'max_score': 4040.351, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI4040351.jpg'}], 'start': 2032.895, 'title': 'Data science processes and algorithms', 'summary': 'Discusses the use of clustering and decision tree algorithms to classify cricket players, various classification algorithms, data preparation, model planning, the data science model life cycle, data cleaning, data mining, and customer behavior analysis, with examples and tools such as r, python, matlab, sas, and tableau mentioned.', 'chapters': [{'end': 2116.324, 'start': 2032.895, 'title': 'Cricket player classification', 'summary': 'Discusses the use of clustering and decision tree algorithms to classify cricket players based on their performance, identifying batsmen, bowlers, and all-rounders, showcasing the logical classification process and the ease of understanding provided by decision tree.', 'duration': 83.429, 'highlights': ['The system uses clustering mechanism to group players based on their performance, such as runs scored and wickets taken, and identifies batsmen, bowlers, and all-rounders (quantifiable: categorization of players based on performance).', 'Decision tree is primarily used for classification and provides a logical way to classify inputs, with one of its biggest advantages being its ease of understanding (quantifiable: primary use for classification and ease of understanding).']}, {'end': 2450.269, 'start': 2116.324, 'title': 'Data science algorithms and project lifecycle', 'summary': 'Discusses various classification algorithms like dictionary, support vector machines, and naive bayes, and then delves into the lifecycle of a data science project, covering concept study, data preparation, and data cleaning.', 'duration': 333.945, 'highlights': ['The chapter discusses various classification algorithms like Dictionary, support vector machines, and Naive Bayes The chapter explains the advantages and popularity of Dictionary algorithm, support vector machines for classification purpose, and Naive Bayes as a statistical probability based classification algorithm.', 'The lifecycle of a data science project is covered, including concept study, data preparation, and data cleaning The concept study involves understanding the business problem, gathering relevant data, and asking questions to meet the end goal, while data preparation includes data munging, exploration of raw data, and dealing with issues like gaps, irrelevant columns, and data integration. Data cleaning involves handling missing values, null values, and improper data.']}, {'end': 2913.167, 'start': 2450.269, 'title': 'Data preparation and model planning', 'summary': 'Covers the importance of handling missing data and splitting data for training and testing, followed by the process of model planning and exploratory data analysis, with tools like r, python, matlab, and sas mentioned as useful for model building.', 'duration': 462.898, 'highlights': ['The chapter emphasizes the importance of handling missing data, especially in large datasets, and suggests methods like filling missing values with mean or median to ensure accurate analysis. Handling missing data in large datasets like having 50,000 records with missing values is emphasized, with suggestions to fill missing values with mean or median to ensure accurate analysis.', 'The process of splitting data into training and test datasets is described, with variations in the ratio such as 50:50, 63.33:33.3, or 80:20 highlighted, emphasizing the importance of using unseen data for testing to measure model accuracy. The process of splitting data into training and test datasets is detailed, with variations in the ratio like 50:50, 63.33:33.3, or 80:20 emphasized, highlighting the importance of using unseen data for testing to measure model accuracy.', 'The chapter explains the process of model planning, where the choice of regression or classification algorithms like linear regression, logistic regression, decision tree, or SVM is based on the problem being addressed, emphasizing the need for exploratory data analysis to understand the data and determine the appropriate model. The process of model planning is explained, highlighting the need for exploratory data analysis to understand the data and determine the appropriate model, with the choice of regression or classification algorithms based on the problem being addressed.', 'Tools like R, Python, MATLAB, and SAS are mentioned as useful for model planning and building due to their capabilities in data analysis, visualization, and machine learning. Tools like R, Python, MATLAB, and SAS are mentioned as useful for model planning and building due to their capabilities in data analysis, visualization, and machine learning.']}, {'end': 3429.195, 'start': 2913.167, 'title': 'Data science model life cycle', 'summary': 'Discusses the process of building a linear regression model to predict the price of a 1.35 carat diamond, detailing the training, testing, retraining, and deployment stages, and emphasizes the demand for data scientists in various industries.', 'duration': 516.028, 'highlights': ['The demand for data scientists is currently huge and the supply is very low, creating a significant gap in the industry. The demand for data scientists is high while the supply is low, leading to a significant gap in the industry.', 'Data science is in high demand in industries such as gaming, healthcare, finance, marketing, and technology, globally. Industries like gaming, healthcare, finance, marketing, and technology have a high demand for data scientists globally.', 'The chapter explains the process of data cleaning, including techniques such as filling missing values with mean, dropping rows with missing values, and ensuring data validity, consistency, uniformity, and accuracy. The process of data cleaning is detailed, covering techniques like filling missing values with mean, dropping rows with missing values, and ensuring data validity, consistency, uniformity, and accuracy.', 'The process of building a linear regression model to predict the price of a 1.35 carat diamond is elaborated, including the training, testing, retraining, and deployment stages. The process of building a linear regression model to predict the price of a 1.35 carat diamond is elaborated, including the training, testing, retraining, and deployment stages.']}, {'end': 3741.759, 'start': 3429.195, 'title': 'Data cleaning and data mining', 'summary': 'Discusses the importance of careful data deletion to ensure data accuracy, with a recommendation to drop records with all empty columns and to set a threshold of 5-10% for data exclusion. it then introduces the process of data mining, focusing on using tableau for exploratory analysis, and offers a brief demonstration of using tableau for analyzing customer exit behavior based on gender.', 'duration': 312.564, 'highlights': ["It's important to be cautious when deleting data, as removing a large chunk, such as 60% of the data, can lead to inaccurate analysis. Deleting a large chunk of data, such as 60%, can lead to inaccurate analysis.", 'The recommendation is to drop records where all the columns are empty, as they provide no useful information. The recommendation is to drop records where all the columns are empty, as they provide no useful information.', 'A threshold of 5-10% for data exclusion is suggested to ensure that a large amount of data is not removed. A threshold of 5-10% for data exclusion is suggested to ensure that a large amount of data is not removed.', 'Data mining involves obtaining insights from existing data and can lead to advanced analytics, such as machine learning. Data mining involves obtaining insights from existing data and can lead to advanced analytics, such as machine learning.', 'Tableau is recommended for exploratory analysis and data mining, offering easy drag-and-drop mechanisms and advanced visualization features. Tableau is recommended for exploratory analysis and data mining, offering easy drag-and-drop mechanisms and advanced visualization features.', 'A brief demonstration of using Tableau for analyzing customer exit behavior based on gender is provided, showcasing the ease of performing such analysis. A brief demonstration of using Tableau for analyzing customer exit behavior based on gender is provided, showcasing the ease of performing such analysis.']}, {'end': 4131.555, 'start': 3741.939, 'title': 'Customer behavior analysis and data mining', 'summary': 'Discusses customer behavior analysis, showing that on average female customers exit more than male customers, credit card has no significant impact on customer exit behavior, and the impact of geography. it also explains the advantages of data mining, the next steps after data preparation, and the process of model building, including supervised and unsupervised learning algorithms.', 'duration': 389.616, 'highlights': ['On average, female customers exit more than male customers. The analysis reveals that on average, female customers have a higher exit rate than male customers, indicating a gender-based difference in behavior.', 'Credit card has no significant impact on customer exit behavior. The data shows that there is minimal difference in the exit rates between customers with and without credit cards, suggesting that the presence of a credit card does not significantly affect customer behavior.', 'Advantages of data mining include predicting future trends, identifying customer behavior patterns, and quickly identifying fraudulent activity. Data mining offers the advantages of predicting future trends, identifying customer behavior patterns, and swiftly detecting fraudulent activity, enabling informed decision-making and selecting appropriate algorithms for advanced data mining activities.', 'Model building involves selecting algorithms for supervised and unsupervised learning, such as regression for continuous values and classification for categorical values. Model building encompasses selecting algorithms for supervised and unsupervised learning, including regression for continuous values and classification for categorical values, with examples of specific algorithms for each type of learning.']}], 'duration': 2098.66, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI2032895.jpg', 'highlights': ['The system uses clustering mechanism to group players based on their performance, such as runs scored and wickets taken, and identifies batsmen, bowlers, and all-rounders.', 'Decision tree is primarily used for classification and provides a logical way to classify inputs, with one of its biggest advantages being its ease of understanding.', 'The chapter discusses various classification algorithms like Dictionary, support vector machines, and Naive Bayes.', 'The lifecycle of a data science project is covered, including concept study, data preparation, and data cleaning.', 'The chapter emphasizes the importance of handling missing data, especially in large datasets, and suggests methods like filling missing values with mean or median to ensure accurate analysis.', 'The process of splitting data into training and test datasets is described, with variations in the ratio such as 50:50, 63.33:33.3, or 80:20 highlighted, emphasizing the importance of using unseen data for testing to measure model accuracy.', 'The demand for data scientists is currently huge and the supply is very low, creating a significant gap in the industry.', 'Data science is in high demand in industries such as gaming, healthcare, finance, marketing, and technology, globally.', 'The process of data cleaning, including techniques such as filling missing values with mean, dropping rows with missing values, and ensuring data validity, consistency, uniformity, and accuracy.', 'The process of building a linear regression model to predict the price of a 1.35 carat diamond is elaborated, including the training, testing, retraining, and deployment stages.', "It's important to be cautious when deleting data, as removing a large chunk, such as 60% of the data, can lead to inaccurate analysis.", 'The recommendation is to drop records where all the columns are empty, as they provide no useful information.', 'A threshold of 5-10% for data exclusion is suggested to ensure that a large amount of data is not removed.', 'Data mining involves obtaining insights from existing data and can lead to advanced analytics, such as machine learning.', 'Tableau is recommended for exploratory analysis and data mining, offering easy drag-and-drop mechanisms and advanced visualization features.', 'A brief demonstration of using Tableau for analyzing customer exit behavior based on gender is provided, showcasing the ease of performing such analysis.', 'On average, female customers exit more than male customers.', 'Credit card has no significant impact on customer exit behavior.', 'Advantages of data mining include predicting future trends, identifying customer behavior patterns, and quickly identifying fraudulent activity.', 'Model building involves selecting algorithms for supervised and unsupervised learning, such as regression for continuous values and classification for categorical values.']}, {'end': 5764.864, 'segs': [{'end': 4201.517, 'src': 'embed', 'start': 4176.953, 'weight': 4, 'content': [{'end': 4184.68, 'text': 'and then scikit-learn or sklearn is the library which we will use actually for this particular machine learning activity which is linear regression.', 'start': 4176.953, 'duration': 7.727}, {'end': 4189.024, 'text': 'So we have NumPy, we have pandas, and so on and so forth.', 'start': 4184.921, 'duration': 4.103}, {'end': 4199.035, 'text': 'So all these libraries are imported and then we load our data and the data is in the form of a CSV file and there are different files for each year.', 'start': 4189.045, 'duration': 9.99}, {'end': 4201.517, 'text': 'So we have data for 2015, 16 and 17.', 'start': 4199.095, 'duration': 2.422}], 'summary': 'Using scikit-learn for linear regression with data from csv files for 2015, 16, and 17.', 'duration': 24.564, 'max_score': 4176.953, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI4176953.jpg'}, {'end': 5156.431, 'src': 'embed', 'start': 5131.592, 'weight': 1, 'content': [{'end': 5139.939, 'text': 'so this was a good and quick example of the code to perform data science activity or machine learning or data mining activity.', 'start': 5131.592, 'duration': 8.347}, {'end': 5142.881, 'text': 'in this case we did what is known as linear regression.', 'start': 5139.939, 'duration': 2.942}, {'end': 5147.024, 'text': "so let's go back to our slides and see what else is there.", 'start': 5142.881, 'duration': 4.143}, {'end': 5148.124, 'text': 'so we saw this.', 'start': 5147.024, 'duration': 1.1}, {'end': 5156.431, 'text': 'these are the coefficients of each of the features in our code, and we have seen the root mean square error as well,', 'start': 5148.124, 'duration': 8.307}], 'summary': 'Performed linear regression with quick code, coefficients, and root mean square error.', 'duration': 24.839, 'max_score': 5131.592, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI5131592.jpg'}, {'end': 5206.348, 'src': 'embed', 'start': 5178.13, 'weight': 0, 'content': [{'end': 5181.951, 'text': 'But in this particular case, we have got a pretty good model, which is very good.', 'start': 5178.13, 'duration': 3.821}, {'end': 5187.753, 'text': 'Also, subsequently, we can assume that this is how the equation in linear regression.', 'start': 5182.092, 'duration': 5.661}, {'end': 5195.035, 'text': 'the model is nothing but an equation like y is equal to beta 0 plus beta 1, x1 plus beta 2, x2 plus beta 3, x3, and so on.', 'start': 5187.753, 'duration': 7.282}, {'end': 5199.539, 'text': 'so this is what we are showing here.', 'start': 5197.156, 'duration': 2.383}, {'end': 5206.348, 'text': 'so this is our intercept, which is beta 0, and then we have beta 1 into economy value, beta 2 into the family value,', 'start': 5199.539, 'duration': 6.809}], 'summary': 'A good linear regression model with intercept and coefficients.', 'duration': 28.218, 'max_score': 5178.13, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI5178130.jpg'}, {'end': 5541.615, 'src': 'embed', 'start': 5516.683, 'weight': 2, 'content': [{'end': 5522.486, 'text': 'it has some special functions, are for integration and for ordinary differential equations.', 'start': 5516.683, 'duration': 5.803}, {'end': 5527.208, 'text': 'so, as you can see, these are mathematical operations or mathematical functions.', 'start': 5522.486, 'duration': 4.722}, {'end': 5535.312, 'text': 'so these are readily available in this library, and it has linear algebra modules and it is built on top of numpy.', 'start': 5527.208, 'duration': 8.104}, {'end': 5537.193, 'text': 'so we will see what is there in numpy.', 'start': 5535.312, 'duration': 1.881}, {'end': 5541.615, 'text': 'So this is again as the name suggests, the num comes from numbers.', 'start': 5537.333, 'duration': 4.282}], 'summary': 'A library with special functions, linear algebra modules, and built on numpy for mathematical operations and integration.', 'duration': 24.932, 'max_score': 5516.683, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI5516683.jpg'}], 'start': 4131.555, 'title': 'Machine learning and data analysis', 'summary': 'Covers building and training machine learning models using python for linear regression analysis, including data loading, visualization, and data preparation for achieving low error and high accuracy. it also discusses communicating machine learning results and provides an overview of various python libraries.', 'chapters': [{'end': 4201.517, 'start': 4131.555, 'title': 'Machine learning with python: linear regression', 'summary': 'Covers the process of building and training a machine learning model using python, including the importation of libraries such as numpy and pandas, and the use of scikit-learn for linear regression analysis on data from csv files for the years 2015, 16, and 17.', 'duration': 69.962, 'highlights': ['The process involves importing libraries in Python such as NumPy and pandas for data manipulation and scikit-learn for linear regression analysis on data from CSV files for the years 2015, 16, and 17.', 'The data is prepared and manipulated using libraries in Python, including NumPy and pandas, to facilitate the machine learning analysis.']}, {'end': 4635.339, 'start': 4201.517, 'title': 'Data analysis and visualization', 'summary': "Covers data loading, concatenation, visualization of happiness scores across countries, identifying correlation between happiness score and rank, dropping columns based on correlation, splitting data into training and test sets, and performing linear regression to evaluate the model's accuracy and coefficients using the scikit-learn functionality.", 'duration': 433.822, 'highlights': ["The chapter covers data loading, concatenation, visualization of happiness scores across countries, identifying correlation between happiness score and rank, dropping columns based on correlation, splitting data into training and test sets, and performing linear regression to evaluate the model's accuracy and coefficients using the scikit-learn functionality. It includes steps such as data loading, concatenation, visualization of happiness scores across countries, identifying correlation between happiness score and rank, dropping columns based on correlation, splitting data into training and test sets, and performing linear regression to evaluate the model's accuracy and coefficients using the scikit-learn functionality.", 'The darker colored countries have lower ranking, while the ones with darker colors are the happiest ones like Australia and the US. The visualization shows that darker colored countries have lower ranking, while darker colors indicate happier countries like Australia and the US.', 'The scatter plot demonstrates an inversely proportional relationship between happiness score and rank, indicating a strong correlation between the two. The scatter plot demonstrates an inversely proportional relationship between happiness score and rank, indicating a strong correlation between the two.', 'The correlation graph indicates strong correlations between happiness and economy, family, and freedom. The correlation graph indicates strong correlations between happiness and economy, family, and freedom.', "The chapter involves splitting the data into training and test sets, using an 80-20 split, and performing linear regression to evaluate the model's accuracy and coefficients. The chapter involves splitting the data into training and test sets, using an 80-20 split, and performing linear regression to evaluate the model's accuracy and coefficients."]}, {'end': 5178.07, 'start': 4635.339, 'title': 'Data preparation for machine learning', 'summary': 'Covers data preparation for machine learning, including removing unwanted columns, concatenating data, exploratory data analysis, and implementing linear regression using scikit-learn, achieving a low root mean square error and high accuracy.', 'duration': 542.731, 'highlights': ['The chapter covers data preparation for machine learning, including removing unwanted columns, concatenating data, exploratory data analysis, and implementing linear regression using scikit-learn, achieving a low root mean square error and high accuracy.', 'Unwanted columns like region and standard error are removed using the drop functionality, and data for 2016 and 2017 is concatenated to create a dataframe called happiness.', 'Exploratory data analysis is performed, including the creation of visualizations using Plotly to depict the correlation between happiness rank and happiness score, leading to the decision to drop the happiness rank.', 'The data is split into training and test data in an 80/20 ratio, and a linear regression model is trained using the training data, resulting in the prediction of values for the test data with a low root mean square error and high accuracy.']}, {'end': 5764.864, 'start': 5178.13, 'title': 'Communicating machine learning results and library overview', 'summary': 'Discusses the process of communicating machine learning results to stakeholders, emphasizing the need to present actionable insights in the context of the problem statement and methodology, followed by the overview of various python libraries such as pandas, scipy, numpy, matplotlib, scikit-learn, tensorflow, beautiful soup, and os, highlighting their specific functions and applications.', 'duration': 586.734, 'highlights': ['Communicating Machine Learning Results The process of communicating machine learning results to stakeholders involves presenting actionable insights in the context of the problem statement and methodology, effectively targeting the appropriate audience and ensuring clear communication in business terms.', "Maintenance of Machine Learning Model The maintenance of machine learning models involves checking the model's performance, testing accuracy, making necessary tweaks or changes, retraining the model with the latest data if required, and deploying the updated model.", 'Overview of Python Libraries The overview of Python libraries includes pandas for structured data operations, SciPy for scientific capabilities, NumPy for n-dimensional arrays and mathematical functions, Matplotlib for visualization, scikit-learn for machine learning activities, TensorFlow for deep learning and AI, Beautiful Soup for web scraping, and OS for operating system-related tasks.']}], 'duration': 1633.309, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI4131555.jpg', 'highlights': ['The process involves importing libraries in Python such as NumPy and pandas for data manipulation and scikit-learn for linear regression analysis on data from CSV files for the years 2015, 16, and 17.', "The chapter involves splitting the data into training and test sets, using an 80-20 split, and performing linear regression to evaluate the model's accuracy and coefficients.", 'Exploratory data analysis is performed, including the creation of visualizations using Plotly to depict the correlation between happiness rank and happiness score, leading to the decision to drop the happiness rank.', 'The data is split into training and test data in an 80/20 ratio, and a linear regression model is trained using the training data, resulting in the prediction of values for the test data with a low root mean square error and high accuracy.', 'Communicating Machine Learning Results The process of communicating machine learning results to stakeholders involves presenting actionable insights in the context of the problem statement and methodology, effectively targeting the appropriate audience and ensuring clear communication in business terms.']}, {'end': 6800.796, 'segs': [{'end': 5858.009, 'src': 'embed', 'start': 5829.25, 'weight': 2, 'content': [{'end': 5834.775, 'text': 'so it will display the default index and then the actual values in each of these rows and columns.', 'start': 5829.25, 'duration': 5.525}, {'end': 5837.137, 'text': 'so this is the way you create a data frame.', 'start': 5834.775, 'duration': 2.362}, {'end': 5844.041, 'text': "so now that we have learned some of the basics of pandas, let's take a quick look at how we use this in real life.", 'start': 5837.137, 'duration': 6.904}, {'end': 5853.706, 'text': "so let's assume we have a situation where we have some customer data and we want to kind of predict whether a customer's loan will be approved or not.", 'start': 5844.041, 'duration': 9.665}, {'end': 5858.009, 'text': 'so we have some historical data about the loans and about the customers,', 'start': 5853.706, 'duration': 4.303}], 'summary': 'Introduction to creating data frames in pandas and using real-life customer loan prediction example.', 'duration': 28.759, 'max_score': 5829.25, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI5829250.jpg'}, {'end': 5893.21, 'src': 'embed', 'start': 5865.213, 'weight': 10, 'content': [{'end': 5868.155, 'text': 'So this is a part of exploratory analysis.', 'start': 5865.213, 'duration': 2.942}, {'end': 5870.917, 'text': 'So we will first start with exploratory analysis.', 'start': 5868.415, 'duration': 2.502}, {'end': 5873.338, 'text': 'We will try to see how the data is looking.', 'start': 5870.957, 'duration': 2.381}, {'end': 5874.419, 'text': 'So what kind of data.', 'start': 5873.498, 'duration': 0.921}, {'end': 5880.023, 'text': "So we will of course I'll take you into the Jupyter notebook and give you a quick live demo.", 'start': 5874.739, 'duration': 5.284}, {'end': 5885.866, 'text': "But before that, let's quickly walk through some of the pieces of this program and slides,", 'start': 5880.103, 'duration': 5.763}, {'end': 5889.849, 'text': 'and then I will take you actually into the actual code and do a demo of that.', 'start': 5885.866, 'duration': 3.983}, {'end': 5893.21, 'text': 'So the Python program structure looks somewhat like this.', 'start': 5890.109, 'duration': 3.101}], 'summary': 'Exploratory analysis of data will be demonstrated using python program structure and a live demo in jupyter notebook.', 'duration': 27.997, 'max_score': 5865.213, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI5865213.jpg'}, {'end': 6297.974, 'src': 'embed', 'start': 6271.372, 'weight': 7, 'content': [{'end': 6277.657, 'text': 'Some columns are going from 0 to 100,000 and some columns are just between 10 to 20, and so on.', 'start': 6271.372, 'duration': 6.285}, {'end': 6280.9, 'text': 'These will affect the accuracy of the analysis.', 'start': 6277.817, 'duration': 3.083}, {'end': 6285.143, 'text': 'So we need to do some kind of unifying the data and so on.', 'start': 6280.98, 'duration': 4.163}, {'end': 6289.127, 'text': 'So that is what data wrangling is all about.', 'start': 6285.183, 'duration': 3.944}, {'end': 6297.974, 'text': 'So, before we actually perform any analysis, we need to bring the data to some kind of a shape so that we can perform additional analysis,', 'start': 6289.187, 'duration': 8.787}], 'summary': 'Data wrangling is necessary to unify and shape data for accurate analysis.', 'duration': 26.602, 'max_score': 6271.372, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI6271372.jpg'}, {'end': 6529.2, 'src': 'embed', 'start': 6490.649, 'weight': 0, 'content': [{'end': 6494.452, 'text': 'So, you can also perform some basic mathematical observations.', 'start': 6490.649, 'duration': 3.803}, {'end': 6497.275, 'text': 'We have already seen that mean we found out.', 'start': 6494.853, 'duration': 2.422}, {'end': 6501.959, 'text': 'So, similarly, if you do call the mean method for the data frame object,', 'start': 6497.335, 'duration': 4.624}, {'end': 6508.985, 'text': 'it will actually perform or display or calculate the mean for pretty much all the numerical columns that are available in there.', 'start': 6501.959, 'duration': 7.026}, {'end': 6514.089, 'text': 'So, for example, here applicant income, co-applicant income and all these are numerical values.', 'start': 6509.485, 'duration': 4.604}, {'end': 6517.351, 'text': 'So, it will display the mean values of all of those.', 'start': 6514.229, 'duration': 3.122}, {'end': 6522.395, 'text': 'Now, another thing that you can do is you can actually also combine data frames.', 'start': 6517.591, 'duration': 4.804}, {'end': 6529.2, 'text': "So, let's say you import data from one CSV file into one data frame and another CSV file into another data frame.", 'start': 6522.475, 'duration': 6.725}], 'summary': 'Perform mean calculations and combine data frames in python.', 'duration': 38.551, 'max_score': 6490.649, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI6490649.jpg'}], 'start': 5764.864, 'title': 'Pandas data analysis', 'summary': 'Covers creating data frames, importing data, and using describe function in a 4x3 matrix, univariate analysis for visualizing data distribution, data preparation, handling missing values, and introduces data wrangling techniques and scikit-learn library for machine learning in python.', 'chapters': [{'end': 6086.528, 'start': 5764.864, 'title': 'Pandas data frame basics', 'summary': 'Explains the basics of creating data frames and importing data using pandas, with a focus on creating a 4x3 matrix and importing data from a csv file for exploratory analysis in python, also covering the use of describe function.', 'duration': 321.664, 'highlights': ['Explaining creation of a 4x3 data frame using numpy random number generator The chapter provides a detailed explanation of creating a 4x3 matrix data frame using np.random to generate random numbers, with 4 rows and 3 columns.', 'Importing data from a csv file using pandas read_csv method and displaying the first five rows using head method It explains the process of importing data from a csv file using pd.read_csv and displaying the first five rows of the data frame using df.head method for initial data exploration.', 'Discussing the use of describe function to get a summary of the data The chapter covers the use of the describe function to obtain a summary of the data, including statistical information for each column in the data frame.']}, {'end': 6426.714, 'start': 6086.648, 'title': 'Pandas data analysis', 'summary': 'Introduces the process of univariate analysis using pandas to visualize and understand the distribution of data in columns, identify extreme values, and the need for data preparation, data wrangling, and handling missing values before performing further analysis, such as machine learning.', 'duration': 340.066, 'highlights': ['The process of univariate analysis involves visualizing and understanding the distribution of data in each column, such as creating histograms to identify extreme values. Visualizing data distribution, creating histograms, identifying extreme values.', 'The need for data preparation is emphasized as the presence of extreme values and haphazard data distribution can make analysis difficult. Emphasizing data preparation due to extreme values and haphazard data distribution.', 'Data wrangling involves cleaning the data, handling missing values, and unifying data to bring it to a shape suitable for analysis, especially in cases where missing values are detected. Data wrangling involves cleaning, handling missing values, unifying data for analysis.', 'The process of identifying missing values in columns is discussed, highlighting the use of code to find null values and the number of observations with missing values for each column. Discussing the process of identifying missing values and the use of code to find null values.', 'Various methods for handling missing values are mentioned, including the option to exclude records with missing values if the proportion is small compared to the total observations. Mentioning various methods for handling missing values, including the option to exclude records.']}, {'end': 6800.796, 'start': 6426.874, 'title': 'Data wrangling and data analysis techniques', 'summary': 'Discusses data wrangling techniques including handling missing values by filling them with mean, checking data types, performing mathematical observations, and combining data frames, and also introduces the scikit-learn library for machine learning with easily usable apis and a variety of algorithms.', 'duration': 373.922, 'highlights': ['The chapter discusses data wrangling techniques including handling missing values by filling them with mean. Excluding observations with missing values may lead to lower accuracy; filling missing values with mean ensures data fits within the range of observations.', 'Introduces the scikit-learn library for machine learning with easily usable APIs and a variety of algorithms. Scikit-learn provides easily usable APIs for linear regression, logistic regression, and a variety of algorithms for machine learning activities.', 'Describes the method of combining data frames using the concat method, ensuring identical structure for successful merging. Demonstrates combining data frames using the concat method, requiring the same structure for successful merging.', 'Describes the process of checking data types and performing mathematical observations to calculate mean values for numerical columns. Demonstrates checking data types and calculating mean values for numerical columns using the mean method of the data frame object.']}], 'duration': 1035.932, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI5764864.jpg', 'highlights': ['Explaining creation of a 4x3 data frame using numpy random number generator with 4 rows and 3 columns', 'Importing data from a csv file using pandas read_csv method and displaying the first five rows using head method for initial data exploration', 'Discussing the use of describe function to get a summary of the data', 'The process of univariate analysis involves visualizing and understanding the distribution of data in each column, such as creating histograms to identify extreme values', 'The need for data preparation is emphasized due to extreme values and haphazard data distribution', 'Data wrangling involves cleaning, handling missing values, unifying data for analysis', 'The process of identifying missing values in columns is discussed, highlighting the use of code to find null values', 'Various methods for handling missing values are mentioned, including the option to exclude records with missing values if the proportion is small compared to the total observations', 'The chapter discusses data wrangling techniques including handling missing values by filling them with mean', 'Introduces the scikit-learn library for machine learning with easily usable APIs and a variety of algorithms', 'Describes the method of combining data frames using the concat method, ensuring identical structure for successful merging', 'Describes the process of checking data types and performing mathematical observations to calculate mean values for numerical columns']}, {'end': 8741.275, 'segs': [{'end': 6844.504, 'src': 'embed', 'start': 6820.119, 'weight': 4, 'content': [{'end': 6828.709, 'text': 'So those algorithms are available and if you want to use some of them you need to import them and from the scikit-learn library.', 'start': 6820.119, 'duration': 8.59}, {'end': 6833.635, 'text': 'So scikit-learn is the top level library which is basically STL learn right.', 'start': 6828.769, 'duration': 4.866}, {'end': 6836.337, 'text': 'and then it has a kind of sub parts in it.', 'start': 6833.795, 'duration': 2.542}, {'end': 6841.461, 'text': 'You need to import those based on what exactly you will be or which algorithm you will be using.', 'start': 6836.477, 'duration': 4.984}, {'end': 6844.504, 'text': "So let's take an example as we move and we will see that.", 'start': 6841.501, 'duration': 3.003}], 'summary': 'Import algorithms from scikit-learn library for use.', 'duration': 24.385, 'max_score': 6820.119, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI6820119.jpg'}, {'end': 6883.729, 'src': 'embed', 'start': 6859.676, 'weight': 2, 'content': [{'end': 6869.221, 'text': 'How do we, some people do it like 50-50, some people do it 80-20, which is training is 80 and test is 20 and so on.', 'start': 6859.676, 'duration': 9.545}, {'end': 6870.662, 'text': 'So it is individual preference.', 'start': 6869.241, 'duration': 1.421}, {'end': 6872.263, 'text': 'There are no hard and fast rules.', 'start': 6870.682, 'duration': 1.581}, {'end': 6876.845, 'text': 'By and large, we have seen that training data set is larger than the test data set.', 'start': 6872.503, 'duration': 4.342}, {'end': 6883.729, 'text': "And again, we will probably not go into details of why do we do this at this point, but that's one of the steps in machine learning.", 'start': 6876.885, 'duration': 6.844}], 'summary': 'In machine learning, training data is typically larger than test data, with varying preferences for the split.', 'duration': 24.053, 'max_score': 6859.676, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI6859676.jpg'}, {'end': 8556.126, 'src': 'embed', 'start': 8528.209, 'weight': 0, 'content': [{'end': 8537.532, 'text': "And let's say I do y equals 3 plus 4, x equals y plus 2, and then I'm going to do just x.", 'start': 8528.209, 'duration': 9.323}, {'end': 8539.113, 'text': "And let's do a plot.", 'start': 8537.532, 'duration': 1.581}, {'end': 8540.633, 'text': "We'll throw a plot in there.", 'start': 8539.473, 'duration': 1.16}, {'end': 8545.075, 'text': 'C C is a notation that these are going to be Cartesian points.', 'start': 8541.273, 'duration': 3.802}, {'end': 8554.084, 'text': "So we got 1, comma 2, comma 3 I'd be like your X and then Y, 3, comma 4, comma 5 for just a standard call scatter plot.", 'start': 8545.175, 'duration': 8.909}, {'end': 8556.126, 'text': "you don't have to memorize this.", 'start': 8554.084, 'duration': 2.042}], 'summary': 'Mathematical operations performed and scatter plot created with cartesian points', 'duration': 27.917, 'max_score': 8528.209, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI8528209.jpg'}, {'end': 8646.356, 'src': 'embed', 'start': 8616.942, 'weight': 1, 'content': [{'end': 8621.105, 'text': 'Not all packages are loaded by default, but they can be installed on demand.', 'start': 8616.942, 'duration': 4.163}, {'end': 8627.97, 'text': "Remember earlier we were talking about all the different packages available? You don't want to install everything from R.", 'start': 8621.345, 'duration': 6.625}, {'end': 8630.051, 'text': 'It would just be a huge waste of space.', 'start': 8627.97, 'duration': 2.081}, {'end': 8631.953, 'text': 'You want to just install those packages you need.', 'start': 8630.091, 'duration': 1.862}, {'end': 8637.054, 'text': 'So, to install packages in RStudio, you go under Tools and Install Packages.', 'start': 8632.413, 'duration': 4.641}, {'end': 8640.175, 'text': "When you click on the Install Packages, you'll get a dialog box.", 'start': 8637.314, 'duration': 2.861}, {'end': 8646.356, 'text': "You'll see where it has a repository CRAN, because there's other repositories, and you can even download and install your own packages you can build.", 'start': 8640.515, 'duration': 5.841}], 'summary': 'Not all r packages are loaded by default; install on demand via rstudio tools.', 'duration': 29.414, 'max_score': 8616.942, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI8616942.jpg'}], 'start': 6800.936, 'title': 'Machine learning fundamentals', 'summary': 'Covers machine learning basics including algorithms like linear regression, logistic regression, and random forest classification, as well as the process of splitting labeled data. it explains confusion matrix, accuracy, and precision with 80% accuracy using a 2x2 matrix. additionally, it discusses data analysis techniques, data visualization, r, and rstudio functionality and package installation.', 'chapters': [{'end': 7298.22, 'start': 6800.936, 'title': 'Introduction to machine learning basics', 'summary': "Introduces the basics of machine learning, covering algorithms like linear regression, logistic regression, and random forest classification, as well as the process of splitting labeled data into training and test sets. it also demonstrates the implementation of logistic regression using scikit-learn, emphasizing the training and testing process, including the prediction and evaluation of the model's accuracy.", 'duration': 497.284, 'highlights': ['The chapter introduces the basics of machine learning, covering algorithms like linear regression, logistic regression, and random forest classification. It mentions the availability of algorithms like linear regression, logistic regression, and random forest classification for machine learning.', "It also demonstrates the implementation of logistic regression using scikit-learn, emphasizing the training and testing process, including the prediction and evaluation of the model's accuracy. It provides a detailed demonstration of the implementation of logistic regression using scikit-learn, emphasizing the training and testing process, including the prediction and evaluation of the model's accuracy.", 'The process of splitting labeled data into training and test sets is explained, along with the different ways of splitting the data. It explains the process of splitting labeled data into training and test sets, highlighting the different ways of splitting the data.']}, {'end': 7857.08, 'start': 7298.22, 'title': 'Understanding confusion matrix and calculating accuracy and precision', 'summary': 'Discusses the concept of confusion matrix in machine learning, explaining the calculation of accuracy and precision using a 2x2 matrix with 150 observations and achieving an 80% accuracy, while also covering the manual and library-based methods for calculating accuracy.', 'duration': 558.86, 'highlights': ['The chapter explains the concept of confusion matrix in machine learning, using a 2x2 matrix with 150 observations and demonstrates the calculation of accuracy and precision, achieving an 80% accuracy.', 'The accuracy is calculated manually as 80% by summing the true positive and true negative values and dividing by the total number of observations.', 'The precision is calculated as 80% using the formula true positives divided by the total predicted positives, illustrating the ratio of correctly predicted positive values to the total predicted positive values.', 'The scikit-learn library provides a method, accuracy_score, for calculating accuracy, which confirms the 80% accuracy achieved through the manual calculation.']}, {'end': 8208.34, 'start': 7857.22, 'title': 'Data analysis and visualization techniques', 'summary': 'Discusses data analysis techniques including data summary, visualization with histograms, handling missing values, and machine learning activities using logistic regression, with a focus on numerical columns and visualizations.', 'duration': 351.12, 'highlights': ['The chapter covers data analysis techniques including data summary, visualization with histograms, and handling missing values, with a focus on numerical columns and visualizations The chapter covers techniques such as data summary, visualization with histograms, and handling missing values, focusing on numerical columns and visualizations.', 'Visualization technique using histograms provides insights into the distribution of loan amounts and applicant incomes, highlighting the presence of extreme values in certain ranges The visualization technique using histograms provides insights into the distribution of loan amounts and applicant incomes, highlighting the presence of extreme values in certain ranges.', 'Handling missing values involves checking for missing entries in different columns and filling them with the mean value, resulting in the removal of missing values for the loan amount Handling missing values involves checking for missing entries in different columns and filling them with the mean value, resulting in the removal of missing values for the loan amount.', 'The chapter discusses the process of machine learning activities using logistic regression, including importing libraries, data separation, splitting data into training and test datasets, and scaling the data for normalization The chapter discusses the process of machine learning activities using logistic regression, including importing libraries, data separation, splitting data into training and test datasets, and scaling the data for normalization.']}, {'end': 8477.317, 'start': 8208.34, 'title': 'Introduction to r and classification', 'summary': "Covers the process of classification using the method 'predict' to measure accuracy, achieving an 80% accuracy, and introduces r as an open-source platform with extensive libraries, easy integration, and a worldwide repository system.", 'duration': 268.977, 'highlights': ["The method 'predict' is used to calculate the values of y for test data and measure accuracy, achieving an 80% accuracy. The 'predict' method is used to calculate the values of y for test data, which is then compared with the known value of y to measure accuracy. The achieved accuracy is 80%.", 'R is introduced as an open-source platform with extensive libraries, easy integration, and a worldwide repository system called the Comprehensive R Archive Network (CRAN) with around 10,000 packages. R is an open-source platform with extensive libraries, easy integration with popular software, and a worldwide repository system called the Comprehensive R Archive Network (CRAN) hosting around 10,000 packages.', 'The installation process of R and RStudio is explained, including downloading the executable file from the CRAN website and the RStudio Desktop Open Source License. The installation process of R is explained, including downloading the executable file from the CRAN website and the availability of RStudio Desktop Open Source License for download.']}, {'end': 8741.275, 'start': 8477.517, 'title': 'Introduction to rstudio and packages', 'summary': 'Introduces rstudio for debian distribution users, explains the layout and functionality of rstudio, and provides instructions for installing packages in rstudio, emphasizing the importance of selective package installation and showcasing the installation process for the forecast package.', 'duration': 263.758, 'highlights': ['R can be easily installed on Debian distribution, including Ubuntu, using regular package management tools, ensuring proper registration on the system setup. Debian distribution, including Ubuntu, users can install R using regular package management tools, ensuring proper registration on the system setup.', 'The layout of RStudio is explained, with emphasis on the console, environmental information, and plots, providing a comprehensive overview of the main workspace and functionality. The layout of RStudio is explained, emphasizing the console, environmental information, and plots, providing a comprehensive overview of the main workspace and functionality.', 'Instructions for installing packages in RStudio, including the process of selecting and installing the Forecast package, are provided, highlighting the importance of selective package installation and showcasing the installation process. Instructions for installing packages in RStudio, including the process of selecting and installing the Forecast package, are provided, emphasizing the importance of selective package installation and showcasing the installation process.']}], 'duration': 1940.339, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI6800936.jpg', 'highlights': ['The scikit-learn library provides a method, accuracy_score, for calculating accuracy, which confirms the 80% accuracy achieved through the manual calculation.', 'The chapter explains the concept of confusion matrix in machine learning, using a 2x2 matrix with 150 observations and demonstrates the calculation of accuracy and precision, achieving an 80% accuracy.', 'R is an open-source platform with extensive libraries, easy integration with popular software, and a worldwide repository system called the Comprehensive R Archive Network (CRAN) hosting around 10,000 packages.', 'The installation process of R is explained, including downloading the executable file from the CRAN website and the availability of RStudio Desktop Open Source License for download.', 'The chapter covers techniques such as data summary, visualization with histograms, and handling missing values, focusing on numerical columns and visualizations.']}, {'end': 9724.53, 'segs': [{'end': 8768.361, 'src': 'embed', 'start': 8741.696, 'weight': 1, 'content': [{'end': 8745.838, 'text': 'Data frames have labels on them, which makes them easier to use.', 'start': 8741.696, 'duration': 4.142}, {'end': 8752.284, 'text': 'easy. you can have a column and you can have a row.', 'start': 8750.061, 'duration': 2.223}, {'end': 8754.726, 'text': 'so think of rows and columns when you see the term data.', 'start': 8752.284, 'duration': 2.442}, {'end': 8763.836, 'text': "frames and then lists are usually homogeneous groups, so in R you're usually looking at similar data that's connected in the list.", 'start': 8754.726, 'duration': 9.11}, {'end': 8768.361, 'text': 'so the first thing we do with before we even importing the data, you should have the data ready.', 'start': 8763.836, 'duration': 4.525}], 'summary': 'Data frames in r have labeled rows and columns, making data easier to manipulate and use.', 'duration': 26.665, 'max_score': 8741.696, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI8741696.jpg'}, {'end': 8877.348, 'src': 'embed', 'start': 8837.521, 'weight': 4, 'content': [{'end': 8839.582, 'text': "So it's the same file saved as a CSV file.", 'start': 8837.521, 'duration': 2.061}, {'end': 8841.243, 'text': "And there's also Excel.", 'start': 8839.702, 'duration': 1.541}, {'end': 8847.086, 'text': 'Excel has its own issues as far as making sure you know what the tables are and the headers are.', 'start': 8842.363, 'duration': 4.723}, {'end': 8849.527, 'text': 'But you can see that each one of these is easy to import.', 'start': 8847.386, 'duration': 2.141}, {'end': 8855.211, 'text': "So, just like it's easy to import the data, you can also export tables in R.", 'start': 8849.907, 'duration': 5.304}, {'end': 8859.894, 'text': 'So you can see here write.table myfile see myfile.text.', 'start': 8855.211, 'duration': 4.683}, {'end': 8863.557, 'text': "comma separated and the scoop T just means it's tab separated.", 'start': 8859.894, 'duration': 3.663}, {'end': 8865.458, 'text': "so if you're using tabbed files on there,", 'start': 8863.557, 'duration': 1.901}, {'end': 8866.899, 'text': 'Example, Excel.', 'start': 8865.698, 'duration': 1.201}, {'end': 8870.462, 'text': 'So you can write a .xls to my file.', 'start': 8867.359, 'duration': 3.103}, {'end': 8871.963, 'text': 'In this case they did a text.', 'start': 8870.702, 'duration': 1.261}, {'end': 8875.766, 'text': "Separation equals scoop T, so it's a tab separated for an Excel file.", 'start': 8872.163, 'duration': 3.603}, {'end': 8877.348, 'text': 'CSV, same thing.', 'start': 8876.087, 'duration': 1.261}], 'summary': 'R allows easy import and export of csv and excel files, with options for tab and comma separation.', 'duration': 39.827, 'max_score': 8837.521, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI8837521.jpg'}, {'end': 8978.295, 'src': 'embed', 'start': 8944.024, 'weight': 2, 'content': [{'end': 8945.404, 'text': 'And histogram is very popular.', 'start': 8944.024, 'duration': 1.38}, {'end': 8947.305, 'text': 'All of these are very widely used.', 'start': 8945.544, 'duration': 1.761}, {'end': 8951.671, 'text': "Let's look at the box plots, also known as whisker diagrams.", 'start': 8947.605, 'duration': 4.066}, {'end': 8959.342, 'text': 'Box plots display the distribution of data based on minimum, first quartile, medium, third quartile, and maximum.', 'start': 8951.851, 'duration': 7.491}, {'end': 8964.049, 'text': 'So right off the bat, we can use a box plot to explore our data with very little work.', 'start': 8959.582, 'duration': 4.467}, {'end': 8965.309, 'text': 'to create a box plot.', 'start': 8964.409, 'duration': 0.9}, {'end': 8968.591, 'text': 'we simply give a box plot and the data very straightforward.', 'start': 8965.309, 'duration': 3.282}, {'end': 8971.572, 'text': 'so we might have passenger numbers in the thousands.', 'start': 8968.591, 'duration': 2.981}, {'end': 8978.295, 'text': 'I guess this is exploring data dealing with airplanes and you can see here they just have a simple plot.', 'start': 8971.572, 'duration': 6.723}], 'summary': 'Box plots are widely used and can display distributions, such as passenger numbers in the thousands, with very little work.', 'duration': 34.271, 'max_score': 8944.024, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI8944024.jpg'}, {'end': 9336.571, 'src': 'embed', 'start': 9302.38, 'weight': 7, 'content': [{'end': 9306.582, 'text': 'The predicted y-value, residuals of errors, the actual y-values.', 'start': 9302.38, 'duration': 4.202}, {'end': 9316.706, 'text': "So one of the things we're looking at when we put this information together is we want these distances to be minimized so it has the smallest amount of error possible.", 'start': 9307.162, 'duration': 9.544}, {'end': 9324.329, 'text': "Let's find out the predicted values of y for corresponding values of x using the linear equation where m equals 1.3 and c equals 1.1.", 'start': 9316.986, 'duration': 7.343}, {'end': 9327.25, 'text': "And here you can see we've done the same kind of chart.", 'start': 9324.329, 'duration': 2.921}, {'end': 9336.571, 'text': 'We have our y predicted, and then we have the actual y minus the y predicted, because we are looking for the error squared, or the e squared values.', 'start': 9327.59, 'duration': 8.981}], 'summary': 'Analyzing error minimization in linear regression with m=1.3 and c=1.1.', 'duration': 34.191, 'max_score': 9302.38, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI9302380.jpg'}, {'end': 9656.688, 'src': 'embed', 'start': 9627.321, 'weight': 3, 'content': [{'end': 9630.044, 'text': 'But when you look at speed versus distance, we get the point 806849.', 'start': 9627.321, 'duration': 2.723}, {'end': 9636.631, 'text': "And if you do distance to distance, you get also 1 since they're identical variables.", 'start': 9630.044, 'duration': 6.587}, {'end': 9642.458, 'text': "So now it's time to build our linear regression model on the entire data set to build the coefficients.", 'start': 9636.972, 'duration': 5.486}, {'end': 9644.6, 'text': "Let's just take a look and see what that looks like.", 'start': 9642.718, 'duration': 1.882}, {'end': 9648.585, 'text': "Back in our console, I'm going to type in linear mod.", 'start': 9645.081, 'duration': 3.504}, {'end': 9656.688, 'text': 'And in R, we want to do the arrow, kind of like an arrow on a line, or in this case, the less than minus sign.', 'start': 9649.205, 'duration': 7.483}], 'summary': 'Linear regression model built on entire dataset with 806849 as a key point', 'duration': 29.367, 'max_score': 9627.321, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI9627321.jpg'}, {'end': 9698.366, 'src': 'embed', 'start': 9669.892, 'weight': 0, 'content': [{'end': 9675.274, 'text': "This lets R know we're going to deal with these two columns, comma, data equals cars.", 'start': 9669.892, 'duration': 5.382}, {'end': 9678.475, 'text': "And when I hit enter on here, we've generated our linear mod.", 'start': 9675.734, 'duration': 2.741}, {'end': 9680.336, 'text': 'Now we want to go ahead and summarize it.', 'start': 9678.815, 'duration': 1.521}, {'end': 9688.16, 'text': 'And we simply do a summary, and then brackets, linear, mod, and it generates all kinds of information on our model.', 'start': 9680.756, 'duration': 7.404}, {'end': 9691.963, 'text': 'So we can explore just how well this model is fitting the data we have right now.', 'start': 9688.22, 'duration': 3.743}, {'end': 9698.366, 'text': "Now if you've done linear regression in other packages and scripts, you're going to see that this is so easy to explore data in R.", 'start': 9692.103, 'duration': 6.263}], 'summary': 'Generated linear model for data analysis in r.', 'duration': 28.474, 'max_score': 9669.892, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI9669892.jpg'}], 'start': 8741.696, 'title': 'Linear regression analysis', 'summary': 'Covers data import, visualization, and linear regression in r, including importing data from csv and excel files, visualization using various graphics, and the theory of linear regression to estimate and predict the relationship between variables.', 'chapters': [{'end': 8815.27, 'start': 8741.696, 'title': 'Importing data in r', 'summary': 'Explains the concept of data frames and lists, the process of importing data from various sources like excel, minitab, and csv, and provides a simple example of importing a table file using read.table with specific parameters.', 'duration': 73.574, 'highlights': ['Data frames have labels on them, making them easier to use, with columns and rows representing the data. Data frames have labels on them, making them easier to use, with columns and rows representing the data.', 'Lists in R are usually homogeneous groups of similar connected data. Lists in R are usually homogeneous groups of similar connected data.', 'Importing data from various sources including Excel, Minitab, CSV, and text files is possible with different options available. Importing data from various sources including Excel, Minitab, CSV, and text files is possible with different options available.', 'The process of importing a table file is demonstrated using read.table with specific parameters like file name and header. The process of importing a table file is demonstrated using read.table with specific parameters like file name and header.']}, {'end': 9238.296, 'start': 8815.65, 'title': 'Data import, visualization, and linear regression in r', 'summary': 'Covers data import from csv and excel files, visualization in r using various types of graphics, and the theory and types of linear regression, aiming to estimate and predict the relationship between variables.', 'duration': 422.646, 'highlights': ['R allows easy import of data from CSV and Excel files using read.csv and write.table functions, and visualization using various types of graphics including bar charts, pie charts, histograms, and box plots. R allows easy import of data from CSV and Excel files using read.csv and write.table functions. Visualization in R includes various types of graphics such as bar charts, pie charts, histograms, and box plots.', 'Visualization in R is powerful and quick, offering various formats for saving graphics and customization according to varied graphic needs. Visualization in R is powerful and quick, offering various formats for saving graphics such as PDF, PNG, JPEG, WMF, and PS. Additionally, the graphics can be customized according to varied graphic needs.', 'The chapter explains the theory and types of linear regression, including simple linear regression and multiple linear regression, aiming to estimate and predict the relationship between variables. The chapter explains the theory and types of linear regression, including simple linear regression and multiple linear regression. The goal is to estimate and predict the relationship between variables.']}, {'end': 9724.53, 'start': 9238.797, 'title': 'Linear regression analysis and model building', 'summary': 'Introduces linear regression analysis with a focus on predicting y-values using a linear equation, emphasizing the concept of minimizing errors, and demonstrates the correlation and building of a linear regression model using the default cars dataset in r.', 'duration': 485.733, 'highlights': ['The chapter discusses the concept of predicting y-values using a linear equation, emphasizing the calculation for y values using the formula y = mx + c, providing an example of x=3 and x=6 with corresponding predicted y-values. Calculation of y values using the linear equation, example of x=3 and x=6, corresponding predicted y-values', 'The concept of minimizing errors in the regression line is emphasized, with a focus on minimizing the sum of squared errors (e squared values) for the best fit line to ensure the least amount of error possible. Emphasis on minimizing errors in the regression line, focus on minimizing sum of squared errors for best fit line', 'The process of correlating variables using the default cars dataset in R is demonstrated, including visualizing the data using a scatterplot, conducting correlation analysis, and building a linear regression model to predict coefficients. Demonstration of correlating variables using default cars dataset, visualization using scatterplot, correlation analysis, building linear regression model']}], 'duration': 982.834, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI8741696.jpg', 'highlights': ['Importing data from various sources including Excel, Minitab, CSV, and text files is possible with different options available.', 'Visualization in R includes various types of graphics such as bar charts, pie charts, histograms, and box plots.', 'The chapter explains the theory and types of linear regression, including simple linear regression and multiple linear regression. The goal is to estimate and predict the relationship between variables.', 'Data frames have labels on them, making them easier to use, with columns and rows representing the data.', 'R allows easy import of data from CSV and Excel files using read.csv and write.table functions.', 'The process of importing a table file is demonstrated using read.table with specific parameters like file name and header.', 'Visualization in R is powerful and quick, offering various formats for saving graphics such as PDF, PNG, JPEG, WMF, and PS. Additionally, the graphics can be customized according to varied graphic needs.', 'The concept of minimizing errors in the regression line is emphasized, with a focus on minimizing the sum of squared errors for the best fit line to ensure the least amount of error possible.', 'The process of correlating variables using the default cars dataset in R is demonstrated, including visualizing the data using a scatterplot, conducting correlation analysis, and building a linear regression model to predict coefficients.', 'Lists in R are usually homogeneous groups of similar connected data.']}, {'end': 11209.785, 'segs': [{'end': 9845.615, 'src': 'embed', 'start': 9813.046, 'weight': 2, 'content': [{'end': 9814.886, 'text': "to kind of fit it and guess where it's going to go.", 'start': 9813.046, 'duration': 1.84}, {'end': 9819.087, 'text': 'And if you want everything to match, you want to start with the same seed in that randomizer.', 'start': 9814.906, 'duration': 4.181}, {'end': 9823.388, 'text': 'That way it recreates it identical every time you run it on a different computer.', 'start': 9819.187, 'duration': 4.201}, {'end': 9832.31, 'text': "And, just like working with data in any package, we're going to create indices for the training data and we're going to model the training data,", 'start': 9823.748, 'duration': 8.562}, {'end': 9836.391, 'text': "and then we're also going to do the test data and build our model on the training data.", 'start': 9832.31, 'duration': 4.081}, {'end': 9838.652, 'text': "And let's walk through that and see what that looks like in R.", 'start': 9836.511, 'duration': 2.141}, {'end': 9841.013, 'text': 'And I actually mistyped that one.', 'start': 9839.412, 'duration': 1.601}, {'end': 9845.615, 'text': "Let's type in set dot seed to set our seed for the 100.", 'start': 9841.093, 'duration': 4.522}], 'summary': 'Creating reproducible randomization in r using set.seed for consistent results.', 'duration': 32.569, 'max_score': 9813.046, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI9813046.jpg'}, {'end': 10227.905, 'src': 'embed', 'start': 10197.811, 'weight': 3, 'content': [{'end': 10200.573, 'text': "In this case, you'll notice that the rows have a different count.", 'start': 10197.811, 'duration': 2.762}, {'end': 10202.613, 'text': "They're not 1, 2, 3, 4, 5, 6.", 'start': 10200.593, 'duration': 2.02}, {'end': 10210.953, 'text': 'Well, we randomly picked 20% of the data, so this is the first six rows of that random selection, which comes out as 1, 4, 8, 20, 26, 31.', 'start': 10202.614, 'duration': 8.339}, {'end': 10217.841, 'text': "And we have the actuals, so the actual value is 2, and the predicted value on this first one is minus 5, so they're way off.", 'start': 10210.958, 'duration': 6.883}, {'end': 10227.905, 'text': '22-7, 26-20, 26-37, 54-42, 50-50, so one of these is actually pretty right on where a lot of them are really off at the beginning.', 'start': 10218.461, 'duration': 9.444}], 'summary': 'Randomly selected 20% of data, with actual vs. predicted values differing significantly.', 'duration': 30.094, 'max_score': 10197.811, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI10197811.jpg'}, {'end': 10546.8, 'src': 'embed', 'start': 10518.265, 'weight': 4, 'content': [{'end': 10520.688, 'text': "So we have an entropy of one on there since there's two of everything.", 'start': 10518.265, 'duration': 2.423}, {'end': 10521.889, 'text': 'Information gain.', 'start': 10521.008, 'duration': 0.881}, {'end': 10528.136, 'text': 'It is a measure of decrease in entropy after the data set is split, also known as entropy reduction.', 'start': 10522.069, 'duration': 6.067}, {'end': 10533.302, 'text': 'So we look over here and we have an entropy equals E1, and the information gain.', 'start': 10528.496, 'duration': 4.806}, {'end': 10540.452, 'text': "as we split the bananas out, the size becomes smaller and you'll see that E1 is going to be greater than E2,", 'start': 10533.302, 'duration': 7.15}, {'end': 10544.176, 'text': 'or we measure E2 of the apples and oranges in this case.', 'start': 10540.452, 'duration': 3.724}, {'end': 10546.8, 'text': "If you love your fruit, it's probably getting you hungry right about now.", 'start': 10544.537, 'duration': 2.263}], 'summary': 'Entropy reduction and information gain measured in data split, e1>e2 in fruit example.', 'duration': 28.535, 'max_score': 10518.265, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI10518265.jpg'}, {'end': 10604.089, 'src': 'embed', 'start': 10575.8, 'weight': 0, 'content': [{'end': 10581.082, 'text': 'And to predict the class of flower based on the petal length and width using R.', 'start': 10575.8, 'duration': 5.282}, {'end': 10588.604, 'text': "And you'll see here we have these beautiful irises, probably the most popular data set for beginning data analysis and statistics.", 'start': 10581.082, 'duration': 7.522}, {'end': 10592.325, 'text': 'We have the setosa, the virginica, and the versicolor.', 'start': 10589.104, 'duration': 3.221}, {'end': 10595.826, 'text': "Let's install the packages that will help us in the use case.", 'start': 10592.485, 'duration': 3.341}, {'end': 10604.089, 'text': "So because we're doing decision tree, we have rpart, rpart.plot, and then we have the library rpart and the library rpart.", 'start': 10596.627, 'duration': 7.462}], 'summary': 'Using r to predict flower class based on petal dimensions using popular iris dataset.', 'duration': 28.289, 'max_score': 10575.8, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI10575800.jpg'}, {'end': 10750.903, 'src': 'embed', 'start': 10722.144, 'weight': 1, 'content': [{'end': 10726.048, 'text': "And you can remember that's a cytosa, the versicolor, the virginica,", 'start': 10722.144, 'duration': 3.904}, {'end': 10730.01, 'text': 'the three different categories we saw earlier in the beautiful pictures of the flower.', 'start': 10726.048, 'duration': 3.962}, {'end': 10737.915, 'text': "And so we're going to go ahead and use the set seed to decide the starting point used in the generation of sequence of random numbers.", 'start': 10730.35, 'duration': 7.565}, {'end': 10745.019, 'text': "Remember, we set the seed so that if we ever want to reproduce what we're doing, it will reproduce the same thing each time,", 'start': 10738.215, 'duration': 6.804}, {'end': 10747.08, 'text': "because it's using the same randomizer seed.", 'start': 10745.019, 'duration': 2.061}, {'end': 10750.903, 'text': 'It brings slightly different results, but depending on what you need it for.', 'start': 10747.1, 'duration': 3.803}], 'summary': 'Using set seed for reproducible random number generation in data analysis.', 'duration': 28.759, 'max_score': 10722.144, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI10722144.jpg'}], 'start': 9724.83, 'title': 'Regression and decision tree models in r', 'summary': 'Explains creating regression models, training and testing in r, including computing distance, selecting a random sample, and discusses analyzing data with linear regression, decision trees, and evaluating model accuracy through correlation, entropy, and information gain. it also covers implementing decision trees for flower classification and randomizing data to build models for classifying flowers with model accuracy evaluation.', 'chapters': [{'end': 9946.173, 'start': 9724.83, 'title': 'Regression model training and testing', 'summary': 'Explains the process of creating a regression model in r, including computing distance, checking model significance, setting seed for data randomization, creating training and test data sets, and selecting a random sample of the data.', 'duration': 221.343, 'highlights': ['The value of p should be less than 0.05 for the model to be statistically significant. Emphasizes the significance level of p-value for model significance.', 'Setting the seed ensures the sample can be recreated for future use, maintaining consistency across different computers. Explains the importance of setting the seed for reproducibility of the random sample.', 'Creating a training and test data set involves selecting 80% of the rows for training data. Describes the process of creating training data by selecting 80% of the rows from the dataset.']}, {'end': 10155.051, 'start': 9946.513, 'title': 'Creating linear model and evaluating accuracy', 'summary': "Discusses the process of creating a linear model using 80% training data and 20% test data, making predictions, and evaluating the model's accuracy through model diagnostic measures and correlation between actuals and predicted values.", 'duration': 208.538, 'highlights': ['Creating the linear model using 80% training data and 20% test data The process involves splitting the data into 80% training data and 20% test data to create a linear model for prediction.', "Evaluating the model's accuracy through model diagnostic measures Utilizing model diagnostic measures to assess the accuracy of the linear model and its predictions.", 'Correlation between actuals and predicted values as a form of accuracy measurement Utilizing correlation between actuals and predicted values as a means of measuring the accuracy of the linear model.']}, {'end': 10558.766, 'start': 10155.852, 'title': 'Analyzing data with decision trees', 'summary': 'Covers analyzing data with linear regression and decision trees, including comparing actual and predicted values, calculating correlation accuracy, min-max accuracy, and mean absolute percentage error, and explaining the concepts of decision trees, entropy, and information gain.', 'duration': 402.914, 'highlights': ['The chapter covers analyzing data with linear regression and decision trees, including comparing actual and predicted values, calculating correlation accuracy, min-max accuracy, and mean absolute percentage error.', 'Linear regression analysis involves comparing actual and predicted values, with examples showing discrepancies and correlation accuracy.', 'The concept of decision trees is explained, including basic terminologies like root node, splitting, decision node, terminal node, entropy, and information gain.']}, {'end': 10754.666, 'start': 10559.046, 'title': 'Decision tree for flower classification', 'summary': 'Discusses the implementation of a decision tree to predict the class of flowers based on petal dimensions using r, using the iris dataset and installing necessary packages like rpart and rpart.plot.', 'duration': 195.62, 'highlights': ['The chapter discusses the implementation of a decision tree to predict the class of flowers based on petal dimensions using R. It mentions using a decision tree to predict the class of flowers based on the petal length and width, as well as the installation of necessary packages like rpart and rpart.plot.', "The iris dataset is used for the analysis, containing 150 objects and 5 variables, one of which is the target variable 'species' with three categories: setosa, versicolor, and virginica. The iris dataset is used for the analysis, containing 150 objects and 5 variables, one of which is the target variable 'species' with three categories: setosa, versicolor, and virginica.", "The use of 'set seed' to ensure reproducibility in generating random numbers is explained. The chapter explains the use of 'set seed' to ensure reproducibility in generating random numbers, which is useful for reproducing the same results each time."]}, {'end': 11209.785, 'start': 10755.007, 'title': 'Randomizing data for model building', 'summary': 'Demonstrates the process of randomizing the data using rstudio to eliminate bias and then using the randomized data to build a model to classify different types of flowers, achieving a model accuracy evaluation using a confusion matrix.', 'duration': 454.778, 'highlights': ['The process of randomizing the data using RStudio to eliminate bias The speaker explains the importance of randomizing data to eliminate bias and demonstrates the process of randomizing data using RStudio to ensure that the order of the data does not affect the output.', 'Building a model to classify different types of flowers The chapter illustrates the process of building a model to classify different types of flowers, specifically using the R part method for classification between the three different types of flowers.', 'Model accuracy evaluation using a confusion matrix The chapter covers the utilization of a confusion matrix to evaluate the accuracy of the model, involving the installation and loading of packages such as caret and E1071 for this purpose.']}], 'duration': 1484.955, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI9724830.jpg', 'highlights': ['The chapter covers analyzing data with linear regression and decision trees, including comparing actual and predicted values, calculating correlation accuracy, min-max accuracy, and mean absolute percentage error.', 'The chapter discusses the implementation of a decision tree to predict the class of flowers based on petal dimensions using R.', 'Creating a training and test data set involves selecting 80% of the rows for training data.', 'The process of randomizing the data using RStudio to eliminate bias.', "Evaluating the model's accuracy through model diagnostic measures."]}, {'end': 12423.535, 'segs': [{'end': 11404.384, 'src': 'embed', 'start': 11374.583, 'weight': 4, 'content': [{'end': 11382.286, 'text': "as I've already mentioned, some regression algorithms are classification algorithms and some are continuous, variable algorithms.", 'start': 11374.583, 'duration': 7.703}, {'end': 11392.577, 'text': "Why use logistic regression? What is logistic regression? And then we'll look at a use case, a college admission using logistic regression.", 'start': 11382.679, 'duration': 9.898}, {'end': 11395.483, 'text': 'So why would we use regression??', 'start': 11392.898, 'duration': 2.585}, {'end': 11404.384, 'text': "Well, let's say we had a website and our revenue was based on the traffic that we could drive to that website, whether through R&D or marketing,", 'start': 11395.796, 'duration': 8.588}], 'summary': 'Logistic regression is used for classification, such as in college admission predictions.', 'duration': 29.801, 'max_score': 11374.583, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI11374583.jpg'}, {'end': 11441.772, 'src': 'embed', 'start': 11411.591, 'weight': 3, 'content': [{'end': 11417.797, 'text': "The more traffic driven to our website, then the higher our revenue, or at least that's what we would intuitively assume.", 'start': 11411.591, 'duration': 6.206}, {'end': 11421.3, 'text': 'And so we need to predict the revenue based on our website traffic.', 'start': 11418.137, 'duration': 3.163}, {'end': 11425.368, 'text': 'Here we have the plot of revenue versus website traffic.', 'start': 11421.647, 'duration': 3.721}, {'end': 11430.709, 'text': 'Traffic would be considered the independent variable and revenue would be the dependent variable.', 'start': 11425.608, 'duration': 5.101}, {'end': 11433.89, 'text': 'Often the independent variable or variables,', 'start': 11431.029, 'duration': 2.861}, {'end': 11441.772, 'text': 'if we had more than one could be called the explanatory variables and the dependent variable would be called the response variable.', 'start': 11433.89, 'duration': 7.882}], 'summary': 'Predict revenue based on website traffic to drive higher revenue.', 'duration': 30.181, 'max_score': 11411.591, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI11411591.jpg'}, {'end': 11705.764, 'src': 'embed', 'start': 11673.588, 'weight': 1, 'content': [{'end': 11677.01, 'text': 'Sometimes we say an nth degree polynomial of x.', 'start': 11673.588, 'duration': 3.422}, {'end': 11680.531, 'text': 'In this picture, you can see that the relationship is not linear.', 'start': 11677.01, 'duration': 3.521}, {'end': 11683.632, 'text': "There's a curve to that best fit trend line.", 'start': 11680.751, 'duration': 2.881}, {'end': 11693.076, 'text': 'So why would we use logistic regression? And we need to understand why we would use logistic regression and not linear regression.', 'start': 11683.972, 'duration': 9.104}, {'end': 11699.62, 'text': 'Picking the machine learning algorithm for your problem is no small task.', 'start': 11693.296, 'duration': 6.324}, {'end': 11705.764, 'text': 'And it really behooves us to understand the difference between these machine learning algorithms.', 'start': 11700.02, 'duration': 5.744}], 'summary': 'Logistic regression vs. linear regression for non-linear relationships in machine learning.', 'duration': 32.176, 'max_score': 11673.588, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI11673588.jpg'}, {'end': 12384.095, 'src': 'embed', 'start': 12353.988, 'weight': 0, 'content': [{'end': 12359.47, 'text': 'You can see that there is some statistical significance in GPA and in rank.', 'start': 12353.988, 'duration': 5.482}, {'end': 12362.802, 'text': 'by the coefficients and output of the model.', 'start': 12359.915, 'duration': 2.887}, {'end': 12366.089, 'text': "So next, let's run the test data through the model.", 'start': 12363.102, 'duration': 2.987}, {'end': 12378.314, 'text': 'And once we have done all that, we can now set up a confusion matrix and look at our predictions versus the actual values.', 'start': 12369.452, 'duration': 8.862}, {'end': 12380.134, 'text': 'Again, this is important.', 'start': 12378.654, 'duration': 1.48}, {'end': 12384.095, 'text': 'We had the answers, and now we took and we predicted some answers.', 'start': 12380.275, 'duration': 3.82}], 'summary': 'Statistical significance in gpa and rank, test data run through model, confusion matrix for predictions vs. actual values.', 'duration': 30.107, 'max_score': 12353.988, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI12353988.jpg'}], 'start': 11210.225, 'title': 'Logistic regression applications', 'summary': 'Covers various applications of logistic regression, including understanding confusion matrix with an example achieving 92% accuracy, predicting revenue from website traffic, binary predictions for startup profitability and plant infestations, using logistic regression for college admissions prediction, and achieving a 72% accuracy rate in a specific model.', 'chapters': [{'end': 11395.483, 'start': 11210.225, 'title': 'Understanding confusion matrix in logistic regression', 'summary': 'Explains the concept of confusion matrix in logistic regression, analyzing a specific example of the iris dataset, achieving an accuracy of 92% and highlighting the importance of accurately representing model accuracy in data analysis.', 'duration': 185.258, 'highlights': ['The chapter explains the concept of confusion matrix in logistic regression, analyzing a specific example of the iris dataset, achieving an accuracy of 92%.', 'It highlights the importance of accurately representing model accuracy in data analysis, emphasizing the impact of bad data on outcomes.', 'Logistic regression is a classification algorithm, not a continuous variable prediction algorithm, and is also sometimes called logit regression.', 'The video covers the reasons for using regression as a predictive algorithm, the different types of regression, and a use case involving college admission using logistic regression.']}, {'end': 11740.912, 'start': 11395.796, 'title': 'Predicting revenue from website traffic', 'summary': 'Explains how to predict revenue based on website traffic using regression analysis, highlighting the correlation between website traffic and revenue, and the distinction between linear and logistic regression for different types of variables.', 'duration': 345.116, 'highlights': ['The chapter explains how to predict revenue based on website traffic using regression analysis. The main focus of the chapter is on using regression analysis to predict revenue based on website traffic, emphasizing the relationship between website traffic and revenue.', 'Highlighting the correlation between website traffic and revenue. It emphasizes the correlation between website traffic and revenue, showing how an increase in website traffic leads to higher revenue, and uses the example of drawing a line to predict revenue based on traffic hits.', 'The distinction between linear and logistic regression for different types of variables. Explains the difference between linear and logistic regression, where linear regression is used for continuous variables like website traffic and revenue, and logistic regression is used for categorical variables with two outcomes, such as yes or no.']}, {'end': 12045.211, 'start': 11741.152, 'title': 'Logistic regression for binary predictions', 'summary': 'Explains the use of logistic regression for binary predictions, illustrating how it can be applied to predict the profitability of a startup based on funding, and also demonstrating its application in predicting the likelihood of plant infestations and college admissions.', 'duration': 304.059, 'highlights': ['Logistic regression is used to predict profitability based on funding for a startup, with the probability of profit being calculated using a sigmoid curve, enabling binary classification. The chapter highlights the application of logistic regression in predicting the profitability of a startup based on funding, utilizing a sigmoid curve to calculate the probability of profit, thereby facilitating binary classification.', 'Explanation of the difference between linear regression and logistic regression, with emphasis on the sigmoid function for binary predictions. The chapter explains the distinction between linear regression and logistic regression, focusing on the utilization of the sigmoid function for binary predictions, highlighting the limitations of linear regression in handling binary outcomes.', "Demonstration of logistic regression's application in predicting the likelihood of plant infestations, showcasing its ability to provide binary classification for healthy versus not healthy plants. The chapter illustrates the use of logistic regression in predicting the likelihood of plant infestations, emphasizing its capability to offer binary classification for healthy and not healthy plants, thereby demonstrating its versatility in solving real-world problems."]}, {'end': 12219.819, 'start': 12045.371, 'title': 'Predicting college admissions with data science', 'summary': 'Discusses the process of predicting college admissions using logistic regression in r, emphasizing the importance of defining the problem, importing necessary libraries, splitting the data set, training the model, and validating its performance.', 'duration': 174.448, 'highlights': ['The chapter emphasizes the importance of defining the problem before proceeding with data science, stating that a clear problem definition is essential for any data science task. Emphasis on problem definition', 'The process involves importing the necessary libraries, acquiring the data, setting the working directory, exploring and preparing the data, scaling the data if necessary, and splitting it into training and test data sets. Importing libraries, data acquisition, data preparation, splitting data into train and test sets', 'The model is trained using the training data, and then the test data is run through the model for validation, accuracy, and precision assessment. Training model, validating accuracy and precision']}, {'end': 12423.535, 'start': 12220.099, 'title': 'Logistic regression model for admissions', 'summary': 'Introduces the process of splitting data, data munging, and training a logistic regression model to predict admissions, achieving a 72% accuracy rate.', 'duration': 203.436, 'highlights': ['The data is split into a training set and a test set using an 80-20 ratio, which could be adjusted based on the size of the data.', 'The data munging process involves converting the admission column and the rank column to categorical variables and ensuring data cleanliness, with no missing values or outliers.', 'The logistic regression model trained using the GLM function shows statistical significance in GPA and rank, achieving a 72% accuracy rate in predicting admissions.']}], 'duration': 1213.31, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI11210225.jpg', 'highlights': ['Logistic regression achieves 92% accuracy in confusion matrix example', 'Regression predicts revenue from website traffic, emphasizing correlation', 'Logistic regression predicts startup profitability using sigmoid curve', 'Emphasis on problem definition and data preparation in logistic regression', 'Logistic regression achieves 72% accuracy in predicting college admissions']}, {'end': 14693.307, 'segs': [{'end': 12503.598, 'src': 'embed', 'start': 12474.938, 'weight': 4, 'content': [{'end': 12481.126, 'text': "This is kind of a nice example because we can see where some of the colors and shapes overlap where others don't.", 'start': 12474.938, 'duration': 6.188}, {'end': 12487.633, 'text': "And so you can create a tree out of this very easily and say, is it colored orange? If it's not, well, it goes into one stack.", 'start': 12481.666, 'duration': 5.967}, {'end': 12489.075, 'text': 'Well, that happens to be all their broccoli.', 'start': 12487.773, 'duration': 1.302}, {'end': 12491.816, 'text': 'And if it is colored orange, then it goes into another stack.', 'start': 12489.315, 'duration': 2.501}, {'end': 12493.696, 'text': "And you're like, well, that's still kind of chaotic.", 'start': 12491.836, 'duration': 1.86}, {'end': 12498.037, 'text': 'People are looking at carrots and oranges, a very strange combination to put in a box.', 'start': 12494.096, 'duration': 3.941}, {'end': 12503.598, 'text': 'So the next question might be, is it round? And if it is, then yes, you put the oranges into one box.', 'start': 12498.457, 'duration': 5.141}], 'summary': 'Using colors and shapes to categorize items, such as oranges and broccoli, into different stacks and boxes.', 'duration': 28.66, 'max_score': 12474.938, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI12474938.jpg'}, {'end': 13041.922, 'src': 'embed', 'start': 13010.465, 'weight': 3, 'content': [{'end': 13011.707, 'text': 'There I am behind my desk.', 'start': 13010.465, 'duration': 1.242}, {'end': 13012.949, 'text': "That's definitely not me.", 'start': 13011.887, 'duration': 1.062}, {'end': 13014.19, 'text': "I don't have that big of a chin.", 'start': 13012.989, 'duration': 1.201}, {'end': 13018.416, 'text': 'But I do usually have a cup of coffee, and it looks kind of like that.', 'start': 13014.411, 'duration': 4.005}, {'end': 13019.638, 'text': "It's sitting next to me on my desk.", 'start': 13018.456, 'duration': 1.182}, {'end': 13023.008, 'text': 'So the use case is going to be survival prediction in R.', 'start': 13020.126, 'duration': 2.882}, {'end': 13027.532, 'text': "And let's implement a classification of data set based on information gain.", 'start': 13023.008, 'duration': 4.524}, {'end': 13031.555, 'text': 'This is going to use the ID3 algorithm.', 'start': 13028.112, 'duration': 3.443}, {'end': 13032.695, 'text': "Don't let that scare you.", 'start': 13031.695, 'duration': 1}, {'end': 13037.099, 'text': "That's the most common algorithm they use to calculate the decision tree.", 'start': 13032.736, 'duration': 4.363}, {'end': 13041.922, 'text': "There's a couple different ways they can calculate the entropy, but the formula is all the same behind it.", 'start': 13037.299, 'duration': 4.623}], 'summary': 'Implement survival prediction in r using classification with id3 algorithm.', 'duration': 31.457, 'max_score': 13010.465, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI13010465.jpg'}, {'end': 13118.54, 'src': 'embed', 'start': 13087.972, 'weight': 0, 'content': [{'end': 13092.095, 'text': 'The lifeboats were distributed based on the class, gender, and age of the passengers.', 'start': 13087.972, 'duration': 4.123}, {'end': 13099.32, 'text': 'We will develop a model that recognizes the relationship between these factors and predicts the survival of a passenger accordingly.', 'start': 13092.535, 'duration': 6.785}, {'end': 13104.323, 'text': 'So we want to predict whether the person is going to survive or die in a shipwreck.', 'start': 13099.58, 'duration': 4.743}, {'end': 13109.735, 'text': "And we'll be using a data set which specifies if a passenger on a ship survived its wreck or not.", 'start': 13104.793, 'duration': 4.942}, {'end': 13111.656, 'text': "So we're going to look at this data.", 'start': 13110.636, 'duration': 1.02}, {'end': 13118.54, 'text': "And if you open it up into a spreadsheet, and don't forget you can always put a note on the YouTube video down below.", 'start': 13112.077, 'duration': 6.463}], 'summary': 'Developing a model to predict passenger survival based on class, gender, and age using a data set.', 'duration': 30.568, 'max_score': 13087.972, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI13087972.jpg'}, {'end': 13958.311, 'src': 'embed', 'start': 13925.671, 'weight': 2, 'content': [{'end': 13926.651, 'text': 'So this is now loaded.', 'start': 13925.671, 'duration': 0.98}, {'end': 13930.094, 'text': 'And DF now is a mutated data frame that we started with.', 'start': 13926.772, 'duration': 3.322}, {'end': 13934.378, 'text': "And we've set it up with our specific columns we want to work with.", 'start': 13930.535, 'duration': 3.843}, {'end': 13936.84, 'text': 'In this case, survived, class, sex, and age.', 'start': 13934.678, 'duration': 2.162}, {'end': 13940.763, 'text': "And we've formatted three of those columns to a new setup.", 'start': 13937.42, 'duration': 3.343}, {'end': 13941.924, 'text': 'So one of them is factor.', 'start': 13940.803, 'duration': 1.121}, {'end': 13943.585, 'text': "That's what we use for survive.", 'start': 13942.244, 'duration': 1.341}, {'end': 13945.627, 'text': "So it knows it's a 0 and 1.", 'start': 13943.645, 'duration': 1.982}, {'end': 13949.47, 'text': 'And then we switched class and age to make sure that it knows that those are numeric values.', 'start': 13945.627, 'duration': 3.843}, {'end': 13958.311, 'text': 'Now that we have our data formatted, we need to go ahead and split the data into a training and testing data set.', 'start': 13951.966, 'duration': 6.345}], 'summary': 'Mutated data frame formatted for training and testing data set creation.', 'duration': 32.64, 'max_score': 13925.671, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI13925671.jpg'}, {'end': 14055.596, 'src': 'embed', 'start': 14033.886, 'weight': 1, 'content': [{'end': 14045.714, 'text': "so we're going to take 70 of the data of those that have survived and we're going to just randomly select which one of those we're going to use for training and which one we're going to use for testing it out later,", 'start': 14033.886, 'duration': 11.828}, {'end': 14053.915, 'text': "and we'll hit our enter and load that sample in there, And then we'll go ahead and create our train and our test data sets.", 'start': 14045.714, 'duration': 8.201}, {'end': 14055.596, 'text': "I'm going to set train equal to.", 'start': 14053.935, 'duration': 1.661}], 'summary': '70% of the survived data will be used for training and testing.', 'duration': 21.71, 'max_score': 14033.886, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI14033886.jpg'}], 'start': 12423.775, 'title': 'Decision trees and modeling', 'summary': 'Covers understanding decision trees and their applications, explaining entropy, information gain, and classification, predicting shipwreck survival, setting up the r environment, and evaluating decision tree predictions with a focus on achieving high accuracy above 79%.', 'chapters': [{'end': 12573.17, 'start': 12423.775, 'title': 'Understanding decision trees', 'summary': 'Explains decision trees as a tree-shaped algorithm used to determine a course of action, with examples of application in organizing goods and solving classification and regression problems.', 'duration': 149.395, 'highlights': ['Decision tree is a tree-shaped algorithm used to determine a course of action. Explains the fundamental concept of a decision tree as an algorithm for determining a course of action.', 'Application in organizing goods and solving classification and regression problems. Provides examples of using decision trees for organizing goods and solving classification and regression problems, demonstrating real-world applications.', 'Example of using decision trees for classification (identifying to which set an object belongs). Illustrates the use of decision trees for solving classification problems, such as identifying to which set an object belongs.', 'Explanation of regression problems and the application of decision trees for solving them. Describes regression problems and how decision trees can be applied to solve problems with continuous or numerical valued output variables.']}, {'end': 13072.475, 'start': 12573.371, 'title': 'Decision tree: entropy, information gain, and classification', 'summary': 'Explains the working of a decision tree, including key terms such as nodes, root node, leaf node, and entropy, and demonstrates the calculation of information gain for classification using the id3 algorithm in rstudio.', 'duration': 499.104, 'highlights': ['The chapter explains the working of a decision tree, including key terms such as nodes, root node, leaf node, and entropy. It describes the structure of a decision tree, defining nodes as test points that split objects into different categories, with the root node at the top and leaf nodes as the output. It also introduces entropy as a measure of data messiness.', 'It demonstrates the calculation of information gain for classification using the ID3 algorithm in RStudio. The transcript provides a detailed example of using the ID3 algorithm in RStudio to classify a dataset based on information gain, involving the calculation of entropy for each attribute and selecting the best split for maximum information gain.']}, {'end': 13379.134, 'start': 13072.495, 'title': 'Predicting shipwreck survival', 'summary': 'Discusses developing a model to predict shipwreck survival based on factors such as class, gender, age, and family size, using a dataset indicating survival outcomes and utilizing rstudio and various packages for decision tree modeling and data manipulation.', 'duration': 306.639, 'highlights': ['Developing a model to predict shipwreck survival based on factors such as class, gender, age, and family size The model aims to recognize the relationship between class, gender, age, and family size and predict the survival of passengers accordingly.', "Using a dataset specifying passenger survival outcomes The data set specifies whether a passenger survived a shipwreck or not, with a '1' indicating survival and '0' indicating non-survival.", 'Utilizing RStudio and various packages for decision tree modeling and data manipulation The chapter involves the use of RStudio and packages such as fselector, rpart, caret, dplyr, rpart plot, xlsx, rpart.plot, and catools for decision tree modeling and data manipulation.']}, {'end': 14243.584, 'start': 13379.494, 'title': 'Setting up r environment and data processing', 'summary': 'Covers setting up the r environment, including resolving java setup issues, importing data, formatting, and splitting data for training and testing a decision tree classifier.', 'duration': 864.09, 'highlights': ['Resolving Java setup issues and setting Java home environment Resolving Java setup issues and setting Java home environment to JDK18025.', 'Importing and formatting data for training and testing a decision tree classifier Importing data, formatting, and splitting data for training and testing a decision tree classifier.', 'Creating a decision tree classifier in R Creating a decision tree classifier in R by training the decision tree as a classifier.']}, {'end': 14693.307, 'start': 14243.604, 'title': 'Decision tree predictions and model evaluation', 'summary': 'Explains the process of building a decision tree model, making predictions, and evaluating the model using a confusion matrix to achieve a high accuracy of 79% and above, with implications for life-and-death scenarios and decision-making.', 'duration': 449.703, 'highlights': ['The process of building a decision tree model, making predictions, and evaluating the model using a confusion matrix to achieve a high accuracy of 79% and above The chapter details the steps involved in building a decision tree model, making predictions, and evaluating the model using a confusion matrix, resulting in a high accuracy of 79% and above.', 'Implications for life-and-death scenarios and decision-making The discussion includes implications for life-and-death scenarios, where a high accuracy rate of the model is crucial for decision-making, such as in betting or determining lifeboat placement.', 'Visualization of the decision tree and its insights The chapter provides insights into the decision tree visualization, highlighting the key factors influencing survival probabilities, such as gender, class, and age, with implications for the likelihood of survival.']}], 'duration': 2269.532, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI12423775.jpg', 'highlights': ['The chapter details the steps involved in building a decision tree model, making predictions, and evaluating the model using a confusion matrix, resulting in a high accuracy of 79% and above.', 'The model aims to recognize the relationship between class, gender, age, and family size and predict the survival of passengers accordingly.', 'The chapter provides insights into the decision tree visualization, highlighting the key factors influencing survival probabilities, such as gender, class, and age, with implications for the likelihood of survival.', 'The transcript provides a detailed example of using the ID3 algorithm in RStudio to classify a dataset based on information gain, involving the calculation of entropy for each attribute and selecting the best split for maximum information gain.', 'Illustrates the use of decision trees for solving classification problems, such as identifying to which set an object belongs.']}, {'end': 15564.966, 'segs': [{'end': 14800.827, 'src': 'embed', 'start': 14761.858, 'weight': 3, 'content': [{'end': 14764.56, 'text': 'As you dig deeper into data, that becomes very important.', 'start': 14761.858, 'duration': 2.702}, {'end': 14770.485, 'text': 'It might not be something so simple where you can look at, in this case, one, two, three, four nodes.', 'start': 14764.74, 'duration': 5.745}, {'end': 14773.147, 'text': 'So they have four major nodes that they split the data in.', 'start': 14770.805, 'duration': 2.342}, {'end': 14774.968, 'text': "So we're looking at this.", 'start': 14774.288, 'duration': 0.68}, {'end': 14779.993, 'text': "We're going to go ahead and create a set test equal to test comma C241.", 'start': 14774.988, 'duration': 5.005}, {'end': 14785.857, 'text': "we're just assigning color column to go with the data.", 'start': 14782.355, 'duration': 3.502}, {'end': 14788.879, 'text': "okay, that's what that is.", 'start': 14785.857, 'duration': 3.022}, {'end': 14800.827, 'text': "and then we're going to take set and have it equal to test and we'll do X, 1 and we're going to set that equal to minimum set of 1 minus 1.", 'start': 14788.879, 'duration': 11.948}], 'summary': 'Data analysis involves splitting data into four major nodes and assigning color column to the data.', 'duration': 38.969, 'max_score': 14761.858, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI14761858.jpg'}, {'end': 15400.992, 'src': 'embed', 'start': 15364.743, 'weight': 0, 'content': [{'end': 15371.346, 'text': "They use five different models to predict the weather and, depending on what the models come up with and what's going on,", 'start': 15364.743, 'duration': 6.603}, {'end': 15372.927, 'text': 'some of the models do better than others.', 'start': 15371.346, 'duration': 1.581}, {'end': 15374.628, 'text': 'I thought that was really interesting.', 'start': 15373.327, 'duration': 1.301}, {'end': 15375.688, 'text': "It's important to note.", 'start': 15374.748, 'duration': 0.94}, {'end': 15381.472, 'text': "so when you see us repeat the same things over and over and you're looking at the different machine learning tools,", 'start': 15375.688, 'duration': 5.784}, {'end': 15389.038, 'text': 'you realize that we use multiple tools to solve the same problem and we can find the best one that fits for that particular situation.', 'start': 15381.472, 'duration': 7.566}, {'end': 15392.921, 'text': "So let's see how that fits back in our wine production.", 'start': 15389.478, 'duration': 3.443}, {'end': 15397.764, 'text': 'To help speed up the process of wine production we will automate the prediction of wine quality.', 'start': 15393.121, 'duration': 4.643}, {'end': 15400.992, 'text': 'Suppose our random forest builds three decision trees.', 'start': 15398.211, 'duration': 2.781}], 'summary': 'Five models used for weather prediction; multiple tools for wine quality prediction, including random forest with three decision trees.', 'duration': 36.249, 'max_score': 15364.743, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI15364743.jpg'}, {'end': 15544.452, 'src': 'embed', 'start': 15519.091, 'weight': 2, 'content': [{'end': 15526.742, 'text': "The first 11 are our features, our fixed acidity, volatile acid, citric acid, things you always want to look at when you're looking at your data.", 'start': 15519.091, 'duration': 7.651}, {'end': 15529.664, 'text': "And we've done this before in the part one.", 'start': 15527.202, 'duration': 2.462}, {'end': 15538.329, 'text': 'but just a reminder we want to note that we have decimal places here, float values in a lot of these things, especially the first number of columns.', 'start': 15529.664, 'duration': 8.665}, {'end': 15544.452, 'text': 'A free sulfur dioxide looks to be an integer, but it varies a lot.', 'start': 15538.969, 'duration': 5.483}], 'summary': 'Features include fixed acidity, volatile acid, citric acid; some values are floats, free sulfur dioxide varies.', 'duration': 25.361, 'max_score': 15519.091, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI15519091.jpg'}], 'start': 14693.967, 'title': 'Data visualization and predictive modeling', 'summary': 'Discusses the importance of displaying data with decision trees, achieving 0.79 accuracy in the confusion matrix and visualization of 2d decision boundaries. it also explores the application of random forest in predicting wine quality, highlighting its accuracy through multiple decision trees and relevance in various industries such as banking, healthcare, e-commerce, and weather prediction.', 'chapters': [{'end': 15044.591, 'start': 14693.967, 'title': 'Displaying data with decision tree', 'summary': 'Discusses the importance of displaying data, focusing on the confusion matrix accuracy of 0.79 and the visualization of 2d decision boundaries, emphasizing the need for effective data representation when communicating with shareholders and data scientists.', 'duration': 350.624, 'highlights': ['Data visualization with a focus on confusion matrix accuracy of 0.79 and 2D decision boundaries', 'Importance of displaying data effectively when communicating with shareholders and data scientists', 'Creating a decision tree classification test set for effective data representation', 'Utilizing bucketing for age and class split for data analysis']}, {'end': 15564.966, 'start': 15044.591, 'title': 'Predicting wine quality with random forest', 'summary': 'Explores the concept of random forest and its application in predicting wine quality, with examples and use cases, emphasizing its accuracy through multiple decision trees and its relevance in various industries such as banking, healthcare, e-commerce, and weather prediction.', 'duration': 520.375, 'highlights': ['Random forest operates by building multiple decision trees, which increases the accuracy of predictions as the number of decision trees increases. The more the number of decision trees, more accurate will be the prediction.', 'Random forests have a number of applications in various industries, including banking (fraud detection), healthcare (disease detection), e-commerce (recommendation systems), and weather prediction. Random forests have applications in predicting fraudulent customers in banking, analyzing symptoms of patients in healthcare, predicting recommendations in e-commerce, and weather prediction.', 'The chapter discusses the process of building decision trees to predict wine quality, emphasizing the use of three decision trees to determine the quality of the wine. The chapter discusses building three decision trees to predict wine quality based on attributes such as chlorides, alcohol, sulfates, pH, and sugar content.']}], 'duration': 870.999, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI14693967.jpg', 'highlights': ['Random forest increases prediction accuracy with more decision trees', 'Random forest has applications in banking, healthcare, e-commerce, and weather prediction', 'Importance of effectively displaying data for communication with stakeholders and data scientists', 'Utilizing bucketing for age and class split for data analysis', 'Building decision trees to predict wine quality based on attributes such as chlorides, alcohol, sulfates, pH, and sugar content', 'Creating a decision tree classification test set for effective data representation', 'Confusion matrix accuracy of 0.79 and visualization of 2D decision boundaries']}, {'end': 16572.059, 'segs': [{'end': 16308.255, 'src': 'embed', 'start': 16279.031, 'weight': 3, 'content': [{'end': 16282.173, 'text': 'And then with all of our data, we can start looking at the answers.', 'start': 16279.031, 'duration': 3.142}, {'end': 16283.394, 'text': "We'll just do RF.", 'start': 16282.293, 'duration': 1.101}, {'end': 16284.795, 'text': "We'll execute that.", 'start': 16283.694, 'duration': 1.101}, {'end': 16289.638, 'text': "And you'll see that the random forest does a nice job.", 'start': 16285.976, 'duration': 3.662}, {'end': 16291.82, 'text': 'You can see right here it has our data equals training.', 'start': 16289.658, 'duration': 2.162}, {'end': 16293.801, 'text': 'It basically goes over everything we talked about.', 'start': 16291.84, 'duration': 1.961}, {'end': 16297.604, 'text': 'And you come down here and it has a confusion matrix, which is nice.', 'start': 16294.822, 'duration': 2.782}, {'end': 16305.189, 'text': "We always like our confusion matrix when we're talking to our shareholders and explaining what the data comes out with and how it trains.", 'start': 16297.824, 'duration': 7.365}, {'end': 16308.255, 'text': 'And we can go ahead and plot our RF.', 'start': 16306.314, 'duration': 1.941}], 'summary': 'Using random forest (rf) analysis, the data shows a strong performance with a clear confusion matrix and accurate training results.', 'duration': 29.224, 'max_score': 16279.031, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI16279031.jpg'}, {'end': 16492.124, 'src': 'embed', 'start': 16448.163, 'weight': 0, 'content': [{'end': 16458.149, 'text': 'and this tells you that a lot of the eight and the eight scale was pretty probably gave us about fifty percent came up there at eight,', 'start': 16448.163, 'duration': 9.986}, {'end': 16459.237, 'text': 'really Maybe 50%.', 'start': 16458.149, 'duration': 1.088}, {'end': 16467.419, 'text': "it didn't guess right, but it did come up there and say that it was at least a 4, 5, or 6, even where it was considered an 8, and so on.", 'start': 16459.237, 'duration': 8.182}, {'end': 16469.48, 'text': 'You can see these different blocks coming in.', 'start': 16467.64, 'duration': 1.84}, {'end': 16477.162, 'text': 'So that if you line up the 3 with the 3, 4 with the 4, and you kind of cross-index them, a heat map would probably do better.', 'start': 16469.619, 'duration': 7.543}, {'end': 16482.464, 'text': "But for this example, we'll just do a quick plot of the data as far as how it works.", 'start': 16477.462, 'duration': 5.002}, {'end': 16486.305, 'text': 'And you can, again, see the results that we predicted over here, and you can look those over.', 'start': 16482.504, 'duration': 3.801}, {'end': 16492.124, 'text': "Of course, back in the wine cellar, we're talking to our vineyard owner.", 'start': 16488.055, 'duration': 4.069}], 'summary': 'Around 50% prediction accuracy for 8-scale data, suggesting need for heat map analysis.', 'duration': 43.961, 'max_score': 16448.163, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI16448163.jpg'}, {'end': 16583.504, 'src': 'embed', 'start': 16554.125, 'weight': 1, 'content': [{'end': 16555.205, 'text': "I guess he's going traveling.", 'start': 16554.125, 'duration': 1.08}, {'end': 16555.907, 'text': 'That sounds fun.', 'start': 16555.246, 'duration': 0.661}, {'end': 16562.712, 'text': 'One of those breeze through Costa Rica or visiting the U.S. or going to Europe to go visit all the highlights.', 'start': 16556.247, 'duration': 6.465}, {'end': 16566.175, 'text': "So he's got 20 places he wants to go and he wants to hit them in four days.", 'start': 16563.213, 'duration': 2.962}, {'end': 16567.356, 'text': 'Very ambitious.', 'start': 16566.515, 'duration': 0.841}, {'end': 16572.059, 'text': "And how will I manage to cover all of them? That's the question that he's coming up.", 'start': 16568.076, 'duration': 3.983}, {'end': 16579.044, 'text': "How am I going to get to all these different places in the short time I have? Maybe he's a sales team, so he has 20 places he's got to do demos for.", 'start': 16572.119, 'duration': 6.925}, {'end': 16583.504, 'text': 'You can make use of clustering by grouping the places into four clusters.', 'start': 16579.424, 'duration': 4.08}], 'summary': 'A person plans to visit 20 places in 4 days for work, considering clustering for efficient travel.', 'duration': 29.379, 'max_score': 16554.125, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI16554125.jpg'}], 'start': 15565.307, 'title': 'Data processing, r studio, and wine quality prediction', 'summary': 'Covers converting numerical data, data import and manipulation in r studio, and wine quality prediction achieving 70.95% accuracy, with dataset containing 1,687 rows and includes troubleshooting tips.', 'chapters': [{'end': 15612.871, 'start': 15565.307, 'title': 'Data processing and analysis', 'summary': 'Covers the process of converting numerical data to categorical data and the importance of checking for missing data, with the dataset containing 1,687 rows.', 'duration': 47.564, 'highlights': ['The dataset contains 1,687 rows, considered a small data set, making it feasible to manually check for missing data or anomalies.', 'The process involves converting numerical data to categorical data, with the option of representing the values as integers, float values, or categories 3, 4, 5, 6, and 7.', 'When working with big data spread across numerous computers and millions of rows, manually checking for missing data becomes impractical.']}, {'end': 15892.302, 'start': 15613.531, 'title': 'R studio data import and analysis', 'summary': 'Covers the process of importing data into r studio, using commands to manipulate and explore the data, and setting up the environment for data analysis, including troubleshooting tips.', 'duration': 278.771, 'highlights': ['Importing random forest package and troubleshooting Java environment The process of importing the random forest package is demonstrated, and troubleshooting tips for Java environment conflicts are provided.', 'Importing XLSX library and setting up file path for data import Demonstrating the import process for the XLSX library and setting up the file path for data import.', 'Reading data from an Excel spreadsheet into R Studio and specifying the sheet index The process of reading data from an Excel spreadsheet into R Studio and specifying the sheet index is explained.', 'Executing commands to manipulate and explore imported data The procedure for executing commands to manipulate and explore the imported data is detailed.']}, {'end': 16572.059, 'start': 15894.389, 'title': 'Predicting wine quality with random forest', 'summary': 'Discusses the process of predicting wine quality using a random forest model, achieving 70.95% accuracy and automating the prediction process.', 'duration': 677.67, 'highlights': ['Executing a random forest model to predict wine quality resulted in 70.95% accuracy, providing a significant improvement in automating the prediction process. The random forest model achieved 70.95% accuracy in predicting wine quality, showcasing substantial progress in automating the prediction process.', 'The process involved splitting the data into a training set (80%) and a testing set (20%) for training the random forest model and ensuring its accuracy. The data was split into an 80% training set and a 20% testing set to train the random forest model, ensuring its accuracy.', 'Variables such as mtri, entry, and importance were defined and adjusted to optimize the random forest model for predicting wine quality. Variables like mtri, entry, and importance were adjusted to optimize the random forest model for accurately predicting wine quality.', "The random forest model's performance was visualized through a confusion matrix and a plot, providing insights into the accuracy and predictions of wine quality. The performance of the random forest model was visualized through a confusion matrix and a plot, offering insights into the accuracy and predictions of wine quality."]}], 'duration': 1006.752, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI15565307.jpg', 'highlights': ['The random forest model achieved 70.95% accuracy in predicting wine quality, showcasing substantial progress in automating the prediction process.', 'The dataset contains 1,687 rows, considered a small data set, making it feasible to manually check for missing data or anomalies.', 'The process involves converting numerical data to categorical data, with the option of representing the values as integers, float values, or categories 3, 4, 5, 6, and 7.', 'The process involved splitting the data into a training set (80%) and a testing set (20%) for training the random forest model and ensuring its accuracy.', 'Importing random forest package and troubleshooting Java environment The process of importing the random forest package is demonstrated, and troubleshooting tips for Java environment conflicts are provided.']}, {'end': 19040.476, 'segs': [{'end': 16596.447, 'src': 'embed', 'start': 16572.119, 'weight': 0, 'content': [{'end': 16579.044, 'text': "How am I going to get to all these different places in the short time I have? Maybe he's a sales team, so he has 20 places he's got to do demos for.", 'start': 16572.119, 'duration': 6.925}, {'end': 16583.504, 'text': 'You can make use of clustering by grouping the places into four clusters.', 'start': 16579.424, 'duration': 4.08}, {'end': 16587.025, 'text': 'Each of these clusters will have places which are close by.', 'start': 16584.224, 'duration': 2.801}, {'end': 16590.405, 'text': "So we're going to cluster them together by what is closest to the other one.", 'start': 16587.445, 'duration': 2.96}, {'end': 16594.346, 'text': 'Then each day you can visit one cluster and cover all the places in the cluster.', 'start': 16590.866, 'duration': 3.48}, {'end': 16596.447, 'text': "Great! That's a great idea.", 'start': 16594.885, 'duration': 1.562}], 'summary': 'Sales team has 20 demos, clusters places to visit efficiently.', 'duration': 24.328, 'max_score': 16572.119, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI16572118.jpg'}, {'end': 16765.989, 'src': 'embed', 'start': 16735.987, 'weight': 4, 'content': [{'end': 16738.049, 'text': 'City planning is big with clustering.', 'start': 16735.987, 'duration': 2.062}, {'end': 16740.51, 'text': 'We want to cluster things together so that they work.', 'start': 16738.069, 'duration': 2.441}, {'end': 16746.317, 'text': "We don't want to put an industrial zone in the middle of somebody's neighborhood where they're not going to enjoy it,", 'start': 16740.632, 'duration': 5.685}, {'end': 16753.264, 'text': "or have a commercial zone right in the middle of the industrial zone where no one's going to want to go next to a factory to go eat a high-end meal or dinner.", 'start': 16746.317, 'duration': 6.947}, {'end': 16755.105, 'text': "So it's very big in city planning.", 'start': 16753.743, 'duration': 1.362}, {'end': 16759.827, 'text': "It's also very big in just pre-processing data into other models.", 'start': 16755.325, 'duration': 4.502}, {'end': 16765.989, 'text': "So when we're exploring data, being able to cluster things together reveal things in the data we never thought about.", 'start': 16760.287, 'duration': 5.702}], 'summary': 'City planning emphasizes clustering for efficient zoning and data exploration.', 'duration': 30.002, 'max_score': 16735.987, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI16735987.jpg'}, {'end': 17088.838, 'src': 'embed', 'start': 17062.981, 'weight': 1, 'content': [{'end': 17068.907, 'text': "It certainly gives us the exact distance, but as far as doing calculations as to which one's bigger or smaller than the other one,", 'start': 17062.981, 'duration': 5.926}, {'end': 17069.847, 'text': "it won't make a difference.", 'start': 17068.907, 'duration': 0.94}, {'end': 17070.828, 'text': "So we'll just go with it.", 'start': 17070.048, 'duration': 0.78}, {'end': 17073.591, 'text': 'So we just get rid of that final square root.', 'start': 17070.989, 'duration': 2.602}, {'end': 17078.276, 'text': 'It computes faster, and it gives us pretty much the Euclidean squared distance on there.', 'start': 17073.731, 'duration': 4.545}, {'end': 17085.175, 'text': 'Now, the Manhattan distance measurement is a simple sum of horizontal and vertical components,', 'start': 17078.789, 'duration': 6.386}, {'end': 17088.838, 'text': 'or the distance between two points measured along axes at right angles.', 'start': 17085.175, 'duration': 3.663}], 'summary': 'Euclidean squared distance computes faster and manhattan distance is a simple sum of horizontal and vertical components', 'duration': 25.857, 'max_score': 17062.981, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI17062981.jpg'}, {'end': 17537.954, 'src': 'embed', 'start': 17512.292, 'weight': 2, 'content': [{'end': 17517.456, 'text': 'So we have, in this case, we just gave it a letter value, A, B, C, D, E, F.', 'start': 17512.292, 'duration': 5.164}, {'end': 17524.302, 'text': 'Since we follow top-down approach in divisive clustering, obtain all possible splits into two columns.', 'start': 17517.456, 'duration': 6.846}, {'end': 17526.624, 'text': 'So we want to know where you could split it here.', 'start': 17524.682, 'duration': 1.942}, {'end': 17530.908, 'text': 'And we could do like an A, B split and a C, D, E, F split.', 'start': 17526.884, 'duration': 4.024}, {'end': 17533.53, 'text': 'We could do B, C, E, A, D, F.', 'start': 17531.388, 'duration': 2.142}, {'end': 17536.352, 'text': 'And you can see this starts generating a huge amount of data.', 'start': 17533.53, 'duration': 2.822}, {'end': 17537.954, 'text': 'A, B, C, D, E, F.', 'start': 17536.713, 'duration': 1.241}], 'summary': 'Using divisive clustering, explore all possible splits for given values a, b, c, d, e, f.', 'duration': 25.662, 'max_score': 17512.292, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI17512292.jpg'}, {'end': 18413.12, 'src': 'embed', 'start': 18385.133, 'weight': 6, 'content': [{'end': 18387.874, 'text': 'And in R, this is so easy.', 'start': 18385.133, 'duration': 2.741}, {'end': 18391.335, 'text': "Once you've gotten to here, we've done all that pre-data processing.", 'start': 18388.074, 'duration': 3.261}, {'end': 18393.235, 'text': "We'll call it distance.", 'start': 18392.255, 'duration': 0.98}, {'end': 18399.177, 'text': "And we'll assign this to dist.", 'start': 18394.516, 'duration': 4.661}, {'end': 18405.339, 'text': 'So dist is the computation for getting the Euclidean distance.', 'start': 18399.777, 'duration': 5.562}, {'end': 18411.019, 'text': "And we can just put Z in there, because we've already reformatted and scaled Z to fit what we want.", 'start': 18406.138, 'duration': 4.881}, {'end': 18413.12, 'text': 'Let me go ahead and just hit Enter on that.', 'start': 18411.679, 'duration': 1.441}], 'summary': "In r, the computation for euclidean distance is assigned to 'dist'.", 'duration': 27.987, 'max_score': 18385.133, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI18385133.jpg'}, {'end': 18962.333, 'src': 'embed', 'start': 18937.184, 'weight': 5, 'content': [{'end': 18942.626, 'text': "if you looked up here, it's all between 0 and 1, and when we look down here,", 'start': 18937.184, 'duration': 5.442}, {'end': 18946.647, 'text': 'we now have some actual connections and how far distance this different data is.', 'start': 18942.626, 'duration': 4.021}, {'end': 18953.23, 'text': 'Again because more of a domain issue and understanding the oil company and what these different values means,', 'start': 18947.688, 'duration': 5.542}, {'end': 18957.151, 'text': 'and you can look at these as being the distances between different items.', 'start': 18953.23, 'duration': 3.921}, {'end': 18962.333, 'text': 'So a little bit different view and you have to really dig deep into this data.', 'start': 18958.131, 'duration': 4.202}], 'summary': 'Data analysis reveals connections and distances, with values between 0 and 1, requiring deep domain understanding.', 'duration': 25.149, 'max_score': 18937.184, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI18937184.jpg'}], 'start': 16572.119, 'title': 'Clustering methods and applications', 'summary': "Introduces clustering methods such as hierarchical clustering and partial clustering, emphasizing the applications of clustering in customer segmentation, social network analysis, sentimental analysis, city planning, and data pre-processing. it also covers the process of agglomerative and divisive clustering in data analysis, including the determination of cluster nearness, termination point, and possible challenges, with a focus on applying these techniques to a u.s. oil organization's sales data.", 'chapters': [{'end': 16603.828, 'start': 16572.119, 'title': 'Optimizing travel with clustering', 'summary': 'Discusses the use of clustering to group locations for efficient travel, with an example of grouping 20 places into four clusters to minimize travel time.', 'duration': 31.709, 'highlights': ['Grouping 20 places into four clusters for efficient travel, reducing time spent on traveling.', 'Using clustering to group locations into clusters based on proximity, enabling efficient visitation of all places in each cluster.']}, {'end': 16933.449, 'start': 16604.229, 'title': 'Clustering methods and applications', 'summary': 'Introduces clustering methods such as hierarchical clustering and partial clustering, emphasizing the applications of clustering in customer segmentation, social network analysis, sentimental analysis, city planning, and data pre-processing.', 'duration': 329.22, 'highlights': ['Clustering applications Clustering is widely used in customer segmentation, social network analysis, sentimental analysis, city planning, and data pre-processing, offering insights into group preferences and aiding in decision-making.', 'Hierarchical clustering explanation Hierarchical clustering involves separating data into different groups based on similarity, using agglomerative and divisive approaches to form clusters and creating a dendrogram structure to visualize the clustering process.', 'Agglomerative clustering process Agglomerative clustering follows a bottom-up approach, where individual data points form their own clusters, and the algorithm progressively merges the closest clusters until only one cluster remains.']}, {'end': 17140.54, 'start': 16933.849, 'title': 'Distance measures in data analysis', 'summary': 'Covers distance measures in data analysis, including euclidean, squared euclidean, manhattan, and cosine distances, which influence the shape of clusters and determine the similarity between elements.', 'duration': 206.691, 'highlights': ['Euclidean distance measure The Euclidean distance measure is the most common method used to determine the distance between two points in Euclidean space, involving summing all the points and taking the square root, impacting the shape of clusters and influencing similarity between elements.', 'Squared Euclidean distance measure The squared Euclidean distance measure is identical to the Euclidean measurement but excludes the final square root computation, resulting in faster computation and providing the exact distance without affecting the comparisons of distances between points.', 'Manhattan distance measurement The Manhattan distance measurement involves summing the horizontal and vertical components or the distance between two points measured along axes at right angles, offering a different approach to measuring distances and considering individual distances along axes.', 'Cosine distance measure The cosine distance measure calculates the angle between two vectors, with larger angles representing greater distances, providing another method to measure distances between elements in data analysis.']}, {'end': 17797.427, 'start': 17140.98, 'title': 'Hierarchical clustering in data analysis', 'summary': "Covers the process of agglomerative and divisive clustering, including the determination of cluster nearness, termination point, and possible challenges, with a focus on applying these techniques to a u.s. oil organization's sales data and clustering the states based on sales.", 'duration': 656.447, 'highlights': ['Agglomerative clustering begins with each element as a separate cluster and then merges them into a larger cluster. Agglomerative clustering starts with individual clusters and then merges them into larger clusters, demonstrating the process of grouping data points together.', 'The process involves determining the nearness of clusters and when to stop combining clusters to prevent an endless loop. The method involves determining the proximity of clusters and establishing a termination point to avoid indefinite clustering, ensuring an efficient and practical process.', 'The approach to divisive clustering starts with a whole set and proceeds to divide it into smaller clusters using monothetic divisive methods. Divisive clustering involves starting with a single cluster of all data points and then dividing it into smaller clusters using specific methods, illustrating a top-down approach to clustering.', "The steps involved in setting up the problem for the U.S. oil organization's sales data include importing the dataset, creating a scatter plot, normalizing the data, calculating the Euclidean distance, and creating a dendrogram. The process for the U.S. oil organization's sales data involves steps such as importing the dataset, visualizing it with a scatter plot, normalizing the data, calculating the Euclidean distance, and creating a dendrogram to represent the clusters visually."]}, {'end': 18159.756, 'start': 17797.567, 'title': 'Data analysis with r: visualizing energy data', 'summary': 'Demonstrates importing and visualizing energy data using r, including converting data to string, plotting pairs and scatter plots, and adding labels to the graph.', 'duration': 362.189, 'highlights': ['The chapter demonstrates importing and visualizing energy data using R, including converting data to string, plotting pairs and scatter plots, and adding labels to the graph. The chapter discusses the process of importing and visualizing energy data using R, including converting data to string, plotting pairs and scatter plots, and adding labels to the graph.', 'The author explains the process of converting data to a string and using the head function to display the first five rows of the data. The author explains the process of converting data to a string and using the head function to display the first five rows of the data, which is a common practice in R.', 'The chapter demonstrates creating a pairs graph to visualize the relationships between different data points, indicating a potential use for clustering analysis. The chapter demonstrates creating a pairs graph to visualize the relationships between different data points, indicating a potential use for clustering analysis to identify commonalities and patterns among the data.', 'The author discusses creating a scatter plot to compare two values, fuel cost and cells, and interprets the clusters observed in the plot. The author discusses creating a scatter plot to compare fuel cost and cells, interpreting the observed clusters and emphasizing the need to analyze all data points for a comprehensive understanding.', 'The chapter explains the process of adding labels to the graph to identify the cities in connection with specific data points, enhancing the visual representation of the data. The chapter explains the process of adding labels to the graph to identify the cities in connection with specific data points, enhancing the visual representation of the data for better interpretation and analysis.']}, {'end': 19040.476, 'start': 18160.056, 'title': 'Hierarchical clustering for data analysis', 'summary': 'Discusses the importance of normalization in data preprocessing, demonstrating the process of normalization and hierarchical clustering to identify regional similarities and groupings based on cell data, resulting in three clusters of regions with the highest, average, and lowest cells.', 'duration': 880.42, 'highlights': ['The chapter emphasizes the importance of normalization in data preprocessing, especially for machine learning, to ensure unbiased results and level the playing field. Importance of normalization, unbiased results, leveling the playing field', 'The process of normalization involves reshaping the data based on means and standard deviation, ensuring that common values become the center point and standard deviation is equal among all variables. Reshaping based on means and standard deviation, equalizing standard deviation among variables', 'The chapter demonstrates the use of hierarchical clustering to identify regional similarities and groupings based on cell data, resulting in three clusters of regions with the highest, average, and lowest cells. Hierarchical clustering, identification of regional similarities, creation of three clusters based on cell data']}], 'duration': 2468.357, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI16572118.jpg', 'highlights': ['Clustering applications in customer segmentation, social network analysis, sentimental analysis, city planning, and data pre-processing.', 'Hierarchical clustering involves separating data into different groups based on similarity, using agglomerative and divisive approaches.', 'The Euclidean distance measure is the most common method used to determine the distance between two points in Euclidean space.', 'Agglomerative clustering begins with each element as a separate cluster and then merges them into a larger cluster.', 'The process involves determining the nearness of clusters and when to stop combining clusters to prevent an endless loop.', 'The chapter demonstrates importing and visualizing energy data using R, including converting data to string, plotting pairs and scatter plots, and adding labels to the graph.', 'The chapter emphasizes the importance of normalization in data preprocessing, especially for machine learning, to ensure unbiased results and level the playing field.']}, {'end': 20832.395, 'segs': [{'end': 19409.264, 'src': 'embed', 'start': 19376.48, 'weight': 0, 'content': [{'end': 19380.745, 'text': 'But in this example, you can see if we pick yellow, the new data point would be a bowler.', 'start': 19376.48, 'duration': 4.265}, {'end': 19385.069, 'text': 'But if we picked green or blue, the new data point would be a batsman.', 'start': 19380.925, 'duration': 4.144}, {'end': 19387.471, 'text': 'So we need the one that best separates the data.', 'start': 19385.149, 'duration': 2.322}, {'end': 19390.955, 'text': "What line best separates the data? We'll find the best line.", 'start': 19387.652, 'duration': 3.303}, {'end': 19396.097, 'text': 'by computing the maximum margin from equidistant support vectors.', 'start': 19391.335, 'duration': 4.762}, {'end': 19401.78, 'text': 'Now support vectors in this context simply means the two points,', 'start': 19396.498, 'duration': 5.282}, {'end': 19409.264, 'text': 'one from each class that are closest together but that maximize the distance between them or the margin.', 'start': 19401.78, 'duration': 7.484}], 'summary': 'Choosing the best line to separate data by computing maximum margin from equidistant support vectors.', 'duration': 32.784, 'max_score': 19376.48, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI19376480.jpg'}, {'end': 19540.959, 'src': 'embed', 'start': 19509.945, 'weight': 6, 'content': [{'end': 19515.069, 'text': "The margins don't appear to be maximum though, so maybe we can come up with a better line.", 'start': 19509.945, 'duration': 5.124}, {'end': 19521.233, 'text': "So let's take two other support vectors and we'll draw the decision boundary between those,", 'start': 19515.209, 'duration': 6.024}, {'end': 19529.596, 'text': 'and then we will calculate the margin and notice now that the unknown data point, the new value, would be considered a batsman.', 'start': 19521.233, 'duration': 8.363}, {'end': 19531.056, 'text': 'we would continue doing this,', 'start': 19529.596, 'duration': 1.46}, {'end': 19540.959, 'text': 'and obviously a computer does it much quicker than a human being over and over and over again until we found the correct decision boundary with the greatest margin.', 'start': 19531.056, 'duration': 9.903}], 'summary': 'Using support vectors, find decision boundary with maximum margin.', 'duration': 31.014, 'max_score': 19509.945, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI19509945.jpg'}, {'end': 19648.061, 'src': 'embed', 'start': 19618.588, 'weight': 10, 'content': [{'end': 19622.73, 'text': "So if you calculate those two distances and add them up, that's the distance margin.", 'start': 19618.588, 'duration': 4.142}, {'end': 19624.611, 'text': 'And we always want to maximize that.', 'start': 19623.03, 'duration': 1.581}, {'end': 19627.852, 'text': "If we don't maximize it, we can have a misclassification.", 'start': 19624.851, 'duration': 3.001}, {'end': 19631.987, 'text': 'And you can see the yellow margin is much smaller than the green margin.', 'start': 19628.103, 'duration': 3.884}, {'end': 19636.631, 'text': 'So this problem set is two dimensional because the classification is only between two classes.', 'start': 19632.147, 'duration': 4.484}, {'end': 19639.274, 'text': 'And so we would call this a linear SVM.', 'start': 19636.951, 'duration': 2.323}, {'end': 19642.316, 'text': "Now we're going to take a look at kernel SVM.", 'start': 19639.454, 'duration': 2.862}, {'end': 19648.061, 'text': 'And if you notice in this picture, this is a great depiction of a plane, not a line.', 'start': 19642.517, 'duration': 5.544}], 'summary': "The distance margin is crucial in svm, with green margin being larger than yellow. it's a 2d linear svm, and we'll explore kernel svm.", 'duration': 29.473, 'max_score': 19618.588, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI19618588.jpg'}, {'end': 19803.649, 'src': 'embed', 'start': 19771.463, 'weight': 4, 'content': [{'end': 19773.825, 'text': "Let's take a look at horses and mules.", 'start': 19771.463, 'duration': 2.362}, {'end': 19777.933, 'text': 'and see if we can use SVM to classify some new data.', 'start': 19774.292, 'duration': 3.641}, {'end': 19783.855, 'text': "So the problem statement is classifying horses and mules, and we're going to use height and weight as the two features.", 'start': 19778.013, 'duration': 5.842}, {'end': 19792.218, 'text': 'And obviously horses and mules typically in general tend to weigh differently and tend to stand taller.', 'start': 19784.295, 'duration': 7.923}, {'end': 19796.8, 'text': "So we'll take a data set, we'll import the data set, we'll make sure we have our libraries.", 'start': 19792.475, 'duration': 4.325}, {'end': 19803.649, 'text': 'The E1071 library has support vector machine algorithms built in.', 'start': 19797.081, 'duration': 6.568}], 'summary': 'Using svm to classify horses and mules based on height and weight.', 'duration': 32.186, 'max_score': 19771.463, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI19771463.jpg'}, {'end': 19846.437, 'src': 'embed', 'start': 19818.459, 'weight': 2, 'content': [{'end': 19820.94, 'text': "Remember, if it's higher dimensional, it's tough to plot those.", 'start': 19818.459, 'duration': 2.481}, {'end': 19826.102, 'text': "And then we'll use the new model, the trained model, to classify new values.", 'start': 19821.3, 'duration': 4.802}, {'end': 19831.523, 'text': 'In general, we would have a training set, a test set, and then ingest the new data.', 'start': 19826.482, 'duration': 5.041}, {'end': 19836.905, 'text': "But for our example, we're just going to use the whole data set to train the algorithm and then see how it performs.", 'start': 19831.563, 'duration': 5.342}, {'end': 19839.246, 'text': "And once we see how it performs, we'll see.", 'start': 19837.242, 'duration': 2.004}, {'end': 19840.007, 'text': 'did we get a horse??', 'start': 19839.246, 'duration': 0.761}, {'end': 19842.051, 'text': 'Did we predict a horse when we had a horse??', 'start': 19840.408, 'duration': 1.643}, {'end': 19844.275, 'text': 'Did we predict a mule when we had a mule?', 'start': 19842.111, 'duration': 2.164}, {'end': 19846.437, 'text': "So here's the R code.", 'start': 19844.736, 'duration': 1.701}], 'summary': 'Using trained model to classify new data, evaluating its performance with r code.', 'duration': 27.978, 'max_score': 19818.459, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI19818459.jpg'}, {'end': 20484.373, 'src': 'embed', 'start': 20454.861, 'weight': 9, 'content': [{'end': 20461.463, 'text': "So usually take the square root of n, and if it's even, you add one to it or subtract one from it, and that's where you get the k value from.", 'start': 20454.861, 'duration': 6.602}, {'end': 20463.643, 'text': "That is the most common use, and it's pretty solid.", 'start': 20461.543, 'duration': 2.1}, {'end': 20464.764, 'text': 'It works very well.', 'start': 20463.884, 'duration': 0.88}, {'end': 20469.611, 'text': 'When do we use KNN? We can use KNN when data is labeled.', 'start': 20465.184, 'duration': 4.427}, {'end': 20470.893, 'text': 'So you need a label on it.', 'start': 20469.971, 'duration': 0.922}, {'end': 20474.058, 'text': 'We know we have a group of pictures with dogs, dogs, cats, cats.', 'start': 20470.933, 'duration': 3.125}, {'end': 20476.362, 'text': 'Data is noise-free.', 'start': 20474.319, 'duration': 2.043}, {'end': 20484.373, 'text': "And so you can see here, when we have a class and we have like underweight, 140, 23, Hello Kitty, normal, that's pretty confusing.", 'start': 20476.849, 'duration': 7.524}], 'summary': 'Knn is commonly used for labeled, noise-free data and works well in those cases.', 'duration': 29.512, 'max_score': 20454.861, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI20454861.jpg'}], 'start': 19040.996, 'title': 'Support vector machine and knn algorithm in classification', 'summary': 'Covers the concepts of support vector machine (svm) and knn algorithm in classification, including their roles in supervised learning, application in real-life scenarios, and practical examples such as classifying cricket players, horses, mules, cats, and dogs with quantifiable insights into algorithm selection and dataset sizes.', 'chapters': [{'end': 19195.043, 'start': 19040.996, 'title': 'Support vector machine in classification', 'summary': 'Explains the concept of supervised learning and classification, highlighting the role of support vector machine as a binary classifier for predicting categories, with a focus on the importance of features in the algorithm selection and the application of classification algorithms in real-life scenarios.', 'duration': 154.047, 'highlights': ['SVM is a binary classifier used for true-false, yes-no types of classification problems, making it suitable for scenarios like bug detection, customer churn, and stock price prediction.', 'Supervised learning involves known outcomes in the dataset, and classification algorithms aim to predict categories based on example inputs and their desired outputs.', 'The importance of features in supervised learning is emphasized, with SVM being considered a better choice when dealing with datasets containing a large number of features.']}, {'end': 19771.182, 'start': 19195.263, 'title': 'Understanding support vector machine', 'summary': 'Explains support vector machine (svm), a type of classification algorithm, using the example of cricket players being classified into batsmen or bowlers based on runs to wicket ratio, and discusses linear svm and kernel svm in detail.', 'duration': 575.919, 'highlights': ['Support Vector Machine (SVM) is a type of classification algorithm that classifies data based on its features, and it will classify any new element into one of those two classes. SVM is a classification algorithm that categorizes data based on its features and can classify new data into distinct classes.', 'In the example of cricket players, SVM is used to classify players into batsmen or bowlers based on the runs to wicket ratio, creating a clear separation between the two groups. The example of using SVM to classify cricket players based on runs to wicket ratio demonstrates how SVM can create a clear separation between different classes of data.', 'SVM employs support vectors to draw a decision boundary, selecting the line that best separates the data by computing the maximum margin from equidistant support vectors. SVM utilizes support vectors to find the best decision boundary that maximizes the margin between classes, ensuring accurate classification of new data points.', 'Kernel SVM is utilized when the data is not linearly separable, transforming the data into a higher dimension where the classes can be separated using a plane, and different types of kernel functions, such as Gaussian RBF, sigmoid, and polynomial, can be applied for this transformation. Kernel SVM is employed to handle non-linearly separable data by transforming it into a higher dimension where classes can be separated, using various kernel functions like Gaussian RBF and polynomial.']}, {'end': 20196.68, 'start': 19771.463, 'title': 'Svm for horse and mule classification', 'summary': 'Discusses using svm to classify horses and mules based on height and weight, visualizing the support vectors, hyperplane, and validating the predictions, with a practical example and insights into k nearest neighbors.', 'duration': 425.217, 'highlights': ['The chapter discusses using SVM to classify horses and mules based on height and weight. SVM is utilized to classify horses and mules based on their height and weight.', 'Visualizing the support vectors, hyperplane, and validating the predictions, with a practical example. The author visualizes the support vectors, hyperplane, and validates the predictions using a practical example.', 'Insights into K nearest neighbors and its practical use case in predicting diabetes. The chapter provides insights into K nearest neighbors and its practical use case in predicting diabetes.']}, {'end': 20832.395, 'start': 20197.141, 'title': 'Knn algorithm and use cases', 'summary': 'Discusses the knn algorithm, its application in classifying cats and dogs based on characteristics, the process of choosing the right value of k, and how knn works in a diabetes prediction use case with a small dataset of 768 people, emphasizing its suitability for small datasets and its simplicity as a supervised machine learning algorithm.', 'duration': 635.254, 'highlights': ['KNN algorithm is used to classify cats and dogs based on characteristics like sharpness of claws and length of ears, with cats having sharper claws and smaller ears than dogs, enabling feature similarity-based classification. Classification of cats and dogs based on sharpness of claws and length of ears, demonstrating feature similarity-based classification.', 'The process of choosing the right value of K in KNN algorithm involves parameter tuning, and using the square root of N as a common method for determining K, with the need to ensure it is an odd number for better selection. Parameter tuning for KNN involves using the square root of N to determine K, emphasizing the need for it to be an odd number for better selection.', 'The KNN algorithm is applied to a diabetes prediction use case with a small dataset of 768 people, highlighting its suitability for small datasets and its simplicity as a supervised machine learning algorithm. Application of KNN in a diabetes prediction use case with a small dataset, highlighting its suitability for small datasets and its simplicity as a supervised machine learning algorithm.']}], 'duration': 1791.399, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI19040996.jpg', 'highlights': ['SVM is a binary classifier suitable for bug detection, customer churn, and stock price prediction.', 'Supervised learning involves known outcomes, and classification algorithms predict categories based on example inputs.', 'SVM is considered a better choice for datasets with a large number of features.', 'SVM classifies data based on features and can classify new data into distinct classes.', 'SVM creates a clear separation between different classes of data using support vectors and a decision boundary.', 'Kernel SVM handles non-linearly separable data by transforming it into a higher dimension using various kernel functions.', 'SVM is utilized to classify horses and mules based on their height and weight.', 'The author visualizes support vectors, hyperplane, and validates predictions using a practical example.', 'Insights into K nearest neighbors and its practical use case in predicting diabetes.', 'KNN algorithm classifies cats and dogs based on characteristics like sharpness of claws and length of ears.', 'Parameter tuning for KNN involves using the square root of N to determine K, emphasizing the need for it to be an odd number.', 'KNN is suitable for small datasets and is simple as a supervised machine learning algorithm.']}, {'end': 22126.722, 'segs': [{'end': 21086.993, 'src': 'embed', 'start': 21057.789, 'weight': 5, 'content': [{'end': 21058.67, 'text': "Any of those, you'd be dead.", 'start': 21057.789, 'duration': 0.881}, {'end': 21062.814, 'text': "So not a really good factor if they have a zero in there because they didn't have the data.", 'start': 21058.81, 'duration': 4.004}, {'end': 21068.258, 'text': "And we'll take a look at that because we're going to start replacing that information with a couple of different things.", 'start': 21063.074, 'duration': 5.184}, {'end': 21069.64, 'text': "And let's see what that looks like.", 'start': 21068.399, 'duration': 1.241}, {'end': 21071.922, 'text': 'So first we create a nice list.', 'start': 21070.08, 'duration': 1.842}, {'end': 21073.903, 'text': 'As you can see, we have the values.', 'start': 21072.242, 'duration': 1.661}, {'end': 21076.205, 'text': 'We talked about glucose, blood pressure, skin thickness.', 'start': 21073.923, 'duration': 2.282}, {'end': 21082.071, 'text': "And this is a nice way when you're working with columns is to list the columns you need to do some kind of transformation on.", 'start': 21076.826, 'duration': 5.245}, {'end': 21083.472, 'text': 'A very common thing to do.', 'start': 21082.291, 'duration': 1.181}, {'end': 21086.993, 'text': 'And then for this particular setup we certainly could use the.', 'start': 21083.792, 'duration': 3.201}], 'summary': 'Replacing zero values in data with alternative information for glucose, blood pressure, and skin thickness.', 'duration': 29.204, 'max_score': 21057.789, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI21057789.jpg'}, {'end': 21503.153, 'src': 'embed', 'start': 21478.195, 'weight': 2, 'content': [{'end': 21484.517, 'text': "Before we do this, let's go ahead and do import math and do math square root length of y test.", 'start': 21478.195, 'duration': 6.322}, {'end': 21488.083, 'text': 'And when I run that, we get 12.409.', 'start': 21484.877, 'duration': 3.206}, {'end': 21490.665, 'text': "I want to show you where this number comes from we're about to use.", 'start': 21488.083, 'duration': 2.582}, {'end': 21492.346, 'text': '12 is an even number.', 'start': 21491.325, 'duration': 1.021}, {'end': 21498.21, 'text': "So if you know, if you're ever voting on things, remember the neighbors all vote, don't want to have an even number of neighbors voting.", 'start': 21492.486, 'duration': 5.724}, {'end': 21499.651, 'text': 'So we want to do something odd.', 'start': 21498.25, 'duration': 1.401}, {'end': 21501.472, 'text': "And let's just take one away and we'll make it 11.", 'start': 21499.891, 'duration': 1.581}, {'end': 21503.153, 'text': 'Let me delete this out of here.', 'start': 21501.472, 'duration': 1.681}], 'summary': 'Using math, we find the square root of y test as 12.409, an even number, so we adjust it to 11.', 'duration': 24.958, 'max_score': 21478.195, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI21478195.jpg'}, {'end': 21631.104, 'src': 'embed', 'start': 21603.516, 'weight': 4, 'content': [{'end': 21606.857, 'text': "if you're looking at this across three different variables instead of just two,", 'start': 21603.516, 'duration': 3.341}, {'end': 21610.197, 'text': "you'd end up with the third row down here and the column going down the middle.", 'start': 21606.857, 'duration': 3.34}, {'end': 21616.659, 'text': "So in the first case, we have the, and I believe the zero is a 94 people who don't have diabetes.", 'start': 21610.437, 'duration': 6.222}, {'end': 21620.96, 'text': 'The prediction said that 13 of those people did have diabetes and were at high risk.', 'start': 21617.099, 'duration': 3.861}, {'end': 21624.021, 'text': 'And the 32 that had diabetes, it had correct.', 'start': 21621.38, 'duration': 2.641}, {'end': 21631.104, 'text': 'But our prediction said another 15 out of that 15 it classified as incorrect.', 'start': 21624.659, 'duration': 6.445}], 'summary': 'Out of 94 non-diabetic individuals, 13 were incorrectly predicted to have diabetes, while 15 out of 32 diabetic individuals were also misclassified.', 'duration': 27.588, 'max_score': 21603.516, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI21603516.jpg'}, {'end': 21705.494, 'src': 'embed', 'start': 21683.873, 'weight': 0, 'content': [{'end': 21693.956, 'text': "But 82%, not too bad for a quick flash look at people's different statistics and running an SKLearn and running the KNN, the K nearest neighbor on it.", 'start': 21683.873, 'duration': 10.083}, {'end': 21696.517, 'text': 'So we have created a model using KNN.', 'start': 21694.416, 'duration': 2.101}, {'end': 21701.552, 'text': 'which can predict whether a person will have diabetes or not, or, at the very least,', 'start': 21696.991, 'duration': 4.561}, {'end': 21705.494, 'text': 'whether they should go get a checkup and have their glucose checked regularly or not.', 'start': 21701.552, 'duration': 3.942}], 'summary': 'A knn model achieved 82% accuracy in predicting diabetes, aiding in timely checkups.', 'duration': 21.621, 'max_score': 21683.873, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI21683873.jpg'}, {'end': 21776.896, 'src': 'embed', 'start': 21746.976, 'weight': 1, 'content': [{'end': 21753.922, 'text': 'why do we need Naive Bayes and understanding Naive Bayes Classifier much more in depth of how the math works in the background?', 'start': 21746.976, 'duration': 6.946}, {'end': 21761.987, 'text': "Finally we'll get into the advantages of the naive Bayes classifier in the machine learning setup and then we'll roll up our sleeves and do my favorite part.", 'start': 21754.222, 'duration': 7.765}, {'end': 21767.03, 'text': "We'll actually do some Python coding and do some text classification using the naive Bayes.", 'start': 21762.267, 'duration': 4.763}, {'end': 21768.911, 'text': 'What is naive Bayes?', 'start': 21767.51, 'duration': 1.401}, {'end': 21776.896, 'text': "Let's start with a basic introduction to the Bayes theorem, named after Thomas Bayes from the 1700s, who first coined this in the Western literature.", 'start': 21769.251, 'duration': 7.645}], 'summary': 'Exploring the need for naive bayes, its advantages, and coding text classification using python.', 'duration': 29.92, 'max_score': 21746.976, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI21746976.jpg'}, {'end': 21851.607, 'src': 'embed', 'start': 21823.339, 'weight': 9, 'content': [{'end': 21831.401, 'text': 'And then the probability of the second coin being a head, given the first coin is tell, is one-half, and the probability of getting two heads,', 'start': 21823.339, 'duration': 8.062}, {'end': 21834.002, 'text': 'given the first coin is a head, is one-half.', 'start': 21831.401, 'duration': 2.601}, {'end': 21836.983, 'text': "We'll demonstrate that in just a minute and show you how that math works.", 'start': 21834.062, 'duration': 2.921}, {'end': 21841.664, 'text': "Now, when we're doing it with two coins, it's easy to see, but when you have something more complex,", 'start': 21837.343, 'duration': 4.321}, {'end': 21844.725, 'text': 'you can see where these formulas really come in and work.', 'start': 21841.664, 'duration': 3.061}, {'end': 21851.607, 'text': 'So the Bayes Theorem gives us the conditional probability of an event A given another event B has occurred.', 'start': 21845.045, 'duration': 6.562}], 'summary': 'Bayes theorem calculates conditional probabilities, demonstrated with two coins.', 'duration': 28.268, 'max_score': 21823.339, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI21823339.jpg'}, {'end': 22073.956, 'src': 'embed', 'start': 22046.727, 'weight': 10, 'content': [{'end': 22050.768, 'text': "Let's go ahead and glance into where is Naive Bay's use.", 'start': 22046.727, 'duration': 4.041}, {'end': 22052.689, 'text': "Let's look at some of the use scenarios for it.", 'start': 22050.848, 'duration': 1.841}, {'end': 22055.61, 'text': 'As a classifier, we use it in face recognition.', 'start': 22052.989, 'duration': 2.621}, {'end': 22058.171, 'text': 'Is this Cindy, or is it not Cindy or whoever?', 'start': 22055.83, 'duration': 2.341}, {'end': 22064.613, 'text': 'Or it might be used to identify parts of the face that they then feed into another part of the face recognition program.', 'start': 22058.711, 'duration': 5.902}, {'end': 22067.154, 'text': 'This is the eye, this is the nose, this is the mouth.', 'start': 22064.713, 'duration': 2.441}, {'end': 22068.195, 'text': 'Weather prediction.', 'start': 22067.494, 'duration': 0.701}, {'end': 22071.016, 'text': 'Is it going to be rainy or sunny? Medical recognition.', 'start': 22068.435, 'duration': 2.581}, {'end': 22071.916, 'text': 'News prediction.', 'start': 22071.196, 'duration': 0.72}, {'end': 22073.956, 'text': "It's also used in medical diagnosis.", 'start': 22072.076, 'duration': 1.88}], 'summary': 'Naive bayes used in face recognition, weather prediction, medical diagnosis, and news prediction.', 'duration': 27.229, 'max_score': 22046.727, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI22046727.jpg'}], 'start': 20832.715, 'title': 'Diabetes prediction and naive bayes', 'summary': 'Covers using knn for diabetes prediction with data preprocessing, model testing, achieving 80% accuracy. it also discusses bayes theorem, naive bayes, and their applications in machine learning.', 'chapters': [{'end': 20915.59, 'start': 20832.715, 'title': 'Knn diabetes prediction', 'summary': 'Focuses on using knn to predict diabetes, including details on data preprocessing, imports, and train-test split.', 'duration': 82.875, 'highlights': ['The chapter covers the process of importing pandas and numpy for data manipulation and analysis, which are essential tools in this context.', "It emphasizes the importance of using train-test split to evaluate the model's performance, a crucial step in machine learning.", 'The need for pre-processing using a standard scalar pre-processor to scale the data and avoid bias due to varying magnitude, with the specific example of normalizing pregnancy and insulin data.']}, {'end': 21188.891, 'start': 20915.59, 'title': 'Data preprocessing and model testing', 'summary': 'Covers importing python modules, loading dataset using pandas, examining dataset characteristics, and performing data preprocessing by replacing zero values and calculating mean for missing data, to prepare for model testing using k-neighbors classifier and evaluating with confusion matrix, f1 score, and accuracy.', 'duration': 273.301, 'highlights': ['Importing Python modules and specific modules from sklearn setup The transcript discusses importing general Python modules and six specific modules from the sklearn setup, to prepare for data preprocessing and model testing.', 'Replacing zero values with numpy none and calculating mean for missing data The process involves replacing zero values with numpy none to signify no data, and calculating the mean for missing data in the dataset columns, to handle missing or incomplete information.', "Preparing for model testing using k-neighbors classifier and evaluating with confusion matrix, F1 score, and accuracy The chapter focuses on preparing for model testing using the k-neighbors classifier and evaluating the model's performance with metrics such as confusion matrix, F1 score, and accuracy.", 'Loading dataset using Pandas and examining dataset characteristics The transcript demonstrates loading the dataset using Pandas and examining the dataset characteristics such as length and data content, to gain insights into the dataset.']}, {'end': 21787.403, 'start': 21189.275, 'title': 'Preparing and training data for machine learning', 'summary': 'Covers the process of preparing and training data for machine learning, including steps such as data exploration, data splitting, data scaling, and model evaluation, resulting in an 80% accuracy score for predicting diabetes using the k nearest neighbor (knn) algorithm.', 'duration': 598.128, 'highlights': ['The chapter covers the process of preparing and training data for machine learning, including steps such as data exploration, data splitting, data scaling, and model evaluation. The transcript details the various steps involved in preparing and training data for machine learning, including data exploration, data splitting, data scaling, and model evaluation.', 'The process resulted in an 80% accuracy score for predicting diabetes using the K nearest neighbor (KNN) algorithm. The KNN algorithm achieved an 80% accuracy score in predicting diabetes, indicating the effectiveness of the model in identifying potential diabetes cases.', 'The chapter introduces the concept of Naive Bayes Classifier and its application in text classification using Python coding. The chapter delves into the concept of the Naive Bayes Classifier and its application in text classification through Python coding, providing a practical example of its usage.']}, {'end': 22126.722, 'start': 21787.624, 'title': 'Bayes theorem and naive bayes', 'summary': 'Discusses the application of bayes theorem in probability calculation, demonstrating the conditional probability of events, with examples of coin tossing and its relevance in machine learning, specifically naive bayes classifier, highlighting its use cases in face recognition, weather prediction, medical diagnosis, and news classification.', 'duration': 339.098, 'highlights': ['The probability of getting two heads equals one-fourth, and the probability of at least one tail occurs three-quarters of the time. The probability of getting two heads and at least one tail is calculated from the data set, with two heads occurring once out of four possibilities, and at least one tail occurring in three out of four possibilities.', 'The probability of the second coin being a head, given the first coin is tail, is one-half, and the probability of getting two heads, given the first coin is a head, is one-half. The conditional probabilities of the second coin being a head given the first coin is a tail, and getting two heads given the first coin is a head are both calculated as one-half.', 'Naive Bayes is used in face recognition, weather prediction, medical diagnosis, and news classification. Naive Bayes classifier is applied in various use cases such as face recognition, weather prediction, medical diagnosis, and news classification.']}], 'duration': 1294.007, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI20832715.jpg', 'highlights': ['The KNN algorithm achieved an 80% accuracy score in predicting diabetes, indicating the effectiveness of the model in identifying potential diabetes cases.', 'The chapter delves into the concept of the Naive Bayes Classifier and its application in text classification through Python coding, providing a practical example of its usage.', 'The chapter covers the process of preparing and training data for machine learning, including steps such as data exploration, data splitting, data scaling, and model evaluation.', "The chapter focuses on preparing for model testing using the k-neighbors classifier and evaluating the model's performance with metrics such as confusion matrix, F1 score, and accuracy.", 'The need for pre-processing using a standard scalar pre-processor to scale the data and avoid bias due to varying magnitude, with the specific example of normalizing pregnancy and insulin data.', 'The chapter covers the process of importing pandas and numpy for data manipulation and analysis, which are essential tools in this context.', 'The transcript discusses importing general Python modules and six specific modules from the sklearn setup, to prepare for data preprocessing and model testing.', 'The process involves replacing zero values with numpy none to signify no data, and calculating the mean for missing data in the dataset columns, to handle missing or incomplete information.', 'The transcript demonstrates loading the dataset using Pandas and examining the dataset characteristics such as length and data content, to gain insights into the dataset.', 'The conditional probabilities of the second coin being a head given the first coin is a tail, and getting two heads given the first coin is a head are both calculated as one-half.', 'Naive Bayes classifier is applied in various use cases such as face recognition, weather prediction, medical diagnosis, and news classification.']}, {'end': 23043.981, 'segs': [{'end': 22436.911, 'src': 'embed', 'start': 22406.281, 'weight': 3, 'content': [{'end': 22412.243, 'text': 'So when we look at that, probability of the weekday without a purchase is going to be .33 or 33%.', 'start': 22406.281, 'duration': 5.962}, {'end': 22417.886, 'text': "Let's take a look at this, at different probabilities, and, based on this likelihood table,", 'start': 22412.243, 'duration': 5.643}, {'end': 22421.01, 'text': "let's go ahead and calculate conditional probabilities as below.", 'start': 22417.886, 'duration': 3.124}, {'end': 22426.538, 'text': 'The first three we just did, the probability of making a purchase on the weekday is 11 out of 30, or roughly 36 or 37%, .367.', 'start': 22421.29, 'duration': 5.248}, {'end': 22436.911, 'text': "The probability of not making a purchase at all, doesn't matter what day of the week, is roughly 0.2 or 20%.", 'start': 22426.538, 'duration': 10.373}], 'summary': 'Probability of weekday purchase is 37%, while no purchase probability is 20%.', 'duration': 30.63, 'max_score': 22406.281, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI22406281.jpg'}, {'end': 22556.861, 'src': 'embed', 'start': 22533.43, 'weight': 2, 'content': [{'end': 22542.256, 'text': 'So now we have our probabilities for a discount and whether the discount leads to a purchase or not, and the probability for free delivery.', 'start': 22533.43, 'duration': 8.826}, {'end': 22546.517, 'text': 'Does that lead to a purchase or not? And this is where it starts getting really exciting.', 'start': 22542.416, 'duration': 4.101}, {'end': 22554, 'text': 'Let us use these three likelihood tables to calculate whether a customer will purchase a product on a specific combination of day,', 'start': 22546.678, 'duration': 7.322}, {'end': 22556.861, 'text': 'discount and free delivery or not purchase.', 'start': 22554, 'duration': 2.861}], 'summary': 'Analyzing probabilities for discount, free delivery, and purchase likelihood.', 'duration': 23.431, 'max_score': 22533.43, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI22533430.jpg'}, {'end': 22892.173, 'src': 'embed', 'start': 22870.055, 'weight': 0, 'content': [{'end': 22880.664, 'text': 'And so the likelihood of a purchase is 84.71%, And the likelihood of no purchase is 15.29% given these three different variables.', 'start': 22870.055, 'duration': 10.609}, {'end': 22884.967, 'text': "So if it's on a holiday, if it's with a discount and has free delivery,", 'start': 22880.904, 'duration': 4.063}, {'end': 22889.391, 'text': "then there's an 84.71% chance that the customer is going to come in and make a purchase.", 'start': 22884.967, 'duration': 4.424}, {'end': 22891.392, 'text': 'Hooray! They purchased our stuff.', 'start': 22889.631, 'duration': 1.761}, {'end': 22892.173, 'text': "We're making money.", 'start': 22891.492, 'duration': 0.681}], 'summary': '84.71% likelihood of purchase with holiday, discount, and free delivery.', 'duration': 22.118, 'max_score': 22870.055, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI22870055.jpg'}, {'end': 23023.994, 'src': 'embed', 'start': 22947.69, 'weight': 5, 'content': [{'end': 22951.251, 'text': "When you put it into Python, it's really nice because you don't have to worry about any of that.", 'start': 22947.69, 'duration': 3.561}, {'end': 22953.632, 'text': 'You let the Python handle that, the Python module.', 'start': 22951.291, 'duration': 2.341}, {'end': 22958.855, 'text': "But understanding it, you can put it on a table and you can easily see how it works and it's a simple algebraic function.", 'start': 22953.752, 'duration': 5.103}, {'end': 22964.299, 'text': 'It needs less training data, so if you have smaller amounts of data, this is a great powerful tool for that.', 'start': 22959.055, 'duration': 5.244}, {'end': 22967.141, 'text': 'Handles both continuous and discrete data.', 'start': 22964.619, 'duration': 2.522}, {'end': 22970.843, 'text': "It's highly scalable with number of predictors and data points.", 'start': 22967.321, 'duration': 3.522}, {'end': 22977.607, 'text': 'So, as you can see, you just keep multiplying different probabilities in there, and you can cover not just three different variables or sets.', 'start': 22971.163, 'duration': 6.444}, {'end': 22979.729, 'text': 'you can now expand this to even more categories.', 'start': 22977.607, 'duration': 2.122}, {'end': 22981.51, 'text': "Number five, it's fast.", 'start': 22980.149, 'duration': 1.361}, {'end': 22984.051, 'text': 'It can be used in real-time predictions.', 'start': 22981.77, 'duration': 2.281}, {'end': 22985.392, 'text': 'This is so important.', 'start': 22984.291, 'duration': 1.101}, {'end': 22994.016, 'text': "This is why it's used in a lot of our predictions on online shopping carts, referrals, spam filters is because there's no time delay,", 'start': 22985.572, 'duration': 8.444}, {'end': 23000.44, 'text': "as it has to go through and figure out a neural network or one of the other mini setups where you're doing classification.", 'start': 22994.016, 'duration': 6.424}, {'end': 23004.402, 'text': "And certainly there's a lot of other tools out there in the machine learning that can handle these,", 'start': 23000.56, 'duration': 3.842}, {'end': 23007.904, 'text': 'but most of them are not as fast as the Naive Bayes.', 'start': 23005.042, 'duration': 2.862}, {'end': 23011.366, 'text': "And then finally, it's not sensitive to irrelevant features.", 'start': 23008.224, 'duration': 3.142}, {'end': 23018.791, 'text': "So it picks up on your different probabilities, and if you're short on data on one probability, it automatically adjusts for that.", 'start': 23011.847, 'duration': 6.944}, {'end': 23023.994, 'text': 'Those formulas are very automatic, and so you can still get a very solid predictability,', 'start': 23018.911, 'duration': 5.083}], 'summary': 'Naive bayes in python is fast, scalable, and handles both continuous and discrete data, making it a powerful tool, especially with smaller amounts of data.', 'duration': 76.304, 'max_score': 22947.69, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI22947690.jpg'}], 'start': 22127.402, 'title': 'Bayes theorem and data analysis', 'summary': 'Explores the application of bayes theorem and data analysis to predict customer purchase behavior based on variables like day, discount, and free delivery, using a small data set of 30 rows and calculating conditional probabilities for different scenarios, with key probabilities being 37% to purchase on a weekday, 20% not to purchase at all, and 18% not to purchase on a weekday. it also discusses the construction of likelihood tables for three independent variables and the use of these tables to calculate the probability of a customer making a purchase based on specific combinations, yielding a probability of .178 for a no purchase on a holiday, a discount, and a free delivery. additionally, it covers the calculation of the probability of purchase based on the conditions of holiday, discount, and free delivery, resulting in a 84.71% likelihood of purchase, along with the advantages of the naive bayes classifier.', 'chapters': [{'end': 22500.144, 'start': 22127.402, 'title': 'Bayes theorem and data analysis', 'summary': 'Explores the application of bayes theorem and data analysis to predict customer purchase behavior based on variables like day, discount, and free delivery, using a small data set of 30 rows and calculating conditional probabilities for different scenarios, with key probabilities being 37% to purchase on a weekday, 20% not to purchase at all, and 18% not to purchase on a weekday.', 'duration': 372.742, 'highlights': ['The chapter explores the application of Bayes Theorem and data analysis to predict customer purchase behavior based on variables like day, discount, and free delivery. It discusses the practical application of Bayes Theorem and data analysis in predicting customer purchase behavior based on variables like day, discount, and free delivery.', 'Using a small data set of 30 rows, the chapter calculates conditional probabilities for different scenarios, with key probabilities being 37% to purchase on a weekday, 20% not to purchase at all, and 18% not to purchase on a weekday. It illustrates the calculation of conditional probabilities for different scenarios using a small data set of 30 rows, with key probabilities including 37% to purchase on a weekday, 20% not to purchase at all, and 18% not to purchase on a weekday.']}, {'end': 22744.114, 'start': 22500.549, 'title': 'Probability calculation for purchase decision', 'summary': 'Discusses the construction of likelihood tables for three independent variables and the use of these tables to calculate the probability of a customer making a purchase based on specific combinations of day, discount, and free delivery, yielding a probability of .178 for a no purchase on a holiday, a discount, and a free delivery.', 'duration': 243.565, 'highlights': ['The construction of likelihood tables for three independent variables and their use in calculating the probability of a customer making a purchase based on specific combinations of day, discount, and free delivery. The chapter discusses constructing likelihood tables for all three variables and using these tables to calculate the probability of a customer making a purchase based on specific combinations of day, discount, and free delivery.', 'Yielding a probability of .178 for a no purchase on a holiday, a discount, and a free delivery. The probability calculation results in a probability of .178 for a no purchase on a holiday, a discount, and a free delivery.']}, {'end': 22892.173, 'start': 22744.314, 'title': 'Probability of purchase calculation', 'summary': 'Discusses the calculation of the probability of purchase based on the conditions of holiday, discount, and free delivery, resulting in a 84.71% likelihood of purchase, along with the normalization process.', 'duration': 147.859, 'highlights': ['The probability of purchase equals 0.986. The calculated probability of purchase based on the given conditions, indicating a high likelihood of a customer making a purchase.', 'The likelihood of a purchase is 84.71%. After normalization, the percentage representing the probability of a customer making a purchase given the specified conditions.', 'The likelihood of no purchase is 15.29% given these three different variables. After normalization, the percentage representing the probability of a customer not making a purchase given the specified conditions.']}, {'end': 23043.981, 'start': 22892.273, 'title': 'Advantages of naive bayes classifier', 'summary': 'Introduces the six advantages of the naive bayes classifier, including its simplicity, need for less training data, ability to handle both continuous and discrete data, scalability, speed for real-time predictions, and insensitivity to irrelevant features.', 'duration': 151.708, 'highlights': ['It needs less training data, making it a powerful tool for smaller amounts of data. Naive Bayes classifier is suitable for smaller datasets, reducing the requirement for extensive training data.', "It can be used in real-time predictions, making it suitable for online shopping carts, referrals, and spam filters. The classifier's speed and real-time prediction capability make it ideal for applications such as online shopping carts, referrals, and spam filters.", "It's fast and not sensitive to irrelevant features, allowing for solid predictability even with missing or overlapping data. Naive Bayes classifier's ability to handle irrelevant features and adjust for missing or overlapping data contributes to its solid predictability.", "It handles both continuous and discrete data, making it highly scalable with a large number of predictors and data points. The classifier's capability to handle both continuous and discrete data contributes to its scalability with a large number of predictors and data points.", "It is simple and easy to implement, either manually or using Python, making it a powerful tool for classification. Naive Bayes classifier's simplicity and ease of implementation, either manually or using Python, make it a powerful tool for classification tasks."]}], 'duration': 916.579, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI22127402.jpg', 'highlights': ['The likelihood of a purchase is 84.71% after normalization, given the specified conditions.', 'The probability of purchase equals 0.986, indicating a high likelihood of a customer making a purchase.', 'The chapter explores the application of Bayes Theorem and data analysis to predict customer purchase behavior based on variables like day, discount, and free delivery.', 'Using a small data set of 30 rows, the chapter calculates conditional probabilities for different scenarios, with key probabilities being 37% to purchase on a weekday, 20% not to purchase at all, and 18% not to purchase on a weekday.', 'The construction of likelihood tables for three independent variables and their use in calculating the probability of a customer making a purchase based on specific combinations of day, discount, and free delivery.', 'Naive Bayes classifier is suitable for smaller datasets, reducing the requirement for extensive training data.', 'It can be used in real-time predictions, making it suitable for online shopping carts, referrals, and spam filters.', 'It handles both continuous and discrete data, making it highly scalable with a large number of predictors and data points.', 'It is simple and easy to implement, either manually or using Python, making it a powerful tool for classification.']}, {'end': 24184.084, 'segs': [{'end': 23076.08, 'src': 'embed', 'start': 23044.201, 'weight': 5, 'content': [{'end': 23048.044, 'text': "So it's very powerful in that it is not sensitive to the irrelevant features.", 'start': 23044.201, 'duration': 3.843}, {'end': 23052.167, 'text': "And in fact, you can use it to help predict features that aren't even in there.", 'start': 23048.404, 'duration': 3.763}, {'end': 23054.468, 'text': "So now we're down to my favorite part.", 'start': 23052.427, 'duration': 2.041}, {'end': 23057.691, 'text': "We're going to roll up our sleeves and do some actual programming.", 'start': 23054.488, 'duration': 3.203}, {'end': 23060.773, 'text': "We're going to do the use case text classification.", 'start': 23057.751, 'duration': 3.022}, {'end': 23069.317, 'text': 'Now, I would challenge you to go back and send us a note on the notes below underneath the video and request the data for the shopping cart.', 'start': 23060.973, 'duration': 8.344}, {'end': 23076.08, 'text': 'So you can plug that into Python code and do that on your own time so you can walk through it since we walked through all the information on it.', 'start': 23069.477, 'duration': 6.603}], 'summary': 'Powerful feature prediction, programming for text classification, request shopping cart data for python code.', 'duration': 31.879, 'max_score': 23044.201, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI23044201.jpg'}, {'end': 23130.306, 'src': 'embed', 'start': 23101.659, 'weight': 3, 'content': [{'end': 23105.443, 'text': 'But certainly this can be used on any of our news headlines and classification.', 'start': 23101.659, 'duration': 3.784}, {'end': 23109.308, 'text': "So let's see how it can be done using the Naive Bayes classifier.", 'start': 23105.543, 'duration': 3.765}, {'end': 23111.109, 'text': "Now we're at my favorite part.", 'start': 23109.668, 'duration': 1.441}, {'end': 23113.291, 'text': "We're actually going to write some Python script.", 'start': 23111.229, 'duration': 2.062}, {'end': 23114.673, 'text': 'Roll up our sleeves.', 'start': 23113.612, 'duration': 1.061}, {'end': 23116.995, 'text': "And we're going to start by doing our imports.", 'start': 23114.933, 'duration': 2.062}, {'end': 23119.937, 'text': 'These are very basic imports including our news group.', 'start': 23117.255, 'duration': 2.682}, {'end': 23122.179, 'text': "And we'll take a quick glance at the target names.", 'start': 23120.057, 'duration': 2.122}, {'end': 23126.103, 'text': "Then we're going to go ahead and start training our data set and putting it together.", 'start': 23122.58, 'duration': 3.523}, {'end': 23130.306, 'text': "We'll put together a nice graph because it's always good to have a graph to show what's going on.", 'start': 23126.323, 'duration': 3.983}], 'summary': 'Using naive bayes classifier to train news headlines data and create a graph.', 'duration': 28.647, 'max_score': 23101.659, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI23101659.jpg'}, {'end': 23560.986, 'src': 'embed', 'start': 23530.805, 'weight': 4, 'content': [{'end': 23532.627, 'text': "I told you we're going to throw a module at you.", 'start': 23530.805, 'duration': 1.822}, {'end': 23536.391, 'text': "We can't go too much into the math behind this or how it works.", 'start': 23532.927, 'duration': 3.464}, {'end': 23537.112, 'text': 'You can look it up.', 'start': 23536.471, 'duration': 0.641}, {'end': 23540.015, 'text': 'The notation for the math is usually tf.idf.', 'start': 23537.272, 'duration': 2.743}, {'end': 23543.478, 'text': "And that's just a way of weighing the words.", 'start': 23541.437, 'duration': 2.041}, {'end': 23551.041, 'text': "And it weighs the words based on how many times they're used in a document, how many times or how many documents they're used in.", 'start': 23543.738, 'duration': 7.303}, {'end': 23552.382, 'text': "And it's a well-used formula.", 'start': 23551.181, 'duration': 1.201}, {'end': 23553.423, 'text': "It's been around for a while.", 'start': 23552.402, 'duration': 1.021}, {'end': 23555.604, 'text': "It's a little confusing to put this in here.", 'start': 23553.723, 'duration': 1.881}, {'end': 23560.986, 'text': "But let's let her know that it just goes in there and weights the different words in the document for us.", 'start': 23556.344, 'duration': 4.642}], 'summary': 'Tf.idf notation weighs words based on their usage, a well-used formula.', 'duration': 30.181, 'max_score': 23530.805, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI23530805.jpg'}, {'end': 23625.435, 'src': 'embed', 'start': 23589.45, 'weight': 2, 'content': [{'end': 23594.372, 'text': "Well, how do you figure out all those weights in the different articles? That's what this module does.", 'start': 23589.45, 'duration': 4.922}, {'end': 23597.573, 'text': "That's what the TF-IDF vectorizer is going to do for us.", 'start': 23594.512, 'duration': 3.061}, {'end': 23603.436, 'text': "And then we're going to import our sklearn.naivebase, and that's our multinomial NB.", 'start': 23597.953, 'duration': 5.483}, {'end': 23608.378, 'text': "multinomial naive Bay's pretty easy to understand that where that comes from.", 'start': 23604.156, 'duration': 4.222}, {'end': 23613.54, 'text': 'and then finally we have the Skylearn pipeline, import make pipeline.', 'start': 23608.378, 'duration': 5.162}, {'end': 23625.435, 'text': "now the make pipeline is just a cool piece of code because we're gonna take the information we get from the TF IDF vectorizer and we're gonna pump that into a the multinomial NB.", 'start': 23613.54, 'duration': 11.895}], 'summary': 'Module calculates weights using tf-idf vectorizer and multinomial nb in skylearn pipeline', 'duration': 35.985, 'max_score': 23589.45, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI23589450.jpg'}, {'end': 23719.795, 'src': 'embed', 'start': 23695.025, 'weight': 0, 'content': [{'end': 23700.61, 'text': 'I remember once running a model on this and I literally had 2.4 million tokens go into this.', 'start': 23695.025, 'duration': 5.585}, {'end': 23705.892, 'text': "So when you're dealing with large document bases, you can have a huge number of different words.", 'start': 23701.391, 'duration': 4.501}, {'end': 23712.473, 'text': 'It then takes those words, gives them a weight and then, based on that weight, based on the words and the weights,', 'start': 23706.352, 'duration': 6.121}, {'end': 23715.234, 'text': 'it then puts that into the multinomial NB.', 'start': 23712.473, 'duration': 2.761}, {'end': 23719.795, 'text': 'And once we go into our naive base, we want to put the train target in there.', 'start': 23715.514, 'duration': 4.281}], 'summary': 'A model processed 2.4 million tokens for large document bases, assigning weights and using them in a multinomial nb.', 'duration': 24.77, 'max_score': 23695.025, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI23695025.jpg'}, {'end': 24041.737, 'src': 'embed', 'start': 24012.432, 'weight': 1, 'content': [{'end': 24016.656, 'text': "To do this, let's go ahead and create a definition, a function to run.", 'start': 24012.432, 'duration': 4.224}, {'end': 24020.06, 'text': "And we're going to call this function, let me just expand that just a notch here.", 'start': 24016.676, 'duration': 3.384}, {'end': 24020.62, 'text': 'There we go.', 'start': 24020.12, 'duration': 0.5}, {'end': 24022.362, 'text': 'I like mine in big letters.', 'start': 24020.64, 'duration': 1.722}, {'end': 24023.643, 'text': 'Predict categories.', 'start': 24022.602, 'duration': 1.041}, {'end': 24025.005, 'text': 'We want to predict the category.', 'start': 24023.683, 'duration': 1.322}, {'end': 24030.649, 'text': "We're going to send it s, a string, and then we're sending it train equals train.", 'start': 24025.365, 'duration': 5.284}, {'end': 24034.452, 'text': 'We have our training model, and then we had our pipeline, model equals model.', 'start': 24030.729, 'duration': 3.723}, {'end': 24037.173, 'text': "This way we don't have to resend these variables each time.", 'start': 24034.832, 'duration': 2.341}, {'end': 24041.737, 'text': 'The definition knows that because I said train equals train, and I put the equal for model.', 'start': 24037.814, 'duration': 3.923}], 'summary': 'Creating a function to predict categories using training data and a pipeline model.', 'duration': 29.305, 'max_score': 24012.432, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI24012432.jpg'}], 'start': 23044.201, 'title': 'Text classification and model evaluation', 'summary': 'Covers text classification using tf-idf vectorizer and multinomial nb in python, achieving good accuracy with model evaluation through confusion matrix and heat map.', 'chapters': [{'end': 23136.632, 'start': 23044.201, 'title': 'Text classification with naive bayes', 'summary': 'Discusses the use of naive bayes for text classification in python, demonstrating the process of training the model and using it to classify news headlines into different topics, along with a suggestion to request the data for a shopping cart.', 'duration': 92.431, 'highlights': ['The chapter demonstrates the process of training the model and using it to classify news headlines into different topics. It shows how to use the Naive Bayes classifier for text classification in Python.', 'A suggestion is made to request the data for a shopping cart to plug into Python code for independent practice. The chapter encourages viewers to request the shopping cart data to practice text classification independently.', 'The chapter emphasizes the use of Naive Bayes for text classification and its ability to predict features that are not present in the data. It highlights the powerful nature of Naive Bayes in being able to predict features not present in the data and its insensitivity to irrelevant features.']}, {'end': 23588.969, 'start': 23136.792, 'title': 'Python data analysis and visualization', 'summary': 'Covers setting up python environment for data analysis, importing necessary modules like numpy and sklearn, and analyzing a dataset using fetch20newsgroups, with focus on tokenizing words and categorizing documents, followed by discussing the implementation of tfidfvectorizer for weighing words in documents.', 'duration': 452.177, 'highlights': ['The chapter explains setting up the Python environment for data analysis using Anaconda Jupyter notebook and importing necessary modules like numpy and seaborn for graphing, with a focus on visualizing data for analysis. Discusses setting up the Python environment for data analysis using Anaconda Jupyter notebook, importing necessary modules like numpy and seaborn for graphing, with a focus on visualizing data for analysis.', 'The chapter demonstrates the process of importing necessary modules like numpy and sklearn, and using fetch20newsgroups to analyze and categorize documents, highlighting the importance of visualizing data for effective analysis. Demonstrates the process of importing necessary modules like numpy and sklearn, and using fetch20newsgroups to analyze and categorize documents, highlighting the importance of visualizing data for effective analysis.', 'The chapter discusses the implementation of tfidfvectorizer for weighing words in documents, explaining its relevance in assigning weights to words based on their usage in documents, and its significance in document analysis. Discusses the implementation of tfidfvectorizer for weighing words in documents, explaining its relevance in assigning weights to words based on their usage in documents, and its significance in document analysis.']}, {'end': 24184.084, 'start': 23589.45, 'title': 'Text classification and model evaluation', 'summary': 'Discusses the process of text classification using tf-idf vectorizer, multinomial nb, and sklearn pipeline, along with model evaluation using confusion matrix and heat map, achieving overall good accuracy in categorizing diverse topics.', 'duration': 594.634, 'highlights': ['The chapter explains the process of text classification using TF-IDF vectorizer, multinomial NB, and sklearn pipeline. It covers the use of these tools to weigh words in different articles and organize the flow of information through the pipeline.', 'The model is trained using 2.4 million tokens and evaluated using confusion matrix and heat map, showing good accuracy in categorizing diverse topics. The model processes a large number of words with different weights, and the evaluation through confusion matrix and heat map reveals overall good accuracy in categorizing diverse topics, with some mislabeling in similar categories.', "A function 'predict categories' is created to predict the category of input strings, demonstrating the model's capability to accurately categorize diverse topics like space science, automobiles, and religion. The function 'predict categories' uses the trained model to accurately predict the category of input strings, such as 'International Space Station' being categorized as 'science space' and 'BMW is better than an Audi' being categorized as 'recreational autos', showcasing the model's capability to categorize diverse topics accurately."]}], 'duration': 1139.883, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI23044201.jpg', 'highlights': ['The model processes a large number of words with different weights, and the evaluation through confusion matrix and heat map reveals overall good accuracy in categorizing diverse topics, with some mislabeling in similar categories.', "The function 'predict categories' uses the trained model to accurately predict the category of input strings, such as 'International Space Station' being categorized as 'science space' and 'BMW is better than an Audi' being categorized as 'recreational autos', showcasing the model's capability to categorize diverse topics accurately.", 'The chapter explains the process of text classification using TF-IDF vectorizer, multinomial NB, and sklearn pipeline. It covers the use of these tools to weigh words in different articles and organize the flow of information through the pipeline.', 'The chapter demonstrates the process of training the model and using it to classify news headlines into different topics. It shows how to use the Naive Bayes classifier for text classification in Python.', 'The chapter discusses the implementation of tfidfvectorizer for weighing words in documents, explaining its relevance in assigning weights to words based on their usage in documents, and its significance in document analysis.', 'The chapter emphasizes the use of Naive Bayes for text classification and its ability to predict features that are not present in the data. It highlights the powerful nature of Naive Bayes in being able to predict features not present in the data and its insensitivity to irrelevant features.']}, {'end': 27143.355, 'segs': [{'end': 24279.565, 'src': 'embed', 'start': 24232.148, 'weight': 0, 'content': [{'end': 24237.83, 'text': "You don't have to know those to understand naive Bayes, but they certainly help for understanding the industry and data science.", 'start': 24232.148, 'duration': 5.682}, {'end': 24242.311, 'text': 'And we can see our categorizer, our naive Bayes classifier.', 'start': 24238.11, 'duration': 4.201}, {'end': 24250.114, 'text': 'We were able to predict the category religion, space motorcycles, autos, politics and properly classify all these different things.', 'start': 24242.531, 'duration': 7.583}, {'end': 24252.795, 'text': 'we pushed into our prediction and our trained model.', 'start': 24250.114, 'duration': 2.681}, {'end': 24256.708, 'text': 'Hello and welcome to the session on K-means clustering.', 'start': 24253.203, 'duration': 3.505}, {'end': 24259.332, 'text': "I'm Mohan Kumar from Simply Learn.", 'start': 24257.089, 'duration': 2.243}, {'end': 24261.956, 'text': 'So what is K-means clustering?', 'start': 24259.813, 'duration': 2.143}, {'end': 24267.339, 'text': 'K-means clustering is unsupervised learning algorithm.', 'start': 24262.597, 'duration': 4.742}, {'end': 24272.282, 'text': "in this case, you don't have labeled data, unlike in supervised learning.", 'start': 24267.339, 'duration': 4.943}, {'end': 24279.565, 'text': 'so you have a set of data and you want to group them and, as the name suggests, you want to put them into clusters,', 'start': 24272.282, 'duration': 7.283}], 'summary': 'Naive bayes accurately classified religion, space, autos, politics. k-means clustering is unsupervised learning.', 'duration': 47.417, 'max_score': 24232.148, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI24232148.jpg'}, {'end': 24699.384, 'src': 'embed', 'start': 24680.088, 'weight': 3, 'content': [{'end': 24695.28, 'text': 'there are primarily two categories of clustering hierarchical clustering and then partitional clustering and each of these categories are further subdivided into agglomerative and divisive clustering and k-means and fuzzy c-means clustering.', 'start': 24680.088, 'duration': 15.192}, {'end': 24699.384, 'text': "Let's take a quick look at what each of these types of clustering are.", 'start': 24695.821, 'duration': 3.563}], 'summary': 'Two main categories of clustering: hierarchical and partitional, subdivided into agglomerative, divisive, k-means, and fuzzy c-means.', 'duration': 19.296, 'max_score': 24680.088, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI24680088.jpg'}, {'end': 25088.435, 'src': 'embed', 'start': 25064.836, 'weight': 2, 'content': [{'end': 25075.262, 'text': 'so each data point will be assigned to the centroid which is closest to it and thereby we have k number of initial clusters.', 'start': 25064.836, 'duration': 10.426}, {'end': 25078.124, 'text': 'However, this is not the final clusters.', 'start': 25075.822, 'duration': 2.302}, {'end': 25088.435, 'text': 'The next step it does is for the new groups, for the clusters that have been formed, it calculates the mean position.', 'start': 25078.505, 'duration': 9.93}], 'summary': 'Data points assigned to closest centroids, forming k initial clusters. new groups calculate mean position.', 'duration': 23.599, 'max_score': 25064.836, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI25064836.jpg'}, {'end': 26022.315, 'src': 'embed', 'start': 25998.03, 'weight': 5, 'content': [{'end': 26008.761, 'text': "So that's the way we determine the best locations for the store and that's how we can help Walmart find the best locations for their stores in Florida.", 'start': 25998.03, 'duration': 10.731}, {'end': 26012.705, 'text': "So now let's take this into Python notebook.", 'start': 26008.981, 'duration': 3.724}, {'end': 26016.549, 'text': "Let's see how this looks when we are running the code live.", 'start': 26012.745, 'duration': 3.804}, {'end': 26017.37, 'text': 'all right.', 'start': 26016.929, 'duration': 0.441}, {'end': 26022.315, 'text': 'so this is the code for k-means clustering in jupyter notebook.', 'start': 26017.37, 'duration': 4.945}], 'summary': 'Using k-means clustering to determine best store locations for walmart in florida.', 'duration': 24.285, 'max_score': 25998.03, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI25998030.jpg'}, {'end': 26378.24, 'src': 'embed', 'start': 26350.387, 'weight': 4, 'content': [{'end': 26361.196, 'text': 'So as explained in the slides, the first step that is done in case of k-means clustering is to randomly assign some centroids.', 'start': 26350.387, 'duration': 10.809}, {'end': 26370.385, 'text': 'so, as a first step, we randomly allocate a couple of centroids, which we call here we are calling as centers,', 'start': 26361.776, 'duration': 8.609}, {'end': 26378.24, 'text': 'and then we put this in a loop and we take it through an iterative process For each of the data points.', 'start': 26370.385, 'duration': 7.855}], 'summary': 'In k-means clustering, centroids are randomly assigned and iteratively updated for data points.', 'duration': 27.853, 'max_score': 26350.387, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI26350387.jpg'}, {'end': 26586.441, 'src': 'embed', 'start': 26557.058, 'weight': 6, 'content': [{'end': 26568.086, 'text': 'And next we will move on to see a couple of examples of how k-means clustering is used in maybe some real life scenarios or use cases.', 'start': 26557.058, 'duration': 11.028}, {'end': 26576.693, 'text': 'In the next example or demo, We are going to see how we can use k-means clustering to perform color compression.', 'start': 26568.406, 'duration': 8.287}, {'end': 26578.774, 'text': 'We will take a couple of images.', 'start': 26577.273, 'duration': 1.501}, {'end': 26586.441, 'text': 'So there will be two examples and we will try to use k-means clustering to compress the colors.', 'start': 26579.155, 'duration': 7.286}], 'summary': 'Demonstrates using k-means clustering for color compression in real-life scenarios.', 'duration': 29.383, 'max_score': 26557.058, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI26557058.jpg'}], 'start': 24184.624, 'title': 'Text classification and k-means clustering', 'summary': 'Demonstrates the accurate classification of texts into various groups using naive bayes, tf-idf vectorizer, and pipeline. it also explains k-means clustering, its application, types, and iterative process, including the use of distance measures and its application in real-world scenarios like store location optimization and image processing.', 'chapters': [{'end': 24252.795, 'start': 24184.624, 'title': 'Text classification with naive bayes', 'summary': 'Demonstrated how the naive bayes classifier, along with tf-idf vectorizer and pipeline, accurately classified texts into different groups such as religion, space, motorcycles, autos, and politics.', 'duration': 68.171, 'highlights': ['The Naive Bayes classifier accurately classified texts into different categories such as religion, space, motorcycles, autos, and politics.', 'The use of TF-IDF vectorizer and pipeline facilitated the classification process, making it easy and fast.', 'The chapter emphasized the importance of understanding TF-IDF vectorizer and pipeline for industry and data science applications.']}, {'end': 25284.089, 'start': 24253.203, 'title': 'Introduction to k-means clustering', 'summary': 'Explains the concept of k-means clustering, an unsupervised learning algorithm used for grouping data into clusters based on similarities, with a key focus on the process, application, and types of clustering, including hierarchical, partitional, k-means, and fuzzy c-means clustering, as well as the use of distance measures such as euclidean, squared euclidean, manhattan, and cosine distances. it also delves into the iterative process of determining the optimal number of clusters through the elbow method and the convergence of the clustering process.', 'duration': 1030.886, 'highlights': ['K-means clustering is an unsupervised learning algorithm used for grouping data into clusters based on similarities, with the number of clusters, denoted by K, being a critical factor in the process. This is a fundamental concept that sets the stage for understanding the core purpose and function of K-means clustering.', 'The chapter provides an example of using K-means clustering in the context of cricket, where players are categorized into batsmen and bowlers based on runs scored and wickets taken, illustrating a real-world application of the algorithm. This real-world example helps to contextualize the application of K-means clustering and its relevance in various industries.', 'The iterative process of K-means clustering involves the allocation of centroids, calculation of distances of data points from centroids, repositioning of centroids based on mean position, and reallocation of data points until convergence is achieved. Understanding the iterative nature of K-means clustering and how centroids are recalculated and data points are reallocated is crucial for grasping the inner workings of the algorithm.', 'The chapter covers the types of clustering, including hierarchical clustering, partitional clustering, k-means clustering, and fuzzy c-means clustering, providing a comprehensive overview of the different clustering methods. This highlights the broader context of K-means clustering within the landscape of clustering methods, offering a holistic view of clustering techniques.', 'Distance measures such as Euclidean, squared Euclidean, Manhattan, and cosine distances are explained, emphasizing their significance in evaluating similarities between data points in the context of K-means clustering. Understanding the different distance measures is pivotal as they form the basis for determining similarities and dissimilarities in K-means clustering.']}, {'end': 25869.683, 'start': 25284.089, 'title': 'Understanding k-means clustering', 'summary': 'Explains the iterative process of k-means clustering, which involves randomly assigning centroids, calculating distance of each point from the centroids, reassigning points to the closest centroid, recalculating centroids, and repeating the process until convergence to form final clusters.', 'duration': 585.594, 'highlights': ['The iterative process of k-means clustering involves randomly assigning centroids, calculating distance of each point from the centroids, reassigning points to the closest centroid, recalculating centroids, and repeating the process until convergence to form final clusters.', 'In the k-means clustering algorithm, the process includes randomly picking k points and calling them centroids, calculating the distance of each input point from each centroid, assigning each point to the closest centroid, calculating the actual centroids for each group, and repeating the process until convergence.', 'If the stores are too far apart, then they will not have enough sales.']}, {'end': 26497.256, 'start': 25869.683, 'title': 'K-means clustering for walmart optimization', 'summary': 'Explores using k-means clustering to determine the optimal store locations for walmart in florida, identifying four distinct clusters and the iterative process of assigning data points to centroids until convergence.', 'duration': 627.573, 'highlights': ['K-means Clustering Process for Walmart Optimization The chapter demonstrates using K-means clustering to determine the optimal store locations for Walmart in Florida, identifying four distinct clusters and the iterative process of assigning data points to centroids until convergence.', 'Importing Libraries and Loading Data The code begins by importing required libraries like numpy and matplotlib, then loads the data in the form of addresses and conducts a scatter plot to visualize the relationship between data points.', 'Creating Test Data Clusters with make_blobs The code demonstrates the use of make_blobs, a feature in scikit-learn, to create clusters of data sets and visually identifies four distinct clusters within the data set.', 'Standard K-means Functionality The standard k-means functionality is utilized without implementing k-means itself, creating an instance of k-means, specifying the number of clusters to be created, fitting the model, and predicting and assigning cluster numbers to observations.', 'Implementation of K-means Algorithm An implementation of the k-means algorithm is explained, involving the random allocation of centroids, iterative assignment of data points to the closest centroid, calculation of new centroids, and checking for convergence until centroid positions no longer change significantly.']}, {'end': 27143.355, 'start': 26497.276, 'title': 'K-means clustering in image processing', 'summary': 'Discusses the implementation of k-means clustering, including a rough implementation using 4 clusters and a more practical application involving color compression in images, using examples to illustrate the process and the impact on image quality.', 'duration': 646.079, 'highlights': ['Color compression using k-means clustering The process of color compression using k-means clustering to reduce the color palette from millions to 16 colors, with visual comparisons of the original and compressed images.', 'Implementation of k-means clustering with 4 clusters A rough implementation of k-means clustering using 4 clusters and visualization of the cluster centroids.', 'Application of k-means clustering in image processing Demonstration of using k-means clustering to compress colors in images to enable rendering on devices with limited memory, with examples showcasing the impact on image quality.']}], 'duration': 2958.731, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI24184624.jpg', 'highlights': ['The Naive Bayes classifier accurately classified texts into different categories such as religion, space, motorcycles, autos, and politics.', 'K-means clustering is an unsupervised learning algorithm used for grouping data into clusters based on similarities, with the number of clusters, denoted by K, being a critical factor in the process.', 'The iterative process of K-means clustering involves the allocation of centroids, calculation of distances of data points from centroids, repositioning of centroids based on mean position, and reallocation of data points until convergence is achieved.', 'The chapter covers the types of clustering, including hierarchical clustering, partitional clustering, k-means clustering, and fuzzy c-means clustering, providing a comprehensive overview of the different clustering methods.', 'The iterative process of k-means clustering involves randomly assigning centroids, calculating distance of each point from the centroids, reassigning points to the closest centroid, recalculating centroids, and repeating the process until convergence to form final clusters.', 'K-means Clustering Process for Walmart Optimization The chapter demonstrates using K-means clustering to determine the optimal store locations for Walmart in Florida, identifying four distinct clusters and the iterative process of assigning data points to centroids until convergence.', 'Color compression using k-means clustering The process of color compression using k-means clustering to reduce the color palette from millions to 16 colors, with visual comparisons of the original and compressed images.', 'Application of k-means clustering in image processing Demonstration of using k-means clustering to compress colors in images to enable rendering on devices with limited memory, with examples showcasing the impact on image quality.']}, {'end': 30499.68, 'segs': [{'end': 28148.997, 'src': 'embed', 'start': 28121.469, 'weight': 4, 'content': [{'end': 28127.411, 'text': 'moving average is if you now have a series of data, you keep taking the three values, the next three values,', 'start': 28121.469, 'duration': 5.942}, {'end': 28131.032, 'text': 'and then you take the average of that and then the next three values, and so on and so forth.', 'start': 28127.411, 'duration': 3.621}, {'end': 28133.653, 'text': 'so that is how you take the moving average.', 'start': 28131.272, 'duration': 2.381}, {'end': 28136.473, 'text': "so let's take a little more detailed example of car sales.", 'start': 28133.653, 'duration': 2.82}, {'end': 28142.275, 'text': "so this is how we have the car sales data for the entire year, let's say so rather for four years.", 'start': 28136.473, 'duration': 5.802}, {'end': 28148.997, 'text': 'so year one we have for each quarter, quarter one, two, three, four, and then year two, quarter, one, two, three, four and so on and so forth.', 'start': 28142.275, 'duration': 6.722}], 'summary': 'Moving average is calculated by averaging every three values in a series of data, illustrated with car sales data over four years.', 'duration': 27.528, 'max_score': 28121.469, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI28121469.jpg'}, {'end': 29119.679, 'src': 'embed', 'start': 29091.567, 'weight': 0, 'content': [{'end': 29096.392, 'text': 'And then we will end with the model validation using the junk box text.', 'start': 29091.567, 'duration': 4.825}, {'end': 29099.033, 'text': "OK, so with that, let's get started.", 'start': 29096.832, 'duration': 2.201}, {'end': 29106.415, 'text': 'First of all, as I mentioned, we will be using the ARIMA model to do the forecast of this time series data.', 'start': 29099.313, 'duration': 7.102}, {'end': 29109.636, 'text': 'So let us try to understand what is ARIMA.', 'start': 29106.595, 'duration': 3.041}, {'end': 29112.717, 'text': 'So ARIMA is actually an acronym.', 'start': 29109.996, 'duration': 2.721}, {'end': 29117.218, 'text': 'It stands for Autoregressive Integrated Moving Average.', 'start': 29112.897, 'duration': 4.321}, {'end': 29119.679, 'text': 'So that is what is ARIMA model.', 'start': 29117.418, 'duration': 2.261}], 'summary': 'Using arima model for time series forecast.', 'duration': 28.112, 'max_score': 29091.567, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI29091567.jpg'}, {'end': 29502.12, 'src': 'embed', 'start': 29474.662, 'weight': 3, 'content': [{'end': 29477.904, 'text': 'And you can use that to plot your autocorrelation function.', 'start': 29474.662, 'duration': 3.242}, {'end': 29480.106, 'text': 'So that is ACF.', 'start': 29478.544, 'duration': 1.562}, {'end': 29483.368, 'text': 'And we will see that in our R studio in a little bit.', 'start': 29480.506, 'duration': 2.862}, {'end': 29486.55, 'text': 'And similarly, you have partial autocorrelation functions.', 'start': 29483.528, 'duration': 3.022}, {'end': 29494.896, 'text': 'Partial autocorrelation function is the degree of association between two variables while adjusting the effect of one or more additional variables.', 'start': 29486.65, 'duration': 8.246}, {'end': 29502.12, 'text': 'So this again can be measured and it can also be plotted, and its value once again can go from minus 1 to 1,', 'start': 29495.176, 'duration': 6.944}], 'summary': 'Autocorrelation and partial autocorrelation functions can be measured and plotted in r studio.', 'duration': 27.458, 'max_score': 29474.662, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI29474662.jpg'}, {'end': 29745.749, 'src': 'embed', 'start': 29719.212, 'weight': 2, 'content': [{'end': 29725.155, 'text': 'so the auto arima function basically tells us what should be the parameters right.', 'start': 29719.212, 'duration': 5.943}, {'end': 29728.317, 'text': 'so these parameters are the p, d and q that we talked about.', 'start': 29725.155, 'duration': 3.162}, {'end': 29729.858, 'text': "that's what is being shown here.", 'start': 29728.317, 'duration': 1.541}, {'end': 29736.803, 'text': 'so if we use auto arima, it will basically take all possible values of this pdq, these parameters,', 'start': 29729.858, 'duration': 6.945}, {'end': 29742.327, 'text': 'and it will find out what is the best value and then it will recommend.', 'start': 29736.803, 'duration': 5.524}, {'end': 29745.749, 'text': 'so that is the advantage of using auto arima, all right.', 'start': 29742.327, 'duration': 3.422}], 'summary': 'Auto arima function finds best parameters for p, d, and q, recommending the best values.', 'duration': 26.537, 'max_score': 29719.212, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI29719212.jpg'}], 'start': 27143.355, 'title': 'Time series forecasting and analysis', 'summary': 'Covers the manual calculation of moving averages, centered moving averages, and trend analysis for time series data, culminating in a successful prediction model, and introduces the arima model for time series forecasting in r, emphasizing the parameters p, d, and q.', 'chapters': [{'end': 27183.586, 'start': 27143.355, 'title': 'Color compression using k-means clustering', 'summary': 'Discusses the process of color compression using k-means clustering, presenting two examples and demonstrating the implementation of k-means clustering with sample data.', 'duration': 40.231, 'highlights': ['Two examples of color compression using k-means clustering were presented, demonstrating the preservation of information despite differences in richness and rendering capabilities.', 'The implementation of k-means clustering was demonstrated, along with the code for its execution and the use of sample data with blob for the clustering process.']}, {'end': 27716.732, 'start': 27183.786, 'title': 'Time series forecasting basics', 'summary': 'Covers the basics of time series forecasting, including the components of time series data such as trend, seasonality, cyclicity, and irregularity, the significance of time series analysis, and examples of time series data like daily stock prices and interest rates.', 'duration': 532.946, 'highlights': ['Time series data consists of four components: trend, seasonality, cyclicity, and irregularity. Time series data is composed of trend, seasonality, cyclicity, and irregularity, each contributing to the overall pattern and variation in the data.', 'Examples of time series data include daily stock prices, interest rates, and sales figures, used for predicting future trends. Time series data examples encompass daily stock prices, interest rates, and sales figures, serving as historical data for creating predictive models to forecast future trends.', 'The significance of time series analysis lies in its ability to predict future events based on past data, particularly in cases such as stock prices, sales, and other time-dependent data. Time series analysis is crucial for predicting future events using historical data, especially in scenarios like stock prices, sales figures, and other time-dependent data, where forecasting is essential for decision-making.']}, {'end': 28245.761, 'start': 27716.732, 'title': 'Time series analysis overview', 'summary': 'Covers the components of time series data, conditions for time series analysis, and the concept of stationary data, including key points such as the non-stationary nature of raw time series data and the need for data to be stationary for time series forecasting.', 'duration': 529.029, 'highlights': ['Non-stationary time series data consists of trend, seasonality, cyclicity, and irregularity components, which typically require transformation to stationary data for analysis. Non-stationary time series data consists of trend, seasonality, cyclicity, and irregularity components, which typically require transformation to stationary data for analysis. This emphasizes the importance of identifying and addressing non-stationarity before applying time series forecasting models.', 'The mean, variance, and covariance of stationary data should remain constant over time, differentiating it from non-stationary data. The mean, variance, and covariance of stationary data should remain constant over time, differentiating it from non-stationary data. This provides a technical understanding of stationary data and its key characteristics for time series analysis.', 'The concept of moving average involves calculating the average of a set of consecutive values in a time series, which is a simple method for forecasting. The concept of moving average involves calculating the average of a set of consecutive values in a time series, which is a simple method for forecasting. This method provides a straightforward approach to forecasting without the need for complex algorithms.']}, {'end': 29383.964, 'start': 28245.761, 'title': 'Time series forecasting and analysis', 'summary': 'Covers the manual calculation of moving averages, centered moving averages, and trend analysis for time series data, culminating in a successful prediction model, and introduces the arima model for time series forecasting in r, emphasizing the parameters p, d, and q.', 'duration': 1138.203, 'highlights': ['The manual calculation of moving averages, centered moving averages, and trend analysis, including the process of predicting values for the fifth year, results in accurate predictions and trend capture for the time series data. The manual calculation of moving averages, centered moving averages, and trend analysis, including the process of predicting values for the fifth year, results in accurate predictions and trend capture for the time series data.', 'Introduction of the ARIMA model for time series forecasting in R, emphasizing the parameters P, D, and Q, and the use of regression analysis to calculate the intercept and slope of the data for trend analysis. Introduction of the ARIMA model for time series forecasting in R, emphasizing the parameters P, D, and Q, and the use of regression analysis to calculate the intercept and slope of the data for trend analysis.', 'Explanation of the autoregressive, integrated, and moving average components of the ARIMA model, with a focus on their significance and interpretation in time series forecasting. Explanation of the autoregressive, integrated, and moving average components of the ARIMA model, with a focus on their significance and interpretation in time series forecasting.']}, {'end': 30499.68, 'start': 29383.964, 'title': 'Time series forecasting with arima model', 'summary': 'Explains the process of using arima model for time series forecasting, including the steps of data exploration, decomposing time series components, model parameter selection, forecasting, and model validation with a box test, using rstudio.', 'duration': 1115.716, 'highlights': ['The ARIMA model works on the assumption that the data is stationary, with trend and seasonality removed, and uses ACF and PACF to test for stationarity. ARIMA model requires stationary data with removed trend and seasonality. ACF and PACF are used to test for stationarity.', 'ACF and PACF are used to measure autocorrelation and partial autocorrelation of time series data, providing insights into correlation between data points separated by time lag, with values ranging from -1 to 1. ACF and PACF measure autocorrelation and partial autocorrelation, providing insights into correlation between data points separated by time lag, with values ranging from -1 to 1.', 'The auto.arima function in RStudio automates the selection of model parameters (p, d, q) by testing all possible combinations and recommending the best model with the lowest AIC value. auto.arima function automates selection of model parameters by testing all possible combinations and recommending the best model with the lowest AIC value.', 'Forecasting is conducted for future time periods using the ARIMA model, and the model is validated using the Box test to assess the accuracy of predictions. Forecasting is conducted for future time periods using the ARIMA model, and the model is validated using the Box test to assess the accuracy of predictions.']}], 'duration': 3356.325, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI27143355.jpg', 'highlights': ['Introduction of the ARIMA model for time series forecasting in R, emphasizing the parameters P, D, and Q, and the use of regression analysis to calculate the intercept and slope of the data for trend analysis.', 'Forecasting is conducted for future time periods using the ARIMA model, and the model is validated using the Box test to assess the accuracy of predictions.', 'The auto.arima function in RStudio automates the selection of model parameters (p, d, q) by testing all possible combinations and recommending the best model with the lowest AIC value.', 'ACF and PACF measure autocorrelation and partial autocorrelation, providing insights into correlation between data points separated by time lag, with values ranging from -1 to 1.', 'The concept of moving average involves calculating the average of a set of consecutive values in a time series, which is a simple method for forecasting.']}, {'end': 31705.615, 'segs': [{'end': 30557.71, 'src': 'embed', 'start': 30517.389, 'weight': 0, 'content': [{'end': 30521.03, 'text': 'With this, they were able to create what they believed was a perfectly engrossing show.', 'start': 30517.389, 'duration': 3.641}, {'end': 30526.712, 'text': 'Data science is the area of study which involves extracting knowledge from all the data that you can gather.', 'start': 30521.851, 'duration': 4.861}, {'end': 30530.554, 'text': "Now that you understood what data science is, let's talk about what a data scientist does.", 'start': 30526.912, 'duration': 3.642}, {'end': 30533.936, 'text': "We'll go through all the skills that are required by a data scientist.", 'start': 30530.834, 'duration': 3.102}, {'end': 30537.138, 'text': 'A data scientist needs to have the following 7 skills.', 'start': 30534.216, 'duration': 2.922}, {'end': 30544.422, 'text': 'Database Knowledge, Statistics, Programming Tools, Data Wrangling, Machine Learning, Big Data and Data Visualization.', 'start': 30537.458, 'duration': 6.964}, {'end': 30545.523, 'text': "Let's go one by one.", 'start': 30544.662, 'duration': 0.861}, {'end': 30546.523, 'text': 'Skill 1.', 'start': 30545.763, 'duration': 0.76}, {'end': 30550.786, 'text': 'Database Knowledge Gaining database knowledge is required to store and analyze data.', 'start': 30546.523, 'duration': 4.263}, {'end': 30555.929, 'text': 'Oracle Database, SQL Server, MySQL and Teradata are some of the tools that are required for this.', 'start': 30550.986, 'duration': 4.943}, {'end': 30557.71, 'text': 'Skill 2.', 'start': 30556.149, 'duration': 1.561}], 'summary': 'Data science requires 7 skills: database knowledge, statistics, programming tools, data wrangling, machine learning, big data, and data visualization.', 'duration': 40.321, 'max_score': 30517.389, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI30517389.jpg'}, {'end': 30680.404, 'src': 'embed', 'start': 30647.799, 'weight': 2, 'content': [{'end': 30654.385, 'text': 'Data visualization involves integrating different data sets, analyzing models and visualizing them in the form of diagrams, charts and graphs.', 'start': 30647.799, 'duration': 6.586}, {'end': 30659.489, 'text': 'Some of the tools or softwares that are used are Tableau, Clickview, Power BI and Google Data Studio.', 'start': 30654.605, 'duration': 4.884}, {'end': 30661.891, 'text': "Let's talk about some of the job roles in data science.", 'start': 30659.829, 'duration': 2.062}, {'end': 30667.515, 'text': 'We have Data Scientist, Data Engineer, Data Architect, Data Analyst, Business Analyst and Data Administrator.', 'start': 30662.171, 'duration': 5.344}, {'end': 30668.676, 'text': "Let's go one by one.", 'start': 30667.735, 'duration': 0.941}, {'end': 30669.697, 'text': 'Data Scientist.', 'start': 30668.996, 'duration': 0.701}, {'end': 30674.86, 'text': 'A data scientist earns around $120,000 per year and these are their responsibilities.', 'start': 30670.097, 'duration': 4.763}, {'end': 30680.404, 'text': 'To create data-driven business solutions and analytics, drive optimization and improvement of product development.', 'start': 30674.98, 'duration': 5.424}], 'summary': 'Data visualization integrates data sets, uses tools like tableau, clickview, power bi, and google data studio. data scientist earns around $120,000 per year and creates data-driven business solutions.', 'duration': 32.605, 'max_score': 30647.799, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI30647799.jpg'}, {'end': 30862.093, 'src': 'embed', 'start': 30837.081, 'weight': 4, 'content': [{'end': 30842.445, 'text': 'Under this program we have courses in data science with SAS training, data science certification training with R programming,', 'start': 30837.081, 'duration': 5.364}, {'end': 30844.406, 'text': 'Big Data Hadoop and Spark Developer.', 'start': 30842.665, 'duration': 1.741}, {'end': 30845.646, 'text': 'Data Science with Python.', 'start': 30844.406, 'duration': 1.24}, {'end': 30847.227, 'text': 'Business Analytics with Excel.', 'start': 30845.646, 'duration': 1.581}, {'end': 30849.128, 'text': 'Machine Learning, Deep Learning with TensorFlow.', 'start': 30847.227, 'duration': 1.901}, {'end': 30853.77, 'text': "We also have an integrated program in Big Data and Data Science which is also a master's program.", 'start': 30849.428, 'duration': 4.342}, {'end': 30859.932, 'text': 'The courses that are covered are Data Science Certification Training with R Programming, Big Data Hadoop and Spark Developer Tableau,', 'start': 30853.91, 'duration': 6.022}, {'end': 30862.093, 'text': 'Desktop and Qualified Associate Training.', 'start': 30859.932, 'duration': 2.161}], 'summary': 'Offering courses in data science, including sas, r, python, hadoop, spark, tableau, excel, and tensorflow, as well as an integrated program in big data and data science.', 'duration': 25.012, 'max_score': 30837.081, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI30837081.jpg'}, {'end': 30904.112, 'src': 'embed', 'start': 30873.725, 'weight': 5, 'content': [{'end': 30876.608, 'text': 'Who is a data science engineer??', 'start': 30873.725, 'duration': 2.883}, {'end': 30880.151, 'text': 'Are you a data science engineer or are you going to be looking for a different field??', 'start': 30876.848, 'duration': 3.303}, {'end': 30882.733, 'text': 'What exactly is a data science engineer??', 'start': 30880.471, 'duration': 2.262}, {'end': 30892.562, 'text': 'Well, a data science engineer is someone who has programming experience and Python and R expert level knowledge ability to write proficient codes.', 'start': 30882.993, 'duration': 9.569}, {'end': 30897.766, 'text': "And we have Python and R, I'm going to say Python or R.", 'start': 30892.942, 'duration': 4.824}, {'end': 30904.112, 'text': 'Once you become really proficient at one language, transferring those skills into another one is usually fairly easy.', 'start': 30897.766, 'duration': 6.346}], 'summary': 'A data science engineer is proficient in python and r, capable of writing proficient codes.', 'duration': 30.387, 'max_score': 30873.725, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI30873725.jpg'}, {'end': 30968.03, 'src': 'embed', 'start': 30939.287, 'weight': 6, 'content': [{'end': 30943.671, 'text': 'So you have to have a strong coding skill with hands-on big data experience.', 'start': 30939.287, 'duration': 4.384}, {'end': 30951.577, 'text': "And, of course, we're showing SQL here, most commonly used, whether you're using a Microsoft SQL server or MySQL server.", 'start': 30944.011, 'duration': 7.566}, {'end': 30956.46, 'text': 'You can also start thinking Hadoop and Spark and big data access.', 'start': 30951.597, 'duration': 4.863}, {'end': 30959.443, 'text': 'Hadoop file system is not a huge jump.', 'start': 30956.941, 'duration': 2.502}, {'end': 30963.686, 'text': "If you've already learned your SQL and you've already learned your basics in coding.", 'start': 30959.643, 'duration': 4.043}, {'end': 30968.03, 'text': 'Hadoop sits on top of all that and does a wonderful job, creating huge clusters of data.', 'start': 30963.686, 'duration': 4.344}], 'summary': 'Strong coding skills and hands-on big data experience are essential, including proficiency in sql, hadoop, and spark for managing large data clusters.', 'duration': 28.743, 'max_score': 30939.287, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI30939287.jpg'}, {'end': 31037.377, 'src': 'embed', 'start': 31004.614, 'weight': 7, 'content': [{'end': 31009.238, 'text': 'A self-starter with a strong sense of personal responsibility and technical orientation.', 'start': 31004.614, 'duration': 4.624}, {'end': 31017.546, 'text': "This is an interesting one because when you talk about data science engineer, it's such a new field that most companies don't have it well defined.", 'start': 31009.479, 'duration': 8.067}, {'end': 31018.647, 'text': "They don't know what they're looking for.", 'start': 31017.566, 'duration': 1.081}, {'end': 31020.789, 'text': "You might not even know what you're looking for.", 'start': 31018.888, 'duration': 1.901}, {'end': 31024.573, 'text': "You might have an idea and you're looking for patterns, but where do those patterns lead you?", 'start': 31020.829, 'duration': 3.744}, {'end': 31031.155, 'text': 'So you really need that self-starter side to jump in there and figure out where to go and be able to communicate that back to the team.', 'start': 31024.933, 'duration': 6.222}, {'end': 31031.835, 'text': 'And there we go.', 'start': 31031.355, 'duration': 0.48}, {'end': 31037.377, 'text': 'We have a strong product intuition, data analysis skills, and business presentation skills.', 'start': 31031.855, 'duration': 5.522}], 'summary': 'Data science engineer requires a self-starter with strong technical skills and product intuition.', 'duration': 32.763, 'max_score': 31004.614, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI31004614.jpg'}, {'end': 31132.785, 'src': 'embed', 'start': 31105.743, 'weight': 9, 'content': [{'end': 31109.466, 'text': "It's an essential language for extracting a large amount of data from data sets.", 'start': 31105.743, 'duration': 3.723}, {'end': 31114.01, 'text': 'So knowledge of the SQL is mandatory for data science engineers.', 'start': 31109.726, 'duration': 4.284}, {'end': 31115.731, 'text': "And you can see there's tools required.", 'start': 31114.37, 'duration': 1.361}, {'end': 31117.413, 'text': "There's the Oracle database.", 'start': 31115.771, 'duration': 1.642}, {'end': 31121.756, 'text': 'I mentioned MySQL Server, Microsoft SQL Server, Teradata.', 'start': 31117.753, 'duration': 4.003}, {'end': 31124.438, 'text': 'There are so many different forms of SQL.', 'start': 31121.956, 'duration': 2.482}, {'end': 31126.84, 'text': "And it'll just keep coming back and coming back.", 'start': 31124.759, 'duration': 2.081}, {'end': 31130.743, 'text': "So if you don't have a solid basic understanding of SQL, go get it.", 'start': 31126.88, 'duration': 3.863}, {'end': 31132.785, 'text': 'Very important because it will come up.', 'start': 31131.024, 'duration': 1.761}], 'summary': 'Sql is essential for data extraction in data science, with various tools and databases like oracle, mysql, and microsoft sql server being mandatory knowledge.', 'duration': 27.042, 'max_score': 31105.743, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI31105743.jpg'}, {'end': 31182.079, 'src': 'embed', 'start': 31141.48, 'weight': 10, 'content': [{'end': 31146.183, 'text': 'Statistics is a subset of mathematics that deals with collecting, analyzing, and interpreting data.', 'start': 31141.48, 'duration': 4.703}, {'end': 31149.446, 'text': 'Therefore, data scientist needs to know statistics.', 'start': 31146.404, 'duration': 3.042}, {'end': 31156.712, 'text': 'So you need to understand your probabilities and what that means, and what the p-score means, and the f-score and means, and mode and median.', 'start': 31149.686, 'duration': 7.026}, {'end': 31160.915, 'text': 'all that information, standard deviation, all of those you need to be aware of.', 'start': 31156.712, 'duration': 4.203}, {'end': 31163.098, 'text': 'And then we get into the programming tools.', 'start': 31161.295, 'duration': 1.803}, {'end': 31168.746, 'text': 'And I mentioned this earlier a little bit, but you need to master any one of these specific programming languages.', 'start': 31163.338, 'duration': 5.408}, {'end': 31174.855, 'text': 'Programming tools such as R, Python, SAS are essential to perform analytics in data.', 'start': 31168.987, 'duration': 5.868}, {'end': 31179.938, 'text': 'And again, you know, you can move in there and you can be an expert in just R.', 'start': 31175.135, 'duration': 4.803}, {'end': 31182.079, 'text': 'You can be an expert in just SAS in a little bit.', 'start': 31179.938, 'duration': 2.141}], 'summary': 'Statistics is crucial for data scientists; mastery of programming languages like r, python, and sas is essential for performing analytics in data.', 'duration': 40.599, 'max_score': 31141.48, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI31141480.jpg'}, {'end': 31320.19, 'src': 'embed', 'start': 31294.735, 'weight': 12, 'content': [{'end': 31302.881, 'text': 'So when we talk about data wrangling, it is a process of transforming raw data into an appropriate format to make it useful for analytics.', 'start': 31294.735, 'duration': 8.146}, {'end': 31309.106, 'text': 'And it involves cleaning raw data, structuring raw data, and enriching raw data.', 'start': 31303.402, 'duration': 5.704}, {'end': 31313.708, 'text': "And this gets interesting because you'll get stuck on something like in Python.", 'start': 31309.346, 'duration': 4.362}, {'end': 31320.19, 'text': 'you might get stuck on something that is a date time based on an integer 64 numpy set.', 'start': 31313.708, 'duration': 6.482}], 'summary': 'Data wrangling transforms raw data for analytics, including cleaning, structuring, and enriching data.', 'duration': 25.455, 'max_score': 31294.735, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI31294735.jpg'}, {'end': 31502.421, 'src': 'embed', 'start': 31472.167, 'weight': 14, 'content': [{'end': 31478.514, 'text': 'Big data has various benefits like access to social data can enable organizations to tune their business strategies.', 'start': 31472.167, 'duration': 6.347}, {'end': 31480.936, 'text': 'Big data can improve customer experience.', 'start': 31478.834, 'duration': 2.102}, {'end': 31487.843, 'text': "And so we're usually when you say big data, we're almost always talking about Hadoop and Apache Spark.", 'start': 31481.797, 'duration': 6.046}, {'end': 31488.724, 'text': 'It used to be.', 'start': 31488.203, 'duration': 0.521}, {'end': 31491.027, 'text': 'Hadoop is your file system.', 'start': 31488.724, 'duration': 2.303}, {'end': 31495.674, 'text': "so that's how you store all your data going across the nodes, and there certainly are other ways to store it.", 'start': 31491.027, 'duration': 4.647}, {'end': 31502.421, 'text': "When we talk about Hadoop, we're usually talking about at least 10 terabytes of data when you're dealing with a Hadoop file structure.", 'start': 31495.994, 'duration': 6.427}], 'summary': 'Big data enables business strategy tuning, improves customer experience, and typically involves at least 10 terabytes of data in hadoop.', 'duration': 30.254, 'max_score': 31472.167, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI31472167.jpg'}, {'end': 31576.391, 'src': 'embed', 'start': 31548.537, 'weight': 15, 'content': [{'end': 31554.96, 'text': 'And you can be talking about big data across five servers with those, pulling them into, say, Apache Spark to do your high-end processing.', 'start': 31548.537, 'duration': 6.423}, {'end': 31557.081, 'text': "And then there's non-technical skills.", 'start': 31555.16, 'duration': 1.921}, {'end': 31564.725, 'text': 'Probably the most important one in data science and any of our data analytics is intellectual curiosity.', 'start': 31557.521, 'duration': 7.204}, {'end': 31569.708, 'text': 'Updating knowledge by reading contents and relevant books on trends in data science.', 'start': 31565.025, 'duration': 4.683}, {'end': 31576.391, 'text': "There's so much going on in this field and it's exploding now that it's hard to keep track of it all.", 'start': 31570.128, 'duration': 6.263}], 'summary': 'Utilize big data across five servers, use apache spark for high-end processing, emphasize intellectual curiosity and continuous learning in data science.', 'duration': 27.854, 'max_score': 31548.537, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI31548537.jpg'}], 'start': 30500.04, 'title': 'Data science skills and tools', 'summary': 'Details the skills required for a data scientist, including database knowledge, statistics, programming tools, data wrangling, machine learning, big data, and data visualization, along with job roles, responsibilities, and average salaries. it emphasizes the demand for data scientists, essential coding skills, and the importance of non-technical skills in data science.', 'chapters': [{'end': 30771.854, 'start': 30500.04, 'title': 'Data science skills and job roles', 'summary': 'Explains how data science and big data were used to create the show house of cards, and details the 7 skills required for a data scientist - database knowledge, statistics, programming tools, data wrangling, machine learning, big data, data visualization - along with their specific tools and the job roles in data science along with their associated responsibilities and average salaries.', 'duration': 271.814, 'highlights': ['Netflix used data science and big data to create House of Cards by analyzing user data from The West Wing, considering scenes rewound, fast-forwarded, and stopped; Data Science involves extracting knowledge from data.', 'A data scientist needs 7 skills: Database Knowledge, Statistics, Programming Tools, Data Wrangling, Machine Learning, Big Data, Data Visualization with specific tools and softwares for each skill.', 'The job roles in data science include Data Scientist ($120,000/yr), Data Engineer ($130,000/yr), Data Architect ($112,000/yr), Data Analyst ($65,000/yr), Business Analyst (US$70,000/yr), and Data Administrator ($54,000/yr) along with their respective responsibilities.']}, {'end': 31054.709, 'start': 30771.854, 'title': 'Data science engineer skills', 'summary': 'Discusses the various skills required for different roles in data science, emphasizing the importance of programming tools, data visualization, communication, statistics, and data wrangling, and the significance of being a versatile problem solver with strong analytical and quantitative skills, along with a self-starter attitude. the chapter also highlights the demand for data scientists, the certifications available, and the essential coding skills, particularly in python and r, with a strong emphasis on sql and big data experience.', 'duration': 282.855, 'highlights': ['The demand for data scientists, one of the most lucrative jobs in data science, necessitates sound knowledge in programming tools, data visualization, communication, statistics, mathematics, and linear algebra.', 'Simply Learn offers various data science certifications, covering programs in data science with SAS, R programming, Python, Big Data Hadoop, Spark Developer, Business Analytics with Excel, Machine Learning, Deep Learning with TensorFlow, and integrated programs in Big Data and Data Science.', 'A data science engineer requires strong programming experience, expert level knowledge in Python and R, the ability to write proficient codes, and a solid foundation in SQL and big data experience.', 'Hadoop and Spark, along with strong SQL skills, are essential for data science engineers to handle big data and access data from large companies.', 'Being a versatile problem solver equipped with strong analytical and quantitative skills, along with a self-starter attitude, is crucial for success in the field of data science engineering.', 'Having a strong product intuition, data analysis skills, and business presentation skills is essential for data scientists to effectively communicate their analysis and insights within the team and to stakeholders.']}, {'end': 31333.256, 'start': 31054.709, 'title': 'Essential data science skills', 'summary': 'Discusses the essential skills required for a data science engineer, including database knowledge (sql), statistics, programming tools (r, python, sas), and data wrangling, emphasizing the importance of each skill and its relevance in the field of data science.', 'duration': 278.547, 'highlights': ['Database knowledge, particularly SQL, is essential for data science engineers, as it is crucial for extracting a large amount of data from datasets. SQL is mandatory for data science engineers, and knowledge of SQL is important as it is the essential language for extracting data from datasets.', 'Understanding statistics, including probabilities, p-score, f-score, mode, median, and standard deviation, is necessary for a data scientist to collect, analyze, and interpret data. Data scientists need to understand statistics, including probabilities, p-score, f-score, mode, median, and standard deviation, to effectively collect, analyze, and interpret data.', 'Proficiency in programming tools such as R, Python, and SAS is essential for performing analytics in data science, with Python being the main language currently used. Proficiency in programming tools like R, Python, and SAS is necessary for performing analytics in data science, with Python being the main language currently used in the field.', 'Data wrangling, which involves transforming raw data into a useful format for analytics, is a crucial aspect of data science, despite being considered one of the least favorite tasks. Data wrangling is a crucial aspect of data science, involving the transformation of raw data into a suitable format for analytics, despite being one of the least favorite tasks.']}, {'end': 31705.615, 'start': 31333.456, 'title': 'Data science skills and tools', 'summary': 'Discusses the importance of machine learning techniques, data visualization, big data, and non-technical skills in data science, emphasizing the significance of mastering these skills for effective communication and decision-making.', 'duration': 372.159, 'highlights': ["Machine learning techniques and data visualization are crucial skills for data scientists, with examples including decision trees, linear regression, and the use of visualization tools like Tableau and Python's PyKit and Seaborn. These skills are essential for data scientists and can be exemplified through decision trees, linear regression, and visualization tools like Tableau, PyKit, and Seaborn.", 'Big data, primarily handled through Hadoop and Apache Spark, offers various benefits such as improved customer experience and business strategy tuning, with Hadoop typically used for storing at least 10 terabytes of data and Spark for high-intensive data processing. Big data, managed through Hadoop and Apache Spark, provides benefits like enhanced customer experience and business strategy tuning, with Hadoop used for large data storage and Spark for intensive data processing.', 'Non-technical skills like intellectual curiosity, business acumen, and effective communication are emphasized as crucial for data scientists to understand the impact on the business, continuously update knowledge, and communicate technical findings to non-technical teams. Non-technical skills such as intellectual curiosity, business acumen, and effective communication are essential for data scientists to understand business impact, update knowledge, and communicate findings to non-technical teams.']}], 'duration': 1205.575, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI30500040.jpg', 'highlights': ['Data Science involves extracting knowledge from data.', 'A data scientist needs 7 skills: Database Knowledge, Statistics, Programming Tools, Data Wrangling, Machine Learning, Big Data, Data Visualization.', 'The job roles in data science include Data Scientist ($120,000/yr), Data Engineer ($130,000/yr), Data Architect ($112,000/yr), Data Analyst ($65,000/yr), Business Analyst (US$70,000/yr), and Data Administrator ($54,000/yr) along with their respective responsibilities.', 'The demand for data scientists necessitates sound knowledge in programming tools, data visualization, communication, statistics, mathematics, and linear algebra.', 'Simply Learn offers various data science certifications, covering programs in data science with SAS, R programming, Python, Big Data Hadoop, Spark Developer, Business Analytics with Excel, Machine Learning, Deep Learning with TensorFlow, and integrated programs in Big Data and Data Science.', 'A data science engineer requires strong programming experience, expert level knowledge in Python and R, the ability to write proficient codes, and a solid foundation in SQL and big data experience.', 'Hadoop and Spark, along with strong SQL skills, are essential for data science engineers to handle big data and access data from large companies.', 'Being a versatile problem solver equipped with strong analytical and quantitative skills, along with a self-starter attitude, is crucial for success in the field of data science engineering.', 'Having a strong product intuition, data analysis skills, and business presentation skills is essential for data scientists to effectively communicate their analysis and insights within the team and to stakeholders.', 'Database knowledge, particularly SQL, is essential for data science engineers, as it is crucial for extracting a large amount of data from datasets.', 'Understanding statistics, including probabilities, p-score, f-score, mode, median, and standard deviation, is necessary for a data scientist to collect, analyze, and interpret data.', 'Proficiency in programming tools such as R, Python, and SAS is essential for performing analytics in data science, with Python being the main language currently used.', 'Data wrangling, which involves transforming raw data into a useful format for analytics, is a crucial aspect of data science, despite being considered one of the least favorite tasks.', "Machine learning techniques and data visualization are crucial skills for data scientists, with examples including decision trees, linear regression, and the use of visualization tools like Tableau and Python's PyKit and Seaborn.", 'Big data, primarily handled through Hadoop and Apache Spark, offers various benefits such as improved customer experience and business strategy tuning, with Hadoop typically used for storing at least 10 terabytes of data and Spark for high-intensive data processing.', 'Non-technical skills like intellectual curiosity, business acumen, and effective communication are emphasized as crucial for data scientists to understand the impact on the business, continuously update knowledge, and communicate technical findings to non-technical teams.']}, {'end': 32661.072, 'segs': [{'end': 31739.362, 'src': 'embed', 'start': 31705.775, 'weight': 0, 'content': [{'end': 31713.782, 'text': 'So working with everybody, including the customer, is very important with this kind of setup and non-technical skills for a data scientist.', 'start': 31705.775, 'duration': 8.007}, {'end': 31718.587, 'text': "So let's take a look at some of the roles data science plays in the job market.", 'start': 31714.143, 'duration': 4.444}, {'end': 31720.429, 'text': "Let's drill in there just a little bit here.", 'start': 31718.767, 'duration': 1.662}, {'end': 31722.11, 'text': 'So, when you have a data scientist,', 'start': 31720.709, 'duration': 1.401}, {'end': 31728.114, 'text': "they're going to perform predictive analysis and identify trend and patterns that can help in better decision making.", 'start': 31722.11, 'duration': 6.004}, {'end': 31732.977, 'text': 'Companies hiring data scientists include Apple, Adobe, Google, Microsoft.', 'start': 31728.314, 'duration': 4.663}, {'end': 31739.362, 'text': "I would say that's even expanding down to smaller companies where you're talking about only 100 employees or something like that.", 'start': 31733.538, 'duration': 5.824}], 'summary': 'Data scientists perform predictive analysis and are in high demand, with companies like apple, adobe, google, and microsoft hiring them.', 'duration': 33.587, 'max_score': 31705.775, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI31705775.jpg'}, {'end': 31970.457, 'src': 'embed', 'start': 31946.954, 'weight': 5, 'content': [{'end': 31954.781, 'text': 'And here we have our languages, our SQL, R, MATLAB, SAS, SPSS, Python, Java, Ruby, C++, Perl, Hive, Pig.', 'start': 31946.954, 'duration': 7.827}, {'end': 31960.807, 'text': "You pretty much as a data engineer at this level, when you're talking about admin and updating all these databases,", 'start': 31954.941, 'duration': 5.866}, {'end': 31964.01, 'text': "need to know all the different stuff that's being used in the company.", 'start': 31960.807, 'duration': 3.203}, {'end': 31967.053, 'text': "And you're testing out these structures to make sure they work.", 'start': 31964.23, 'duration': 2.823}, {'end': 31970.457, 'text': 'And so you really are talking admin level kind of setups.', 'start': 31967.153, 'duration': 3.304}], 'summary': 'Data engineer needs to know sql, r, matlab, sas, python, java, c++, perl, hive, pig for admin and testing database structures.', 'duration': 23.503, 'max_score': 31946.954, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI31946954.jpg'}, {'end': 32036.503, 'src': 'embed', 'start': 32006.098, 'weight': 4, 'content': [{'end': 32009.479, 'text': "Companies hiring a data engineer, or we're talking about statisticians?", 'start': 32006.098, 'duration': 3.381}, {'end': 32016.64, 'text': "we're talking about LinkedIn, PepsiCo, Johnson & Johnson and, of course, these companies probably hire a little bit of everything,", 'start': 32009.479, 'duration': 7.161}, {'end': 32018.66, 'text': 'but they have a lot of statisticians working for them.', 'start': 32016.64, 'duration': 2.02}, {'end': 32026.101, 'text': 'And so we look at the role, extract and offer valuable reports from the data clusters through statistical theories and data organization.', 'start': 32019.12, 'duration': 6.981}, {'end': 32027.922, 'text': 'And the languages across the board.', 'start': 32026.401, 'duration': 1.521}, {'end': 32031.642, 'text': "we're still seeing SQL, R, MATLAB, SAS, SPSS data.", 'start': 32027.922, 'duration': 3.72}, {'end': 32036.503, 'text': "that's a new one that we haven't seen in some of the other roles Python Perl, An older version.", 'start': 32031.642, 'duration': 4.861}], 'summary': 'Companies like linkedin, pepsico, and johnson & johnson hire statisticians and data engineers with skills in sql, r, matlab, sas, spss, python, and perl.', 'duration': 30.405, 'max_score': 32006.098, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI32006098.jpg'}, {'end': 32321.098, 'src': 'embed', 'start': 32288.502, 'weight': 1, 'content': [{'end': 32290.363, 'text': "That's where we're pulling a lot of our data from.", 'start': 32288.502, 'duration': 1.861}, {'end': 32295.165, 'text': 'Average base pay is around $117,000 in the US, the average salary.', 'start': 32290.643, 'duration': 4.522}, {'end': 32299.407, 'text': "And when we look at India, we're talking about $950,000 a year.", 'start': 32295.545, 'duration': 3.862}, {'end': 32301.848, 'text': 'And those are based on some of them.', 'start': 32299.627, 'duration': 2.221}, {'end': 32306.051, 'text': 'you need to go ahead and dig deeper to find out what education level versus entry level,', 'start': 32301.848, 'duration': 4.203}, {'end': 32310.713, 'text': "but those are a pretty solid base once you're in the industry and once you've created your career in there.", 'start': 32306.051, 'duration': 4.662}, {'end': 32315.235, 'text': 'When we look at the job titles, the most common job title is data scientist.', 'start': 32310.733, 'duration': 4.502}, {'end': 32319.257, 'text': 'Business intelligence manager is the second greatest one.', 'start': 32315.635, 'duration': 3.622}, {'end': 32321.098, 'text': "That's kind of good to note.", 'start': 32319.957, 'duration': 1.141}], 'summary': 'Data science industry average base pay: us $117,000, india $950,000. common job titles: data scientist, business intelligence manager.', 'duration': 32.596, 'max_score': 32288.502, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI32288502.jpg'}, {'end': 32402.641, 'src': 'embed', 'start': 32377.13, 'weight': 3, 'content': [{'end': 32384.034, 'text': 'if you specialize in retail and marketing and you can see here on the data science salary trends and the growth and data science job listings,', 'start': 32377.13, 'duration': 6.904}, {'end': 32385.694, 'text': "It's continually going up.", 'start': 32384.534, 'duration': 1.16}, {'end': 32391.657, 'text': "It's since 2014 to 2012, we've gone from 400 to 600.", 'start': 32385.855, 'duration': 5.802}, {'end': 32393.197, 'text': "That's a pretty big increase.", 'start': 32391.657, 'duration': 1.54}, {'end': 32395.378, 'text': 'So, you know, a huge growth in this market.', 'start': 32393.577, 'duration': 1.801}, {'end': 32399.08, 'text': "It's one of the biggest growing markets right now for jobs and careers.", 'start': 32395.398, 'duration': 3.682}, {'end': 32402.641, 'text': "So let's go ahead and take a look at building a resume.", 'start': 32399.42, 'duration': 3.221}], 'summary': 'Data science job listings grew from 400 to 600 since 2014, showing a significant increase in the retail and marketing sector.', 'duration': 25.511, 'max_score': 32377.13, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI32377130.jpg'}], 'start': 31705.775, 'title': 'Data science job roles and trends', 'summary': 'Outlines key job roles, top hiring companies, essential skills, and average salaries in data science, highlighting the increasing demand for data scientists and the essential skills and responsibilities of the role.', 'chapters': [{'end': 31815.621, 'start': 31705.775, 'title': 'Roles of a data scientist in the job market', 'summary': 'Discusses the importance of non-technical skills for data scientists, the expanding demand for data scientists in companies of all sizes, and the essential skills and responsibilities of a data scientist, including performing predictive analysis, identifying trends and patterns, and utilizing various programming languages and tools like r, sas, python, matlab, sql, hive, pig, and spark.', 'duration': 109.846, 'highlights': ['The expanding demand for data scientists includes smaller companies with around 100 employees, in addition to tech giants like Apple, Adobe, Google, and Microsoft.', 'The primary goal of a data scientist is to understand the challenges of a system and offer the best solutions, contributing to better decision-making and enhancing the company or business.', 'Data scientists should have knowledge of various programming languages such as R, SAS, Python, MATLAB, SQL, Hive, Pig, and Spark, and be capable of creating and modifying algorithms to extract information from large databases.', 'Data scientists are responsible for performing predictive analysis and identifying trends and patterns that can help in better decision making.', 'Non-technical skills are crucial for data scientists as they need to collaborate with various stakeholders, including customers, to contribute effectively to the team and understand the broader goals of the company.']}, {'end': 32399.08, 'start': 31815.821, 'title': 'Data science job roles and trends', 'summary': 'Outlines the key job roles in data science, including data analyst, data architect, data engineer, statistician, database administrator, data and analytics manager, and business analytics. it highlights the top hiring companies, essential skills and languages required for each role, and provides insights into the average salaries and job title trends in the field.', 'duration': 583.259, 'highlights': ['The average base pay for data science engineers in the US is around $117,000, while in India it is approximately $950,000 per year. The average salaries for data science engineers in the US and India are provided, showing a significant contrast between the two regions.', 'The most common job title in data science is data scientist, followed by business intelligence manager. The most prevalent job titles in data science are highlighted, emphasizing the demand for data scientists and business intelligence managers.', 'The growth in data science job listings has seen a substantial increase from 2014 to 2012, indicating a thriving job market in the field. The growth trend of data science job listings is presented, demonstrating a significant expansion in the market from 2014 to 2012.', 'Key languages and skills required for various data science roles include SQL, R, MATLAB, SAS, Python, Java, C++, Ruby, Hive, Pig, and Spark. An overview of the essential languages and skills needed for different data science roles is provided, highlighting the diverse technical requirements in the field.', 'Companies hiring for data science roles include Amazon, Spotify, Facebook, LinkedIn, PepsiCo, Johnson & Johnson, Visa, Logitech, Coca-Cola, IBM, DHL, and HP. The key companies recruiting for data science positions are listed, indicating the diverse range of organizations seeking data science professionals.']}, {'end': 32661.072, 'start': 32399.42, 'title': 'Building an effective resume', 'summary': "Discusses the evolution of resumes, emphasizing the importance of including references, a professional summary, and relevant links, while highlighting the significance of tailoring the resume to the specific company's needs.", 'duration': 261.652, 'highlights': ['The top part of resumes has evolved, with references and a picture becoming crucial in standing out.', "The professional summary should focus on how the candidate can serve the company, and tailoring it to the specific company's needs is crucial.", "Including links to professional profiles like LinkedIn, GitHub, and even a personal website is important in showcasing one's skills and work.", "Tailoring the resume to the specific company's requirements by emphasizing relevant experiences and skills is crucial for making an impact.", "Resumes should be concise and tailored to grab the reader's interest within 30 seconds to a minute, with the option for further exploration through provided links."]}], 'duration': 955.297, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI31705775.jpg', 'highlights': ['The expanding demand for data scientists includes smaller companies with around 100 employees, in addition to tech giants like Apple, Adobe, Google, and Microsoft.', 'The average base pay for data science engineers in the US is around $117,000, while in India it is approximately $950,000 per year.', 'The most common job title in data science is data scientist, followed by business intelligence manager.', 'The growth in data science job listings has seen a substantial increase from 2014 to 2012, indicating a thriving job market in the field.', 'Companies hiring for data science roles include Amazon, Spotify, Facebook, LinkedIn, PepsiCo, Johnson & Johnson, Visa, Logitech, Coca-Cola, IBM, DHL, and HP.', 'Data scientists should have knowledge of various programming languages such as R, SAS, Python, MATLAB, SQL, Hive, Pig, and Spark, and be capable of creating and modifying algorithms to extract information from large databases.', 'Data scientists are responsible for performing predictive analysis and identifying trends and patterns that can help in better decision making.', 'Non-technical skills are crucial for data scientists as they need to collaborate with various stakeholders, including customers, to contribute effectively to the team and understand the broader goals of the company.']}, {'end': 33410.21, 'segs': [{'end': 32719.136, 'src': 'embed', 'start': 32661.072, 'weight': 0, 'content': [{'end': 32669.397, 'text': 'so quick overview this is your sell sheet, selling you to the company, so always tie it to the company so that you have that.', 'start': 32661.072, 'duration': 8.325}, {'end': 32670.878, 'text': 'what am I going to give this company?', 'start': 32669.397, 'duration': 1.481}, {'end': 32672.019, 'text': 'what are they going to get from me?', 'start': 32670.878, 'duration': 1.141}, {'end': 32677.174, 'text': "We've covered a lot on data science, engineering in general, and we've gone over a basic resume.", 'start': 32672.25, 'duration': 4.924}, {'end': 32680.157, 'text': 'Remember, keep it simple, short, and direct.', 'start': 32677.615, 'duration': 2.542}, {'end': 32682.239, 'text': 'That is so important with that resume.', 'start': 32680.417, 'duration': 1.822}, {'end': 32684.941, 'text': 'Welcome to Data Science Interview Questions.', 'start': 32682.399, 'duration': 2.542}, {'end': 32687.723, 'text': 'My name is Richard Kirshner with the Simply Learn team.', 'start': 32685.281, 'duration': 2.442}, {'end': 32690.165, 'text': "That's www.simplylearn.com.", 'start': 32687.763, 'duration': 2.402}, {'end': 32691.186, 'text': 'Get certified.', 'start': 32690.305, 'duration': 0.881}, {'end': 32691.967, 'text': 'Get ahead.', 'start': 32691.406, 'duration': 0.561}, {'end': 32695.828, 'text': 'Before we dive in and start going through the questions one at a time,', 'start': 32692.367, 'duration': 3.461}, {'end': 32700.17, 'text': "we're going to start with some of the logical kind of concept that enters in a lot of interviews.", 'start': 32695.828, 'duration': 4.342}, {'end': 32704.452, 'text': 'In this one you have two buckets, one of three liters and the other of five liters.', 'start': 32700.35, 'duration': 4.102}, {'end': 32706.993, 'text': "You're expected to measure exactly four liters.", 'start': 32704.612, 'duration': 2.381}, {'end': 32711.034, 'text': 'How will you complete the task? And note, you only have the two buckets.', 'start': 32707.373, 'duration': 3.661}, {'end': 32713.695, 'text': "You don't have a third bucket or anything like that, just the two buckets.", 'start': 32711.134, 'duration': 2.561}, {'end': 32719.136, 'text': 'And the object of the question like this is to see how well you are thinking outside the box.', 'start': 32713.955, 'duration': 5.181}], 'summary': 'Prepare sell sheet for company, keep resume short, logical thinking is crucial in interviews.', 'duration': 58.064, 'max_score': 32661.072, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI32661072.jpg'}, {'end': 32872.352, 'src': 'embed', 'start': 32845.245, 'weight': 4, 'content': [{'end': 32850.97, 'text': 'Most commonly used supervised learning algorithms are decision tree, logistic regression, support vector machine.', 'start': 32845.245, 'duration': 5.725}, {'end': 32855.295, 'text': 'And you should know that those are probably the most common used right now and there certainly are so many coming out.', 'start': 32851.11, 'duration': 4.185}, {'end': 32861.04, 'text': "So that's a very evolving thing and be aware of a lot of the different algorithms that are out there outside of the deep learning.", 'start': 32855.394, 'duration': 5.646}, {'end': 32866.165, 'text': 'because a lot of these work faster on raw data and numbers than they do, than a deep neural network would.', 'start': 32861.101, 'duration': 5.064}, {'end': 32869.328, 'text': 'Unsupervised learning uses unlabeled data as input.', 'start': 32866.366, 'duration': 2.962}, {'end': 32872.352, 'text': 'Unsupervised learning has no feedback mechanism.', 'start': 32869.629, 'duration': 2.723}], 'summary': 'Common supervised learning algorithms are decision tree, logistic regression, and support vector machine, while unsupervised learning uses unlabeled data.', 'duration': 27.107, 'max_score': 32845.245, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI32845245.jpg'}], 'start': 32661.072, 'title': 'Crafting a sell sheet and measuring 4 liters', 'summary': 'Emphasizes crafting a sell sheet for data science interviews and solving the problem of measuring 4 liters with 3 and 5-liter buckets. it covers concepts like supervised and unsupervised learning algorithms, logistic regression, decision tree, random forest, overfitting, and univariate, bivariate, and multivariate analysis.', 'chapters': [{'end': 32700.17, 'start': 32661.072, 'title': 'Crafting a sell sheet for data science interviews', 'summary': 'Emphasizes the importance of crafting a sell sheet for data science interviews, highlighting the need to tie the sell sheet to the company, keep the resume simple, short, and direct, and covering data science and engineering concepts.', 'duration': 39.098, 'highlights': ['Crafting a sell sheet is crucial for data science interviews, emphasizing the need to tie it to the company and focus on what the candidate can offer the company.', 'Emphasizes the importance of simplicity, brevity, and directness when creating a resume for data science interviews.', 'Covered a lot on data science, engineering, and provided guidance on creating a basic resume for data science interviews.', 'Starting with logical concepts that are commonly encountered in data science interviews.']}, {'end': 33410.21, 'start': 32700.35, 'title': 'Measuring 4 liters with 3 and 5 liter buckets', 'summary': 'Discusses the problem of measuring exactly 4 liters using 3 and 5-liter buckets, and also covers topics like supervised and unsupervised learning algorithms, logistic regression, decision tree, random forest, overfitting, and univariate, bivariate, and multivariate analysis.', 'duration': 709.86, 'highlights': ['The chapter discusses the problem of measuring exactly 4 liters using 3 and 5-liter buckets. The problem of measuring exactly 4 liters using 3 and 5-liter buckets is explored, demonstrating the process of pouring water between the two buckets to achieve the desired measurement.', 'The chapter covers topics like supervised and unsupervised learning algorithms, logistic regression, decision tree, random forest, overfitting, and univariate, bivariate, and multivariate analysis. Various topics related to data science are covered, including supervised and unsupervised learning algorithms, logistic regression, decision tree, random forest, overfitting, and univariate, bivariate, and multivariate analysis, providing a broad overview of important concepts in the field.']}], 'duration': 749.138, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI32661072.jpg', 'highlights': ['Crafting a sell sheet is crucial for data science interviews, emphasizing the need to tie it to the company and focus on what the candidate can offer the company.', 'Emphasizes the importance of simplicity, brevity, and directness when creating a resume for data science interviews.', 'Starting with logical concepts that are commonly encountered in data science interviews.', 'The chapter discusses the problem of measuring exactly 4 liters using 3 and 5-liter buckets, demonstrating the process of pouring water between the two buckets to achieve the desired measurement.', 'The chapter covers topics like supervised and unsupervised learning algorithms, logistic regression, decision tree, random forest, overfitting, and univariate, bivariate, and multivariate analysis, providing a broad overview of important concepts in the field.', 'Covered a lot on data science, engineering, and provided guidance on creating a basic resume for data science interviews.']}, {'end': 34289.237, 'segs': [{'end': 33447.555, 'src': 'embed', 'start': 33410.23, 'weight': 0, 'content': [{'end': 33415.273, 'text': 'You know, a little vendor on the corner selling 2,000 ice cream cones a day and 3,100 the next day.', 'start': 33410.23, 'duration': 5.043}, {'end': 33421.436, 'text': 'Here, the relationship is visible from the table that temperature and cells are directly proportional to each other.', 'start': 33415.533, 'duration': 5.903}, {'end': 33424.678, 'text': 'So the hotter the temperature, we can predict an increase in cells.', 'start': 33421.657, 'duration': 3.021}, {'end': 33426.48, 'text': 'So the word prediction should come up.', 'start': 33425.078, 'duration': 1.402}, {'end': 33428.742, 'text': 'So we have description and prediction.', 'start': 33426.5, 'duration': 2.242}, {'end': 33433.606, 'text': 'When the data involves three or more variables, it is categorized under multivariate.', 'start': 33428.962, 'duration': 4.644}, {'end': 33437.068, 'text': 'It is similar to bivariate, but contains more than one dependent variable.', 'start': 33433.845, 'duration': 3.223}, {'end': 33440.61, 'text': 'In this example another really common one the data for house price prediction.', 'start': 33437.167, 'duration': 3.443}, {'end': 33447.555, 'text': 'the patterns can be studied by drawing conclusions using mean, median and mode dispersion or range minimum maximum, etc.', 'start': 33440.61, 'duration': 6.945}], 'summary': "Temperature affects cell growth, shown by vendor's ice cream sales; multivariate data includes house price prediction.", 'duration': 37.325, 'max_score': 33410.23, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI33410230.jpg'}, {'end': 33516.188, 'src': 'embed', 'start': 33486.069, 'weight': 3, 'content': [{'end': 33489.992, 'text': "But that usually doesn't show up unless you're dealing with some really hardcore data science groups.", 'start': 33486.069, 'duration': 3.923}, {'end': 33497.638, 'text': 'What are the feature selection methods to select the right variables? There are two main methods for feature selection.', 'start': 33490.292, 'duration': 7.346}, {'end': 33500.68, 'text': "There's filter methods and wrapper methods.", 'start': 33497.718, 'duration': 2.962}, {'end': 33510.866, 'text': "And when you're filtering your, before we discuss the two methods real quick, the best analogy for selecting features is bad data in, bad answer out.", 'start': 33501.201, 'duration': 9.665}, {'end': 33516.188, 'text': "So when we're limiting or selecting our features, it's all about cleaning up the data coming in.", 'start': 33511.226, 'duration': 4.962}], 'summary': 'Feature selection in data science involves filter and wrapper methods to clean and select variables, crucial for accurate results.', 'duration': 30.119, 'max_score': 33486.069, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI33486069.jpg'}, {'end': 33805.412, 'src': 'embed', 'start': 33773.171, 'weight': 4, 'content': [{'end': 33776.153, 'text': 'So with smaller data, you start running into problems because you lose a lot of data.', 'start': 33773.171, 'duration': 2.982}, {'end': 33783.118, 'text': "And so we can substitute missing values with the mean or average of the rest of the data using Panda's data frame in Python.", 'start': 33776.273, 'duration': 6.845}, {'end': 33785.5, 'text': "There's different ways to do this, obviously, in different languages.", 'start': 33783.218, 'duration': 2.282}, {'end': 33787.741, 'text': "And even in Python, there's different ways to do this.", 'start': 33785.56, 'duration': 2.181}, {'end': 33789.383, 'text': "But in Python, it's real easy.", 'start': 33787.922, 'duration': 1.461}, {'end': 33790.944, 'text': 'You can do the df.mean.', 'start': 33789.403, 'duration': 1.541}, {'end': 33792.185, 'text': 'So you get the mean value.', 'start': 33791.104, 'duration': 1.081}, {'end': 33796.908, 'text': 'So if you set mean equal to that, then you can do a df.fillna with the mean value.', 'start': 33792.425, 'duration': 4.483}, {'end': 33799.65, 'text': 'Very easy to do in a Python Panda script.', 'start': 33797.068, 'duration': 2.582}, {'end': 33805.412, 'text': "And if you're using Python, you should really know pandas and numpy, number Python and pandas data frames.", 'start': 33799.87, 'duration': 5.542}], 'summary': "Substitute missing data with mean using python's pandas data frame.", 'duration': 32.241, 'max_score': 33773.171, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI33773171.jpg'}, {'end': 34019.918, 'src': 'embed', 'start': 33993.619, 'weight': 6, 'content': [{'end': 33999.242, 'text': 'How will you calculate eigenvalues and eigenvectors of a 3x3 matrix?', 'start': 33993.619, 'duration': 5.623}, {'end': 34005.686, 'text': "And what they're really looking for here is when you write it out for the eigen is that you know that you're going to use the lambda?", 'start': 34000.043, 'duration': 5.643}, {'end': 34006.766, 'text': "That's the most common one.", 'start': 34005.806, 'duration': 0.96}, {'end': 34009.308, 'text': 'Obviously you can use any symbol you want, but lambda is usually what they use.', 'start': 34006.887, 'duration': 2.421}, {'end': 34017.696, 'text': 'that you do it down the middle diagonal, and so when you take that matrix and you take the characteristic equation, you end up with the determinant,', 'start': 34009.628, 'duration': 8.068}, {'end': 34019.918, 'text': "and that's the minus 2, minus lambda.", 'start': 34017.696, 'duration': 2.222}], 'summary': 'To calculate eigenvalues and eigenvectors of a 3x3 matrix, use the characteristic equation to find the determinant, such as -2-lambda.', 'duration': 26.299, 'max_score': 33993.619, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI33993619.jpg'}, {'end': 34233.476, 'src': 'embed', 'start': 34203.627, 'weight': 5, 'content': [{'end': 34206.571, 'text': "where you've changed something and you want to figure out how your changes are going to affect things.", 'start': 34203.627, 'duration': 2.944}, {'end': 34208.974, 'text': "we need to monitor it and make sure it's doing what it's supposed to do.", 'start': 34206.571, 'duration': 2.403}, {'end': 34214.442, 'text': 'Evaluation metrics of the current model is calculated to determine if new algorithm is needed.', 'start': 34209.195, 'duration': 5.247}, {'end': 34215.743, 'text': 'And then we compare it.', 'start': 34214.822, 'duration': 0.921}, {'end': 34220.049, 'text': 'The new models are compared against each other to determine which model performs the best.', 'start': 34215.884, 'duration': 4.165}, {'end': 34221.231, 'text': 'And then we do a rebuild.', 'start': 34220.31, 'duration': 0.921}, {'end': 34224.672, 'text': 'the best performing model is rebuilt on the current state of data.', 'start': 34221.511, 'duration': 3.161}, {'end': 34225.413, 'text': 'And this is interesting.', 'start': 34224.752, 'duration': 0.661}, {'end': 34226.633, 'text': 'I found this out just recently.', 'start': 34225.453, 'duration': 1.18}, {'end': 34233.476, 'text': "If you're in weather prediction, the really big weather areas have about seven or eight different models depending on what's going on.", 'start': 34226.733, 'duration': 6.743}], 'summary': 'Evaluate model performance, compare new models, and rebuild best performing model on current data. multiple weather prediction models used.', 'duration': 29.849, 'max_score': 34203.627, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI34203627.jpg'}], 'start': 33410.23, 'title': 'Data analysis and science methods', 'summary': 'Discusses data analysis techniques such as temperature and ice cream sales relationship, bivariate and multivariate data analysis, and data description for house price prediction, as well as data science methods including feature selection, handling missing data, and model deployment maintenance.', 'chapters': [{'end': 33467.453, 'start': 33410.23, 'title': 'Data analysis techniques', 'summary': 'Discusses the relationship between temperature and ice cream sales, the distinction between bivariate and multivariate data analysis, and the use of data description for house price prediction.', 'duration': 57.223, 'highlights': ['The relationship between temperature and ice cream sales is directly proportional, leading to an increase in sales as temperature rises.', 'Multivariate data analysis involves three or more variables and is used in scenarios such as house price prediction.', 'Data description is utilized to make predictions, such as estimating the price of a house based on its characteristics and the local market trends.']}, {'end': 34289.237, 'start': 33467.813, 'title': 'Data science methods and applications', 'summary': 'Covers data science methods, including feature selection, programming logic, handling missing data, distance calculation, clock angle calculation, dimensionality reduction, eigenvalues and eigenvectors calculation, and model deployment maintenance.', 'duration': 821.424, 'highlights': ['The chapter covers data science methods, including feature selection, programming logic, handling missing data, distance calculation, clock angle calculation, dimensionality reduction, eigenvalues and eigenvectors calculation, and model deployment maintenance. Encompasses a wide range of data science topics, including feature selection, programming logic, handling missing data, distance calculation, clock angle calculation, dimensionality reduction, eigenvalues and eigenvectors calculation, and model deployment maintenance.', 'The most common feature selection methods are filter methods and wrapper methods, where filter methods involve linear discrimination analysis, ANOVA, and chi-squared, while wrapper methods are forward selection, backward selection, and recursive feature elimination. Explains the two main feature selection methods: filter methods (e.g., linear discrimination analysis, ANOVA, chi-squared) and wrapper methods (e.g., forward selection, backward selection, recursive feature elimination).', "Methods for handling missing data include removing rows with missing values for large datasets, and substituting missing values with the mean or average of the rest of the data using Panda's data frame in Python for smaller datasets. Discusses methods for handling missing data, such as removing rows with missing values for large datasets and substituting missing values with the mean or average using Panda's data frame in Python for smaller datasets.", 'The calculation of eigenvalues and eigenvectors of a 3x3 matrix involves utilizing the characteristic equation and determinant to determine the eigenvalues, followed by calculating the eigenvectors for each eigenvalue. Explains the process of calculating eigenvalues and eigenvectors of a 3x3 matrix using the characteristic equation, determinant, and the subsequent calculation of eigenvectors for each eigenvalue.', 'The maintenance of deployed models involves constant monitoring, evaluation of current models, comparison of new models, and rebuilding the best performing model based on the current state of data. Describes the maintenance process for deployed models, including constant monitoring, evaluation of current models, comparison of new models, and rebuilding the best performing model based on the current state of data.']}], 'duration': 879.007, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI33410230.jpg', 'highlights': ['Multivariate data analysis involves three or more variables and is used in scenarios such as house price prediction.', 'The relationship between temperature and ice cream sales is directly proportional, leading to an increase in sales as temperature rises.', 'Data description is utilized to make predictions, such as estimating the price of a house based on its characteristics and the local market trends.', 'The most common feature selection methods are filter methods and wrapper methods, where filter methods involve linear discrimination analysis, ANOVA, and chi-squared, while wrapper methods are forward selection, backward selection, and recursive feature elimination.', "Methods for handling missing data include removing rows with missing values for large datasets, and substituting missing values with the mean or average of the rest of the data using Panda's data frame in Python for smaller datasets.", 'The maintenance of deployed models involves constant monitoring, evaluation of current models, comparison of new models, and rebuilding the best performing model based on the current state of data.', 'The calculation of eigenvalues and eigenvectors of a 3x3 matrix involves utilizing the characteristic equation and determinant to determine the eigenvalues, followed by calculating the eigenvectors for each eigenvalue.']}, {'end': 35486.759, 'segs': [{'end': 34339.499, 'src': 'embed', 'start': 34309.672, 'weight': 0, 'content': [{'end': 34314.996, 'text': 'The RMSE and the MSE are the two of the most common measures of accuracy for a linear regression model.', 'start': 34309.672, 'duration': 5.324}, {'end': 34319.38, 'text': 'And you can see here we have the root mean square error, RMSE equals,', 'start': 34315.076, 'duration': 4.304}, {'end': 34327.788, 'text': 'and this is the square root of the sum of the predicted minus the actual squared over the total number.', 'start': 34319.38, 'duration': 8.408}, {'end': 34330.27, 'text': "So we're just looking for the average mean.", 'start': 34328.008, 'duration': 2.262}, {'end': 34332.372, 'text': "So we're looking for the average over the n.", 'start': 34330.45, 'duration': 1.922}, {'end': 34339.499, 'text': "And the reason you need to know about the difference between RMSE versus MSE is when you're doing a lot of these models and you're building your own model,", 'start': 34332.372, 'duration': 7.127}], 'summary': 'Rmse and mse are common accuracy measures for linear regression models, rmse is the square root of the sum of predicted minus actual squared over the total number.', 'duration': 29.827, 'max_score': 34309.672, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI34309672.jpg'}, {'end': 34449.526, 'src': 'embed', 'start': 34405.998, 'weight': 1, 'content': [{'end': 34411.62, 'text': 'we end up with 0.68 or 68% chance that it will rain on the weekend.', 'start': 34405.998, 'duration': 5.622}, {'end': 34416.683, 'text': 'And there are a couple other ways to solve this, but this is probably the most traditional way of doing that.', 'start': 34412.101, 'duration': 4.582}, {'end': 34425.97, 'text': 'How can you select k for k-means? So first you better understand what k-means is and that k is the number of different groupings.', 'start': 34417.143, 'duration': 8.827}, {'end': 34430.193, 'text': 'And most commonly we use is the ELBO method to select k for k-means.', 'start': 34426.25, 'duration': 3.943}, {'end': 34435.657, 'text': 'The idea of the ELBO method is to run k-means clustering on the data set where k is the number of clusters.', 'start': 34430.513, 'duration': 5.144}, {'end': 34442.963, 'text': 'Within the sum of squares, WSS is defined as the sum of the squared distance between each member of the cluster and its centroid.', 'start': 34435.937, 'duration': 7.026}, {'end': 34446.104, 'text': 'And you should know all the terms for your k-means on there.', 'start': 34443.383, 'duration': 2.721}, {'end': 34449.526, 'text': "And with the elbow point, and again, here's our iteration in our code.", 'start': 34446.284, 'duration': 3.242}], 'summary': 'There is a 68% chance of rain on the weekend. the elbo method is commonly used to select k for k-means clustering.', 'duration': 43.528, 'max_score': 34405.998, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI34405998.jpg'}, {'end': 34498.316, 'src': 'embed', 'start': 34468.994, 'weight': 3, 'content': [{'end': 34471.776, 'text': 'What is the significance of p-value? Oh, good one.', 'start': 34468.994, 'duration': 2.782}, {'end': 34474.638, 'text': "Especially if you're dealing with r, because that's the first thing that pops up.", 'start': 34471.956, 'duration': 2.682}, {'end': 34483.345, 'text': 'p-value typically less than or equal to 0.05 indicates a strong evidence against the null hypothesis.', 'start': 34474.899, 'duration': 8.446}, {'end': 34487.668, 'text': 'And you should know why we use null hypothesis instead of the hypothesis.', 'start': 34483.545, 'duration': 4.123}, {'end': 34490.03, 'text': 'So you reject the null hypothesis.', 'start': 34487.888, 'duration': 2.142}, {'end': 34496.174, 'text': 'Very important, that term null hypothesis in any scientific setup and also in data science.', 'start': 34490.45, 'duration': 5.724}, {'end': 34498.316, 'text': "It doesn't mean that it's true.", 'start': 34496.574, 'duration': 1.742}], 'summary': 'P-value ≤ 0.05 indicates strong evidence against null hypothesis in scientific and data science setups.', 'duration': 29.322, 'max_score': 34468.994, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI34468994.jpg'}, {'end': 34611.261, 'src': 'embed', 'start': 34581.804, 'weight': 4, 'content': [{'end': 34585.906, 'text': 'Data detected as outliers by linear model can be fit by nonlinear model.', 'start': 34581.804, 'duration': 4.102}, {'end': 34589.108, 'text': 'So be sure you are choosing the right model.', 'start': 34586.126, 'duration': 2.982}, {'end': 34595.251, 'text': 'So if it has more of a curved look to it instead of a straight line, you might need to use something other than just a straight line linear model.', 'start': 34589.368, 'duration': 5.883}, {'end': 34596.652, 'text': 'Try normalizing the data.', 'start': 34595.491, 'duration': 1.161}, {'end': 34599.774, 'text': 'This way the extreme data points are pulled to a similar range.', 'start': 34596.992, 'duration': 2.782}, {'end': 34605.397, 'text': 'If you can use algorithms which are less affected by outliers, example random forest.', 'start': 34600.114, 'duration': 5.283}, {'end': 34611.261, 'text': 'So there is another solution is you can come up with the random forest which a lot of times completely bypasses your outliers.', 'start': 34605.597, 'duration': 5.664}], 'summary': 'Consider using a nonlinear model for outlier detection and normalization to handle extreme data points. random forest algorithm can also be effective in bypassing outliers.', 'duration': 29.457, 'max_score': 34581.804, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI34581804.jpg'}, {'end': 34647.455, 'src': 'embed', 'start': 34619.286, 'weight': 6, 'content': [{'end': 34625.367, 'text': 'We can say that a time series is stationary when the variance and mean of the series is constant with time.', 'start': 34619.286, 'duration': 6.081}, {'end': 34628.888, 'text': 'And this graphic example is very easy to see.', 'start': 34625.528, 'duration': 3.36}, {'end': 34635.87, 'text': 'The variance is constant with time, so we have our first variable y and x, and x being the time factor and y being the variable.', 'start': 34629.168, 'duration': 6.702}, {'end': 34638.751, 'text': 'As you can see, it goes through the same values all the time.', 'start': 34635.93, 'duration': 2.821}, {'end': 34641.552, 'text': "It's not changing in the long period of time.", 'start': 34638.931, 'duration': 2.621}, {'end': 34642.512, 'text': "So that's stationary.", 'start': 34641.632, 'duration': 0.88}, {'end': 34647.455, 'text': "And then you can see in the second example the waves get bigger and bigger so that's non-stationary.", 'start': 34642.712, 'duration': 4.743}], 'summary': 'Time series is stationary if variance and mean are constant with time. non-stationary if they change.', 'duration': 28.169, 'max_score': 34619.286, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI34619286.jpg'}, {'end': 34794.455, 'src': 'embed', 'start': 34760.734, 'weight': 5, 'content': [{'end': 34764.237, 'text': 'Write the equation and calculate precision and recall rate.', 'start': 34760.734, 'duration': 3.503}, {'end': 34769.42, 'text': 'And so continuing with our confusion matrix, I was just talking about the different domains.', 'start': 34764.597, 'duration': 4.823}, {'end': 34773.022, 'text': 'We have the precision equals 262 over 277.', 'start': 34769.6, 'duration': 3.422}, {'end': 34777.925, 'text': 'So your precision is the true positive over the true positive plus false positive.', 'start': 34773.022, 'duration': 4.903}, {'end': 34782.328, 'text': 'And the recall rate is your true positive over the total positive plus false negative.', 'start': 34778.105, 'duration': 4.223}, {'end': 34783.909, 'text': 'And you can see here we have that 262 over 277 equals a 94%.', 'start': 34782.468, 'duration': 1.441}, {'end': 34794.455, 'text': 'And the recall over here is the 262 over 280, which equals 0.9 or 90%.', 'start': 34783.909, 'duration': 10.546}], 'summary': 'Calculated precision is 94% and recall rate is 90%.', 'duration': 33.721, 'max_score': 34760.734, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI34760734.jpg'}, {'end': 34894.786, 'src': 'embed', 'start': 34866.767, 'weight': 7, 'content': [{'end': 34870.309, 'text': 'Recommendation engine is done with collaborative filtering.', 'start': 34866.767, 'duration': 3.542}, {'end': 34876.673, 'text': 'Collaborative filtering exploits the behavior of other users and their purchase history in terms of ratings, selection, etc.', 'start': 34870.569, 'duration': 6.104}, {'end': 34881.517, 'text': 'It makes predictions on what you might interest a person based on the preference of many other users.', 'start': 34876.793, 'duration': 4.724}, {'end': 34884.679, 'text': 'In this algorithm, features of the items are not known.', 'start': 34881.777, 'duration': 2.902}, {'end': 34888.381, 'text': 'And we have a nice example here where they took a snapshot of a sales page.', 'start': 34884.799, 'duration': 3.582}, {'end': 34894.786, 'text': 'It says, for example, suppose X number of people buy a new phone and then also buy tempered glass with it.', 'start': 34888.422, 'duration': 6.364}], 'summary': 'Collaborative filtering recommendation engine uses user behavior to predict interests, illustrated with an example of phone and tempered glass purchases.', 'duration': 28.019, 'max_score': 34866.767, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI34866767.jpg'}, {'end': 34952.607, 'src': 'embed', 'start': 34924.252, 'weight': 8, 'content': [{'end': 34930.214, 'text': 'I remember back in the 90s it was so important to know SQL query and only a few people got it.', 'start': 34924.252, 'duration': 5.962}, {'end': 34932.114, 'text': "Nowadays it's just part of your kit.", 'start': 34930.474, 'duration': 1.64}, {'end': 34934.115, 'text': 'You have to know some basic SQL.', 'start': 34932.174, 'duration': 1.941}, {'end': 34938.956, 'text': 'So write a basic SQL query to list all orders with customer information.', 'start': 34934.215, 'duration': 4.741}, {'end': 34941.236, 'text': 'And you can kind of make up your own name for the database.', 'start': 34939.156, 'duration': 2.08}, {'end': 34943.577, 'text': 'And you can pause it here if you want to write that down on a paper.', 'start': 34941.496, 'duration': 2.081}, {'end': 34944.278, 'text': "and let's go ahead.", 'start': 34943.757, 'duration': 0.521}, {'end': 34944.718, 'text': 'look at this.', 'start': 34944.278, 'duration': 0.44}, {'end': 34952.607, 'text': 'we have to list all orders with customer information, and so usually you have an order table and a customer table and you have an order ID,', 'start': 34944.718, 'duration': 7.889}], 'summary': "In the 90s, sql was rare, now it's essential. basic sql query to list all orders with customer info.", 'duration': 28.355, 'max_score': 34924.252, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI34924252.jpg'}, {'end': 35027.414, 'src': 'embed', 'start': 34997.201, 'weight': 9, 'content': [{'end': 35000.343, 'text': "That's one of the standard data sets on there is for cancer detection.", 'start': 34997.201, 'duration': 3.142}, {'end': 35003.384, 'text': 'Cancer detection results in imbalanced data.', 'start': 35000.523, 'duration': 2.861}, {'end': 35005.225, 'text': 'In an imbalanced data set.', 'start': 35003.704, 'duration': 1.521}, {'end': 35011.668, 'text': 'accuracy should not be based as a measure of performance, because it is important to focus on the remaining 4%,', 'start': 35005.225, 'duration': 6.443}, {'end': 35014.149, 'text': 'which are the people who were wrongly diagnosed.', 'start': 35011.668, 'duration': 2.481}, {'end': 35015.809, 'text': 'We talked a little bit about this earlier.', 'start': 35014.469, 'duration': 1.34}, {'end': 35017.05, 'text': 'You have to know your domain.', 'start': 35015.849, 'duration': 1.201}, {'end': 35020.991, 'text': 'This is the medical cancer domain versus weather domain.', 'start': 35017.61, 'duration': 3.381}, {'end': 35023.492, 'text': 'Weather channel, they can get by with 50% wrong.', 'start': 35021.271, 'duration': 2.221}, {'end': 35027.414, 'text': "In cancer, you don't want 4% of the people being wrongly diagnosed.", 'start': 35023.712, 'duration': 3.702}], 'summary': 'Imbalanced cancer detection data, accuracy not reliable measure, focus on 4% misdiagnosed.', 'duration': 30.213, 'max_score': 34997.201, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI34997201.jpg'}, {'end': 35061.722, 'src': 'embed', 'start': 35038.447, 'weight': 10, 'content': [{'end': 35046.276, 'text': 'Which of the following machine learning algorithm can be used for inputting missing values of both categorical and continuous variables?', 'start': 35038.447, 'duration': 7.829}, {'end': 35048.017, 'text': 'And so we have a couple choices here.', 'start': 35046.736, 'duration': 1.281}, {'end': 35054.199, 'text': 'We have k-means clustering, we have linear regression, we have the k-NN, nearest neighbor, and decision tree.', 'start': 35048.057, 'duration': 6.142}, {'end': 35061.722, 'text': 'And which of the following machine learning algorithms can be used for inputting missing values of both categorical and continuous variables?', 'start': 35054.379, 'duration': 7.343}], 'summary': 'Identify machine learning algorithms for inputting missing values: k-nn, decision tree, linear regression.', 'duration': 23.275, 'max_score': 35038.447, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI35038447.jpg'}], 'start': 34289.537, 'title': 'Accuracy measures, probability, p-value, outliers, and model evaluation', 'summary': 'Covers calculating rmse, mse, and probability in linear regression, selecting k for k-means clustering, understanding p-value, treating outliers, and evaluating models with accuracy, precision, and recall rate. it also includes topics on time series data, collaborative filtering, sql queries, and addressing imbalanced data in cancer detection.', 'chapters': [{'end': 34468.614, 'start': 34289.537, 'title': 'Measures of accuracy in linear regression & probability calculation', 'summary': 'Covers calculating rmse and mse in linear regression, probability calculation, and selecting k for k-means clustering.', 'duration': 179.077, 'highlights': ['RMSE and MSE are common measures of accuracy for a linear regression model, with RMSE being the square root of the sum of the predicted minus the actual squared over the total number. RMSE and MSE are widely used measures of accuracy in linear regression. RMSE is calculated as the square root of the sum of the predicted minus the actual squared over the total number.', 'Probability calculation reveals a 68% chance of rain on the weekend by multiplying the probability of it not raining on Saturday (0.4) and Sunday (0.8). Probability calculation shows a 68% chance of rain on the weekend by multiplying the probability of it not raining on Saturday (0.4) and Sunday (0.8).', 'The ELBO method is commonly used to select the number of clusters (k) for k-means clustering, by identifying the elbow point where the WSS value drops and flattens out. The ELBO method is used to select the number of clusters (k) for k-means clustering, by identifying the elbow point where the within sum of squares (WSS) value drops and flattens out.']}, {'end': 35086.872, 'start': 34468.994, 'title': 'Understanding p-value, outlier treatment, and model evaluation', 'summary': 'Discusses the significance of p-value in hypothesis testing, treatment of outliers in data science models, and evaluation metrics like accuracy, precision, and recall rate, emphasizing the importance of domain knowledge. it also covers time series data stationarity, collaborative filtering for recommendation algorithms, basic sql queries, and addressing imbalanced data in cancer detection models.', 'duration': 617.878, 'highlights': ['Explaining the significance of p-value in hypothesis testing The significance of p-value in hypothesis testing is explained, emphasizing that a p-value less than or equal to 0.05 indicates strong evidence against the null hypothesis, while a p-value greater than 0.05 indicates weak evidence against the null hypothesis.', 'Treatment of outliers in data science models The methods for treating outliers in data science models are discussed, including the option to drop outliers if they are garbage values and alternative approaches such as trying different models, using algorithms less affected by outliers, and normalizing the data.', 'Evaluating model performance using accuracy, precision, and recall rate The calculation of accuracy using a confusion matrix is explained, along with the equations and calculations for precision and recall rate, highlighting the importance of considering domain-specific implications when interpreting these metrics.', 'Identifying stationary time series data The concept of stationary time series data is defined, emphasizing that it exhibits constant variance and mean over time, while contrasting it with non-stationary data where the variance changes over time.', 'Utilizing collaborative filtering for recommendation algorithms The use of collaborative filtering in recommendation algorithms is described, emphasizing its reliance on the behavior and preferences of other users to make predictions on what might interest a person.', 'Basic SQL query for listing orders with customer information A basic SQL query for listing all orders with customer information is provided, demonstrating the selection of specific columns and the use of a join operation between the order and customer tables.', 'Challenges of imbalanced data in cancer detection models The challenges of imbalanced data in cancer detection models are outlined, highlighting the limitations of using accuracy as a performance measure and the importance of focusing on wrongly diagnosed cases in the minority class.', 'Choosing the right machine learning algorithm for inputting missing values The suitability of k-nearest neighbor algorithm for inputting missing values of both categorical and continuous variables is discussed, highlighting its capability to compute the nearest neighbor based on all other features.']}, {'end': 35486.759, 'start': 35087.032, 'title': 'Measuring time with matches and ropes', 'summary': 'Demonstrates different solutions for measuring a period of 45 minutes using a box of matches and two ropes, explores the entropy of the target variable, and discusses the appropriate algorithms for predicting death from heart disease and identifying similar user groups.', 'duration': 399.727, 'highlights': ['The chapter demonstrates different solutions for measuring a period of 45 minutes using a box of matches and two ropes. The solutions involve lighting the ropes in a specific sequence to measure 45 minutes, such as lighting one rope from both ends and the other rope from one end, and then lighting the remaining part of the second rope.', 'The chapter explores the entropy of the target variable. It discusses the calculation of entropy based on the values of the target variable, providing options for calculating entropy and explaining the key concepts of entropy in relation to the target and non-target variables.', 'The chapter discusses the appropriate algorithms for predicting death from heart disease and identifying similar user groups. It explains the most suitable algorithm, logistic regression, for predicting death from heart disease based on risk factors and highlights the use of k-means clustering for identifying similar user groups.']}], 'duration': 1197.222, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7WRlYJFG7YI/pics/7WRlYJFG7YI34289537.jpg', 'highlights': ['RMSE and MSE are common measures of accuracy for linear regression models.', 'Probability calculation reveals a 68% chance of rain on the weekend.', 'The ELBO method is commonly used to select the number of clusters (k) for k-means clustering.', 'The significance of p-value in hypothesis testing is explained.', 'Methods for treating outliers in data science models are discussed.', 'Model performance is evaluated using accuracy, precision, and recall rate.', 'The concept of stationary time series data is defined.', 'Collaborative filtering is described for recommendation algorithms.', 'A basic SQL query for listing orders with customer information is provided.', 'Challenges of imbalanced data in cancer detection models are outlined.', 'The suitability of k-nearest neighbor algorithm for inputting missing values is discussed.']}], 'highlights': ['The demand for data scientists is currently huge and the supply is very low, creating a significant gap in the industry.', 'Data scientists can expect median base salaries ranging from $95,000 to $165,000, highlighting the significant demand for skilled professionals in the field.', 'Data science is in high demand in industries such as gaming, healthcare, finance, marketing, and technology, globally.', 'Data warehousing skills include ETL, SQL, Hadoop, Spark, and tools like Informatica, Data Stage, Talent, and AWS Redshift, for handling large amounts of structured and unstructured data.', 'The course covers comprehensive data science topics including data acquisition, preparation, mining, modeling, and maintenance, with practical implementation using Python and RStudio.', 'The chapter emphasizes the importance of exploratory data analysis in refining the selection of feature variables for model development.', 'Insights from data science enable the prediction of employee attrition and identification of key variables influencing turnover, providing valuable HR insights.', 'Data science plays a crucial role in the aviation industry by predicting flight delays, optimizing routes, and ensuring the proper selection of equipment, ultimately improving operational efficiency.', 'Curiosity, common sense, and communication skills are essential traits for a data scientist, crucial for problem-solving and effective communication.', 'The process involves importing libraries in Python such as NumPy and pandas for data manipulation and scikit-learn for linear regression analysis on data from CSV files for the years 2015, 16, and 17.', 'Exploratory data analysis is performed, including the creation of visualizations using Plotly to depict the correlation between happiness rank and happiness score, leading to the decision to drop the happiness rank.', 'The scikit-learn library provides a method, accuracy_score, for calculating accuracy, which confirms the 80% accuracy achieved through the manual calculation.', 'The system uses clustering mechanism to group players based on their performance, such as runs scored and wickets taken, and identifies batsmen, bowlers, and all-rounders.', 'Decision tree is primarily used for classification and provides a logical way to classify inputs, with one of its biggest advantages being its ease of understanding.', 'The process of splitting data into training and test datasets is described, with variations in the ratio such as 50:50, 63.33:33.3, or 80:20 highlighted, emphasizing the importance of using unseen data for testing to measure model accuracy.', 'The process of building a linear regression model to predict the price of a 1.35 carat diamond is elaborated, including the training, testing, retraining, and deployment stages.', 'The scikit-learn library provides a method, accuracy_score, for calculating accuracy, which confirms the 80% accuracy achieved through the manual calculation.', 'The chapter covers techniques such as data summary, visualization with histograms, and handling missing values, focusing on numerical columns and visualizations.', 'The process involves importing data from a csv file using pandas read_csv method and displaying the first five rows using head method for initial data exploration', 'The process of univariate analysis involves visualizing and understanding the distribution of data in each column, such as creating histograms to identify extreme values', 'The process of splitting the data into training and test data in an 80/20 ratio, and a linear regression model is trained using the training data, resulting in the prediction of values for the test data with a low root mean square error and high accuracy.', 'The chapter covers the process of preparing and training data for machine learning, including steps such as data exploration, data splitting, data scaling, and model evaluation.', 'The chapter delves into the concept of the Naive Bayes Classifier and its application in text classification through Python coding, providing a practical example of its usage.', 'The likelihood of a purchase is 84.71% after normalization, given the specified conditions.', 'The chapter discusses the problem of measuring exactly 4 liters using 3 and 5-liter buckets, demonstrating the process of pouring water between the two buckets to achieve the desired measurement.', 'The expanding demand for data scientists includes smaller companies with around 100 employees, in addition to tech giants like Apple, Adobe, Google, and Microsoft.', 'The average base pay for data science engineers in the US is around $117,000, while in India it is approximately $950,000 per year.', 'The most common job title in data science is data scientist, followed by business intelligence manager.', 'The growth in data science job listings has seen a substantial increase from 2014 to 2012, indicating a thriving job market in the field.', 'The process of communicating machine learning results to stakeholders involves presenting actionable insights in the context of the problem statement and methodology, effectively targeting the appropriate audience and ensuring clear communication in business terms.']}