title
Machine Learning with R | Machine Learning with caret

description
Learn how the R and Caret package can help to implement some of the most common tasks of the data science project lifecycle. The R programming language is experiencing rapid increases in popularity and wide adoption across industries. This popularity is due, in part, to R’s huge collection of open-source machine-learning algorithms. If you are a data scientist working with R, the caret package (short for Classification And Regression Training) is a must-have tool in your tool belt. The caret package provides capabilities that are ubiquitous in all stages of the data science project lifecycle. Most important of all, Caret provides a common interface for training, tuning, and evaluating more than 200 machine learning algorithms. Not surprisingly, caret is a surefire way to accelerate your velocity as a data scientist! In this presentation, we will provide an introduction to the caret package. The focus of the presentation will be using caret to implement some of the most common tasks of the data science project lifecycle and to illustrate incorporating caret into your daily work. Attendees will learn how to: • Create stratified random samples of data useful for training machine learning models. • Train machine learning models using caret’s common interface. • Leverage caret’s powerful features for cross-validation and hyperparameter tuning. • Scale caret via the use of multi-core, parallel training. • Increase their knowledge of caret’s many features. R code and accompanying dataset: https://code.datasciencedojo.com/datasciencedojo/tutorials/tree/master/Introduction%20to%20Machine%20Learning%20with%20R%20and%20Caret caret website: http://topepo.github.io/caret/index.html Table of Contents: 0:00 – Intro 3:24 – Motivation 5:07 – Expectation setting 9:23 – The data 11:57 – Caret 1:18:46 – Resources -- For more captivating community talks featuring renowned speakers, check out this playlist: https://youtube.com/playlist?list=PL8eNk_zTBST-EBv2LDSW9Wx_V4Gy5OPFT -- At Data Science Dojo, we believe data science is for everyone. Our data science trainings have been attended by more than 10,000 employees from over 2,500 companies globally, including many leaders in tech like Microsoft, Google, and Facebook. For more information please visit: https://hubs.la/Q01Z-13k0 💼 Learn to build LLM-powered apps in just 40 hours with our Large Language Models bootcamp: https://hubs.la/Q01ZZGL-0 💼 Get started in the world of data with our top-rated data science bootcamp: https://hubs.la/Q01ZZDpt0 💼 Master Python for data science, analytics, machine learning, and data engineering: https://hubs.la/Q01ZZD-s0 💼 Explore, analyze, and visualize your data with Power BI desktop: https://hubs.la/Q01ZZF8B0 -- Unleash your data science potential for FREE! Dive into our tutorials, events & courses today! 📚 Learn the essentials of data science and analytics with our data science tutorials: https://hubs.la/Q01ZZJJK0 📚 Stay ahead of the curve with the latest data science content, subscribe to our newsletter now: https://hubs.la/Q01ZZBy10 📚 Connect with other data scientists and AI professionals at our community events: https://hubs.la/Q01ZZLd80 📚 Checkout our free data science courses: https://hubs.la/Q01ZZMcm0 📚 Get your daily dose of data science with our trending blogs: https://hubs.la/Q01ZZMWl0 -- 📱 Social media links Connect with us: https://www.linkedin.com/company/data-science-dojo Follow us: https://twitter.com/DataScienceDojo Keep up with us: https://www.instagram.com/data_science_dojo/ Like us: https://www.facebook.com/datasciencedojo Find us: https://www.threads.net/@data_science_dojo -- Also, join our communities: LinkedIn: https://www.linkedin.com/groups/13601597/ Twitter: https://twitter.com/i/communities/1677363761399865344 Facebook: https://www.facebook.com/groups/AIandMachineLearningforEveryone/ Vimeo: https://vimeo.com/datasciencedojo Discord: https://discord.com/invite/tj8ken4Err _ Want to share your data science knowledge? Boost your profile and share your knowledge with our community: https://hubs.la/Q01ZZNCn0 #machinelearning #rprogramming #caret

detail
{'title': 'Machine Learning with R | Machine Learning with caret', 'heatmap': [{'end': 1194.78, 'start': 1125.886, 'weight': 0.779}, {'end': 2672.324, 'start': 2492.174, 'weight': 0.707}, {'end': 4044.277, 'start': 3916.002, 'weight': 0.924}, {'end': 4217.79, 'start': 4094.835, 'weight': 0.872}, {'end': 4453.966, 'start': 4334.63, 'weight': 0.729}], 'summary': "Covers machine learning with r and caret, titanic dataset's relevance in data science, carat package's functionality, data wrangling, imputation with carrot, model enhancement, feature engineering, data visualization, and model interpretation. it achieves around 85% accuracy in predicting survival and discusses regression metrics and xgboost's impact on machine learning models.", 'chapters': [{'end': 499.444, 'segs': [{'end': 60.278, 'src': 'embed', 'start': 38.727, 'weight': 0, 'content': [{'end': 49.432, 'text': "where I managed a team of technical program managers that had responsibility for all of the data platforms used to run Microsoft's $10 billion plus supply chain operation.", 'start': 38.727, 'duration': 10.705}, {'end': 56.576, 'text': "Like probably a lot of you, I'm not formally trained in what now is called data science.", 'start': 51.753, 'duration': 4.823}, {'end': 60.278, 'text': 'I have a background in software engineering and computer science,', 'start': 56.956, 'duration': 3.322}], 'summary': 'Led team managing data platforms for $10b+ supply chain operation, with background in software engineering.', 'duration': 21.551, 'max_score': 38.727, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY38727.jpg'}, {'end': 146.898, 'src': 'embed', 'start': 118.743, 'weight': 2, 'content': [{'end': 122.164, 'text': "So what am I going to do? I love data science, but I'm not going to get the formal education.", 'start': 118.743, 'duration': 3.421}, {'end': 124.365, 'text': 'So my philosophy was well.', 'start': 122.404, 'duration': 1.961}, {'end': 130.467, 'text': "I will learn what I can and I will apply it to my daily job and I'll derive business value even though I don't have those credentials.", 'start': 124.365, 'duration': 6.102}, {'end': 133.547, 'text': 'Data Science Dojo has exactly the same philosophy.', 'start': 131.087, 'duration': 2.46}, {'end': 137.268, 'text': 'This is a very consulting term, democratize.', 'start': 134.888, 'duration': 2.38}, {'end': 139.809, 'text': 'But down here is probably more important.', 'start': 138.029, 'duration': 1.78}, {'end': 142.87, 'text': 'Mission statement of Data Science Dojo is data science for everyone.', 'start': 140.149, 'duration': 2.721}, {'end': 146.898, 'text': 'So it was a good meshing of the two philosophies, which is why I came here.', 'start': 143.995, 'duration': 2.903}], 'summary': 'Data science dojo shares philosophy of democratizing data science for everyone.', 'duration': 28.155, 'max_score': 118.743, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY118743.jpg'}, {'end': 217.524, 'src': 'embed', 'start': 187.561, 'weight': 1, 'content': [{'end': 191.805, 'text': 'R has experienced rapid year-over-year increases in popularity.', 'start': 187.561, 'duration': 4.244}, {'end': 199.656, 'text': "Now, some of you may say, fifth place, Dave? You know, wow, who cares? But this is what's interesting.", 'start': 193.346, 'duration': 6.31}, {'end': 207.52, 'text': "If you don't know a lot about programming languages, just let me tell you that of these top six languages, this one is very different than the others.", 'start': 199.737, 'duration': 7.783}, {'end': 217.524, 'text': 'This is remarkable, I would argue, because R was originally built by statisticians for statisticians to do data analysis.', 'start': 210.181, 'duration': 7.343}], 'summary': 'R has seen rapid year-over-year growth, reaching fifth place among programming languages, notable for its unique focus on data analysis.', 'duration': 29.963, 'max_score': 187.561, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY187561.jpg'}], 'start': 0.674, 'title': 'Machine learning with r and care', 'summary': "Introduces machine learning with r and care, emphasizing the speaker's extensive experience in technology, the growing popularity of r in data science, its rapid increase in popularity, and its unique focus on data analysis.", 'chapters': [{'end': 159.829, 'start': 0.674, 'title': 'Introduction to machine learning with r and care', 'summary': "Introduces machine learning with r and care, emphasizing the speaker's extensive experience in technology and data, his journey into data science, and the mission of data science dojo to make data science accessible to everyone.", 'duration': 159.155, 'highlights': ["The speaker, Dave Langer, has over 20 years of experience in technology, with roles in software engineering, data warehousing, analytics, and management, including managing a team responsible for Microsoft's $10 billion plus supply chain operation.", "The speaker's interest in data science was sparked five years ago during his master's degree, when he discovered the potential of using existing data assets for predictive analysis, leading him to spend extensive time learning about data science.", "The philosophy of Data Science Dojo aligns with the speaker's approach to democratize data science and make it accessible to everyone, emphasizing the mission of 'data science for everyone' and providing tutorials on YouTube channels to support this mission."]}, {'end': 499.444, 'start': 160.249, 'title': "R's popularity in data science", 'summary': 'Discusses the growing popularity of r in data science, highlighting its rapid increase in popularity, its unique focus on data analysis, and its surpassing of c sharp in popularity, as well as setting expectations for the audience and introducing the carrot package for machine learning in r.', 'duration': 339.195, 'highlights': ['R has experienced rapid year-over-year increases in popularity, surpassing C sharp in popularity, and is now more popular than C sharp. R has seen rapid growth in popularity, surpassing C sharp and becoming more popular than a strategic language like C sharp.', 'R is a language specifically designed for data analysis, unlike the other top programming languages, which are general purpose programming languages. R stands out as a language designed solely for data analysis, unlike the other top programming languages that are primarily general-purpose.', 'Setting expectations for the audience, assuming a certain level of R and machine learning knowledge, and introducing the Carrot package for machine learning in R. The presenter sets expectations for the audience, assuming a certain level of R and machine learning expertise, and introduces the Carrot package for machine learning in R.']}], 'duration': 498.77, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY674.jpg', 'highlights': ["Dave Langer has over 20 years of experience in technology, including managing a team responsible for Microsoft's $10 billion plus supply chain operation.", 'R has experienced rapid year-over-year increases in popularity, surpassing C sharp and becoming more popular than a strategic language like C sharp.', "The philosophy of Data Science Dojo aligns with the speaker's approach to democratize data science and make it accessible to everyone."]}, {'end': 975.205, 'segs': [{'end': 534.354, 'src': 'embed', 'start': 502.267, 'weight': 0, 'content': [{'end': 504.429, 'text': 'And as I mentioned, the GitHub repo has all the stuff.', 'start': 502.267, 'duration': 2.162}, {'end': 514.376, 'text': "Data So for tonight, we're going to use the Titanic dataset from Kaggle's website.", 'start': 508.172, 'duration': 6.204}, {'end': 517.609, 'text': 'And we use this data set a lot at Data Science Dojo.', 'start': 515.168, 'duration': 2.441}, {'end': 520.89, 'text': 'And the reason why we use it is actually pretty simple.', 'start': 518.389, 'duration': 2.501}, {'end': 526.671, 'text': "First and foremost, it's a safe bet that everyone is familiar with the problem domain.", 'start': 521.99, 'duration': 4.681}, {'end': 529.712, 'text': "And I've done this before, but I'm going to do it again.", 'start': 528.312, 'duration': 1.4}, {'end': 534.354, 'text': "Raise your hand if you're not familiar with the Titanic and what happened on the Titanic.", 'start': 530.433, 'duration': 3.921}], 'summary': 'Using the titanic dataset from kaggle for a familiar problem domain at data science dojo.', 'duration': 32.087, 'max_score': 502.267, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY502267.jpg'}, {'end': 594.533, 'src': 'embed', 'start': 568.012, 'weight': 1, 'content': [{'end': 572.676, 'text': "A lot of the characteristics that you would see in that data, you're going to see in the Titanic data set.", 'start': 568.012, 'duration': 4.664}, {'end': 578.965, 'text': "Nice thing about Titanic data set is it's actually relatively small, but it exhibits a lot of the same problems, a lot of the same characteristics.", 'start': 573.823, 'duration': 5.142}, {'end': 581.967, 'text': "That's why it's a very, very useful data set.", 'start': 579.286, 'duration': 2.681}, {'end': 589.17, 'text': 'Also, as it turns out, predicting Titanic survival correctly is actually quite a difficult problem just based on the data set that you have.', 'start': 582.827, 'duration': 6.343}, {'end': 591.431, 'text': "So it's also good for that reason as well.", 'start': 589.871, 'duration': 1.56}, {'end': 594.533, 'text': "Okay So here's the data.", 'start': 593.212, 'duration': 1.321}], 'summary': 'The titanic data set exhibits similar problems and characteristics, making it very useful and difficult for predicting survival.', 'duration': 26.521, 'max_score': 568.012, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY568012.jpg'}, {'end': 840.898, 'src': 'embed', 'start': 811.635, 'weight': 2, 'content': [{'end': 815.119, 'text': 'They have different naming conventions and it gets kind of problematic.', 'start': 811.635, 'duration': 3.484}, {'end': 819.283, 'text': "There's a lot of inertia to understand how to work with all these packages.", 'start': 815.199, 'duration': 4.084}, {'end': 826.811, 'text': 'So Carrot actually wraps more than 200 different machine learning algorithms and provides a common interface.', 'start': 819.844, 'duration': 6.967}, {'end': 830.335, 'text': 'Literally, you can write your code.', 'start': 828.533, 'duration': 1.802}, {'end': 837.035, 'text': "in a nice, neat way and change one little thing, which you can actually parameterize if you'd like from the command line or whatever.", 'start': 830.99, 'duration': 6.045}, {'end': 840.898, 'text': "And all of a sudden you're not making a logistic regression model anymore.", 'start': 837.415, 'duration': 3.483}], 'summary': 'Carrot wraps 200+ ml algorithms, allowing easy code modification and parameterization.', 'duration': 29.263, 'max_score': 811.635, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY811635.jpg'}], 'start': 502.267, 'title': 'Titanic dataset and carat for machine learning', 'summary': 'Discusses the significance of the titanic dataset in data science education and its relevance as a proxy for common business data. it also highlights the carat package for machine learning in r, which encompasses over 200 machine learning algorithms and provides functionality for data splitting, sampling, feature selection, and model tuning.', 'chapters': [{'end': 548.477, 'start': 502.267, 'title': 'Titanic dataset for data analysis', 'summary': "Discusses the use of the titanic dataset from kaggle's website and its significance in data science education, emphasizing its familiarity and widespread use.", 'duration': 46.21, 'highlights': ["The Titanic dataset from Kaggle's website is frequently used at Data Science Dojo, as it is a safe bet that everyone is familiar with the problem domain.", 'The familiarity with the Titanic domain aids in the understanding of the dataset, making it a popular choice for educational purposes.', 'The use of the Titanic dataset is emphasized due to its widespread recognition and understanding, contributing to its significance in data science education.']}, {'end': 975.205, 'start': 548.497, 'title': 'Titanic data and carat for machine learning in r', 'summary': 'Introduces the titanic dataset, highlighting its relevance as a good proxy for common business data and the challenges of predicting titanic survival. it also discusses the carat package for machine learning in r, emphasizing its capability to accelerate machine learning work, wrapping more than 200 machine learning algorithms, providing functionality for data splitting and sampling, wrapper functions for feature selection, and model tuning.', 'duration': 426.708, 'highlights': ['The Titanic dataset is a good proxy for common business data and exhibits similar problems and characteristics, making it very useful for analytics and machine learning perspectives. Relevance of Titanic dataset as a proxy for common business data, exhibiting similar problems and characteristics.', 'Predicting Titanic survival correctly is a difficult problem based on the dataset. Challenges in predicting Titanic survival based on the dataset.', "CARAT is a package designed to accelerate machine learning work in R, becoming the de facto standard package with extensive features and capability, wrapping over 200 different machine learning algorithms, and providing a common interface for code consistency. CARAT's role in accelerating machine learning work in R, becoming the de facto standard package, wrapping over 200 machine learning algorithms, and providing a common interface for code consistency.", "CARAT offers functionality for data splitting and sampling, feature selection, and model tuning, providing capabilities for stratified random samples, downsampling, upsampling, and synthesizing data for class imbalance problems. CARAT's functionality for data splitting, sampling, feature selection, and model tuning."]}], 'duration': 472.938, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY502267.jpg', 'highlights': ['The Titanic dataset is frequently used at Data Science Dojo due to its widespread recognition and understanding, making it a popular choice for educational purposes.', 'The Titanic dataset is a good proxy for common business data, exhibiting similar problems and characteristics, making it very useful for analytics and machine learning perspectives.', 'CARAT is a package designed to accelerate machine learning work in R, wrapping over 200 different machine learning algorithms and providing a common interface for code consistency.']}, {'end': 1394.635, 'segs': [{'end': 1032.978, 'src': 'embed', 'start': 977.415, 'weight': 0, 'content': [{'end': 980.377, 'text': "First, I'm going to shut down my Outlook because nobody cares.", 'start': 977.415, 'duration': 2.962}, {'end': 986.102, 'text': 'Okay So here we have the code file.', 'start': 984.061, 'duration': 2.041}, {'end': 992.147, 'text': "So first up, I already have all the packages installed, so I don't need to run this line of code here.", 'start': 987.864, 'duration': 4.283}, {'end': 996.291, 'text': "So I'll just load up carrot and do snow.", 'start': 993.108, 'duration': 3.183}, {'end': 1000.815, 'text': 'As I mentioned earlier, do snow will allow us to do training in parallel, which is super awesome.', 'start': 996.451, 'duration': 4.364}, {'end': 1005.759, 'text': "Next up, I'm just going to read in the Titanic data.", 'start': 1003.497, 'duration': 2.262}, {'end': 1009.15, 'text': 'And it opened up in R in the spreadsheet view.', 'start': 1006.229, 'duration': 2.921}, {'end': 1014.732, 'text': "Now there's a couple things that you're going to notice first and foremost.", 'start': 1012.531, 'duration': 2.201}, {'end': 1021.934, 'text': 'One is that passenger ID and name were not listed in the data dictionary that we talked about in the slide deck.', 'start': 1016.092, 'duration': 5.842}, {'end': 1023.635, 'text': "There's a couple reasons for that.", 'start': 1022.575, 'duration': 1.06}, {'end': 1029.156, 'text': 'One is passenger ID is useless.', 'start': 1025.315, 'duration': 3.841}, {'end': 1032.978, 'text': 'In fact, passenger ID is actually worse than useless.', 'start': 1030.877, 'duration': 2.101}], 'summary': 'Data analysis process using r, including package installation and parallel training using do snow.', 'duration': 55.563, 'max_score': 977.415, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY977415.jpg'}, {'end': 1106.11, 'src': 'embed', 'start': 1078.872, 'weight': 3, 'content': [{'end': 1084.276, 'text': 'I know now that anytime I got a model that was that high, I should be scared.', 'start': 1078.872, 'duration': 5.404}, {'end': 1088.78, 'text': 'My data science spidey sense should be going, boop, boop, boop, boop, boop.', 'start': 1084.837, 'duration': 3.943}, {'end': 1093.324, 'text': "Dave, what'd you do wrong? I fat-fingered the keyboard and left passenger ID in.", 'start': 1088.82, 'duration': 4.504}, {'end': 1097.862, 'text': 'I thought it was all my feature engineering work that got me the a hundred percent, but no,', 'start': 1093.979, 'duration': 3.883}, {'end': 1102.386, 'text': "I left this in and the model said I don't need any other data, Dave.", 'start': 1097.862, 'duration': 4.524}, {'end': 1106.11, 'text': 'I know that passenger ID four survived.', 'start': 1103.067, 'duration': 3.043}], 'summary': 'High model accuracy led to realization of unintended input inclusion.', 'duration': 27.238, 'max_score': 1078.872, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY1078872.jpg'}, {'end': 1194.78, 'src': 'heatmap', 'start': 1125.886, 'weight': 0.779, 'content': [{'end': 1130.388, 'text': 'Terrible model because obviously these IDs are unique in time and space.', 'start': 1125.886, 'duration': 4.502}, {'end': 1132.508, 'text': "So new data won't have these IDs.", 'start': 1130.628, 'duration': 1.88}, {'end': 1134.129, 'text': "So my model wouldn't know what to do with them.", 'start': 1132.748, 'duration': 1.381}, {'end': 1135.869, 'text': 'So it sucked basically.', 'start': 1134.529, 'duration': 1.34}, {'end': 1144.612, 'text': "Yeah Now the other reason why we don't include name is because name is complicated.", 'start': 1137.01, 'duration': 7.602}, {'end': 1146.713, 'text': 'Name is complicated.', 'start': 1145.913, 'duration': 0.8}, {'end': 1154.965, 'text': 'If I want to get any sort of information out of this name field that my machine learning model is going to need and use,', 'start': 1147.473, 'duration': 7.492}, {'end': 1156.105, 'text': "I'm going to need text analytics.", 'start': 1154.965, 'duration': 1.14}, {'end': 1161.326, 'text': 'Because every name in the data set is unique.', 'start': 1157.185, 'duration': 4.141}, {'end': 1163.607, 'text': "So it's essentially the same thing as passenger ID.", 'start': 1161.546, 'duration': 2.061}, {'end': 1167.647, 'text': 'Name as it currently is in the data right now is useless.', 'start': 1164.267, 'duration': 3.38}, {'end': 1169.108, 'text': "It's worse than useless.", 'start': 1167.727, 'duration': 1.381}, {'end': 1170.868, 'text': "So that's why we left them out.", 'start': 1170.048, 'duration': 0.82}, {'end': 1175.469, 'text': "So if you're interested in text analytics, check out the Data Science Dojo YouTube channel.", 'start': 1171.688, 'duration': 3.781}, {'end': 1176.849, 'text': 'We got a new free tutorial on that.', 'start': 1175.489, 'duration': 1.36}, {'end': 1179.99, 'text': 'OK Looking at the rest of the data.', 'start': 1178.27, 'duration': 1.72}, {'end': 1183.577, 'text': 'You can see sex is essentially a string, male and female.', 'start': 1180.816, 'duration': 2.761}, {'end': 1184.637, 'text': "We'll need to take care of that.", 'start': 1183.597, 'duration': 1.04}, {'end': 1186.238, 'text': 'But check out age.', 'start': 1185.317, 'duration': 0.921}, {'end': 1190.619, 'text': "Notice we're missing values in age.", 'start': 1189.179, 'duration': 1.44}, {'end': 1194.78, 'text': "That's going to be a problem, right? We're missing values in age.", 'start': 1191.879, 'duration': 2.901}], 'summary': 'Model is flawed due to unique ids and complicated name field; missing values in age pose a problem.', 'duration': 68.894, 'max_score': 1125.886, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY1125886.jpg'}, {'end': 1316.867, 'src': 'embed', 'start': 1253.948, 'weight': 4, 'content': [{'end': 1256.69, 'text': "For our purposes tonight, we're going to use the rest of the data only.", 'start': 1253.948, 'duration': 2.742}, {'end': 1260.332, 'text': 'Name is too complicated for us to use.', 'start': 1258.771, 'duration': 1.561}, {'end': 1262.113, 'text': 'Passenger ID is worse than useless.', 'start': 1260.392, 'duration': 1.721}, {'end': 1263.714, 'text': "Ticket's too complicated.", 'start': 1262.614, 'duration': 1.1}, {'end': 1265.015, 'text': "Cabin doesn't have a lot of data in it.", 'start': 1263.794, 'duration': 1.221}, {'end': 1265.856, 'text': "So we'll just throw them out.", 'start': 1265.035, 'duration': 0.821}, {'end': 1269.738, 'text': "We'll just throw them out.", 'start': 1267.096, 'duration': 2.642}, {'end': 1274.222, 'text': 'Cool All right, moving on.', 'start': 1269.758, 'duration': 4.464}, {'end': 1281.367, 'text': "Okay, so first up, we're going to do some light data wrangling.", 'start': 1278.225, 'duration': 3.142}, {'end': 1287.49, 'text': 'So if I run the table on the embark column, I get the result down here.', 'start': 1282.347, 'duration': 5.143}, {'end': 1288.891, 'text': "Oh, that's not good.", 'start': 1288.111, 'duration': 0.78}, {'end': 1291.533, 'text': 'My apologies.', 'start': 1290.872, 'duration': 0.661}, {'end': 1299.718, 'text': "Let's go ahead and spread this out so everybody can actually see it a little better.", 'start': 1291.613, 'duration': 8.105}, {'end': 1307.965, 'text': 'Notice that I actually have two blanks in my data.', 'start': 1302.443, 'duration': 5.522}, {'end': 1309.565, 'text': 'Two blanks in my data.', 'start': 1308.825, 'duration': 0.74}, {'end': 1316.867, 'text': 'So this is going to be the first example of what is known in machine learning as imputation.', 'start': 1311.366, 'duration': 5.501}], 'summary': 'Data wrangling and imputation in machine learning.', 'duration': 62.919, 'max_score': 1253.948, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY1253948.jpg'}], 'start': 977.415, 'title': 'Parallelized training and data wrangling', 'summary': "Covers the process of parallelized training using 'do snow' and loading titanic data in r, highlighting the uselessness of passenger id and its exclusion from the data dictionary. it also highlights the mistakes made by a young data scientist in building a model for the titanic competition, the importance of excluding certain features, and the process of data wrangling including imputation with the embark column.", 'chapters': [{'end': 1032.978, 'start': 977.415, 'title': 'Parallelized training with titanic data', 'summary': "Covers the process of parallelized training using 'do snow' and loading titanic data in r, highlighting the uselessness of passenger id and its exclusion from the data dictionary.", 'duration': 55.563, 'highlights': ["The 'do snow' package enables parallel training, enhancing efficiency and speed.", 'Excluding passenger ID from the data dictionary is justified due to its uselessness.', 'Loading the Titanic data in R provides an initial view for analysis and manipulation.']}, {'end': 1394.635, 'start': 1034.278, 'title': 'Data wrangling and model mistakes', 'summary': 'Highlights the mistakes made by a young data scientist in building a model for the titanic competition, the importance of excluding certain features, and the process of data wrangling including imputation with the embark column.', 'duration': 360.357, 'highlights': ["The young data scientist's initial model for the Titanic competition was 100% accurate due to a mistake of leaving passenger ID in, highlighting the importance of being cautious with high-performing models. Initial model for the Titanic competition was 100% accurate due to leaving passenger ID in, stressing the importance of being cautious with high-performing models.", 'Excluding features like name, ticket, and cabin due to complexity and lack of useful data, emphasizing the need to carefully select relevant features for model building. Excluded features like name, ticket, and cabin due to complexity and lack of useful data, emphasizing the need to carefully select relevant features for model building.', 'Introduction to the concept of imputation in data wrangling, with a specific example of replacing missing values in the embark column based on the majority of values, highlighting the importance of handling missing data. Introduction to the concept of imputation in data wrangling, with a specific example of replacing missing values in the embark column based on the majority of values, highlighting the importance of handling missing data.']}], 'duration': 417.22, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY977415.jpg', 'highlights': ["The 'do snow' package enables parallel training, enhancing efficiency and speed.", 'Excluding passenger ID from the data dictionary is justified due to its uselessness.', 'Loading the Titanic data in R provides an initial view for analysis and manipulation.', "The young data scientist's initial model for the Titanic competition was 100% accurate due to leaving passenger ID in, stressing the importance of being cautious with high-performing models.", 'Excluding features like name, ticket, and cabin due to complexity and lack of useful data, emphasizing the need to carefully select relevant features for model building.', 'Introduction to the concept of imputation in data wrangling, with a specific example of replacing missing values in the embark column based on the majority of values, highlighting the importance of handling missing data.']}, {'end': 1773.455, 'segs': [{'end': 1546.772, 'src': 'embed', 'start': 1475.51, 'weight': 0, 'content': [{'end': 1477.17, 'text': 'R is telling me I am missing 177 age values.', 'start': 1475.51, 'duration': 1.66}, {'end': 1485.85, 'text': 'Now notice up here, I only have 891 rows in my dataset to begin with.', 'start': 1477.19, 'duration': 8.66}, {'end': 1490.673, 'text': "So I'm missing a little more than 20% of my age data.", 'start': 1486.691, 'duration': 3.982}, {'end': 1493.335, 'text': 'One in five, a little more than one in five.', 'start': 1491.674, 'duration': 1.661}, {'end': 1500.7, 'text': 'So now you can see why if I use the mode or something like the median, for example, probably not a good idea.', 'start': 1494.376, 'duration': 6.324}, {'end': 1509.465, 'text': 'In fact, the global median for the non missing age values in this dataset is 28.', 'start': 1501.48, 'duration': 7.985}, {'end': 1516.463, 'text': 'So hypothetically, if I used a median model, impute the missing ages, anybody that with an age missing would all of a sudden be 28.', 'start': 1509.465, 'duration': 6.998}, {'end': 1522.687, 'text': 'No matter whether they were male, whether they were female, whether they were traveling in first class, second class, or third class, would matter.', 'start': 1516.463, 'duration': 6.224}, {'end': 1523.968, 'text': "Everybody's all of a sudden 28.", 'start': 1522.767, 'duration': 1.201}, {'end': 1530.352, 'text': "Remember that spidey sense? That's what should be going on right now.", 'start': 1523.968, 'duration': 6.384}, {'end': 1531.693, 'text': "Dave, I don't want to.", 'start': 1530.992, 'duration': 0.701}, {'end': 1532.494, 'text': "No, that's crazy, Dave.", 'start': 1531.793, 'duration': 0.701}, {'end': 1533.454, 'text': "No, we're not going to do that.", 'start': 1532.514, 'duration': 0.94}, {'end': 1535.676, 'text': "No, we're not going to do that.", 'start': 1535.075, 'duration': 0.601}, {'end': 1536.516, 'text': "And in fact, we're not.", 'start': 1535.816, 'duration': 0.7}, {'end': 1540.909, 'text': "We're going to impute missing ages And we'll use carrot to help us with that.", 'start': 1537.397, 'duration': 3.512}, {'end': 1544.731, 'text': 'But we use a far more sophisticated technique than just using the mode or the median.', 'start': 1541.029, 'duration': 3.702}, {'end': 1546.772, 'text': "But here's the problem.", 'start': 1546.051, 'duration': 0.721}], 'summary': 'About 20% of age data missing, suggesting need for sophisticated imputation technique.', 'duration': 71.262, 'max_score': 1475.51, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY1475510.jpg'}, {'end': 1676.932, 'src': 'embed', 'start': 1640.981, 'weight': 3, 'content': [{'end': 1643.982, 'text': "It's a possible benefit of doing this as well.", 'start': 1640.981, 'duration': 3.001}, {'end': 1646.162, 'text': "So we'll feed this into our model as well.", 'start': 1644.362, 'duration': 1.8}, {'end': 1655.605, 'text': "Okay So we're going to use XGBoost.", 'start': 1650.444, 'duration': 5.161}, {'end': 1660.702, 'text': 'And XGBoost is a algorithm that creates boosted decision trees,', 'start': 1656.94, 'duration': 3.762}, {'end': 1666.026, 'text': 'a collection of decision trees working together to make a more powerful predictive model.', 'start': 1660.702, 'duration': 5.324}, {'end': 1676.932, 'text': 'Now, one of the things that you need to know about decision trees is they tend to prefer fewer, more powerful features, fewer, more powerful features.', 'start': 1667.507, 'duration': 9.425}], 'summary': 'Using xgboost algorithm to create powerful predictive model with fewer features.', 'duration': 35.951, 'max_score': 1640.981, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY1640981.jpg'}, {'end': 1734.68, 'src': 'embed', 'start': 1712.185, 'weight': 4, 'content': [{'end': 1719.85, 'text': 'So one technique that we could potentially try out that may help out our decision tree model is to combine that information from those two columns into one.', 'start': 1712.185, 'duration': 7.665}, {'end': 1722.652, 'text': 'It may make for a stronger feature.', 'start': 1720.871, 'duration': 1.781}, {'end': 1727.795, 'text': 'So we can create a family size feature, which, essentially,', 'start': 1724.413, 'duration': 3.382}, {'end': 1734.68, 'text': "is add up all my spouses and siblings that I'm traveling with to all the parents and children that I'm traveling with.", 'start': 1727.795, 'duration': 6.885}], 'summary': 'Combine spouse, siblings, parents, and children into a family size feature to strengthen decision tree model.', 'duration': 22.495, 'max_score': 1712.185, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY1712185.jpg'}], 'start': 1394.635, 'title': 'Data imputation and model enhancement', 'summary': 'Discusses imputing missing data in the titanic dataset, with 177 missing age values, and improving the decision tree model using xgboost and a family size feature.', 'chapters': [{'end': 1640.421, 'start': 1394.635, 'title': 'Imputing missing data in titanic dataset', 'summary': 'Discusses the use of mode for imputing missing data, the implications of using a naive method for imputing age values, and the implementation of a tracking feature to maintain visibility into the original missing data, with 177 missing age values, representing over 20% of the dataset, and the global median for non-missing age values being 28.', 'duration': 245.786, 'highlights': ['There are 177 missing age values, representing a little more than 20% of the dataset. The chapter reveals that there are 177 missing age values, which account for a little more than 20% of the dataset.', 'The global median for the non-missing age values in this dataset is 28. It is mentioned that the global median for the non-missing age values in the dataset is 28, which highlights the potential impact of using a simplistic method like median imputation.', 'The implications of using a naive method for imputing age values are discussed, with the recommendation to use a far more sophisticated technique than just using the mode or the median. The chapter emphasizes the limitations of using a naive method for imputing age values and recommends the use of a far more sophisticated technique than just using the mode or the median.']}, {'end': 1773.455, 'start': 1640.981, 'title': 'Improving decision tree model with family size feature', 'summary': "Discusses using xgboost, a decision tree algorithm, and the potential benefit of creating a family size feature by combining demographic information from the 'sibspa' and 'parche' variables to improve the decision tree model.", 'duration': 132.474, 'highlights': ['XGBoost is an algorithm that creates boosted decision trees, which tend to prefer fewer, more powerful features, in contrast to support vector machines designed for a large number of relatively weak features.', "Combining information from 'sibspa' and 'parche' variables into a family size feature could potentially strengthen the decision tree model by adding up all spouses, siblings, parents, children, and the individual themselves to determine the total family size."]}], 'duration': 378.82, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY1394635.jpg', 'highlights': ['There are 177 missing age values, representing a little more than 20% of the dataset.', 'The global median for the non-missing age values in this dataset is 28.', 'The implications of using a naive method for imputing age values are discussed, with the recommendation to use a far more sophisticated technique than just using the mode or the median.', 'XGBoost is an algorithm that creates boosted decision trees, which tend to prefer fewer, more powerful features, in contrast to support vector machines designed for a large number of relatively weak features.', "Combining information from 'sibspa' and 'parche' variables into a family size feature could potentially strengthen the decision tree model by adding up all spouses, siblings, parents, children, and the individual themselves to determine the total family size."]}, {'end': 2446.002, 'segs': [{'end': 2090.024, 'src': 'embed', 'start': 2063.431, 'weight': 1, 'content': [{'end': 2067.574, 'text': 'Because you know what you spend most of your time on? Just getting the data in the first place.', 'start': 2063.431, 'duration': 4.143}, {'end': 2070.835, 'text': 'Cleaning the data, understanding the data, getting more data.', 'start': 2068.695, 'duration': 2.14}, {'end': 2076.08, 'text': 'Generally speaking, you want to spend a lot of time on feature engineering, very little time on modeling.', 'start': 2071.917, 'duration': 4.163}, {'end': 2078.542, 'text': 'Feature engineering is where you make your money.', 'start': 2076.9, 'duration': 1.642}, {'end': 2080.382, 'text': "It's where you add your value.", 'start': 2079.422, 'duration': 0.96}, {'end': 2081.944, 'text': "Because here's the thing.", 'start': 2080.402, 'duration': 1.542}, {'end': 2085.887, 'text': "If there's no point in doing feature engineering, there's no point having a data scientist.", 'start': 2082.864, 'duration': 3.023}, {'end': 2090.024, 'text': 'Just take the data, throw in a model.', 'start': 2088.803, 'duration': 1.221}], 'summary': 'Focus on feature engineering over modeling for added value and efficiency.', 'duration': 26.593, 'max_score': 2063.431, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY2063431.jpg'}, {'end': 2128.895, 'src': 'embed', 'start': 2100.291, 'weight': 0, 'content': [{'end': 2103.834, 'text': 'Unfortunately, you spend 60 to 80% of your time on data management.', 'start': 2100.291, 'duration': 3.543}, {'end': 2109.798, 'text': 'Getting access to the data, loading the data, understanding the data, getting more data, cleaning the data, that sort of thing.', 'start': 2104.874, 'duration': 4.924}, {'end': 2111.939, 'text': 'And not as much as you would like on feature engineering.', 'start': 2110.378, 'duration': 1.561}, {'end': 2115.482, 'text': 'Feature engineering is actually the fun part of being a data scientist.', 'start': 2111.999, 'duration': 3.483}, {'end': 2117.283, 'text': "It's the fun part.", 'start': 2116.743, 'duration': 0.54}, {'end': 2128.895, 'text': 'Dave, does adding features affect the accuracy of the prediction? Well, if you do your job right, absolutely.', 'start': 2123.953, 'duration': 4.942}], 'summary': 'Data scientists spend 60-80% of time on data management, less on feature engineering. adding features can significantly impact prediction accuracy.', 'duration': 28.604, 'max_score': 2100.291, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY2100291.jpg'}, {'end': 2177.89, 'src': 'embed', 'start': 2146.562, 'weight': 2, 'content': [{'end': 2154.175, 'text': 'But yeah, so generally speaking, if you engineer good features, your accuracy, your measurement of accuracy, specificity, whatever, will go up.', 'start': 2146.562, 'duration': 7.613}, {'end': 2156.517, 'text': "That's how you know you're doing your job.", 'start': 2155.356, 'duration': 1.161}, {'end': 2159.058, 'text': "Most of the time it'll do nothing or it'll get worse.", 'start': 2156.977, 'duration': 2.081}, {'end': 2160.359, 'text': "And that's okay.", 'start': 2159.839, 'duration': 0.52}, {'end': 2161.3, 'text': "That's part of the job.", 'start': 2160.599, 'duration': 0.701}, {'end': 2177.89, 'text': 'Any other questions? Yeah? Is there a way to quantify the predictive power of a feature? The answer is yes.', 'start': 2164.302, 'duration': 13.588}], 'summary': 'Good features improve accuracy, specificity, and predictive power in data engineering.', 'duration': 31.328, 'max_score': 2146.562, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY2146562.jpg'}], 'start': 1775.357, 'title': 'Feature engineering and data science insights', 'summary': 'Discusses the significance of feature engineering for different models, emphasizing its impact on accurate predictions and job security, while also highlighting the importance of data management in data science, underscoring its time-consuming nature.', 'chapters': [{'end': 1887.283, 'start': 1775.357, 'title': 'Feature engineering for machine learning', 'summary': 'Discusses the importance of creating effective features for different models, emphasizing that good features may not necessarily work well for all models, and that engineering features that work well for one model but not for another is possible, highlighting the significance of job security in the field of machine learning.', 'duration': 111.926, 'highlights': ['The importance of creating effective features for different models is emphasized, with the understanding that good features may not necessarily work well for all models, showcasing the dynamic nature of feature engineering.', 'Engineering features that work well for one model but not for another is highlighted, emphasizing the possibility of this scenario and the impact on job security in the field of machine learning.', 'The discussion highlights the addition of new features while retaining the old ones, with the removal of certain irrelevant features such as passenger ID, name, ticket number, and cabin, demonstrating the practical application of feature engineering in data preprocessing.', "The mention of adding specific new features such as 'aged missing' and 'family size' showcases the practical implementation of feature engineering in data preprocessing for machine learning models."]}, {'end': 2446.002, 'start': 1888.965, 'title': 'Data science insights and best practices', 'summary': "Discusses the importance of feature engineering, the impact of algorithms on different tasks, and the significance of data management in data science, highlighting that feature engineering is crucial for accurate predictions and job security, while data management consumes a significant portion of a data scientist's time.", 'duration': 557.037, 'highlights': ["Feature engineering is crucial for accurate predictions and job security, with 60 to 80% of a data scientist's time spent on data management.", 'Deep neural networks are essential for image recognition, while univariate time series forecasting typically starts with parametric methods like REMA or exponential smoothing.', 'The significance of feature engineering in adding value as a data scientist, emphasizing that it is where the money is made and the value is added.', 'The impact of adding features on prediction accuracy, with the recognition that good feature engineering leads to improved accuracy.', 'The transformation of data into categorical variables for machine learning algorithms, highlighting the importance of interpreting data correctly and the impact on the classification problem.']}], 'duration': 670.645, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY1775357.jpg', 'highlights': ["Feature engineering is crucial for accurate predictions and job security, with 60 to 80% of a data scientist's time spent on data management.", 'The importance of creating effective features for different models is emphasized, showcasing the dynamic nature of feature engineering.', 'The impact of adding features on prediction accuracy, with the recognition that good feature engineering leads to improved accuracy.', 'The significance of feature engineering in adding value as a data scientist, emphasizing that it is where the money is made and the value is added.', 'The discussion highlights the addition of new features while retaining the old ones, demonstrating the practical application of feature engineering in data preprocessing.']}, {'end': 3134.005, 'segs': [{'end': 2672.324, 'src': 'heatmap', 'start': 2446.242, 'weight': 0, 'content': [{'end': 2449.803, 'text': 'So this very indicative of the problem space in general.', 'start': 2446.242, 'duration': 3.561}, {'end': 2452.872, 'text': 'All right.', 'start': 2452.512, 'duration': 0.36}, {'end': 2458.793, 'text': "Now you'll notice so far, we haven't really done anything with carrot yet.", 'start': 2453.992, 'duration': 4.801}, {'end': 2462.814, 'text': "That's going to change starting right now.", 'start': 2461.173, 'duration': 1.641}, {'end': 2469.855, 'text': "We're going to use carrot to automatically impute the missing agents.", 'start': 2465.174, 'duration': 4.681}, {'end': 2472.695, 'text': 'This is awesome.', 'start': 2472.195, 'duration': 0.5}, {'end': 2476.456, 'text': 'Carrot actually supports multiple methods of imputation.', 'start': 2473.196, 'duration': 3.26}, {'end': 2481.657, 'text': "For example, it supports just using the median, which we know we don't want.", 'start': 2477.096, 'duration': 4.561}, {'end': 2484.689, 'text': "Remember, don't want that one.", 'start': 2483.608, 'duration': 1.081}, {'end': 2487.091, 'text': 'It can also use k-nearest neighbors.', 'start': 2485.69, 'duration': 1.401}, {'end': 2490.813, 'text': 'And it also can use bagged decision trees.', 'start': 2488.792, 'duration': 2.021}, {'end': 2498.759, 'text': 'Now, out of those three, bagged decision trees have the potential of being the most accurate because they have the most predictive power.', 'start': 2492.174, 'duration': 6.585}, {'end': 2502.241, 'text': 'Unfortunately, they are the most computationally intensive as well.', 'start': 2499.519, 'duration': 2.722}, {'end': 2507.625, 'text': "Good news is, on this data set, small, so it's not going to be that big of a deal.", 'start': 2502.722, 'duration': 4.903}, {'end': 2511.168, 'text': "But it's something to keep in the back of your mind as your data sets get bigger.", 'start': 2508.126, 'duration': 3.042}, {'end': 2513.022, 'text': "You're going to have to worry about that.", 'start': 2511.701, 'duration': 1.321}, {'end': 2515.604, 'text': "Now, here's one gotcha, though.", 'start': 2514.423, 'duration': 1.181}, {'end': 2520.367, 'text': 'The imputation methods in caret only work on numeric data.', 'start': 2516.865, 'duration': 3.502}, {'end': 2522.149, 'text': 'They will not work on factors.', 'start': 2520.808, 'duration': 1.341}, {'end': 2528.373, 'text': "So we need to transform our data frame so that it doesn't have any factors in it anymore.", 'start': 2523.53, 'duration': 4.843}, {'end': 2531.115, 'text': "This is what's known as dummy variables.", 'start': 2529.394, 'duration': 1.721}, {'end': 2535.459, 'text': "Or if you're from Python, this is often called one hot encoding in Python.", 'start': 2531.796, 'duration': 3.663}, {'end': 2536.62, 'text': 'Same idea.', 'start': 2536.159, 'duration': 0.461}, {'end': 2540.222, 'text': 'How do I take my factor variables and transform them into numerics?', 'start': 2537.1, 'duration': 3.122}, {'end': 2547.011, 'text': 'The function that we use in care to do that is called, not surprisingly, dummy bars.', 'start': 2541.667, 'duration': 5.344}, {'end': 2560.863, 'text': 'So if we, uh, pull up the help system here and pull up the help file for dummy bars, you see here, create a full set of dummy variables.', 'start': 2548.493, 'duration': 12.37}, {'end': 2562.764, 'text': 'This is awesome.', 'start': 2562.284, 'duration': 0.48}, {'end': 2564.025, 'text': 'This is handy.', 'start': 2563.244, 'duration': 0.781}, {'end': 2565.526, 'text': "You're going to use this all the time.", 'start': 2564.045, 'duration': 1.481}, {'end': 2571.151, 'text': "For example, if you're using support vector machines, you're going to want to use dummy bars to transform your.", 'start': 2566.107, 'duration': 5.044}, {'end': 2572.982, 'text': 'factor variables and numerics.', 'start': 2571.701, 'duration': 1.281}, {'end': 2579.726, 'text': "Okay, before we run the code, though, there's something that we should talk about, and that's a common pattern,", 'start': 2574.242, 'duration': 5.484}, {'end': 2583.288, 'text': 'a common idiom in machine learning in R, which is this', 'start': 2579.726, 'duration': 3.562}, {'end': 2589.271, 'text': 'Typically, the way I write my code in R is I invoke a function to train my model.', 'start': 2584.228, 'duration': 5.043}, {'end': 2593.053, 'text': 'That function returns me back a trained model.', 'start': 2590.212, 'duration': 2.841}, {'end': 2598.036, 'text': 'I then use a predict function on that trained model to create my predictions.', 'start': 2594.094, 'duration': 3.942}, {'end': 2601.038, 'text': 'Train, predict, train, predict, train, predict.', 'start': 2598.957, 'duration': 2.081}, {'end': 2605.322, 'text': 'Carrot is set up exactly the same way, exactly the same way.', 'start': 2601.84, 'duration': 3.482}, {'end': 2611.926, 'text': 'So this line of code here, if you will, trains a dummy variable model for me.', 'start': 2606.763, 'duration': 5.163}, {'end': 2615.248, 'text': 'Dummy variable model for me.', 'start': 2614.187, 'duration': 1.061}, {'end': 2625.393, 'text': 'And how you read this code is hey Carrot, I want you to create a dummy variable model for me and I want you to create it on all of my columns for me.', 'start': 2615.488, 'duration': 9.905}, {'end': 2630.116, 'text': "All of the columns that I'm going to give you, transform them into dummy variables.", 'start': 2626.134, 'duration': 3.982}, {'end': 2637.149, 'text': "Now, the good news is the function is smart enough that it says, look, if the column is already numeric, I won't do anything to it.", 'start': 2630.626, 'duration': 6.523}, {'end': 2640.15, 'text': 'I will only work on the factors that you pass me.', 'start': 2637.769, 'duration': 2.381}, {'end': 2645.732, 'text': "And then we can say, okay, here's the data I want you to work with.", 'start': 2643.211, 'duration': 2.521}, {'end': 2647.953, 'text': 'I want you to work on my training data set.', 'start': 2646.432, 'duration': 1.521}, {'end': 2654.216, 'text': "And notice I'm scrapping off the class label because I'm worried about my features.", 'start': 2648.493, 'duration': 5.723}, {'end': 2655.316, 'text': "I'm not worried about the class label.", 'start': 2654.256, 'duration': 1.06}, {'end': 2657.577, 'text': 'Everything but the class label.', 'start': 2656.757, 'duration': 0.82}, {'end': 2666.84, 'text': 'So if I run this line of code, then I will get a dummy variable object.', 'start': 2661.377, 'duration': 5.463}, {'end': 2672.324, 'text': "Next up, I can then say, okay, look, I've trained my dummy variable model.", 'start': 2668.541, 'duration': 3.783}], 'summary': 'Using carrot to impute missing agents and transform factor variables into numerics for machine learning in r.', 'duration': 55.999, 'max_score': 2446.242, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY2446242.jpg'}, {'end': 2572.982, 'src': 'embed', 'start': 2548.493, 'weight': 3, 'content': [{'end': 2560.863, 'text': 'So if we, uh, pull up the help system here and pull up the help file for dummy bars, you see here, create a full set of dummy variables.', 'start': 2548.493, 'duration': 12.37}, {'end': 2562.764, 'text': 'This is awesome.', 'start': 2562.284, 'duration': 0.48}, {'end': 2564.025, 'text': 'This is handy.', 'start': 2563.244, 'duration': 0.781}, {'end': 2565.526, 'text': "You're going to use this all the time.", 'start': 2564.045, 'duration': 1.481}, {'end': 2571.151, 'text': "For example, if you're using support vector machines, you're going to want to use dummy bars to transform your.", 'start': 2566.107, 'duration': 5.044}, {'end': 2572.982, 'text': 'factor variables and numerics.', 'start': 2571.701, 'duration': 1.281}], 'summary': 'The help file for dummy bars creates a full set of dummy variables for transforming factor variables and numerics.', 'duration': 24.489, 'max_score': 2548.493, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY2548493.jpg'}, {'end': 2630.116, 'src': 'embed', 'start': 2598.957, 'weight': 5, 'content': [{'end': 2601.038, 'text': 'Train, predict, train, predict, train, predict.', 'start': 2598.957, 'duration': 2.081}, {'end': 2605.322, 'text': 'Carrot is set up exactly the same way, exactly the same way.', 'start': 2601.84, 'duration': 3.482}, {'end': 2611.926, 'text': 'So this line of code here, if you will, trains a dummy variable model for me.', 'start': 2606.763, 'duration': 5.163}, {'end': 2615.248, 'text': 'Dummy variable model for me.', 'start': 2614.187, 'duration': 1.061}, {'end': 2625.393, 'text': 'And how you read this code is hey Carrot, I want you to create a dummy variable model for me and I want you to create it on all of my columns for me.', 'start': 2615.488, 'duration': 9.905}, {'end': 2630.116, 'text': "All of the columns that I'm going to give you, transform them into dummy variables.", 'start': 2626.134, 'duration': 3.982}], 'summary': 'Training and predicting a dummy variable model on all columns.', 'duration': 31.159, 'max_score': 2598.957, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY2598957.jpg'}, {'end': 2807.699, 'src': 'embed', 'start': 2769.47, 'weight': 4, 'content': [{'end': 2780.888, 'text': 'Ideally, far more accurate than something simple like just using the global medium of 28.', 'start': 2769.47, 'duration': 11.418}, {'end': 2781.248, 'text': 'All right.', 'start': 2780.888, 'duration': 0.36}, {'end': 2784.129, 'text': 'So now we get to impute.', 'start': 2782.609, 'duration': 1.52}, {'end': 2791.972, 'text': 'So we use a function in carrot called pre-process and pre-process is mighty.', 'start': 2785.07, 'duration': 6.902}, {'end': 2794.974, 'text': 'It is mighty.', 'start': 2794.353, 'duration': 0.621}, {'end': 2800.176, 'text': "I'm not going to drain the help file, but check out all the stuff it can do.", 'start': 2795.194, 'duration': 4.982}, {'end': 2804.197, 'text': 'Pre-process will center and scale your data.', 'start': 2802.036, 'duration': 2.161}, {'end': 2805.998, 'text': 'It will do box COTS transforms.', 'start': 2804.337, 'duration': 1.661}, {'end': 2807.699, 'text': 'It will do PCA, ICA.', 'start': 2806.098, 'duration': 1.601}], 'summary': 'The pre-process function in carrot is powerful, capable of centering, scaling, and performing various data transformations.', 'duration': 38.229, 'max_score': 2769.47, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY2769470.jpg'}], 'start': 2446.242, 'title': 'Carrot imputation methods and data transformation with dummy variables', 'summary': "Discusses using carrot for automatic imputation of missing data, highlighting its support for multiple methods including median, k-nearest neighbors, and bagged decision trees, with the latter being the most accurate but also the most computationally intensive. it also covers the transformation of factor variables into dummy variables using the 'dummyvars' function in caret, the process of training and predicting using the train-predict paradigm, and the utilization of the 'preprocess' function for data preprocessing in caret, emphasizing the importance of imputation in handling missing data.", 'chapters': [{'end': 2502.241, 'start': 2446.242, 'title': 'Carrot imputation methods', 'summary': 'Discusses using carrot for automatic imputation of missing data, highlighting its support for multiple methods including median, k-nearest neighbors, and bagged decision trees, with the latter being the most accurate but also the most computationally intensive.', 'duration': 55.999, 'highlights': ['Carrot supports multiple methods of imputation including median, k-nearest neighbors, and bagged decision trees, with the latter being the most accurate.', 'Bagged decision trees have the most predictive power but are also the most computationally intensive.', 'Using Carrot for automatic imputation is an effective approach for addressing missing data.']}, {'end': 3134.005, 'start': 2502.722, 'title': 'Data transformation with dummy variables', 'summary': "Discusses the transformation of factor variables into dummy variables using the 'dummyvars' function in caret, the process of training and predicting using the train-predict paradigm, and the utilization of the 'preprocess' function for data preprocessing in caret, emphasizing the importance of imputation in handling missing data.", 'duration': 631.283, 'highlights': ["The 'dummyVars' function in caret transforms factor variables into dummy variables, essential for machine learning models, such as support vector machines, and preserves missing values. The 'dummyVars' function is used to transform factor variables into dummy variables. It is crucial for models like support vector machines and preserves missing values.", "The 'preProcess' function in caret provides various data transformation capabilities, including imputation, centering, scaling, and other preprocessing techniques, enabling the creation of a preprocessing pipeline for model training. The 'preProcess' function offers multiple data transformation capabilities, including imputation, centering, scaling, and other preprocessing techniques. It allows the creation of a preprocessing pipeline for model training.", "The 'train-predict' paradigm is followed in caret for training and predicting models, and the 'preProcess' function is utilized to create an imputation model for each column to handle missing data, highlighting the importance of considering computational resources for large datasets. Caret follows the 'train-predict' paradigm for model training and prediction. The 'preProcess' function creates an imputation model for each column to handle missing data, emphasizing the consideration of computational resources for large datasets."]}], 'duration': 687.763, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY2446242.jpg', 'highlights': ['Carrot supports multiple methods of imputation including median, k-nearest neighbors, and bagged decision trees, with the latter being the most accurate.', 'Bagged decision trees have the most predictive power but are also the most computationally intensive.', 'Using Carrot for automatic imputation is an effective approach for addressing missing data.', "The 'dummyVars' function in caret transforms factor variables into dummy variables, essential for machine learning models, such as support vector machines, and preserves missing values.", "The 'preProcess' function in caret provides various data transformation capabilities, including imputation, centering, scaling, and other preprocessing techniques, enabling the creation of a preprocessing pipeline for model training.", "The 'train-predict' paradigm is followed in caret for training and predicting models, and the 'preProcess' function is utilized to create an imputation model for each column to handle missing data, highlighting the importance of considering computational resources for large datasets."]}, {'end': 3574.163, 'segs': [{'end': 3262.182, 'src': 'embed', 'start': 3151.479, 'weight': 0, 'content': [{'end': 3156.783, 'text': "the only way you can know that that's going to happen is you have to try out essentially multiple strategies and see which one works the best.", 'start': 3151.479, 'duration': 5.304}, {'end': 3180.636, 'text': "that's good, because of yeah.", 'start': 3157.856, 'duration': 22.78}, {'end': 3181.497, 'text': 'so the question is okay.', 'start': 3180.636, 'duration': 0.861}, {'end': 3188.011, 'text': "Dave man, you got me scared, right, I'm just Wow, I thought this imputation thing was cool, but now you've got me scared.", 'start': 3181.497, 'duration': 6.514}, {'end': 3194.874, 'text': 'Good Is there any way for me to actually evaluate how good my imputation model is? And the answer is absolutely.', 'start': 3189.432, 'duration': 5.442}, {'end': 3205.239, 'text': 'Build your imputation model, and then use it to predict ages for the rows where you have ages, and then compare the predictions to the actuals.', 'start': 3195.775, 'duration': 9.464}, {'end': 3209.041, 'text': "If you're familiar with regression, this is residuals, essentially.", 'start': 3206.88, 'duration': 2.161}, {'end': 3211.302, 'text': 'Same idea.', 'start': 3210.922, 'duration': 0.38}, {'end': 3218.677, 'text': 'You can use something like MAE, Mean absolute error, RMSC, whatever metric you would like.', 'start': 3211.862, 'duration': 6.815}, {'end': 3221.478, 'text': "but that's one way to double check to see how good is my model?", 'start': 3218.677, 'duration': 2.801}, {'end': 3222.739, 'text': 'How good is my imputation model?', 'start': 3221.738, 'duration': 1.001}, {'end': 3223.939, 'text': 'Is it on the data that I have??', 'start': 3223.159, 'duration': 0.78}, {'end': 3225.66, 'text': 'Absolutely.', 'start': 3225.299, 'duration': 0.361}, {'end': 3231.241, 'text': "I'm sorry, say it one more time.", 'start': 3230.541, 'duration': 0.7}, {'end': 3241.664, 'text': 'Yeah, so the question is, what if there are more than one column with missing data? So we can think of it at two levels.', 'start': 3235.202, 'duration': 6.462}, {'end': 3246.456, 'text': "There's the more general question about what do I do? And there's the more specific question of how Carrot does it.", 'start': 3241.704, 'duration': 4.752}, {'end': 3252.538, 'text': "That's exactly the reason why Carrot calculates imputation models for every one of your columns, is to fix that problem.", 'start': 3247.116, 'duration': 5.422}, {'end': 3261.481, 'text': "Now, once again, the more columns that you've got missing data, the more imputation models, the bigger the worry.", 'start': 3255.279, 'duration': 6.202}, {'end': 3262.182, 'text': "There's a question over here.", 'start': 3261.521, 'duration': 0.661}], 'summary': 'Test multiple strategies to find best imputation model. evaluate using metrics like mae, rmsc.', 'duration': 110.703, 'max_score': 3151.479, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY3151479.jpg'}, {'end': 3409.444, 'src': 'embed', 'start': 3342.966, 'weight': 4, 'content': [{'end': 3345.627, 'text': 'As I mentioned earlier, we are going to have a webinar later this month on ggplot2.', 'start': 3342.966, 'duration': 2.661}, {'end': 3362.992, 'text': "Yeah, so the question is, Dave, do R handles factors just fine? What is this craziness that we have to do? Answer, it's a requirement of character.", 'start': 3354.588, 'duration': 8.404}, {'end': 3365.513, 'text': "It's the way the function is implemented.", 'start': 3363.812, 'duration': 1.701}, {'end': 3367.934, 'text': 'It was a choice that Max Kuhn made.', 'start': 3366.353, 'duration': 1.581}, {'end': 3370.335, 'text': "It's not a limitation of R itself.", 'start': 3368.934, 'duration': 1.401}, {'end': 3397.639, 'text': "Yeah, so the question is, if I understand you correctly, okay, Dave, you've imputed age data now.", 'start': 3392.557, 'duration': 5.082}, {'end': 3403.282, 'text': "Now what do you do? And the answer is we're going to feed it back into our original data set.", 'start': 3398.279, 'duration': 5.003}, {'end': 3409.444, 'text': "We'll feed this column back into our original data set, and it'll become one of the features that we use to train our XGBoost model.", 'start': 3403.382, 'duration': 6.062}], 'summary': 'Webinar on ggplot2, handling factors in r, imputing age data, and using xgboost model for training.', 'duration': 66.478, 'max_score': 3342.966, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY3342966.jpg'}, {'end': 3555.79, 'src': 'embed', 'start': 3512.908, 'weight': 6, 'content': [{'end': 3517.428, 'text': 'So the old nautical adage, women and children first, generally speaking, when you look at the data, bears that out.', 'start': 3512.908, 'duration': 4.52}, {'end': 3524.87, 'text': 'Okay Yeah.', 'start': 3522.469, 'duration': 2.401}, {'end': 3530.231, 'text': 'Yes Yes.', 'start': 3524.89, 'duration': 5.341}, {'end': 3538.412, 'text': "The question is, should I exclude the response variable when I'm imputing my predictors? And the answer is yes.", 'start': 3533.211, 'duration': 5.201}, {'end': 3542.846, 'text': "Absolutely The reason is you won't have the response variable for your new data.", 'start': 3539.625, 'duration': 3.221}, {'end': 3555.79, 'text': 'Pardon? Why did I use this method? Why? So good question.', 'start': 3548.668, 'duration': 7.122}], 'summary': "Data supports 'women and children first' nautical adage. exclude response variable when imputing predictors.", 'duration': 42.882, 'max_score': 3512.908, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY3512908.jpg'}], 'start': 3134.485, 'title': 'Evaluating imputation models and data visualization in r', 'summary': 'Emphasizes the evaluation of imputation models using metrics like mae and rmsc, and suggests trying out multiple strategies. it also discusses imputing age data, visualizing errors using regression terminology, and the importance of excluding the response variable when imputing predictors in r, emphasizing the power of data visualization and the use of xgboost model.', 'chapters': [{'end': 3262.182, 'start': 3134.485, 'title': 'Imputation model evaluation', 'summary': 'Emphasizes the importance of evaluating imputation models for accuracy and suggests using metrics like mae and rmsc to assess their performance, as well as trying out multiple strategies to determine the most effective approach.', 'duration': 127.697, 'highlights': ['Using metrics like MAE and RMSC to assess the performance of imputation models is crucial, as it helps in determining the accuracy of the models and evaluating their effectiveness.', 'Trying out multiple strategies to find the most effective approach for imputation is important in ensuring that the imputation model is reasonably accurate and can mitigate potential problems down the line.', 'Building and using the imputation model to predict ages for the rows with missing data, and then comparing the predictions to the actual values, helps in evaluating the accuracy of the imputation model.', 'Carrot calculates imputation models for every column with missing data, addressing the problem of multiple columns with missing data and aiming to provide a solution for each specific column.']}, {'end': 3574.163, 'start': 3262.202, 'title': 'Data visualization and imputation in r', 'summary': 'Discusses the process of imputing age data, visualizing errors using regression terminology, and the importance of excluding the response variable when imputing predictors in r, emphasizing the power of data visualization and the use of xgboost model.', 'duration': 311.961, 'highlights': ['The chapter emphasizes the power of data visualization in R, particularly using ggplot2 and mentions an upcoming webinar on ggplot2. Emphasizes the power of data visualization in R using ggplot2, mentions an upcoming webinar on ggplot2.', 'The process of imputing age data and feeding it back into the original data set to be used as features for training the XGBoost model is discussed. Discusses the process of imputing age data and feeding it back into the original data set for training the XGBoost model.', 'Excludes the response variable when imputing predictors in R, emphasizing the importance of this practice for new data. Emphasizes the importance of excluding the response variable when imputing predictors in R for new data.']}], 'duration': 439.678, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY3134485.jpg', 'highlights': ['Using metrics like MAE and RMSC to assess the performance of imputation models is crucial for determining accuracy and evaluating effectiveness.', 'Trying out multiple strategies for imputation is important to find the most effective approach and mitigate potential problems.', 'Building and using the imputation model to predict ages for rows with missing data helps in evaluating the accuracy of the model.', 'Carrot calculates imputation models for every column with missing data, addressing the problem of multiple columns with missing data.', 'Emphasizes the power of data visualization in R using ggplot2 and mentions an upcoming webinar on ggplot2.', 'The process of imputing age data and feeding it back into the original data set for training the XGBoost model is discussed.', 'Excludes the response variable when imputing predictors in R, emphasizing the importance of this practice for new data.']}, {'end': 4707.522, 'segs': [{'end': 3627.126, 'src': 'embed', 'start': 3575.764, 'weight': 7, 'content': [{'end': 3581.528, 'text': 'Now, should you test all of them in a production scenario in your real job? Yeah, actually, I would, actually.', 'start': 3575.764, 'duration': 5.764}, {'end': 3582.429, 'text': 'I would.', 'start': 3582.109, 'duration': 0.32}, {'end': 3583.77, 'text': 'Yeah, I would.', 'start': 3583.049, 'duration': 0.721}, {'end': 3588.132, 'text': "Because if your goal is maximum accuracy, why wouldn't you? I wouldn't.", 'start': 3583.81, 'duration': 4.322}, {'end': 3589.493, 'text': 'Well, not the median.', 'start': 3588.673, 'duration': 0.82}, {'end': 3590.434, 'text': 'I think you can rule that one out.', 'start': 3589.633, 'duration': 0.801}, {'end': 3591.234, 'text': 'You can rule that one out.', 'start': 3590.454, 'duration': 0.78}, {'end': 3592.936, 'text': "That's too simple.", 'start': 3592.415, 'duration': 0.521}, {'end': 3598.239, 'text': "Okay Notice I'm going to take the sixth column.", 'start': 3594.777, 'duration': 3.462}, {'end': 3602.021, 'text': 'in the imputed data because the sixth column was the age.', 'start': 3599.301, 'duration': 2.72}, {'end': 3604.842, 'text': 'The sixth column was the age.', 'start': 3603.842, 'duration': 1}, {'end': 3607.242, 'text': "And I'm going to overwrite the age variable on my train.", 'start': 3605.262, 'duration': 1.98}, {'end': 3613.823, 'text': "I'm going to take all the imputed age values and put it back on my original data frame where I have my factors and all my original data.", 'start': 3608.103, 'duration': 5.72}, {'end': 3627.126, 'text': 'And if we pull that up in the spreadsheet view, you can see now this age was originally missing.', 'start': 3617.204, 'duration': 9.922}], 'summary': 'Testing all production scenarios for maximum accuracy, including overwriting imputed age values back to the original data frame.', 'duration': 51.362, 'max_score': 3575.764, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY3575764.jpg'}, {'end': 3728.699, 'src': 'embed', 'start': 3695.315, 'weight': 8, 'content': [{'end': 3695.915, 'text': "It wouldn't really matter.", 'start': 3695.315, 'duration': 0.6}, {'end': 3700.178, 'text': "But anytime I don't have a 50-50 split.", 'start': 3697.556, 'duration': 2.622}, {'end': 3706.422, 'text': "let's say, for example, I'm doing anomaly detection and I have severe class imbalance, like maybe fraud detection,", 'start': 3700.178, 'duration': 6.244}, {'end': 3710.845, 'text': 'where maybe only 1% is fraud and 99% is not.', 'start': 3706.422, 'duration': 4.423}, {'end': 3716.689, 'text': "I'm going to want to do stratified sampling because I want to make sure that the data that I'm working with is representative of the population.", 'start': 3710.845, 'duration': 5.844}, {'end': 3723.437, 'text': 'CreateDataPartition does this for me automagically.', 'start': 3720.316, 'duration': 3.121}, {'end': 3728.699, 'text': "Since we're doing random sampling, we're going to go ahead and set the seed.", 'start': 3725.798, 'duration': 2.901}], 'summary': 'For anomaly detection with severe class imbalance like 1% fraud and 99% non-fraud, stratified sampling ensures representative data; createdatapartition automates this, using random sampling with a specified seed.', 'duration': 33.384, 'max_score': 3695.315, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY3695315.jpg'}, {'end': 3778.558, 'src': 'embed', 'start': 3749.26, 'weight': 3, 'content': [{'end': 3750.781, 'text': 'We pass it in a factor variable.', 'start': 3749.26, 'duration': 1.521}, {'end': 3760.268, 'text': "That's how the function can calculate the relative proportions so that it knows any random splits that I create have to follow those same proportions.", 'start': 3751.842, 'duration': 8.426}, {'end': 3761.889, 'text': 'Same proportions.', 'start': 3761.369, 'duration': 0.52}, {'end': 3766.752, 'text': 'The function is mighty.', 'start': 3765.171, 'duration': 1.581}, {'end': 3771.136, 'text': 'I can actually ask it for more than one split at a time.', 'start': 3767.973, 'duration': 3.163}, {'end': 3776.079, 'text': 'Here I only need one, but I could actually ask it for five splits or 25 splits.', 'start': 3772.176, 'duration': 3.903}, {'end': 3777.46, 'text': 'How many splits I want.', 'start': 3776.539, 'duration': 0.921}, {'end': 3778.558, 'text': "They'd all be random.", 'start': 3777.858, 'duration': 0.7}], 'summary': 'The function can calculate relative proportions and create multiple random splits as requested.', 'duration': 29.298, 'max_score': 3749.26, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY3749260.jpg'}, {'end': 3915.761, 'src': 'embed', 'start': 3885.158, 'weight': 4, 'content': [{'end': 3892.099, 'text': 'And just to prove to you that it works, We can spit out the relative proportions of all three data sets.', 'start': 3885.158, 'duration': 6.941}, {'end': 3893.68, 'text': 'So here was the original.', 'start': 3892.819, 'duration': 0.861}, {'end': 3898.365, 'text': "That's all 891 pieces of data.", 'start': 3896.623, 'duration': 1.742}, {'end': 3907.673, 'text': "Here's my test split of 70% and I see my trains, my train split of 70% and my test set of 30%.", 'start': 3899.386, 'duration': 8.287}, {'end': 3910.516, 'text': 'Notice how the relative proportions are maintained.', 'start': 3907.673, 'duration': 2.843}, {'end': 3912.818, 'text': 'My data is still representative.', 'start': 3911.036, 'duration': 1.782}, {'end': 3915.761, 'text': 'This is great.', 'start': 3915.401, 'duration': 0.36}], 'summary': 'The data set was split into 70% for training and 30% for testing, maintaining representative proportions.', 'duration': 30.603, 'max_score': 3885.158, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY3885158.jpg'}, {'end': 4044.277, 'src': 'heatmap', 'start': 3916.002, 'weight': 0.924, 'content': [{'end': 3919.825, 'text': 'If you work with any sort of class imbalance problem, this is awesome.', 'start': 3916.002, 'duration': 3.823}, {'end': 3921.547, 'text': 'This is awesome sauce.', 'start': 3920.526, 'duration': 1.021}, {'end': 3924.85, 'text': 'All right.', 'start': 3924.55, 'duration': 0.3}, {'end': 3926.692, 'text': "So now we've got a training test split.", 'start': 3925.051, 'duration': 1.641}, {'end': 3929.915, 'text': "It's time to actually build a model.", 'start': 3928.654, 'duration': 1.261}, {'end': 3944.455, 'text': 'So first up, we can tell Carrot, how would I like the model built? using the train control function.', 'start': 3934.28, 'duration': 10.175}, {'end': 3952.34, 'text': 'The train control function illustrates the fact that the process for training a model is actually independent of the model type itself.', 'start': 3944.575, 'duration': 7.765}, {'end': 3958.024, 'text': 'This is the same for a random forest, the same for XGBoost, the same for a neural network, the same for logistic regression.', 'start': 3952.881, 'duration': 5.143}, {'end': 3959.005, 'text': "It doesn't matter.", 'start': 3958.044, 'duration': 0.961}, {'end': 3963.068, 'text': "I would like Carrot, what I'm showing here is I'd like Carrot to do cross-validation.", 'start': 3959.505, 'duration': 3.563}, {'end': 3967.691, 'text': 'Specifically, I would like it to do 10-fold cross-validation repeated three times.', 'start': 3963.748, 'duration': 3.943}, {'end': 3971.854, 'text': 'And then I would also like it to do a grid search.', 'start': 3969.212, 'duration': 2.642}, {'end': 3977.916, 'text': 'Go through a collection of parameters and find which ones are optimal.', 'start': 3973.834, 'duration': 4.082}, {'end': 3982.658, 'text': "And we'll show what the grid's coming down.", 'start': 3980.997, 'duration': 1.661}, {'end': 3984.079, 'text': 'You can see the code right there at the bottom there.', 'start': 3982.678, 'duration': 1.401}, {'end': 3984.759, 'text': "Well, it's coming.", 'start': 3984.259, 'duration': 0.5}, {'end': 3988.621, 'text': "But I just want to make sure we understand what's going on here.", 'start': 3985.48, 'duration': 3.141}, {'end': 3991.843, 'text': "If you're not familiar with tenfold cross validation, that's okay.", 'start': 3989.221, 'duration': 2.622}, {'end': 3994.884, 'text': "We'll explain it at a super high level.", 'start': 3993.503, 'duration': 1.381}, {'end': 4002.788, 'text': 'Tenfold cross validation essentially says split my data in 10 ways, cycle through all the folds, a total of 10 times.', 'start': 3996.025, 'duration': 6.763}, {'end': 4007.185, 'text': 'build 10 models, and evaluate how good they are on the data.', 'start': 4003.783, 'duration': 3.402}, {'end': 4010.526, 'text': "That's the tenfold process.", 'start': 4009.366, 'duration': 1.16}, {'end': 4013.288, 'text': 'So I do the algorithm 10 times.', 'start': 4011.547, 'duration': 1.741}, {'end': 4019.751, 'text': 'When I do a repeated CV, I say, look, you know what? That process needs to be done more than once.', 'start': 4014.548, 'duration': 5.203}, {'end': 4024.734, 'text': 'So I would like you to do that same tenfold cross-validation a total of three times.', 'start': 4021.392, 'duration': 3.342}, {'end': 4033.049, 'text': 'So I need you to build 30 models, not 10, 30, three times 10, three times 10.', 'start': 4026.154, 'duration': 6.895}, {'end': 4033.89, 'text': 'Now, why would we do this?', 'start': 4033.049, 'duration': 0.841}, {'end': 4044.277, 'text': "Cross-validation is a means, a means of essentially estimating how well our model will function in production, once it's in the real world,", 'start': 4034.61, 'duration': 9.667}], 'summary': 'Utilize 10-fold cross-validation repeated 3 times to build 30 models for estimating model performance.', 'duration': 128.275, 'max_score': 3916.002, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY3916002.jpg'}, {'end': 3967.691, 'src': 'embed', 'start': 3944.575, 'weight': 5, 'content': [{'end': 3952.34, 'text': 'The train control function illustrates the fact that the process for training a model is actually independent of the model type itself.', 'start': 3944.575, 'duration': 7.765}, {'end': 3958.024, 'text': 'This is the same for a random forest, the same for XGBoost, the same for a neural network, the same for logistic regression.', 'start': 3952.881, 'duration': 5.143}, {'end': 3959.005, 'text': "It doesn't matter.", 'start': 3958.044, 'duration': 0.961}, {'end': 3963.068, 'text': "I would like Carrot, what I'm showing here is I'd like Carrot to do cross-validation.", 'start': 3959.505, 'duration': 3.563}, {'end': 3967.691, 'text': 'Specifically, I would like it to do 10-fold cross-validation repeated three times.', 'start': 3963.748, 'duration': 3.943}], 'summary': 'Training models is independent of model type. carrot should perform 10-fold cross-validation repeated 3 times.', 'duration': 23.116, 'max_score': 3944.575, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY3944575.jpg'}, {'end': 4129.312, 'src': 'embed', 'start': 4099.678, 'weight': 6, 'content': [{'end': 4106.662, 'text': 'This expand.grid function essentially will create all the permutations of all of these things and create me a data frame.', 'start': 4099.678, 'duration': 6.984}, {'end': 4108.863, 'text': "It's essentially one unique row for all the combinations.", 'start': 4106.723, 'duration': 2.14}, {'end': 4114.108, 'text': "So if I run this code, it's easier if you just see it.", 'start': 4110.966, 'duration': 3.142}, {'end': 4122.474, 'text': "You'll see I have all the different combinations of values.", 'start': 4120.252, 'duration': 2.222}, {'end': 4124.77, 'text': 'And in particular, notice this.', 'start': 4123.51, 'duration': 1.26}, {'end': 4129.312, 'text': "It's 243 distinct combinations.", 'start': 4126.39, 'duration': 2.922}], 'summary': 'Expand.grid function creates 243 distinct combinations in a data frame.', 'duration': 29.634, 'max_score': 4099.678, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY4099678.jpg'}, {'end': 4240.457, 'src': 'heatmap', 'start': 4094.835, 'weight': 0, 'content': [{'end': 4097.596, 'text': 'I know it takes a lot of faith, but bear with me for a second.', 'start': 4094.835, 'duration': 2.761}, {'end': 4106.662, 'text': 'This expand.grid function essentially will create all the permutations of all of these things and create me a data frame.', 'start': 4099.678, 'duration': 6.984}, {'end': 4108.863, 'text': "It's essentially one unique row for all the combinations.", 'start': 4106.723, 'duration': 2.14}, {'end': 4114.108, 'text': "So if I run this code, it's easier if you just see it.", 'start': 4110.966, 'duration': 3.142}, {'end': 4122.474, 'text': "You'll see I have all the different combinations of values.", 'start': 4120.252, 'duration': 2.222}, {'end': 4124.77, 'text': 'And in particular, notice this.', 'start': 4123.51, 'duration': 1.26}, {'end': 4129.312, 'text': "It's 243 distinct combinations.", 'start': 4126.39, 'duration': 2.922}, {'end': 4141.276, 'text': "So now I'm asking Kara to run tenfold cross-validation three times for each one of these 243 potential values, or 10 times three times 243.", 'start': 4130.773, 'duration': 10.503}, {'end': 4148.899, 'text': "When I mentioned earlier about hit enter, five o'clock on Friday, turn your lights off, go home.", 'start': 4141.276, 'duration': 7.623}, {'end': 4151.8, 'text': 'This is an example of this, right? This is an exhaustive grid search.', 'start': 4149.078, 'duration': 2.722}, {'end': 4155.832, 'text': 'And sometimes you do actually do do this in practice to find the best model.', 'start': 4152.492, 'duration': 3.34}, {'end': 4165.055, 'text': 'more complicated the model, the more knobs and dials, the more likely it is you have to do stuff like this.', 'start': 4155.832, 'duration': 9.223}, {'end': 4172.636, 'text': "Okay, So next up, because we're going to do 10 times three times 243 things,", 'start': 4165.095, 'duration': 7.541}, {'end': 4177.817, 'text': 'it would be nice if we could actually do some of this work all at the same time, in parallel.', 'start': 4172.636, 'duration': 5.181}, {'end': 4185.423, 'text': 'So this line of code here from the do snow package, the make cluster, essentially allows us to create a cluster.', 'start': 4178.877, 'duration': 6.546}, {'end': 4192.527, 'text': "by the way, if you're using the code at home, unless you have a really powerful computer, change the number that i've highlighted.", 'start': 4185.423, 'duration': 7.104}, {'end': 4194.768, 'text': 'make it smaller.', 'start': 4192.527, 'duration': 2.241}, {'end': 4200.992, 'text': 'this assumes at least a workstation class machine or a server by default,', 'start': 4194.768, 'duration': 6.224}, {'end': 4206.555, 'text': 'because the way you interpret this is is and this is close enough for our purposes.', 'start': 4200.992, 'duration': 5.563}, {'end': 4209.037, 'text': 'hey, you snow.', 'start': 4206.555, 'duration': 2.482}, {'end': 4213.328, 'text': 'start up 10 instances of our studio behind the scenes.', 'start': 4209.037, 'duration': 4.291}, {'end': 4217.79, 'text': 'And I want you to run all 10 of those RStudio programs all at the same time.', 'start': 4214.389, 'duration': 3.401}, {'end': 4223.672, 'text': "If you've ever opened up like 30 Chrome tabs, same type of idea.", 'start': 4219.73, 'duration': 3.942}, {'end': 4225.492, 'text': 'Guess what? Your computer starts to slow down.', 'start': 4223.732, 'duration': 1.76}, {'end': 4226.913, 'text': 'Same idea.', 'start': 4226.513, 'duration': 0.4}, {'end': 4228.113, 'text': 'Same idea.', 'start': 4227.733, 'duration': 0.38}, {'end': 4231.054, 'text': 'But this is how you actually can train stuff in parallel in Carrot.', 'start': 4228.713, 'duration': 2.341}, {'end': 4232.375, 'text': 'And this is really awesome.', 'start': 4231.534, 'duration': 0.841}, {'end': 4235.836, 'text': 'This allows you to scale up very, very easily in R.', 'start': 4232.835, 'duration': 3.001}, {'end': 4240.457, 'text': "You can also use these types of techniques to scale across machines, but that's way more complicated.", 'start': 4235.836, 'duration': 4.621}], 'summary': 'Using expand.grid function to create 243 distinct combinations for model training, then running them in parallel to scale up easily in r.', 'duration': 145.622, 'max_score': 4094.835, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY4094835.jpg'}, {'end': 4453.966, 'src': 'heatmap', 'start': 4334.63, 'weight': 0.729, 'content': [{'end': 4342.016, 'text': 'Okay So on my little laptop, I changed that number from 10 to 3.', 'start': 4334.63, 'duration': 7.386}, {'end': 4342.736, 'text': "And here's a picture.", 'start': 4342.016, 'duration': 0.72}, {'end': 4346.458, 'text': 'Notice I titled it Ouch.', 'start': 4342.756, 'duration': 3.702}, {'end': 4356.942, 'text': 'That code took six minutes to run, where I was at basically 100% CPU the entire time.', 'start': 4351.94, 'duration': 5.002}, {'end': 4370.564, 'text': 'So imagine, if you will, on a much bigger data set, But all these considerations are good because of? Yes.', 'start': 4357.802, 'duration': 12.762}, {'end': 4376.327, 'text': 'OK So not surprisingly, I pre-ran this before you guys all showed up.', 'start': 4372.045, 'duration': 4.282}, {'end': 4384.592, 'text': 'So if I actually run this line of code, this actually allows me to see the results of that process that we just described.', 'start': 4379.209, 'duration': 5.383}, {'end': 4392.015, 'text': "And you'll notice we get a ton of output.", 'start': 4388.834, 'duration': 3.181}, {'end': 4393.236, 'text': 'Let me just scroll up here to the top.', 'start': 4392.035, 'duration': 1.201}, {'end': 4400.007, 'text': 'So Carrot said, okay, you want to use XGBoost, extreme gradient boosting.', 'start': 4396.044, 'duration': 3.963}, {'end': 4403.07, 'text': "You're doing tenfold cross-validation repeated three times.", 'start': 4400.548, 'duration': 2.522}, {'end': 4406.212, 'text': 'Here are the results for each one of those combinations.', 'start': 4403.63, 'duration': 2.582}, {'end': 4410.956, 'text': "There's so much output that eventually it just gets truncated.", 'start': 4408.814, 'duration': 2.142}, {'end': 4415.759, 'text': 'But it tells you here at the bottom, accuracy was used to select the optimal value.', 'start': 4411.976, 'duration': 3.783}, {'end': 4418.381, 'text': 'Here were the values that I found that were actually the best.', 'start': 4416.34, 'duration': 2.041}, {'end': 4421.304, 'text': 'This produced the most accurate model.', 'start': 4419.002, 'duration': 2.302}, {'end': 4425.605, 'text': "And once it's done that, Carrot then says, okay, cool.", 'start': 4423.205, 'duration': 2.4}, {'end': 4427.726, 'text': 'Once I know the best settings to use,', 'start': 4425.705, 'duration': 2.021}, {'end': 4433.127, 'text': 'I will now use all of your training data and build you one final model using those settings and give it back to you.', 'start': 4427.726, 'duration': 5.401}, {'end': 4437.648, 'text': 'I just love Carrot.', 'start': 4436.768, 'duration': 0.88}, {'end': 4438.708, 'text': 'I just love it so much.', 'start': 4437.848, 'duration': 0.86}, {'end': 4444.61, 'text': 'So what that allows us to do is make some predictions.', 'start': 4440.949, 'duration': 3.661}, {'end': 4446.45, 'text': 'Remember, train, predict, train, predict, train, predict.', 'start': 4444.67, 'duration': 1.78}, {'end': 4450.051, 'text': 'So now I can do a prediction using that CV object.', 'start': 4447.07, 'duration': 2.981}, {'end': 4452.265, 'text': 'on the 30% that I held out.', 'start': 4450.805, 'duration': 1.46}, {'end': 4453.966, 'text': 'Remember I created that 70, 30 split.', 'start': 4452.285, 'duration': 1.681}], 'summary': 'Changed number from 10 to 3, code took 6 mins, 100% cpu usage, xgboost used for model building and prediction.', 'duration': 119.336, 'max_score': 4334.63, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY4334630.jpg'}, {'end': 4621.58, 'src': 'embed', 'start': 4592.137, 'weight': 1, 'content': [{'end': 4593.478, 'text': "That's pretty good, right? We're at 93% accuracy.", 'start': 4592.137, 'duration': 1.341}, {'end': 4607.412, 'text': 'We can predict, based on this particular process and this particular data split and this particular algorithm, 93% of the time,', 'start': 4600.188, 'duration': 7.224}, {'end': 4608.633, 'text': 'we can correctly predict people who die.', 'start': 4607.412, 'duration': 1.221}, {'end': 4616.337, 'text': 'Another example is specificity, which is this column right here.', 'start': 4613.575, 'duration': 2.762}, {'end': 4621.58, 'text': 'How well did we do in correctly predicting people who actually lived?', 'start': 4618.618, 'duration': 2.962}], 'summary': '93% accuracy in predicting deaths, high specificity in predicting survivors.', 'duration': 29.443, 'max_score': 4592.137, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY4592137.jpg'}, {'end': 4685.523, 'src': 'embed', 'start': 4650.361, 'weight': 2, 'content': [{'end': 4653.063, 'text': 'this can help you say look, where should you focus your feature engineering?', 'start': 4650.361, 'duration': 2.702}, {'end': 4660.598, 'text': "Probably You don't want to focus on features that help you predict people who died better.", 'start': 4654.684, 'duration': 5.914}, {'end': 4661.878, 'text': "You're pretty good at that already.", 'start': 4660.638, 'duration': 1.24}, {'end': 4669.72, 'text': 'You would probably want to focus your efforts, your data analysis efforts, on understanding the signal and the data around who actually survived,', 'start': 4662.338, 'duration': 7.382}, {'end': 4671.56, 'text': "because that's where you have the most room for improvement.", 'start': 4669.72, 'duration': 1.84}, {'end': 4673.941, 'text': "That's where this becomes very, very helpful.", 'start': 4672.56, 'duration': 1.381}, {'end': 4685.523, 'text': 'Okay And that is the end of the script.', 'start': 4680.302, 'duration': 5.221}], 'summary': 'Focus feature engineering on understanding data around who actually survived for improvement.', 'duration': 35.162, 'max_score': 4650.361, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY4650361.jpg'}], 'start': 3575.764, 'title': 'Data preprocessing and model training', 'summary': 'Covers data imputation, stratified sampling, creating a 70-30 train-test split, 10-fold cross-validation, and parallel model training with carrot, achieving around 85% accuracy in predicting survival, with a focus on feature engineering for improvement.', 'chapters': [{'end': 3716.689, 'start': 3575.764, 'title': 'Data sampling and imputation', 'summary': 'Discusses the process of imputing missing data and creating stratified random samples, emphasizing the importance of maximum accuracy in testing and the use of stratified sampling for non-50-50 data splits.', 'duration': 140.925, 'highlights': ['The importance of testing all scenarios for maximum accuracy and the exclusion of median testing.', 'The process of imputing missing data by overwriting the age variable with imputed age values.', 'The use of create data partition function for stratified random sampling, especially for non-50-50 data splits or severe class imbalances like fraud detection.']}, {'end': 4097.596, 'start': 3720.316, 'title': 'Data partitioning and model training', 'summary': 'Explains the process of creating a 70-30 train-test split using the createdatapartition function, maintaining the original data proportions, and the use of train control function for 10-fold cross-validation repeated three times for model training.', 'duration': 377.28, 'highlights': ['The chapter explains the process of creating a 70-30 train-test split using the CreateDataPartition function, maintaining the original data proportions. 70-30 train-test split, maintaining original data proportions.', 'Illustrates the use of train control function for 10-fold cross-validation repeated three times for model training. 10-fold cross-validation repeated three times.', 'The function CreateDataPartition can calculate the relative proportions and can be asked for more than one split at a time. CreateDataPartition calculating relative proportions.']}, {'end': 4707.522, 'start': 4099.678, 'title': 'Parallel model training with carrot', 'summary': 'Discusses using the expand.grid function to create 243 distinct combinations, utilizing parallel computing to efficiently run 10 times three times 243 tasks, and evaluating model performance, achieving around 85% accuracy in predicting survival, with a focus on feature engineering for improvement.', 'duration': 607.844, 'highlights': ['Using expand.grid function to create 243 distinct combinations The expand.grid function generates all permutations, resulting in 243 unique combinations.', 'Utilizing parallel computing to efficiently run 10 times three times 243 tasks The use of parallel computing allows for running 10 times three times 243 tasks simultaneously, improving efficiency in model training.', 'Evaluating model performance, achieving around 85% accuracy in predicting survival The model achieved around 85% accuracy in predicting survival, indicating a reasonably good performance.', 'Focus on feature engineering for improvement based on specificity and sensitivity analysis The analysis of specificity and sensitivity provides insights for focusing feature engineering efforts on improving predictions for people who actually survived, as there is significant room for improvement in this area.']}], 'duration': 1131.758, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY3575764.jpg', 'highlights': ['The use of parallel computing allows for running 10 times three times 243 tasks simultaneously, improving efficiency in model training.', 'The model achieved around 85% accuracy in predicting survival, indicating a reasonably good performance.', 'The analysis of specificity and sensitivity provides insights for focusing feature engineering efforts on improving predictions for people who actually survived, as there is significant room for improvement in this area.', 'The function CreateDataPartition can calculate the relative proportions and can be asked for more than one split at a time.', 'The chapter explains the process of creating a 70-30 train-test split using the CreateDataPartition function, maintaining the original data proportions.', 'Illustrates the use of train control function for 10-fold cross-validation repeated three times for model training.', 'Using expand.grid function to create 243 distinct combinations The expand.grid function generates all permutations, resulting in 243 unique combinations.', 'The process of imputing missing data by overwriting the age variable with imputed age values.', 'The use of create data partition function for stratified random sampling, especially for non-50-50 data splits or severe class imbalances like fraud detection.', 'The importance of testing all scenarios for maximum accuracy and the exclusion of median testing.']}, {'end': 5930.639, 'segs': [{'end': 4742.901, 'src': 'embed', 'start': 4709.343, 'weight': 4, 'content': [{'end': 4717.71, 'text': "I know it's cliche, but if I was trapped on a desert island and I could only have one data science book, This is the one I would pick without a doubt.", 'start': 4709.343, 'duration': 8.367}, {'end': 4721.151, 'text': 'And you can ask my wife and look at my Amazon account.', 'start': 4718.67, 'duration': 2.481}, {'end': 4722.992, 'text': 'I buy lots of data science books.', 'start': 4721.372, 'duration': 1.62}, {'end': 4727.834, 'text': "This is by far the best book in general on data science I've ever seen.", 'start': 4724.373, 'duration': 3.461}, {'end': 4742.901, 'text': "So the question is, what was the default cutoff for survive versus died? And the answer is it's 0.5 by default.", 'start': 4729.895, 'duration': 13.006}], 'summary': 'This data science book is the best, with a default cutoff of 0.5 for survival.', 'duration': 33.558, 'max_score': 4709.343, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY4709343.jpg'}, {'end': 4811.063, 'src': 'embed', 'start': 4763.043, 'weight': 0, 'content': [{'end': 4766.144, 'text': 'And XGBoost only uses simple probability.', 'start': 4763.043, 'duration': 3.101}, {'end': 4768.544, 'text': "If it's over 50%, you lived.", 'start': 4766.544, 'duration': 2}, {'end': 4770.205, 'text': "If it's below, you died.", 'start': 4768.564, 'duration': 1.641}, {'end': 4776.147, 'text': 'So if you want to mess with thresholds, like for an AUC calculation, you have to do extra code for that.', 'start': 4772.125, 'duration': 4.022}, {'end': 4777.707, 'text': "That's a good question.", 'start': 4776.167, 'duration': 1.54}, {'end': 4780.145, 'text': 'Yeah, so this is the end of the presentation.', 'start': 4778.561, 'duration': 1.584}, {'end': 4782.912, 'text': 'So feel free to pepper me with questions if you have them.', 'start': 4780.305, 'duration': 2.607}, {'end': 4811.063, 'text': "was a hypothesis because I wasn't there.", 'start': 4808.041, 'duration': 3.022}], 'summary': 'Xgboost uses simple probability for classification. thresholds may require extra code for auc calculation.', 'duration': 48.02, 'max_score': 4763.043, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY4763043.jpg'}, {'end': 4927.377, 'src': 'embed', 'start': 4896.815, 'weight': 2, 'content': [{'end': 4898.735, 'text': 'which then guide their future engineering efforts?', 'start': 4896.815, 'duration': 1.92}, {'end': 4900.556, 'text': "Absolutely, because we're human beings.", 'start': 4899.215, 'duration': 1.341}, {'end': 4902.496, 'text': "It's very hard for us not to do that.", 'start': 4900.636, 'duration': 1.86}, {'end': 4917.16, 'text': 'Use my own intuition and hard elections and research to guide my future engineering efforts? No, absolutely not.', 'start': 4908.558, 'duration': 8.602}, {'end': 4918.48, 'text': 'Absolutely not.', 'start': 4918.08, 'duration': 0.4}, {'end': 4919.6, 'text': 'Absolutely not.', 'start': 4919.26, 'duration': 0.34}, {'end': 4924.635, 'text': 'So business subject matter expertise, for example, often is qualitative and not quantitative.', 'start': 4919.68, 'duration': 4.955}, {'end': 4927.377, 'text': 'And it is immensely useful in many, many situations.', 'start': 4925.176, 'duration': 2.201}], 'summary': 'Human intuition and qualitative expertise guide engineering efforts, not hard elections or research.', 'duration': 30.562, 'max_score': 4896.815, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY4896815.jpg'}, {'end': 5051.945, 'src': 'embed', 'start': 5028.818, 'weight': 1, 'content': [{'end': 5036.382, 'text': 'Actually the most, as I mentioned earlier, most of the time is actually spent on data management activities, getting the data, understanding the data,', 'start': 5028.818, 'duration': 7.564}, {'end': 5038.323, 'text': 'cleaning the data, all that stuff.', 'start': 5036.382, 'duration': 1.941}, {'end': 5040.324, 'text': "That's usually 60 to 80% of your time.", 'start': 5038.343, 'duration': 1.981}, {'end': 5041.044, 'text': "And that's usually.", 'start': 5040.384, 'duration': 0.66}, {'end': 5043.045, 'text': "that's the first step in the in.", 'start': 5041.044, 'duration': 2.001}, {'end': 5049.948, 'text': "the in the process is getting the data in the first place, which is why, if you're lucky enough to have an enterprise data warehouse,", 'start': 5043.045, 'duration': 6.903}, {'end': 5051.945, 'text': 'Thank whoever paid for it.', 'start': 5050.845, 'duration': 1.1}], 'summary': 'Data management activities consume 60-80% of time in data processing.', 'duration': 23.127, 'max_score': 5028.818, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY5028818.jpg'}, {'end': 5103.578, 'src': 'embed', 'start': 5079.015, 'weight': 3, 'content': [{'end': 5085.273, 'text': 'So the doSnow package By default only works on the local CPU cores that you have.', 'start': 5079.015, 'duration': 6.258}, {'end': 5089.974, 'text': "If you want to distribute it across multiple machines, that's more complicated.", 'start': 5086.313, 'duration': 3.661}, {'end': 5091.735, 'text': "You can do that, but it's more complicated.", 'start': 5090.014, 'duration': 1.721}, {'end': 5095.656, 'text': 'What I, what I brought up, was in this day and age.', 'start': 5092.315, 'duration': 3.341}, {'end': 5103.578, 'text': "it's probably not worth it to set that complexity up, go to the cloud and rent a really big VM and just run everything locally there.", 'start': 5095.656, 'duration': 7.922}], 'summary': "The dosnow package works on local cpu cores; distributing it across multiple machines is complex. it's suggested to use a cloud vm for simplicity.", 'duration': 24.563, 'max_score': 5079.015, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY5079015.jpg'}, {'end': 5319.756, 'src': 'embed', 'start': 5288.067, 'weight': 5, 'content': [{'end': 5302.891, 'text': "Anybody else? Yeah, so if you're, so the question is, okay, Dave, use confusion matrix.", 'start': 5288.067, 'duration': 14.824}, {'end': 5304.372, 'text': "That's mainly for classification.", 'start': 5302.911, 'duration': 1.461}, {'end': 5308.393, 'text': "What would I use for regression problems? So that's a more generalized question.", 'start': 5304.832, 'duration': 3.561}, {'end': 5311.11, 'text': 'So there are many metrics you can use right?', 'start': 5309.028, 'duration': 2.082}, {'end': 5317.074, 'text': "The two most popular and the ones that I use personally, which is just take it for what it's worth.", 'start': 5312.471, 'duration': 4.603}, {'end': 5319.756, 'text': 'is they, use MAE, mean absolute error?', 'start': 5317.074, 'duration': 2.682}], 'summary': 'For regression problems, popular metrics include mae.', 'duration': 31.689, 'max_score': 5288.067, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY5288067.jpg'}, {'end': 5464.297, 'src': 'embed', 'start': 5438.857, 'weight': 6, 'content': [{'end': 5444.301, 'text': "So if I was being realistic here, what I would do is I would go back and say, look, it's time for me to go analyze my data.", 'start': 5438.857, 'duration': 5.444}, {'end': 5448.705, 'text': 'What are some trends? What are some things I can see? Engineer some features.', 'start': 5444.942, 'duration': 3.763}, {'end': 5453.188, 'text': 'For example, I made an assumption that family size would actually be an interesting feature.', 'start': 5448.745, 'duration': 4.443}, {'end': 5456.733, 'text': "But I didn't actually do any exploratory data analysis to actually verify that.", 'start': 5453.592, 'duration': 3.141}, {'end': 5459.415, 'text': 'Those are the kinds of things that I would actually be doing.', 'start': 5457.314, 'duration': 2.101}, {'end': 5464.297, 'text': 'Oh, yeah.', 'start': 5464.037, 'duration': 0.26}], 'summary': 'Analyze data, identify trends, engineer features, and conduct exploratory data analysis to verify assumptions.', 'duration': 25.44, 'max_score': 5438.857, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY5438857.jpg'}, {'end': 5556.626, 'src': 'embed', 'start': 5495.099, 'weight': 7, 'content': [{'end': 5497.5, 'text': 'How possible it is is going to depend on a couple of factors.', 'start': 5495.099, 'duration': 2.401}, {'end': 5499.382, 'text': 'One is the nature of the data and the problem.', 'start': 5497.721, 'duration': 1.661}, {'end': 5501.903, 'text': "And second is the algorithm that you're using.", 'start': 5499.922, 'duration': 1.981}, {'end': 5508.988, 'text': "Certain algorithms actually implement strong feature selection, so they'll actually ignore bad features altogether.", 'start': 5503.244, 'duration': 5.744}, {'end': 5511.109, 'text': "Other algorithms can't.", 'start': 5510.129, 'duration': 0.98}, {'end': 5516.272, 'text': 'So if you have a bad feature, removing that feature will actually improve the performance.', 'start': 5511.449, 'duration': 4.823}, {'end': 5516.953, 'text': "So it's going to depend.", 'start': 5516.333, 'duration': 0.62}, {'end': 5534.814, 'text': 'Yes Thank you so much for that question.', 'start': 5531.792, 'duration': 3.022}, {'end': 5543.999, 'text': "If I understand your question correctly, how easy is it for me to change out what type of model I'm using? Do that.", 'start': 5534.834, 'duration': 9.165}, {'end': 5552.564, 'text': "In caret, if I change that to RF, now I'm training a random forest instead of a boosted decision tree, for example.", 'start': 5545.4, 'duration': 7.164}, {'end': 5555.366, 'text': "Using caret, it's extremely easy.", 'start': 5553.124, 'duration': 2.242}, {'end': 5556.626, 'text': 'Extremely easy.', 'start': 5556.086, 'duration': 0.54}], 'summary': 'Algorithm type affects feature selection and model change is extremely easy with caret.', 'duration': 61.527, 'max_score': 5495.099, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY5495099.jpg'}, {'end': 5698.57, 'src': 'embed', 'start': 5671.775, 'weight': 9, 'content': [{'end': 5676.376, 'text': 'Remember what the name carrot stands for, classification and regression testing, or training, excuse me.', 'start': 5671.775, 'duration': 4.601}, {'end': 5682.558, 'text': "So, if you think about it, though, I mean there's really not a lot of.", 'start': 5679.617, 'duration': 2.941}, {'end': 5688.581, 'text': "there's not really a lot of uplift for unsupervised learning in here, right?", 'start': 5682.558, 'duration': 6.023}, {'end': 5693.802, 'text': "Well, for example, you wouldn't.", 'start': 5691.322, 'duration': 2.48}, {'end': 5698.57, 'text': "you're not going to want to do a grid search or parameters for like a k-means.", 'start': 5693.802, 'duration': 4.768}], 'summary': 'Carrot focuses on classification and regression testing, with minimal emphasis on unsupervised learning and grid search parameters.', 'duration': 26.795, 'max_score': 5671.775, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY5671775.jpg'}, {'end': 5889.552, 'src': 'embed', 'start': 5841.98, 'weight': 10, 'content': [{'end': 5846.581, 'text': 'I want the relative probability that the algorithm used to determine whether it was a one or a zero.', 'start': 5841.98, 'duration': 4.601}, {'end': 5849.562, 'text': 'Tree-based algorithms can do that.', 'start': 5848.121, 'duration': 1.441}, {'end': 5854.402, 'text': 'And you would just, you would use a different parameter on the predict function and it would give you probabilities.', 'start': 5849.942, 'duration': 4.46}, {'end': 5871.726, 'text': 'So is there an easy way to run multiple methods? Mm-hmm.', 'start': 5854.423, 'duration': 17.303}, {'end': 5878.428, 'text': 'Yeah, If I understand your question correctly, all you would do is you can, for example,', 'start': 5871.746, 'duration': 6.682}, {'end': 5884.43, 'text': "wrap this code in a for loop and each time the for loop executes that string that's currently highlighted,", 'start': 5878.428, 'duration': 6.002}, {'end': 5885.59, 'text': 'you would switch it out with a different one.', 'start': 5884.43, 'duration': 1.16}, {'end': 5889.552, 'text': 'And on that big VM, you would just use all 15 cores all the time.', 'start': 5886.691, 'duration': 2.861}], 'summary': 'Tree-based algorithms can determine relative probability for 0 or 1, and running multiple methods can be achieved by using a for loop.', 'duration': 47.572, 'max_score': 5841.98, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY5841980.jpg'}], 'start': 4709.343, 'title': 'Data science book and model interpretation, regression metrics, and using carrot for machine learning', 'summary': 'Covers topics such as recommending a data science book, using xgboost for probability, impact of qualitative and quantitative aspects on machine learning models, limitations of dosnow package, regression metrics including mae, impact of data analysis on model performance, and considerations for using carrot for machine learning including its limited applicability for unsupervised learning and achieving 100% accuracy in models.', 'chapters': [{'end': 5283.885, 'start': 4709.343, 'title': 'Data science book and model interpretation', 'summary': "Discusses the recommendation of a data science book, the default cutoff for survival, xgboost's usage of simple probability, the impact of qualitative and quantitative aspects on machine learning models, the significance of business expertise, the time-consuming nature of data management activities, the limitations of dosnow package, and examples of feature engineering and real-time trading platforms.", 'duration': 574.542, 'highlights': ['XGBoost uses simple probability with a default cutoff of 50% for survive versus died. XGBoost utilizes a straightforward probability model with a default survival cutoff of 50%, providing clarity on its decision-making process.', 'Data management activities usually consume 60 to 80% of the time in the process, emphasizing the significance of understanding and cleaning data. The time-consuming nature of data management activities, accounting for 60 to 80% of the overall process, underscores the importance of thorough data understanding and cleansing.', 'The significance of business subject matter expertise in providing qualitative insights and guidance for engineering efforts. The valuable role of business expertise in offering qualitative insights and guidance for engineering efforts, highlighting its importance in decision-making and problem-solving.', 'The limitations of doSnow package, as it only operates on local CPU cores by default and requires complexity for distribution across multiple machines. The constraints of the doSnow package, operating solely on local CPU cores by default and necessitating additional complexity for distribution across multiple machines, depicting its restricted functionality and potential complexities.', "The recommendation of a data science book as the best in general, based on the speaker's extensive experience and comparison with other books. The speaker's endorsement of a specific data science book as the best in general, supported by their extensive experience and comparison with numerous other books, conveying a strong recommendation."]}, {'end': 5645.299, 'start': 5288.067, 'title': 'Regression metrics and data analysis', 'summary': 'Discusses using mae and mean absolute error for regression problems, the importance of analyzing data, the impact of removing data on model performance, and the ease of changing machine learning models using caret.', 'duration': 357.232, 'highlights': ['The chapter discusses using MAE and mean absolute error for regression problems. The speaker mentions using MAE and mean absolute error as the two most popular metrics for regression problems.', 'The importance of analyzing data is emphasized. The speaker emphasizes the importance of spending time analyzing data, understanding trends, and engineering features through exploratory data analysis.', 'The impact of removing data on model performance is explained. The speaker explains that removing data can make a model better, depending on the nature of the data, the problem, and the algorithm being used.', 'The ease of changing machine learning models using caret is highlighted. The speaker explains that using caret, it is extremely easy to change the type of model being used and even parameterize these changes to run through a script.']}, {'end': 5930.639, 'start': 5645.72, 'title': 'Using carrot for machine learning', 'summary': 'Discusses the limited applicability of carrot for unsupervised learning, the usage of dummy variables in text analytics, and the possibility of achieving 100% accuracy in machine learning models, while also touching upon running multiple methods and upcoming talks on power bi.', 'duration': 284.919, 'highlights': ['Carrot is not particularly useful for unsupervised learning, as it does not offer significant uplift for tasks like grid search or parameter tuning for k-means clustering.', 'The usage of dummy variables in text analytics with Carrot is discussed, where it is highlighted that they are only required when using the impute functionality, and not for tree-based algorithms.', 'The possibility of achieving 100% accuracy in machine learning models, particularly in the context of the public Titanic competition, is mentioned, with skepticism about its legitimacy and the discussion of relative probability determination using tree-based algorithms.', 'The process of running multiple methods is briefly explained, suggesting the use of a for loop to switch out different strings and the utilization of all available cores for processing.', 'The speaker expresses willingness to answer more questions despite running over time, and announces the upcoming talk on storytelling with Power BI.', 'The chapter concludes with a mention of the next scheduled talk on data visualization and storytelling with Power BI, thanking the audience for attending.']}], 'duration': 1221.296, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/z8PRU46I3NY/pics/z8PRU46I3NY4709343.jpg', 'highlights': ['XGBoost utilizes a straightforward probability model with a default survival cutoff of 50%, providing clarity on its decision-making process.', 'The time-consuming nature of data management activities, accounting for 60 to 80% of the overall process, underscores the importance of thorough data understanding and cleansing.', 'The valuable role of business expertise in offering qualitative insights and guidance for engineering efforts, highlighting its importance in decision-making and problem-solving.', 'The constraints of the doSnow package, operating solely on local CPU cores by default and necessitating additional complexity for distribution across multiple machines, depicting its restricted functionality and potential complexities.', "The speaker's endorsement of a specific data science book as the best in general, supported by their extensive experience and comparison with numerous other books, conveying a strong recommendation.", 'The speaker mentions using MAE and mean absolute error as the two most popular metrics for regression problems.', 'The speaker emphasizes the importance of spending time analyzing data, understanding trends, and engineering features through exploratory data analysis.', 'The speaker explains that removing data can make a model better, depending on the nature of the data, the problem, and the algorithm being used.', 'The speaker explains that using caret, it is extremely easy to change the type of model being used and even parameterize these changes to run through a script.', 'Carrot is not particularly useful for unsupervised learning, as it does not offer significant uplift for tasks like grid search or parameter tuning for k-means clustering.', 'The possibility of achieving 100% accuracy in machine learning models, particularly in the context of the public Titanic competition, is mentioned, with skepticism about its legitimacy and the discussion of relative probability determination using tree-based algorithms.', 'The process of running multiple methods is briefly explained, suggesting the use of a for loop to switch out different strings and the utilization of all available cores for processing.']}], 'highlights': ["Rapid increase in R's popularity surpassing C sharp and becoming more popular than a strategic language like C sharp.", 'The Titanic dataset is frequently used at Data Science Dojo due to its widespread recognition and understanding, making it a popular choice for educational purposes.', 'CARAT is a package designed to accelerate machine learning work in R, wrapping over 200 different machine learning algorithms and providing a common interface for code consistency.', 'Introduction to the concept of imputation in data wrangling, with a specific example of replacing missing values in the embark column based on the majority of values, highlighting the importance of handling missing data.', 'XGBoost is an algorithm that creates boosted decision trees, which tend to prefer fewer, more powerful features, in contrast to support vector machines designed for a large number of relatively weak features.', "Feature engineering is crucial for accurate predictions and job security, with 60 to 80% of a data scientist's time spent on data management.", 'Carrot supports multiple methods of imputation including median, k-nearest neighbors, and bagged decision trees, with the latter being the most accurate.', 'Using metrics like MAE and RMSC to assess the performance of imputation models is crucial for determining accuracy and evaluating effectiveness.', 'The model achieved around 85% accuracy in predicting survival, indicating a reasonably good performance.', 'XGBoost utilizes a straightforward probability model with a default survival cutoff of 50%, providing clarity on its decision-making process.', 'The time-consuming nature of data management activities, accounting for 60 to 80% of the overall process, underscores the importance of thorough data understanding and cleansing.', 'The speaker emphasizes the importance of spending time analyzing data, understanding trends, and engineering features through exploratory data analysis.', 'The possibility of achieving 100% accuracy in machine learning models, particularly in the context of the public Titanic competition, is mentioned, with skepticism about its legitimacy and the discussion of relative probability determination using tree-based algorithms.']}