title
Build your first machine learning model in Python
description
In this video, you will learn how to build your first machine learning model in Python using the scikit-learn library.
🔗 Colab https://colab.research.google.com/
🔗 Code https://github.com/dataprofessor/first-ml
🔗 GitHub https://github.com/dataprofessor
🔗 Blog https://towardsdatascience.com/how-to-build-a-machine-learning-model-439ab8fb3fb1
📖 DiscoverDataScience https://www.discoverdatascience.org
📖 https://www.discoverdatascience.org/articles/journey-through-data-science-with-the-data-professor/
Time stamp
0:00 Introduction
0:15 Getting started with Google Colab
1:30 Load dataset
6:52 Split to X and y
8:30 Split data to train/test set
11:45 About DiscoverDataScience
13:00 Model building with Linear regression
21:55 Model building with Random forest
26:00 Model comparison
27:55 Data visualization
30:32 Conclusion
Support my work:
👪 Join as Channel Member:
https://www.youtube.com/channel/UCV8e2g4IWQqK71bbzGDEI4Q/join
✉️ Newsletter http://newsletter.dataprofessor.org
📖 Join Medium to Read my Blogs https://data-professor.medium.com/membership
☕ Buy me a coffee https://www.buymeacoffee.com/dataprofessor
Recommended Resources
📚 Books https://kit.co/dataprofessor
😎 Taro (Tech Career Mentorship) https://www.jointaro.com/r/dataprofessor/
📜 Google Data Analytics Professional Certificate https://imp.i384100.net/google-data-analytics
🤔 Interview Query https://www.interviewquery.com/?ref=dataprofessor
🖥️ Stock photos, graphics and videos used on this channel https://1.envato.market/c/2346717/628379/4662
Subscribe:
🌟 Coding Professor https://www.youtube.com/channel/UCJzlfIoF8nmWqJIv_iWQVRw?sub_confirmation=1
🌟 Data Professor https://www.youtube.com/dataprofessor?sub_confirmation=1
Disclaimer:
Recommended books and tools are affiliate links that gives me a portion of sales at no cost to you, which will contribute to the improvement of this channel's contents.
#datascience #machinelearning #dataprofessor
detail
{'title': 'Build your first machine learning model in Python', 'heatmap': [{'end': 501.682, 'start': 441.631, 'weight': 0.989}, {'end': 596.821, 'start': 520.366, 'weight': 0.718}, {'end': 1053.435, 'start': 997.564, 'weight': 0.896}], 'summary': 'Learn to build your first machine learning model in python using scikit-learn library in google colab, analyze delany dataset for drug candidates, understand csv data format, import and split data in jupyter, perform data splitting for training and testing sets, organize text cells, build linear regression model, evaluate models using mean squared error and r2 score, train and combine regression models including random forest regressor, and compare regression models while using scikit-learn for model selection and data visualization in python.', 'chapters': [{'end': 163.603, 'segs': [{'end': 47.212, 'src': 'embed', 'start': 0.429, 'weight': 1, 'content': [{'end': 5.252, 'text': 'A portion of this video is sponsored by Discover Data Science, powered by Wiley.', 'start': 0.429, 'duration': 4.823}, {'end': 7.413, 'text': 'More on them in just a moment.', 'start': 5.752, 'duration': 1.661}, {'end': 14.757, 'text': "In this video, I'm going to show you how you could build your first machine learning model in Python, and we're starting right now.", 'start': 8.132, 'duration': 6.625}, {'end': 22.371, 'text': "So we're going to build our first machine learning model in Python, and we're going to do that using the scikit-learn library.", 'start': 15.529, 'duration': 6.842}, {'end': 26.713, 'text': "And the coding environment that we're going to use is going to be Google Colab.", 'start': 22.691, 'duration': 4.022}, {'end': 28.993, 'text': "It's free, and it's quite powerful.", 'start': 27.013, 'duration': 1.98}, {'end': 30.634, 'text': "And so let's fire it up.", 'start': 29.314, 'duration': 1.32}, {'end': 42.698, 'text': 'So typically, when I create projects on Google Colab, one of the first things that I would do is I would give the notebook a name.', 'start': 34.035, 'duration': 8.663}, {'end': 47.212, 'text': "So we're going to give it a name of first project.", 'start': 44.488, 'duration': 2.724}], 'summary': 'A tutorial on building a machine learning model in python using scikit-learn and google colab.', 'duration': 46.783, 'max_score': 0.429, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ429.jpg'}, {'end': 171.235, 'src': 'embed', 'start': 138.634, 'weight': 0, 'content': [{'end': 139.515, 'text': "And I think it's this one.", 'start': 138.634, 'duration': 0.881}, {'end': 140.355, 'text': 'Let me have a look.', 'start': 139.655, 'duration': 0.7}, {'end': 148.025, 'text': 'Okay, so this is a data set of the solubility of molecules,', 'start': 142.378, 'duration': 5.647}, {'end': 159.498, 'text': 'and they are important in the fact that they are crucial for biologists and chemists in determining whether a molecule is soluble in water or solvent,', 'start': 148.025, 'duration': 11.473}, {'end': 162.021, 'text': 'and whether they will be good drug candidates.', 'start': 159.498, 'duration': 2.523}, {'end': 163.603, 'text': "And so let's have a look here.", 'start': 162.361, 'duration': 1.242}, {'end': 171.235, 'text': 'You can see that the data set here is in the format of a CSV and essentially it is a comma separated value file.', 'start': 163.763, 'duration': 7.472}], 'summary': 'Dataset contains solubility data crucial for biologists and chemists, helping determine molecule suitability for drug development.', 'duration': 32.601, 'max_score': 138.634, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ138634.jpg'}], 'start': 0.429, 'title': 'Building first ml model in python', 'summary': 'Discusses building the first machine learning model in python using the scikit-learn library in google colab, and analyzing the delany dataset to determine good drug candidates.', 'chapters': [{'end': 163.603, 'start': 0.429, 'title': 'Building first ml model in python', 'summary': 'Discusses building the first machine learning model in python using the scikit-learn library in google colab, loading the delany dataset for analyzing the solubility of molecules, crucial for biologists and chemists in determining good drug candidates.', 'duration': 163.174, 'highlights': ['Loading the Delany dataset for analyzing the solubility of molecules, important for determining good drug candidates, is a key step in the project.', 'The chapter explores using the scikit-learn library in Google Colab to build the first machine learning model in Python.', 'The video is sponsored by Discover Data Science, powered by Wiley.', 'Google Colab is highlighted as a free and powerful coding environment for building machine learning models in Python.', 'The Delany dataset contains data on the solubility of molecules, which is crucial for biologists and chemists in determining good drug candidates.']}], 'duration': 163.174, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ429.jpg', 'highlights': ['Loading the Delany dataset for analyzing molecule solubility is crucial for determining good drug candidates', 'Exploring scikit-learn library in Google Colab to build the first ML model in Python', 'Delany dataset contains data on molecule solubility, crucial for biologists and chemists', 'Google Colab is a free and powerful coding environment for building ML models in Python', 'Video sponsored by Discover Data Science, powered by Wiley']}, {'end': 510.684, 'segs': [{'end': 258.462, 'src': 'embed', 'start': 232.301, 'weight': 0, 'content': [{'end': 241.689, 'text': 'And so they are the variable that you want to predict as a function of the X variables, which are the ones here that are highlighted.', 'start': 232.301, 'duration': 9.388}, {'end': 254.557, 'text': 'So you might be familiar with the equation of Y equals to f of x, right? So y is the last column here, the y variable equals to the function of x.', 'start': 241.829, 'duration': 12.728}, {'end': 256.159, 'text': 'So we have several x here.', 'start': 254.557, 'duration': 1.602}, {'end': 258.462, 'text': 'So it is a multivariate analysis.', 'start': 256.339, 'duration': 2.123}], 'summary': 'Multivariate analysis to predict y as a function of x variables.', 'duration': 26.161, 'max_score': 232.301, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ232301.jpg'}, {'end': 314.154, 'src': 'embed', 'start': 283.351, 'weight': 1, 'content': [{'end': 285.313, 'text': 'import pandas as PD.', 'start': 283.351, 'duration': 1.962}, {'end': 290.096, 'text': 'So PD is sort of a alias for the pandas library.', 'start': 285.673, 'duration': 4.423}, {'end': 294.64, 'text': "So from here on, we're going to call pandas as PD, as mentioned here.", 'start': 290.216, 'duration': 4.424}, {'end': 298.823, 'text': "And then we're going to read in the data set in the CSV format.", 'start': 294.8, 'duration': 4.023}, {'end': 302.446, 'text': "And then we're going to assign it to a variable called DF.", 'start': 298.983, 'duration': 3.463}, {'end': 304.928, 'text': 'And DF is an acronym for data frame.', 'start': 302.566, 'duration': 2.362}, {'end': 306.029, 'text': "So let's do it.", 'start': 305.308, 'duration': 0.721}, {'end': 309.011, 'text': "We're going to type in PD because we want to use pandas.", 'start': 306.189, 'duration': 2.822}, {'end': 314.154, 'text': "And then we're going to use the function from pandas library called read CSV.", 'start': 309.131, 'duration': 5.023}], 'summary': 'Using pd alias, read csv data into df variable.', 'duration': 30.803, 'max_score': 283.351, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ283351.jpg'}, {'end': 419.55, 'src': 'embed', 'start': 393.985, 'weight': 2, 'content': [{'end': 401.534, 'text': 'and so when we build a machine learning model, to predict the y variable or the log s, and therefore log s,', 'start': 393.985, 'duration': 7.549}, {'end': 405.378, 'text': 'is equal to the function of all of the x variables here.', 'start': 401.534, 'duration': 3.844}, {'end': 411.265, 'text': "so in other words, we're going to use the four variables here to make a prediction on the log s variable.", 'start': 405.378, 'duration': 5.887}, {'end': 419.55, 'text': "okay. and so the next thing that we want to do here now is that we want to split the data frame into the x and into the y, and so let's do.", 'start': 411.525, 'duration': 8.025}], 'summary': 'Building a machine learning model to predict log s using four x variables.', 'duration': 25.565, 'max_score': 393.985, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ393985.jpg'}, {'end': 501.682, 'src': 'heatmap', 'start': 441.631, 'weight': 0.989, 'content': [{'end': 447.237, 'text': "And so we're gonna call this data split, data separation, or data separation as X and Y.", 'start': 441.631, 'duration': 5.606}, {'end': 457.485, 'text': "Okay, and so we're going to create the y, and we're gonna type in df, and the name of the last column here is log s.", 'start': 448.677, 'duration': 8.808}, {'end': 459.046, 'text': "So that's how we're gonna get the y.", 'start': 457.485, 'duration': 1.561}, {'end': 462.409, 'text': "And let's see, okay, and these are the y, log s.", 'start': 459.046, 'duration': 3.363}, {'end': 467.034, 'text': 'And now we want to get only the x variables, so we want to remove the log s.', 'start': 462.409, 'duration': 4.625}, {'end': 473.839, 'text': "So we're going to do that type in x equals to df dot drop parenthesis.", 'start': 468.294, 'duration': 5.545}, {'end': 479.485, 'text': "and then we're going to say we want to drop log s and we want to have axis equals to one,", 'start': 473.839, 'duration': 5.646}, {'end': 486.331, 'text': 'because axis equals to one will allow the drop function to work with the data as column mode.', 'start': 479.485, 'duration': 6.846}, {'end': 490.855, 'text': 'However, if you use axis equal to zero, it will work it in the row mode.', 'start': 486.451, 'duration': 4.404}, {'end': 492.297, 'text': "Let's see if that's correct.", 'start': 490.875, 'duration': 1.422}, {'end': 494.261, 'text': 'it is correct.', 'start': 493.621, 'duration': 0.64}, {'end': 498.742, 'text': 'You see that the log s now gone and that we have four columns here.', 'start': 494.381, 'duration': 4.361}, {'end': 501.682, 'text': 'And prior to that, we have five columns.', 'start': 499.762, 'duration': 1.92}], 'summary': 'Data separated into x and y variables, log s removed, resulting in 4 x variables and 1 y variable.', 'duration': 60.051, 'max_score': 441.631, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ441631.jpg'}, {'end': 486.331, 'src': 'embed', 'start': 459.046, 'weight': 3, 'content': [{'end': 462.409, 'text': "And let's see, okay, and these are the y, log s.", 'start': 459.046, 'duration': 3.363}, {'end': 467.034, 'text': 'And now we want to get only the x variables, so we want to remove the log s.', 'start': 462.409, 'duration': 4.625}, {'end': 473.839, 'text': "So we're going to do that type in x equals to df dot drop parenthesis.", 'start': 468.294, 'duration': 5.545}, {'end': 479.485, 'text': "and then we're going to say we want to drop log s and we want to have axis equals to one,", 'start': 473.839, 'duration': 5.646}, {'end': 486.331, 'text': 'because axis equals to one will allow the drop function to work with the data as column mode.', 'start': 479.485, 'duration': 6.846}], 'summary': 'Data manipulation to remove log s and extract x variables from the dataframe.', 'duration': 27.285, 'max_score': 459.046, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ459046.jpg'}], 'start': 163.763, 'title': 'Understanding csv data format and importing and splitting data in jupyter', 'summary': 'Explains the structure of a csv file with 5 columns, emphasizes the importance of the y variable in multivariate analysis, and covers importing a dataset using pandas in a jupyter notebook, splitting the dataset into x and y variables, with the x variables representing four columns and the y variable being log s.', 'chapters': [{'end': 258.462, 'start': 163.763, 'title': 'Understanding csv data format', 'summary': 'Explains the structure of a csv file containing 5 columns and multiple rows, and highlights the importance of the y variable as the dependent variable in multivariate analysis.', 'duration': 94.699, 'highlights': ['The data set is in the format of a CSV file with 5 columns and multiple rows, representing the X and Y variables in a multivariate analysis.', 'The Y variable, also known as the dependent variable, is crucial for prediction and is represented as the last column in the dataset.', 'Each row in the CSV file represents a data set, with the first row containing the names of the columns and subsequent rows containing the actual data points separated by commas.']}, {'end': 510.684, 'start': 258.482, 'title': 'Importing and splitting data in jupyter', 'summary': 'Covers importing a dataset using pandas in a jupyter notebook, and then splitting the dataset into x and y variables, with the x variables representing four columns and the y variable being log s.', 'duration': 252.202, 'highlights': ["The dataset is imported using the Pandas library in a Jupyter Notebook, assigning it to a variable called DF, which stands for data frame, and then displayed by typing 'DF'.", 'The x variables, including mol log p, mol weight, num rotatable bonds, and aromatic proportion, are utilized to predict the y variable, log s, in a machine learning model.', "The dataset is split into the x and y variables, with the y variable obtained from the last column 'log s', and the x variables obtained by dropping the 'log s' column from the dataset, resulting in four columns representing the x variables.", "The process of dropping the 'log s' column from the dataset is demonstrated using the 'drop' function with 'axis=1', effectively separating the x and y variables for further analysis."]}], 'duration': 346.921, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ163763.jpg', 'highlights': ['The Y variable is crucial for prediction and is represented as the last column in the dataset.', 'The dataset is imported using the Pandas library in a Jupyter Notebook, assigning it to a variable called DF.', 'The x variables, including mol log p, mol weight, num rotatable bonds, and aromatic proportion, are utilized to predict the y variable, log s.', "The dataset is split into the x and y variables, with the y variable obtained from the last column 'log s', and the x variables obtained by dropping the 'log s' column from the dataset."]}, {'end': 780.539, 'segs': [{'end': 596.821, 'src': 'heatmap', 'start': 511.024, 'weight': 3, 'content': [{'end': 519.385, 'text': "So the next thing that we want to do is we're going to split the data set, we're going to split it as the training set and the testing set.", 'start': 511.024, 'duration': 8.361}, {'end': 521.006, 'text': "So let's do it.", 'start': 520.366, 'duration': 0.64}, {'end': 526.053, 'text': 'So remember how many we need, we need three hash symbols here.', 'start': 521.846, 'duration': 4.207}, {'end': 531.715, 'text': "So we're going to add text cell, click it three times and then type data splitting.", 'start': 526.833, 'duration': 4.882}, {'end': 534.496, 'text': "And we're going to use the scikit learn package for that.", 'start': 531.995, 'duration': 2.501}, {'end': 539.498, 'text': 'So you want to type in from SK learn dot model underscore selection.', 'start': 534.656, 'duration': 4.842}, {'end': 543.14, 'text': 'And then you want to import the train test split train test split.', 'start': 539.759, 'duration': 3.381}, {'end': 557.549, 'text': "and now we're going to type in x train, x test, y train, y test equals to train test splits x and y and we're going to have the test size to be 0.2.", 'start': 543.38, 'duration': 14.169}, {'end': 558.97, 'text': 'and let me see,', 'start': 557.549, 'duration': 1.421}, {'end': 566.714, 'text': 'I want to have the random state to be assigned a specific number so that every time I run the code cell I will get the same data split.', 'start': 559.23, 'duration': 7.484}, {'end': 569.975, 'text': "So we're going to have random state equals to, let's say, 100.", 'start': 566.774, 'duration': 3.201}, {'end': 571.596, 'text': "And now we're going to run it.", 'start': 569.975, 'duration': 1.621}, {'end': 574.337, 'text': 'So we should now have four new variables here.', 'start': 571.736, 'duration': 2.601}, {'end': 577.539, 'text': "And let's have a look at the X train.", 'start': 574.357, 'duration': 3.182}, {'end': 585.683, 'text': 'And we see that we have 915 rows and four columns.', 'start': 577.559, 'duration': 8.124}, {'end': 587.424, 'text': "Let's have a look at X test.", 'start': 585.963, 'duration': 1.461}, {'end': 591.914, 'text': 'we have 229 rows and also four columns.', 'start': 588.53, 'duration': 3.384}, {'end': 596.821, 'text': 'So Xtest or Xtrain will come from the X variable.', 'start': 592.255, 'duration': 4.566}], 'summary': 'Using scikit-learn, the dataset is split into training and testing sets, with x train having 915 rows and x test having 229 rows.', 'duration': 23.472, 'max_score': 511.024, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ511024.jpg'}, {'end': 644.704, 'src': 'embed', 'start': 559.23, 'weight': 0, 'content': [{'end': 566.714, 'text': 'I want to have the random state to be assigned a specific number so that every time I run the code cell I will get the same data split.', 'start': 559.23, 'duration': 7.484}, {'end': 569.975, 'text': "So we're going to have random state equals to, let's say, 100.", 'start': 566.774, 'duration': 3.201}, {'end': 571.596, 'text': "And now we're going to run it.", 'start': 569.975, 'duration': 1.621}, {'end': 574.337, 'text': 'So we should now have four new variables here.', 'start': 571.736, 'duration': 2.601}, {'end': 577.539, 'text': "And let's have a look at the X train.", 'start': 574.357, 'duration': 3.182}, {'end': 585.683, 'text': 'And we see that we have 915 rows and four columns.', 'start': 577.559, 'duration': 8.124}, {'end': 587.424, 'text': "Let's have a look at X test.", 'start': 585.963, 'duration': 1.461}, {'end': 591.914, 'text': 'we have 229 rows and also four columns.', 'start': 588.53, 'duration': 3.384}, {'end': 596.821, 'text': 'So Xtest or Xtrain will come from the X variable.', 'start': 592.255, 'duration': 4.566}, {'end': 599.063, 'text': 'So we started out with 1, 144.', 'start': 597.622, 'duration': 1.441}, {'end': 606.478, 'text': 'And so 80% of 1, 144 is 915 and 20 of 1144 is 229.', 'start': 599.064, 'duration': 7.414}, {'end': 612.461, 'text': 'and so the training set here will have 80 of the data and the x test here, or the test set will have 20 of the data.', 'start': 606.478, 'duration': 5.983}, {'end': 631.673, 'text': "And I've actually written a blog post about this particular topic of building your machine learning model in Python using scikit-learn.", 'start': 622.948, 'duration': 8.725}, {'end': 637.217, 'text': "And I've drawn several illustrations explaining about the data split.", 'start': 632.114, 'duration': 5.103}, {'end': 639.018, 'text': 'So let me go and let me show you.', 'start': 637.597, 'duration': 1.421}, {'end': 644.704, 'text': "and it's this article how to build a machine learning model, a visual guide to learning data science.", 'start': 639.438, 'duration': 5.266}], 'summary': 'Random state set to 100 for consistent data split, resulting in 915 rows for x train and 229 rows for x test.', 'duration': 85.474, 'max_score': 559.23, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ559230.jpg'}, {'end': 734.563, 'src': 'embed', 'start': 690.248, 'weight': 2, 'content': [{'end': 694.81, 'text': 'You wanna evaluate whether the model that you have built using the training set,', 'start': 690.248, 'duration': 4.562}, {'end': 702.194, 'text': 'whether it performed in a robust manner against an unknown data that you simulate using the testing set.', 'start': 694.81, 'duration': 7.384}, {'end': 708.885, 'text': 'okay?. And so, before continuing further, a quick word from our sponsor.', 'start': 702.194, 'duration': 6.691}, {'end': 719.11, 'text': 'And so a short message from our sponsor, Discover Data Science, powered by Wiley, which is the premier information hub for the field of data science.', 'start': 709.622, 'duration': 9.488}, {'end': 725.655, 'text': 'With in-depth guides on careers, degrees and industry-leading programming languages.', 'start': 719.37, 'duration': 6.285}, {'end': 734.563, 'text': "Discover Data Science's goal is to provide accessible resources and materials for prospective students and professionals.", 'start': 725.655, 'duration': 8.908}], 'summary': 'Evaluate model performance with testing set for robustness.', 'duration': 44.315, 'max_score': 690.248, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ690248.jpg'}], 'start': 511.024, 'title': 'Data splitting and model selection', 'summary': 'Discusses splitting the dataset into training and testing sets, resulting in 915 rows and 4 columns for the training set and 229 rows and 4 columns for the testing set, with a test size of 0.2. it also covers building a machine learning model in python using scikit-learn, and emphasizes the data splitting process with 80% of the data going into the training set and 20% into the testing set, while highlighting discover data science as a resource for data science education and career guidance.', 'chapters': [{'end': 612.461, 'start': 511.024, 'title': 'Data splitting and model selection', 'summary': "Discusses splitting the dataset into training and testing sets using scikit-learn's train_test_split, resulting in 915 rows and 4 columns for the training set and 229 rows and 4 columns for the testing set, with a test size of 0.2.", 'duration': 101.437, 'highlights': ['The training set has 915 rows and 4 columns, while the testing set has 229 rows and 4 columns, with a test size of 0.2.', "The code utilizes scikit-learn's train_test_split to split the dataset into training and testing sets.", 'The random state is set to 100 to ensure consistent data split when the code is run.']}, {'end': 780.539, 'start': 622.948, 'title': 'Building machine learning model in python', 'summary': 'Discusses building a machine learning model in python using scikit-learn, emphasizing the data splitting process where 80% of the data goes into the training set and 20% into the testing set, and highlights the sponsor, discover data science, as a resource for data science education and career guidance.', 'duration': 157.591, 'highlights': ['The process of building a machine learning model in Python using scikit-learn is illustrated with a focus on the data splitting, where 80% of the data goes into the training set and 20% into the testing set.', "The chapter emphasizes the importance of evaluating the model's performance using the testing set, which simulates unknown data, against the model built using the training set.", 'Discover Data Science, powered by Wiley, is highlighted as a premier information hub for data science education and career guidance, offering expert-driven articles and resources for prospective students and professionals.']}], 'duration': 269.515, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ511024.jpg', 'highlights': ['The training set has 915 rows and 4 columns, while the testing set has 229 rows and 4 columns, with a test size of 0.2.', 'The process of building a machine learning model in Python using scikit-learn is illustrated with a focus on the data splitting, where 80% of the data goes into the training set and 20% into the testing set.', "The chapter emphasizes the importance of evaluating the model's performance using the testing set, which simulates unknown data, against the model built using the training set.", "The code utilizes scikit-learn's train_test_split to split the dataset into training and testing sets.", 'The random state is set to 100 to ensure consistent data split when the code is run.', 'Discover Data Science, powered by Wiley, is highlighted as a premier information hub for data science education and career guidance, offering expert-driven articles and resources for prospective students and professionals.']}, {'end': 1091.704, 'segs': [{'end': 841.779, 'src': 'embed', 'start': 811.883, 'weight': 0, 'content': [{'end': 815.326, 'text': "So if you click here, you're gonna see the table of content of your code.", 'start': 811.883, 'duration': 3.443}, {'end': 825.517, 'text': 'And so the benefit of organizing your text cells in hierarchical form is that you could see the table of contents here and then you could click through the various sections.', 'start': 815.527, 'duration': 9.99}, {'end': 832.895, 'text': "So actually, instead of making load data having two hash symbol, I'm gonna make it into having one.", 'start': 826.772, 'duration': 6.123}, {'end': 834.575, 'text': "So it's gonna be the same as the title.", 'start': 833.055, 'duration': 1.52}, {'end': 837.217, 'text': "And then you're gonna see that this one moved to the left a bit.", 'start': 834.776, 'duration': 2.441}, {'end': 841.779, 'text': "And now we're gonna make data preparation to be one as well, one hash.", 'start': 837.397, 'duration': 4.382}], 'summary': 'Organizing text in hierarchical form allows easy navigation and table of contents display.', 'duration': 29.896, 'max_score': 811.883, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ811883.jpg'}, {'end': 889.943, 'src': 'embed', 'start': 858.7, 'weight': 4, 'content': [{'end': 863.845, 'text': "Okay, and now we're going to continue by populating the code cell underneath the linear regression.", 'start': 858.7, 'duration': 5.145}, {'end': 871.433, 'text': "So we're going to use scikit-learn from sklearn.linear model import linear regression.", 'start': 864.746, 'duration': 6.687}, {'end': 877.876, 'text': "So you're going to see here that scikit-learn has several functions that you could use not only to prepare your data set,", 'start': 871.653, 'duration': 6.223}, {'end': 880.458, 'text': 'but also to build a machine learning model.', 'start': 877.876, 'duration': 2.582}, {'end': 883.459, 'text': "And here we're going to build a typical linear regression model.", 'start': 880.558, 'duration': 2.901}, {'end': 889.943, 'text': "And now that we have imported the function, we're going to create a variable called LR to stand for linear regression.", 'start': 883.639, 'duration': 6.304}], 'summary': 'Using scikit-learn to build a linear regression model.', 'duration': 31.243, 'max_score': 858.7, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ858700.jpg'}, {'end': 974.834, 'src': 'embed', 'start': 945.6, 'weight': 2, 'content': [{'end': 957.967, 'text': "So we're going to apply the model to make a prediction on the training set and the prediction to notify that we're going to use pred and then to make note of the algorithm that we're using to train the model.", 'start': 945.6, 'duration': 12.367}, {'end': 962.789, 'text': "we're going to specify to be lr here and then we're going to start with the y underscore.", 'start': 957.967, 'duration': 4.822}, {'end': 974.834, 'text': 'So this naming convention will be helpful when we have several machine learning algorithms that we want to try out and also whether our prediction is made on the training set or the testing set.', 'start': 962.829, 'duration': 12.005}], 'summary': 'Applying lr model to make predictions on the training set and note algorithm used.', 'duration': 29.234, 'max_score': 945.6, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ945600.jpg'}, {'end': 1057.118, 'src': 'heatmap', 'start': 987.34, 'weight': 1, 'content': [{'end': 992.202, 'text': "It's going to be making prediction on the original data set that it has been trained on.", 'start': 987.34, 'duration': 4.862}, {'end': 997.204, 'text': 'And so that will allow us to evaluate the performance of the algorithm.', 'start': 992.322, 'duration': 4.882}, {'end': 1008.187, 'text': "So here we're going to call it y underscore l r underscore test underscore pred equals l r dot predicts and as you've guessed y underscore test.", 'start': 997.564, 'duration': 10.623}, {'end': 1009.008, 'text': "Let's do it.", 'start': 1008.508, 'duration': 0.5}, {'end': 1017.23, 'text': "Let's print out the results y underscore l r train pred y underscore l r test pred.", 'start': 1009.148, 'duration': 8.082}, {'end': 1022.092, 'text': "Actually, let's just make it like that.", 'start': 1019.851, 'duration': 2.241}, {'end': 1028.659, 'text': 'Okay, so these are all of the predictions of look here.', 'start': 1025.176, 'duration': 3.483}, {'end': 1035.602, 'text': 'So these represents the 80% of the data.', 'start': 1033.161, 'duration': 2.441}, {'end': 1040.744, 'text': 'And there you go, the remainder 20% has been predicted.', 'start': 1037.343, 'duration': 3.401}, {'end': 1053.435, 'text': "And we have the predicted value And the next part here is we're going to compare the predicted value with the original value or the actual value.", 'start': 1041.204, 'duration': 12.231}, {'end': 1057.118, 'text': "And we're going to call the new section here to be model performance.", 'start': 1053.695, 'duration': 3.423}], 'summary': 'Model predicts 80% of data with 20% remaining, evaluating algorithm performance.', 'duration': 69.778, 'max_score': 987.34, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ987340.jpg'}], 'start': 780.999, 'title': 'About: organizing text cells and building linear regression model', 'summary': 'Demonstrates organizing text cells hierarchically for improved code readability and navigation, with an example. additionally, it covers building a linear regression model using scikit-learn, training the model on 80% of the data, making predictions, and evaluating model performance.', 'chapters': [{'end': 834.575, 'start': 780.999, 'title': 'Building hierarchical text cells', 'summary': 'Demonstrates how to organize text cells in a hierarchical form to create a table of contents, improving code readability and navigation, with a specific example of adjusting hash symbols for section titles.', 'duration': 53.576, 'highlights': ['Organizing text cells in hierarchical form allows for the creation of a table of contents, improving code navigation and readability.', 'Adjusting hash symbols for section titles helps in maintaining a consistent hierarchical structure for better organization and navigation.']}, {'end': 1091.704, 'start': 834.776, 'title': 'Building linear regression model', 'summary': 'Covers building a linear regression model using scikit-learn, training the model on the training set, making predictions, and evaluating model performance on 80% of the data.', 'duration': 256.928, 'highlights': ['The model is trained on the training set using lr.fit, resulting in the creation of y_lr_train_pred and y_lr_test_pred to evaluate model performance.', 'The scikit-learn library is used to build a typical linear regression model, providing functions for data preparation and building a machine learning model.', 'The prediction process involves applying the model to make predictions on both the training and testing sets, allowing for the evaluation of algorithm performance.']}], 'duration': 310.705, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ780999.jpg', 'highlights': ['Organizing text cells in hierarchical form allows for the creation of a table of contents, improving code navigation and readability.', 'The prediction process involves applying the model to make predictions on both the training and testing sets, allowing for the evaluation of algorithm performance.', 'The model is trained on the training set using lr.fit, resulting in the creation of y_lr_train_pred and y_lr_test_pred to evaluate model performance.', 'Adjusting hash symbols for section titles helps in maintaining a consistent hierarchical structure for better organization and navigation.', 'The scikit-learn library is used to build a typical linear regression model, providing functions for data preparation and building a machine learning model.']}, {'end': 1450.955, 'segs': [{'end': 1174.447, 'src': 'embed', 'start': 1145.222, 'weight': 1, 'content': [{'end': 1148.184, 'text': 'and so these two blocks are for the training set.', 'start': 1145.222, 'duration': 2.962}, {'end': 1150.445, 'text': "now we're going to do the same for the testing set.", 'start': 1148.184, 'duration': 2.261}, {'end': 1164.032, 'text': 'mean squared error one test r underscore test red lr test two equals r2 score and we have y test and the y lr test underscore red.', 'start': 1150.445, 'duration': 13.587}, {'end': 1164.692, 'text': 'run it.', 'start': 1164.032, 'duration': 0.66}, {'end': 1166.213, 'text': "let's run values here.", 'start': 1164.692, 'duration': 1.521}, {'end': 1174.447, 'text': "Okay, they're reasonably similar performance here.", 'start': 1171.266, 'duration': 3.181}], 'summary': 'Training and testing sets show similar performance with mean squared error and r2 score.', 'duration': 29.225, 'max_score': 1145.222, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ1145222.jpg'}, {'end': 1314.4, 'src': 'embed', 'start': 1292.221, 'weight': 0, 'content': [{'end': 1301.088, 'text': 'And so the great thing about having it in a pandas data frame like this is that if you evaluate more and more machine learning models like random forest k,', 'start': 1292.221, 'duration': 8.867}, {'end': 1308.795, 'text': "nearest neighbor support vector machine neural network, then you're going to have a data frame that will allow you to easily compare.", 'start': 1301.088, 'duration': 7.707}, {'end': 1314.4, 'text': 'you could also sort by column, the performance, and that will help you to evaluate which one was the best.', 'start': 1308.795, 'duration': 5.605}], 'summary': 'Using a pandas data frame to compare machine learning models and identify the best performer.', 'duration': 22.179, 'max_score': 1292.221, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ1292221.jpg'}, {'end': 1450.955, 'src': 'embed', 'start': 1406.944, 'weight': 2, 'content': [{'end': 1409.745, 'text': 'then evaluate model performance.', 'start': 1406.944, 'duration': 2.801}, {'end': 1411.545, 'text': 'so we could move this up a bit.', 'start': 1409.745, 'duration': 1.8}, {'end': 1414.766, 'text': "so we're going to train the model using the random forest algorithm.", 'start': 1411.545, 'duration': 3.221}, {'end': 1420.908, 'text': 'So from sklearn dot ensemble, import random forest regressor.', 'start': 1414.946, 'duration': 5.962}, {'end': 1428.95, 'text': "So a point of note here is that this particular tutorial video makes use of regressor because we're building regression models.", 'start': 1421.068, 'duration': 7.882}, {'end': 1436.492, 'text': "And it is because the y variable, which is called log s, let me show you log s right here, it's a quantitative value.", 'start': 1429.15, 'duration': 7.342}, {'end': 1440.753, 'text': "So if the y variable is quantitative, we're going to build a regression model.", 'start': 1436.612, 'duration': 4.141}, {'end': 1445.984, 'text': "Whereas if it is categorical, then we're going to build a classification model, okay?", 'start': 1440.973, 'duration': 5.011}, {'end': 1449.231, 'text': 'So in this tutorial, the log S is quantitative.', 'start': 1446.144, 'duration': 3.087}, {'end': 1450.955, 'text': 'therefore we built the regression model.', 'start': 1449.231, 'duration': 1.724}], 'summary': 'Trained regression model using random forest algorithm for quantitative y variable log s.', 'duration': 44.011, 'max_score': 1406.944, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ1406944.jpg'}], 'start': 1091.924, 'title': 'Model evaluation and comparison', 'summary': 'Covers the model evaluation process using mean squared error and r2 score for both training and testing sets, and the benefits of presenting the results in a pandas data frame for easy comparison of machine learning models.', 'chapters': [{'end': 1450.955, 'start': 1091.924, 'title': 'Model evaluation and comparison', 'summary': 'Covers the model evaluation process using mean squared error and r2 score for both training and testing sets, and the benefits of presenting the results in a pandas data frame for easy comparison of machine learning models.', 'duration': 359.031, 'highlights': ['The tutorial demonstrates the calculation of mean squared error and r2 score for both training and testing sets, showcasing the evaluation process of the machine learning model.', 'Utilizing a pandas data frame to present the evaluation results allows for easy comparison of multiple machine learning models, aiding in the identification of the best-performing model.', "The importance of selecting 'regressor' for building regression models is emphasized, as it is suitable for quantitative y variables like 'log s' in the tutorial.", 'The tutorial emphasizes the distinction between building regression models for quantitative y variables and classification models for categorical y variables to ensure the appropriate model is chosen based on the nature of the data.']}], 'duration': 359.031, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ1091924.jpg', 'highlights': ['Utilizing a pandas data frame for easy comparison of machine learning models.', 'The tutorial demonstrates the calculation of mean squared error and r2 score for both training and testing sets.', "The importance of selecting 'regressor' for building regression models is emphasized.", 'The tutorial emphasizes the distinction between building regression models for quantitative y variables and classification models.']}, {'end': 1649.168, 'segs': [{'end': 1478.285, 'src': 'embed', 'start': 1452.654, 'weight': 0, 'content': [{'end': 1458.396, 'text': 'because random forest here has two versions, random forest regressor and random forest classifier.', 'start': 1452.654, 'duration': 5.742}, {'end': 1460.517, 'text': "And here we're using the regressor.", 'start': 1458.797, 'duration': 1.72}, {'end': 1464.819, 'text': "So we're going to create a RF variable to house the random forest algorithm.", 'start': 1460.677, 'duration': 4.142}, {'end': 1473.283, 'text': "And we're going to specify some of the parameters for the model here maximum depth of two and the random state of what about 100?", 'start': 1465.099, 'duration': 8.184}, {'end': 1476.644, 'text': 'Because in the prior random state we used 100..', 'start': 1473.283, 'duration': 3.361}, {'end': 1478.285, 'text': "and now we're going to train the model.", 'start': 1476.644, 'duration': 1.641}], 'summary': 'Using random forest regressor with maximum depth of 2 and random state of 100 to train the model.', 'duration': 25.631, 'max_score': 1452.654, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ1452654.jpg'}, {'end': 1534.736, 'src': 'embed', 'start': 1504.507, 'weight': 1, 'content': [{'end': 1508.47, 'text': "and now it looks correct to me and we're going to run it.", 'start': 1504.507, 'duration': 3.963}, {'end': 1513.755, 'text': "okay, and now we're going to do the model performance evaluation.", 'start': 1508.47, 'duration': 5.285}, {'end': 1517.982, 'text': "i'm going to copy the code here, paste it.", 'start': 1513.755, 'duration': 4.227}, {'end': 1525.868, 'text': "we're gonna use the mean squared error and we're gonna use the r2 score and here, instead of lr, we're gonna replace that to be rf.", 'start': 1517.982, 'duration': 7.886}, {'end': 1534.736, 'text': 'okay, so replace all of the lr to be rf and be mindful, maybe you might type in wrong, like me just a moment ago to be fr.', 'start': 1525.868, 'duration': 8.868}], 'summary': "Model performance evaluation using mean squared error and r2 score, replacing 'lr' with 'rf'.", 'duration': 30.229, 'max_score': 1504.507, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ1504507.jpg'}, {'end': 1625.35, 'src': 'embed', 'start': 1578.585, 'weight': 2, 'content': [{'end': 1586.909, 'text': "so we're gonna combine the two results table into one and let me see df models equal pd.concat,", 'start': 1578.585, 'duration': 8.324}, {'end': 1593.431, 'text': "and then i'm going to specify the name of lr results and rf results.", 'start': 1586.909, 'duration': 6.522}, {'end': 1598.973, 'text': 'see, do i have x equals to 0 because i want to combine it in a row wise manner.', 'start': 1593.431, 'duration': 5.542}, {'end': 1600.713, 'text': 'let me try if it works.', 'start': 1598.973, 'duration': 1.74}, {'end': 1601.513, 'text': 'all right, it worked.', 'start': 1600.713, 'duration': 0.8}, {'end': 1610.658, 'text': 'yeah. so x is equal to 0 if you want to combine in a row wise manner, Whereas if you use axis one, it will be in a column wise manner.', 'start': 1601.513, 'duration': 9.145}, {'end': 1613.2, 'text': "So here we're stacking them on top of one another.", 'start': 1610.718, 'duration': 2.482}, {'end': 1619.465, 'text': 'Okay, so you can see now that the two are in the same table, but then the index number is a bit off.', 'start': 1613.22, 'duration': 6.245}, {'end': 1621.647, 'text': 'So we need to reindex that.', 'start': 1619.605, 'duration': 2.042}, {'end': 1625.35, 'text': "So let me see if it's as simple as doing this reindex.", 'start': 1621.687, 'duration': 3.663}], 'summary': 'Combined two results tables using pd.concat to stack them row-wise.', 'duration': 46.765, 'max_score': 1578.585, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ1578585.jpg'}], 'start': 1452.654, 'title': 'Training and combining regression models', 'summary': "Covers the training of a random forest regressor model with a max depth of 2 and random state of 100. it also discusses combining the results of linear regression and random forest tables using python's pandas library.", 'chapters': [{'end': 1552.187, 'start': 1452.654, 'title': 'Random forest regressor training', 'summary': 'Covers the training and evaluation of a random forest regressor model with a maximum depth of two and a random state of 100, followed by performance evaluation using mean squared error and r2 score.', 'duration': 99.533, 'highlights': ['The model is trained using the Random Forest Regressor algorithm with a specified maximum depth of two and a random state of 100.', 'The performance of the model is evaluated using mean squared error and r2 score, with the model achieving a certain level of accuracy.']}, {'end': 1649.168, 'start': 1552.487, 'title': 'Combining linear regression and random forest tables', 'summary': "Discusses combining the results of linear regression and random forest tables using python's pandas library, ensuring alignment and reindexing for tidy presentation.", 'duration': 96.681, 'highlights': ['The process involves combining the results of linear regression and random forest tables using pd.concat in a row-wise manner.', 'The method of reindexing the combined table ensures correct alignment of the indices for a tidy presentation.', 'The concept of stacking the tables on top of one another is explained, with the option to use axis one for column-wise combination.', 'The speaker emphasizes the significance of using x equals 0 for row-wise combination and mentions the alternative option of using axis one for column-wise combination.']}], 'duration': 196.514, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ1452654.jpg', 'highlights': ['The model is trained using the Random Forest Regressor algorithm with a max depth of 2 and random state of 100.', 'The performance is evaluated using mean squared error and r2 score, achieving a certain level of accuracy.', 'The process involves combining the results of linear regression and random forest tables using pd.concat in a row-wise manner.', 'The method of reindexing the combined table ensures correct alignment of the indices for a tidy presentation.', 'The concept of stacking the tables on top of one another is explained, with the option to use axis one for column-wise combination.']}, {'end': 1856.353, 'segs': [{'end': 1692.609, 'src': 'embed', 'start': 1649.688, 'weight': 0, 'content': [{'end': 1655.712, 'text': 'So here you can see that we have already compared linear regression model and the random forest model.', 'start': 1649.688, 'duration': 6.024}, {'end': 1657.733, 'text': "Let's have a look at the scikit-learn.", 'start': 1655.952, 'duration': 1.781}, {'end': 1669.001, 'text': 'Okay, and if you click here regression and so here you could find other regression model that you like and you could use it to build your own in the colab mood book here,', 'start': 1658.054, 'duration': 10.947}, {'end': 1675.405, 'text': 'and then you could then add the resulting performance into the data frame here to make your comparison.', 'start': 1669.001, 'duration': 6.404}, {'end': 1684.407, 'text': "and so now we're going to perform data visualization to take the predicted value and the actual value and make a scatter plot.", 'start': 1675.865, 'duration': 8.542}, {'end': 1685.127, 'text': "let's do it.", 'start': 1684.407, 'duration': 0.72}, {'end': 1692.609, 'text': "let's say data visualization of prediction results and we're going to make use of the matplotlib library.", 'start': 1685.127, 'duration': 7.482}], 'summary': 'Comparing linear regression and random forest models, using scikit-learn to build and visualize regression models, and plotting predicted vs actual values using matplotlib.', 'duration': 42.921, 'max_score': 1649.688, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ1649688.jpg'}, {'end': 1856.353, 'src': 'embed', 'start': 1783.988, 'weight': 2, 'content': [{'end': 1786.468, 'text': 'so we added this red line as trend line.', 'start': 1783.988, 'duration': 2.48}, {'end': 1788.269, 'text': 'that are fitted with the data here.', 'start': 1786.468, 'duration': 1.801}, {'end': 1790.733, 'text': 'So congratulations.', 'start': 1789.412, 'duration': 1.321}, {'end': 1796.255, 'text': 'you built your first machine learning model in Python using the scikit-learn library.', 'start': 1790.733, 'duration': 5.522}, {'end': 1803.378, 'text': 'So you can see how easy it is now to build models in Python, particularly for your tabular data sets.', 'start': 1796.375, 'duration': 7.003}, {'end': 1809.041, 'text': 'And so please feel free to build more models and you could tweak the learning parameters and,', 'start': 1803.538, 'duration': 5.503}, {'end': 1813.484, 'text': 'as I have shown you this API documentation from scikit-learn.', 'start': 1809.041, 'duration': 4.443}, {'end': 1816.187, 'text': 'You could go through the documentation.', 'start': 1813.765, 'duration': 2.422}, {'end': 1824.595, 'text': "You could click on an algorithm that you're interested in, read about it, and then look at some of the parameters that it allows you to adjust.", 'start': 1816.427, 'duration': 8.168}, {'end': 1826.316, 'text': 'So give it some try.', 'start': 1824.995, 'duration': 1.321}, {'end': 1831.461, 'text': 'Let me know in the comments down below what models that you are building and have fun.', 'start': 1826.436, 'duration': 5.025}, {'end': 1834.868, 'text': 'Thank you for watching until the end of the video.', 'start': 1832.708, 'duration': 2.16}, {'end': 1839.649, 'text': "If you reach this far, drop a snake emoji so that I know that you're the real one.", 'start': 1835.068, 'duration': 4.581}, {'end': 1843.23, 'text': "And while you're at it, please smash the like button.", 'start': 1839.93, 'duration': 3.3}, {'end': 1844.991, 'text': "subscribe if you haven't already.", 'start': 1843.23, 'duration': 1.761}, {'end': 1849.251, 'text': 'make sure to turn on notifications to be notified of the next video.', 'start': 1844.991, 'duration': 4.26}, {'end': 1854.413, 'text': 'And as always, the best way to learn data science is to do data science.', 'start': 1849.632, 'duration': 4.781}, {'end': 1856.353, 'text': 'And please enjoy the journey.', 'start': 1854.693, 'duration': 1.66}], 'summary': 'Built first machine learning model in python using scikit-learn library for tabular data sets.', 'duration': 72.365, 'max_score': 1783.988, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ1783988.jpg'}], 'start': 1649.688, 'title': 'Machine learning models in python', 'summary': 'Discusses comparing regression models, utilizing scikit-learn for model selection, adding performance results to a data frame, and performing data visualization using matplotlib. it also demonstrates building a machine learning model in python using the scikit-learn library and encourages further exploration of api documentation and model building.', 'chapters': [{'end': 1750.896, 'start': 1649.688, 'title': 'Comparing regression models and data visualization', 'summary': 'Discusses comparing regression models, utilizing scikit-learn for model selection, adding performance results to a data frame, and performing data visualization using matplotlib to create a scatter plot of predicted and actual values.', 'duration': 101.208, 'highlights': ['The chapter demonstrates the process of comparing linear regression and random forest models, followed by utilizing scikit-learn for selecting other regression models for building and evaluating performance, and performing data visualization with matplotlib to create a scatter plot of predicted and actual values.', 'The process involves importing matplotlib lib.pyplot as plt, creating a scatter plot of x-axis as y train and y-axis as y train pred, adjusting the darkness of the samples represented by circles using the alpha option, and labeling the X and Y axis as well as setting a fixed size of five by five for the plot.']}, {'end': 1856.353, 'start': 1750.996, 'title': 'Building machine learning models in python', 'summary': "Demonstrates building a machine learning model in python using the scikit-learn library, showcasing the ease of model building for tabular datasets and encouraging further exploration of api documentation and model building. viewers are encouraged to try building models, adjust learning parameters, and engage with scikit-learn's api documentation for algorithm exploration.", 'duration': 105.357, 'highlights': ['The chapter demonstrates building a machine learning model in Python using the scikit-learn library, showcasing the ease of model building for tabular datasets.', "Viewers are encouraged to try building models, adjust learning parameters, and engage with scikit-learn's API documentation for algorithm exploration.", 'The presenter encourages viewers to drop a snake emoji and engage with the content by liking, subscribing, and turning on notifications.', 'The presenter emphasizes the importance of learning by doing and enjoying the journey of data science.']}], 'duration': 206.665, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/29ZQ3TDGgRQ/pics/29ZQ3TDGgRQ1649688.jpg', 'highlights': ['The chapter demonstrates the process of comparing linear regression and random forest models, utilizing scikit-learn for model selection, and performing data visualization with matplotlib.', 'The process involves creating a scatter plot of predicted and actual values using matplotlib, adjusting sample darkness, and labeling the X and Y axis.', 'The chapter demonstrates building a machine learning model in Python using the scikit-learn library for tabular datasets.', "Viewers are encouraged to try building models, adjust learning parameters, and engage with scikit-learn's API documentation for algorithm exploration.", 'The presenter emphasizes the importance of learning by doing and enjoying the journey of data science.']}], 'highlights': ['Loading Delany dataset for drug solubility analysis is crucial', 'Exploring scikit-learn library in Google Colab for building ML model', 'Importing dataset using Pandas in Jupyter Notebook', 'Splitting dataset into training and testing sets with 80/20 ratio', 'Organizing text cells for creating a table of contents', 'Training model on training set using linear regression', 'Utilizing Pandas data frame for comparison of ML models', 'Training model using Random Forest Regressor with specific parameters', 'Comparing linear regression and random forest models with data visualization']}