title

Data science in Python: pandas, seaborn, scikit-learn

description

In this video, we'll cover the data science pipeline from data ingestion (with pandas) to data visualization (with seaborn) to machine learning (with scikit-learn). We'll learn how to train and interpret a linear regression model, and then compare three possible evaluation metrics for regression problems. Finally, we'll apply the train/test split procedure to decide which features to include in our model.
Download the notebook: https://github.com/justmarkham/scikit-learn-videos
pandas installation instructions: http://pandas.pydata.org/pandas-docs/stable/install.html
seaborn installation instructions: http://seaborn.pydata.org/installing.html
Longer linear regression notebook: https://github.com/justmarkham/DAT5/blob/master/notebooks/09_linear_regression.ipynb
Chapter 3 of Introduction to Statistical Learning: http://www-bcf.usc.edu/~gareth/ISL/
Videos related to Chapter 3: https://www.dataschool.io/15-hours-of-expert-machine-learning-videos/
Quick reference guide to linear regression: https://www.dataschool.io/applying-and-interpreting-linear-regression/
Introduction to linear regression: http://people.duke.edu/~rnau/regintro.htm
pandas Q&A video series: https://www.dataschool.io/easier-data-analysis-with-pandas/
pandas 3-part tutorial: http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
pandas read_csv documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
pandas read_table documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html
seaborn tutorial: http://seaborn.pydata.org/tutorial.html
seaborn example gallery: http://seaborn.pydata.org/examples/index.html
WANT TO GET BETTER AT MACHINE LEARNING? HERE ARE YOUR NEXT STEPS:
1) WATCH my scikit-learn video series:
https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
2) SUBSCRIBE for more videos:
https://www.youtube.com/dataschool?sub_confirmation=1
3) JOIN "Data School Insiders" to access bonus content:
https://www.patreon.com/dataschool
4) ENROLL in my Machine Learning course:
https://www.dataschool.io/learn/
5) LET'S CONNECT!
- Newsletter: https://www.dataschool.io/subscribe/
- Twitter: https://twitter.com/justmarkham
- Facebook: https://www.facebook.com/DataScienceSchool/
- LinkedIn: https://www.linkedin.com/in/justmarkham/

detail

{'title': 'Data science in Python: pandas, seaborn, scikit-learn', 'heatmap': [{'end': 649.956, 'start': 620.009, 'weight': 0.728}, {'end': 1252.203, 'start': 1199.884, 'weight': 1}, {'end': 1531.802, 'start': 1480.736, 'weight': 0.916}], 'summary': 'Covers regression in scikit-learn using pandas and seaborn for data manipulation and visualization, data analysis with pandas, seaborn library for visualizing advertising mediums and sales, linear regression modeling with scikit-learn, interpreting linear regression coefficients and machine learning, and regression evaluation metrics with an rmse of about 1.4 for sales predictions.', 'chapters': [{'end': 93.68, 'segs': [{'end': 30.177, 'src': 'embed', 'start': 1.291, 'weight': 0, 'content': [{'end': 5.215, 'text': 'Welcome back to my video series on machine learning in Scikit-Learn.', 'start': 1.291, 'duration': 3.924}, {'end': 12.943, 'text': 'In the previous video, we learned how to properly evaluate a model using the train-test-split procedure.', 'start': 6.316, 'duration': 6.627}, {'end': 20.691, 'text': 'We were focusing on classification models, and our evaluation metric was classification accuracy.', 'start': 14.064, 'duration': 6.627}, {'end': 24.175, 'text': "In this video, I'll be covering the following.", 'start': 21.933, 'duration': 2.242}, {'end': 30.177, 'text': 'How do I use the Pandas library to read data into Python?', 'start': 26.195, 'duration': 3.982}], 'summary': 'Video series on scikit-learn covers model evaluation and pandas data reading.', 'duration': 28.886, 'max_score': 1.291, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p41291.jpg'}], 'start': 1.291, 'title': 'Regression in scikit-learn', 'summary': 'Covers regression in scikit-learn, using pandas and seaborn for data manipulation and visualization, training linear regression models, evaluating regression problems, and feature selection.', 'chapters': [{'end': 93.68, 'start': 1.291, 'title': 'Regression in scikit-learn', 'summary': 'Covers regression in scikit-learn, including the use of pandas and seaborn libraries for data manipulation and visualization, linear regression model training and interpretation in scikit-learn, evaluation metrics for regression problems, and feature selection.', 'duration': 92.389, 'highlights': ['The chapter covers regression in Scikit-Learn, including the use of Pandas and Seaborn libraries for data manipulation and visualization, linear regression model training and interpretation in Scikit-Learn, evaluation metrics for regression problems, and feature selection. It includes using Pandas and Seaborn libraries for data manipulation and visualization, training and interpreting a linear regression model in Scikit-Learn, discussing evaluation metrics for regression problems, and addressing how to choose features for the model.', 'Regression is a type of supervised learning in which the goal is to predict a continuous response. Regression is a type of supervised learning aimed at predicting continuous responses, as opposed to categorical responses in classification.', 'The focus of the video is on regression problems. The video is primarily centered on regression problems, shifting the focus from the previous classification models.', 'The previous video focused on evaluating a model using the train-test-split procedure with classification accuracy as the metric. In the previous video, the emphasis was on evaluating models using the train-test-split procedure, with a focus on classification models and the classification accuracy metric.']}], 'duration': 92.389, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p41291.jpg', 'highlights': ['The chapter covers regression in Scikit-Learn, including the use of Pandas and Seaborn libraries for data manipulation and visualization, linear regression model training and interpretation in Scikit-Learn, evaluation metrics for regression problems, and feature selection.', 'Regression is a type of supervised learning in which the goal is to predict a continuous response.', 'The video is primarily centered on regression problems, shifting the focus from the previous classification models.', 'The previous video focused on evaluating a model using the train-test-split procedure with classification accuracy as the metric.']}, {'end': 469.452, 'segs': [{'end': 120.769, 'src': 'embed', 'start': 93.68, 'weight': 6, 'content': [{'end': 99.682, 'text': 'an extremely popular library for data exploration, manipulation and analysis.', 'start': 93.68, 'duration': 6.002}, {'end': 108.165, 'text': "If you're using the Anaconda distribution of Python, Pandas and its dependencies are already installed.", 'start': 100.903, 'duration': 7.262}, {'end': 112.206, 'text': "Otherwise, I've linked to the installation instructions.", 'start': 109.185, 'duration': 3.021}, {'end': 120.769, 'text': "we'll start by importing pandas in the conventional way, which is to import pandas as PD.", 'start': 114.025, 'duration': 6.744}], 'summary': 'Pandas is a popular data library for python, often included in anaconda distribution.', 'duration': 27.089, 'max_score': 93.68, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p493680.jpg'}, {'end': 182.77, 'src': 'embed', 'start': 153.434, 'weight': 2, 'content': [{'end': 162.34, 'text': 'Pandas has a function for reading in CSV files called Read CSV, and you simply pass in the name of the file.', 'start': 153.434, 'duration': 8.906}, {'end': 172.262, 'text': "You can read files from your local computer or you can actually read files directly from a URL, which is what I'm doing here.", 'start': 163.715, 'duration': 8.547}, {'end': 182.77, 'text': "I'm going to save the results as an object called data and then run the head method on that object to see the first five rows of data.", 'start': 172.282, 'duration': 10.488}], 'summary': "Pandas' read csv function reads files from local or url, like fetching first five rows.", 'duration': 29.336, 'max_score': 153.434, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p4153434.jpg'}, {'end': 246.648, 'src': 'embed', 'start': 216.547, 'weight': 0, 'content': [{'end': 223.208, 'text': "However, it looks like there's an unnamed column that contains sequential numbers starting at one.", 'start': 216.547, 'duration': 6.661}, {'end': 230.61, 'text': 'So those are probably just the ID numbers for those observations.', 'start': 226.869, 'duration': 3.741}, {'end': 238.412, 'text': "I'm gonna take those numbers and use them as the index, which is how Pandas identifies the rows.", 'start': 231.85, 'duration': 6.562}, {'end': 246.648, 'text': 'The default index is sequential numbers starting at zero, shown on the left side in bold.', 'start': 239.946, 'duration': 6.702}], 'summary': 'Unnamed column contains sequential numbers starting at one, used as index for pandas rows.', 'duration': 30.101, 'max_score': 216.547, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p4216547.jpg'}], 'start': 93.68, 'title': 'Data analysis with pandas', 'summary': "Introduces pandas for data analysis, demonstrating reading a dataset from a csv file, displaying the first five rows of data, and setting the index for the data frame. it also explains how to set id numbers as the index using the index call parameter in read csv and confirming the dataframe's shape, with a highest index number of 200. additionally, it covers structuring advertising data for a supervised learning task, with 200 observations representing markets and the goal of predicting sales based on tv, radio, and newspaper ad spending as features.", 'chapters': [{'end': 246.648, 'start': 93.68, 'title': 'Introduction to pandas data analysis', 'summary': 'Introduces pandas, a popular library for data analysis, and demonstrates reading a dataset from a csv file, displaying the first five rows of data, and setting the index for the data frame.', 'duration': 152.968, 'highlights': ['Pandas is an extremely popular library for data exploration, manipulation, and analysis, and it comes pre-installed with the Anaconda distribution of Python.', 'Reading in CSV files using the Read CSV function and displaying the first five rows of data using the head method is demonstrated.', 'Setting the index for the data frame to use the unnamed column containing sequential numbers as the identifier for rows is shown.']}, {'end': 348.065, 'start': 250.549, 'title': 'Setting index with read csv', 'summary': "Explains how to set id numbers as the index using the index call parameter in read csv and confirming the dataframe's shape to verify the number of rows, with a highest index number of 200.", 'duration': 97.516, 'highlights': ["The read CSV function's index call parameter is utilized to set a specific column as the index, demonstrated with index call equals zero.", 'The shape attribute of the DataFrame is used to confirm the number of rows, with the highest index number being 200.', 'The tail method of DataFrames is mentioned, showing the last five rows for further context.']}, {'end': 469.452, 'start': 353.018, 'title': 'Supervised learning with advertising data', 'summary': 'Explains how to structure advertising data for a supervised learning task, with 200 observations representing markets and the goal of predicting sales based on tv, radio, and newspaper ad spending, which are used as features.', 'duration': 116.434, 'highlights': ['The dataset has 200 observations, and each observation represents a single market, with the goal of predicting sales based on advertising dollars.', 'TV, radio, and newspaper spending are used as features to predict sales, representing a regression problem.', 'In market 200, $232,000 was spent on TV ads, $8,600 on radio ads, $8,700 on newspaper ads, and 13,400 items were sold.']}], 'duration': 375.772, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p493680.jpg', 'highlights': ['Pandas is an extremely popular library for data exploration, manipulation, and analysis.', 'Reading in CSV files and displaying the first five rows of data is demonstrated.', 'Setting the index for the data frame using the unnamed column containing sequential numbers is shown.', "The read CSV function's index call parameter is utilized to set a specific column as the index.", 'The shape attribute of the DataFrame is used to confirm the number of rows, with the highest index number being 200.', 'The dataset has 200 observations, and each observation represents a single market.', 'TV, radio, and newspaper spending are used as features to predict sales, representing a regression problem.']}, {'end': 866.01, 'segs': [{'end': 649.956, 'src': 'heatmap', 'start': 620.009, 'weight': 0.728, 'content': [{'end': 625.817, 'text': 'Seaborn has added a line of best fit as well as a 95% confidence band.', 'start': 620.009, 'duration': 5.808}, {'end': 635.992, 'text': 'Because there appears to be a linear relationship between the features and the response, this is a great candidate for the linear regression method.', 'start': 627.66, 'duration': 8.332}, {'end': 649.956, 'text': "Linear regression is quite a deep topic, but I'm just gonna give you a brief introduction before we implement it in scikit-learn.", 'start': 641.693, 'duration': 8.263}], 'summary': 'Seaborn added a 95% confidence band, indicating a linear relationship, making it a good candidate for linear regression.', 'duration': 29.947, 'max_score': 620.009, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p4620009.jpg'}, {'end': 866.01, 'src': 'embed', 'start': 768.823, 'weight': 0, 'content': [{'end': 775.547, 'text': "Let's take a look at the functional form of linear regression in order to gain an understanding of how it works.", 'start': 768.823, 'duration': 6.724}, {'end': 778.589, 'text': 'It can be represented as follows.', 'start': 776.608, 'duration': 1.981}, {'end': 793.97, 'text': 'y equals beta naught plus beta1 x1 plus beta2 x2 all the way to beta n xn, in which n is the number of features.', 'start': 780.082, 'duration': 13.888}, {'end': 799.893, 'text': "Let's briefly discuss each of the model terms.", 'start': 795.391, 'duration': 4.502}, {'end': 801.814, 'text': 'y is simply the response value.', 'start': 799.913, 'duration': 1.901}, {'end': 808.518, 'text': 'Each of the features is represented by an x variable, and each feature has a coefficient.', 'start': 803.095, 'duration': 5.423}, {'end': 821.541, 'text': 'In this case, we have three features, TV, radio, and newspaper, and each feature has a beta value, beta1 and beta2 and beta3.', 'start': 810.211, 'duration': 11.33}, {'end': 835.774, 'text': 'Finally, beta0 is called the intercept, which is the value of y when all of the x values are 0.', 'start': 823.223, 'duration': 12.551}, {'end': 845.879, 'text': 'These beta values, as well as the intercept, are learned during the model fitting process using what is called the least squares criterion.', 'start': 835.774, 'duration': 10.105}, {'end': 857.585, 'text': 'Basically, linear regression seeks to find the line that best fits the observed data, as we can see here.', 'start': 846.94, 'duration': 10.645}, {'end': 866.01, 'text': 'It defines the best line as the one that minimizes the sum of squared errors,', 'start': 859.426, 'duration': 6.584}], 'summary': 'Linear regression models the relationship between features and the response value, learning coefficients and intercept using least squares criterion.', 'duration': 97.187, 'max_score': 768.823, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p4768823.jpg'}], 'start': 470.713, 'title': 'Seaborn library and linear regression', 'summary': 'Introduces the seaborn library for data visualization, showcasing its usage in visualizing the relationship between advertising mediums and sales, and discusses the benefits, drawbacks, functional form, and model terms of linear regression.', 'chapters': [{'end': 625.817, 'start': 470.713, 'title': 'Seaborn library for data visualization', 'summary': 'Introduces the seaborn library, a python tool for statistical data visualization, and demonstrates its usage for visualizing the relationship between advertising mediums and sales, revealing a strong linear relationship between tv advertising and sales.', 'duration': 155.104, 'highlights': ["Seaborn library is introduced as a Python tool for statistical data visualization, which can be easily installed using 'conda install seaborn' in Anaconda.", "Demonstrates the usage of Seaborn's pair plot function to visualize the relationship between advertising mediums and sales, revealing a strong linear relationship between TV advertising and sales.", 'Shows the addition of a line of best fit and a 95% confidence band using Seaborn to plot the relationships between advertising mediums and sales.']}, {'end': 866.01, 'start': 627.66, 'title': 'Introduction to linear regression', 'summary': 'Discusses the benefits and drawbacks of linear regression, highlighting its popularity due to speed, ease of use, interpretability, and extensive literature, but also its limitation in accurately modeling highly nonlinear relationships, and explains the functional form and model terms of linear regression.', 'duration': 238.35, 'highlights': ['Linear regression is popular for its quick run time, ease of use, interpretability, and extensive literature, but it may not produce the best predictive accuracy compared to other models. Linear regression is popular due to its quick run time, ease of use, interpretability, and extensive literature, but it may not produce the best predictive accuracy compared to other models.', 'Linear regression seeks to find the line that best fits the observed data by minimizing the sum of squared errors using the least squares criterion. Linear regression seeks to find the line that best fits the observed data by minimizing the sum of squared errors using the least squares criterion.', 'The functional form of linear regression is represented as y equals beta naught plus beta1 x1 plus beta2 x2 all the way to beta n xn, in which n is the number of features. The functional form of linear regression is represented as y equals beta naught plus beta1 x1 plus beta2 x2 all the way to beta n xn, in which n is the number of features.']}], 'duration': 395.297, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p4470713.jpg', 'highlights': ["Seaborn library is introduced as a Python tool for statistical data visualization, which can be easily installed using 'conda install seaborn' in Anaconda.", "Demonstrates the usage of Seaborn's pair plot function to visualize the relationship between advertising mediums and sales, revealing a strong linear relationship between TV advertising and sales.", 'Shows the addition of a line of best fit and a 95% confidence band using Seaborn to plot the relationships between advertising mediums and sales.', 'Linear regression is popular for its quick run time, ease of use, interpretability, and extensive literature, but it may not produce the best predictive accuracy compared to other models.', 'Linear regression seeks to find the line that best fits the observed data by minimizing the sum of squared errors using the least squares criterion.', 'The functional form of linear regression is represented as y equals beta naught plus beta1 x1 plus beta2 x2 all the way to beta n xn, in which n is the number of features.']}, {'end': 1089.492, 'segs': [{'end': 906.6, 'src': 'embed', 'start': 866.01, 'weight': 1, 'content': [{'end': 878.779, 'text': 'which is really just the sum of the squared vertical distances between each point in the line.', 'start': 866.01, 'duration': 12.769}, {'end': 889.513, 'text': 'Once this line of best fit has been learned, it can be used to make predictions for sales given any set of feature values.', 'start': 880.45, 'duration': 9.063}, {'end': 906.6, 'text': 'Before we start the modeling process with scikit-learn, we first have to define the feature matrix X and the response vector Y.', 'start': 897.697, 'duration': 8.903}], 'summary': 'Sum of squared vertical distances for best fit line; using scikit-learn for modeling sales predictions.', 'duration': 40.59, 'max_score': 866.01, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p4866010.jpg'}, {'end': 979.102, 'src': 'embed', 'start': 946.038, 'weight': 0, 'content': [{'end': 952.021, 'text': 'What we need to create is a data frame that contains only our three feature columns.', 'start': 946.038, 'duration': 5.983}, {'end': 964.128, 'text': "So first, let's create a Python list called feature calls that contains the names of our feature columns stored as strings.", 'start': 953.202, 'duration': 10.926}, {'end': 979.102, 'text': 'Then we can say data open bracket feature calls close bracket which tells Pandas to select a subset of the original DataFrame columns.', 'start': 965.669, 'duration': 13.433}], 'summary': 'Create a data frame with 3 feature columns using python list and pandas.', 'duration': 33.064, 'max_score': 946.038, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p4946038.jpg'}], 'start': 866.01, 'title': 'Linear regression modeling with scikit-learn', 'summary': 'Introduces the process of modeling with scikit-learn, emphasizing the use of pandas data frame for x and pandas series for y, accessing the underlying numpy arrays for linear regression predictions.', 'chapters': [{'end': 944.701, 'start': 866.01, 'title': 'Linear regression modeling with scikit-learn', 'summary': 'Introduces the process of modeling with scikit-learn, emphasizing the use of pandas data frame for x and pandas series for y, in order to access the underlying numpy arrays for linear regression predictions.', 'duration': 78.691, 'highlights': ['The chapter introduces the process of modeling with scikit-learn It outlines the main topic of the chapter.', 'emphasizing the use of pandas data frame for X and pandas series for Y Explains the specific data types required for X and Y in the modeling process.', 'in order to access the underlying NumPy arrays for linear regression predictions Describes the purpose of using pandas data frame and series to access NumPy arrays for prediction.']}, {'end': 1089.492, 'start': 946.038, 'title': 'Creating data frame and selecting columns', 'summary': "Explains how to create a data frame with three feature columns using pandas in python and confirms its dimensions as 200 rows and 3 columns, and then proceeds to select the 'sales' series from the data frame.", 'duration': 143.454, 'highlights': ['Creating a data frame with 3 feature columns using Pandas in Python The process involves creating a Python list of feature column names and using the list to select a subset of the original DataFrame columns.', "Confirming the data frame's dimensions as 200 rows and 3 columns Using Python's type function to confirm the data frame type and printing the shape attribute to verify the dimensions.", "Selecting the 'sales' series from the data frame Utilizing bracket notation with a string containing the name of the column to select the 'sales' series, and then checking the results using the head method on the series."]}], 'duration': 223.482, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p4866010.jpg', 'highlights': ['The chapter introduces the process of modeling with scikit-learn It outlines the main topic of the chapter.', 'emphasizing the use of pandas data frame for X and pandas series for Y Explains the specific data types required for X and Y in the modeling process.', 'in order to access the underlying NumPy arrays for linear regression predictions Describes the purpose of using pandas data frame and series to access NumPy arrays for prediction.', 'Creating a data frame with 3 feature columns using Pandas in Python The process involves creating a Python list of feature column names and using the list to select a subset of the original DataFrame columns.', "Confirming the data frame's dimensions as 200 rows and 3 columns Using Python's type function to confirm the data frame type and printing the shape attribute to verify the dimensions.", "Selecting the 'sales' series from the data frame Utilizing bracket notation with a string containing the name of the column to select the 'sales' series, and then checking the results using the head method on the series."]}, {'end': 1297.405, 'segs': [{'end': 1131.228, 'src': 'embed', 'start': 1089.492, 'weight': 2, 'content': [{'end': 1111.283, 'text': "we can use the type function to confirm that it's a series and print the shape attribute to confirm that it's a one-dimensional array with length 200..", 'start': 1089.492, 'duration': 21.791}, {'end': 1120.585, 'text': 'Our final step before using linear regression is to split x and y into training and testing sets for proper model evaluation.', 'start': 1111.283, 'duration': 9.302}, {'end': 1131.228, 'text': 'As we saw in the last video, we use the train test split function to split x and y into two objects each.', 'start': 1122.126, 'duration': 9.102}], 'summary': "Using type function, confirmed it's a series with shape attribute as a 1d array of length 200. split x and y for model evaluation.", 'duration': 41.736, 'max_score': 1089.492, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p41089492.jpg'}, {'end': 1264.462, 'src': 'heatmap', 'start': 1199.884, 'weight': 0, 'content': [{'end': 1207.989, 'text': 'In the case of linear regression, the model is learning the intercept and coefficients for the line of best fit.', 'start': 1199.884, 'duration': 8.105}, {'end': 1216.275, 'text': "Then it has an easy formula for making predictions during the predict step, which we'll see below.", 'start': 1209.55, 'duration': 6.725}, {'end': 1227.522, 'text': 'As I mentioned previously, linear regression is a highly interpretable model.', 'start': 1221.978, 'duration': 5.544}, {'end': 1234.819, 'text': "Let's see what that means by printing out the intercept and the coefficients and then interpreting them.", 'start': 1228.817, 'duration': 6.002}, {'end': 1252.203, 'text': 'The intercept and coefficients are stored in separate attributes of the linreg object, which is why I use the dot notation to access them.', 'start': 1241.56, 'duration': 10.643}, {'end': 1258.33, 'text': "You'll notice they both have a trailing underscore in their names,", 'start': 1253.581, 'duration': 4.749}, {'end': 1264.462, 'text': "which is scikit-learn's convention for any attributes that were estimated from the data.", 'start': 1258.33, 'duration': 6.132}], 'summary': 'Linear regression model learns intercept and coefficients for best fit line, providing easy predictions and interpretability.', 'duration': 86.797, 'max_score': 1199.884, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p41199884.jpg'}], 'start': 1089.492, 'title': 'Linear regression model building, model fitting, predictions, coefficients, and intercepts', 'summary': "Covers confirming series type and shape, splitting data into training and testing sets, and building a linear regression model using scikit-learn. it also discusses the fitting step for k-neighbor's classifier and linear regression, including memorization of training data and learning of intercept and coefficients, leading to easy predictions. additionally, it delves into accessing intercept and coefficients in the linreg object, their storage in scikit-learn's convention, and their pairing with feature names using python's zip function for use in the linear regression formula.", 'chapters': [{'end': 1171.783, 'start': 1089.492, 'title': 'Linear regression model building', 'summary': 'Covers confirming series type and shape, splitting data into training and testing sets, and building a linear regression model using scikit-learn.', 'duration': 82.291, 'highlights': ["We confirm that it's a one-dimensional array with length 200 using the type function and print the shape attribute.", 'We split x and y into training and testing sets using the train test split function, defaulting to using 25% of the data for testing, and print the shapes of these objects.', 'We build our linear regression model using scikit-learn by importing the model, instantiating the model, and fitting the model to the training data.']}, {'end': 1234.819, 'start': 1177.665, 'title': 'Model fitting and predictions', 'summary': "Discusses the different processes during the fitting step for k-neighbor's classifier and linear regression, highlighting the memorization of training data and learning of intercept and coefficients, leading to easy predictions. it also emphasizes the interpretability of linear regression models through printing and interpreting intercept and coefficients.", 'duration': 57.154, 'highlights': ['Linear regression model learns the intercept and coefficients for the line of best fit. This demonstrates the specific action taken by the linear regression model during the fitting step, emphasizing its process of learning and memorization.', "K-Neighbor's classifier memorizes the training data during fitting to calculate the distance between new and existing observations. This highlights the distinct behavior of K-Neighbor's classifier during fitting, focusing on its memorization process and subsequent calculation of distances, showcasing its unique fitting step.", "Linear regression model has an easy formula for making predictions after the fitting step. This emphasizes the practical outcome of the fitting step for linear regression, highlighting the ease of making predictions and the efficiency of the model's predictive capabilities.", 'Linear regression is a highly interpretable model which allows for printing and interpreting the intercept and coefficients. This showcases the interpretability of the linear regression model, emphasizing its unique feature that enables the printing and interpretation of specific model components, enhancing its transparency and understandability.']}, {'end': 1297.405, 'start': 1241.56, 'title': 'Linear regression coefficients and intercepts', 'summary': "Discusses accessing intercept and coefficients in the linreg object, their storage in scikit-learn's convention, and their pairing with feature names using python's zip function for use in the linear regression formula.", 'duration': 55.845, 'highlights': ["The coefficients are stored in the coef attribute in the same order as they were stored in the feature matrix x, and can be paired with feature names using Python's zip function.", "The intercept and coefficients can be accessed in the linreg object using the dot notation and have a trailing underscore in their names as per scikit-learn's convention."]}], 'duration': 207.913, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p41089492.jpg', 'highlights': ['We build our linear regression model using scikit-learn by importing the model, instantiating the model, and fitting the model to the training data.', 'Linear regression model learns the intercept and coefficients for the line of best fit. This demonstrates the specific action taken by the linear regression model during the fitting step, emphasizing its process of learning and memorization.', "Linear regression model has an easy formula for making predictions after the fitting step. This emphasizes the practical outcome of the fitting step for linear regression, highlighting the ease of making predictions and the efficiency of the model's predictive capabilities.", "The coefficients are stored in the coef attribute in the same order as they were stored in the feature matrix x, and can be paired with feature names using Python's zip function."]}, {'end': 1491.36, 'segs': [{'end': 1434.732, 'src': 'embed', 'start': 1401.763, 'weight': 0, 'content': [{'end': 1406.566, 'text': 'Before we move on to the prediction step, I want to add two important notes.', 'start': 1401.763, 'duration': 4.803}, {'end': 1418.662, 'text': 'First, I was careful to use the phrase associated with when interpreting the TV coefficient because this is not a claim of causation.', 'start': 1408.075, 'duration': 10.587}, {'end': 1423.665, 'text': "It's a very difficult problem to determine causation,", 'start': 1419.963, 'duration': 3.702}, {'end': 1434.732, 'text': 'because that relies upon having access to every possible factor that could have influenced sales, whereas all we have is data on ad spending.', 'start': 1423.665, 'duration': 11.067}], 'summary': 'Caution is taken to avoid claiming causation due to limited data on ad spending.', 'duration': 32.969, 'max_score': 1401.763, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p41401763.jpg'}, {'end': 1491.36, 'src': 'embed', 'start': 1451.395, 'weight': 1, 'content': [{'end': 1462.579, 'text': 'For instance, if an increase in TV ad spending was associated with a decrease in sales, the beta one coefficient would have been negative.', 'start': 1451.395, 'duration': 11.184}, {'end': 1476.435, 'text': "Anyway, let's use our fitted model to make predictions on the testing set, which we store in YPred.", 'start': 1468.072, 'duration': 8.363}, {'end': 1491.36, 'text': "Previously we've used classification accuracy as our evaluation metric, though that metric is not relevant for regression problems,", 'start': 1480.736, 'duration': 10.624}], 'summary': 'Fitting model to make predictions on testing set for regression analysis.', 'duration': 39.965, 'max_score': 1451.395, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p41451395.jpg'}], 'start': 1300.007, 'title': 'Interpreting linear regression coefficients and machine learning', 'summary': 'Discusses interpreting coefficients in linear regression, emphasizing the relationship between ad spending and sales, with a $1000 increase in tv ad spending associated with a 46.6 item sales increase. it also explains the focus of machine learning on association over causation, the possibility of negative coefficients in linear regression, and the importance of evaluation metrics in regression problems.', 'chapters': [{'end': 1423.665, 'start': 1300.007, 'title': 'Interpreting linear regression coefficients', 'summary': 'Explains how to interpret the coefficients in a linear regression model, highlighting the relationship between ad spending and sales, where a $1000 increase in tv ad spending is associated with an increase in sales of 46.6 items, emphasizing the importance of interpretability in linear regression.', 'duration': 123.658, 'highlights': ['An additional $1000 spent on TV ads is associated with an increase in sales of 46.6 items, showcasing the relationship between ad spending and sales.', "Linear regression's interpretability is emphasized as a key factor for its popularity in regression problems, underlining the significance of understanding the coefficients' implications.", 'The TV coefficient of 0.0466 signifies that for a given amount of radio and newspaper ad spending, a unit increase in TV ad spending is associated with a 0.0466 unit increase in sales, providing insights into the impact of TV ad spending on sales.', 'The difficulty in determining causation is acknowledged, cautioning against inferring a causal relationship between ad spending and sales.']}, {'end': 1491.36, 'start': 1423.665, 'title': 'Machine learning and linear regression', 'summary': 'Explains the focus of machine learning on association over causation, the possibility of negative coefficients in linear regression, and the relevance of evaluation metrics in regression problems.', 'duration': 67.695, 'highlights': ['The chapter explains the focus of machine learning on association over causation.', 'The possibility of negative coefficients in linear regression is discussed.', 'The relevance of evaluation metrics in regression problems is highlighted.']}], 'duration': 191.353, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p41300007.jpg', 'highlights': ['An additional $1000 spent on TV ads is associated with an increase in sales of 46.6 items, showcasing the relationship between ad spending and sales.', 'The TV coefficient of 0.0466 signifies that for a given amount of radio and newspaper ad spending, a unit increase in TV ad spending is associated with a 0.0466 unit increase in sales, providing insights into the impact of TV ad spending on sales.', "Linear regression's interpretability is emphasized as a key factor for its popularity in regression problems, underlining the significance of understanding the coefficients' implications.", 'The chapter explains the focus of machine learning on association over causation.', 'The relevance of evaluation metrics in regression problems is highlighted.', 'The difficulty in determining causation is acknowledged, cautioning against inferring a causal relationship between ad spending and sales.', 'The possibility of negative coefficients in linear regression is discussed.']}, {'end': 2068.068, 'segs': [{'end': 1520.804, 'src': 'embed', 'start': 1491.36, 'weight': 4, 'content': [{'end': 1494.461, 'text': 'because regression problems have a continuous response.', 'start': 1491.36, 'duration': 3.101}, {'end': 1504.257, 'text': "Let's take a look at some common evaluation metrics for regression and then choose one to evaluate our predictions.", 'start': 1495.993, 'duration': 8.264}, {'end': 1520.804, 'text': "We'll start by creating some example numeric predictions and then evaluate them using a given metric so that we can get a feel for how those metrics work.", 'start': 1509.959, 'duration': 10.845}], 'summary': 'Exploring common evaluation metrics for regression problems.', 'duration': 29.444, 'max_score': 1491.36, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p41491360.jpg'}, {'end': 1802.425, 'src': 'embed', 'start': 1776.871, 'weight': 5, 'content': [{'end': 1784.655, 'text': 'In the last video we saw that train test split could help us to choose between different models and different tuning parameters.', 'start': 1776.871, 'duration': 7.784}, {'end': 1795.181, 'text': "Linear regression doesn't have any tuning parameters, and we haven't yet learned other models for regression, so those options don't apply.", 'start': 1786.816, 'duration': 8.365}, {'end': 1802.425, 'text': 'However, note that train test split can also help us to choose between features.', 'start': 1796.482, 'duration': 5.943}], 'summary': 'Train test split aids in model and feature selection.', 'duration': 25.554, 'max_score': 1776.871, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p41776871.jpg'}, {'end': 1912.191, 'src': 'embed', 'start': 1883.333, 'weight': 1, 'content': [{'end': 1892.418, 'text': 'Thus, our new model that excludes newspaper is performing slightly better than when it was included,', 'start': 1883.333, 'duration': 9.085}, {'end': 1896.441, 'text': 'indicating that the newspaper feature should be left out of the model.', 'start': 1892.418, 'duration': 4.023}, {'end': 1912.191, 'text': 'You could repeat this process with different combinations of features and then select the combination with the lowest RMSE as the best combination to use for this particular problem.', 'start': 1898.482, 'duration': 13.709}], 'summary': 'Excluding newspaper improves model performance, consider feature combinations for lowest rmse.', 'duration': 28.858, 'max_score': 1883.333, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p41883333.jpg'}, {'end': 2068.068, 'src': 'embed', 'start': 2029.884, 'weight': 0, 'content': [{'end': 2040.032, 'text': "If you'd like to learn Seaborn, the official tutorial is quite good, and the example gallery will give you a quick look at most of its functionality.", 'start': 2029.884, 'duration': 10.148}, {'end': 2046.537, 'text': 'Congratulations on getting this far in the series.', 'start': 2044.235, 'duration': 2.302}, {'end': 2053.422, 'text': "I've got a lot more content planned, and your comments have been helpful in shaping the series, so thank you.", 'start': 2047.457, 'duration': 5.965}, {'end': 2063.759, 'text': "One question I have for you this week is whether you'd like me to focus more on pandas or instead focus exclusively on scikit-learn.", 'start': 2054.783, 'duration': 8.976}, {'end': 2068.068, 'text': "Let me know in the comments section below, and then I'll see you again soon.", 'start': 2064.701, 'duration': 3.367}], 'summary': 'Seaborn tutorial is recommended. more content planned. seeking input on pandas vs. scikit-learn focus.', 'duration': 38.184, 'max_score': 2029.884, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p42029884.jpg'}], 'start': 1491.36, 'title': 'Regression evaluation metrics and feature selection', 'summary': 'Discusses common evaluation metrics for regression, highlighting rmse with an rmse of about 1.4 for sales predictions. it also explores using train test split for feature selection and model evaluation, resulting in a slight decrease to about 1.39 for the new model. additional resources for further learning are also provided.', 'chapters': [{'end': 1775.59, 'start': 1491.36, 'title': 'Evaluation metrics for regression', 'summary': 'Discusses common evaluation metrics for regression, including mean absolute error (mae), mean squared error (mse), and root mean squared error (rmse), and concludes by choosing rmse as the evaluation metric with an rmse of about 1.4 for sales predictions.', 'duration': 284.23, 'highlights': ['RMSE is around 12, which is interpretable in the y units. RMSE is popular as it is interpretable in the y units and easier to put into context.', 'MAE is the simplest metric with a calculated value of 10 for the given predictions. MAE is the mean of the absolute value of the errors and yields a calculated value of 10 for the given predictions.', 'MSE is 150, which is less interpretable compared to RMSE. MSE yields a value of 150, which is less interpretable compared to RMSE, making RMSE more popular.']}, {'end': 2068.068, 'start': 1776.871, 'title': 'Choosing features with train test split', 'summary': 'Explores using train test split to choose between features, removing the newspaper feature from the model, and evaluating the rmse, resulting in a slight decrease to about 1.39, indicating that the new model excluding newspaper is performing slightly better. additionally, the chapter provides resources for further learning on linear regression, pandas, and seaborn.', 'duration': 291.197, 'highlights': ['Using train test split to choose between features, removing the newspaper feature from the model, and evaluating the RMSE, resulting in a slight decrease to about 1.39, indicating that the new model excluding newspaper is performing slightly better. The RMSE after removing the newspaper feature from the model is about 1.39, a slight decrease from the previous model, suggesting that excluding newspaper is beneficial for the model.', 'Providing resources for further learning on linear regression, Pandas, and Seaborn. The chapter offers various resources for delving deeper into linear regression, Pandas, and Seaborn, including longer notebooks, a quick reference guide, and tutorials.', 'Seeking audience input on whether to focus more on pandas or exclusively on scikit-learn. The audience is invited to provide input on whether the focus should be more on pandas or exclusively on scikit-learn, demonstrating the interactive nature of the content.']}], 'duration': 576.708, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/3ZWuPVWq7p4/pics/3ZWuPVWq7p41491360.jpg', 'highlights': ['RMSE after removing newspaper feature is about 1.39, indicating slight improvement', 'MAE calculated value is 10 for the given predictions, providing a simple metric', 'RMSE is around 12, interpretable in the y units, popular for its context', 'MSE yields a value of 150, less interpretable compared to RMSE', 'Providing resources for further learning on linear regression, Pandas, and Seaborn', 'Seeking audience input on focusing more on pandas or exclusively on scikit-learn']}], 'highlights': ['The chapter covers regression in Scikit-Learn, including the use of Pandas and Seaborn libraries for data manipulation and visualization, linear regression model training and interpretation in Scikit-Learn, evaluation metrics for regression problems, and feature selection.', 'Covers regression in scikit-learn using pandas and seaborn for data manipulation and visualization, data analysis with pandas, seaborn library for visualizing advertising mediums and sales, linear regression modeling with scikit-learn, interpreting linear regression coefficients and machine learning, and regression evaluation metrics with an rmse of about 1.4 for sales predictions.', 'An additional $1000 spent on TV ads is associated with an increase in sales of 46.6 items, showcasing the relationship between ad spending and sales.', 'RMSE after removing newspaper feature is about 1.39, indicating slight improvement', 'Linear regression model learns the intercept and coefficients for the line of best fit. This demonstrates the specific action taken by the linear regression model during the fitting step, emphasizing its process of learning and memorization.', 'Pandas is an extremely popular library for data exploration, manipulation, and analysis.', "Seaborn library is introduced as a Python tool for statistical data visualization, which can be easily installed using 'conda install seaborn' in Anaconda.", 'We build our linear regression model using scikit-learn by importing the model, instantiating the model, and fitting the model to the training data.', 'The TV coefficient of 0.0466 signifies that for a given amount of radio and newspaper ad spending, a unit increase in TV ad spending is associated with a 0.0466 unit increase in sales, providing insights into the impact of TV ad spending on sales.', 'RMSE is around 12, interpretable in the y units, popular for its context']}