title
Gradient Boost Part 2 (of 4): Regression Details
description
Gradient Boost is one of the most popular Machine Learning algorithms in use. And get this, it's not that complicated! This video is the second part in a series that walks through it one step at a time. This video focuses on the original Gradient Boost algorithm used to predict a continuous value, like someone's weight. We call this, "using Gradient Boost for Regression". In part 3, we'll walk though how Gradient Boost classifies samples into two different categories, and in part 4, we'll go through the math again, this time focusing on classification.
This StatQuest assumes that you have already watched Part 1:
https://youtu.be/3CC4N4z3GJc
...it also assumes that you know about Regression Trees:
https://youtu.be/g9c66TUylZ4
...and, while it required, it might be useful if you understood Gradient Descent: https://youtu.be/sDv4f4s2SB8
For a complete index of all the StatQuest videos, check out:
https://statquest.org/video-index/
This StatQuest is based on the following sources:
A 1999 manuscript by Jerome Friedman that introduced Stochastic Gradient Boost: https://jerryfriedman.su.domains/ftp/stobst.pdf
The Wikipedia article on Gradient Boosting: https://en.wikipedia.org/wiki/Gradient_boosting
NOTE: The key to understanding how the wikipedia article relates to this video is to keep reading past the "pseudo algorithm" section. The very next section in the article called "Gradient Tree Boosting" shows how the algorithm works for trees (which is pretty much the only weak learner people ever use for gradient boost, which is why I focus on it in the video). In that section, you see how the equation is modified so that each leaf from a tree can have a different output value, rather than the entire "weak learner" having a single output value - and this is the exact same equation that I use in the video.
Later in the article, in the section called "Shrinkage", they show how the learning rate can be included. Since this is also pretty much always used with gradient boost, I simply included it in the base algorithm that I describe.
The scikit-learn implementation of Gradient Boosting: https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting
If you'd like to support StatQuest, please consider...
Buying The StatQuest Illustrated Guide to Machine Learning!!!
PDF - https://statquest.gumroad.com/l/wvtmc
Paperback - https://www.amazon.com/dp/B09ZCKR4H6
Kindle eBook - https://www.amazon.com/dp/B09ZG79HXC
Patreon: https://www.patreon.com/statquest
...or...
YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join
...a cool StatQuest t-shirt or sweatshirt:
https://shop.spreadshirt.com/statquest-with-josh-starmer/
...buying one or two of my songs (or go large and get a whole album!)
https://joshuastarmer.bandcamp.com/
...or just donating to StatQuest!
https://www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
https://twitter.com/joshuastarmer
0:00 Awesome song and introduction
0:00 Step 0: The data and the loss function
6:30 Step 1: Initialize the model with a constant value
9:10 Step 2: Build M trees
10:01 Step 2.A: Calculate residuals
12:47 Step 2.B: Fit a regression tree to the residuals
14:50 Step 2.C: Optimize leaf output values
20:38 Step 2.D: Update predictions with the new tree
23:19 Step 2: Summary of step 2
24:59 Step 3: Output the final prediction
Corrections:
4:27 The sum on the left hand side should be in parentheses to make it clear that the entire sum is multiplied by 1/2, not just the first term.
15:47. It should be R_jm, not R_ij.
16:18, the leaf in the script is R_1,2 and it should be R_2,1.
21:08. With regression trees, the sample will only go to a single leaf, and this summation simply isolates the one output value of interest from all of the others. However, when I first made this video I was thinking that because Gradient Boost is supposed to work with any "weak learner", not just small regression trees, that this summation was a way to add flexibility to the algorithm.
24:15, the header for the residual column should be r_i,2.
#statquest #gradientboost
detail
{'title': 'Gradient Boost Part 2 (of 4): Regression Details', 'heatmap': [{'end': 452.876, 'start': 418.37, 'weight': 0.783}, {'end': 497.863, 'start': 476.951, 'weight': 0.799}, {'end': 667.216, 'start': 587.14, 'weight': 0.848}, {'end': 793.347, 'start': 721.864, 'weight': 0.717}, {'end': 837.489, 'start': 814.588, 'weight': 0.704}, {'end': 919.74, 'start': 882.501, 'weight': 0.828}, {'end': 1285.686, 'start': 1267.9, 'weight': 0.846}, {'end': 1428.895, 'start': 1387.823, 'weight': 0.704}], 'summary': 'Delves into the algorithmic details of using gradient boost for regression, discussing fitting a model, evaluating predictive accuracy, creating predicted values, residuals, and derivatives, and explaining the use of regression tree in the algorithm. it demonstrates the process of making improved predictions using gradient boosting and emphasizes the use of stumps for trees, with an enhancement of 1.7 units for the third sample and the output values for r2,1 being 8.7 and for r1,1 being -17.3.', 'chapters': [{'end': 60.676, 'segs': [{'end': 60.676, 'src': 'embed', 'start': 13.335, 'weight': 0, 'content': [{'end': 16.877, 'text': "Hello! I'm Josh Starmer and welcome to StatQuest.", 'start': 13.335, 'duration': 3.542}, {'end': 20.399, 'text': "Today we're going to continue our series on Gradient Boost.", 'start': 17.437, 'duration': 2.962}, {'end': 26.642, 'text': "Specifically, we're going to dive into the algorithmic details of how Gradient Boost is used for regression.", 'start': 20.879, 'duration': 5.763}, {'end': 36.308, 'text': 'Note, this stat quest assumes you have already watched the first video in this series, Gradient Boost Part 1, Regression Main Ideas.', 'start': 27.941, 'duration': 8.367}, {'end': 38.57, 'text': 'If not, check out the quest.', 'start': 36.909, 'duration': 1.661}, {'end': 45.416, 'text': 'Also, although not required, it might be helpful if you understand gradient descent.', 'start': 40.092, 'duration': 5.324}, {'end': 48.158, 'text': "So check out that quest if you haven't already.", 'start': 46.057, 'duration': 2.101}, {'end': 55.585, 'text': 'In this stat quest, we are going to walk through the original gradient boost algorithm step by step.', 'start': 49.319, 'duration': 6.266}, {'end': 60.676, 'text': 'In order to keep the example from getting out of hand,', 'start': 57.614, 'duration': 3.062}], 'summary': 'Josh starmer explains the algorithmic details of using gradient boost for regression in this statquest episode.', 'duration': 47.341, 'max_score': 13.335, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs13335.jpg'}], 'start': 13.335, 'title': 'Gradient boost for regression', 'summary': 'Delves into the algorithmic details of using gradient boost for regression, assuming prior understanding of the first video in the series and potential familiarity with gradient descent.', 'chapters': [{'end': 60.676, 'start': 13.335, 'title': 'Gradient boost: regression algorithm', 'summary': 'Explores the algorithmic details of using gradient boost for regression, assuming prior understanding of the first video in the series and potential familiarity with gradient descent.', 'duration': 47.341, 'highlights': ['The chapter focuses on the algorithmic details of using Gradient Boost for regression, assuming prior understanding of the first video in the series and potential familiarity with gradient descent.', 'Josh Starmer presents the continuation of the series on Gradient Boost, specifically delving into the algorithmic details of using Gradient Boost for regression.', 'Prior understanding of the first video in the series on Gradient Boost and potential familiarity with gradient descent is assumed.', "It is recommended to watch the first video in the series, 'Gradient Boost Part 1, Regression Main Ideas', before proceeding with this chapter."]}], 'duration': 47.341, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs13335.jpg', 'highlights': ['Josh Starmer presents the continuation of the series on Gradient Boost, specifically delving into the algorithmic details of using Gradient Boost for regression.', 'The chapter focuses on the algorithmic details of using Gradient Boost for regression, assuming prior understanding of the first video in the series and potential familiarity with gradient descent.', 'Prior understanding of the first video in the series on Gradient Boost and potential familiarity with gradient descent is assumed.', "It is recommended to watch the first video in the series, 'Gradient Boost Part 1, Regression Main Ideas', before proceeding with this chapter."]}, {'end': 508.91, 'segs': [{'end': 86.111, 'src': 'embed', 'start': 60.676, 'weight': 0, 'content': [{'end': 71.442, 'text': 'we are going to use Gradient Boost to fit a model to a super simple training data set which contains height measurements from three people their favorite color,', 'start': 60.676, 'duration': 10.766}, {'end': 73.744, 'text': 'their gender and their weight.', 'start': 71.442, 'duration': 2.302}, {'end': 83.529, 'text': "Great Now that we know all about this super simple training data set, let's walk through the original gradient descent algorithm step by step.", 'start': 74.904, 'duration': 8.625}, {'end': 86.111, 'text': "We'll start from the top.", 'start': 84.83, 'duration': 1.281}], 'summary': 'Using gradient boost to fit model to simple training data set.', 'duration': 25.435, 'max_score': 60.676, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs60676.jpg'}, {'end': 238.497, 'src': 'embed', 'start': 212.008, 'weight': 1, 'content': [{'end': 217.87, 'text': 'we can evaluate how well this greenish line fits the data with the sum of the squared residuals.', 'start': 212.008, 'duration': 5.862}, {'end': 222.892, 'text': 'Thus, the loss function is just a squared residual.', 'start': 219.17, 'duration': 3.722}, {'end': 233.455, 'text': 'If we wanted to compare how well this greenish line fit the data to this pink line, then we would calculate the residuals,', 'start': 224.672, 'duration': 8.783}, {'end': 238.497, 'text': 'the difference between the observed and the predicted values for the greenish line.', 'start': 233.455, 'duration': 5.042}], 'summary': 'Evaluating the fit of a greenish line using sum of squared residuals and comparing it to a pink line.', 'duration': 26.489, 'max_score': 212.008, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs212008.jpg'}, {'end': 393.528, 'src': 'embed', 'start': 306.209, 'weight': 2, 'content': [{'end': 325.869, 'text': 'The reason why people choose this loss function for gradient boost is that when we differentiate it with respect to predicted and use the chain rule and bring the square down to the front and multiply it by the derivative of minus predicted,', 'start': 306.209, 'duration': 19.66}, {'end': 336.797, 'text': 'which is negative one, then the two divided by two cancels out and that leaves you with the observed minus, the predicted multiplied by negative one.', 'start': 325.869, 'duration': 10.928}, {'end': 346.805, 'text': 'In other words, we are left with the negative residual, and that makes the math easier since gradient boost uses the derivative a lot.', 'start': 338.318, 'duration': 8.487}, {'end': 350.528, 'text': "Okay, we've got a loss function.", 'start': 348.406, 'duration': 2.122}, {'end': 354.377, 'text': "The y sub i's are the observed values.", 'start': 351.616, 'duration': 2.761}, {'end': 359.458, 'text': 'And f of x is a function that gives us the predicted values.', 'start': 355.837, 'duration': 3.621}, {'end': 363.659, 'text': "Note, we'll talk more about f of x later.", 'start': 360.678, 'duration': 2.981}, {'end': 370.66, 'text': 'We also know that the loss function is differentiable since we have already taken the derivative.', 'start': 365.059, 'duration': 5.601}, {'end': 376.501, 'text': 'Bam! We figured out what the input is for the gradient boost algorithm.', 'start': 371.92, 'duration': 4.581}, {'end': 381.501, 'text': "We've got data, And we've got a differentiable loss function.", 'start': 377.462, 'duration': 4.039}, {'end': 388.785, 'text': 'There are other loss functions to choose from, but this is the most popular one for regression.', 'start': 383.482, 'duration': 5.303}, {'end': 393.528, 'text': "Hooray! Now we're ready for step one.", 'start': 390.266, 'duration': 3.262}], 'summary': 'Popular loss function for gradient boost: observed minus predicted multiplied by negative one, makes math easier. ready for step one.', 'duration': 87.319, 'max_score': 306.209, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs306209.jpg'}, {'end': 452.876, 'src': 'heatmap', 'start': 418.37, 'weight': 0.783, 'content': [{'end': 423.132, 'text': 'And that funky looking symbol, called gamma, refers to the predicted values.', 'start': 418.37, 'duration': 4.762}, {'end': 429.274, 'text': 'The summation means that we add up one loss function for each observed value.', 'start': 424.412, 'duration': 4.862}, {'end': 437.565, 'text': 'and the argmin over gamma means we need to find a predicted value that minimizes this sum.', 'start': 431.161, 'duration': 6.404}, {'end': 443.81, 'text': 'In other words, if we plot the observed weights on a number line,', 'start': 438.746, 'duration': 5.064}, {'end': 452.876, 'text': 'then we want to find the point on the line that minimizes the sum of the squared residuals divided by 2..', 'start': 443.81, 'duration': 9.066}], 'summary': 'Gamma symbol predicts values and argmin minimizes sum of squared residuals.', 'duration': 34.506, 'max_score': 418.37, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs418370.jpg'}, {'end': 508.91, 'src': 'heatmap', 'start': 476.951, 'weight': 0.799, 'content': [{'end': 482.294, 'text': "And since we've already shown how to take the derivative of our loss function, we can just plug it in.", 'start': 476.951, 'duration': 5.343}, {'end': 487.397, 'text': 'Then we set the sum of the derivatives equal to zero.', 'start': 484.255, 'duration': 3.142}, {'end': 489.898, 'text': 'and solve.', 'start': 488.977, 'duration': 0.921}, {'end': 497.863, 'text': 'And we end up with the average of the observed weights.', 'start': 494.861, 'duration': 3.002}, {'end': 508.91, 'text': 'So, given this loss function, the value for gamma that minimizes this sum is the average of the observed weights.', 'start': 499.304, 'duration': 9.606}], 'summary': 'Derive loss function, solve for gamma. optimal gamma = average observed weights.', 'duration': 31.959, 'max_score': 476.951, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs476951.jpg'}], 'start': 60.676, 'title': 'Gradient boosting and loss functions', 'summary': 'Discusses fitting a model using gradient boost on a training data set, evaluating model predictive accuracy with a loss function, and explaining the importance of the loss function in gradient boost, with a demonstration using a training data set of height measurements from three people and their weight. it also covers the differentiation process, input requirements for the algorithm, and the initialization of the model with a constant value.', 'chapters': [{'end': 286.556, 'start': 60.676, 'title': 'Gradient boost model and loss function', 'summary': "Explains the process of fitting a model using gradient boost on a training data set of height measurements from three people and their weight, and discusses the use of a loss function to evaluate the model's predictive accuracy, demonstrating the calculation of sum of squared residuals.", 'duration': 225.88, 'highlights': ['The chapter explains the process of fitting a model using Gradient Boost on a training data set of height measurements from three people and their weight. The training data set contains height measurements from three people, their favorite color, gender, and weight.', "Discusses the use of a loss function to evaluate the model's predictive accuracy, demonstrating the calculation of sum of squared residuals. A loss function is used to evaluate how well the model can predict weight, with the sum of squared residuals being calculated to compare the fit of different models."]}, {'end': 508.91, 'start': 286.556, 'title': 'Gradient boosting with loss functions', 'summary': 'Explains the importance of the loss function in gradient boost, the differentiation process, the input requirements for the algorithm, and the initialization of the model with a constant value.', 'duration': 222.354, 'highlights': ['The loss function for gradient boost is chosen to simplify differentiation and is differentiable, making it the most popular choice for regression (e.g., observed minus predicted multiplied by negative one).', 'The input requirements for the gradient boost algorithm include having data and a differentiable loss function, with the most popular choice being the one discussed in the transcript.', 'The initialization of the model in gradient boost involves determining a constant value based on the loss function, observed values, and predicted values, with the goal of minimizing the sum of squared residuals.']}], 'duration': 448.234, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs60676.jpg', 'highlights': ['The training data set contains height measurements from three people, their favorite color, gender, and weight.', 'A loss function is used to evaluate how well the model can predict weight, with the sum of squared residuals being calculated to compare the fit of different models.', 'The loss function for gradient boost is chosen to simplify differentiation and is differentiable, making it the most popular choice for regression (e.g., observed minus predicted multiplied by negative one).', 'The input requirements for the gradient boost algorithm include having data and a differentiable loss function, with the most popular choice being the one discussed in the transcript.', 'The initialization of the model in gradient boost involves determining a constant value based on the loss function, observed values, and predicted values, with the goal of minimizing the sum of squared residuals.']}, {'end': 770.931, 'segs': [{'end': 550.64, 'src': 'embed', 'start': 510.251, 'weight': 0, 'content': [{'end': 519.758, 'text': 'We have now created the initial predicted value, f sub 0 of x, and it equals 73.3.', 'start': 510.251, 'duration': 9.507}, {'end': 525.5, 'text': 'That means that the initial predicted value, f sub zero of x, is just a leaf.', 'start': 519.758, 'duration': 5.742}, {'end': 532.322, 'text': 'The leaf predicts that all samples will weigh 73.3.', 'start': 526.66, 'duration': 5.662}, {'end': 535.143, 'text': 'Bam! We finished step one.', 'start': 532.322, 'duration': 2.821}, {'end': 543.786, 'text': 'We initialize the model with a constant value, f sub zero of x equals 73.3.', 'start': 536.144, 'duration': 7.642}, {'end': 550.64, 'text': 'In other words, we created a leaf that predicts all samples will weigh 73.3.', 'start': 543.786, 'duration': 6.854}], 'summary': 'Initial predicted value f(0) of x is 73.3, predicting all samples will weigh 73.3.', 'duration': 40.389, 'max_score': 510.251, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs510251.jpg'}, {'end': 667.216, 'src': 'heatmap', 'start': 581.697, 'weight': 1, 'content': [{'end': 585.759, 'text': "When little m equals big M, then we're talking about the last tree.", 'start': 581.697, 'duration': 4.062}, {'end': 595.605, 'text': "And when little m is somewhere between 1 and big M, then we're talking about a tree somewhere between 1 and big M.", 'start': 587.14, 'duration': 8.465}, {'end': 602.079, 'text': 'Since we are just starting step 2, we will set little m equal to 1.', 'start': 595.605, 'duration': 6.474}, {'end': 605.462, 'text': "Part A of Step 2 looks nasty, but it's not.", 'start': 602.079, 'duration': 3.383}, {'end': 612.507, 'text': 'This part is just the derivative of the loss function with respect to the predicted value.', 'start': 606.642, 'duration': 5.865}, {'end': 615.769, 'text': "And we've already calculated this.", 'start': 614.108, 'duration': 1.661}, {'end': 623.035, 'text': 'This big minus sign tells us to multiply the derivative by negative 1.', 'start': 617.41, 'duration': 5.625}, {'end': 626.998, 'text': 'And that leaves us with the observed value minus the predicted value.', 'start': 623.035, 'duration': 3.963}, {'end': 633.039, 'text': 'In other words, this nasty-looking thing is just a residual.', 'start': 628.299, 'duration': 4.74}, {'end': 639.005, 'text': 'Now we plug f sub of x in for predicted.', 'start': 634.821, 'duration': 4.184}, {'end': 648.694, 'text': 'And since m equals 1, that means we plug in f sub 0 of x for f sub of x.', 'start': 639.025, 'duration': 9.669}, {'end': 657.586, 'text': 'And since f sub 0 of x is just the leaf set to 73.3, we plug in 73.3.', 'start': 648.694, 'duration': 8.892}, {'end': 667.216, 'text': "Now we can compute r sub i comma m, where r is short for residual, i is the sample number, and m is the tree that we're trying to build.", 'start': 657.586, 'duration': 9.63}], 'summary': 'In step 2, when m=1, the residual is computed using f_0(x)=73.3.', 'duration': 66.997, 'max_score': 581.697, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs581697.jpg'}, {'end': 770.931, 'src': 'embed', 'start': 721.864, 'weight': 3, 'content': [{'end': 728.227, 'text': 'Before we move on, I just want to point out that this derivative is the gradient that Gradient Boost is named after.', 'start': 721.864, 'duration': 6.363}, {'end': 738.197, 'text': 'Little bam! I also want to point out that the r sub i comma m values are technically called pseudo-residuals.', 'start': 729.768, 'duration': 8.429}, {'end': 744.5, 'text': 'When we use this loss function, we end up calculating normal residuals.', 'start': 739.758, 'duration': 4.742}, {'end': 755.425, 'text': "But if we used another loss function this time we're not multiplying by one half then we would end up with something similar to a residual,", 'start': 745.981, 'duration': 9.444}, {'end': 756.245, 'text': 'but not quite.', 'start': 755.425, 'duration': 0.82}, {'end': 760.906, 'text': "In other words, we'd end up with a pseudo-residual.", 'start': 757.904, 'duration': 3.002}, {'end': 766.229, 'text': "And that's why the r sub i comma m's are called pseudo-residuals.", 'start': 762.086, 'duration': 4.143}, {'end': 770.931, 'text': "Okay, now let's do part b.", 'start': 767.65, 'duration': 3.281}], 'summary': 'Derivative is the gradient that gradient boost is named after. r sub i comma m values are pseudo-residuals used in loss function calculations.', 'duration': 49.067, 'max_score': 721.864, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs721864.jpg'}], 'start': 510.251, 'title': 'Creating predicted value, residuals, and derivatives in gradient boosting', 'summary': 'Covers creating an initial predicted value of 73.3, making 100 trees, calculating residuals, understanding the derivative in gradient boosting, and the significance of pseudo-residuals in different loss functions.', 'chapters': [{'end': 612.507, 'start': 510.251, 'title': 'Creating initial predicted value and making trees', 'summary': 'Explains the process of creating the initial predicted value, f sub 0 of x, as 73.3, and the iterative process of making m trees, typically set to 100, with part a of step 2 focusing on the derivative of the loss function with respect to the predicted value.', 'duration': 102.256, 'highlights': ['The initial predicted value, f sub 0 of x, is 73.3, predicting all samples to weigh 73.3.', 'In the iterative process of making trees, typically set to 100, part A of Step 2 involves the derivative of the loss function with respect to the predicted value.']}, {'end': 719.503, 'start': 614.108, 'title': 'Calculating residuals in data analysis', 'summary': 'Explains how to calculate residuals in data analysis, specifically in the context of building a tree model, where the observed value minus the predicted value is computed for each sample in the dataset.', 'duration': 105.395, 'highlights': ['The big minus sign indicates to multiply the derivative by negative 1, resulting in the observed value minus the predicted value.', 'The calculated residual for the first sample and the first tree is 14.7.', 'The process involves computing residuals for all three samples in the data set, resulting in a residual for each sample.']}, {'end': 770.931, 'start': 721.864, 'title': 'Gradient boosting derivative and pseudo-residuals', 'summary': 'Explains the significance of the derivative in gradient boosting and the concept of pseudo-residuals, highlighting their role in calculating normal residuals and the impact of different loss functions on the nature of residuals.', 'duration': 49.067, 'highlights': ['The derivative in Gradient Boosting is significant, as it is the gradient that Gradient Boost is named after.', 'The r sub i comma m values are technically called pseudo-residuals, which are used in calculating normal residuals.', 'The choice of loss function impacts the nature of residuals, with different loss functions resulting in pseudo-residuals that are not quite similar to normal residuals.']}], 'duration': 260.68, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs510251.jpg', 'highlights': ['The initial predicted value, f sub 0 of x, is 73.3, predicting all samples to weigh 73.3.', 'In the iterative process of making trees, typically set to 100, part A of Step 2 involves the derivative of the loss function with respect to the predicted value.', 'The big minus sign indicates to multiply the derivative by negative 1, resulting in the observed value minus the predicted value.', 'The derivative in Gradient Boosting is significant, as it is the gradient that Gradient Boost is named after.', 'The r sub i comma m values are technically called pseudo-residuals, which are used in calculating normal residuals.', 'The choice of loss function impacts the nature of residuals, with different loss functions resulting in pseudo-residuals that are not quite similar to normal residuals.']}, {'end': 1305.475, 'segs': [{'end': 837.489, 'src': 'heatmap', 'start': 770.931, 'weight': 0, 'content': [{'end': 777.595, 'text': 'All this is saying is that we will build a regression tree to predict the residuals instead of the weights.', 'start': 770.931, 'duration': 6.664}, {'end': 784.299, 'text': 'So we will use height, favorite color, and gender to predict the residuals.', 'start': 778.896, 'duration': 5.403}, {'end': 786.621, 'text': "Here's the new tree.", 'start': 785.54, 'duration': 1.081}, {'end': 793.347, 'text': 'Yes, I know this is just a stump, and Gradient Boost almost always uses larger trees.', 'start': 787.602, 'duration': 5.745}, {'end': 802.094, 'text': 'However, in order to demonstrate the details of the Gradient Boost algorithm, we need at least one leaf with more than one sample in it.', 'start': 794.107, 'duration': 7.987}, {'end': 807.158, 'text': "And when you only have three samples, then you can't have more than two leaves.", 'start': 803.035, 'duration': 4.123}, {'end': 813.003, 'text': "So we're stuck using stumps, even though they are not typically used with Gradient Boost.", 'start': 808.139, 'duration': 4.864}, {'end': 819.591, 'text': 'The residual for the third sample, X sub 3, goes to the leaf on the left.', 'start': 814.588, 'duration': 5.003}, {'end': 825.994, 'text': 'And the residual for samples X sub 1 and X sub 2 go to the leaf on the right.', 'start': 820.832, 'duration': 5.162}, {'end': 830.357, 'text': 'So we have a regression tree fit to the residuals.', 'start': 827.535, 'duration': 2.822}, {'end': 837.489, 'text': 'Now we need to create terminal regions r sub j comma m.', 'start': 831.966, 'duration': 5.523}], 'summary': 'Using regression tree to predict residuals with height, color, and gender. demonstrating gradient boost algorithm with stump due to limited samples.', 'duration': 48.66, 'max_score': 770.931, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs770931.jpg'}, {'end': 919.74, 'src': 'heatmap', 'start': 882.501, 'weight': 0.828, 'content': [{'end': 889.324, 'text': 'Hooray! We finished Part B of Step 2 by fitting a regression tree to the residuals and labeling the leaves.', 'start': 882.501, 'duration': 6.823}, {'end': 893.846, 'text': "Now let's do Part C.", 'start': 890.985, 'duration': 2.861}, {'end': 897.347, 'text': 'In this part, we determine the output values for each leaf.', 'start': 893.846, 'duration': 3.501}, {'end': 905.011, 'text': "Specifically, since two residuals ended up in this leaf, it's unclear what its output value should be.", 'start': 898.768, 'duration': 6.243}, {'end': 914.097, 'text': 'So for each leaf in the new tree, we compute an output value, gamma sub j comma m.', 'start': 906.411, 'duration': 7.686}, {'end': 919.74, 'text': 'The output value for each leaf is the value for gamma that minimizes this summation.', 'start': 914.097, 'duration': 5.643}], 'summary': 'Completed part b, fitting regression tree to residuals. moving on to part c to determine output values for each leaf.', 'duration': 37.239, 'max_score': 882.501, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs882501.jpg'}, {'end': 1241.965, 'src': 'embed', 'start': 1176.837, 'weight': 2, 'content': [{'end': 1180.379, 'text': 'So the value for gamma that minimizes this equation is 8.7.', 'start': 1176.837, 'duration': 3.542}, {'end': 1182.701, 'text': 'And that means gamma sub 2,1 equals 8.7.', 'start': 1180.379, 'duration': 2.322}, {'end': 1195.544, 'text': 'And ultimately, the leaf R2,1 has an output value of 8.7.', 'start': 1182.701, 'duration': 12.843}, {'end': 1198.347, 'text': 'One last little observation before we move on.', 'start': 1195.544, 'duration': 2.803}, {'end': 1207.355, 'text': 'We just saw that the output value for this leaf, r sub 2 comma 1, is the average of the residuals that ended up here.', 'start': 1199.828, 'duration': 7.527}, {'end': 1215.542, 'text': 'Given our choice of loss function, the output values are always the average of the residuals that end up in the same leaf.', 'start': 1208.996, 'duration': 6.546}, {'end': 1228.14, 'text': 'Even if only one residual ends up in a leaf, the output value is still the average since negative 17.3 divided by one equals negative 17.3.', 'start': 1217.196, 'duration': 10.944}, {'end': 1237.004, 'text': 'Hooray, we finished part C of step two by computing gamma values or output values for each leaf.', 'start': 1228.14, 'duration': 8.864}, {'end': 1241.965, 'text': "Now let's do part D.", 'start': 1238.864, 'duration': 3.101}], 'summary': 'Gamma value that minimizes equation is 8.7, resulting in average output value of 8.7 for leaf r2,1.', 'duration': 65.128, 'max_score': 1176.837, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs1176837.jpg'}, {'end': 1305.475, 'src': 'heatmap', 'start': 1267.9, 'weight': 4, 'content': [{'end': 1272.983, 'text': 'Note, this summation is there just in case a single sample ends up in multiple leaves.', 'start': 1267.9, 'duration': 5.083}, {'end': 1285.686, 'text': "The summation says we should add up the output values, gamma sub j comma m's, for all the leaves, r sub j comma m, that a sample, x, can be found in.", 'start': 1274.482, 'duration': 11.204}, {'end': 1291.658, 'text': 'The last thing in this equation is this Greek character, ν.', 'start': 1287.487, 'duration': 4.171}, {'end': 1297.972, 'text': 'ν is the learning rate and is a value between 1 and 0.', 'start': 1291.658, 'duration': 6.314}, {'end': 1305.475, 'text': 'A small learning rate reduces the effect each tree has on the final prediction, and this improves accuracy in the long run.', 'start': 1297.972, 'duration': 7.503}], 'summary': 'Summation of output values for all leaves determines final prediction, with learning rate ν (0 < ν < 1) improving accuracy.', 'duration': 30.993, 'max_score': 1267.9, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs1267900.jpg'}], 'start': 770.931, 'title': 'Gradient boost algorithm details and computing output values', 'summary': 'Explains the use of regression tree in gradient boost algorithm with predictors like height, favorite color, and gender, and demonstrates creation of terminal regions with a stump tree due to small sample size. it covers the process of computing output values for each leaf in a regression tree, involving minimizing a summation and solving for gamma values, with the output values for r2,1 being 8.7 and for r1,1 being -17.3.', 'chapters': [{'end': 880.36, 'start': 770.931, 'title': 'Gradient boost algorithm details', 'summary': 'Explains how a regression tree is used to predict residuals in the gradient boost algorithm, using height, favorite color, and gender as predictors, and demonstrates the creation of terminal regions with a stump tree due to having only three samples.', 'duration': 109.429, 'highlights': ['The chapter explains how a regression tree is used to predict residuals in the Gradient Boost algorithm, using height, favorite color, and gender as predictors.', 'The creation of terminal regions is demonstrated with a stump tree due to having only three samples.', 'The stump tree is used to demonstrate the details of the Gradient Boost algorithm, despite it not typically using stumps.', 'The terminal regions r sub j comma m are created using the leaves of the tree, with m=1 and j sub m=2 in this example.']}, {'end': 1305.475, 'start': 882.501, 'title': 'Step 2 part c: computing output values', 'summary': 'Covers the process of computing output values for each leaf in a regression tree, involving minimizing a summation and solving for gamma values, with the output value for r2,1 being 8.7 and for r1,1 being -17.3.', 'duration': 422.974, 'highlights': ['The value for gamma that minimizes the equation for leaf R2,1 is 8.7, and for leaf R1,1 is -17.3.', 'The output values are always the average of the residuals that end up in the same leaf, even if only one residual ends up in a leaf.', 'The learning rate, ν, is a value between 1 and 0, with a small learning rate improving accuracy in the long run.']}], 'duration': 534.544, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs770931.jpg', 'highlights': ['The chapter explains how a regression tree is used to predict residuals in the Gradient Boost algorithm, using height, favorite color, and gender as predictors.', 'The creation of terminal regions is demonstrated with a stump tree due to having only three samples.', 'The value for gamma that minimizes the equation for leaf R2,1 is 8.7, and for leaf R1,1 is -17.3.', 'The output values are always the average of the residuals that end up in the same leaf, even if only one residual ends up in a leaf.', 'The learning rate, ν, is a value between 1 and 0, with a small learning rate improving accuracy in the long run.']}, {'end': 1598.293, 'segs': [{'end': 1428.895, 'src': 'heatmap', 'start': 1334.533, 'weight': 0, 'content': [{'end': 1337.094, 'text': "which is 8.7, because x sub 1's height is greater than 1.55..", 'start': 1334.533, 'duration': 2.561}, {'end': 1338.614, 'text': 'Now just do the math.', 'start': 1337.094, 'duration': 1.52}, {'end': 1359.561, 'text': 'The new prediction for the first sample is 74.2, which is slightly closer to the observed weight, 88, than the first prediction, 73.3.', 'start': 1348.131, 'duration': 11.43}, {'end': 1370.575, 'text': "Bam! Now let's make a new prediction for the second sample, X sub 2.", 'start': 1359.561, 'duration': 11.014}, {'end': 1380.24, 'text': 'The new prediction for X sub 2 is also 74.2, which is an improvement over the first prediction, 73.3.', 'start': 1370.575, 'duration': 9.665}, {'end': 1383.701, 'text': "Now let's make a new prediction for the third sample, X sub 3.", 'start': 1380.24, 'duration': 3.461}, {'end': 1387.823, 'text': 'The new prediction is 71.6, which is an improvement over 73.3.', 'start': 1383.701, 'duration': 4.122}, {'end': 1404.744, 'text': 'Double bam! Hooray! We made it through one iteration of Step 2.', 'start': 1387.823, 'duration': 16.921}, {'end': 1408.246, 'text': 'We started by setting m to 1.', 'start': 1404.744, 'duration': 3.502}, {'end': 1418.61, 'text': 'Then we solved for the negative gradient, plugged in the observed values, plugged in the latest predictions, and that gave us residuals.', 'start': 1408.246, 'duration': 10.364}, {'end': 1428.895, 'text': 'Then we fit a regression tree to the residuals and computed the output values gamma sub j comma m for each leaf.', 'start': 1420.271, 'duration': 8.624}], 'summary': 'Improved predictions for samples 1, 2, and 3 after one iteration of step 2, with new predictions closer to observed weights.', 'duration': 84.077, 'max_score': 1334.533, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs1334533.jpg'}, {'end': 1580.195, 'src': 'embed', 'start': 1486.84, 'weight': 3, 'content': [{'end': 1495.322, 'text': "Now, in the interest of time, let's assume M equals 2 so that we are done with Step 2.", 'start': 1486.84, 'duration': 8.482}, {'end': 1498.623, 'text': 'In practice, M equals 100 or more.', 'start': 1495.322, 'duration': 3.301}, {'end': 1503.825, 'text': "Now we are ready for Gradient Boost's third and final step.", 'start': 1500.284, 'duration': 3.541}, {'end': 1511.427, 'text': 'If M equals 2, then f sub 2 of x is the output from the Gradient Boost algorithm.', 'start': 1505.065, 'duration': 6.362}, {'end': 1524.277, 'text': "Holy smokes! We made it through this whole thing! That's crazy! Now if we receive some new data, we could use f sub 2 of x to predict the weight.", 'start': 1512.546, 'duration': 11.731}, {'end': 1544.798, 'text': 'The predicted weight equals 73.3 plus 0.1 times negative 17.3 plus 0.1 times negative 15.6, which equals 70.', 'start': 1525.738, 'duration': 19.06}, {'end': 1548.76, 'text': 'Gradient Boost predicts this person weighs 70 kilograms.', 'start': 1544.798, 'duration': 3.962}, {'end': 1559.385, 'text': 'Triple bam! Before we go, I want to remind you that Gradient Boost usually uses trees larger than stumps.', 'start': 1550.141, 'duration': 9.244}, {'end': 1565.829, 'text': 'I only used stumps in this tutorial because our training dataset was so darn small.', 'start': 1560.866, 'duration': 4.963}, {'end': 1571.583, 'text': 'Also, be sure to watch Part 3 of this exciting series on Gradient Boost.', 'start': 1567.176, 'duration': 4.407}, {'end': 1574.487, 'text': "Next time, we'll talk about classification.", 'start': 1572.183, 'duration': 2.304}, {'end': 1580.195, 'text': "Hooray! We've made it to the end of another exciting StatQuest.", 'start': 1575.811, 'duration': 4.384}], 'summary': 'Gradient boost predicts a weight of 70 kilograms using m=2 and new data, usually m=100 or more.', 'duration': 93.355, 'max_score': 1486.84, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs1486840.jpg'}], 'start': 1307.056, 'title': 'Gradient boosting algorithm', 'summary': 'Delves into the process of making improved predictions using gradient boosting with a new set to 0.1, resulting in an enhancement of 1.7 units for the third sample from the initial prediction of 73.3 to the new prediction of 71.6. it also explains the three steps of the algorithm with m equals 2 and demonstrates predicting a weight of 70 kilograms using f sub 2 of x. additionally, it hints at upcoming content on classification and emphasizes the use of stumps for trees.', 'chapters': [{'end': 1418.61, 'start': 1307.056, 'title': 'Gradient boosting prediction iteration', 'summary': 'Discusses the process of making new predictions using gradient boosting, where new is set to 0.1, resulting in improved predictions for each sample, with the third sample showing the most improvement of 1.7 units from the initial prediction of 73.3 to the new prediction of 71.6.', 'duration': 111.554, 'highlights': ['The new prediction for the third sample is 71.6, which is an improvement over the initial prediction of 73.3 by 1.7 units.', 'The new prediction for the second sample is 74.2, an improvement over the initial prediction of 73.3.', 'The new prediction for the first sample is 74.2, slightly closer to the observed weight of 88, compared to the initial prediction of 73.3.']}, {'end': 1598.293, 'start': 1420.271, 'title': 'Gradient boost: three steps and predicting weight', 'summary': 'Explains the three steps of gradient boost algorithm, where m equals 2, and demonstrates predicting a weight of 70 kilograms using f sub 2 of x. additionally, it highlights the use of stumps for trees and hints at upcoming content on classification.', 'duration': 178.022, 'highlights': ['The predicted weight using Gradient Boost is 70 kilograms, calculated using the output from the algorithm and specific values.', 'The algorithm usually uses trees larger than stumps, but the tutorial utilized stumps due to the small training dataset.', 'The chapter discusses the three steps of the Gradient Boost algorithm, where M equals 2, and demonstrates predicting a weight of 70 kilograms using f sub 2 of x.', 'The chapter hints at upcoming content on classification and encourages viewers to subscribe and support StatQuest through various means.']}], 'duration': 291.237, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/2xudPOBz-vs/pics/2xudPOBz-vs1307056.jpg', 'highlights': ['The new prediction for the third sample is 71.6, an improvement over the initial prediction of 73.3 by 1.7 units.', 'The new prediction for the second sample is 74.2, an improvement over the initial prediction of 73.3.', 'The new prediction for the first sample is 74.2, slightly closer to the observed weight of 88, compared to the initial prediction of 73.3.', 'The predicted weight using Gradient Boost is 70 kilograms, calculated using the output from the algorithm and specific values.', 'The algorithm usually uses trees larger than stumps, but the tutorial utilized stumps due to the small training dataset.', 'The chapter discusses the three steps of the Gradient Boost algorithm, where M equals 2, and demonstrates predicting a weight of 70 kilograms using f sub 2 of x.']}], 'highlights': ['The new prediction for the third sample is 71.6, an improvement over the initial prediction of 73.3 by 1.7 units.', 'The value for gamma that minimizes the equation for leaf R2,1 is 8.7, and for leaf R1,1 is -17.3.', 'The training data set contains height measurements from three people, their favorite color, gender, and weight.', 'The loss function for gradient boost is chosen to simplify differentiation and is differentiable, making it the most popular choice for regression (e.g., observed minus predicted multiplied by negative one).', 'The initial predicted value, f sub 0 of x, is 73.3, predicting all samples to weigh 73.3.', 'The choice of loss function impacts the nature of residuals, with different loss functions resulting in pseudo-residuals that are not quite similar to normal residuals.', 'The chapter explains how a regression tree is used to predict residuals in the Gradient Boost algorithm, using height, favorite color, and gender as predictors.', 'The learning rate, ν, is a value between 1 and 0, with a small learning rate improving accuracy in the long run.', 'The big minus sign indicates to multiply the derivative by negative 1, resulting in the observed value minus the predicted value.', 'The output values are always the average of the residuals that end up in the same leaf, even if only one residual ends up in a leaf.']}