title

Gradient Boost Part 4 (of 4): Classification Details

description

At last, part 4 in our series of videos on Gradient Boost. This time we dive deep into the details of how it is used for classification, going through algorithm, and the math behind it, one step at a time. Specifically, we derive the loss function from the log(likelihood) of the data and we derive the functions used to calculate the output values from the leaves in each tree. This one is long, but well worth if you want to know how Gradient Boost works.
NOTE: There is a minor error at 7:01. It should just say log(p) - log(1-p) = log(p/(1-p)). And at 19:10 I forgot to put "L" in front of some of the loss functions. However, it should be clear what they are since I point to them say, "This is the loss function".
This StatQuest assumes that you have already watched Parts 1, 2 and 3 in this series:
Part 1, Regression Main Ideas: https://youtu.be/3CC4N4z3GJc
Part 2, Regression Details: https://youtu.be/2xudPOBz-vs
Part 3, Classification Main Ideas: https://youtu.be/jxuNLH5dXCs
...and it also assumed that you understand odds, the log(odds) and Logistic Regression pretty well. Here are the links for...
The odds: https://youtu.be/ARfXDSkQf1Y
A general overview of Logistic Regression: https://youtu.be/yIYKR4sgzI8
how to interpret the coefficients: https://youtu.be/vN5cNN2-HWE
and how to estimate the coefficients: https://youtu.be/BfKanl1aSG0
Lastly, if you want to learn more about using different probability thresholds for classification, check out the StatQuest on ROC and AUC: https://youtu.be/xugjARegisk
For a complete index of all the StatQuest videos, check out:
https://statquest.org/video-index/
This StatQuest is based on the following sources:
A 1999 manuscript by Jerome Friedman that introduced Stochastic Gradient Boost: https://statweb.stanford.edu/~jhf/ftp/stobst.pdf
The Wikipedia article on Gradient Boosting: https://en.wikipedia.org/wiki/Gradient_boosting
The scikit-learn implementation of Gradient Boosting: https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting
If you'd like to support StatQuest, please consider...
Buying The StatQuest Illustrated Guide to Machine Learning!!!
PDF - https://statquest.gumroad.com/l/wvtmc
Paperback - https://www.amazon.com/dp/B09ZCKR4H6
Kindle eBook - https://www.amazon.com/dp/B09ZG79HXC
Patreon: https://www.patreon.com/statquest
...or...
YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join
...a cool StatQuest t-shirt or sweatshirt:
https://shop.spreadshirt.com/statquest-with-josh-starmer/
...buying one or two of my songs (or go large and get a whole album!)
https://joshuastarmer.bandcamp.com/
...or just donating to StatQuest!
https://www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
https://twitter.com/joshuastarmer
Corrections:
6:58 log(p) - log(1-p) is not equal to log(p)/log(1-p) but equal to log(p/(1-p)). In other words, the result at 7:07, log(p) - log(1-p) = log(odds), is correct, and thus, the error does not propagate beyond it's short, but embarrassing moment.
26:53, my indexing of the variables gets off. This is unfortunate, but you should still be able to follow the concepts.
#statquest #gradientboost

detail

{'title': 'Gradient Boost Part 4 (of 4): Classification Details', 'heatmap': [{'end': 231.061, 'start': 176.896, 'weight': 1}, {'end': 383.923, 'start': 328.877, 'weight': 0.843}, {'end': 760.526, 'start': 685.098, 'weight': 0.763}, {'end': 915.728, 'start': 884.218, 'weight': 0.736}, {'end': 1117.297, 'start': 1059.072, 'weight': 0.783}], 'summary': 'Explores gradient boost for classification, discussing the importance of log odds and log likelihood in logistic regression, demonstrating the gradient boost algorithm for classification with a small training set, and explaining the optimization process for gamma in the loss function using taylor polynomial approximation and derivatives.', 'chapters': [{'end': 78.443, 'segs': [{'end': 78.443, 'src': 'embed', 'start': 33.286, 'weight': 0, 'content': [{'end': 41.009, 'text': "Also, it's important that you have a pretty good understanding of the roles that the log odds and the log likelihood play in logistic regression.", 'start': 33.286, 'duration': 7.723}, {'end': 44.07, 'text': "So if you haven't already, check out these quests.", 'start': 41.649, 'duration': 2.421}, {'end': 52.032, 'text': 'In this stat quest, we will walk through the original gradient boost algorithm for classification step by step.', 'start': 45.47, 'duration': 6.562}, {'end': 58.534, 'text': 'Just like in Part 2 of this series, we will use an incredibly small training set for the examples.', 'start': 53.311, 'duration': 5.223}, {'end': 65.997, 'text': "The small size will help us focus on the algorithm's details, but it will mean using stumps instead of trees.", 'start': 59.834, 'duration': 6.163}, {'end': 74.781, 'text': 'However, by now you know that in practice, Gradient Boost usually uses trees with between 8 and 32 leaves.', 'start': 67.198, 'duration': 7.583}, {'end': 78.443, 'text': "Now we'll describe our training dataset.", 'start': 76.502, 'duration': 1.941}], 'summary': 'Stat quest walks through gradient boost algorithm for classification with small training set.', 'duration': 45.157, 'max_score': 33.286, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw33286.jpg'}], 'start': 11.499, 'title': 'Gradient boost for classification', 'summary': 'Discusses the application of gradient boost for classification, emphasizing the importance of understanding log odds and log likelihood in logistic regression, and provides a step-by-step walkthrough of the original gradient boost algorithm for classification using an incredibly small training set.', 'chapters': [{'end': 78.443, 'start': 11.499, 'title': 'Gradient boost for classification', 'summary': 'Discusses the application of gradient boost for classification, emphasizing the importance of understanding log odds and log likelihood in logistic regression, and provides a step-by-step walkthrough of the original gradient boost algorithm for classification using an incredibly small training set.', 'duration': 66.944, 'highlights': ['The chapter emphasizes the importance of understanding log odds and log likelihood in logistic regression for effectively using Gradient Boost for classification.', "The chapter provides a step-by-step walkthrough of the original gradient boost algorithm for classification using an incredibly small training set to focus on algorithm's details.", 'In practice, Gradient Boost usually uses trees with between 8 and 32 leaves for classification.']}], 'duration': 66.944, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw11499.jpg', 'highlights': ['The chapter emphasizes the importance of understanding log odds and log likelihood in logistic regression for effectively using Gradient Boost for classification.', 'In practice, Gradient Boost usually uses trees with between 8 and 32 leaves for classification.', "The chapter provides a step-by-step walkthrough of the original gradient boost algorithm for classification using an incredibly small training set to focus on algorithm's details."]}, {'end': 322.197, 'segs': [{'end': 132.795, 'src': 'embed', 'start': 106.38, 'weight': 1, 'content': [{'end': 114.304, 'text': 'Just to remind you, X sub i refers to a row of measurements that we will use to predict if someone loves Troll 2.', 'start': 106.38, 'duration': 7.924}, {'end': 120.168, 'text': 'And Y sub i refers to whether or not someone loves Troll 2.', 'start': 114.304, 'duration': 5.864}, {'end': 124.25, 'text': 'Now we need a differentiable loss function that will work for classification.', 'start': 120.168, 'duration': 4.082}, {'end': 132.795, 'text': 'I think the easiest way to understand the most commonly used loss function for classification is to show how it works on a graph.', 'start': 125.491, 'duration': 7.304}], 'summary': 'Using x sub i measurements to predict love for troll 2 with a differentiable loss function for classification.', 'duration': 26.415, 'max_score': 106.38, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw106380.jpg'}, {'end': 231.061, 'src': 'heatmap', 'start': 163.447, 'weight': 0, 'content': [{'end': 170.752, 'text': 'And we can draw a dotted line to represent the predicted probability that someone loves Troll 2.', 'start': 163.447, 'duration': 7.305}, {'end': 176.896, 'text': "In this example, I've set the predicted probability to 0.67.", 'start': 170.752, 'duration': 6.144}, {'end': 184.741, 'text': 'Now, just like we do for logistic regression, we can calculate the log likelihood of the data given the predicted probability.', 'start': 176.896, 'duration': 7.845}, {'end': 190.385, 'text': 'The log likelihood of the observed data given the prediction is.', 'start': 186.302, 'duration': 4.083}, {'end': 193.437, 'text': 'this nasty-looking summation.', 'start': 191.616, 'duration': 1.821}, {'end': 200.523, 'text': "The p's refer to the predicted probability, which is 0.67 in this example.", 'start': 194.678, 'duration': 5.845}, {'end': 207.288, 'text': "And the y sub i's refer to the observed values for Love's Troll 2.", 'start': 201.624, 'duration': 5.664}, {'end': 216.535, 'text': 'For the two people who love Troll 2, y sub i equals 1, which means that this term will be 0, leaving just the log of p.', 'start': 207.288, 'duration': 9.247}, {'end': 231.061, 'text': 'In contrast, for the one person who does not love Troll 2, y sub i equals 0, which means that this term will be 0, leaving just the log of 1 minus p.', 'start': 218.095, 'duration': 12.966}], 'summary': 'Calculating log likelihood for predicting love of troll 2 with 0.67 probability.', 'duration': 29.99, 'max_score': 163.447, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw163447.jpg'}, {'end': 322.197, 'src': 'embed', 'start': 248.84, 'weight': 3, 'content': [{'end': 253.602, 'text': 'Then we plug in 0.67 for the predicted probability, p.', 'start': 248.84, 'duration': 4.762}, {'end': 258.724, 'text': '1 minus 1 equals 0.', 'start': 255.443, 'duration': 3.281}, {'end': 260.565, 'text': 'And now we do the multiplication.', 'start': 258.724, 'duration': 1.841}, {'end': 268.889, 'text': 'The log likelihood for the first person, given the predicted probability, is the log of 0.67.', 'start': 261.946, 'duration': 6.943}, {'end': 273.711, 'text': "Now let's calculate the log likelihood for the second person.", 'start': 268.889, 'duration': 4.822}, {'end': 284.303, 'text': 'We plug in the observed value, 1, for y sub 2, and plug in 0.67 for the predicted probability, p.', 'start': 275.08, 'duration': 9.223}, {'end': 289.284, 'text': '1 minus 1 equals 0.', 'start': 285.743, 'duration': 3.541}, {'end': 290.945, 'text': 'And now we do the multiplication.', 'start': 289.284, 'duration': 1.661}, {'end': 298.347, 'text': 'And we get the log of 0.67, since the predicted probability was the same.', 'start': 293.065, 'duration': 5.282}, {'end': 302.888, 'text': "Now let's calculate the log likelihood for the third person.", 'start': 299.847, 'duration': 3.041}, {'end': 310.454, 'text': 'We plug in the observed value, 0, for y sub 3, since this person does not love the movie.', 'start': 304.312, 'duration': 6.142}, {'end': 315.575, 'text': 'Plug in 0.67 for the predicted probability, p.', 'start': 311.454, 'duration': 4.121}, {'end': 320.377, 'text': '1 minus 0 equals 1.', 'start': 315.595, 'duration': 4.782}, {'end': 322.197, 'text': 'And now we do the multiplication.', 'start': 320.377, 'duration': 1.82}], 'summary': 'Calculating log likelihoods for predicted probabilities of 0.67.', 'duration': 73.357, 'max_score': 248.84, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw248840.jpg'}], 'start': 79.584, 'title': 'Loss functions and log likelihood', 'summary': "Explains the usage of a differentiable loss function for classification, demonstrated through a graph with probability values and log likelihood calculations based on a training dataset of three people's preferences for troll 2. additionally, it covers the process of calculating log likelihood for predicted probabilities of 0.67 on three individuals, resulting in log likelihood values of 0, log(0.67), and log(0.67).", 'chapters': [{'end': 248.84, 'start': 79.584, 'title': 'Classification loss function explained', 'summary': "Explains the process of using a differentiable loss function for classification, demonstrated through a graph with probability values and log likelihood calculations, based on a training dataset of three people's preferences for troll 2.", 'duration': 169.256, 'highlights': ['The log likelihood of the observed data given the prediction is calculated using a summation of terms, with the predicted probability and observed values for loving Troll 2, yielding insights into the classification process.', 'The process of using a differentiable loss function for classification is demonstrated through a graph with the probability of loving Troll 2 on the y-axis, representing observed values and predicted probability, aiding in the understanding of the classification model.', 'Explanation of X sub i and Y sub i as representing measurements and whether someone loves Troll 2, providing essential context for prediction and classification training dataset.']}, {'end': 322.197, 'start': 248.84, 'title': 'Calculating log likelihood', 'summary': 'Explains the process of calculating log likelihood for predicted probabilities of 0.67 on three individuals, resulting in log likelihood values of 0, log(0.67), and log(0.67).', 'duration': 73.357, 'highlights': ['The log likelihood for the third person is calculated by plugging in the observed value of 0 for y sub 3 and the predicted probability of 0.67, resulting in a log likelihood of 0.', 'The log likelihood for the second person is calculated by plugging in the observed value of 1 for y sub 2 and the predicted probability of 0.67, resulting in a log likelihood of log(0.67).', 'The log likelihood for the first person is calculated by plugging in the predicted probability of 0.67, resulting in a log likelihood of log(0.67).']}], 'duration': 242.613, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw79584.jpg', 'highlights': ['The log likelihood of the observed data given the prediction is calculated using a summation of terms, aiding in the understanding of the classification process.', 'The process of using a differentiable loss function for classification is demonstrated through a graph with the probability of loving Troll 2 on the y-axis, aiding in the understanding of the classification model.', 'Explanation of X sub i and Y sub i as representing measurements and whether someone loves Troll 2, providing essential context for prediction and classification training dataset.', 'The log likelihood for the third person is calculated by plugging in the observed value of 0 for y sub 3 and the predicted probability of 0.67, resulting in a log likelihood of 0.', 'The log likelihood for the second person is calculated by plugging in the observed value of 1 for y sub 2 and the predicted probability of 0.67, resulting in a log likelihood of log(0.67).', 'The log likelihood for the first person is calculated by plugging in the predicted probability of 0.67, resulting in a log likelihood of log(0.67).']}, {'end': 1072.884, 'segs': [{'end': 383.923, 'src': 'heatmap', 'start': 323.618, 'weight': 0, 'content': [{'end': 328.877, 'text': 'And we get the log of 1 minus 0.67.', 'start': 323.618, 'duration': 5.259}, {'end': 336.443, 'text': 'Note the better the prediction, the larger the log likelihood, and this is why, when doing logistic regression,', 'start': 328.877, 'duration': 7.566}, {'end': 339.405, 'text': 'the goal is to maximize the log likelihood.', 'start': 336.443, 'duration': 2.962}, {'end': 348.552, 'text': 'That means that if we want to use the log likelihood as a loss function where smaller values represent better fitting models,', 'start': 340.686, 'duration': 7.866}, {'end': 351.995, 'text': 'then we need to multiply the log likelihood by negative one.', 'start': 348.552, 'duration': 3.443}, {'end': 358, 'text': "So we'll put this subtle, but very important minus sign in front of everything.", 'start': 353.737, 'duration': 4.263}, {'end': 365.434, 'text': 'And since a loss function sometimes only deals with one sample at a time, we can get rid of the summation.', 'start': 359.471, 'duration': 5.963}, {'end': 371.137, 'text': "And to make it easier to read, we'll replace y with observed.", 'start': 366.815, 'duration': 4.322}, {'end': 377.1, 'text': 'Now we need to transform this equation, the negative log likelihood,', 'start': 372.538, 'duration': 4.562}, {'end': 383.923, 'text': 'so that it is a function of the predicted log odds instead of the predicted probability p.', 'start': 377.1, 'duration': 6.823}], 'summary': 'Maximize log likelihood in logistic regression by multiplying it with -1.', 'duration': 28.377, 'max_score': 323.618, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw323618.jpg'}, {'end': 594.29, 'src': 'embed', 'start': 562.493, 'weight': 2, 'content': [{'end': 567.294, 'text': "So let's take the derivative of the loss function with respect to the predicted log odds.", 'start': 562.493, 'duration': 4.801}, {'end': 573.594, 'text': 'The derivative of the first part with respect to the predicted log odds is super easy.', 'start': 568.851, 'duration': 4.743}, {'end': 577.437, 'text': "It's just the negative observed value.", 'start': 573.995, 'duration': 3.442}, {'end': 586.004, 'text': 'The derivative of the second part is also super easy if you know how to use the chain rule.', 'start': 578.938, 'duration': 7.066}, {'end': 594.29, 'text': 'The derivative of the log of something is one over that something times the derivative of that something.', 'start': 586.984, 'duration': 7.306}], 'summary': 'Derivative of loss function with respect to predicted log odds explained.', 'duration': 31.797, 'max_score': 562.493, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw562493.jpg'}, {'end': 760.526, 'src': 'heatmap', 'start': 685.098, 'weight': 0.763, 'content': [{'end': 689.62, 'text': 'And that funky symbol, called gamma, refers to a log odds value.', 'start': 685.098, 'duration': 4.522}, {'end': 694.743, 'text': 'In theory, we could go ahead and replace the log odds with gamma.', 'start': 691.041, 'duration': 3.702}, {'end': 702.408, 'text': "But it's actually easier to see what's going on if we leave the log odds in and remember that it represents gamma.", 'start': 695.784, 'duration': 6.624}, {'end': 708.552, 'text': 'The summation means that we add up one loss function for each observed value.', 'start': 704.029, 'duration': 4.523}, {'end': 716.376, 'text': 'And the argmin over gamma means we need to find a log odds value that minimizes this sum.', 'start': 710.091, 'duration': 6.285}, {'end': 723.12, 'text': 'The first thing we do is take the derivative of each term with respect to the log odds.', 'start': 718.177, 'duration': 4.943}, {'end': 727.804, 'text': 'Now, to make the next step super easy,', 'start': 724.721, 'duration': 3.083}, {'end': 738.011, 'text': "let's replace the log odds with the predicted probability p and set the sum of the derivatives equal to zero and solve.", 'start': 727.804, 'duration': 10.207}, {'end': 747.921, 'text': 'And we end up with 2 divided by 3 for the initial predicted probability p,', 'start': 742.298, 'duration': 5.623}, {'end': 754.583, 'text': 'because two people love Troll 2 and there are three people in the training data set.', 'start': 747.921, 'duration': 6.662}, {'end': 760.526, 'text': 'Now we can convert the predicted probability into the predicted log odds.', 'start': 756.284, 'duration': 4.242}], 'summary': 'Using calculus, we find initial predicted probability p = 2/3.', 'duration': 75.428, 'max_score': 685.098, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw685098.jpg'}, {'end': 915.728, 'src': 'heatmap', 'start': 884.218, 'weight': 0.736, 'content': [{'end': 890.221, 'text': 'So we can think of the pseudo residuals as the observed probability minus the predicted probability.', 'start': 884.218, 'duration': 6.003}, {'end': 895.723, 'text': 'And the observed minus the predicted results in a pseudo residual.', 'start': 891.481, 'duration': 4.242}, {'end': 901.145, 'text': 'This part says to plug in the most recent predicted log odds.', 'start': 897.464, 'duration': 3.681}, {'end': 907.448, 'text': 'So we plug in f sub zero of x, the most recent predicted log odds.', 'start': 902.546, 'duration': 4.902}, {'end': 915.728, 'text': 'Then do the math to convert the predicted log odds into the predicted probability, p.', 'start': 908.803, 'duration': 6.925}], 'summary': 'Pseudo residuals are calculated using observed and predicted probabilities, then converted into predicted probabilities.', 'duration': 31.51, 'max_score': 884.218, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw884218.jpg'}, {'end': 978.163, 'src': 'embed', 'start': 946.382, 'weight': 3, 'content': [{'end': 948.543, 'text': "Now we'll calculate the other two residuals.", 'start': 946.382, 'duration': 2.161}, {'end': 957.045, 'text': "Hooray! We've finished Part A of Step 2 by calculating a residual for each sample.", 'start': 951.463, 'duration': 5.582}, {'end': 962.166, 'text': "Now we're ready for Part B, where we will build a regression tree.", 'start': 958.725, 'duration': 3.441}, {'end': 971.7, 'text': 'We will build a regression tree using likes popcorn, age, and favorite color to predict the residuals.', 'start': 963.677, 'duration': 8.023}, {'end': 974.181, 'text': "Here's the new tree.", 'start': 973.081, 'duration': 1.1}, {'end': 978.163, 'text': 'So we have a regression tree fit to the residuals.', 'start': 975.222, 'duration': 2.941}], 'summary': 'Completed part a by calculating residuals for each sample. moving to part b to build a regression tree using likes popcorn, age, and favorite color to predict the residuals.', 'duration': 31.781, 'max_score': 946.382, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw946382.jpg'}], 'start': 323.618, 'title': 'Logistic regression and loss function', 'summary': 'Discusses the importance of maximizing the log likelihood in logistic regression and explains the derivation of the negative log likelihood loss function, emphasizing its transformation into a function of the predicted log odds and the process of building a regression tree to predict residuals.', 'chapters': [{'end': 377.1, 'start': 323.618, 'title': 'Logistic regression and log likelihood', 'summary': 'Discusses the importance of maximizing the log likelihood in logistic regression, emphasizing the need to transform it into a loss function by multiplying it by negative one and simplifying it for single-sample calculations.', 'duration': 53.482, 'highlights': ['Maximizing the log likelihood is the goal in logistic regression, with better predictions leading to larger log likelihood.', 'The need to transform the log likelihood into a loss function by multiplying it by negative one for better fitting models.', 'Simplifying the equation for single-sample calculations by removing the summation and replacing y with observed.']}, {'end': 1072.884, 'start': 377.1, 'title': 'Logistic regression and loss function', 'summary': 'Explains the derivation of the negative log likelihood loss function and its transformation into a function of the predicted log odds, as well as the process of building a regression tree to predict residuals and calculating output values for the new tree.', 'duration': 695.784, 'highlights': ['The negative log likelihood of the data is converted into a function of the predicted log odds. The chapter emphasizes the conversion of the negative log likelihood of the data into a function of the predicted log odds, providing a clear understanding of the transformation process.', 'Derivative of the loss function with respect to the predicted log odds is derived and explained. The explanation of the derivative of the loss function with respect to the predicted log odds provides insight into the process of differentiability of the loss function.', 'Calculation of pseudo residuals and building a regression tree to predict residuals is demonstrated. The chapter illustrates the process of calculating pseudo residuals and building a regression tree to predict residuals, offering a practical understanding of the steps involved.']}], 'duration': 749.266, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw323618.jpg', 'highlights': ['Maximizing log likelihood is the goal in logistic regression for better predictions.', 'The need to transform log likelihood into a loss function by multiplying it by negative one.', 'Derivative of the loss function with respect to the predicted log odds is derived and explained.', 'Calculation of pseudo residuals and building a regression tree to predict residuals is demonstrated.']}, {'end': 1351.352, 'segs': [{'end': 1102.703, 'src': 'embed', 'start': 1074.226, 'weight': 1, 'content': [{'end': 1076.528, 'text': 'Remember, this is just the loss function.', 'start': 1074.226, 'duration': 2.302}, {'end': 1082.776, 'text': 'so we can replace the generic form with the actual loss function that we were using.', 'start': 1077.614, 'duration': 5.162}, {'end': 1092.199, 'text': "Note, to keep the length of the formula from getting out of hand, I'm returning to using y sub i to refer to the observed values.", 'start': 1084.156, 'duration': 8.043}, {'end': 1102.703, 'text': "Since only x sub 1 goes to r sub 1 comma 1, we can remove the big sigma and swap the i's with 1's.", 'start': 1093.54, 'duration': 9.163}], 'summary': 'Using observed values as y sub i, simplifying formula for x sub 1.', 'duration': 28.477, 'max_score': 1074.226, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw1074226.jpg'}, {'end': 1156.395, 'src': 'embed', 'start': 1129.703, 'weight': 2, 'content': [{'end': 1134.766, 'text': "And even though we want to minimize gamma, let's remember that we are working with the loss function.", 'start': 1129.703, 'duration': 5.063}, {'end': 1143.846, 'text': 'Since taking the derivative of the loss function with respect to gamma and then solving for gamma is hard,', 'start': 1136.28, 'duration': 7.566}, {'end': 1148.349, 'text': 'we can approximate the loss function with a second-order Taylor polynomial.', 'start': 1143.846, 'duration': 4.503}, {'end': 1156.395, 'text': 'Why this second-order Taylor polynomial is a good approximation is something we can talk about in a future StatQuest.', 'start': 1149.97, 'duration': 6.425}], 'summary': 'Approximate loss function with second-order taylor polynomial for solving gamma.', 'duration': 26.692, 'max_score': 1129.703, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw1129703.jpg'}, {'end': 1205.249, 'src': 'embed', 'start': 1178.513, 'weight': 0, 'content': [{'end': 1184.737, 'text': 'We can treat this, the derivative of the loss function with respect to the predicted log odds, as a constant.', 'start': 1178.513, 'duration': 6.224}, {'end': 1189.6, 'text': 'And the derivative of a constant times gamma is the constant.', 'start': 1185.978, 'duration': 3.622}, {'end': 1195.303, 'text': 'Similarly, the derivative of this with respect to gamma is super easy.', 'start': 1190.98, 'duration': 4.323}, {'end': 1205.249, 'text': 'All this stuff, 1 half times the second derivative of the loss function with respect to the predicted log odds, can be treated like a constant.', 'start': 1196.801, 'duration': 8.448}], 'summary': 'Derivative of loss function is treated as a constant for gamma.', 'duration': 26.736, 'max_score': 1178.513, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw1178513.jpg'}, {'end': 1302.317, 'src': 'embed', 'start': 1271.156, 'weight': 3, 'content': [{'end': 1274.717, 'text': 'the predicted probability p is just the residual.', 'start': 1271.156, 'duration': 3.561}, {'end': 1281.207, 'text': 'Now we need to take the second derivative of the loss function to figure out what goes in the denominator.', 'start': 1276.044, 'duration': 5.163}, {'end': 1289.751, 'text': 'The second derivative of the loss function equals the derivative of the first derivative of the loss function.', 'start': 1282.667, 'duration': 7.084}, {'end': 1293.913, 'text': 'So we can plug in the first derivative of the loss function.', 'start': 1290.751, 'duration': 3.162}, {'end': 1302.317, 'text': "Now, to make taking the derivative a little more obvious, let's rewrite this fraction as multiplication.", 'start': 1295.193, 'duration': 7.124}], 'summary': 'Derive the second derivative of the loss function for predicted probability p.', 'duration': 31.161, 'max_score': 1271.156, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw1271156.jpg'}, {'end': 1351.352, 'src': 'embed', 'start': 1323.111, 'weight': 6, 'content': [{'end': 1332.853, 'text': 'The product rule says that the derivative of A times B equals the derivative of A times B plus A times the derivative of B.', 'start': 1323.111, 'duration': 9.742}, {'end': 1340.384, 'text': 'So we start with the derivative of the first part by using the chain rule.', 'start': 1334.44, 'duration': 5.944}, {'end': 1343.086, 'text': 'And that gives us this derivative.', 'start': 1341.425, 'duration': 1.661}, {'end': 1346.128, 'text': 'Then we multiply by the second part.', 'start': 1344.227, 'duration': 1.901}, {'end': 1351.352, 'text': 'Then we add the first part times the derivative of the second part.', 'start': 1347.149, 'duration': 4.203}], 'summary': 'The product rule states: (d/dx)(a*b) = a*(d/dx)b + b*(d/dx)a.', 'duration': 28.241, 'max_score': 1323.111, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw1323111.jpg'}], 'start': 1074.226, 'title': 'Optimizing gamma and loss function derivatives', 'summary': 'Covers the process of optimizing gamma in the loss function using a second-order taylor polynomial, simplifying derivative calculation, and solving for gamma in the context of loss function derivatives, highlighting the importance of managing formula length and illustrating derivative calculation using the product rule.', 'chapters': [{'end': 1177.433, 'start': 1074.226, 'title': 'Optimizing gamma in loss function', 'summary': 'Discusses the process of optimizing the value for gamma by using a second-order taylor polynomial to approximate the loss function and simplify the derivative calculation, while emphasizing the importance of managing the length of the formula.', 'duration': 103.207, 'highlights': ["The chapter emphasizes the need to manage the length of the formula to avoid complexity, highlighting the approach of using y sub i to refer to the observed values and simplifying the equation by removing the big sigma and swapping the i's with 1's", 'It discusses the use of a second-order Taylor polynomial to approximate the loss function, making it easier to calculate the derivative with respect to gamma', 'The chapter mentions the difficulty in taking the derivative of the loss function with respect to gamma and suggests the use of a different approach to solve for the optimal value for gamma']}, {'end': 1293.913, 'start': 1178.513, 'title': 'Optimizing loss function derivatives', 'summary': 'Introduces the process of solving for gamma in the context of derivative of the loss function with respect to the predicted log odds, and simplifies the equation to obtain the solution for gamma, which is represented as -1 times the derivative of the loss function divided by the second derivative of the loss function.', 'duration': 115.4, 'highlights': ['The process of solving for gamma in the context of derivative of the loss function with respect to the predicted log odds is introduced, demonstrating the simplification of the equation to obtain the solution for gamma, represented as -1 times the derivative of the loss function divided by the second derivative of the loss function.', 'The predicted probability p is explained as the residual, and the need to take the second derivative of the loss function to determine the denominator is emphasized.', 'The derivative of the first derivative of the loss function is mentioned as essential in calculating the second derivative of the loss function.']}, {'end': 1351.352, 'start': 1295.193, 'title': 'Derivative calculation and product rule', 'summary': 'Discusses taking the derivative by rewriting a fraction as multiplication and applying the product rule to find the derivative of a function with respect to the log odds.', 'duration': 56.159, 'highlights': ['Applying the product rule to find the derivative of A times B, where the derivative of A times B equals the derivative of A times B plus A times the derivative of B, is explained.', 'The process of taking the derivative is demonstrated by using the chain rule and multiplying the derivatives of the individual parts by each other.']}], 'duration': 277.126, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw1074226.jpg', 'highlights': ['The process of solving for gamma in the context of derivative of the loss function with respect to the predicted log odds is introduced, demonstrating the simplification of the equation to obtain the solution for gamma, represented as -1 times the derivative of the loss function divided by the second derivative of the loss function.', "The chapter emphasizes the need to manage the length of the formula to avoid complexity, highlighting the approach of using y sub i to refer to the observed values and simplifying the equation by removing the big sigma and swapping the i's with 1's", 'The chapter mentions the difficulty in taking the derivative of the loss function with respect to gamma and suggests the use of a different approach to solve for the optimal value for gamma', 'The predicted probability p is explained as the residual, and the need to take the second derivative of the loss function to determine the denominator is emphasized.', 'The derivative of the first derivative of the loss function is mentioned as essential in calculating the second derivative of the loss function.', 'It discusses the use of a second-order Taylor polynomial to approximate the loss function, making it easier to calculate the derivative with respect to gamma', 'Applying the product rule to find the derivative of A times B, where the derivative of A times B equals the derivative of A times B plus A times the derivative of B, is explained.', 'The process of taking the derivative is demonstrated by using the chain rule and multiplying the derivatives of the individual parts by each other.']}, {'end': 1558.307, 'segs': [{'end': 1471.952, 'src': 'embed', 'start': 1382.24, 'weight': 0, 'content': [{'end': 1387.385, 'text': 'Then we multiply the top and bottom of the second term by 1 plus e to the log odds.', 'start': 1382.24, 'duration': 5.145}, {'end': 1391.052, 'text': 'and now we can add these terms together.', 'start': 1389.011, 'duration': 2.041}, {'end': 1395.876, 'text': 'These two parts in the numerator cancel each other out.', 'start': 1392.874, 'duration': 3.002}, {'end': 1398.277, 'text': 'And that leaves us with this.', 'start': 1396.936, 'duration': 1.341}, {'end': 1405.122, 'text': 'Note, I also split the denominator into two terms so that the next steps make more sense.', 'start': 1399.658, 'duration': 5.464}, {'end': 1415.309, 'text': "Now we'll multiply the numerator by one, which seems silly, but is the key to reducing the whole thing into two super simple terms.", 'start': 1406.943, 'duration': 8.366}, {'end': 1425.398, 'text': 'By multiplying the numerator by 1, we can easily see how this single term separates into two terms multiplied together.', 'start': 1416.752, 'duration': 8.646}, {'end': 1430.062, 'text': 'At this point, you may recognize the first term.', 'start': 1427.26, 'duration': 2.802}, {'end': 1438.187, 'text': 'It converts the predicted log odds to the predicted probability, p.', 'start': 1431.022, 'duration': 7.165}, {'end': 1443.011, 'text': 'The second term should also remind you of something we have seen earlier in this stat quest.', 'start': 1438.187, 'duration': 4.824}, {'end': 1449.724, 'text': 'Earlier, we saw that the log of 1 minus p equals this.', 'start': 1444.562, 'duration': 5.162}, {'end': 1464.309, 'text': 'And that means that 1 minus p equals this, which means the second term is just 1 minus the predicted probability, p.', 'start': 1450.884, 'duration': 13.425}, {'end': 1471.952, 'text': 'So at long last, we see that the second derivative of the loss function is equal to p times 1 minus p.', 'start': 1464.309, 'duration': 7.643}], 'summary': 'Deriving the second derivative of the loss function equals p times 1 minus p.', 'duration': 89.712, 'max_score': 1382.24, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw1382240.jpg'}, {'end': 1558.307, 'src': 'embed', 'start': 1531.348, 'weight': 3, 'content': [{'end': 1539.434, 'text': 'In this case, the predicted probability for this sample is derived from f sub zero of x, the most recent log odds prediction.', 'start': 1531.348, 'duration': 8.086}, {'end': 1549.601, 'text': 'Now we just do the math, and the output value for leaf r sub one comma one is 1.5.', 'start': 1540.955, 'duration': 8.646}, {'end': 1554.144, 'text': "Now let's calculate the output value for the other leaf, r sub two comma one.", 'start': 1549.601, 'duration': 4.543}, {'end': 1558.307, 'text': "That means we're calculating gamma sub two comma one.", 'start': 1555.345, 'duration': 2.962}], 'summary': "Using f sub zero of x, the output value for leaf r sub one comma one is 1.5, and for leaf r sub two comma one, we're calculating gamma sub two comma one.", 'duration': 26.959, 'max_score': 1531.348, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw1531348.jpg'}], 'start': 1352.453, 'title': 'Simplified derivation of loss function', 'summary': "Explains the simplification of the loss function's second derivative resulting in two simple terms and the conversion of predicted log odds to predicted probability. it also demonstrates the relationship between derivatives, probabilities, and output values for specific leaf nodes in a decision tree.", 'chapters': [{'end': 1438.187, 'start': 1352.453, 'title': 'Simplified derivation of loss function', 'summary': 'Explains the simplification of the second derivative of the loss function, resulting in two super simple terms, and demonstrates the conversion of predicted log odds to predicted probability.', 'duration': 85.734, 'highlights': ['The process involves rewriting the second derivative as a fraction and then multiplying the top and bottom of the second term by 1 plus e to the log odds.', 'The cancellation of two parts in the numerator results in a simplified expression, which is further reduced into two super simple terms by multiplying the numerator by 1, leading to the recognition of the first term as the conversion of predicted log odds to predicted probability.']}, {'end': 1558.307, 'start': 1438.187, 'title': 'Derivatives and probabilities in loss function', 'summary': "Explains the relationship between the second term, the loss function's second derivative, and the predicted probability, demonstrating the calculation of output values for specific leaf nodes in a decision tree.", 'duration': 120.12, 'highlights': ["The second derivative of the loss function is equal to p times 1 minus p, revealing the relationship between the second term, the loss function's second derivative, and the predicted probability.", 'The output value for leaf r sub one comma one is calculated as 1.5, demonstrating the application of the derived relationship to compute specific output values for leaf nodes.']}], 'duration': 205.854, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw1352453.jpg', 'highlights': ['The process involves rewriting the second derivative as a fraction and then multiplying the top and bottom of the second term by 1 plus e to the log odds.', 'The cancellation of two parts in the numerator results in a simplified expression, which is further reduced into two super simple terms by multiplying the numerator by 1, leading to the recognition of the first term as the conversion of predicted log odds to predicted probability.', "The second derivative of the loss function is equal to p times 1 minus p, revealing the relationship between the second term, the loss function's second derivative, and the predicted probability.", 'The output value for leaf r sub one comma one is calculated as 1.5, demonstrating the application of the derived relationship to compute specific output values for leaf nodes.']}, {'end': 1845.107, 'segs': [{'end': 1596.637, 'src': 'embed', 'start': 1559.645, 'weight': 0, 'content': [{'end': 1572.048, 'text': 'Since samples x2 and x3 go to leaf then we will need a loss function for x2 and a loss function for x3.', 'start': 1559.645, 'duration': 12.403}, {'end': 1579.91, 'text': 'Now, just like before, we can approximate the loss function with second-order Taylor polynomials.', 'start': 1573.708, 'duration': 6.202}, {'end': 1586.832, 'text': "Here's the second-order Taylor polynomial approximation of the loss function for sample x2.", 'start': 1581.41, 'duration': 5.422}, {'end': 1596.637, 'text': "And here's the second-order Taylor polynomial approximation of the loss function for x sub 3.", 'start': 1588.891, 'duration': 7.746}], 'summary': 'Approximated loss functions for x2 and x3 using second-order taylor polynomials.', 'duration': 36.992, 'max_score': 1559.645, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw1559645.jpg'}, {'end': 1663.24, 'src': 'embed', 'start': 1632.288, 'weight': 1, 'content': [{'end': 1637.931, 'text': "Now let's move everything to the top so we have some space to determine the optimal value for gamma.", 'start': 1632.288, 'duration': 5.643}, {'end': 1648.676, 'text': 'The first step in finding the optimal value for gamma is to take the derivative of the sum of the two approximate loss functions with respect to gamma.', 'start': 1639.732, 'duration': 8.944}, {'end': 1655.017, 'text': 'the derivative of this part is zero since gamma is not involved at all.', 'start': 1650.595, 'duration': 4.422}, {'end': 1663.24, 'text': 'For the second term, since everything between the square brackets is like a constant with respect to gamma,', 'start': 1656.637, 'duration': 6.603}], 'summary': 'Determining the optimal value for gamma involves taking derivatives and treating certain terms as constants.', 'duration': 30.952, 'max_score': 1632.288, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw1632288.jpg'}, {'end': 1820.508, 'src': 'embed', 'start': 1791.043, 'weight': 2, 'content': [{'end': 1802.61, 'text': 'we can plug in p sub 2 for the predicted probability for x sub 2, and we can plug in p sub 3, the predicted probability for x sub 3..', 'start': 1791.043, 'duration': 11.567}, {'end': 1804.212, 'text': 'Now we just tidy everything up.', 'start': 1802.61, 'duration': 1.602}, {'end': 1820.508, 'text': 'Double bam! At long last, we see that gamma is equal to the sum of the residuals divided by the sum of p times 1 minus p for each sample in the leaf.', 'start': 1806.013, 'duration': 14.495}], 'summary': 'Gamma is the sum of residuals divided by the sum of p times 1 minus p for each sample in the leaf.', 'duration': 29.465, 'max_score': 1791.043, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw1791043.jpg'}], 'start': 1559.645, 'title': 'Optimizing gamma in loss function', 'summary': 'Discusses the process of optimizing the value of gamma in the loss function using second-order taylor polynomial approximation and deriving the optimal value. it also explains the calculation of gamma, which is the sum of residuals divided by the sum of p times 1 minus p for each sample in the leaf.', 'chapters': [{'end': 1733.947, 'start': 1559.645, 'title': 'Optimizing gamma in loss function', 'summary': 'Discusses the process of optimizing the value of gamma in the loss function by using second-order taylor polynomial approximation and deriving the optimal value for gamma.', 'duration': 174.302, 'highlights': ['The chapter explains the process of approximating the loss function for samples x2 and x3 using second-order Taylor polynomials.', 'It describes the steps to determine the optimal value for gamma by taking the derivative of the sum of approximate loss functions with respect to gamma and solving for gamma.', 'The derivative of the sum of the approximate loss functions with respect to gamma is calculated and simplified to derive the solution for gamma.']}, {'end': 1845.107, 'start': 1737.788, 'title': 'Calculation of gamma for output value', 'summary': 'Explains the calculation of gamma, which is the sum of residuals divided by the sum of p times 1 minus p for each sample in the leaf.', 'duration': 107.319, 'highlights': ['The calculation of gamma involves determining the predicted probabilities for x sub 2 and x sub 3 and then finding the sum of the residuals in the numerator.', 'Gamma is equal to the sum of the residuals divided by the sum of p times 1 minus p for each sample in the leaf.', 'The second derivative of the loss function equals p times 1 minus p, and by plugging in the predicted probabilities for x sub 2 and x sub 3, we can simplify the equation.']}], 'duration': 285.462, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw1559645.jpg', 'highlights': ['The chapter explains the process of approximating the loss function for samples x2 and x3 using second-order Taylor polynomials.', 'The steps to determine the optimal value for gamma by taking the derivative of the sum of approximate loss functions with respect to gamma and solving for gamma are described.', 'The calculation of gamma involves determining the predicted probabilities for x sub 2 and x sub 3 and then finding the sum of the residuals in the numerator.', 'Gamma is equal to the sum of the residuals divided by the sum of p times 1 minus p for each sample in the leaf.']}, {'end': 2211.154, 'segs': [{'end': 1960.157, 'src': 'embed', 'start': 1846.536, 'weight': 0, 'content': [{'end': 1853.339, 'text': 'Now we need to plug in the most recent predicted probabilities, p2 and p3, for x2 and x3.', 'start': 1846.536, 'duration': 6.803}, {'end': 1858.162, 'text': 'Just like before, since we are building the first tree.', 'start': 1854.82, 'duration': 3.342}, {'end': 1865.005, 'text': 'the predicted probability for these samples is derived from f, sub 0 of x, the most recent log odds prediction.', 'start': 1858.162, 'duration': 6.843}, {'end': 1872.49, 'text': 'Note, since we are just starting out, the predicted probabilities are the same for all of the samples.', 'start': 1866.747, 'duration': 5.743}, {'end': 1876.712, 'text': 'However, after we build the first tree, they can be different.', 'start': 1873.07, 'duration': 3.642}, {'end': 1879.274, 'text': 'Now just do the math.', 'start': 1877.993, 'duration': 1.281}, {'end': 1887.458, 'text': 'And the output value for leaf R sub two comma one is negative 0.77.', 'start': 1880.654, 'duration': 6.804}, {'end': 1892.02, 'text': 'Hooray! We made it through step two, part C.', 'start': 1887.458, 'duration': 4.562}, {'end': 1894.862, 'text': 'We calculated output values for each leaf in the tree.', 'start': 1892.02, 'duration': 2.842}, {'end': 1899.135, 'text': "Now let's do Part D.", 'start': 1896.434, 'duration': 2.701}, {'end': 1902.435, 'text': 'In Part D, we make a new prediction for each sample.', 'start': 1899.135, 'duration': 3.3}, {'end': 1912.117, 'text': 'Since this is our first pass through Step 2 and m equals 1, this new prediction will be called f sub 1 of x.', 'start': 1903.495, 'duration': 8.622}, {'end': 1916.538, 'text': 'The new prediction f, sub 1 of x is based on the last prediction we made.', 'start': 1912.117, 'duration': 4.421}, {'end': 1925.98, 'text': 'f sub 0 of x, plus the learning rate nu times the output values from the first tree we made.', 'start': 1916.538, 'duration': 9.442}, {'end': 1932.436, 'text': 'Note, this summation is there just in case a single sample ends up in multiple leaves.', 'start': 1927.372, 'duration': 5.064}, {'end': 1939.802, 'text': "Also note, we've set the learning rate, new, to 0.8, which is relatively large.", 'start': 1933.797, 'duration': 6.005}, {'end': 1944.426, 'text': 'For more details about this, check out the StatQuest on the main ideas.', 'start': 1940.442, 'duration': 3.984}, {'end': 1950.671, 'text': "Hooray! We've created f sub 1 of x.", 'start': 1946.007, 'duration': 4.664}, {'end': 1954.734, 'text': 'Now we will use f sub 1 of x to make new predictions for each sample.', 'start': 1950.671, 'duration': 4.063}, {'end': 1960.157, 'text': "We'll start with the first sample, x sub 1.", 'start': 1956.155, 'duration': 4.002}], 'summary': 'Using predicted probabilities and output values, created f1 of x for new predictions.', 'duration': 113.621, 'max_score': 1846.536, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw1846536.jpg'}, {'end': 2076.572, 'src': 'embed', 'start': 2017.82, 'weight': 6, 'content': [{'end': 2023.865, 'text': 'The new log odds prediction for the third sample is 0.07, which is better than before.', 'start': 2017.82, 'duration': 6.045}, {'end': 2028.848, 'text': 'Hooray! We made it through one iteration of step two.', 'start': 2025.125, 'duration': 3.723}, {'end': 2034.173, 'text': 'We started by setting little m equal to 1.', 'start': 2030.169, 'duration': 4.004}, {'end': 2040.838, 'text': 'Then we calculated pseudo residuals by plugging in the observed values and the latest predictions.', 'start': 2034.173, 'duration': 6.665}, {'end': 2043.099, 'text': 'and that gave us residuals.', 'start': 2041.758, 'duration': 1.341}, {'end': 2052.645, 'text': 'Then we fit a regression tree to the residuals and computed the output values, gamma sub j comma m, for each leaf.', 'start': 2044.44, 'duration': 8.205}, {'end': 2064.351, 'text': 'Lastly, we made new predictions for each sample f, sub 1 of x, based on the previous prediction, f, sub 0 of x,', 'start': 2054.246, 'duration': 10.105}, {'end': 2070.476, 'text': 'the learning rate nu and the output values gamma, sub j, comma m from the new tree.', 'start': 2064.351, 'duration': 6.125}, {'end': 2076.572, 'text': 'Now we set little m equal to 2 and do everything over again.', 'start': 2072.371, 'duration': 4.201}], 'summary': 'Improved log odds prediction at 0.07 after one iteration.', 'duration': 58.752, 'max_score': 2017.82, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw2017820.jpg'}, {'end': 2185.102, 'src': 'embed', 'start': 2145.05, 'weight': 4, 'content': [{'end': 2148.511, 'text': 'The predicted log odds that this person will love Troll 2 equals.', 'start': 2145.05, 'duration': 3.461}, {'end': 2159.059, 'text': '3.4 The predicted probability that this person will love Troll 2 equals 0.97.', 'start': 2154.357, 'duration': 4.702}, {'end': 2168.123, 'text': 'If we use a threshold of 0.5 for deciding if someone loves Troll 2, then since 0.97 is greater than 0.5, this person loves Troll 2.', 'start': 2159.059, 'duration': 9.064}, {'end': 2181.899, 'text': 'Triple bam! Holy freaking smokes.', 'start': 2168.123, 'duration': 13.776}, {'end': 2183.961, 'text': 'We made it through this whole algorithm.', 'start': 2182.159, 'duration': 1.802}, {'end': 2185.102, 'text': "I can't believe it.", 'start': 2184.141, 'duration': 0.961}], 'summary': 'Predicted probability of loving troll 2 is 97%.', 'duration': 40.052, 'max_score': 2145.05, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw2145050.jpg'}], 'start': 1846.536, 'title': 'Gradient boosting algorithm', 'summary': 'Demonstrates the step-by-step process of building a gradient boosting algorithm, including the calculation of predicted probabilities, output values for each leaf in the tree, and the creation of new predictions based on the learning rate and previous predictions. additionally, it explains the process of gradient boost algorithm by iteratively making predictions for movie preferences, resulting in a log odds prediction of 3.4 and a probability of 0.97, showcasing its effectiveness for classification.', 'chapters': [{'end': 1960.157, 'start': 1846.536, 'title': 'Gradient boosting algorithm - step by step', 'summary': 'Demonstrates the step-by-step process of building a gradient boosting algorithm, including the calculation of predicted probabilities, output values for each leaf in the tree, and the creation of new predictions based on the learning rate and previous predictions.', 'duration': 113.621, 'highlights': ['The predicted probabilities for x2 and x3 are derived from the most recent log odds prediction, and they are initially the same for all samples.', 'The output value for leaf R sub two comma one is calculated to be negative 0.77, signifying successful completion of step two, part C.', 'A new prediction f sub 1 of x is created based on the last prediction f sub 0 of x, plus the learning rate nu times the output values from the first tree, with the learning rate set to 0.8.', 'The new prediction f sub 1 of x is used to make new predictions for each sample, starting with x sub 1.']}, {'end': 2211.154, 'start': 1960.157, 'title': 'Gradient boosting for predicting movie preferences', 'summary': 'Explains the process of gradient boost algorithm by iteratively making predictions for movie preferences, resulting in a log odds prediction of 3.4 and a probability of 0.97, showcasing its effectiveness for classification.', 'duration': 250.997, 'highlights': ['The predicted log odds for movie preference is 3.4, indicating a strong prediction outcome.', 'The probability of preference for the movie is 0.97, demonstrating a high likelihood of positive outcome.', 'The algorithm iteratively improves predictions, with the log odds prediction for the first sample being 1.89, signifying a better prediction.', "The log odds prediction for the third sample is 0.07, showcasing the algorithm's ability to refine predictions over iterations.", 'The process involves fitting a regression tree, computing output values, and making new predictions, contributing to the iterative improvement of the model.']}], 'duration': 364.618, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/StWY5QWMXCw/pics/StWY5QWMXCw1846536.jpg', 'highlights': ['The predicted probabilities for x2 and x3 are derived from the most recent log odds prediction, and they are initially the same for all samples.', 'The output value for leaf R sub two comma one is calculated to be negative 0.77, signifying successful completion of step two, part C.', 'The new prediction f sub 1 of x is created based on the last prediction f sub 0 of x, plus the learning rate nu times the output values from the first tree, with the learning rate set to 0.8.', 'The new prediction f sub 1 of x is used to make new predictions for each sample, starting with x sub 1.', 'The predicted log odds for movie preference is 3.4, indicating a strong prediction outcome.', 'The probability of preference for the movie is 0.97, demonstrating a high likelihood of positive outcome.', 'The algorithm iteratively improves predictions, with the log odds prediction for the first sample being 1.89, signifying a better prediction.', "The log odds prediction for the third sample is 0.07, showcasing the algorithm's ability to refine predictions over iterations.", 'The process involves fitting a regression tree, computing output values, and making new predictions, contributing to the iterative improvement of the model.']}], 'highlights': ['The chapter emphasizes the importance of understanding log odds and log likelihood in logistic regression for effectively using Gradient Boost for classification.', 'In practice, Gradient Boost usually uses trees with between 8 and 32 leaves for classification.', "The chapter provides a step-by-step walkthrough of the original gradient boost algorithm for classification using an incredibly small training set to focus on algorithm's details.", 'Maximizing log likelihood is the goal in logistic regression for better predictions.', 'The need to transform log likelihood into a loss function by multiplying it by negative one.', 'Derivative of the loss function with respect to the predicted log odds is derived and explained.', 'Calculation of pseudo residuals and building a regression tree to predict residuals is demonstrated.', 'The process of solving for gamma in the context of derivative of the loss function with respect to the predicted log odds is introduced, demonstrating the simplification of the equation to obtain the solution for gamma, represented as -1 times the derivative of the loss function divided by the second derivative of the loss function.', "The chapter emphasizes the need to manage the length of the formula to avoid complexity, highlighting the approach of using y sub i to refer to the observed values and simplifying the equation by removing the big sigma and swapping the i's with 1's", 'The process involves rewriting the second derivative as a fraction and then multiplying the top and bottom of the second term by 1 plus e to the log odds.', 'The output value for leaf r sub one comma one is calculated as 1.5, demonstrating the application of the derived relationship to compute specific output values for leaf nodes.', 'The chapter explains the process of approximating the loss function for samples x2 and x3 using second-order Taylor polynomials.', 'The steps to determine the optimal value for gamma by taking the derivative of the sum of approximate loss functions with respect to gamma and solving for gamma are described.', 'The calculation of gamma involves determining the predicted probabilities for x sub 2 and x sub 3 and then finding the sum of the residuals in the numerator.', 'Gamma is equal to the sum of the residuals divided by the sum of p times 1 minus p for each sample in the leaf.', 'The predicted probabilities for x2 and x3 are derived from the most recent log odds prediction, and they are initially the same for all samples.', 'The new prediction f sub 1 of x is created based on the last prediction f sub 0 of x, plus the learning rate nu times the output values from the first tree, with the learning rate set to 0.8.', 'The new prediction f sub 1 of x is used to make new predictions for each sample, starting with x sub 1.', 'The predicted log odds for movie preference is 3.4, indicating a strong prediction outcome.', 'The probability of preference for the movie is 0.97, demonstrating a high likelihood of positive outcome.', 'The algorithm iteratively improves predictions, with the log odds prediction for the first sample being 1.89, signifying a better prediction.', "The log odds prediction for the third sample is 0.07, showcasing the algorithm's ability to refine predictions over iterations.", 'The process involves fitting a regression tree, computing output values, and making new predictions, contributing to the iterative improvement of the model.']}