title
XGBoost Part 3 (of 4): Mathematical Details
description
In this video we dive into the nitty-gritty details of the math behind XGBoost trees. We derive the equations for the Output Values from the leaves as well as the Similarity Score. Then we show how these general equations are customized for Regression or Classification by their respective Loss Functions. If you make it to the end, you will be approximately 22% smarter than you are now! :)
NOTE: This StatQuest assumes that you are already familiar with...
XGBoost Part 1: XGBoost Trees for Regression: https://youtu.be/OtD8wVaFm6E
XGBoost Part 2: XGBoost Trees for Classification: https://youtu.be/8b1JEDvenQU
Gradient Boost Part 1: Regression Main Ideas: https://youtu.be/3CC4N4z3GJc
Gradient Boost Part 2: Regression Details:https://youtu.be/2xudPOBz-vs
Gradient Boost Part 3: Classification Main Ideas: https://youtu.be/jxuNLH5dXCs
Gradient Boost Part 4: Classification Details: https://youtu.be/StWY5QWMXCw
...and Ridge Regression: https://youtu.be/Q81RR3yKn30
Also note, this StatQuest is based on the following sources:
The original XGBoost manuscript: https://arxiv.org/pdf/1603.02754.pdf
The original XGBoost presentation: https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
And the XGBoost Documentation: https://xgboost.readthedocs.io/en/latest/index.html
Last but not least, I want to extend a special thanks to Giuseppe Fasanella and Samuel Judge for thoughtful discussions and helping me understand the math.
For a complete index of all the StatQuest videos, check out:
https://statquest.org/video-index/
If you'd like to support StatQuest, please consider...
Buying The StatQuest Illustrated Guide to Machine Learning!!!
PDF - https://statquest.gumroad.com/l/wvtmc
Paperback - https://www.amazon.com/dp/B09ZCKR4H6
Kindle eBook - https://www.amazon.com/dp/B09ZG79HXC
Patreon: https://www.patreon.com/statquest
...or...
YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join
...a cool StatQuest t-shirt or sweatshirt:
https://shop.spreadshirt.com/statquest-with-josh-starmer/
...buying one or two of my songs (or go large and get a whole album!)
https://joshuastarmer.bandcamp.com/
...or just donating to StatQuest!
https://www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
https://twitter.com/joshuastarmer
Corrections:
1:16 The Lambda should be outside of the square brackets.
#statquest #xgboost
detail
{'title': 'XGBoost Part 3 (of 4): Mathematical Details', 'heatmap': [{'end': 315.223, 'start': 280.6, 'weight': 0.899}, {'end': 411.208, 'start': 344.335, 'weight': 0.746}, {'end': 1372.234, 'start': 1235.494, 'weight': 0.835}], 'summary': 'Delves into the mathematical intricacies of xgboost, covering derivation of equations for similarity scores and output values, discussing practical applications in regression and classification, quantifying predictions, applying loss functions, and exploring the effects of increasing lambda and second-order taylor approximation in xgboost for regression and classification tasks, among other topics.', 'chapters': [{'end': 278.659, 'segs': [{'end': 145.559, 'src': 'embed', 'start': 38.943, 'weight': 0, 'content': [{'end': 42.484, 'text': 'Lastly, it assumes that you are familiar with ridge regression.', 'start': 38.943, 'duration': 3.541}, {'end': 45.346, 'text': 'If not, the link is in the description below.', 'start': 42.905, 'duration': 2.441}, {'end': 53.497, 'text': 'In XGBoost Part 1, we saw how XGBoost builds XGBoost trees for regression.', 'start': 47.153, 'duration': 6.344}, {'end': 61.482, 'text': 'And in XGBoost Part 2, we saw how XGBoost builds XGBoost trees for classification.', 'start': 54.738, 'duration': 6.744}, {'end': 69.367, 'text': 'In both cases, we build the trees using similarity scores and then calculated the output values for the leaves.', 'start': 62.723, 'duration': 6.644}, {'end': 81.144, 'text': 'Now we will derive the equations for the similarity scores and the output values and show you how the only difference between regression and classification is the loss function.', 'start': 70.655, 'duration': 10.489}, {'end': 91.933, 'text': "To keep the examples manageable, we'll start with this simple training dataset for regression and this simple training dataset for classification.", 'start': 82.625, 'duration': 9.308}, {'end': 101.38, 'text': 'For regression, we are using drug dosage on the x-axis to predict drug effectiveness on the y-axis.', 'start': 93.359, 'duration': 8.021}, {'end': 110.922, 'text': 'For classification, we are using drug dosage on the x-axis to predict the probability the drug will be effective.', 'start': 102.861, 'duration': 8.061}, {'end': 122.276, 'text': 'For both regression and classification, we already know that XGBoost starts with an initial prediction that is usually 0.5.', 'start': 112.642, 'duration': 9.634}, {'end': 129.863, 'text': 'And in both cases, we can represent this prediction with a thick black line at 0.5.', 'start': 122.276, 'duration': 7.587}, {'end': 137.051, 'text': 'And the residuals, the differences between the observed and predicted values, show us how good the initial prediction is.', 'start': 129.863, 'duration': 7.188}, {'end': 145.559, 'text': 'Just like in regular, unextreme gradient boost, we can quantify how good the prediction is with a loss function.', 'start': 138.632, 'duration': 6.927}], 'summary': 'Xgboost builds trees for regression and classification using similarity scores and output values, with the only difference being the loss function.', 'duration': 106.616, 'max_score': 38.943, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I38943.jpg'}], 'start': 3.09, 'title': 'Xgboost: mathematical and practical details', 'summary': 'Explores the mathematical intricacies of xgboost, covering the derivation of equations for similarity scores and output values, and also discusses practical applications in regression and classification, quantifying predictions and applying loss functions.', 'chapters': [{'end': 91.933, 'start': 3.09, 'title': 'Xgboost part 3: mathematical details', 'summary': 'Delves into the mathematical details of xgboost, assuming prior knowledge of tree building, gradient boost, and ridge regression. it demonstrates the derivation of equations for similarity scores and output values, highlighting the difference between regression and classification as the loss function.', 'duration': 88.843, 'highlights': ['The StatQuest explores the mathematical details of XGBoost, assuming familiarity with tree building, Gradient Boost, and ridge regression, and demonstrates the derivation of equations for similarity scores and output values.', 'The chapter also shows that the only difference between regression and classification in XGBoost is the loss function.', 'It assumes that the audience is already familiar with ridge regression, and provides links for further understanding of XGBoost and Gradient Boost in the description below.']}, {'end': 278.659, 'start': 93.359, 'title': 'Xgboost: regression and classification details', 'summary': 'Discusses using drug dosage for predicting drug effectiveness in regression and the probability of drug effectiveness in classification using xgboost, quantifying prediction with a loss function, and applying loss function to initial and new predictions.', 'duration': 185.3, 'highlights': ['Using drug dosage to predict drug effectiveness in regression and the probability of drug effectiveness in classification with XGBoost.', 'Quantifying prediction with a loss function: 1 half times the squared residual for regression and negative log likelihood for classification.', 'Applying the loss function to initial and new predictions for evaluating prediction improvement.']}], 'duration': 275.569, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I3090.jpg', 'highlights': ['The chapter explores the mathematical details of XGBoost, assuming familiarity with tree building, Gradient Boost, and ridge regression.', 'Using drug dosage to predict drug effectiveness in regression and the probability of drug effectiveness in classification with XGBoost.', 'The only difference between regression and classification in XGBoost is the loss function.', 'Quantifying prediction with a loss function: 1 half times the squared residual for regression and negative log likelihood for classification.']}, {'end': 571.302, 'segs': [{'end': 411.208, 'src': 'heatmap', 'start': 280.6, 'weight': 0, 'content': [{'end': 287.424, 'text': 'Now that we have one loss function for regression and another loss function for classification,', 'start': 280.6, 'duration': 6.824}, {'end': 292.687, 'text': 'XGBoost uses those loss functions to build trees by minimizing this equation.', 'start': 287.424, 'duration': 5.263}, {'end': 301.072, 'text': 'Note, the equation in the original manuscript for XGBoost contains an extra term that I am omitting.', 'start': 294.408, 'duration': 6.664}, {'end': 309.297, 'text': 'This term gamma times t, where t is the number of terminal nodes or leaves in a tree,', 'start': 302.613, 'duration': 6.684}, {'end': 315.223, 'text': 'and gamma is a user-definable penalty is meant to encourage pruning.', 'start': 310.481, 'duration': 4.742}, {'end': 325.986, 'text': 'I say that it encourages pruning because, as we saw in XGBoost Part 1, XGBoost can prune even when gamma equals zero.', 'start': 316.643, 'duration': 9.343}, {'end': 331.768, 'text': "I'm omitting this term because, as we saw in Parts 1 and 2,", 'start': 327.427, 'duration': 4.341}, {'end': 339.751, 'text': 'pruning takes place after the full tree is built and it plays no role in deriving the optimal output values or similarity scores.', 'start': 331.768, 'duration': 7.983}, {'end': 343.435, 'text': "So let's talk about this equation.", 'start': 341.775, 'duration': 1.66}, {'end': 347.936, 'text': 'The first part is the loss function, which we just talked about.', 'start': 344.335, 'duration': 3.601}, {'end': 352.557, 'text': 'And the second part consists of a regularization term.', 'start': 349.336, 'duration': 3.221}, {'end': 359.018, 'text': 'The goal is to find an output value for the leaf that minimizes the whole equation.', 'start': 353.937, 'duration': 5.081}, {'end': 369.68, 'text': 'And in a way that is very similar to ridge regression, we square the output value from the new tree and scale it with lambda.', 'start': 360.558, 'duration': 9.122}, {'end': 379.145, 'text': 'Later on, I will show you that, just like ridge regression, if lambda is greater than zero, then we will shrink the output value.', 'start': 371.243, 'duration': 7.902}, {'end': 382.686, 'text': 'The one half just makes the math easier.', 'start': 380.405, 'duration': 2.281}, {'end': 388.707, 'text': 'Note because we are optimizing the output value from the first tree,', 'start': 384.326, 'duration': 4.381}, {'end': 398.73, 'text': 'we can replace the prediction p sub i with the initial prediction p of zero plus the output value from the new tree.', 'start': 388.707, 'duration': 10.023}, {'end': 406.127, 'text': "Now that we understand all of the terms in this equation, let's use it to build the first tree.", 'start': 400.546, 'duration': 5.581}, {'end': 411.208, 'text': 'We start by putting all of the residuals into a single leaf.', 'start': 407.947, 'duration': 3.261}], 'summary': 'Xgboost uses a loss function and a regularization term to build trees, aiming to minimize the equation and optimize output values. the model can prune even with a zero gamma, and the process is similar to ridge regression.', 'duration': 89.08, 'max_score': 280.6, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I280600.jpg'}, {'end': 546.968, 'src': 'embed', 'start': 515.717, 'weight': 1, 'content': [{'end': 525.502, 'text': 'Now we plot the point on the graph, and we see that negative 1 is a worse choice for the output value than 0 because it has a larger total loss.', 'start': 515.717, 'duration': 9.785}, {'end': 538.762, 'text': 'In contrast, if we set the output value to positive 1, then the new prediction is 0.5 plus 1,, which equals 1.5,', 'start': 527.343, 'duration': 11.419}, {'end': 546.968, 'text': 'and that makes the residual for y sub 1 larger, but the residuals for y sub 2 and y sub 3 are smaller.', 'start': 538.762, 'duration': 8.206}], 'summary': 'Choosing positive 1 as output value reduces total loss with smaller residuals.', 'duration': 31.251, 'max_score': 515.717, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I515717.jpg'}], 'start': 280.6, 'title': 'Xgboost and gradient boosting', 'summary': "Discusses xgboost's use of loss functions for regression and classification, and explains the gradient boosting equation, including the impact of different output values on the total loss function.", 'chapters': [{'end': 339.751, 'start': 280.6, 'title': 'Xgboost loss functions and tree building', 'summary': 'Discusses how xgboost uses loss functions for regression and classification to build trees by minimizing an equation, with an omitted term for encouraging pruning.', 'duration': 59.151, 'highlights': ['XGBoost uses loss functions for regression and classification to build trees by minimizing an equation.', 'The equation in the original manuscript for XGBoost contains an extra term, gamma times t, where gamma is a user-definable penalty meant to encourage pruning.', 'Pruning in XGBoost can occur even when gamma equals zero, and it plays no role in deriving the optimal output values or similarity scores.']}, {'end': 571.302, 'start': 341.775, 'title': 'Gradient boosting equation analysis', 'summary': 'Explains the gradient boosting equation, including the loss function and regularization term, demonstrating the impact of different output values on the total loss function and identifying the best output value.', 'duration': 229.527, 'highlights': ['The chapter explains the impact of different output values on the total loss function, with negative 1 resulting in a total loss of 109.4 and positive 1 resulting in the lowest total loss of 102.4.', 'The chapter details the components of the gradient boosting equation, including the loss function and regularization term, and demonstrates the process of finding the best output value to minimize the equation.', 'The chapter compares the impact of different output values on the residuals, illustrating that positive 1 leads to the lowest total loss of 102.4, while negative 1 results in a larger total loss of 109.4.', 'The chapter simplifies the analysis by setting the regularization term to zero and demonstrates the process of finding the output value that minimizes the equation, identifying positive 1 as the best choice with the lowest total loss of 102.4.']}], 'duration': 290.702, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I280600.jpg', 'highlights': ['XGBoost uses loss functions for regression and classification to build trees by minimizing an equation.', 'The chapter explains the impact of different output values on the total loss function, with negative 1 resulting in a total loss of 109.4 and positive 1 resulting in the lowest total loss of 102.4.', 'The equation in the original manuscript for XGBoost contains an extra term, gamma times t, where gamma is a user-definable penalty meant to encourage pruning.', 'The chapter details the components of the gradient boosting equation, including the loss function and regularization term, and demonstrates the process of finding the best output value to minimize the equation.']}, {'end': 738.123, 'segs': [{'end': 636.572, 'src': 'embed', 'start': 604.684, 'weight': 1, 'content': [{'end': 615.461, 'text': 'In other words, the more emphasis we give the regularization penalty by increasing lambda, the optimal output value gets closer to 0.', 'start': 604.684, 'duration': 10.777}, {'end': 620.083, 'text': "And this is exactly what regularization is supposed to do, so that's super cool.", 'start': 615.461, 'duration': 4.622}, {'end': 628.408, 'text': 'Bam! Now, one last thing before we solve for the optimal output value.', 'start': 620.904, 'duration': 7.504}, {'end': 636.572, 'text': 'You may remember that when regular unextreme gradient boost found the optimal output value for a leaf,', 'start': 629.848, 'duration': 6.724}], 'summary': 'Increasing lambda brings optimal output closer to 0, showcasing the effect of regularization.', 'duration': 31.888, 'max_score': 604.684, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I604684.jpg'}, {'end': 711.782, 'src': 'embed', 'start': 661.463, 'weight': 0, 'content': [{'end': 669.829, 'text': 'Unextreme Gradient Boost uses second-order Taylor approximation to simplify the math when solving for the optimal output value.', 'start': 661.463, 'duration': 8.366}, {'end': 678.505, 'text': 'In contrast, XGBoost uses the second-order Taylor approximation for both regression and classification.', 'start': 671.081, 'duration': 7.424}, {'end': 685.61, 'text': 'Unfortunately, explaining Taylor series approximations is out of the scope of this StatQuest.', 'start': 680.106, 'duration': 5.504}, {'end': 696.416, 'text': "So you'll just have to take my word for it that the loss function that includes the output value can be approximated by this mess of sums and derivatives.", 'start': 686.79, 'duration': 9.626}, {'end': 702.52, 'text': "The genius of a Taylor approximation is that it's made of relatively simple parts.", 'start': 697.639, 'duration': 4.881}, {'end': 707.501, 'text': 'This part is just the loss function for the previous prediction.', 'start': 703.82, 'duration': 3.681}, {'end': 711.782, 'text': 'This is the first derivative of that loss function.', 'start': 709.181, 'duration': 2.601}], 'summary': 'Unextreme gradient boost uses 2nd-order taylor approximation, xgboost uses it for regression and classification.', 'duration': 50.319, 'max_score': 661.463, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I661463.jpg'}], 'start': 572.722, 'title': 'Regularization and taylor approximation in xgboost', 'summary': 'Explores the effects of increasing lambda in xgboost, demonstrating its impact on shifting the optimal output value closer to 0. it also delves into the application of second-order taylor approximation in xgboost for regression and classification tasks.', 'chapters': [{'end': 738.123, 'start': 572.722, 'title': 'Regularization and taylor approximation in xgboost', 'summary': 'Explains how increasing lambda in xgboost shifts the optimal output value closer to 0, showcasing the impact of regularization, and discusses the use of second-order taylor approximation in xgboost for both regression and classification.', 'duration': 165.401, 'highlights': ['Increasing lambda in XGBoost shifts the lowest point in the parabola closer to 0, exemplifying the impact of regularization.', 'XGBoost utilizes second-order Taylor approximation for both regression and classification, simplifying the math when solving for the optimal output value.', 'The Taylor approximation in XGBoost consists of simple parts, including the loss function for the previous prediction, its first derivative, and its second derivative.']}], 'duration': 165.401, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I572722.jpg', 'highlights': ['XGBoost utilizes second-order Taylor approximation for both regression and classification, simplifying the math when solving for the optimal output value.', 'Increasing lambda in XGBoost shifts the lowest point in the parabola closer to 0, exemplifying the impact of regularization.', 'The Taylor approximation in XGBoost consists of simple parts, including the loss function for the previous prediction, its first derivative, and its second derivative.']}, {'end': 909.428, 'segs': [{'end': 824.225, 'src': 'embed', 'start': 771.992, 'weight': 1, 'content': [{'end': 781.317, 'text': "Now it's worth noting that these terms do not contain the output value, and that means they have no effect on the optimal output value,", 'start': 771.992, 'duration': 9.325}, {'end': 784.493, 'text': 'so we can omit them from the optimization.', 'start': 782.293, 'duration': 2.2}, {'end': 789.775, 'text': 'Now all that remains are terms associated with the output value.', 'start': 786.154, 'duration': 3.621}, {'end': 806.478, 'text': "So let's combine all of the unsquared output value terms into a single term and combine all of the squared output value terms into a single term and move the formula to give us some space to work.", 'start': 791.655, 'duration': 14.823}, {'end': 814.199, 'text': "Now let's do what we usually do when we want a value that minimizes a function.", 'start': 808.095, 'duration': 6.104}, {'end': 819.102, 'text': '1 Take the derivative with respect to the output value.', 'start': 814.219, 'duration': 4.883}, {'end': 821.524, 'text': '2 Set the derivative equal to zero.', 'start': 819.122, 'duration': 2.402}, {'end': 824.225, 'text': 'And 3.', 'start': 823.025, 'duration': 1.2}], 'summary': 'Terms without output value have no effect. derivative set to zero for minimizing function.', 'duration': 52.233, 'max_score': 771.992, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I771992.jpg'}, {'end': 878.941, 'src': 'embed', 'start': 849.932, 'weight': 0, 'content': [{'end': 854.854, 'text': 'Now we set the derivative equal to 0 and solve for the output value.', 'start': 849.932, 'duration': 4.922}, {'end': 864.674, 'text': "So we subtract the sum of the g's from both sides, and divide both sides by the sum of the H's and lambda.", 'start': 856.575, 'duration': 8.099}, {'end': 870.677, 'text': 'Hooray! We have finally solved for the optimal output value for the leaf.', 'start': 866.275, 'duration': 4.402}, {'end': 878.941, 'text': "Now we need to plug in the gradients, the G's, and the Hessians, the H's, for the loss function.", 'start': 872.138, 'duration': 6.803}], 'summary': 'Derivative set to 0, solved for optimal output. gradients and hessians plugged in for loss function.', 'duration': 29.009, 'max_score': 849.932, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I849932.jpg'}], 'start': 739.724, 'title': 'Optimizing output value', 'summary': 'Discusses optimizing the output value in the loss function, including the regularization term, second-order taylor approximation, and solving for the optimal output value using derivatives and gradients in xgboost for regression.', 'chapters': [{'end': 909.428, 'start': 739.724, 'title': 'Optimizing output value in loss function', 'summary': 'Discusses optimizing the output value in the loss function, including the regularization term, second-order taylor approximation, and solving for the optimal output value using derivatives and gradients in xgboost for regression.', 'duration': 169.704, 'highlights': ['Solving for the optimal output value using derivatives and gradients The process involves taking the derivative with respect to the output value, setting the derivative equal to zero, and solving for the output value using the sum of gradients and Hessians.', 'Combining unsquared and squared output value terms All unsquared output value terms are combined into a single term, and all squared output value terms are combined into another single term to simplify the optimization process.', 'Omitting terms without effect on the optimal output value Terms not containing the output value are omitted from the optimization process as they have no effect on the optimal output value.']}], 'duration': 169.704, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I739724.jpg', 'highlights': ['Solving for the optimal output value using derivatives and gradients', 'Combining unsquared and squared output value terms', 'Omitting terms without effect on the optimal output value']}, {'end': 1269.458, 'segs': [{'end': 1006.557, 'src': 'embed', 'start': 978.951, 'weight': 0, 'content': [{'end': 983.674, 'text': 'In other words, the denominator is the number of residuals plus lambda.', 'start': 978.951, 'duration': 4.723}, {'end': 992.759, 'text': 'So, when we are using XGBoost for regression, this is the specific formula for the output value for a leaf.', 'start': 985.455, 'duration': 7.304}, {'end': 1003.955, 'text': "To summarize what we've done so far, we started out with this data, Then we made an initial prediction, 0.5.", 'start': 994.36, 'duration': 9.595}, {'end': 1006.557, 'text': 'Then we put all of the residuals in this leaf.', 'start': 1003.955, 'duration': 2.602}], 'summary': 'Xgboost regression uses a formula for leaf output, with an initial prediction of 0.5 and residuals.', 'duration': 27.606, 'max_score': 978.951, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I978951.jpg'}, {'end': 1091.49, 'src': 'embed', 'start': 1037.458, 'weight': 2, 'content': [{'end': 1050.727, 'text': 'Bam! Now, if we are using XGBoost for classification, then this, the negative log likelihood, is the most commonly used loss function.', 'start': 1037.458, 'duration': 13.269}, {'end': 1061.776, 'text': 'Shameless self-promotion! This is the exact same loss function that we worked with in Gradient Boost Part 4, Classification Details.', 'start': 1052.108, 'duration': 9.668}, {'end': 1069.644, 'text': 'In that StatQuest, we spent a long time deriving the first and second derivative of this equation.', 'start': 1063.378, 'duration': 6.266}, {'end': 1077.252, 'text': 'Calculating the derivatives took a long time because the output values are in terms of the log odds.', 'start': 1071.166, 'duration': 6.086}, {'end': 1086.465, 'text': 'So we converted the probabilities to log odds one step at a time rather than skipping the fun parts like we are now.', 'start': 1078.877, 'duration': 7.588}, {'end': 1091.49, 'text': "Then we took the derivatives without skipping the fun parts like we're doing here.", 'start': 1087.526, 'duration': 3.964}], 'summary': 'Xgboost commonly uses negative log likelihood for classification, as explained in a previous statquest video.', 'duration': 54.032, 'max_score': 1037.458, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I1037458.jpg'}, {'end': 1249.752, 'src': 'embed', 'start': 1215.87, 'weight': 1, 'content': [{'end': 1222.992, 'text': 'remember that we derived the equation for the output value by minimizing the sum of the loss functions plus the regularization.', 'start': 1215.87, 'duration': 7.122}, {'end': 1230.293, 'text': "And let's also remember that, depending on the loss function, optimizing this part can be hard.", 'start': 1224.312, 'duration': 5.981}, {'end': 1234.034, 'text': 'so we approximated it with a second-order Taylor polynomial.', 'start': 1230.293, 'duration': 3.741}, {'end': 1249.752, 'text': 'So we expanded the summation, added the regularization term and swapped in the second-order Taylor approximation of the loss function,', 'start': 1235.494, 'duration': 14.258}], 'summary': 'Derivation of output equation by minimizing loss functions with regularization and taylor approximation.', 'duration': 33.882, 'max_score': 1215.87, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I1215870.jpg'}], 'start': 909.988, 'title': 'Xgboost formulas and functions', 'summary': 'Explains the xgboost regression formula for deriving output values, including the use of negative log likelihood as the loss function, and how to calculate the optimal output value for a leaf using derivatives. it outlines the numerator and denominator for leaf output, the process of deriving output value, and the use of negative log likelihood, first and second derivatives in xgboost.', 'chapters': [{'end': 1036.257, 'start': 909.988, 'title': 'Xgboost regression formula', 'summary': 'Explains the xgboost regression formula, stating that the numerator of the output value for a leaf is the sum of the residuals, while the denominator is the number of residuals plus lambda. it then outlines the process of deriving the output value for a leaf using an initial prediction, residuals, loss function, and regularization.', 'duration': 126.269, 'highlights': ['The numerator of the output value for a leaf is the sum of the residuals, which cancels out all negative signs, providing a clear understanding of the role of gradients in the XGBoost regression formula.', 'The denominator of the output value for a leaf is the number of residuals plus lambda, highlighting the importance of the Hessian and regularization in determining the output value.', 'The process of deriving the output value for a leaf involves making an initial prediction, putting all the residuals in the leaf, considering the loss function, and solving for the lowest point on the parabola where the derivative is zero, demonstrating the comprehensive approach of XGBoost regression formula in optimizing the output value.']}, {'end': 1269.458, 'start': 1037.458, 'title': 'Xgboost: loss function and output value', 'summary': 'Discusses the use of negative log likelihood as the most commonly used loss function in xgboost for classification, the derivation of first and second derivatives of the loss function, and the calculation of the optimal output value for a leaf using these derivatives.', 'duration': 232, 'highlights': ['The negative log likelihood is the most commonly used loss function in XGBoost for classification, and the first and second derivatives of this equation are crucial for calculating the optimal output value for a leaf.', 'The process of deriving the first and second derivatives of the loss function involves converting probabilities to log odds, taking the derivatives, and then converting the log odds back to probabilities, which is a time-consuming process.', 'The equation for the output value is derived by plugging the first and second derivatives of the loss function into the equation, and the output value is calculated using the sum of residuals and the sum of p sub i times 1 minus p sub i in the numerator and denominator, respectively.']}], 'duration': 359.47, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I909988.jpg', 'highlights': ['The process of deriving the output value for a leaf involves making an initial prediction, putting all the residuals in the leaf, considering the loss function, and solving for the lowest point on the parabola where the derivative is zero, demonstrating the comprehensive approach of XGBoost regression formula in optimizing the output value.', 'The equation for the output value is derived by plugging the first and second derivatives of the loss function into the equation, and the output value is calculated using the sum of residuals and the sum of p sub i times 1 minus p sub i in the numerator and denominator, respectively.', 'The negative log likelihood is the most commonly used loss function in XGBoost for classification, and the first and second derivatives of this equation are crucial for calculating the optimal output value for a leaf.', 'The numerator of the output value for a leaf is the sum of the residuals, which cancels out all negative signs, providing a clear understanding of the role of gradients in the XGBoost regression formula.', 'The denominator of the output value for a leaf is the number of residuals plus lambda, highlighting the importance of the Hessian and regularization in determining the output value.', 'The process of deriving the first and second derivatives of the loss function involves converting probabilities to log odds, taking the derivatives, and then converting the log odds back to probabilities, which is a time-consuming process.']}, {'end': 1642.598, 'segs': [{'end': 1414.393, 'src': 'embed', 'start': 1293.11, 'weight': 0, 'content': [{'end': 1302.798, 'text': 'And that makes each term negative and it flips the parabola over the horizontal line y equals zero.', 'start': 1293.11, 'duration': 9.688}, {'end': 1310.905, 'text': 'Now, the optimal output value represents the x-axis coordinate for the highest point on the parabola.', 'start': 1304.179, 'duration': 6.726}, {'end': 1318.132, 'text': 'And this y-axis coordinate for the highest point on the parabola is the similarity score.', 'start': 1312.667, 'duration': 5.465}, {'end': 1324.698, 'text': "At least, it's the similarity score described in the original XGBoost manuscript.", 'start': 1319.613, 'duration': 5.085}, {'end': 1332.228, 'text': 'However, the similarity score used in the implementations is actually 2 times that number.', 'start': 1326.343, 'duration': 5.885}, {'end': 1337.593, 'text': 'The reason for this difference will become clear once we do the algebra.', 'start': 1333.91, 'duration': 3.683}, {'end': 1347.021, 'text': "So let's do the algebra to convert this into the similarity scores we saw in XGBoost Parts 1 and 2.", 'start': 1338.934, 'duration': 8.087}, {'end': 1350.224, 'text': "First, let's plug in this solution for the output value.", 'start': 1347.021, 'duration': 3.203}, {'end': 1356.609, 'text': "Now multiply together the sums of the gradients, G's, on the left.", 'start': 1352.208, 'duration': 4.401}, {'end': 1364.172, 'text': 'Note, these negative signs cancel out, and we get the square of the sum.', 'start': 1358.01, 'duration': 6.162}, {'end': 1372.234, 'text': 'Now we square the term on the right, and let this sum cancel out this square.', 'start': 1365.792, 'duration': 6.442}, {'end': 1379.436, 'text': 'Now we add these two terms together, and we end up with this fraction.', 'start': 1372.254, 'duration': 7.182}, {'end': 1386.988, 'text': 'This is the equation for the similarity score as described in the original XGBoost manuscript.', 'start': 1381.206, 'duration': 5.782}, {'end': 1397.151, 'text': 'However, in the XGBoost implementations, this one half is omitted because the similarity score is only a relative measure.', 'start': 1388.288, 'duration': 8.863}, {'end': 1404.894, 'text': 'And as long as every similarity score is scaled the same amount, the results of the comparisons will be the same.', 'start': 1398.372, 'duration': 6.522}, {'end': 1414.393, 'text': 'This is an example of how extreme Extreme Gradient Boost is, it will do anything to reduce the amount of computation.', 'start': 1406.354, 'duration': 8.039}], 'summary': 'The similarity score in xgboost implementations is actually 2 times the value described in the original manuscript.', 'duration': 121.283, 'max_score': 1293.11, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I1293110.jpg'}, {'end': 1576.203, 'src': 'embed', 'start': 1540.839, 'weight': 5, 'content': [{'end': 1549.521, 'text': 'And since there is one Hessian per residual in a leaf, cover for regression is simply the number of residuals in a leaf.', 'start': 1540.839, 'duration': 8.682}, {'end': 1557.295, 'text': 'For classification, the Hessian is p,', 'start': 1551.172, 'duration': 6.123}, {'end': 1565.098, 'text': 'So cover is equal to the sum of the previously predicted probability times 1 minus the previously predicted probability.', 'start': 1557.295, 'duration': 7.803}, {'end': 1576.203, 'text': 'Small bam! In summary, XGBoost builds trees by finding the output value that minimizes this equation.', 'start': 1566.659, 'duration': 9.544}], 'summary': 'Xgboost builds trees by minimizing an equation, using cover for regression and predicted probability for classification.', 'duration': 35.364, 'max_score': 1540.839, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I1540839.jpg'}], 'start': 1270.543, 'title': 'Xgboost similarity score and regression', 'summary': 'Discusses the process of determining similarity score in xgboost, including transformation of parabola, algebraic conversion, and omission of one half in the equation. it also explains the similarity score equations for regression and classification, detailing the use of derivatives and the relationship between cover and the similarity score.', 'chapters': [{'end': 1414.393, 'start': 1270.543, 'title': 'Xgboost similarity score', 'summary': 'Explains the process of determining the similarity score in xgboost, including the transformation of the parabola, the algebraic conversion, and the omission of one half in the similarity score equation.', 'duration': 143.85, 'highlights': ['The optimal output value represents the x-axis coordinate for the highest point on the parabola, and this y-axis coordinate for the highest point on the parabola is the similarity score. The highest point on the parabola represents the similarity score in XGBoost, offering insight into the scoring process.', 'The similarity score used in the implementations is actually 2 times that number compared to the original XGBoost manuscript. The difference in the similarity score used in XGBoost implementations compared to the original manuscript is quantified as being 2 times the original value.', 'In the XGBoost implementations, the one half is omitted in the similarity score equation because the similarity score is only a relative measure, and the results of the comparisons will be the same as long as every similarity score is scaled the same amount. The omission of one half in the similarity score equation in XGBoost implementations is justified by the relative nature of the score, aiming to reduce computation while maintaining consistent results in comparisons.']}, {'end': 1642.598, 'start': 1415.895, 'title': 'Xgboost: regression, classification, and cover', 'summary': 'Explains the similarity score equations for regression and classification in xgboost, detailing the use of first and second derivatives, and the relationship between cover and the similarity score.', 'duration': 226.703, 'highlights': ['The equation for the similarity score in XGBoost for regression is the sum of the squared residuals in the numerator and the sum of h sub i plus lambda in the denominator, serving as the similarity score equation for regression.', 'For classification, the similarity score equation in XGBoost involves the sum of the squared residuals in the numerator and the sum of h sub i plus lambda in the denominator, illustrating the similarity score equation for classification.', "Cover in XGBoost is related to the minimum number of residuals in a leaf, calculated as the sum of the Hessians, the h sub i's, for regression and as the sum of the previously predicted probability times 1 minus the previously predicted probability for classification."]}], 'duration': 372.055, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/ZVFeW798-2I/pics/ZVFeW798-2I1270543.jpg', 'highlights': ['The highest point on the parabola represents the similarity score in XGBoost, offering insight into the scoring process.', 'The difference in the similarity score used in XGBoost implementations compared to the original manuscript is quantified as being 2 times the original value.', 'The omission of one half in the similarity score equation in XGBoost implementations is justified by the relative nature of the score, aiming to reduce computation while maintaining consistent results in comparisons.', 'The equation for the similarity score in XGBoost for regression is the sum of the squared residuals in the numerator and the sum of h sub i plus lambda in the denominator, serving as the similarity score equation for regression.', 'For classification, the similarity score equation in XGBoost involves the sum of the squared residuals in the numerator and the sum of h sub i plus lambda in the denominator, illustrating the similarity score equation for classification.', "Cover in XGBoost is related to the minimum number of residuals in a leaf, calculated as the sum of the Hessians, the h sub i's, for regression and as the sum of the previously predicted probability times 1 minus the previously predicted probability for classification."]}], 'highlights': ['The chapter explores the mathematical details of XGBoost, assuming familiarity with tree building, Gradient Boost, and ridge regression.', 'Using drug dosage to predict drug effectiveness in regression and the probability of drug effectiveness in classification with XGBoost.', 'The only difference between regression and classification in XGBoost is the loss function.', 'Quantifying prediction with a loss function: 1 half times the squared residual for regression and negative log likelihood for classification.', 'XGBoost uses loss functions for regression and classification to build trees by minimizing an equation.', 'The chapter explains the impact of different output values on the total loss function, with negative 1 resulting in a total loss of 109.4 and positive 1 resulting in the lowest total loss of 102.4.', 'The equation in the original manuscript for XGBoost contains an extra term, gamma times t, where gamma is a user-definable penalty meant to encourage pruning.', 'The chapter details the components of the gradient boosting equation, including the loss function and regularization term, and demonstrates the process of finding the best output value to minimize the equation.', 'XGBoost utilizes second-order Taylor approximation for both regression and classification, simplifying the math when solving for the optimal output value.', 'Increasing lambda in XGBoost shifts the lowest point in the parabola closer to 0, exemplifying the impact of regularization.', 'The Taylor approximation in XGBoost consists of simple parts, including the loss function for the previous prediction, its first derivative, and its second derivative.', 'Solving for the optimal output value using derivatives and gradients', 'Combining unsquared and squared output value terms', 'Omitting terms without effect on the optimal output value', 'The process of deriving the output value for a leaf involves making an initial prediction, putting all the residuals in the leaf, considering the loss function, and solving for the lowest point on the parabola where the derivative is zero, demonstrating the comprehensive approach of XGBoost regression formula in optimizing the output value.', 'The equation for the output value is derived by plugging the first and second derivatives of the loss function into the equation, and the output value is calculated using the sum of residuals and the sum of p sub i times 1 minus p sub i in the numerator and denominator, respectively.', 'The negative log likelihood is the most commonly used loss function in XGBoost for classification, and the first and second derivatives of this equation are crucial for calculating the optimal output value for a leaf.', 'The numerator of the output value for a leaf is the sum of the residuals, which cancels out all negative signs, providing a clear understanding of the role of gradients in the XGBoost regression formula.', 'The denominator of the output value for a leaf is the number of residuals plus lambda, highlighting the importance of the Hessian and regularization in determining the output value.', 'The process of deriving the first and second derivatives of the loss function involves converting probabilities to log odds, taking the derivatives, and then converting the log odds back to probabilities, which is a time-consuming process.', 'The highest point on the parabola represents the similarity score in XGBoost, offering insight into the scoring process.', 'The difference in the similarity score used in XGBoost implementations compared to the original manuscript is quantified as being 2 times the original value.', 'The omission of one half in the similarity score equation in XGBoost implementations is justified by the relative nature of the score, aiming to reduce computation while maintaining consistent results in comparisons.', 'The equation for the similarity score in XGBoost for regression is the sum of the squared residuals in the numerator and the sum of h sub i plus lambda in the denominator, serving as the similarity score equation for regression.', 'For classification, the similarity score equation in XGBoost involves the sum of the squared residuals in the numerator and the sum of h sub i plus lambda in the denominator, illustrating the similarity score equation for classification.', "Cover in XGBoost is related to the minimum number of residuals in a leaf, calculated as the sum of the Hessians, the h sub i's, for regression and as the sum of the previously predicted probability times 1 minus the previously predicted probability for classification."]}