title

Lecture 10 - Decision Trees and Ensemble Methods | Stanford CS229: Machine Learning (Autumn 2018)

description

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/3GdlrqJ
Raphael Townshend
PhD Candidate and CS229 Head TA
To follow along with the course schedule and syllabus, visit:
http://cs229.stanford.edu/syllabus-autumn2018.html

detail

{'title': 'Lecture 10 - Decision Trees and Ensemble Methods | Stanford CS229: Machine Learning (Autumn 2018)', 'heatmap': [{'end': 437.233, 'start': 380.992, 'weight': 0.862}, {'end': 678.668, 'start': 566.462, 'weight': 0.986}, {'end': 1028.654, 'start': 964.394, 'weight': 0.734}, {'end': 2277.851, 'start': 2221.57, 'weight': 0.749}, {'end': 3342.339, 'start': 3244.327, 'weight': 0.836}, {'end': 3437.115, 'start': 3359.2, 'weight': 1}, {'end': 3778.734, 'start': 3678.902, 'weight': 0.823}, {'end': 4509.755, 'start': 4451.564, 'weight': 0.812}, {'end': 4661.017, 'start': 4549.78, 'weight': 0.941}], 'summary': 'Covers decision trees, ensemble methods, and global skiing seasons, discussing tree-based region partitioning, loss and misclassification in machine learning, cross entropy, refining decision trees, decision tree efficiency, and ensembling methods in statistics and machine learning, along with ensemble learning methods like bagging, random forests, and boosting methods. it also includes specific examples and quantifiable data such as 900 positives and 100 negatives in decision tree classifications resulting in 10% incorrect classifications.', 'chapters': [{'end': 185.9, 'segs': [{'end': 97.025, 'src': 'embed', 'start': 43.853, 'weight': 0, 'content': [{'end': 49.036, 'text': "then we're gonna go over general ensembling methods and then go specifically into bagging random forests and boosting.", 'start': 43.853, 'duration': 5.183}, {'end': 51.798, 'text': "Okay So let's get started.", 'start': 49.837, 'duration': 1.961}, {'end': 55.861, 'text': "So first, let's cover some decision trees.", 'start': 53.96, 'duration': 1.901}, {'end': 71.527, 'text': 'So last week, Andrew was covering SVN, which are sort of one of the classical linear models.', 'start': 66.164, 'duration': 5.363}, {'end': 75.23, 'text': 'And it sort of brought to a close a lot of our discussion of those linear models.', 'start': 72.268, 'duration': 2.962}, {'end': 80.893, 'text': "And so today, we're going to be getting to decision trees, which is really one of our first examples of a nonlinear model.", 'start': 75.37, 'duration': 5.523}, {'end': 84.415, 'text': 'And so to motivate these guys, let me give you guys an example.', 'start': 81.494, 'duration': 2.921}, {'end': 86.737, 'text': "So I'm Canadian.", 'start': 85.536, 'duration': 1.201}, {'end': 87.918, 'text': 'I really like to ski.', 'start': 86.977, 'duration': 0.941}, {'end': 89.859, 'text': "So I'm going to motivate it using that.", 'start': 88.278, 'duration': 1.581}, {'end': 97.025, 'text': 'So pretend you have a classifier that, given a time and a location, tells you whether or not you can ski.', 'start': 90.543, 'duration': 6.482}], 'summary': 'Teaching decision trees with a skiing example and ensembling methods like bagging, random forests, and boosting.', 'duration': 53.172, 'max_score': 43.853, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA43853.jpg'}, {'end': 185.9, 'src': 'embed', 'start': 143.379, 'weight': 2, 'content': [{'end': 152.127, 'text': 'OK, so given this, if you might recall, the winter in the northern hemisphere generally happens in the early months of the year.', 'start': 143.379, 'duration': 8.748}, {'end': 156.27, 'text': 'So you might see that you can ski in these early months over here and have some positive data points.', 'start': 152.447, 'duration': 3.823}, {'end': 162.055, 'text': 'And then in the later months, like this.', 'start': 156.811, 'duration': 5.244}, {'end': 164.277, 'text': "And then in the middle, you can't really ski.", 'start': 162.315, 'duration': 1.962}, {'end': 170.843, 'text': "Versus in the southern hemisphere, it's basically flipped, where you can not ski in the early months.", 'start': 166.299, 'duration': 4.544}, {'end': 176.575, 'text': 'You can ski during the May, June, July, August time period.', 'start': 172.233, 'duration': 4.342}, {'end': 179.217, 'text': 'And then you cannot ski in the later months.', 'start': 177.536, 'duration': 1.681}, {'end': 182.478, 'text': 'And then the equator in general is just not great for skiing.', 'start': 180.157, 'duration': 2.321}, {'end': 183.719, 'text': "There's a reason I don't live there.", 'start': 182.558, 'duration': 1.161}, {'end': 185.9, 'text': 'And so you just have a bunch of negatives here.', 'start': 184.239, 'duration': 1.661}], 'summary': "In the northern hemisphere, skiing is possible in early months, while in the southern hemisphere, it's possible during may to august, and the equator is not suitable for skiing.", 'duration': 42.521, 'max_score': 143.379, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA143379.jpg'}], 'start': 4.448, 'title': 'Decision trees, ensemble methods, and global skiing seasons', 'summary': 'Covers decision trees and ensemble methods for predictive analysis, comparing them with classical linear models like svn. it also discusses skiing seasons in the northern and southern hemispheres, highlighting the availability of skiing in different months and noting unfavorable conditions near the equator.', 'chapters': [{'end': 142.482, 'start': 4.448, 'title': 'Decision trees and ensemble methods', 'summary': 'Covers decision trees and ensemble methods, highlighting their relevance in classification and their application in nonlinear modeling for predictive analysis, alongside a comparison with classical linear models like svn.', 'duration': 138.034, 'highlights': ['The chapter covers decision trees and ensemble methods', 'Decision trees are one of our first examples of a nonlinear model', 'The relevance of decision trees in classification with a specific example of a binary classifier for skiing']}, {'end': 185.9, 'start': 143.379, 'title': 'Global skiing seasons', 'summary': 'Compares skiing seasons in the northern and southern hemispheres, noting the availability of skiing in different months and highlighting the unfavorable skiing conditions near the equator.', 'duration': 42.521, 'highlights': ['The northern hemisphere offers skiing in the early months of the year, with positive data points, while skiing becomes unavailable in the later months.', 'The southern hemisphere has a reversed skiing season compared to the northern hemisphere, with skiing available during May to August and unavailable in the later months.', 'The equator is generally not suitable for skiing, making it unfavorable for this activity.']}], 'duration': 181.452, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA4448.jpg', 'highlights': ['The chapter covers decision trees and ensemble methods', 'The relevance of decision trees in classification with a specific example of a binary classifier for skiing', 'The northern hemisphere offers skiing in the early months of the year, with positive data points, while skiing becomes unavailable in the later months', 'The southern hemisphere has a reversed skiing season compared to the northern hemisphere, with skiing available during May to August and unavailable in the later months', 'Decision trees are one of our first examples of a nonlinear model', 'The equator is generally not suitable for skiing, making it unfavorable for this activity']}, {'end': 503.994, 'segs': [{'end': 281.58, 'src': 'embed', 'start': 226.219, 'weight': 0, 'content': [{'end': 229.44, 'text': 'So we want to isolate out these positive examples, for example.', 'start': 226.219, 'duration': 3.221}, {'end': 234.282, 'text': 'And in general, this problem is fairly intractable, just coming up with the optimal regions.', 'start': 230.6, 'duration': 3.682}, {'end': 249.383, 'text': 'But how we do it with decision trees is we do it in this basically greedy top-down recursive manner.', 'start': 235.442, 'duration': 13.941}, {'end': 255.247, 'text': 'And this would be recursive partitioning.', 'start': 253.165, 'duration': 2.082}, {'end': 268.797, 'text': "And so basically, it's top down because we're starting with the overall region, and we want to slowly partition it up.", 'start': 262.112, 'duration': 6.685}, {'end': 273.04, 'text': "And then it's greedy because at each step, we want to pick the best partition possible.", 'start': 269.077, 'duration': 3.963}, {'end': 278.398, 'text': "So let's actually try and work out intuitively what a decision tree would do.", 'start': 274.455, 'duration': 3.943}, {'end': 281.58, 'text': 'So what we do is we start with the overall space.', 'start': 279.739, 'duration': 1.841}], 'summary': 'Using decision trees for partitioning data in a top-down, recursive, and greedy manner.', 'duration': 55.361, 'max_score': 226.219, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA226219.jpg'}, {'end': 437.233, 'src': 'heatmap', 'start': 356.421, 'weight': 2, 'content': [{'end': 360.685, 'text': 'And so you could imagine how, through asking these recursive questions over and over again,', 'start': 356.421, 'duration': 4.264}, {'end': 365.09, 'text': 'you could start splitting up the entire space into your individual regions, like this.', 'start': 360.685, 'duration': 4.405}, {'end': 377.191, 'text': "OK, and so to make this a little bit more formal, what we're looking for is we're looking for sort of this split function.", 'start': 369.389, 'duration': 7.802}, {'end': 379.571, 'text': 'OK, so you can sort of define a region.', 'start': 377.371, 'duration': 2.2}, {'end': 380.972, 'text': 'So you have a region.', 'start': 379.591, 'duration': 1.381}, {'end': 387.933, 'text': "And let's call that region RP in this case for our parent.", 'start': 380.992, 'duration': 6.941}, {'end': 397.716, 'text': "OK, and we're looking for looking for a split.", 'start': 387.953, 'duration': 9.763}, {'end': 405.618, 'text': 'sp such that you have an sp.', 'start': 401.056, 'duration': 4.562}, {'end': 413.301, 'text': 'You can sort of write out this sp function as a function of j, t.', 'start': 406.138, 'duration': 7.163}, {'end': 418.002, 'text': "So j is which feature number, and t is the threshold you're using.", 'start': 413.301, 'duration': 4.701}, {'end': 420.383, 'text': 'And so you can sort of write this out formally as sort of.', 'start': 418.342, 'duration': 2.041}, {'end': 432.011, 'text': "you're outputting a tuple where, on the one hand, you have a set x, where you have the j-th feature of x is less than the threshold.", 'start': 420.383, 'duration': 11.628}, {'end': 437.233, 'text': "And you have x as element of RP, since we're only partitioning that parent region.", 'start': 433.332, 'duration': 3.901}], 'summary': 'The process involves recursive questions to split space into regions based on a split function defined by feature number and threshold.', 'duration': 63.962, 'max_score': 356.421, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA356421.jpg'}, {'end': 503.994, 'src': 'embed', 'start': 464.07, 'weight': 5, 'content': [{'end': 468.973, 'text': 'Any questions so far? No? OK.', 'start': 464.07, 'duration': 4.903}, {'end': 472.435, 'text': 'So we defined now how we would do this.', 'start': 469.953, 'duration': 2.482}, {'end': 476.657, 'text': "We're trying to greedily pick these picks that are partitioning our input space.", 'start': 472.475, 'duration': 4.182}, {'end': 482.621, 'text': "And the splits are defined by which feature you're looking at and the threshold that you're applying to that feature.", 'start': 477.077, 'duration': 5.544}, {'end': 488.94, 'text': 'A natural question to ask now is how do you choose these splits??', 'start': 484.53, 'duration': 4.41}, {'end': 503.994, 'text': 'All right,', 'start': 503.353, 'duration': 0.641}], 'summary': 'Defining the process for greedily picking splits based on features and thresholds.', 'duration': 39.924, 'max_score': 464.07, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA464070.jpg'}], 'start': 188.703, 'title': 'Tree-based region partitioning', 'summary': 'Discusses how decision trees and tree-based partitioning effectively isolate positive examples and divide the space into regions for data analysis, providing a natural and effective method for the task.', 'chapters': [{'end': 281.58, 'start': 188.703, 'title': 'Decision trees for region partitioning', 'summary': 'Explains how decision trees naturally partition a space into regions to isolate positive examples, with a top-down and greedy approach, making it a very natural and effective method for the task.', 'duration': 92.877, 'highlights': ['Decision trees provide a natural way to partition the space into individual regions, allowing isolation of positive examples, making it an effective method for solving the problem (relevance score: 5)', 'The approach is top-down and greedy, as it starts with the overall region and then partitions it up, picking the best partition possible at each step (relevance score: 4)', 'The problem of isolating optimal regions is fairly intractable, but decision trees solve it in a recursive, top-down, and greedy manner, making it an effective method for the task (relevance score: 3)']}, {'end': 503.994, 'start': 282.18, 'title': 'Tree-based partitioning for data analysis', 'summary': 'Explains the process of tree-based partitioning for data analysis, where the space is recursively divided into regions by asking questions based on features such as latitude and month, effectively partitioning the space into individual regions.', 'duration': 221.814, 'highlights': ['The process of recursively dividing the space into regions by asking questions based on features like latitude and month effectively partitions the space into individual regions.', 'The explanation of the split function for defining regions using feature number and threshold, accompanied by the partitioning of the parent region into sets R1 and R2.', 'The query of how to choose the splits, referring to the process of greedily picking the splits based on the features and thresholds.']}], 'duration': 315.291, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA188703.jpg', 'highlights': ['Decision trees provide a natural way to partition the space into individual regions, allowing isolation of positive examples, making it an effective method for solving the problem (relevance score: 5)', 'The approach is top-down and greedy, as it starts with the overall region and then partitions it up, picking the best partition possible at each step (relevance score: 4)', 'The process of recursively dividing the space into regions by asking questions based on features like latitude and month effectively partitions the space into individual regions.', 'The problem of isolating optimal regions is fairly intractable, but decision trees solve it in a recursive, top-down, and greedy manner, making it an effective method for the task (relevance score: 3)', 'The explanation of the split function for defining regions using feature number and threshold, accompanied by the partitioning of the parent region into sets R1 and R2.', 'The query of how to choose the splits, referring to the process of greedily picking the splits based on the features and thresholds.']}, {'end': 1158.934, 'segs': [{'end': 678.668, 'src': 'heatmap', 'start': 503.994, 'weight': 3, 'content': [{'end': 511.401, 'text': "and so I sort of give this intuitive explanation that really what you're trying to do is you're trying to isolate out the space of positives and negatives in this case.", 'start': 503.994, 'duration': 7.407}, {'end': 517.546, 'text': "And so what it's useful to define is a loss on a region.", 'start': 512.782, 'duration': 4.764}, {'end': 524.393, 'text': 'OK, so define your loss L on R.', 'start': 518.126, 'duration': 6.267}, {'end': 539.567, 'text': "And so for now, let's define our loss as something fairly obvious, is your misclassification loss.", 'start': 533.823, 'duration': 5.744}, {'end': 542.969, 'text': "It's how many examples in your region you get wrong.", 'start': 539.747, 'duration': 3.222}, {'end': 561.641, 'text': 'And so assuming that you have given c classes total, you can define p hat c.', 'start': 544.13, 'duration': 17.511}, {'end': 593.161, 'text': 'to be the proportion of examples in R that are of class C.', 'start': 566.462, 'duration': 26.699}, {'end': 599.866, 'text': "And so, now that we've got this definition, where we have this p hat c of telling us the proportion of examples that we've got,", 'start': 594.622, 'duration': 5.244}, {'end': 606.512, 'text': 'in that case you can define the loss of any region as loss.', 'start': 599.866, 'duration': 6.646}, {'end': 619.222, 'text': "let's call it misclassification is just 1 minus max over c of p hat c.", 'start': 606.512, 'duration': 12.71}, {'end': 626.574, 'text': "Okay?. And so the reasoning behind this is basically you can say that for any region you've subdivided generally,", 'start': 620.069, 'duration': 6.505}, {'end': 632.639, 'text': "what you'll want to do is predict the most common class there, which is just the maximum p hat c right?", 'start': 626.574, 'duration': 6.065}, {'end': 637.303, 'text': 'And so then all the remaining probability just gets thrown onto misclassification errors.', 'start': 632.979, 'duration': 4.324}, {'end': 643.828, 'text': 'Okay?. And so then, once we do this, we want to basically pick.', 'start': 638.123, 'duration': 5.705}, {'end': 651.553, 'text': 'now that we have a loss defined, we want to pick a split that decreases the loss as much as possible.', 'start': 643.828, 'duration': 7.725}, {'end': 657.716, 'text': "So you recall I've defined this region r parent, and then these two children regions, r1 and r2.", 'start': 651.933, 'duration': 5.783}, {'end': 662.379, 'text': 'And you basically want to reduce that loss as much as possible.', 'start': 658.677, 'duration': 3.702}, {'end': 678.668, 'text': 'So you want to basically minimize loss r parent minus loss of r1 plus loss of r2.', 'start': 662.459, 'duration': 16.209}], 'summary': 'Defining loss on a region to minimize misclassification errors and decrease overall loss.', 'duration': 102.518, 'max_score': 503.994, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA503994.jpg'}, {'end': 678.668, 'src': 'embed', 'start': 638.123, 'weight': 1, 'content': [{'end': 643.828, 'text': 'Okay?. And so then, once we do this, we want to basically pick.', 'start': 638.123, 'duration': 5.705}, {'end': 651.553, 'text': 'now that we have a loss defined, we want to pick a split that decreases the loss as much as possible.', 'start': 643.828, 'duration': 7.725}, {'end': 657.716, 'text': "So you recall I've defined this region r parent, and then these two children regions, r1 and r2.", 'start': 651.933, 'duration': 5.783}, {'end': 662.379, 'text': 'And you basically want to reduce that loss as much as possible.', 'start': 658.677, 'duration': 3.702}, {'end': 678.668, 'text': 'So you want to basically minimize loss r parent minus loss of r1 plus loss of r2.', 'start': 662.459, 'duration': 16.209}], 'summary': 'Define regions and minimize loss for split', 'duration': 40.545, 'max_score': 638.123, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA638123.jpg'}, {'end': 789.376, 'src': 'embed', 'start': 749.184, 'weight': 2, 'content': [{'end': 754.408, 'text': "Let's get a little bit into actually why misclassification loss isn't actually the right loss to use for this problem.", 'start': 749.184, 'duration': 5.224}, {'end': 776.673, 'text': "OK, and so for a simple example, let's pretend I've sort of drawn out a tree like this.", 'start': 771.812, 'duration': 4.861}, {'end': 784.475, 'text': "Let's pretend that instead we have another set up here where we're coming into a decision node.", 'start': 777.273, 'duration': 7.202}, {'end': 789.376, 'text': 'And at this point, we have 900 positives and 100 negatives.', 'start': 784.795, 'duration': 4.581}], 'summary': 'Misclassification loss not suitable for 900 positives and 100 negatives.', 'duration': 40.192, 'max_score': 749.184, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA749184.jpg'}, {'end': 964.134, 'src': 'embed', 'start': 915.167, 'weight': 0, 'content': [{'end': 919.571, 'text': "And so that sort of brings up one problem with the misclassification loss is that it's not really sensitive enough.", 'start': 915.167, 'duration': 4.404}, {'end': 926.018, 'text': 'So instead, what we can do is we can define this cross entropy loss.', 'start': 921.994, 'duration': 4.024}, {'end': 947.227, 'text': "which we'll define as L cross.", 'start': 945.646, 'duration': 1.581}, {'end': 953.95, 'text': 'Let me just write this out here.', 'start': 952.809, 'duration': 1.141}, {'end': 964.134, 'text': "And so really what you're doing is you're just summing over the classes.", 'start': 961.713, 'duration': 2.421}], 'summary': 'Cross entropy loss is more sensitive than misclassification loss.', 'duration': 48.967, 'max_score': 915.167, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA915167.jpg'}, {'end': 1028.654, 'src': 'heatmap', 'start': 964.394, 'weight': 0.734, 'content': [{'end': 969.417, 'text': "And it's the proportion of elements in that class times the log of the proportion in that class.", 'start': 964.394, 'duration': 5.023}, {'end': 971.775, 'text': 'And how you can think of this.', 'start': 970.495, 'duration': 1.28}, {'end': 979.137, 'text': "is it's sort of this concept that we borrow from information theory, which is sort of like the number of bits you need to communicate,", 'start': 971.775, 'duration': 7.362}, {'end': 982.698, 'text': "to tell someone who already knows what the probabilities are what class you're looking at.", 'start': 979.137, 'duration': 3.561}, {'end': 989.74, 'text': 'And so that sounds like a mouthful, but really you can sort of think of it intuitively, as if someone already knows the probabilities, like, say,', 'start': 983.258, 'duration': 6.482}, {'end': 992.881, 'text': "it's 100% chance that it is of one class.", 'start': 989.74, 'duration': 3.141}, {'end': 998.062, 'text': "then you don't need to communicate anything to tell them exactly which class is, because it's obvious that it is that one class.", 'start': 992.881, 'duration': 5.181}, {'end': 1006.025, 'text': 'versus if you have a fairly even split, then you need to communicate a lot more information to tell someone exactly what class you are in.', 'start': 998.522, 'duration': 7.503}, {'end': 1028.654, 'text': 'Any questions so far? Yep? The R1, R2 for the parent class? For this case here? Yeah, yeah.', 'start': 1007.925, 'duration': 20.729}], 'summary': 'Information theory concept helps communicate class probabilities intuitively.', 'duration': 64.26, 'max_score': 964.394, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA964394.jpg'}], 'start': 503.994, 'title': 'Loss and misclassification in machine learning', 'summary': 'Discusses defining loss on a region, proportion of examples in a region for misclassification, calculation of misclassification loss based on maximum proportion of examples, selection of split to minimize loss, inadequacy of misclassification loss, and introduction of cross entropy loss with examples and explanation.', 'chapters': [{'end': 637.303, 'start': 503.994, 'title': 'Loss and misclassification in machine learning', 'summary': 'Discusses the concept of defining loss on a region and the proportion of examples in a region for misclassification, emphasizing the calculation of misclassification loss based on the maximum proportion of examples of a class in a region.', 'duration': 133.309, 'highlights': ['The definition of loss on a region is based on the intuitive explanation of isolating the space of positives and negatives.', 'The misclassification loss is calculated as how many examples in a region are classified incorrectly, and it can be defined as 1 minus the maximum proportion of examples of any class in the region.', 'The proportion of examples in a region of a particular class, denoted as p hat c, is used to define the misclassification loss.', 'The concept of predicting the most common class in a region based on the maximum p hat c is emphasized as the reasoning behind the calculation of misclassification loss.']}, {'end': 1158.934, 'start': 638.123, 'title': 'Decision tree loss functions', 'summary': 'Discusses the selection of split to minimize loss and the inadequacy of misclassification loss, leading to the introduction of cross entropy loss, with examples and explanation.', 'duration': 520.811, 'highlights': ['The chapter discusses the inadequacy of misclassification loss in decision tree splitting, providing examples to illustrate the limitations.', 'The introduction of cross entropy loss is presented, along with an explanation of its calculation and intuitive understanding borrowed from information theory.', 'The process of selecting a split to minimize loss in decision trees is explained, emphasizing the goal of reducing the loss of the parent region by choosing a split that decreases the loss as much as possible.']}], 'duration': 654.94, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA503994.jpg', 'highlights': ['Introduction of cross entropy loss with examples and explanation', 'Selection of split to minimize loss in decision trees', 'Inadequacy of misclassification loss in decision tree splitting', 'Calculation of misclassification loss based on maximum proportion of examples', 'Definition of loss on a region based on isolating the space of positives and negatives']}, {'end': 2177.012, 'segs': [{'end': 1243.002, 'src': 'embed', 'start': 1198.285, 'weight': 0, 'content': [{'end': 1201.248, 'text': 'You would just classify everything as positive, right?', 'start': 1198.285, 'duration': 2.963}, {'end': 1203.77, 'text': 'And so you get 100 negatives incorrect.', 'start': 1201.528, 'duration': 2.242}, {'end': 1208.466, 'text': 'if that makes sense, because this is 900 positives and 100 negatives.', 'start': 1204.923, 'duration': 3.543}, {'end': 1217.512, 'text': 'So if you just stopped here and just tried to classify given this whole region rp, you would end up getting 10% of your examples wrong.', 'start': 1209.086, 'duration': 8.426}, {'end': 1221.595, 'text': "In this case, we're not talking about percentages.", 'start': 1219.134, 'duration': 2.461}, {'end': 1224.718, 'text': "We're talking about absolute number of examples that we've gotten wrong.", 'start': 1221.615, 'duration': 3.103}, {'end': 1227.42, 'text': 'But you can also definitely talk in terms of percentages instead.', 'start': 1224.878, 'duration': 2.542}, {'end': 1233.724, 'text': "And then down here, once you've split it, now you've got these two subregions.", 'start': 1229.501, 'duration': 4.223}, {'end': 1239.46, 'text': 'And on this left one here, you still have more positives than negatives.', 'start': 1234.998, 'duration': 4.462}, {'end': 1243.002, 'text': "So you're still going to classify positive in this leaf.", 'start': 1239.66, 'duration': 3.342}], 'summary': 'Classifying 900 positives and 100 negatives results in 10% incorrect.', 'duration': 44.717, 'max_score': 1198.285, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA1198285.jpg'}, {'end': 1367.298, 'src': 'embed', 'start': 1335.384, 'weight': 7, 'content': [{'end': 1337.584, 'text': "And what you've got plotted up here is your loss.", 'start': 1335.384, 'duration': 2.2}, {'end': 1347.266, 'text': "For cross entropy loss, what your curve is going to end up looking like is it's going to end up looking like this strictly concave curve like this.", 'start': 1340.265, 'duration': 7.001}, {'end': 1356.593, 'text': 'And what you can do is you can look at where your children versus your parents would fall on this curve.', 'start': 1349.95, 'duration': 6.643}, {'end': 1359.154, 'text': 'So say that you have two children.', 'start': 1357.293, 'duration': 1.861}, {'end': 1361.135, 'text': 'You have one up here.', 'start': 1360.315, 'duration': 0.82}, {'end': 1363.096, 'text': "So let's call this LR1.", 'start': 1361.235, 'duration': 1.861}, {'end': 1367.298, 'text': 'And you have one down here, LR2.', 'start': 1364.057, 'duration': 3.241}], 'summary': 'Cross entropy loss curve helps compare learning rates for children vs. parents.', 'duration': 31.914, 'max_score': 1335.384, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA1335384.jpg'}, {'end': 1524.068, 'src': 'embed', 'start': 1494.357, 'weight': 5, 'content': [{'end': 1497.898, 'text': "And in this case, here we're just assuming that we have two classes, okay?", 'start': 1494.357, 'duration': 3.541}, {'end': 1502.719, 'text': "And so what we're doing is we're just modifying the p hat c we're-,", 'start': 1498.598, 'duration': 4.121}, {'end': 1508.521, 'text': "we're changing that on the x-axis and then we're looking at what the response of the overall loss function is on the y-axis.", 'start': 1502.719, 'duration': 5.802}, {'end': 1516.945, 'text': 'And so what I just did here is this curve just represents for any p hat c what the cross entropy loss would look like.', 'start': 1509.221, 'duration': 7.724}, {'end': 1520.287, 'text': 'And so we can come back to this, for example.', 'start': 1518.586, 'duration': 1.701}, {'end': 1524.068, 'text': 'And if we look at this parent here, this guy has a 10%.', 'start': 1521.167, 'duration': 2.901}], 'summary': 'Analyzing cross entropy loss for two classes, with 10% error rate.', 'duration': 29.711, 'max_score': 1494.357, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA1494357.jpg'}, {'end': 1585.118, 'src': 'embed', 'start': 1550.122, 'weight': 4, 'content': [{'end': 1554.564, 'text': "And so you can see, since there's the same number of examples, in both of these,", 'start': 1550.122, 'duration': 4.442}, {'end': 1558.025, 'text': 'the p hat of the parent is just the average of the p hats of the children.', 'start': 1554.564, 'duration': 3.461}, {'end': 1562.707, 'text': "And so that's how we can take this LR parent.", 'start': 1558.045, 'duration': 4.662}, {'end': 1564.147, 'text': 'This LR parent is just halfway.', 'start': 1562.727, 'duration': 1.42}, {'end': 1570.043, 'text': 'If we projected this down, Let me just erase this little bit here.', 'start': 1564.207, 'duration': 5.836}, {'end': 1585.118, 'text': "If we projected this down like this, we'd see that this point here is the midpoint.", 'start': 1574.588, 'duration': 10.53}], 'summary': "The parent p hat is the average of the children's p hats.", 'duration': 34.996, 'max_score': 1550.122, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA1550122.jpg'}, {'end': 1689.433, 'src': 'embed', 'start': 1648.182, 'weight': 6, 'content': [{'end': 1654.165, 'text': 'So any point along that line is going to lie below the original loss curve for the parent.', 'start': 1648.182, 'duration': 5.983}, {'end': 1661.331, 'text': "So you're basically, as long as you're not picking the exact same points on the probability curve and not making any gain at all in your split,", 'start': 1654.545, 'duration': 6.786}, {'end': 1664.614, 'text': "you're gonna gain some amount of information through this split.", 'start': 1661.331, 'duration': 3.283}, {'end': 1689.433, 'text': "Okay Now, this was the cross entropy loss, right? If instead we look at the misclassification loss over here, let's draw this one instead.", 'start': 1668.817, 'duration': 20.616}], 'summary': 'Splitting data gains information, shown by cross entropy and misclassification loss.', 'duration': 41.251, 'max_score': 1648.182, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA1648182.jpg'}, {'end': 1975.279, 'src': 'embed', 'start': 1938.546, 'weight': 1, 'content': [{'end': 1943.328, 'text': 'And what you do when you get to one of your leaves is, instead of just predicting a majority class,', 'start': 1938.546, 'duration': 4.782}, {'end': 1947.27, 'text': 'what you can do is predict the mean of the values left.', 'start': 1943.328, 'duration': 3.942}, {'end': 1958.502, 'text': "So you're predicting predict y hat where, well, for Rm.", 'start': 1947.75, 'duration': 10.752}, {'end': 1972.216, 'text': "So pretend you have a region Rm, you're predicting y hat of m, which is the sum of all the indices in Rm, yi minus y hat m.", 'start': 1958.522, 'duration': 13.694}, {'end': 1975.279, 'text': 'And you want the squared loss and then you can sort of.', 'start': 1972.216, 'duration': 3.063}], 'summary': 'Predict mean of values left instead of majority class, using squared loss.', 'duration': 36.733, 'max_score': 1938.546, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA1938546.jpg'}, {'end': 2117.94, 'src': 'embed', 'start': 2082.656, 'weight': 2, 'content': [{'end': 2085.938, 'text': 'And it turns out that you can actually basically brute force it very efficiently.', 'start': 2082.656, 'duration': 3.282}, {'end': 2089.58, 'text': "I'm going to get into sort of the details of how you do that shortly.", 'start': 2086.418, 'duration': 3.162}, {'end': 2092.802, 'text': 'But it turns out that you can just go through everything fairly quickly.', 'start': 2089.86, 'duration': 2.942}, {'end': 2094.202, 'text': "I'll get into that.", 'start': 2093.502, 'duration': 0.7}, {'end': 2096.244, 'text': "I think that's in a couple sections from now.", 'start': 2094.222, 'duration': 2.022}, {'end': 2102.627, 'text': 'Any other questions? No? OK.', 'start': 2098.565, 'duration': 4.062}, {'end': 2106.77, 'text': 'So this is for regression trees, right?', 'start': 2104.448, 'duration': 2.322}, {'end': 2117.94, 'text': "It turns out that um another useful extension that that you don't really get for other learning algorithms is that you can also deal with uh categorical variables fairly easily.", 'start': 2107.09, 'duration': 10.85}], 'summary': 'Efficiently brute force regression trees can handle categorical variables.', 'duration': 35.284, 'max_score': 2082.656, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA2082656.jpg'}], 'start': 1164.375, 'title': 'Decision trees and cross entropy', 'summary': 'Covers decision tree classifications with 900 positives and 100 negatives, resulting in 10% incorrect classifications, understanding cross entropy loss in binary classification, and the use of p-hat for parent and children nodes, impact of uneven splits on information gain, and handling of categorical variables in regression trees.', 'chapters': [{'end': 1259.49, 'start': 1164.375, 'title': 'Decision tree classifications', 'summary': 'Explains the classification of examples using a decision tree, starting with 900 positives and 100 negatives, resulting in 10% incorrect classifications based on the majority class, and further splitting into subregions to minimize errors in classification.', 'duration': 95.115, 'highlights': ['The initial classification results in 10% incorrect classifications based on the majority class of 900 positives and 100 negatives.', 'Further splitting into subregions allows for minimizing errors in classification, with some subregions not making any errors.']}, {'end': 1516.945, 'start': 1260.911, 'title': 'Understanding cross entropy loss', 'summary': 'Explains the concept of cross entropy loss and its geometric interpretation in a binary classification problem, comparing the losses of children regions and the parent node.', 'duration': 256.034, 'highlights': ['The geometric perspective of cross entropy loss in a binary classification problem is explained, highlighting the curvature of the loss curve and the relationship between the losses of children regions and the parent node.', 'The concept of p hat as a proportion in the context of cross entropy loss is discussed, emphasizing the modification of p hat c on the x-axis and its impact on the overall loss function on the y-axis.', "The explanation of the average loss between two children regions and the projection of the parent node's loss is provided, illustrating the relationship between the losses of different regions in the context of cross entropy loss."]}, {'end': 1937.566, 'start': 1518.586, 'title': 'Decision trees in machine learning', 'summary': 'Discusses the concept of p-hat for parent and children nodes, the impact of uneven splits on information gain, and the use of strictly concave functions for decision splits in classification and regression trees.', 'duration': 418.98, 'highlights': ['The p-hat of the parent is the average of the p-hats of the children, impacting the LR parent and the average loss calculation.', "Uneven splits result in the average being any point along the line drawn, leading to gaining information through the split as long as there's not an exact same point on the probability curve.", 'Strictly concave functions are successfully used for decision splits in classification and regression trees, with the Gini loss curve also resembling the original cross-entropy period curve.']}, {'end': 2177.012, 'start': 1938.546, 'title': 'Regression trees and categorical variables', 'summary': 'Discusses the use of mean prediction and squared loss in regression trees, as well as the efficient brute force approach for finding splits, and the handling of categorical variables in regression trees.', 'duration': 238.466, 'highlights': ['Regression trees can predict the mean of the values within a region instead of just the majority class, using the squared loss for the loss function, allowing for a more accurate prediction.', 'The optimization problem of finding splits in regression trees can be efficiently solved through a brute force approach, providing a quick method for searching for these splits.', 'Regression trees can easily handle categorical variables by using subsets and asking questions based on those subsets, providing an extension not commonly available in other learning algorithms.']}], 'duration': 1012.637, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA1164375.jpg', 'highlights': ['Further splitting into subregions allows for minimizing errors in classification, with some subregions not making any errors.', 'Regression trees can predict the mean of the values within a region instead of just the majority class, using the squared loss for the loss function, allowing for a more accurate prediction.', 'The optimization problem of finding splits in regression trees can be efficiently solved through a brute force approach, providing a quick method for searching for these splits.', 'The initial classification results in 10% incorrect classifications based on the majority class of 900 positives and 100 negatives.', 'The p-hat of the parent is the average of the p-hats of the children, impacting the LR parent and the average loss calculation.', 'The concept of p hat as a proportion in the context of cross entropy loss is discussed, emphasizing the modification of p hat c on the x-axis and its impact on the overall loss function on the y-axis.', "Uneven splits result in the average being any point along the line drawn, leading to gaining information through the split as long as there's not an exact same point on the probability curve.", 'The geometric perspective of cross entropy loss in a binary classification problem is explained, highlighting the curvature of the loss curve and the relationship between the losses of children regions and the parent node.']}, {'end': 2560.813, 'segs': [{'end': 2205.918, 'src': 'embed', 'start': 2177.032, 'weight': 0, 'content': [{'end': 2180.894, 'text': "You could ask a question about any sort of subset of the categories you're looking at.", 'start': 2177.032, 'duration': 3.862}, {'end': 2186.816, 'text': 'So in this case, northern, this question would still split out this top part from these bottom pieces here.', 'start': 2181.694, 'duration': 5.122}, {'end': 2191.859, 'text': 'One thing to be careful about, though, is that if you have q categories,', 'start': 2188.137, 'duration': 3.722}, {'end': 2203.155, 'text': 'then you basically are considering every single possible subset of these categories.', 'start': 2197.711, 'duration': 5.444}, {'end': 2205.918, 'text': "So that's 2 to the q possible splits.", 'start': 2203.456, 'duration': 2.462}], 'summary': 'Consider asking questions about subsets of categories to split data, leading to 2 to the q possible splits.', 'duration': 28.886, 'max_score': 2177.032, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA2177032.jpg'}, {'end': 2316.399, 'src': 'heatmap', 'start': 2221.57, 'weight': 1, 'content': [{'end': 2227.073, 'text': 'It turns out that in certain very specific cases, you can still deal with a lot of categories.', 'start': 2221.57, 'duration': 5.503}, {'end': 2232.536, 'text': 'Uh, one such case is for binary classification, where then you can just-.', 'start': 2227.794, 'duration': 4.742}, {'end': 2234.457, 'text': 'the math is a little bit complicated for this one,', 'start': 2232.536, 'duration': 1.921}, {'end': 2243.483, 'text': 'but you can basically sort your categories by how many positive examples are in each category and then just take that as like a sorted order and then search through that linearly.', 'start': 2234.457, 'duration': 9.026}, {'end': 2245.784, 'text': 'And it turns out that that yields you an optimal solution.', 'start': 2243.563, 'duration': 2.221}, {'end': 2257.104, 'text': 'So decision trees, we can use them for regression.', 'start': 2255.123, 'duration': 1.981}, {'end': 2259.104, 'text': 'We can also use them for categorical variables.', 'start': 2257.144, 'duration': 1.96}, {'end': 2267.067, 'text': "One thing that I've not gotten into is that you can imagine that, in the limit, if you grew your tree without ever stopping,", 'start': 2260.305, 'duration': 6.762}, {'end': 2270.768, 'text': 'you could end up just having a separate region for every single data point that you have.', 'start': 2267.067, 'duration': 3.701}, {'end': 2277.851, 'text': "And so that's really, you could consider that probably overfitting if you ran it all the way to that completion.", 'start': 2272.409, 'duration': 5.442}, {'end': 2284.313, 'text': 'So you can sort of see that decision trees are fairly high variance models.', 'start': 2277.911, 'duration': 6.402}, {'end': 2290.751, 'text': "And so one thing that we're interested in doing is regularizing these high variance models.", 'start': 2286.601, 'duration': 4.15}, {'end': 2308.836, 'text': 'And generally, how people have solved this problem is through a number of heuristics.', 'start': 2305.834, 'duration': 3.002}, {'end': 2316.399, 'text': 'So one such heuristic is that if you hit a certain minimum leaf size, you stop splitting that leaf.', 'start': 2309.796, 'duration': 6.603}], 'summary': 'Decision trees can be used for regression and categorical variables. they can be high variance, and regularization is done through heuristics like stopping leaf splitting at a minimum size.', 'duration': 56.094, 'max_score': 2221.57, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA2221570.jpg'}, {'end': 2392.987, 'src': 'embed', 'start': 2365.721, 'weight': 4, 'content': [{'end': 2371.067, 'text': "And I say this one's tempting because it's generally not actually a good idea to use this minimum decrease in loss one.", 'start': 2365.721, 'duration': 5.346}, {'end': 2377.754, 'text': 'And you can think about that by thinking that if you have any sort of higher order interactions between your variables, um,', 'start': 2371.628, 'duration': 6.126}, {'end': 2383.14, 'text': "you might have to ask one question that is not very optimal, or doesn't give you that much of an increase in loss?", 'start': 2377.754, 'duration': 5.386}, {'end': 2387.425, 'text': 'And then your follow-up question combined with that first question might give you a much better increase.', 'start': 2383.44, 'duration': 3.985}, {'end': 2392.987, 'text': "And you can sort of see that in this case, where our initial latitude question doesn't really give us that much of a gain.", 'start': 2387.745, 'duration': 5.242}], 'summary': 'Using minimum decrease in loss may not be optimal due to higher order interactions between variables.', 'duration': 27.266, 'max_score': 2365.721, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA2365721.jpg'}, {'end': 2425.304, 'src': 'embed', 'start': 2400.21, 'weight': 3, 'content': [{'end': 2406.793, 'text': 'And if we were looking at it purely from the minimum decrease and loss perspective, we might stomp too early and miss that entirely.', 'start': 2400.21, 'duration': 6.583}, {'end': 2415.62, 'text': 'And so a better way to do this kind of loss decrease is instead you grow out your full tree and then you prune it backwards instead.', 'start': 2408.376, 'duration': 7.244}, {'end': 2421.022, 'text': 'So you, you grow out the whole thing and then you check which nodes to prune out, pruning.', 'start': 2415.68, 'duration': 5.342}, {'end': 2425.304, 'text': 'And how you generally do this is you you take it,', 'start': 2422.503, 'duration': 2.801}], 'summary': 'To avoid premature loss, grow the full tree and prune it backwards for effective loss decrease.', 'duration': 25.094, 'max_score': 2400.21, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA2400210.jpg'}], 'start': 2177.032, 'title': 'Refining decision trees', 'summary': 'Explains subsets of categories and their impact, overfitting in decision trees due to high variance, and pruning techniques for minimizing loss, including specific techniques and considerations such as minimum leaf size and maximum depth.', 'chapters': [{'end': 2259.104, 'start': 2177.032, 'title': 'Categories and decision trees', 'summary': 'Explains the consideration of subsets of categories and the potential number of splits, emphasizing the impact of too many categories and presenting a specific case for binary classification with an optimal sorted order search.', 'duration': 82.072, 'highlights': ['The number of possible splits for q categories is 2 to the q, emphasizing the exponential growth when considering every single possible subset of categories.', 'The impact of too many categories is highlighted, with the warning that dealing with a large number of categories can quickly become intractable.', 'In the specific case of binary classification, sorting the categories by the number of positive examples and then searching linearly yields an optimal solution, presenting a practical approach for dealing with a lot of categories.']}, {'end': 2399.81, 'start': 2260.305, 'title': 'Regularizing decision trees for high variance models', 'summary': 'Discusses the overfitting issue in decision trees due to high variance, and the regularization techniques including minimum leaf size, maximum depth, max number of nodes, and the pitfalls of using minimum decrease in loss.', 'duration': 139.505, 'highlights': ['Decision trees can overfit if grown without stopping, resulting in a separate region for every data point, indicating the need for regularization.', 'Regularization techniques for decision trees include stopping the split when reaching a certain minimum leaf size, enforcing a maximum depth, limiting the number of nodes, and being cautious about using minimum decrease in loss as it may not capture higher-order interactions effectively.', 'Using a minimum decrease in loss as a regularization technique for decision trees may not be optimal, as it may not capture higher-order interactions effectively, and the combination of multiple questions may provide a much better increase in loss.', 'Enforcing a maximum depth and a maximum number of nodes are common heuristics used for regularizing high variance decision tree models, to prevent overfitting and ensure generalization.', 'Stopping the split when reaching a certain minimum leaf size is one of the heuristics used for regularizing high variance decision tree models, preventing overfitting by avoiding excessive growth of the tree.']}, {'end': 2560.813, 'start': 2400.21, 'title': 'Tree pruning for loss decrease', 'summary': 'Discusses the approach of growing a full tree and then pruning it backwards to minimize loss, using misclassification error on a validation set and emphasizing the need to ask multiple suboptimal questions to reach the best ones.', 'duration': 160.603, 'highlights': ['The approach of growing a full tree and then pruning it backwards is discussed, emphasizing the need to check which nodes to prune out to minimize loss.', 'The use of misclassification error on a validation set to evaluate the removal of each leaf is highlighted as a technique for loss decrease.', 'The importance of asking multiple suboptimal questions, especially in the presence of variable interactions and correlations, to reach the best questions is emphasized.']}], 'duration': 383.781, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA2177032.jpg', 'highlights': ['The number of possible splits for q categories is 2 to the q, emphasizing the exponential growth when considering every single possible subset of categories.', 'Decision trees can overfit if grown without stopping, resulting in a separate region for every data point, indicating the need for regularization.', 'Enforcing a maximum depth and a maximum number of nodes are common heuristics used for regularizing high variance decision tree models, to prevent overfitting and ensure generalization.', 'The approach of growing a full tree and then pruning it backwards is discussed, emphasizing the need to check which nodes to prune out to minimize loss.', 'Using a minimum decrease in loss as a regularization technique for decision trees may not be optimal, as it may not capture higher-order interactions effectively, and the combination of multiple questions may provide a much better increase in loss.']}, {'end': 3142.82, 'segs': [{'end': 2669.877, 'src': 'embed', 'start': 2638.866, 'weight': 0, 'content': [{'end': 2645.774, 'text': 'So you have n examples that you trained on with each has f features, and your resulting tree is depth d.', 'start': 2638.866, 'duration': 6.908}, {'end': 2655.285, 'text': 'So at test time, your runtime is basically just your depth d.', 'start': 2645.774, 'duration': 9.511}, {'end': 2656.666, 'text': "It's just O of d.", 'start': 2655.285, 'duration': 1.381}, {'end': 2659.13, 'text': 'right, which is your depth.', 'start': 2658.069, 'duration': 1.061}, {'end': 2669.877, 'text': 'And typically, though in not all cases, um, d is sort of about- is less than the log of your number of examples.', 'start': 2659.57, 'duration': 10.307}], 'summary': 'Trained on n examples with f features, resulting tree depth d. test runtime is o(d), typically d < log(n).', 'duration': 31.011, 'max_score': 2638.866, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA2638866.jpg'}, {'end': 2852.446, 'src': 'embed', 'start': 2820.27, 'weight': 1, 'content': [{'end': 2825.552, 'text': 'And it turns out that this is actually surprisingly fast,', 'start': 2820.27, 'duration': 5.282}, {'end': 2833.475, 'text': 'especially if you consider that n times f is just the size of your original design matrix or your data matrix.', 'start': 2825.552, 'duration': 7.923}, {'end': 2852.446, 'text': 'Your data matrix is of size n times f, right? And then your only- your- your runtime is going through the data matrix at most depth times.', 'start': 2833.956, 'duration': 18.49}], 'summary': 'Processing is surprisingly fast, with n times f as the size of the data matrix.', 'duration': 32.176, 'max_score': 2820.27, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA2820270.jpg'}, {'end': 3038.337, 'src': 'embed', 'start': 2978.033, 'weight': 2, 'content': [{'end': 2983.636, 'text': "Okay, So to recap so far, since we've covered a number of different things about decision trees.", 'start': 2978.033, 'duration': 5.603}, {'end': 2990.108, 'text': "There's a number of pluses and minuses to decision trees.", 'start': 2987.346, 'duration': 2.762}, {'end': 2996.353, 'text': "So on the plus side, I think this is an important point, is that they're actually pretty easy to explain.", 'start': 2990.448, 'duration': 5.905}, {'end': 3002.238, 'text': "If you're explaining what a decision tree is to a non-technical person, it's fairly obvious.", 'start': 2997.494, 'duration': 4.744}, {'end': 3003.359, 'text': "You're like, OK, you have this tree.", 'start': 3002.258, 'duration': 1.101}, {'end': 3007.962, 'text': "You're just playing 20 questions with your data and letting it come up with one question at a time.", 'start': 3003.379, 'duration': 4.583}, {'end': 3010.724, 'text': "They're also interpretable.", 'start': 3009.343, 'duration': 1.381}, {'end': 3013.787, 'text': 'You can just draw out the tree, especially for shorter trees.', 'start': 3010.864, 'duration': 2.923}, {'end': 3017.436, 'text': "to see exactly what it's doing.", 'start': 3016.055, 'duration': 1.381}, {'end': 3022.701, 'text': 'It can deal with categorical variables.', 'start': 3021.08, 'duration': 1.621}, {'end': 3031.89, 'text': "And it's generally pretty fast.", 'start': 3030.749, 'duration': 1.141}, {'end': 3038.337, 'text': 'However, on the negative side,', 'start': 3036.815, 'duration': 1.522}], 'summary': 'Decision trees are easy to explain, interpretable, and fast, but have some drawbacks.', 'duration': 60.304, 'max_score': 2978.033, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA2978033.jpg'}, {'end': 3114.758, 'src': 'embed', 'start': 3076.672, 'weight': 4, 'content': [{'end': 3079.974, 'text': 'I just spent all this time talking about decision trees, and I tell you guys they actually sort of suck.', 'start': 3076.672, 'duration': 3.302}, {'end': 3087.92, 'text': 'So why did I actually cover decision trees? And the answer is that, in fact, you can make decision trees a lot better through ensembling.', 'start': 3080.014, 'duration': 7.906}, {'end': 3094.946, 'text': 'And a lot of the methods, for example, the leading methods in Kaggle these days are actually built on ensembles of decision trees.', 'start': 3088.501, 'duration': 6.445}, {'end': 3101.611, 'text': 'And they really provide an ideal sort of model framework to look at through which we can examine a lot of these different ensembling methods.', 'start': 3095.726, 'duration': 5.885}, {'end': 3114.758, 'text': 'Any questions about decision trees before I move on? Yeah.', 'start': 3104.201, 'duration': 10.557}], 'summary': 'Ensembles of decision trees improve model performance, leading methods in kaggle are built on them.', 'duration': 38.086, 'max_score': 3076.672, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA3076672.jpg'}], 'start': 2563.394, 'title': 'Decision trees', 'summary': 'Discusses the runtime of decision trees at test and train time, highlighting o(d) test time and o(nfd) training time, efficiency, interpretability, ability to handle categorical variables, while addressing downsides such as high variance, additive structure challenges, and low predictive accuracy.', 'chapters': [{'end': 2714.567, 'start': 2563.394, 'title': 'Runtime of decision trees', 'summary': 'Discusses the runtime of decision trees, explaining that at test time, the runtime is o(d), where d is the depth of the tree, typically less than log(n) where n is the number of examples, resulting in quick test times, and at train time, each point only belongs to the left or right side of a split.', 'duration': 151.173, 'highlights': ['At test time, the runtime of decision trees is O(d), where d is the depth of the tree, typically less than log(n) where n is the number of examples, resulting in quick test times.', 'At train time, each point only belongs to the left or right side of a split, never being considered on the other side, resulting in efficient training.']}, {'end': 3142.82, 'start': 2716.374, 'title': 'Costs and benefits of decision trees', 'summary': 'Discusses the costs and benefits of decision trees, highlighting their efficiency in training time (o(nfd)), interpretability, and ability to handle categorical variables, while also addressing their downsides such as high variance, challenges with additive structures, and relatively low predictive accuracy.', 'duration': 426.446, 'highlights': ['The total cost of evaluating points for training is O(nfd), where n is the number of points, f is the number of features, and d is the depth of the tree, making it surprisingly fast given the size of the data matrix and the logarithmic depth (log of n).', 'Decision trees are interpretable, making them easy to explain to non-technical individuals and allowing for visualization of their decision-making process, especially for shorter trees.', 'Decision trees can handle categorical variables and offer relatively fast training times, but they have downsides including high variance, challenges with additive structures, and generally lower predictive accuracy.', "Ensembling methods can significantly improve decision trees' performance, with leading methods in platforms like Kaggle being built on ensembles of decision trees, offering an ideal framework for examining different ensembling methods.", 'The cost of evaluating a point at each node is O(f), where f is the number of features, and each point is part of O(d) nodes, where d is the depth of the tree.']}], 'duration': 579.426, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA2563394.jpg', 'highlights': ["Decision trees' test time complexity is O(d), where d is the tree depth, resulting in quick test times.", 'Training time complexity is O(nfd), making it surprisingly fast given the size of the data matrix and the logarithmic depth.', 'Decision trees are interpretable, allowing for visualization of decision-making process and easy explanation to non-technical individuals.', 'Decision trees can handle categorical variables and offer relatively fast training times.', "Ensembling methods can significantly improve decision trees' performance, offering an ideal framework for examining different ensembling methods."]}, {'end': 4105.084, 'segs': [{'end': 3342.339, 'src': 'heatmap', 'start': 3244.327, 'weight': 0.836, 'content': [{'end': 3245.928, 'text': 'But you can call this iid.', 'start': 3244.327, 'duration': 1.601}, {'end': 3258.631, 'text': 'Now, say that your variance of one of these variables is sigma squared.', 'start': 3251.249, 'duration': 7.382}, {'end': 3280.329, 'text': "Then what you can show is that the variance of the mean of many of these random variables, or written alternatively, 1 over n, sum over i's of xi,", 'start': 3262.192, 'duration': 18.137}, {'end': 3285.431, 'text': 'is equal to sigma squared over n.', 'start': 3280.329, 'duration': 5.102}, {'end': 3290.693, 'text': 'And so each independent variable you factor in is decreasing the variance of your model.', 'start': 3285.431, 'duration': 5.262}, {'end': 3299.603, 'text': 'And so the thought is that if you can factor in a number of independent sources, you can slowly decrease your variance.', 'start': 3292.898, 'duration': 6.705}, {'end': 3305.366, 'text': "Okay, so I've said that, though this is a little bit simplistic of a way of looking at this,", 'start': 3301.064, 'duration': 4.302}, {'end': 3309.429, 'text': 'because really all these different things are factoring together, have some amount of correlation with each other.', 'start': 3305.366, 'duration': 4.063}, {'end': 3312.871, 'text': 'And so this independence assumption is oftentimes not correct.', 'start': 3309.729, 'duration': 3.142}, {'end': 3324.184, 'text': 'So if instead, you drop the independence assumption.', 'start': 3313.872, 'duration': 10.312}, {'end': 3342.339, 'text': 'So now your variables are just ID right?', 'start': 3338.956, 'duration': 3.383}], 'summary': 'Variance decreases as independent variables are factored in, but independence assumption may not be correct.', 'duration': 98.012, 'max_score': 3244.327, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA3244327.jpg'}, {'end': 3290.693, 'src': 'embed', 'start': 3251.249, 'weight': 0, 'content': [{'end': 3258.631, 'text': 'Now, say that your variance of one of these variables is sigma squared.', 'start': 3251.249, 'duration': 7.382}, {'end': 3280.329, 'text': "Then what you can show is that the variance of the mean of many of these random variables, or written alternatively, 1 over n, sum over i's of xi,", 'start': 3262.192, 'duration': 18.137}, {'end': 3285.431, 'text': 'is equal to sigma squared over n.', 'start': 3280.329, 'duration': 5.102}, {'end': 3290.693, 'text': 'And so each independent variable you factor in is decreasing the variance of your model.', 'start': 3285.431, 'duration': 5.262}], 'summary': 'Variance of mean of random variables is sigma squared over n, decreasing with each independent variable.', 'duration': 39.444, 'max_score': 3251.249, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA3251249.jpg'}, {'end': 3437.115, 'src': 'heatmap', 'start': 3359.2, 'weight': 1, 'content': [{'end': 3396.236, 'text': 'So then you can actually write out the variance of your mean, as rho sigma squared plus 1 minus rho over n sigma squared.', 'start': 3359.2, 'duration': 37.036}, {'end': 3405.266, 'text': "And so you can see that if they're fully correlated, then this term will drop to 0, and you'll just have sigma squared again,", 'start': 3398.121, 'duration': 7.145}, {'end': 3410.69, 'text': 'because adding a bunch of fully correlated variables is just going to give you the original variables variance versus.', 'start': 3405.266, 'duration': 5.424}, {'end': 3415.154, 'text': "if they're completely de-correlated, then this term drops to 0, and you just end up with sigma squared over n,", 'start': 3410.69, 'duration': 4.464}, {'end': 3420.758, 'text': 'which gives you the initial independent, identically distributed equation.', 'start': 3415.154, 'duration': 5.604}, {'end': 3424.401, 'text': 'And so in this case, really what you want to do.', 'start': 3422.659, 'duration': 1.742}, {'end': 3432.249, 'text': "the name of the game is you want to have as many different models that you're factoring as possible to increase this n, which drives this turn down.", 'start': 3424.401, 'duration': 7.848}, {'end': 3437.115, 'text': 'And then, on the other hand, you also want to make sure those models are as decorrelated as possible,', 'start': 3432.87, 'duration': 4.245}], 'summary': 'Variance of mean: rho sigma² + (1-rho)/n sigma². correlation impacts variance.', 'duration': 77.915, 'max_score': 3359.2, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA3359200.jpg'}, {'end': 3491.304, 'src': 'embed', 'start': 3437.115, 'weight': 1, 'content': [{'end': 3440.438, 'text': 'so that your row goes down in this first turn goes down as well.', 'start': 3437.115, 'duration': 3.323}, {'end': 3459.426, 'text': 'Okay And so this gives rise to a number of different ways to ensemble.', 'start': 3441.64, 'duration': 17.786}, {'end': 3473.042, 'text': 'And one way you could think about doing this is you just use different algorithms.', 'start': 3469.177, 'duration': 3.865}, {'end': 3489.063, 'text': "This is actually what a lot of people in Kaggle, for example, will do, is they'll just take a neural network, a random forest,", 'start': 3483.581, 'duration': 5.482}, {'end': 3491.304, 'text': 'an SVM average them all together.', 'start': 3489.063, 'duration': 2.241}], 'summary': 'Ensembling can involve combining different algorithms like neural networks, random forests, and svm, commonly seen in kaggle competitions.', 'duration': 54.189, 'max_score': 3437.115, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA3437115.jpg'}, {'end': 3642.693, 'src': 'embed', 'start': 3603.342, 'weight': 2, 'content': [{'end': 3609.404, 'text': "But generally, we end up doing these latter two because we don't want to collect new training sets or train entirely new algorithms.", 'start': 3603.342, 'duration': 6.062}, {'end': 3612.946, 'text': "So let's cover bagging first.", 'start': 3611.605, 'duration': 1.341}, {'end': 3624.289, 'text': 'OK, so bagging really stands for this thing.', 'start': 3621.929, 'duration': 2.36}, {'end': 3626.45, 'text': "It's called bootstrap aggregation.", 'start': 3624.39, 'duration': 2.06}, {'end': 3642.693, 'text': "And so first, let's just break down this term.", 'start': 3640.233, 'duration': 2.46}], 'summary': 'Bagging reduces need for new training sets or algorithms, utilizing bootstrap aggregation.', 'duration': 39.351, 'max_score': 3603.342, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA3603342.jpg'}, {'end': 3778.734, 'src': 'heatmap', 'start': 3678.902, 'weight': 0.823, 'content': [{'end': 3682.863, 'text': "So you just start drawing a bunch of examples from p, and that's what forms your training set at some level.", 'start': 3678.902, 'duration': 3.961}, {'end': 3687.644, 'text': 'And so ideally, for example, this different training sets approach.', 'start': 3684.103, 'duration': 3.541}, {'end': 3692.446, 'text': 'what you do is you just draw s1, s2, s3, s4, and then train your model in each one of those separately.', 'start': 3687.644, 'duration': 4.802}, {'end': 3695.406, 'text': "Unfortunately, you generally don't have the time to do that.", 'start': 3692.946, 'duration': 2.46}, {'end': 3706.343, 'text': 'And so what bootstrapping does is you assume basically that your population is your training sample.', 'start': 3696.187, 'duration': 10.156}, {'end': 3710.446, 'text': 'So you assume that your population is your training sample.', 'start': 3708.405, 'duration': 2.041}, {'end': 3720.494, 'text': 'And so now that you have this S as approximating your P, then you can draw new samples from your population by just drawing samples from S instead.', 'start': 3710.787, 'duration': 9.707}, {'end': 3725.498, 'text': "So you have bootstrap samples is what they're called.", 'start': 3721.655, 'duration': 3.843}, {'end': 3734.326, 'text': 'Z sampled from S.', 'start': 3732.184, 'duration': 2.142}, {'end': 3739.689, 'text': 'And so how that works is you basically just take your train-, your-, your training sample.', 'start': 3735.407, 'duration': 4.282}, {'end': 3746.452, 'text': "okay, say it's of like cardinality n or something, and you just sample n times from S, and this is important.", 'start': 3739.689, 'duration': 6.763}, {'end': 3750.053, 'text': "you do it with replacement because they're pretending that this is a population.", 'start': 3746.452, 'duration': 3.601}, {'end': 3756.116, 'text': "And so doing it with replacement sort of makes that assumption hold that you're sampling from it as a population.", 'start': 3750.493, 'duration': 5.623}, {'end': 3761.219, 'text': "Okay, so that's bootstrapping.", 'start': 3759.878, 'duration': 1.341}, {'end': 3766.924, 'text': 'So you generate all these different bootstrap samples z on your- from your training set.', 'start': 3761.259, 'duration': 5.665}, {'end': 3771.568, 'text': 'And what you can do is you can take your model and train it on all these separate bootstrap samples.', 'start': 3767.244, 'duration': 4.324}, {'end': 3778.734, 'text': 'And then you can sort of look at the variability and the predictions that your model ends up making based on these different bootstrap samples.', 'start': 3772.909, 'duration': 5.825}], 'summary': 'Bootstrapping involves generating multiple bootstrap samples from a training set to train models and assess prediction variability.', 'duration': 99.832, 'max_score': 3678.902, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA3678902.jpg'}, {'end': 4019.302, 'src': 'embed', 'start': 3998.226, 'weight': 3, 'content': [{'end': 4009.635, 'text': "And it turns out that one nice thing about bootstrapping is that increasing the number of bootstrap models you're training doesn't actually cause you to overfit any more than you were beforehand.", 'start': 3998.226, 'duration': 11.409}, {'end': 4012.897, 'text': "Because all you're doing is you're driving down this term here.", 'start': 4009.835, 'duration': 3.062}, {'end': 4019.302, 'text': 'So more M is just less variance, right?', 'start': 4013.598, 'duration': 5.704}], 'summary': 'Bootstrapping effectively reduces overfitting by driving down variance with more models.', 'duration': 21.076, 'max_score': 3998.226, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA3998226.jpg'}], 'start': 3174.426, 'title': 'Ensembling methods in statistics and machine learning', 'summary': 'Discusses ensembling in statistics, emphasizing the importance of decorrelated models to decrease variance, covers ensembling in machine learning, and explains bagging and bootstrap aggregation for addressing the challenges of limited time and overfitting in model training.', 'chapters': [{'end': 3437.115, 'start': 3174.426, 'title': 'Ensembling in statistics', 'summary': 'Discusses ensembling in statistics, explaining how combining independent sources can decrease variance and emphasizing the importance of having decorrelated models to increase the number of independent sources and decrease the variance.', 'duration': 262.689, 'highlights': ['Combining independent sources can decrease the variance of the model, as shown by the formula variance of the mean = sigma squared / n, where n is the number of independent variables.', 'The correlation between variables affects the variance of the mean, with fully correlated variables resulting in the original variance and completely decorrelated variables resulting in sigma squared / n.', 'The goal is to have as many different models as possible to increase the number of independent sources and drive down the variance, while ensuring that the models are as decorrelated as possible.']}, {'end': 3602.942, 'start': 3437.115, 'title': 'Ensemble methods for machine learning', 'summary': 'Discusses the concept of ensembling in machine learning, covering methods like using different algorithms, training sets, bagging, and boosting for improved model performance with decision trees.', 'duration': 165.827, 'highlights': ['The chapter covers ensembling methods including using different algorithms, training sets, bagging, and boosting for improved model performance with decision trees.', 'Ensembling by using different algorithms such as neural networks, random forests, and SVM and averaging them together is a common practice, often yielding good results.', 'The discussion also involves the inefficiency of spending time implementing separate algorithms and the limitation of recommending a whole second training set for performance improvement.', 'The methods of bagging and boosting are then introduced, with bagging aiming to approximate having different training sets and boosting involving techniques like adaboost and xgboost for decision trees.']}, {'end': 3756.116, 'start': 3603.342, 'title': 'Understanding bagging in machine learning', 'summary': 'Explains the concept of bagging in machine learning, specifically focusing on bootstrap aggregation, which involves sampling from a training set to create bootstrap samples and train models, enabling the approximation of the true population and addressing the challenge of limited time for creating different training sets.', 'duration': 152.774, 'highlights': ['Bootstrap aggregation (bagging) involves sampling from a training set to create bootstrap samples and train models, addressing the challenge of limited time for creating different training sets.', 'Bootstrapping assumes that the population is the training sample, allowing the drawing of new samples from the population by sampling from the training sample with replacement.', 'Drawing new samples from the training sample with replacement enables the assumption that the sample is a population, facilitating the creation of bootstrap samples for model training.']}, {'end': 4105.084, 'start': 3759.878, 'title': 'Bootstrapping and bagging', 'summary': 'Explains the concept of bootstrapping, including generating bootstrap samples, training separate models, and aggregating their outputs to decrease variance without causing overfitting, and discusses the trade-offs and bounds involved.', 'duration': 345.206, 'highlights': ["By taking bootstrap samples and training separate models, then averaging their outputs, bootstrapping can decrease variance without causing overfitting, as increasing the number of bootstrap models doesn't increase overfitting and generally improves performance.", 'The trade-off in bootstrapping comes from the lower bound on decreasing the correlation factor rho, with a limit on how much rho can be decreased due to the high correlation between bootstrap samples drawn from the same set S.', 'Increasing the number of bootstrap models can decrease the correlation factor rho and the second term of the variance equation, leading to decreased variance without causing overfitting, and generally only improves performance.']}], 'duration': 930.658, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA3174426.jpg', 'highlights': ['Combining independent sources decreases model variance, variance of mean = sigma squared / n', 'Ensembling different algorithms like neural networks, random forests, and SVM often yields good results', 'Bootstrap aggregation (bagging) addresses limited time for creating different training sets', 'Increasing the number of bootstrap models decreases correlation factor rho and variance']}, {'end': 4835.903, 'segs': [{'end': 4314.971, 'src': 'embed', 'start': 4241.105, 'weight': 1, 'content': [{'end': 4249.787, 'text': 'the decrease in variance that you get from doing this is much larger than the slight increase in bias you get from doing this randomized subsampling.', 'start': 4241.105, 'duration': 8.682}, {'end': 4252.147, 'text': 'So in a lot of cases, bagging is quite nice.', 'start': 4249.807, 'duration': 2.34}, {'end': 4268.95, 'text': 'OK OK.', 'start': 4253.087, 'duration': 15.863}, {'end': 4272.539, 'text': "I've talked a bit about bagging.", 'start': 4270.558, 'duration': 1.981}, {'end': 4275.021, 'text': "Let's talk about decision trees plus bagging now.", 'start': 4272.839, 'duration': 2.182}, {'end': 4298.035, 'text': 'So you recall that decision trees are high variance, low bias.', 'start': 4284.747, 'duration': 13.288}, {'end': 4303.486, 'text': "And this right here sort of explains why they're pretty good fit for bagging.", 'start': 4300.205, 'duration': 3.281}, {'end': 4309.469, 'text': "Because bagging, what you're doing is you're decreasing the variance of your models for a slight increase in bias.", 'start': 4304.087, 'duration': 5.382}, {'end': 4314.971, 'text': 'And since most of your error from your decision trees is coming from the high variance side of things,', 'start': 4309.549, 'duration': 5.422}], 'summary': 'Bagging reduces variance, slightly increases bias; beneficial for decision trees.', 'duration': 73.866, 'max_score': 4241.105, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA4241105.jpg'}, {'end': 4431.525, 'src': 'embed', 'start': 4375.752, 'weight': 0, 'content': [{'end': 4381.577, 'text': 'But if you can further de-correlate your different random variables, then you can drive down that variance even further, okay?', 'start': 4375.752, 'duration': 5.825}, {'end': 4399.568, 'text': 'Um, and so the idea there is that basically for- at each split for random forests.', 'start': 4383.004, 'duration': 16.564}, {'end': 4419.338, 'text': 'at each split you consider only a fraction of your total features.', 'start': 4399.568, 'duration': 19.77}, {'end': 4431.525, 'text': "So it's sort of like for that ski example, maybe like for the first split, I only let it look at latitude.", 'start': 4426.902, 'duration': 4.623}], 'summary': 'By de-correlating random variables, variance can be reduced in random forests using a fraction of total features at each split.', 'duration': 55.773, 'max_score': 4375.752, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA4375752.jpg'}, {'end': 4509.755, 'src': 'heatmap', 'start': 4451.564, 'weight': 0.812, 'content': [{'end': 4459.09, 'text': 'you can think that say you have a classification example where you have one very strong predictor that gets you very good performance on its own.', 'start': 4451.564, 'duration': 7.526}, {'end': 4464.574, 'text': 'And regardless of what bootstrap sample you select, your model is probably going to use that predictor as its first split.', 'start': 4459.73, 'duration': 4.844}, {'end': 4469.238, 'text': "That's going to cause all your models to be very highly correlated right at that first split, for example.", 'start': 4465.195, 'duration': 4.043}, {'end': 4479.376, 'text': "And by instead forcing it to sample from different features instead that's going to decrease the correlation between your models.", 'start': 4469.759, 'duration': 9.617}, {'end': 4483.018, 'text': "And so it's all about de-correlating your models in this case.", 'start': 4479.897, 'duration': 3.121}, {'end': 4495.603, 'text': 'And that brings to a close a lot of our discussion of bagging.', 'start': 4493.102, 'duration': 2.501}, {'end': 4509.755, 'text': "Are there any questions regarding bagging? Now that I've covered bagging, let's get a little bit into boosting.", 'start': 4496.163, 'duration': 13.592}], 'summary': 'Using different features decreases model correlation, improving performance. discussion includes bagging and boosting.', 'duration': 58.191, 'max_score': 4451.564, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA4451564.jpg'}, {'end': 4661.017, 'src': 'heatmap', 'start': 4549.78, 'weight': 0.941, 'content': [{'end': 4557.865, 'text': "So versus- you'll recall that for bagging, you were taking the average of a number of variables.", 'start': 4549.78, 'duration': 8.085}, {'end': 4562.448, 'text': 'In boosting what happens, you train one model and then you add that prediction into your ensemble.', 'start': 4557.965, 'duration': 4.483}, {'end': 4565.209, 'text': 'And then when you turn a new model, you just add that in as a prediction.', 'start': 4562.608, 'duration': 2.601}, {'end': 4567.731, 'text': "And so that's a little bit hand-wavy right now.", 'start': 4565.93, 'duration': 1.801}, {'end': 4570.673, 'text': 'So let me actually make that clear through an example.', 'start': 4567.791, 'duration': 2.882}, {'end': 4577.579, 'text': 'So say you have a data set, again, x1, x2.', 'start': 4574.858, 'duration': 2.721}, {'end': 4586.421, 'text': 'And you have some data points, maybe some pluses and minuses.', 'start': 4580.619, 'duration': 5.802}, {'end': 4593.082, 'text': 'So you have some more pluses here, and then maybe a couple minuses, and then some pluses here.', 'start': 4587.501, 'duration': 5.581}, {'end': 4599.004, 'text': "And say you're training size 1 decision trees.", 'start': 4596.003, 'duration': 3.001}, {'end': 4601.824, 'text': 'So decision stumps is what we call them.', 'start': 4599.064, 'duration': 2.76}, {'end': 4603.985, 'text': 'You only get to ask one question at a time.', 'start': 4601.844, 'duration': 2.141}, {'end': 4612.389, 'text': "And the reasoning behind this, just really quickly, is that because you're decreasing bias by restricting your trees to be only depth 1,", 'start': 4604.901, 'duration': 7.488}, {'end': 4618.555, 'text': 'you basically are increasing their amount of bias and decreasing the amount of variance, which makes them a better fit for boosting kind of methods.', 'start': 4612.389, 'duration': 6.166}, {'end': 4624.341, 'text': 'And say that you come up with a decision boundary, say this one here.', 'start': 4620.116, 'duration': 4.225}, {'end': 4631.868, 'text': "Okay? And what you're gonna do is on this side you predict positive, right? And on this side you predict negative.", 'start': 4626.206, 'duration': 5.662}, {'end': 4636.55, 'text': "This is like a reasonable like line that you could draw here, but it's not perfect, right? You've made some mistakes.", 'start': 4631.888, 'duration': 4.662}, {'end': 4641.132, 'text': 'And in fact, what you can do is you can sort of identify these mistakes.', 'start': 4637.571, 'duration': 3.561}, {'end': 4646.094, 'text': "So if we draw this in red, right? You've got made these guys as mistakes.", 'start': 4641.392, 'duration': 4.702}, {'end': 4653.065, 'text': "And what boosting does is basically it increases the weights of the mistakes you've made.", 'start': 4647.697, 'duration': 5.368}, {'end': 4661.017, 'text': "And then for the next decision stump that you train, it's now trained on this modified set, which we might just draw it over here.", 'start': 4653.646, 'duration': 7.371}], 'summary': 'Boosting trains models sequentially, adjusting for previous mistakes, to improve ensemble predictions.', 'duration': 111.237, 'max_score': 4549.78, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA4549780.jpg'}, {'end': 4835.903, 'src': 'embed', 'start': 4805.934, 'weight': 3, 'content': [{'end': 4808.517, 'text': 'And this algorithm is actually known as AdaBoost.', 'start': 4805.934, 'duration': 2.583}, {'end': 4813.769, 'text': 'And basically through similar techniques,', 'start': 4811.508, 'duration': 2.261}, {'end': 4829.039, 'text': "you can derive algorithms such as XGBoost or gradient boosting machines that also allow you to basically reweight the examples you're getting right or wrong in this sort of dynamic fashion and slowly adding them in this additive fashion to your composite model.", 'start': 4813.769, 'duration': 15.27}, {'end': 4831.98, 'text': 'And that about finishes it for today.', 'start': 4830.259, 'duration': 1.721}, {'end': 4832.841, 'text': 'Thanks for coming.', 'start': 4832.201, 'duration': 0.64}, {'end': 4835.903, 'text': 'Have a great rest of your week.', 'start': 4832.861, 'duration': 3.042}], 'summary': 'Adaboost enables dynamic reweighting of examples, leading to composite models like xgboost and gradient boosting machines.', 'duration': 29.969, 'max_score': 4805.934, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA4805934.jpg'}], 'start': 4110.127, 'title': 'Ensemble learning methods', 'summary': 'Covers bagging and random forests, explaining the impact of bootstrapping on bias and variance, and boosting methods like adaboost, xgboost, and gradient boosting machines, which reduce bias and increase model additivity.', 'chapters': [{'end': 4479.376, 'start': 4110.127, 'title': 'Bagging and random forests', 'summary': 'Explains the impact of bootstrapping on bias and variance, the analogy between random variables and algorithms, and the benefits of bagging and random forests in decreasing variance with a slight increase in bias, particularly with decision trees and additional randomization in random forests.', 'duration': 369.249, 'highlights': ['Bagging decreases variance with a slight increase in bias due to random subsampling, making it beneficial in many cases.', 'Decision trees are a good fit for bagging due to their high variance and low bias, as decreasing variance has a larger benefit for models with high variance.', 'Random forests introduce even more randomization by considering only a fraction of total features at each split, effectively reducing correlation between models.', 'Random forests aim to de-correlate different random variables to further drive down variance, which is achieved by considering only a fraction of total features at each split.', 'Algorithm as a random variable: an algorithm can be seen as a random variable at a high level, as it provides a distribution of possible predictions based on the training sample.']}, {'end': 4835.903, 'start': 4479.897, 'title': 'Boosting: reducing bias and increasing model additivity', 'summary': 'Introduces boosting as a method that reduces bias and increases model additivity by reweighting examples based on their correctness, and it covers the adaboost algorithm and its derivatives like xgboost and gradient boosting machines.', 'duration': 356.006, 'highlights': ['Boosting reduces bias and increases model additivity by reweighting examples based on their correctness.', 'AdaBoost algorithm is introduced as a specific boosting technique.', 'Derivatives of AdaBoost like XGBoost and gradient boosting machines are mentioned.']}], 'duration': 725.776, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wr9gUr-eWdA/pics/wr9gUr-eWdA4110127.jpg', 'highlights': ['Random forests aim to de-correlate different random variables to further drive down variance, which is achieved by considering only a fraction of total features at each split.', 'Decision trees are a good fit for bagging due to their high variance and low bias, as decreasing variance has a larger benefit for models with high variance.', 'Bagging decreases variance with a slight increase in bias due to random subsampling, making it beneficial in many cases.', 'Boosting reduces bias and increases model additivity by reweighting examples based on their correctness.', 'Random forests introduce even more randomization by considering only a fraction of total features at each split, effectively reducing correlation between models.']}], 'highlights': ['Decision trees provide a natural way to partition the space into individual regions, allowing isolation of positive examples, making it an effective method for solving the problem (relevance score: 5)', 'The approach is top-down and greedy, as it starts with the overall region and then partitions it up, picking the best partition possible at each step (relevance score: 4)', 'Introduction of cross entropy loss with examples and explanation', "Decision trees' test time complexity is O(d), where d is the tree depth, resulting in quick test times.", "Ensembling methods can significantly improve decision trees' performance, offering an ideal framework for examining different ensembling methods.", 'The number of possible splits for q categories is 2 to the q, emphasizing the exponential growth when considering every single possible subset of categories.', 'Combining independent sources decreases model variance, variance of mean = sigma squared / n', 'Random forests aim to de-correlate different random variables to further drive down variance, which is achieved by considering only a fraction of total features at each split.', 'Decision trees are interpretable, allowing for visualization of decision-making process and easy explanation to non-technical individuals.', 'Bootstrap aggregation (bagging) addresses limited time for creating different training sets']}