title
StatQuest: Decision Trees

description
NOTE: This video has been updated and revised. The new version can be found here: https://youtu.be/_L39rN6gz7Y This StatQuest focuses on the machine learning topic "Decision Trees". Decision trees are a simple way to convert a table of data that you have sitting around your desk into a means to predict and classify new data as it comes. There is a minor error at 12:43: The Gini Impurity for Chest Pain should be 0.19. There is another minor error at 14:39: We should plug in 0.375 instead of 0.336. For a complete index of all the StatQuest videos, check out: https://statquest.org/video-index/ If you'd like to support StatQuest, please consider... Buying The StatQuest Illustrated Guide to Machine Learning!!! PDF - https://statquest.gumroad.com/l/wvtmc Paperback - https://www.amazon.com/dp/B09ZCKR4H6 Kindle eBook - https://www.amazon.com/dp/B09ZG79HXC Patreon: https://www.patreon.com/statquest ...or... YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join ...a cool StatQuest t-shirt or sweatshirt (USA/Europe): https://teespring.com/stores/statquest (everywhere): https://www.redbubble.com/people/starmer/works/40421224-statquest-double-bam?asc=u&p=t-shirt ...buying one or two of my songs (or go large and get a whole album!) https://joshuastarmer.bandcamp.com/ ...or just donating to StatQuest! https://www.paypal.me/statquest Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter: https://twitter.com/joshuastarmer 0:00 Awesome song and introduction 0:15 How to use decision trees to make decisions 1:36 Descriptions of decision trees and their parts 3:29 How to build a decision tree 7:06 Calculating Gini Impurity 13:59 Numeric and continuous variables 15:31 Ranked data 16:08 Multiple choice data #statquest #decisiontree #ML

detail
{'title': 'StatQuest: Decision Trees', 'heatmap': [], 'summary': "'statquest: decision trees' explores decision trees' applications for classifying individuals using yes-no questions, numeric, and ranked data, illustrating examples such as resting heart rate, mouse weight, and heart disease prediction among 303 patients, while also discussing measuring impurity with gini and building decision trees for patient classification based on attributes like chest pain and blocked arteries.", 'chapters': [{'end': 153.776, 'segs': [{'end': 60.674, 'src': 'embed', 'start': 34.636, 'weight': 0, 'content': [{'end': 41.861, 'text': 'In general, a decision tree asks a question and then classifies the person based on the answer.', 'start': 34.636, 'duration': 7.225}, {'end': 43.902, 'text': "It's no big deal.", 'start': 43.001, 'duration': 0.901}, {'end': 48.966, 'text': 'This decision tree is based on a yes-no question.', 'start': 45.283, 'duration': 3.683}, {'end': 53.789, 'text': 'But it is just as easy to build a tree from numeric data.', 'start': 50.327, 'duration': 3.462}, {'end': 60.674, 'text': 'If a person has a really high resting heart rate, then that person had better see a doctor.', 'start': 55.15, 'duration': 5.524}], 'summary': 'Decision tree classifies based on yes-no and numeric data, identifying need to see a doctor.', 'duration': 26.038, 'max_score': 34.636, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7VeUPuFGJHk/pics/7VeUPuFGJHk34636.jpg'}], 'start': 0.429, 'title': 'Decision trees', 'summary': 'Delves into understanding decision trees and their applications for classifying individuals based on yes-no questions, numeric data, and ranked data, with examples including decision trees based on resting heart rate and mouse weight, and a more complex decision tree combining numeric and yes-no data.', 'chapters': [{'end': 153.776, 'start': 0.429, 'title': 'Understanding decision trees', 'summary': 'Explores decision trees and their applications, including classifying individuals based on yes-no questions, numeric data, and ranked data, with examples of decision trees based on resting heart rate and mouse weight, also showcasing a more complex decision tree combining numeric and yes-no data.', 'duration': 153.347, 'highlights': ['Decision trees can classify individuals based on yes-no questions, numeric data, or ranked data. Decision trees can classify individuals based on yes-no questions, numeric data, or ranked data, as illustrated by examples of decision trees based on resting heart rate and mouse weight.', 'Examples of decision trees include classifying individuals based on resting heart rate and mouse weight. Examples of decision trees include classifying individuals based on resting heart rate and mouse weight, showcasing the versatility of decision trees in different scenarios.', 'Complex decision trees can combine numeric and yes-no data. Complex decision trees can combine numeric and yes-no data, as demonstrated by the example of a decision tree combining numeric data with yes-no data based on resting heart rate and eating donuts.']}], 'duration': 153.347, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7VeUPuFGJHk/pics/7VeUPuFGJHk429.jpg', 'highlights': ['Decision trees can classify individuals based on yes-no questions, numeric data, or ranked data.', 'Examples of decision trees include classifying individuals based on resting heart rate and mouse weight.', 'Complex decision trees can combine numeric and yes-no data.']}, {'end': 390.935, 'segs': [{'end': 186.932, 'src': 'embed', 'start': 155.217, 'weight': 0, 'content': [{'end': 159.02, 'text': 'For the most part, decision trees are pretty intuitive to work with.', 'start': 155.217, 'duration': 3.803}, {'end': 171.849, 'text': "You start at the top, and work your way down, and down, till you get to a point where you can't go any further, and that's how you classify a sample.", 'start': 160.641, 'duration': 11.208}, {'end': 181.988, 'text': 'Oh no! Jargon alert! The very top of the tree is called the root node or just the root.', 'start': 173.591, 'duration': 8.397}, {'end': 186.932, 'text': 'These are called internal nodes or just nodes.', 'start': 183.75, 'duration': 3.182}], 'summary': 'Decision trees are intuitive to work with, starting at the root and classifying samples by moving down internal nodes.', 'duration': 31.715, 'max_score': 155.217, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7VeUPuFGJHk/pics/7VeUPuFGJHk155217.jpg'}, {'end': 307.812, 'src': 'embed', 'start': 277.164, 'weight': 3, 'content': [{'end': 282.847, 'text': 'Ultimately, we look at chest pain and heart disease for all 303 patients in this study.', 'start': 277.164, 'duration': 5.683}, {'end': 288.27, 'text': 'Now we do the exact same thing for good blood circulation.', 'start': 284.428, 'duration': 3.842}, {'end': 297.147, 'text': 'Lastly, we look at how blocked arteries separates the patients with and without heart disease.', 'start': 291.365, 'duration': 5.782}, {'end': 302.97, 'text': "Since we don't know if this patient had blocked arteries or not, we'll skip it.", 'start': 298.608, 'duration': 4.362}, {'end': 307.812, 'text': "However, there are alternatives that I'll discuss in a follow-up video.", 'start': 303.97, 'duration': 3.842}], 'summary': 'Study examines chest pain and heart disease in 303 patients, with a focus on good blood circulation and blocked arteries.', 'duration': 30.648, 'max_score': 277.164, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7VeUPuFGJHk/pics/7VeUPuFGJHk277164.jpg'}, {'end': 390.935, 'src': 'embed', 'start': 361.493, 'weight': 4, 'content': [{'end': 368.779, 'text': 'The total number of patients with heart disease is different for chest pain, good blood circulation and blocked arteries,', 'start': 361.493, 'duration': 7.286}, {'end': 374.223, 'text': 'because some patients had measurements for chest pain but not for blocked arteries, etc.', 'start': 368.779, 'duration': 5.444}, {'end': 376.025, 'text': 'Oh, no!', 'start': 375.604, 'duration': 0.421}, {'end': 378.867, 'text': "It's another one of those ghastly jargon alerts!", 'start': 376.525, 'duration': 2.342}, {'end': 388.654, 'text': 'Because none of the leaf nodes are 100% yes heart disease or 100% no heart disease.', 'start': 380.711, 'duration': 7.943}, {'end': 390.935, 'text': 'they are all considered impure.', 'start': 388.654, 'duration': 2.281}], 'summary': 'Variations in heart disease patients with different symptoms and measurements make classification impure.', 'duration': 29.442, 'max_score': 361.493, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7VeUPuFGJHk/pics/7VeUPuFGJHk361493.jpg'}], 'start': 155.217, 'title': 'Decision trees and heart disease prediction', 'summary': 'Introduces decision tree concepts and discusses creating decision trees from raw data, and examines the effectiveness of chest pain, good blood circulation, and blocked arteries in predicting heart disease among 303 patients, concluding that none of these factors are perfect predictors.', 'chapters': [{'end': 221.727, 'start': 155.217, 'title': 'Understanding decision trees', 'summary': 'Introduces the concept of decision trees, explaining the structure and terminology, and prepares to discuss the process of creating a decision tree from raw data.', 'duration': 66.51, 'highlights': ['The chapter introduces the concept of decision trees, explaining the structure and terminology, and prepares to discuss the process of creating a decision tree from raw data.', 'Decision trees are intuitive to work with, starting from the root node and classifying samples by moving down to the leaf nodes.', 'The terminology of decision trees is explained, such as root nodes, internal nodes, and leaf nodes.', 'The process of creating a decision tree from raw data is about to be discussed.']}, {'end': 390.935, 'start': 221.727, 'title': 'Heart disease prediction analysis', 'summary': 'Examines the effectiveness of chest pain, good blood circulation, and blocked arteries in predicting heart disease among 303 patients, with the conclusion that none of these factors are perfect predictors, as they all show impurity in separating patients with and without heart disease.', 'duration': 169.208, 'highlights': ['The chapter examines the effectiveness of chest pain, good blood circulation, and blocked arteries in predicting heart disease among 303 patients The analysis involves studying the predictive power of chest pain, good blood circulation, and blocked arteries in determining the presence of heart disease.', 'None of these factors are perfect predictors, as they all show impurity in separating patients with and without heart disease Chest pain, good blood circulation, and blocked arteries are found to be imperfect predictors of heart disease, as none of the leaf nodes resulted in 100% yes or no heart disease outcomes.']}], 'duration': 235.718, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7VeUPuFGJHk/pics/7VeUPuFGJHk155217.jpg', 'highlights': ['The chapter introduces the concept of decision trees and explains the structure and terminology.', 'Decision trees are intuitive, starting from the root node and classifying samples by moving down to the leaf nodes.', 'The terminology of decision trees is explained, including root nodes, internal nodes, and leaf nodes.', 'The chapter examines the effectiveness of chest pain, good blood circulation, and blocked arteries in predicting heart disease among 303 patients.', 'None of these factors are perfect predictors, as they all show impurity in separating patients with and without heart disease.']}, {'end': 1041.529, 'segs': [{'end': 622.287, 'src': 'embed', 'start': 595.379, 'weight': 0, 'content': [{'end': 605.785, 'text': "And since I'm such a nice guy, I'm going to cut to the chase and tell you that the genie impurity for good blood circulation equals 0.360.", 'start': 595.379, 'duration': 10.406}, {'end': 612.022, 'text': 'And the genie impurity for blocked arteries equals 0.381.', 'start': 605.785, 'duration': 6.237}, {'end': 614.623, 'text': 'Good blood circulation has the lowest impurity.', 'start': 612.022, 'duration': 2.601}, {'end': 618.705, 'text': 'It separates patients with and without heart disease the best.', 'start': 615.204, 'duration': 3.501}, {'end': 622.287, 'text': 'So we will use it at the root of the tree.', 'start': 620.226, 'duration': 2.061}], 'summary': 'Genie impurity: good circulation=0.360, blocked arteries=0.381. good circulation best separates patients with and without heart disease.', 'duration': 26.908, 'max_score': 595.379, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7VeUPuFGJHk/pics/7VeUPuFGJHk595379.jpg'}, {'end': 681.469, 'src': 'embed', 'start': 651.136, 'weight': 2, 'content': [{'end': 659.823, 'text': 'And the 133 patients with and without heart disease that ended up in this leaf node are now in this node in the tree.', 'start': 651.136, 'duration': 8.687}, {'end': 672.446, 'text': 'Now we need to figure out how well chest pain and blocked arteries separate these 164 patients, 37 with heart disease and 127 without heart disease.', 'start': 661.522, 'duration': 10.924}, {'end': 681.469, 'text': 'Just like we did before, we separate these patients based on chest pain and then calculate the Gini impurity value.', 'start': 674.466, 'duration': 7.003}], 'summary': 'Out of 164 patients, 37 had heart disease, and 127 did not. they were separated based on chest pain and blocked arteries to calculate gini impurity.', 'duration': 30.333, 'max_score': 651.136, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7VeUPuFGJHk/pics/7VeUPuFGJHk651136.jpg'}, {'end': 776.72, 'src': 'embed', 'start': 744.752, 'weight': 1, 'content': [{'end': 750.897, 'text': "Note, the vast majority of the patients in this node, 89%, don't have heart disease.", 'start': 744.752, 'duration': 6.145}, {'end': 755.021, 'text': "Here's how chest pain divides these patients.", 'start': 752.519, 'duration': 2.502}, {'end': 764.555, 'text': "Do these new leaves separate patients better than what we had before? Well, let's calculate the Gini impurity for it.", 'start': 756.702, 'duration': 7.853}, {'end': 768.937, 'text': "In this case, it's 0.29.", 'start': 765.415, 'duration': 3.522}, {'end': 776.72, 'text': 'The Gini impurity for this node before using chest pain to separate patients is 0.2.', 'start': 768.937, 'duration': 7.783}], 'summary': "89% of patients in this node don't have heart disease, with a gini impurity of 0.29.", 'duration': 31.968, 'max_score': 744.752, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7VeUPuFGJHk/pics/7VeUPuFGJHk744752.jpg'}, {'end': 834.172, 'src': 'embed', 'start': 807.499, 'weight': 3, 'content': [{'end': 815.423, 'text': "Second, if the node itself has the lowest score, then there's no point in separating the patients anymore and it becomes a leaf node.", 'start': 807.499, 'duration': 7.924}, {'end': 823.847, 'text': 'Third, if separating the data results in an improvement, then pick the separation with the lowest impurity value.', 'start': 816.803, 'duration': 7.044}, {'end': 828.089, 'text': 'Hooray! We made a decision tree.', 'start': 825.567, 'duration': 2.522}, {'end': 834.172, 'text': "So far, we've seen how to build a tree with yes-no questions at each step.", 'start': 829.65, 'duration': 4.522}], 'summary': 'Decision tree construction with lowest impurity value for separation.', 'duration': 26.673, 'max_score': 807.499, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7VeUPuFGJHk/pics/7VeUPuFGJHk807499.jpg'}], 'start': 392.796, 'title': 'Measuring impurity with gini and building decision trees for patient classification', 'summary': 'Discusses the calculation of gini impurity to measure impurity, with examples showing gini impurity values for different leaf nodes and the use of gini impurity to determine the best separation method, where good blood circulation has the lowest impurity. it also demonstrates building a decision tree for patient classification based on attributes like chest pain, blocked arteries, and patient weight, resulting in effective patient separation and gini impurity values, with a potential extension to ranked and multiple choice data.', 'chapters': [{'end': 622.287, 'start': 392.796, 'title': 'Measuring impurity with gini', 'summary': 'Discusses the calculation of gini impurity to measure impurity, with examples showing gini impurity values for different leaf nodes and the use of gini impurity to determine the best separation method, where good blood circulation has the lowest impurity.', 'duration': 229.491, 'highlights': ['Good blood circulation has the lowest impurity. The Gini impurity for good blood circulation equals 0.360.', 'Using chest pain to separate patients with and without heart disease has a total Gini impurity of 0.364. The total genie impurity for chest pain equals 0.364.', 'The total Gini impurity for using chest pain to separate patients with and without heart disease is the weighted average of the leaf node impurities. The total Gini impurity for using chest pain to separate patients with and without heart disease is the weighted average of the leaf node impurities, which equals 0.364.', 'The Gini impurity for blocked arteries equals 0.381. The genie impurity for blocked arteries equals 0.381.']}, {'end': 1041.529, 'start': 625.389, 'title': 'Building decision trees for patient classification', 'summary': 'Demonstrates building a decision tree for patient classification based on attributes like chest pain, blocked arteries, and patient weight, resulting in effective patient separation and gini impurity values, with a potential extension to ranked and multiple choice data.', 'duration': 416.14, 'highlights': ['The decision tree uses attributes like chest pain and blocked arteries to effectively separate patients with 37 heart disease and 127 without, resulting in a Gini impurity value of 0.3.', 'The decision tree demonstrates the effective separation of patients using chest pain, resulting in 24 with heart disease and 25 without, and a Gini impurity value of 0.29, indicating improved patient separation.', 'The process of building a decision tree is detailed, including steps for using numeric data like patient weight and the extension to ranked and multiple choice data, providing a comprehensive approach to patient classification.']}], 'duration': 648.733, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/7VeUPuFGJHk/pics/7VeUPuFGJHk392796.jpg', 'highlights': ['Good blood circulation has the lowest impurity with a Gini impurity of 0.360.', 'Using chest pain to separate patients with and without heart disease has a total Gini impurity of 0.364.', 'The decision tree uses attributes like chest pain and blocked arteries to effectively separate patients with 37 heart disease and 127 without, resulting in a Gini impurity value of 0.3.', 'The process of building a decision tree is detailed, including steps for using numeric data like patient weight and the extension to ranked and multiple choice data.']}], 'highlights': ['Decision trees can classify individuals based on yes-no questions, numeric data, or ranked data.', 'Examples of decision trees include classifying individuals based on resting heart rate and mouse weight.', 'Complex decision trees can combine numeric and yes-no data.', 'The chapter introduces the concept of decision trees and explains the structure and terminology.', 'Decision trees are intuitive, starting from the root node and classifying samples by moving down to the leaf nodes.', 'The terminology of decision trees is explained, including root nodes, internal nodes, and leaf nodes.', 'The chapter examines the effectiveness of chest pain, good blood circulation, and blocked arteries in predicting heart disease among 303 patients.', 'None of these factors are perfect predictors, as they all show impurity in separating patients with and without heart disease.', 'Good blood circulation has the lowest impurity with a Gini impurity of 0.360.', 'Using chest pain to separate patients with and without heart disease has a total Gini impurity of 0.364.', 'The decision tree uses attributes like chest pain and blocked arteries to effectively separate patients with 37 heart disease and 127 without, resulting in a Gini impurity value of 0.3.', 'The process of building a decision tree is detailed, including steps for using numeric data like patient weight and the extension to ranked and multiple choice data.']}