title

Neural Networks Pt. 2: Backpropagation Main Ideas

description

Backpropagation is the method we use to optimize parameters in a Neural Network. The ideas behind backpropagation are quite simple, but there are tons of details. This StatQuest focuses on explaining the main ideas in a way that is easy to understand.
NOTE: This StatQuest assumes that you already know the main ideas behind...
Neural Networks: https://youtu.be/CqOfi41LfDw
The Chain Rule: https://youtu.be/wl1myxrtQHQ
Gradient Descent: https://youtu.be/sDv4f4s2SB8
LAST NOTE: When I was researching this 'Quest, I found this page by Sebastian Raschka to be helpful: https://sebastianraschka.com/faq/docs/backprop-arbitrary.html
For a complete index of all the StatQuest videos, check out:
https://statquest.org/video-index/
If you'd like to support StatQuest, please consider...
Buying my book, The StatQuest Illustrated Guide to Machine Learning:
PDF - https://statquest.gumroad.com/l/wvtmc
Paperback - https://www.amazon.com/dp/B09ZCKR4H6
Kindle eBook - https://www.amazon.com/dp/B09ZG79HXC
Patreon: https://www.patreon.com/statquest
...or...
YouTube Membership: https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw/join
...a cool StatQuest t-shirt or sweatshirt:
https://shop.spreadshirt.com/statquest-with-josh-starmer/
...buying one or two of my songs (or go large and get a whole album!)
https://joshuastarmer.bandcamp.com/
...or just donating to StatQuest!
https://www.paypal.me/statquest
Lastly, if you want to keep up with me as I research and create new StatQuests, follow me on twitter:
https://twitter.com/joshuastarmer
0:00 Awesome song and introduction
3:55 Fitting the Neural Network to the data
6:04 The Sum of the Squared Residuals
7:23 Testing different values for a parameter
8:38 Using the Chain Rule to calculate a derivative
13:28 Using Gradient Descent
16:05 Summary
#StatQuest #NeuralNetworks #Backpropagation

detail

{'title': 'Neural Networks Pt. 2: Backpropagation Main Ideas', 'heatmap': [{'end': 707.904, 'start': 673.356, 'weight': 0.73}, {'end': 865.268, 'start': 838.259, 'weight': 0.709}], 'summary': 'Covers backpropagation in neural networks, optimization of parameter values, visualization of activation functions, and the use of gradient descent to minimize the sum of squared residuals, with a specific example showing a decrease from 20.4 to 0.46 as b sub 3 increases to 3 and achieving an optimal value for b sub 3 of 2.61.', 'chapters': [{'end': 215.473, 'segs': [{'end': 64.074, 'src': 'embed', 'start': 32.58, 'weight': 2, 'content': [{'end': 34.521, 'text': 'The links are in the description below.', 'start': 32.58, 'duration': 1.941}, {'end': 40.846, 'text': 'In the Stat Quest on Neural Networks, Part 1, Inside the Black Box,', 'start': 35.842, 'duration': 5.004}, {'end': 48.611, 'text': 'we started with a simple dataset that showed whether or not different drug dosages were effective against a virus.', 'start': 40.846, 'duration': 7.765}, {'end': 52.994, 'text': 'The low and high dosages were not effective.', 'start': 49.832, 'duration': 3.162}, {'end': 56.149, 'text': 'but the medium dosage was effective.', 'start': 53.868, 'duration': 2.281}, {'end': 64.074, 'text': 'Then we talked about how a neural network like this one fits a green squiggle to this dataset.', 'start': 57.27, 'duration': 6.804}], 'summary': 'Neural network analyzed drug dosages, found medium dosage effective against virus.', 'duration': 31.494, 'max_score': 32.58, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/IN2XmBhILt4/pics/IN2XmBhILt432580.jpg'}, {'end': 116.937, 'src': 'embed', 'start': 87.692, 'weight': 0, 'content': [{'end': 92.115, 'text': 'However, we did not talk about how to estimate the weights and biases.', 'start': 87.692, 'duration': 4.423}, {'end': 101.38, 'text': "So let's talk about how backpropagation optimizes the weights and biases in this and other neural networks.", 'start': 93.876, 'duration': 7.504}, {'end': 110.906, 'text': 'Backpropagation is relatively simple, but there are a ton of details, so I split it up into bite-sized pieces.', 'start': 103.682, 'duration': 7.224}, {'end': 116.937, 'text': 'In this part, we talk about the main ideas of backpropagation.', 'start': 112.614, 'duration': 4.323}], 'summary': 'Backpropagation optimizes weights and biases in neural networks, with emphasis on main ideas.', 'duration': 29.245, 'max_score': 87.692, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/IN2XmBhILt4/pics/IN2XmBhILt487692.jpg'}, {'end': 215.473, 'src': 'embed', 'start': 180.944, 'weight': 3, 'content': [{'end': 189.827, 'text': 'Conceptually, backpropagation starts with the last parameter, and works its way backwards to estimate all of the other parameters.', 'start': 180.944, 'duration': 8.883}, {'end': 199.051, 'text': 'However, we can discuss all of the main ideas behind backpropagation by just estimating the last bias, B3.', 'start': 191.248, 'duration': 7.803}, {'end': 212.371, 'text': "So, in order to start from the back, let's assume that we already have optimal values for all of the parameters except for the last bias term b,", 'start': 201.684, 'duration': 10.687}, {'end': 215.473, 'text': 'sub 3..', 'start': 212.371, 'duration': 3.102}], 'summary': 'Backpropagation estimates parameters backwards, starting with the last bias, b3.', 'duration': 34.529, 'max_score': 180.944, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/IN2XmBhILt4/pics/IN2XmBhILt4180944.jpg'}], 'start': 0.109, 'title': 'Backpropagation in neural networks', 'summary': 'Delves into the concept of backpropagation in neural networks, highlighting its role in training and its optimization of weights and biases. it references the previous stat quest on neural networks and demonstrates the process using a simple dataset to analyze drug dosages against a virus.', 'chapters': [{'end': 56.149, 'start': 0.109, 'title': 'Neural networks part 2: backpropagation', 'summary': 'Discusses the concept of backpropagation in neural networks, emphasizing its importance in the training process and referring to the previous stat quest on neural networks. it also mentions the use of a simple dataset to illustrate the effectiveness of different drug dosages against a virus.', 'duration': 56.04, 'highlights': ['The chapter introduces the concept of backpropagation in neural networks, emphasizing its importance in the training process.', 'The previous Stat Quest on neural networks is referenced, indicating the assumption of prior knowledge in this episode.', 'A simple dataset demonstrating the effectiveness of different drug dosages against a virus is mentioned, with specific focus on the medium dosage being effective while the low and high dosages were not.']}, {'end': 215.473, 'start': 57.27, 'title': 'Backpropagation in neural networks', 'summary': 'Explains how backpropagation optimizes the weights and biases in a neural network using the chain rule to calculate derivatives and plugging them into gradient descent to optimize parameters, with a focus on estimating the last bias term.', 'duration': 158.203, 'highlights': ['The chapter explains how backpropagation optimizes the weights and biases in a neural network using the chain rule to calculate derivatives and plugging them into gradient descent to optimize parameters.', 'Backpropagation starts with the last parameter and works its way backwards to estimate all other parameters in the neural network.', 'The chapter focuses on estimating the last bias term, B3, in order to illustrate the main ideas behind backpropagation.']}], 'duration': 215.364, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/IN2XmBhILt4/pics/IN2XmBhILt4109.jpg', 'highlights': ['The chapter introduces the concept of backpropagation in neural networks, emphasizing its importance in the training process.', 'The chapter explains how backpropagation optimizes the weights and biases in a neural network using the chain rule to calculate derivatives and plugging them into gradient descent to optimize parameters.', 'A simple dataset demonstrating the effectiveness of different drug dosages against a virus is mentioned, with specific focus on the medium dosage being effective while the low and high dosages were not.', 'Backpropagation starts with the last parameter and works its way backwards to estimate all other parameters in the neural network.', 'The previous Stat Quest on neural networks is referenced, indicating the assumption of prior knowledge in this episode.']}, {'end': 527.27, 'segs': [{'end': 269.243, 'src': 'embed', 'start': 245.014, 'weight': 2, 'content': [{'end': 252.358, 'text': 'then we get the x-axis coordinates for the activation function that are all inside this red box.', 'start': 245.014, 'duration': 7.344}, {'end': 263.701, 'text': 'And when we plug the x-axis coordinates into the activation function, which in this example is the soft plus activation function,', 'start': 253.877, 'duration': 9.824}, {'end': 269.243, 'text': 'we get the corresponding y-axis coordinates and this blue curve.', 'start': 263.701, 'duration': 5.542}], 'summary': 'The soft plus activation function generates y-axis coordinates for x-axis coordinates within a red box.', 'duration': 24.229, 'max_score': 245.014, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/IN2XmBhILt4/pics/IN2XmBhILt4245014.jpg'}, {'end': 431.001, 'src': 'embed', 'start': 401.01, 'weight': 1, 'content': [{'end': 412.78, 'text': 'Lastly, this residual is the observed value, zero, minus the predicted value from the green squiggle, negative 2.61.', 'start': 401.01, 'duration': 11.77}, {'end': 421.888, 'text': 'Now we square each residual and add them all together to get 20.4 for the sum of the squared residuals.', 'start': 412.78, 'duration': 9.108}, {'end': 431.001, 'text': 'So when b sub 3 equals 0, the sum of the squared residuals equals 20.4.', 'start': 423.419, 'duration': 7.582}], 'summary': 'Sum of squared residuals is 20.4 when b sub 3 equals 0', 'duration': 29.991, 'max_score': 401.01, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/IN2XmBhILt4/pics/IN2XmBhILt4401010.jpg'}, {'end': 506.879, 'src': 'embed', 'start': 470.637, 'weight': 0, 'content': [{'end': 473.56, 'text': 'And that corresponds to this point on our graph.', 'start': 470.637, 'duration': 2.923}, {'end': 481.437, 'text': 'If we increase B3 to 2, then the sum of the squared residuals equals 1.11.', 'start': 475.393, 'duration': 6.044}, {'end': 491.143, 'text': 'And if we increase B3 to 3, then the sum of the squared residuals equals 0.46.', 'start': 481.437, 'duration': 9.706}, {'end': 496.947, 'text': 'And if we had time to plug in tons of values for B3, we would get this pink curve.', 'start': 491.143, 'duration': 5.804}, {'end': 506.879, 'text': 'and we could find the lowest point, which corresponds to the value for b sub 3 that results in the lowest sum of the squared residuals, here.', 'start': 498.194, 'duration': 8.685}], 'summary': 'Increasing b3 leads to decreasing sum of squared residuals: 1.11 at 2, 0.46 at 3.', 'duration': 36.242, 'max_score': 470.637, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/IN2XmBhILt4/pics/IN2XmBhILt4470637.jpg'}], 'start': 215.473, 'title': 'Neural network optimization and gradient descent', 'summary': 'Explores optimizing parameter values, visualizing activation functions, and calculating the sum of squared residuals in neural networks. it also discusses residual analysis, with the sum of squared residuals decreasing from 20.4 to 0.46 as b sub 3 increases to 3, and using gradient descent to minimize the sum of squared residuals.', 'chapters': [{'end': 371.252, 'start': 215.473, 'title': 'Neural network activation function and optimization', 'summary': 'Discusses the process of optimizing parameter values, visualizing activation functions, and the calculation of the sum of the squared residuals in neural networks.', 'duration': 155.779, 'highlights': ['The process of visualizing activation functions and optimizing parameter values in neural networks is explained. The transcript explains the visualization process of activation functions, such as the soft plus activation function, and the optimization of parameter values through the use of green and red color coding.', 'The multiplication of y-axis coordinates by specific values to obtain the final activation curves is demonstrated. It illustrates the multiplication of y-axis coordinates on the blue and orange curves by -1.22 and -2.3 respectively, to obtain the final blue and orange curves.', "The addition of the final bias and the calculation of the sum of squared residuals for evaluating the model's fit to the data are described. The process of adding the final bias, initializing it to 0, and evaluating the fit of the model to the data through the calculation of the sum of squared residuals is outlined."]}, {'end': 527.27, 'start': 372.753, 'title': 'Residual analysis and gradient descent', 'summary': 'Explains the concept of residuals as the difference between observed and predicted values, with the sum of squared residuals being 20.4 when b sub 3 equals 0, decreasing to 0.46 as b sub 3 increases to 3. the chapter also discusses using gradient descent to find the lowest point in the pink curve, which corresponds to the value for b sub 3 resulting in the lowest sum of squared residuals.', 'duration': 154.517, 'highlights': ['The sum of squared residuals is 20.4 when b sub 3 equals 0.', 'Using gradient descent, the sum of squared residuals decreases to 0.46 as b sub 3 increases to 3.', 'The chapter also discusses using gradient descent to find the lowest point in the pink curve, which corresponds to the value for b sub 3 resulting in the lowest sum of squared residuals.']}], 'duration': 311.797, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/IN2XmBhILt4/pics/IN2XmBhILt4215473.jpg', 'highlights': ['Using gradient descent, the sum of squared residuals decreases to 0.46 as b sub 3 increases to 3.', 'The sum of squared residuals is 20.4 when b sub 3 equals 0.', 'The process of visualizing activation functions and optimizing parameter values in neural networks is explained.']}, {'end': 1052.812, 'segs': [{'end': 622.674, 'src': 'embed', 'start': 527.27, 'weight': 0, 'content': [{'end': 537.234, 'text': 'Now, remember, the sum of the squared residuals equals the first residual squared plus all of the other squared residuals.', 'start': 527.27, 'duration': 9.964}, {'end': 546.658, 'text': 'Now, because this equation takes up a lot of space, we can make it smaller by using summation notation.', 'start': 538.715, 'duration': 7.943}, {'end': 552.26, 'text': 'The Greek symbol sigma tells us to sum things together.', 'start': 548.158, 'duration': 4.102}, {'end': 560.302, 'text': 'And i is an index for the observed and predicted values that starts at 1.', 'start': 553.757, 'duration': 6.545}, {'end': 568.528, 'text': 'And the index goes from 1 to the number of values n, which in this case is set to 3.', 'start': 560.302, 'duration': 8.226}, {'end': 573.191, 'text': 'So, when i equals 1, we are talking about the first residual.', 'start': 568.528, 'duration': 4.663}, {'end': 578.255, 'text': 'When i equals 2, we are talking about the second residual.', 'start': 574.713, 'duration': 3.542}, {'end': 583.239, 'text': 'And when i equals 3, we are talking about the third residual.', 'start': 579.356, 'duration': 3.883}, {'end': 587.957, 'text': "Now let's talk a little bit more about the predicted values.", 'start': 584.716, 'duration': 3.241}, {'end': 592.497, 'text': 'Each predicted value comes from the green squiggle.', 'start': 589.517, 'duration': 2.98}, {'end': 597.718, 'text': 'And the green squiggle comes from the last part of the neural network.', 'start': 593.818, 'duration': 3.9}, {'end': 608.62, 'text': 'In other words, the green squiggle is the sum of the blue and orange curves plus b sub 3.', 'start': 599.239, 'duration': 9.381}, {'end': 614.761, 'text': 'Now remember, we want to use gradient descent to optimize b sub 3.', 'start': 608.62, 'duration': 6.141}, {'end': 622.674, 'text': 'And that means we need to take the derivative of the sum of the squared residuals with respect to b sub 3.', 'start': 614.761, 'duration': 7.913}], 'summary': 'Sum of squared residuals, prediction index, and gradient descent for optimization discussed with 3 observed values.', 'duration': 95.404, 'max_score': 527.27, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/IN2XmBhILt4/pics/IN2XmBhILt4527270.jpg'}, {'end': 707.904, 'src': 'heatmap', 'start': 673.356, 'weight': 0.73, 'content': [{'end': 687.068, 'text': 'Now we can solve for the derivative of the sum of the squared residuals with respect to the predicted values by first substituting in the equation and then use the chain rule to move the square to the front,', 'start': 673.356, 'duration': 13.712}, {'end': 697.6, 'text': 'And then we multiply that by the derivative of the stuff inside the parentheses with respect to the predicted values, negative 1..', 'start': 688.297, 'duration': 9.303}, {'end': 702.642, 'text': 'Now we simplify by multiplying 2 by negative 1.', 'start': 697.6, 'duration': 5.042}, {'end': 707.904, 'text': 'And we have the derivative of the sum of the squared residuals with respect to the predicted values.', 'start': 702.642, 'duration': 5.262}], 'summary': 'Derivative of sum of squared residuals solved using chain rule.', 'duration': 34.548, 'max_score': 673.356, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/IN2XmBhILt4/pics/IN2XmBhILt4673356.jpg'}, {'end': 865.268, 'src': 'heatmap', 'start': 838.259, 'weight': 0.709, 'content': [{'end': 844.263, 'text': 'Then we plug in the observed values and the values predicted by the green squiggle.', 'start': 838.259, 'duration': 6.004}, {'end': 851.928, 'text': 'Remember, we get the predicted values on the green squiggle by running the dosages through the neural network.', 'start': 845.924, 'duration': 6.004}, {'end': 858.923, 'text': 'Now we just do the math and get negative 15.7.', 'start': 853.598, 'duration': 5.325}, {'end': 865.268, 'text': 'And that corresponds to the slope for when b sub 3 equals 0.', 'start': 858.923, 'duration': 6.345}], 'summary': 'Using the neural network, the observed and predicted values lead to a slope of -15.7 for b sub 3 = 0.', 'duration': 27.009, 'max_score': 838.259, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/IN2XmBhILt4/pics/IN2XmBhILt4838259.jpg'}, {'end': 997.874, 'src': 'embed', 'start': 973.811, 'weight': 3, 'content': [{'end': 981.897, 'text': 'we use the chain rule to calculate the derivative of the sum of the squared residuals with respect to the unknown parameter,', 'start': 973.811, 'duration': 8.086}, {'end': 985.74, 'text': 'which in this case was b sub 3..', 'start': 981.897, 'duration': 3.843}, {'end': 994.17, 'text': 'Then we initialize the unknown parameter with a number, and in this case we set b sub 3 equal to 0,', 'start': 985.74, 'duration': 8.43}, {'end': 997.874, 'text': 'and used gradient ascent to optimize the unknown parameter.', 'start': 994.17, 'duration': 3.704}], 'summary': 'Applied chain rule to find derivative of sum of squared residuals for b3, initialized b3=0, used gradient ascent for optimization.', 'duration': 24.063, 'max_score': 973.811, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/IN2XmBhILt4/pics/IN2XmBhILt4973811.jpg'}], 'start': 527.27, 'title': 'Representing residuals and gradient descent', 'summary': 'Explains representing the sum of squared residuals using summation notation with an index ranging from 1 to 3 and the process of using the chain rule to calculate the derivative of the sum of squared residuals with respect to an unknown parameter and then using gradient descent to optimize the parameter, achieving an optimal value for b sub 3 of 2.61.', 'chapters': [{'end': 592.497, 'start': 527.27, 'title': 'Residuals and summation notation', 'summary': 'Explains how to represent the sum of squared residuals using summation notation, with an index ranging from 1 to 3, and the predicted values derived from the green squiggle.', 'duration': 65.227, 'highlights': ['Using summation notation to represent the sum of squared residuals, with an index ranging from 1 to 3, helps reduce the space taken up by the equation.', 'The index i starts at 1 and goes up to the number of values n, which is set to 3 in this case.', 'The predicted values are derived from the green squiggle.']}, {'end': 1052.812, 'start': 593.818, 'title': 'Backpropagation derivative and gradient descent', 'summary': 'Explains the process of using the chain rule to calculate the derivative of the sum of squared residuals with respect to an unknown parameter, and then using gradient descent to optimize the parameter, achieving an optimal value for b sub 3 of 2.61.', 'duration': 458.994, 'highlights': ['The process of using the chain rule to calculate the derivative of the sum of squared residuals with respect to an unknown parameter The chapter elaborates on using the chain rule to calculate the derivative of the sum of squared residuals with respect to an unknown parameter, such as b sub 3.', 'Achieving an optimal value for b sub 3 of 2.61 using gradient descent The chapter demonstrates the use of gradient descent to optimize the unknown parameter, achieving an optimal value for b sub 3 of 2.61.']}], 'duration': 525.542, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/IN2XmBhILt4/pics/IN2XmBhILt4527270.jpg', 'highlights': ['Using summation notation to represent the sum of squared residuals helps reduce equation space.', 'The index i ranges from 1 to n, set to 3 in this case.', 'The predicted values are derived from the green squiggle.', 'The chapter elaborates on using the chain rule to calculate the derivative of the sum of squared residuals with respect to an unknown parameter.', 'Demonstrates the use of gradient descent to optimize the unknown parameter, achieving an optimal value for b sub 3 of 2.61.']}], 'highlights': ['The chapter elaborates on using the chain rule to calculate the derivative of the sum of squared residuals with respect to an unknown parameter.', 'Using gradient descent, the sum of squared residuals decreases to 0.46 as b sub 3 increases to 3.', 'The sum of squared residuals is 20.4 when b sub 3 equals 0.', 'The chapter introduces the concept of backpropagation in neural networks, emphasizing its importance in the training process.', 'The chapter explains how backpropagation optimizes the weights and biases in a neural network using the chain rule to calculate derivatives and plugging them into gradient descent to optimize parameters.']}