title

Building makemore Part 4: Becoming a Backprop Ninja

description

We take the 2-layer MLP (with BatchNorm) from the previous video and backpropagate through it manually without using PyTorch autograd's loss.backward(): through the cross entropy loss, 2nd linear layer, tanh, batchnorm, 1st linear layer, and the embedding table. Along the way, we get a strong intuitive understanding about how gradients flow backwards through the compute graph and on the level of efficient Tensors, not just individual scalars like in micrograd. This helps build competence and intuition around how neural nets are optimized and sets you up to more confidently innovate on and debug modern neural networks.
!!!!!!!!!!!!
I recommend you work through the exercise yourself but work with it in tandem and whenever you are stuck unpause the video and see me give away the answer. This video is not super intended to be simply watched. The exercise is here:
https://colab.research.google.com/drive/1WV2oi2fh9XXyldh02wupFQX0wh5ZC-z-?usp=sharing
!!!!!!!!!!!!
Links:
- makemore on github: https://github.com/karpathy/makemore
- jupyter notebook I built in this video: https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part4_backprop.ipynb
- collab notebook: https://colab.research.google.com/drive/1WV2oi2fh9XXyldh02wupFQX0wh5ZC-z-?usp=sharing
- my website: https://karpathy.ai
- my twitter: https://twitter.com/karpathy
- our Discord channel: https://discord.gg/3zy8kqD9Cp
Supplementary links:
- Yes you should understand backprop: https://karpathy.medium.com/yes-you-should-understand-backprop-e2f06eab496b
- BatchNorm paper: https://arxiv.org/abs/1502.03167
- Besselâ€™s Correction: http://math.oxford.emory.edu/site/math117/besselCorrection/
- Bengio et al. 2003 MLP LM https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
Chapters:
00:00:00 intro: why you should care & fun history
00:07:26 starter code
00:13:01 exercise 1: backproping the atomic compute graph
01:05:17 brief digression: besselâ€™s correction in batchnorm
01:26:31 exercise 2: cross entropy loss backward pass
01:36:37 exercise 3: batch norm layer backward pass
01:50:02 exercise 4: putting it all together
01:54:24 outro

detail

{'title': 'Building makemore Part 4: Becoming a Backprop Ninja', 'heatmap': [{'end': 5610.743, 'start': 5540.306, 'weight': 0.718}, {'end': 6920.121, 'start': 6853.283, 'weight': 1}], 'summary': 'Series delves into understanding and implementing backpropagation in deep learning, covering manual backpropagation, derivatives and gradients in neural networks, backpropagation process, and its implementation in pytorch, emphasizing efficient derivation of mathematical formulas and achieving good loss.', 'chapters': [{'end': 366.594, 'segs': [{'end': 86.011, 'src': 'embed', 'start': 58.992, 'weight': 2, 'content': [{'end': 64.736, 'text': "I actually have an entire blog post on this topic, but I'd like to call backpropagation a leaky abstraction.", 'start': 58.992, 'duration': 5.744}, {'end': 70.919, 'text': "And what I mean by that is backpropagation doesn't just make your neural networks just work magically.", 'start': 65.656, 'duration': 5.263}, {'end': 77.803, 'text': "It's not the case that you can just stack up arbitrary Lego blocks of differentiable functions and just cross your fingers and backpropagate and everything is great.", 'start': 71.32, 'duration': 6.483}, {'end': 80.245, 'text': "Things don't just work automatically.", 'start': 79.084, 'duration': 1.161}, {'end': 86.011, 'text': 'It is a leaky abstraction in the sense that you can shoot yourself in the foot if you do not understand its internals.', 'start': 80.945, 'duration': 5.066}], 'summary': 'Backpropagation is a leaky abstraction that requires understanding its internals.', 'duration': 27.019, 'max_score': 58.992, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI58992.jpg'}, {'end': 217.929, 'src': 'embed', 'start': 192.125, 'weight': 0, 'content': [{'end': 197.27, 'text': "And I don't think it's enough, and I'd like us to basically think about backpropagation on the level of tensors as well.", 'start': 192.125, 'duration': 5.145}, {'end': 199.893, 'text': "And so in a summary, I think it's a good exercise.", 'start': 197.931, 'duration': 1.962}, {'end': 202.375, 'text': 'I think it is very, very valuable.', 'start': 200.433, 'duration': 1.942}, {'end': 207.6, 'text': "You're going to become better at debugging neural networks and making sure that you understand what you're doing.", 'start': 202.675, 'duration': 4.925}, {'end': 212.325, 'text': "It is going to make everything fully explicit, so you're not going to be nervous about what is hidden away from you.", 'start': 208.241, 'duration': 4.084}, {'end': 215.287, 'text': "And basically, in general, we're going to emerge stronger.", 'start': 213.025, 'duration': 2.262}, {'end': 217.929, 'text': "And so let's get into it.", 'start': 216.108, 'duration': 1.821}], 'summary': 'Backpropagation on tensors enhances neural network understanding and debugging, leading to stronger outcomes.', 'duration': 25.804, 'max_score': 192.125, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI192125.jpg'}, {'end': 254.293, 'src': 'embed', 'start': 226.894, 'weight': 3, 'content': [{'end': 230.837, 'text': 'But about 10 years ago in deep learning, this was fairly standard and in fact pervasive.', 'start': 226.894, 'duration': 3.943}, {'end': 238.062, 'text': "so at the time everyone used to write their own backward pass by hand manually, including myself, and it's just what you would do.", 'start': 231.517, 'duration': 6.545}, {'end': 244.146, 'text': "so we used to write backward, pass by hand, and now everyone just calls lost that backward, we've lost something.", 'start': 238.062, 'duration': 6.084}, {'end': 246.928, 'text': 'I wanted to give you a few examples of this.', 'start': 244.146, 'duration': 2.782}, {'end': 254.293, 'text': "so here's a 2006 paper from Jeff Hinton and Ruslan Selakhtinov in science that was influential at the time,", 'start': 246.928, 'duration': 7.365}], 'summary': 'About 10 years ago, deep learning involved writing backward passes manually, including a 2006 paper from jeff hinton and ruslan selakhtinov.', 'duration': 27.399, 'max_score': 226.894, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI226894.jpg'}, {'end': 355.827, 'src': 'embed', 'start': 328.605, 'weight': 4, 'content': [{'end': 333.15, 'text': "And then here we take that gradient and use it for a parameter update along the lines that we're used to.", 'start': 328.605, 'duration': 4.545}, {'end': 335.312, 'text': 'Yeah, here.', 'start': 333.17, 'duration': 2.142}, {'end': 341.198, 'text': 'But you can see that basically people are meddling with these gradients directly and inline and themselves.', 'start': 336.775, 'duration': 4.423}, {'end': 343.519, 'text': "It wasn't that common to use an autograd engine.", 'start': 341.758, 'duration': 1.761}, {'end': 349.223, 'text': "Here's one more example from a paper of mine from 2014 called the fragment embeddings.", 'start': 344.14, 'duration': 5.083}, {'end': 352.445, 'text': 'And here what I was doing is I was aligning images and text.', 'start': 350.104, 'duration': 2.341}, {'end': 355.827, 'text': "And so it's kind of like a clip if you're familiar with it.", 'start': 353.746, 'duration': 2.081}], 'summary': 'Using autograd engine for parameter updates and aligning images and text.', 'duration': 27.222, 'max_score': 328.605, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI328605.jpg'}], 'start': 0.049, 'title': 'Understanding backpropagation and autograd in deep learning', 'summary': 'Covers the importance of understanding backpropagation, challenges associated, need to manually write backward pass, and the value of implementing it for better debugging and comprehension of neural networks, citing historical examples from the deep learning field.', 'chapters': [{'end': 172.511, 'start': 0.049, 'title': 'Understanding backpropagation in neural networks', 'summary': 'Covers the importance of understanding backpropagation, the challenges associated with it, and the need to manually write backward pass for better comprehension, with examples of potential issues and bugs in neural network implementation.', 'duration': 172.462, 'highlights': ['The importance of understanding backpropagation', 'Challenges and potential issues with backpropagation', 'Need to manually write backward pass for better comprehension']}, {'end': 366.594, 'start': 172.511, 'title': 'Backpropagation and autograd in deep learning', 'summary': 'Discusses the importance of understanding backpropagation, citing historical examples from the deep learning field and emphasizing the value of implementing it for better debugging and comprehension of neural networks.', 'duration': 194.083, 'highlights': ['Understanding backpropagation aids in debugging neural networks and ensures a comprehensive grasp of the process, leading to improved performance.', "Historically, manually writing the backward pass was standard practice in deep learning, with examples such as Jeff Hinton and Ruslan Selakhtinov's 2006 paper on Restricted Boltzmann Machines showcasing the prevalent use of MATLAB for training.", 'The chapter also touches on the implementation of contrastive divergence for parameter updates, highlighting the direct manipulation of gradients in the absence of autograd engines.']}], 'duration': 366.545, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI49.jpg', 'highlights': ['Understanding backpropagation aids in debugging neural networks and ensures a comprehensive grasp of the process, leading to improved performance.', 'The importance of understanding backpropagation', 'Challenges and potential issues with backpropagation', 'Need to manually write backward pass for better comprehension', 'The chapter also touches on the implementation of contrastive divergence for parameter updates, highlighting the direct manipulation of gradients in the absence of autograd engines.', "Historically, manually writing the backward pass was standard practice in deep learning, with examples such as Jeff Hinton and Ruslan Selakhtinov's 2006 paper on Restricted Boltzmann Machines showcasing the prevalent use of MATLAB for training."]}, {'end': 761.421, 'segs': [{'end': 398.742, 'src': 'embed', 'start': 367.294, 'weight': 4, 'content': [{'end': 370.376, 'text': 'And I dug up the code from 2014 of how I implemented this.', 'start': 367.294, 'duration': 3.082}, {'end': 373.638, 'text': 'And it was already in NumPy and Python.', 'start': 371.057, 'duration': 2.581}, {'end': 376.32, 'text': "And here I'm implementing the cost function.", 'start': 374.739, 'duration': 1.581}, {'end': 381.283, 'text': 'And it was standard to implement not just the cost, but also the backward pass manually.', 'start': 376.68, 'duration': 4.603}, {'end': 386.49, 'text': "So here I'm calculating the image embeddings, sentence embeddings, the loss function.", 'start': 382.065, 'duration': 4.425}, {'end': 388.612, 'text': 'I calculate the scores.', 'start': 387.811, 'duration': 0.801}, {'end': 390.053, 'text': 'This is the loss function.', 'start': 389.112, 'duration': 0.941}, {'end': 393.337, 'text': 'And then once I have the loss function, I do the backward pass right here.', 'start': 390.774, 'duration': 2.563}, {'end': 398.742, 'text': 'So I backward through the loss function and through the neural net and I append regularization.', 'start': 393.897, 'duration': 4.845}], 'summary': 'Implemented code in numpy and python, calculated image and sentence embeddings, loss function, and conducted backward pass.', 'duration': 31.448, 'max_score': 367.294, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI367294.jpg'}, {'end': 424.873, 'src': 'embed', 'start': 402.064, 'weight': 7, 'content': [{'end': 410.267, 'text': 'and you would just write out the backward pass and then you would use a gradient checker to make sure that your numerical estimate of the gradient agrees with the one you calculated during the backpropagation.', 'start': 402.064, 'duration': 8.203}, {'end': 415.509, 'text': 'So this was very standard for a long time, but today, of course, it is standard to use an autograd engine.', 'start': 411.027, 'duration': 4.482}, {'end': 421.591, 'text': 'But it was definitely useful, and I think people sort of understood how these neural networks work on a very intuitive level.', 'start': 417.089, 'duration': 4.502}, {'end': 424.873, 'text': "And so I think it's a good exercise again, and this is where we want to be.", 'start': 422.151, 'duration': 2.722}], 'summary': 'Using autograd engine is now standard for neural networks, but understanding backpropagation is useful and intuitive.', 'duration': 22.809, 'max_score': 402.064, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI402064.jpg'}, {'end': 460.459, 'src': 'embed', 'start': 433.696, 'weight': 5, 'content': [{'end': 437.338, 'text': "So we're still going to have a two-layer multi-layer perceptron with a batch normalization layer.", 'start': 433.696, 'duration': 3.642}, {'end': 440.979, 'text': 'So the forward pass will be basically identical to this lecture,', 'start': 437.898, 'duration': 3.081}, {'end': 445.201, 'text': "but here we're going to get rid of loss.backward and instead we're going to write the backward pass manually.", 'start': 440.979, 'duration': 4.222}, {'end': 447.727, 'text': "Now here's the starter code for this lecture.", 'start': 446.226, 'duration': 1.501}, {'end': 450.61, 'text': 'We are becoming a Backprop Ninja in this notebook.', 'start': 448.428, 'duration': 2.182}, {'end': 455.174, 'text': 'And the first few cells here are identical to what we are used to.', 'start': 451.751, 'duration': 3.423}, {'end': 459.558, 'text': 'So we are doing some imports, loading the dataset and processing the dataset.', 'start': 455.614, 'duration': 3.944}, {'end': 460.459, 'text': 'None of this changed.', 'start': 459.838, 'duration': 0.621}], 'summary': 'Implementing a two-layer multi-layer perceptron with batch normalization and manual backward pass in neural network training.', 'duration': 26.763, 'max_score': 433.696, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI433696.jpg'}, {'end': 494.979, 'src': 'embed', 'start': 466.306, 'weight': 3, 'content': [{'end': 472.469, 'text': "So in particular, we are going to have the gradients that we estimate manually ourselves, and we're going to have gradients that PyTorch calculates.", 'start': 466.306, 'duration': 6.163}, {'end': 476.751, 'text': "And we're going to be checking for correctness, assuming of course that PyTorch is correct.", 'start': 473.069, 'duration': 3.682}, {'end': 481.313, 'text': 'Then here we have the initialization that we are quite used to.', 'start': 478.952, 'duration': 2.361}, {'end': 487.696, 'text': 'So we have our embedding table for the characters, the first layer, second layer, and a batch normalization in between.', 'start': 481.773, 'duration': 5.923}, {'end': 490.014, 'text': "And here's where we create all the parameters.", 'start': 488.533, 'duration': 1.481}, {'end': 494.979, 'text': 'Now you will note that I changed the initialization a little bit to be small numbers.', 'start': 490.875, 'duration': 4.104}], 'summary': 'Comparing manually estimated and pytorch-calculated gradients for neural network parameters, with adjusted initialization.', 'duration': 28.673, 'max_score': 466.306, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI466306.jpg'}, {'end': 542.397, 'src': 'embed', 'start': 516.996, 'weight': 2, 'content': [{'end': 522.241, 'text': "And so by making it small numbers, I'm trying to unmask those potential errors in these calculations.", 'start': 516.996, 'duration': 5.245}, {'end': 526.89, 'text': "You also notice that I'm using B1 in the first layer.", 'start': 523.288, 'duration': 3.602}, {'end': 529.951, 'text': "I'm using a bias despite batch normalization right afterwards.", 'start': 526.91, 'duration': 3.041}, {'end': 538.035, 'text': "So this would typically not be what you do, because we talked about the fact that you don't need a bias, but I'm doing this here just for fun,", 'start': 531.092, 'duration': 6.943}, {'end': 542.397, 'text': "because we're going to have a gradient with respect to it and we can check that we are still calculating it correctly,", 'start': 538.035, 'duration': 4.362}], 'summary': 'Testing small numbers to uncover errors in calculations and using bias despite batch normalization for gradient checking.', 'duration': 25.401, 'max_score': 516.996, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI516996.jpg'}, {'end': 591.398, 'src': 'embed', 'start': 558.132, 'weight': 1, 'content': [{'end': 561.134, 'text': 'Now the reason that the forward pass is longer is for two reasons.', 'start': 558.132, 'duration': 3.002}, {'end': 567.58, 'text': 'Number one here we just had an F dot cross entropy, but here I am bringing back a explicit implementation of the loss function.', 'start': 561.615, 'duration': 5.965}, {'end': 573.036, 'text': "And number two I've broken up the implementation into manageable chunks.", 'start': 568.533, 'duration': 4.503}, {'end': 577.838, 'text': 'So we have a lot more intermediate tensors along the way in the forward pass.', 'start': 573.736, 'duration': 4.102}, {'end': 584.963, 'text': "And that's because we are about to go backwards and calculate the gradients in this back propagation from the bottom to the top.", 'start': 578.279, 'duration': 6.684}, {'end': 591.398, 'text': "So we're going to go upwards and, just like we have, for example, the logProps tensor in a forward pass,", 'start': 585.876, 'duration': 5.522}], 'summary': 'The forward pass is longer due to explicit loss function and more intermediate tensors for back propagation.', 'duration': 33.266, 'max_score': 558.132, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI558132.jpg'}, {'end': 701.913, 'src': 'embed', 'start': 674.105, 'weight': 0, 'content': [{'end': 677.789, 'text': 'So, instead of breaking up BatchNorm into all the little, tiny components,', 'start': 674.105, 'duration': 3.684}, {'end': 684.435, 'text': "we're going to use pen and paper and mathematics and calculus to derive the gradient through the BatchNorm layer.", 'start': 677.789, 'duration': 6.646}, {'end': 692.783, 'text': "So we're going to calculate the backward pass through BatchNorm layer in a much more efficient expression instead of backward propagating through all of its little pieces independently.", 'start': 685.116, 'duration': 7.667}, {'end': 695.206, 'text': "So that's going to be exercise three.", 'start': 693.724, 'duration': 1.482}, {'end': 698.491, 'text': "And then in exercise four, we're going to put it all together.", 'start': 696.65, 'duration': 1.841}, {'end': 701.913, 'text': 'And this is the full code of training this two-layer MLP.', 'start': 698.971, 'duration': 2.942}], 'summary': 'Efficiently derive gradient for batchnorm, then integrate in training code.', 'duration': 27.808, 'max_score': 674.105, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI674105.jpg'}], 'start': 367.294, 'title': 'Implementing manual backpropagation', 'summary': 'Discusses the implementation of manual backpropagation through a two-layer multi-layer perceptron with batch normalization, emphasizing the need for manual backpropagation and previewing exercises involving analytical derivation of gradients for efficiency.', 'chapters': [{'end': 415.509, 'start': 367.294, 'title': 'Implementing cost function and backward pass', 'summary': 'Discusses manually implementing the cost function and backward pass, including calculating image and sentence embeddings, loss function, and backward pass through the neural net, followed by the use of a gradient checker for numerical estimate validation.', 'duration': 48.215, 'highlights': ['Manually implementing the cost function and backward pass, including calculating image and sentence embeddings, loss function, and backward pass through the neural net, followed by the use of a gradient checker for numerical estimate validation.', 'The code was implemented in NumPy and Python in 2014.', 'Standard practice at the time was to manually implement not just the cost, but also the backward pass, and then use a gradient checker to validate the numerical estimate of the gradient.']}, {'end': 761.421, 'start': 417.089, 'title': 'Implementing manual backpropagation', 'summary': 'Covers the implementation of manual backpropagation through a two-layer multi-layer perceptron with batch normalization, introducing the process, emphasizing the need for manual backpropagation, and previewing the exercises involving analytical derivation of gradients for efficiency.', 'duration': 344.332, 'highlights': ['The chapter covers the implementation of manual backpropagation through a two-layer multi-layer perceptron with batch normalization, emphasizing the need for manual backpropagation to understand the neural network training process fully.', 'Exercises involve breaking down the loss and backpropagating through it manually, followed by analytical derivation of gradients for efficiency, and deriving gradients through the BatchNorm layer using mathematics and calculus.', 'The forward pass is expanded to include an explicit implementation of the loss function and intermediate tensors to facilitate the upcoming backpropagation process.', 'Variables are initialized to small random numbers to unmask potential errors in gradient calculations, and a bias is intentionally included despite batch normalization to verify correct gradient calculation.', 'The process involves checking the correctness of the manually estimated gradients against PyTorch-calculated gradients using a utility function, ensuring the accuracy of the manual backpropagation implementation.']}], 'duration': 394.127, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI367294.jpg', 'highlights': ['Exercises involve breaking down the loss and backpropagating through it manually, followed by analytical derivation of gradients for efficiency, and deriving gradients through the BatchNorm layer using mathematics and calculus.', 'The forward pass is expanded to include an explicit implementation of the loss function and intermediate tensors to facilitate the upcoming backpropagation process.', 'Variables are initialized to small random numbers to unmask potential errors in gradient calculations, and a bias is intentionally included despite batch normalization to verify correct gradient calculation.', 'The process involves checking the correctness of the manually estimated gradients against PyTorch-calculated gradients using a utility function, ensuring the accuracy of the manual backpropagation implementation.', 'Manually implementing the cost function and backward pass, including calculating image and sentence embeddings, loss function, and backward pass through the neural net, followed by the use of a gradient checker for numerical estimate validation.', 'The chapter covers the implementation of manual backpropagation through a two-layer multi-layer perceptron with batch normalization, emphasizing the need for manual backpropagation to understand the neural network training process fully.', 'The code was implemented in NumPy and Python in 2014.', 'Standard practice at the time was to manually implement not just the cost, but also the backward pass, and then use a gradient checker to validate the numerical estimate of the gradient.']}, {'end': 1522.982, 'segs': [{'end': 821.254, 'src': 'embed', 'start': 781.525, 'weight': 0, 'content': [{'end': 783.186, 'text': 'So we are starting here with DLogProps.', 'start': 781.525, 'duration': 1.661}, {'end': 790.249, 'text': 'Now, DLogProps will hold the derivative of the loss with respect to all the elements of LogProps.', 'start': 783.826, 'duration': 6.423}, {'end': 796.747, 'text': 'What is inside LogProps? The shape of this is 32 by 27.', 'start': 791.549, 'duration': 5.198}, {'end': 802.257, 'text': "So it's not going to surprise you that DLogProps should also be an array of size 32 by 27,,", 'start': 796.747, 'duration': 5.51}, {'end': 804.782, 'text': 'because we want the derivative loss with respect to all of its elements.', 'start': 802.257, 'duration': 2.525}, {'end': 807.787, 'text': 'So the sizes of those are always going to be equal.', 'start': 805.443, 'duration': 2.344}, {'end': 821.254, 'text': 'Now, how does logProps influence the loss? Loss is negative logProps indexed with range of n and yb and then the mean of that.', 'start': 809.727, 'duration': 11.527}], 'summary': 'Dlogprops holds derivative of loss w.r.t logprops (32x27 array). loss is -logprops indexed with range of n and yb, then mean of that.', 'duration': 39.729, 'max_score': 781.525, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI781525.jpg'}, {'end': 966.314, 'src': 'embed', 'start': 936.937, 'weight': 2, 'content': [{'end': 944.48, 'text': "And so you can see that if we don't just have A, B and C, but we have 32 numbers, then D loss by D.", 'start': 936.937, 'duration': 7.543}, {'end': 952.103, 'text': 'every one of those numbers is going to be one over N more generally, because N is the size of the batch, 32 in this case.', 'start': 944.48, 'duration': 7.623}, {'end': 961.232, 'text': 'So D loss by D log probes is negative one over N, in all these places.', 'start': 953.143, 'duration': 8.089}, {'end': 966.314, 'text': 'Now, what about the other elements inside logProps? Because logProps is a large array.', 'start': 962.292, 'duration': 4.022}], 'summary': 'In a batch of 32 numbers, d loss by d logprobs is -1/n.', 'duration': 29.377, 'max_score': 936.937, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI936937.jpg'}, {'end': 1039.945, 'src': 'embed', 'start': 1009.381, 'weight': 4, 'content': [{'end': 1017.667, 'text': "or let's just say instead of doing this because we don't want to hard-code numbers, let's do Torch.zeros like logProps.", 'start': 1009.381, 'duration': 8.286}, {'end': 1021.511, 'text': 'So basically this is going to create an array of zeros exactly in the shape of logProps.', 'start': 1018.228, 'duration': 3.283}, {'end': 1027.896, 'text': 'And then we need to set the derivative of negative one over n inside exactly these locations.', 'start': 1022.792, 'duration': 5.104}, {'end': 1029.696, 'text': "So here's what we can do.", 'start': 1028.656, 'duration': 1.04}, {'end': 1039.945, 'text': 'The logProps indexed in the identical way will be just set to negative one over zero, divide N.', 'start': 1030.438, 'duration': 9.507}], 'summary': 'Using torch.zeros to create an array of zeros in the shape of logprops and setting the derivative of -1/n inside these locations.', 'duration': 30.564, 'max_score': 1009.381, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI1009381.jpg'}, {'end': 1191.529, 'src': 'embed', 'start': 1162.304, 'weight': 3, 'content': [{'end': 1163.965, 'text': 'we have like a log node.', 'start': 1162.304, 'duration': 1.661}, {'end': 1173.073, 'text': 'it takes in probes and creates log probes, and the probes will be the local derivative of that individual operation, log times,', 'start': 1163.965, 'duration': 9.108}, {'end': 1176.816, 'text': 'the derivative loss with respect to its output, which in this case is d log probes.', 'start': 1173.073, 'duration': 3.743}, {'end': 1179.858, 'text': 'So what is the local derivative of this operation??', 'start': 1177.716, 'duration': 2.142}, {'end': 1184.442, 'text': 'Well, we are taking log element-wise and we can come here and we can see well from alpha.', 'start': 1180.379, 'duration': 4.063}, {'end': 1187.885, 'text': 'is your friend that d by dx of log, of x is just simply one over x.', 'start': 1184.442, 'duration': 3.443}, {'end': 1191.529, 'text': 'So therefore, in this case, x is probes.', 'start': 1189.127, 'duration': 2.402}], 'summary': 'A log node takes in probes, creates log probes, and calculates the local derivative of the operation.', 'duration': 29.225, 'max_score': 1162.304, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI1162304.jpg'}, {'end': 1457.435, 'src': 'embed', 'start': 1409.516, 'weight': 1, 'content': [{'end': 1413.078, 'text': "And now we're trying to back propagate through this operation to count sum in.", 'start': 1409.516, 'duration': 3.562}, {'end': 1421.464, 'text': "So when we were calculating this derivative, It's important to realize that this looks like a single operation,", 'start': 1414.319, 'duration': 7.145}, {'end': 1424.646, 'text': 'but actually is two operations applied sequentially.', 'start': 1421.464, 'duration': 3.182}, {'end': 1434.355, 'text': 'The first operation that PyTorch did is it took this column tensor and replicated it across all the columns, basically 27 times.', 'start': 1425.347, 'duration': 9.008}, {'end': 1436.457, 'text': "So that's the first operation, it's a replication.", 'start': 1434.815, 'duration': 1.642}, {'end': 1438.959, 'text': 'And then the second operation is the multiplication.', 'start': 1436.977, 'duration': 1.982}, {'end': 1441.662, 'text': "So let's first backtrack through the multiplication.", 'start': 1439.54, 'duration': 2.122}, {'end': 1451.708, 'text': 'If these two arrays were of the same size and we just have A and B, both of them three by three, then how do we backpropagate?', 'start': 1442.859, 'duration': 8.849}, {'end': 1452.509, 'text': 'through a multiplication?', 'start': 1451.708, 'duration': 0.801}, {'end': 1457.435, 'text': 'So if we just have scalars and not tensors, then if you have C equals A times B.', 'start': 1453.27, 'duration': 4.165}], 'summary': 'Backpropagating through replication and multiplication operations in tensor calculus.', 'duration': 47.919, 'max_score': 1409.516, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI1409516.jpg'}], 'start': 761.461, 'title': 'Derivatives and backpropagation in neural networks', 'summary': "Discusses the calculation of derivative of loss with respect to logprops, involving the shape of logprops, impact on loss, and the implementation of the derivative using torch.zeros and comparison with the calculated value by pytorch. it also covers the backpropagation process in neural networks, including the calculation of local derivatives and the backpropagation through element-wise operations and tensor multiplication, aiming to optimize the network's performance and numerical stability.", 'chapters': [{'end': 1134.358, 'start': 761.461, 'title': 'Derivative of loss with respect to logprops', 'summary': 'Discusses the calculation of the derivative of the loss with respect to logprops, involving the shape of logprops, impact on loss, and the implementation of the derivative using torch.zeros and comparison with the calculated value by pytorch.', 'duration': 372.897, 'highlights': ['The shape of DLogProps is 32 by 27, matching the shape of LogProps, indicating the derivative loss with respect to all of its elements.', 'The derivative of loss with respect to logProps is negative one over N, where N is the size of the batch, in all locations.', "Implementation of the derivative involves setting the derivative of negative one over N in the locations identified by logProps, using Torch.zeros to create an array and then comparing the calculated value with PyTorch's value."]}, {'end': 1522.982, 'start': 1134.978, 'title': 'Backpropagation and derivatives in neural networks', 'summary': "Covers the backpropagation process in neural networks, including the calculation of local derivatives and the backpropagation through element-wise operations and tensor multiplication, aiming to optimize the network's performance and numerical stability.", 'duration': 388.004, 'highlights': ["The backpropagation process involves calculating local derivatives and backpropagating through element-wise operations and tensor multiplication to optimize the network's performance and numerical stability.", 'The local derivative of an operation involving element-wise application of log is 1 over probes, which is then multiplied by the derivative of log probes, effectively boosting the gradient of examples with low assigned probabilities.', 'The derivative of countSumInv involves careful consideration of tensor shapes and implicit broadcasting, with PyTorch replicating elements and performing sequential operations, ultimately requiring backpropagation through the replication process.']}], 'duration': 761.521, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI761461.jpg', 'highlights': ['The shape of DLogProps is 32 by 27, matching the shape of LogProps, indicating the derivative loss with respect to all of its elements.', "The backpropagation process involves calculating local derivatives and backpropagating through element-wise operations and tensor multiplication to optimize the network's performance and numerical stability.", 'The derivative of loss with respect to logProps is negative one over N, where N is the size of the batch, in all locations.', 'The local derivative of an operation involving element-wise application of log is 1 over probes, which is then multiplied by the derivative of log probes, effectively boosting the gradient of examples with low assigned probabilities.', "Implementation of the derivative involves setting the derivative of negative one over N in the locations identified by logProps, using Torch.zeros to create an array and then comparing the calculated value with PyTorch's value.", 'The derivative of countSumInv involves careful consideration of tensor shapes and implicit broadcasting, with PyTorch replicating elements and performing sequential operations, ultimately requiring backpropagation through the replication process.']}, {'end': 2297.021, 'segs': [{'end': 1550.508, 'src': 'embed', 'start': 1523.782, 'weight': 3, 'content': [{'end': 1530.247, 'text': "And we're talking about how the correct thing to do in the backward pass is we need to sum all the gradients that arrive at any one node.", 'start': 1523.782, 'duration': 6.465}, {'end': 1533.61, 'text': 'So across these different branches, the gradients would sum.', 'start': 1530.848, 'duration': 2.762}, {'end': 1540.496, 'text': 'So if a node is used multiple times, the gradients for all of its uses sum during backpropagation.', 'start': 1534.931, 'duration': 5.565}, {'end': 1550.508, 'text': 'So here, b1 is used multiple times in all these columns, and therefore the right thing to do here is to sum horizontally across all the rows.', 'start': 1541.621, 'duration': 8.887}], 'summary': 'In backpropagation, gradients arriving at a node should be summed, especially for nodes used multiple times.', 'duration': 26.726, 'max_score': 1523.782, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI1523782.jpg'}, {'end': 2030.996, 'src': 'embed', 'start': 2005.537, 'weight': 4, 'content': [{'end': 2011.421, 'text': "Now we have to be careful here because the shapes again are not the same and so there's an implicit broadcasting happening here.", 'start': 2005.537, 'duration': 5.884}, {'end': 2018.786, 'text': 'So normal logits has this shape 32 by 27, logits does as well, but logit maxes is only 32 by one.', 'start': 2012.182, 'duration': 6.604}, {'end': 2021.788, 'text': "So there's a broadcasting here in the minus.", 'start': 2019.326, 'duration': 2.462}, {'end': 2026.494, 'text': 'Now here, I tried to sort of write out a toy example again.', 'start': 2023.052, 'duration': 3.442}, {'end': 2030.996, 'text': 'We basically have that this is our C equals A minus B.', 'start': 2027.074, 'duration': 3.922}], 'summary': 'Logits have shapes 32x27, but logit maxes is 32x1, demonstrating implicit broadcasting.', 'duration': 25.459, 'max_score': 2005.537, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI2005537.jpg'}, {'end': 2075.617, 'src': 'embed', 'start': 2048.243, 'weight': 0, 'content': [{'end': 2062.349, 'text': "So it's very clear now that the derivatives of every one of these Cs with respect to their inputs are one for the corresponding A and it's a negative one for the corresponding B.", 'start': 2048.243, 'duration': 14.106}, {'end': 2072.956, 'text': "And so therefore the derivatives on the c will flow equally to the corresponding a's and then also to the corresponding b's.", 'start': 2062.349, 'duration': 10.607}, {'end': 2075.617, 'text': "but then, in addition to that, the b's are broadcast.", 'start': 2072.956, 'duration': 2.661}], 'summary': "Derivatives of cs w.r.t. inputs are 1 for a and -1 for b, flowing equally to corresponding a's and b's.", 'duration': 27.374, 'max_score': 2048.243, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI2048243.jpg'}, {'end': 2199.547, 'src': 'embed', 'start': 2172.64, 'weight': 1, 'content': [{'end': 2175.784, 'text': 'and logit max is if PyTorch agrees with us.', 'start': 2172.64, 'duration': 3.144}, {'end': 2179.909, 'text': 'So that was the derivative through this line.', 'start': 2176.685, 'duration': 3.224}, {'end': 2186.416, 'text': 'Now, before we move on, I want to pause here briefly and I want to look at these logit maxes and especially their gradients.', 'start': 2181.23, 'duration': 5.186}, {'end': 2190.02, 'text': "We've talked previously, in the previous lecture,", 'start': 2187.418, 'duration': 2.602}, {'end': 2194.804, 'text': "that the only reason we're doing this is for the numerical stability of the softmax that we are implementing here.", 'start': 2190.02, 'duration': 4.784}, {'end': 2199.547, 'text': 'And we talked about how, if you take these logits for any one of these examples,', 'start': 2195.564, 'duration': 3.983}], 'summary': 'Analyzing logit maxes and gradients for numerical stability in softmax implementation.', 'duration': 26.907, 'max_score': 2172.64, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI2172640.jpg'}, {'end': 2307.006, 'src': 'embed', 'start': 2276.831, 'weight': 2, 'content': [{'end': 2280.633, 'text': 'then you would probably assume that the derivative through here is exactly zero.', 'start': 2276.831, 'duration': 3.802}, {'end': 2288.817, 'text': "So you would be sort of, skipping this branch because it's only done for numerical stability.", 'start': 2281.713, 'duration': 7.104}, {'end': 2290.438, 'text': "But it's interesting to see that,", 'start': 2289.397, 'duration': 1.041}, {'end': 2297.021, 'text': "even if you break up everything into the full atoms and you still do the computation as you'd like with respect to numerical stability,", 'start': 2290.438, 'duration': 6.583}, {'end': 2301.863, 'text': 'the correcting happens and you still get a very, very small gradients here,', 'start': 2297.021, 'duration': 4.842}, {'end': 2307.006, 'text': 'basically reflecting the fact that the values of these do not matter with respect to the final loss.', 'start': 2301.863, 'duration': 5.143}], 'summary': 'Even with full atom breakdown, computation remains stable, yielding very small gradients reflecting insignificant impact on final loss.', 'duration': 30.175, 'max_score': 2276.831, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI2276831.jpg'}], 'start': 1523.782, 'title': 'Backpropagation and gradients in neural networks', 'summary': 'Explains the correct approach in backpropagation, including the summing of gradients at nodes, the preservation of dimensions during summation, and the broadcasting of tensors, with examples of dimension manipulation and chain rule applications. it also delves into the process of backpropagation and gradients in pytorch, discussing the calculation and verification of gradients, element-wise operations, broadcasting, and the impact of logit maxes on numerical stability, with particular focus on the derivative and gradients.', 'chapters': [{'end': 1924.94, 'start': 1523.782, 'title': 'Backpropagation and broadcasting in neural networks', 'summary': 'Explains the correct approach in backpropagation, including the summing of gradients at nodes, the preservation of dimensions during summation, and the broadcasting of tensors, with examples of dimension manipulation and chain rule applications.', 'duration': 401.158, 'highlights': ['The correct thing to do in the backward pass is to sum all the gradients that arrive at any one node.', 'We want to sum in dimension one, but we want to retain this dimension so that the countSumInv and its gradient are going to be exactly the same shape.', "CountSumInv needs to be replicated again to correctly multiply dProps, but that's going to give the correct result."]}, {'end': 2297.021, 'start': 1925.26, 'title': 'Backpropagation and gradients in pytorch', 'summary': 'Delves into the process of backpropagation and gradients in pytorch, discussing the calculation and verification of gradients, element-wise operations, broadcasting, and the impact of logit maxes on numerical stability, with particular focus on the derivative and gradients.', 'duration': 371.761, 'highlights': ['The local derivative of e to the x is e to the x, which is utilized in the calculation of the local derivative of the element-wise operation counts, potentially allowing for the reuse of counts.', 'Explanation of implicit broadcasting happening due to shapes not being the same, particularly highlighting the impact on normal logits, logits, and logit maxes.', "Detailed explanation of derivatives of Cs with respect to their inputs, showcasing the impact on corresponding A's and B's, as well as the additional sum and negative local derivative for B's.", 'The process of obtaining the final delogits, considering the impact of logit maxes and the need for a second branch into logits for the final derivative of logits.', 'Discussion on the impact of logit maxes on numerical stability, including the expected minimal gradient values and the relevance of backpropagation through this branch for numerical stability.', 'Comparison of expected zero derivative through the branch for numerical stability with the observation of extremely small gradient values, providing insights into the backpropagation process for numerical stability.']}], 'duration': 773.239, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI1523782.jpg', 'highlights': ["Detailed explanation of derivatives of Cs with respect to their inputs, showcasing the impact on corresponding A's and B's, as well as the additional sum and negative local derivative for B's.", 'Discussion on the impact of logit maxes on numerical stability, including the expected minimal gradient values and the relevance of backpropagation through this branch for numerical stability.', 'Comparison of expected zero derivative through the branch for numerical stability with the observation of extremely small gradient values, providing insights into the backpropagation process for numerical stability.', 'The correct thing to do in the backward pass is to sum all the gradients that arrive at any one node.', 'Explanation of implicit broadcasting happening due to shapes not being the same, particularly highlighting the impact on normal logits, logits, and logit maxes.']}, {'end': 3830.854, 'segs': [{'end': 2373.234, 'src': 'embed', 'start': 2339.957, 'weight': 2, 'content': [{'end': 2344.898, 'text': "But in the backward pass, it's extremely useful to know about where those maximum values occurred.", 'start': 2339.957, 'duration': 4.941}, {'end': 2347.379, 'text': 'And we have the indices at which they occurred.', 'start': 2345.758, 'duration': 1.621}, {'end': 2350.599, 'text': 'And this will, of course, help us do the back propagation.', 'start': 2347.919, 'duration': 2.68}, {'end': 2356.981, 'text': 'Because what should the backward pass be here in this case? We have the logit tensor, which is 32 by 27.', 'start': 2351.339, 'duration': 5.642}, {'end': 2358.601, 'text': 'And in each row, we find the maximum value.', 'start': 2356.981, 'duration': 1.62}, {'end': 2361.862, 'text': 'And then that value gets plucked out into logit maxes.', 'start': 2359.001, 'duration': 2.861}, {'end': 2364.849, 'text': 'And so intuitively.', 'start': 2362.748, 'duration': 2.101}, {'end': 2373.234, 'text': 'basically, the derivative flowing through here then should be one times the local derivatives,', 'start': 2364.849, 'duration': 8.385}], 'summary': 'Understanding max values and indices aids backpropagation in the 32x27 logit tensor', 'duration': 33.277, 'max_score': 2339.957, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI2339957.jpg'}, {'end': 2532.678, 'src': 'embed', 'start': 2507.594, 'weight': 0, 'content': [{'end': 2513.217, 'text': 'that is an outcome of a matrix multiplication and a bias offset in this linear layer.', 'start': 2507.594, 'duration': 5.623}, {'end': 2517.415, 'text': "so I've printed out the shapes of all these intermediate tensors.", 'start': 2513.217, 'duration': 4.198}, {'end': 2521.195, 'text': "We see that logits is of course 32 by 27, as we've just seen.", 'start': 2518.175, 'duration': 3.02}, {'end': 2524.516, 'text': 'Then the H here is 32 by 64.', 'start': 2522.116, 'duration': 2.4}, {'end': 2527.097, 'text': 'So these are 64 dimensional hidden states.', 'start': 2524.516, 'duration': 2.581}, {'end': 2532.678, 'text': 'And then this W matrix projects those 64 dimensional vectors into 27 dimensions.', 'start': 2527.897, 'duration': 4.781}], 'summary': 'Outcome: 32x27 logits, 32x64 hidden states, w: 64d to 27d', 'duration': 25.084, 'max_score': 2507.594, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI2507594.jpg'}, {'end': 3242.621, 'src': 'embed', 'start': 3216.996, 'weight': 1, 'content': [{'end': 3223.119, 'text': 'Now next up, we have derivative for the H already, and we need to backpropagate through 10H into H preact.', 'start': 3216.996, 'duration': 6.123}, {'end': 3226.04, 'text': 'So we want to derive dH preact.', 'start': 3224.059, 'duration': 1.981}, {'end': 3229.478, 'text': 'and here we have to back propagate through a tanh,', 'start': 3227.358, 'duration': 2.12}, {'end': 3234.72, 'text': "and we've already done this in micrograd and we remember that tanh is a very simple backward formula.", 'start': 3229.478, 'duration': 5.242}, {'end': 3239.681, 'text': 'now, unfortunately, if I just put in d by dx of tanh of x into volt from alpha, it lets us down.', 'start': 3234.72, 'duration': 4.961}, {'end': 3242.621, 'text': "it tells us that it's a hyperbolic secant function.", 'start': 3239.681, 'duration': 2.94}], 'summary': 'Derive dh preact by backpropagating through tanh function.', 'duration': 25.625, 'max_score': 3216.996, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI3216996.jpg'}, {'end': 3380.304, 'src': 'embed', 'start': 3352.592, 'weight': 3, 'content': [{'end': 3362.497, 'text': 'So you see how bngain and bmbias are 1 by 64, but hpreact and bnraw are 32 by 64.', 'start': 3352.592, 'duration': 9.905}, {'end': 3367.62, 'text': 'So we have to be careful with that and make sure that all the shapes work out fine and that the broadcasting is correctly backpropagated.', 'start': 3362.497, 'duration': 5.123}, {'end': 3370.521, 'text': "So in particular, let's start with d, b, and gain.", 'start': 3368.621, 'duration': 1.9}, {'end': 3377.243, 'text': 'So d, b, and gain should be, and here this is, again, element-wise multiply.', 'start': 3371.382, 'duration': 5.861}, {'end': 3380.304, 'text': 'And whenever we have a times, b equals c.', 'start': 3377.563, 'duration': 2.741}], 'summary': 'Careful with 1 by 64 and 32 by 64 shapes in backpropagation.', 'duration': 27.712, 'max_score': 3352.592, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI3352592.jpg'}, {'end': 3739.576, 'src': 'embed', 'start': 3712.921, 'weight': 4, 'content': [{'end': 3716.343, 'text': 'so there is a broadcasting happening here that we have to be careful with.', 'start': 3712.921, 'duration': 3.422}, {'end': 3719.364, 'text': 'But it is just an element-wise simple multiplication.', 'start': 3717.083, 'duration': 2.281}, {'end': 3721.065, 'text': 'By now, we should be pretty comfortable with that.', 'start': 3719.504, 'duration': 1.561}, {'end': 3729.029, 'text': 'To get dbnDiff, we know that this is just bnVarInv multiplied with dbnRaw.', 'start': 3721.685, 'duration': 7.344}, {'end': 3739.576, 'text': 'And conversely, to get dbnvarinf, we need to take bndiff and multiply that by dbnraw.', 'start': 3731.795, 'duration': 7.781}], 'summary': 'Broadcasting involves element-wise multiplication and is used to calculate dbndiff and dbnvarinf.', 'duration': 26.655, 'max_score': 3712.921, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI3712921.jpg'}], 'start': 2297.021, 'title': 'Backpropagation in neural networks', 'summary': 'Discusses backpropagation through logits, matrix multiplication, and broadcasting in pytorch, emphasizing the process, shapes of intermediate tensors, and derivation of derivatives for linear layers and activation functions, with specific insights into batch normalization parameters and challenges related to broadcasting and variable dependencies.', 'chapters': [{'end': 2571.473, 'start': 2297.021, 'title': 'Backpropagation and logits in pytorch', 'summary': "Discusses backpropagation through logits in pytorch, explaining the importance of knowing maximum values' indices and the implementation of the backpropagation process, while also detailing the shapes of intermediate tensors involved in the process.", 'duration': 274.452, 'highlights': ["The importance of knowing maximum values' indices in the backward pass is explained, emphasizing its usefulness for backpropagation.", 'Explanation of an efficient one-line code for implementing the scattering of dLogit maxes to the correct positions in the logits.', 'Detailed explanation of the shapes of intermediate tensors involved in the backpropagation process, including the dimensions of the logit tensor, hidden states, W matrix, and bias vector.']}, {'end': 3331.072, 'start': 2573.133, 'title': 'Backpropagation in matrix multiplication', 'summary': 'Explains the process of backpropagation in matrix multiplication, demonstrating how to derive derivatives for a, b, c, and providing insights into the backward pass for linear layers and tanh activation functions.', 'duration': 757.939, 'highlights': ['The chapter explains the process of backpropagation in matrix multiplication, demonstrating how to derive derivatives for A, B, C, and providing insights into the backward pass for linear layers and tanh activation functions.', 'The derivative of the loss with respect to A, B, and C is derived through simple reasoning and expressed as a matrix multiplication.', 'The backward pass of a matrix multiplication is determined to be a matrix multiplication, with specific explanations provided for derivatives with respect to A, B, and C.', 'Insights into backpropagation through a linear layer, including the derivation of derivatives for H, W2, and B2.', 'The process of backpropagating through a tanh activation function is explained, including the derivation of the derivative of tanh and its implementation.']}, {'end': 3560.753, 'start': 3331.853, 'title': 'Backpropagation and broadcasting in matrix operations', 'summary': 'Explains the process of backpropagation and broadcasting in matrix operations, emphasizing the importance of careful handling of shapes and dimensions to ensure correct calculations and gradients flow in the neural network, with specific emphasis on element-wise multiplication and broadcasting.', 'duration': 228.9, 'highlights': ['The importance of handling shapes and dimensions in backpropagation and broadcasting in neural network calculations.', 'Explanation of element-wise multiplication and the need for careful handling of dimensions and shapes to ensure accurate backpropagation.', 'Handling of broadcasting and replication of tensors to ensure correct gradients flow in the neural network.']}, {'end': 3830.854, 'start': 3560.753, 'title': 'Backpropagation in batch normalization', 'summary': 'Explains the backpropagation process in batch normalization, including the role of parameters like bngain, bmbias, bnmean, bndiff, and bnvar, and the challenges related to broadcasting and variable dependencies.', 'duration': 270.101, 'highlights': ['The backpropagation process in batch normalization involves breaking down the layer into manageable pieces to backpropagate through each line individually.', 'Parameter updates like bnvarinv and dbnDiff involve element-wise multiplication and careful handling of broadcasting to ensure correct dimensions.', 'Challenges related to broadcasting and variable dependencies arise during backpropagation, leading to incorrect results in parameter updates like dbnDiff.']}], 'duration': 1533.833, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI2297021.jpg', 'highlights': ['Detailed explanation of the shapes of intermediate tensors involved in the backpropagation process, including the dimensions of the logit tensor, hidden states, W matrix, and bias vector.', 'The process of backpropagating through a tanh activation function is explained, including the derivation of the derivative of tanh and its implementation.', "The importance of knowing maximum values' indices in the backward pass is explained, emphasizing its usefulness for backpropagation.", 'The importance of handling shapes and dimensions in backpropagation and broadcasting in neural network calculations.', 'Challenges related to broadcasting and variable dependencies arise during backpropagation, leading to incorrect results in parameter updates like dbnDiff.']}, {'end': 5091.099, 'segs': [{'end': 3858.361, 'src': 'embed', 'start': 3831.514, 'weight': 3, 'content': [{'end': 3834.815, 'text': "It branches out into two branches and we've only done one branch of it.", 'start': 3831.514, 'duration': 3.301}, {'end': 3838.936, 'text': 'We have to continue our backprop application and eventually come back to bndiff,', 'start': 3835.415, 'duration': 3.521}, {'end': 3842.437, 'text': "and then we'll be able to do a plus equals and get the actual current gradient.", 'start': 3838.936, 'duration': 3.501}, {'end': 3845.758, 'text': 'For now, it is good to verify that cmp also works.', 'start': 3843.057, 'duration': 2.701}, {'end': 3849.238, 'text': "It doesn't just lie to us and tell us that everything is always correct.", 'start': 3846.058, 'duration': 3.18}, {'end': 3852.759, 'text': 'It can in fact detect when your gradient is not correct.', 'start': 3849.599, 'duration': 3.16}, {'end': 3854.48, 'text': "So that's good to see as well.", 'start': 3853.14, 'duration': 1.34}, {'end': 3858.361, 'text': "Okay, so now we have the derivative here, and we're trying to backpropagate through this line.", 'start': 3855.04, 'duration': 3.321}], 'summary': 'Backpropagation needs to continue, and cmp helps verify gradient accuracy.', 'duration': 26.847, 'max_score': 3831.514, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI3831514.jpg'}, {'end': 4019.313, 'src': 'embed', 'start': 3990.239, 'weight': 0, 'content': [{'end': 3994.082, 'text': "You can read more about the Bessel's correction and why.", 'start': 3990.239, 'duration': 3.843}, {'end': 4005.066, 'text': 'dividing by n minus one gives you a better estimate of the variance in the case where you have population sizes or samples from a population that are very small and that is indeed the case for us,', 'start': 3994.082, 'duration': 10.984}, {'end': 4012.028, 'text': 'because we are dealing with mini-batches and these mini-batches are a small sample of a larger population, which is the entire training set.', 'start': 4005.066, 'duration': 6.962}, {'end': 4019.313, 'text': 'And so it just turns out that if you just estimate it using 1 over n, that actually almost always underestimates the variance.', 'start': 4012.788, 'duration': 6.525}], 'summary': "Using bessel's correction (dividing by n-1) provides a better estimate of variance for small sample sizes like mini-batches, which tends to underestimate variance when using 1 over n.", 'duration': 29.074, 'max_score': 3990.239, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI3990239.jpg'}, {'end': 4230.953, 'src': 'embed', 'start': 4203.051, 'weight': 2, 'content': [{'end': 4208.135, 'text': "and so hopefully you're noticing that duality, that those two are kind of like the opposites of each other in the forward and backward pass.", 'start': 4203.051, 'duration': 5.084}, {'end': 4211.266, 'text': 'Now, once we understand the shapes,', 'start': 4209.264, 'duration': 2.002}, {'end': 4220.317, 'text': 'the next thing I like to do always is I like to look at a toy example in my head to sort of just like understand roughly how the variable dependencies go in the mathematical formula.', 'start': 4211.266, 'duration': 9.051}, {'end': 4230.953, 'text': 'So here we have a two-dimensional array which we are scaling by a constant, and then we are summing vertically over the columns.', 'start': 4221.538, 'duration': 9.415}], 'summary': 'Understanding duality in forward and backward pass, analyzing toy example with 2d array.', 'duration': 27.902, 'max_score': 4203.051, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI4203051.jpg'}, {'end': 4875.298, 'src': 'embed', 'start': 4846.677, 'weight': 1, 'content': [{'end': 4849.379, 'text': 'So this is the forward pass and then this is the shapes.', 'start': 4846.677, 'duration': 2.702}, {'end': 4856.724, 'text': 'So remember that the shape here was 32 by 30 and the original shape of EMB was 32 by 3 by 10.', 'start': 4850.02, 'duration': 6.704}, {'end': 4863.569, 'text': 'So this layer in the forward pass, as you recall, did the concatenation of these three 10-dimensional character vectors.', 'start': 4856.724, 'duration': 6.845}, {'end': 4865.81, 'text': 'And so now we just want to undo that.', 'start': 4864.369, 'duration': 1.441}, {'end': 4872.476, 'text': 'So this is actually a relatively straightforward operation, because the backward pass of the what is the view?', 'start': 4866.831, 'duration': 5.645}, {'end': 4875.298, 'text': 'View is just a representation of the array.', 'start': 4873.136, 'duration': 2.162}], 'summary': 'The backward pass involves undoing a 32x30 concatenation operation in a straightforward manner.', 'duration': 28.621, 'max_score': 4846.677, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI4846677.jpg'}], 'start': 3831.514, 'title': "Backpropagation and derivative calculation, bessel's correction, and neural network backward pass", 'summary': "Covers backpropagation and derivative calculation, bessel's correction for variance estimation, and the backward pass through a neural network. it discusses the importance of detecting incorrect gradients, train-test mismatch in variance estimation, and efficient backpropagation operations.", 'chapters': [{'end': 3919.662, 'start': 3831.514, 'title': 'Backpropagation and derivative calculation', 'summary': 'Covers backpropagation application, derivative calculation using the power rule, and verification of correctness using cmp, highlighting the importance of detecting incorrect gradients.', 'duration': 88.148, 'highlights': ['The chapter emphasizes the importance of verifying correctness using cmp, which can detect when the gradient is not correct.', 'Explanation of backpropagation through the line and application of the power rule to calculate the derivative, showcasing the process of raising to a power and applying the chain rule.', 'Demonstration of the local and global derivatives to create the chain rule and verification of correctness through uncommenting the check, displaying the correct result.']}, {'end': 4138.594, 'start': 3919.662, 'title': "Bessel's correction in variance estimation", 'summary': "Discusses the use of bessel's correction (dividing by n-1 instead of n) for variance estimation, highlighting the train-test mismatch in using biased and unbiased estimates in batchnormalization, which can lead to discrepancies in variance calculation and suggests the consistent use of the unbiased version.", 'duration': 218.932, 'highlights': ["Bessel's Correction recommends dividing by n-1 for better variance estimation, particularly for small sample sizes, such as mini-batches, to avoid underestimation and bias. The biased estimate (1/N) and unbiased estimate (1/N-1) are used interchangeably in the BatchNormalization process, leading to a train-test mismatch and potential discrepancies in variance calculation.", 'The use of biased and unbiased estimates in BatchNormalization introduces a train-test mismatch, with the biased version used during training and the unbiased version during inference, leading to a lack of consistency and potential bugs in the code.', 'The documentation for BatchNormalization and variance estimation is unclear and misleading, leading to confusion and potential errors in implementation, emphasizing the need for consistent use of the unbiased version for better accuracy.']}, {'end': 4712.703, 'start': 4138.594, 'title': 'Backward pass through neural network', 'summary': 'Explains how to backpropagate through a neural network by scrutinizing shapes, identifying variable dependencies, and performing operations like replication, broadcasting, and scaling to calculate derivatives, ensuring correctness and efficiency.', 'duration': 574.109, 'highlights': ['Understanding shapes and variable dependencies is crucial for backpropagation in a neural network.', 'Recognizing the duality between sum and replication/broadcasting in the forward and backward passes is important for identifying variable reuse and ensuring correctness.', 'Performing operations like scaling, replication, and broadcasting are key steps in calculating derivatives during the backward pass.']}, {'end': 5091.099, 'start': 4713.784, 'title': 'Backpropagation through linear layers', 'summary': 'Covers backpropagation through linear layers, explaining the steps and calculations involved, including matrix multiplications and shape manipulations, to obtain the derivatives of the respective layers and the final indexing operation, with a focus on maintaining the correct dimensions and ensuring accurate gradient routing.', 'duration': 377.315, 'highlights': ['The process involves matrix multiplications and shape manipulations to obtain the derivatives of the respective layers, such as obtaining dw1 by multiplying dh pre bn with w1 dot transpose and db1 through summing elements in dh prebian, emphasizing the importance of matching shapes and maintaining correct dimensions.', "Explains the reinterpretation of the shape of derivatives into the original view and demonstrates the re-representation of EMBCAT's shape to match the original view, ensuring the correct derivative representation.", 'Elaborates on the indexing operation, explaining the process of routing gradients backwards through the assignment using for loops and the need to deposit the gradients into DC, emphasizing the necessity of addition for multiple occurrences of the same row of C.']}], 'duration': 1259.585, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI3831514.jpg', 'highlights': ["Bessel's Correction recommends dividing by n-1 for better variance estimation, particularly for small sample sizes, such as mini-batches, to avoid underestimation and bias.", 'The process involves matrix multiplications and shape manipulations to obtain the derivatives of the respective layers, emphasizing the importance of matching shapes and maintaining correct dimensions.', 'Understanding shapes and variable dependencies is crucial for backpropagation in a neural network.', 'The chapter emphasizes the importance of verifying correctness using cmp, which can detect when the gradient is not correct.']}, {'end': 5770.791, 'segs': [{'end': 5166.463, 'src': 'embed', 'start': 5131.888, 'weight': 2, 'content': [{'end': 5132.708, 'text': "That's where they are packaged.", 'start': 5131.888, 'duration': 0.82}, {'end': 5138.973, 'text': 'So now we need to go backwards, and we just need to route deemb at the position kj.', 'start': 5133.409, 'duration': 5.564}, {'end': 5145.137, 'text': "We now have these derivatives for each position, and it's 10-dimensional.", 'start': 5140.014, 'duration': 5.123}, {'end': 5147.98, 'text': 'And you just need to go into the correct row of C.', 'start': 5145.978, 'duration': 2.002}, {'end': 5159.801, 'text': 'So DC rather at ix is this, but plus equals because there could be multiple occurrences, like the same row could have been used many, many times.', 'start': 5149.238, 'duration': 10.563}, {'end': 5166.463, 'text': 'And so all of those derivatives will just go backwards through the indexing and they will add.', 'start': 5160.541, 'duration': 5.922}], 'summary': 'Derivatives for each position are 10-dimensional, and go backwards through indexing to add.', 'duration': 34.575, 'max_score': 5131.888, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI5131888.jpg'}, {'end': 5329.753, 'src': 'embed', 'start': 5282.614, 'weight': 0, 'content': [{'end': 5286.257, 'text': 'But it will be significantly shorter than whatever we did here.', 'start': 5282.614, 'duration': 3.643}, {'end': 5287.458, 'text': 'where to get to D logits.', 'start': 5286.257, 'duration': 1.201}, {'end': 5289.079, 'text': 'we have to go all the way here.', 'start': 5287.458, 'duration': 1.621}, {'end': 5295.545, 'text': 'So all of this work can be skipped in a much, much simpler mathematical expression that you can implement here.', 'start': 5290.16, 'duration': 5.385}, {'end': 5299.248, 'text': 'So you can give it a shot yourself.', 'start': 5296.446, 'duration': 2.802}, {'end': 5304.933, 'text': 'basically, look at what exactly is the mathematical expression of loss and differentiate with respect to the logits.', 'start': 5299.248, 'duration': 5.685}, {'end': 5308.116, 'text': 'So let me show you a hint.', 'start': 5306.374, 'duration': 1.742}, {'end': 5314.245, 'text': 'You can of course try it fully yourself, but if not, I can give you some hint of how to get started mathematically.', 'start': 5309.278, 'duration': 4.967}, {'end': 5322.491, 'text': "So basically what's happening here is we have logits, then there's a softmax that takes the logits and gives you probabilities.", 'start': 5316.789, 'duration': 5.702}, {'end': 5329.753, 'text': 'Then we are using the identity of the correct next character to pluck out a row of probabilities.', 'start': 5323.231, 'duration': 6.522}], 'summary': 'Simplify mathematical expression for faster implementation, with a focus on differentiating loss with respect to logits.', 'duration': 47.139, 'max_score': 5282.614, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI5282614.jpg'}, {'end': 5610.743, 'src': 'heatmap', 'start': 5540.306, 'weight': 0.718, 'content': [{'end': 5542.707, 'text': 'but this otherwise should be the result.', 'start': 5540.306, 'duration': 2.401}, {'end': 5552.692, 'text': "so now, if we verify this, we see that we don't get an exact match, but at the same time the maximum difference from logits,", 'start': 5542.707, 'duration': 9.985}, {'end': 5559.656, 'text': "from pytorch and rdlogits here is on the order of 5e-9, so it's a tiny, tiny number.", 'start': 5552.692, 'duration': 6.964}, {'end': 5568.24, 'text': "so because of floating point wonkiness, we don't get the exact bitwise result, but we basically get the correct answer approximately.", 'start': 5559.656, 'duration': 8.584}, {'end': 5576.29, 'text': "Now I'd like to pause here briefly before we move on to the next exercise, because I'd like us to get an intuitive sense of what dlogits is,", 'start': 5569.246, 'duration': 7.044}, {'end': 5579.591, 'text': 'because it has a beautiful and very simple explanation, honestly.', 'start': 5576.29, 'duration': 3.301}, {'end': 5583.814, 'text': "So here, I'm taking dlogits and I'm visualizing it.", 'start': 5580.952, 'duration': 2.862}, {'end': 5587.135, 'text': 'And we can see that we have a batch of 32 examples of 27 characters.', 'start': 5584.434, 'duration': 2.701}, {'end': 5595.458, 'text': 'And what is the logits intuitively, right? The logits is the probabilities that the probabilities matrix in the forward pass.', 'start': 5588.636, 'duration': 6.822}, {'end': 5600.54, 'text': 'But then here, these black squares are the positions of the correct indices where we subtracted a one.', 'start': 5595.998, 'duration': 4.542}, {'end': 5606.261, 'text': 'And so what is this doing, right? These are the derivatives on the logits.', 'start': 5601.94, 'duration': 4.321}, {'end': 5610.743, 'text': "And so let's look at just the first row here.", 'start': 5607.202, 'duration': 3.541}], 'summary': 'Verification shows maximum difference of 5e-9; dlogits and visualization explained intuitively.', 'duration': 70.437, 'max_score': 5540.306, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI5540306.jpg'}], 'start': 5091.099, 'title': 'Backpropagation', 'summary': 'Explains the backward pass in forward propagation for updating derivatives in a 10-dimensional matrix c and simplification of backpropagation through analytical differentiation, resulting in faster and more efficient loss optimization with an emphasis on intuitive understanding of dlogits.', 'chapters': [{'end': 5159.801, 'start': 5091.099, 'title': 'Backward pass in forward propagation', 'summary': 'Explains the process of performing a backward pass for updating derivatives in a 10-dimensional matrix c based on the occurrences and index positions in the forward pass of a neural network.', 'duration': 68.702, 'highlights': ['Performing a backward pass to update derivatives in a 10-dimensional matrix C based on the occurrences and index positions in the forward pass, e.g., DC at ix and deemb at kj.', 'Iterating over all elements of xb and obtaining the index position, for example, 11 or 14, to deposit the row of C into emb at kj in the forward pass.', 'Explaining the process of routing deemb at the position kj and updating the correct row of C, e.g., DC at ix, with the derivatives for each position.']}, {'end': 5770.791, 'start': 5160.541, 'title': 'Backpropagation and loss optimization', 'summary': 'Explains the simplification of backpropagation through analytical differentiation and the implementation of dlogits, resulting in faster and more efficient loss optimization, with an emphasis on the intuitive understanding of dlogits.', 'duration': 610.25, 'highlights': ['The simplification of backpropagation through analytical differentiation and the implementation of dlogits result in faster and more efficient loss optimization.', 'The intuitive understanding of dlogits is emphasized, showcasing its role as a force in pulling down on incorrect probabilities and pushing up on the probability of the correct character.', 'The explanation of dlogits provides an intuitive understanding of the forces of push and pull in the gradients during the training of a neural network.']}], 'duration': 679.692, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI5091099.jpg', 'highlights': ['Simplification of backpropagation through analytical differentiation and implementation of dlogits result in faster and more efficient loss optimization.', 'Intuitive understanding of dlogits emphasized, showcasing its role in pulling down on incorrect probabilities and pushing up on the probability of the correct character.', 'Explaining the process of routing deemb at the position kj and updating the correct row of C, e.g., DC at ix, with the derivatives for each position.', 'Performing a backward pass to update derivatives in a 10-dimensional matrix C based on the occurrences and index positions in the forward pass.']}, {'end': 6923.003, 'segs': [{'end': 6085.435, 'src': 'embed', 'start': 6052.455, 'weight': 3, 'content': [{'end': 6055.175, 'text': 'And remember that sigma square is just a single individual number here.', 'start': 6052.455, 'duration': 2.72}, {'end': 6069.338, 'text': 'So when we look at the expression for dl by d sigma square we have, that we have to actually consider all the possible paths that we basically have,', 'start': 6056.016, 'duration': 13.322}, {'end': 6072.499, 'text': "that there's many x hats and they all feed off from.", 'start': 6069.338, 'duration': 3.161}, {'end': 6073.959, 'text': 'they all depend on sigma square.', 'start': 6072.499, 'duration': 1.46}, {'end': 6076.419, 'text': 'So sigma square has a large fan out.', 'start': 6074.679, 'duration': 1.74}, {'end': 6079.72, 'text': "There's lots of arrows coming out from sigma square into all the x hats.", 'start': 6076.539, 'duration': 3.181}, {'end': 6085.435, 'text': "and then there's a back-replicating signal from each x-hat into sigma-square,", 'start': 6080.774, 'duration': 4.661}], 'summary': 'The variable sigma square has a large fan out with many x hats depending on it.', 'duration': 32.98, 'max_score': 6052.455, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI6052455.jpg'}, {'end': 6247.039, 'src': 'embed', 'start': 6217.82, 'weight': 2, 'content': [{'end': 6223.584, 'text': 'But if it is the special case that mu is actually equal to the average, as it is in the case of batch normalization,', 'start': 6217.82, 'duration': 5.764}, {'end': 6225.445, 'text': 'that gradient will actually vanish and become zero.', 'start': 6223.584, 'duration': 1.861}, {'end': 6231.73, 'text': 'So the whole term cancels, and we just get a fairly straightforward expression here for dl by d mu.', 'start': 6226.306, 'duration': 5.424}, {'end': 6238.47, 'text': "Okay, and now we get to the craziest part, which is deriving dL by dxi, which is ultimately what we're after.", 'start': 6232.423, 'duration': 6.047}, {'end': 6247.039, 'text': "Now, let's count, first of all, how many numbers are there inside x? As I mentioned, there are 32 numbers, there are 32 little xi's.", 'start': 6239.31, 'duration': 7.729}], 'summary': "In the case of batch normalization, the gradient vanishes when mu equals the average, resulting in a straightforward expression for dl by d mu. there are 32 little xi's inside x.", 'duration': 29.219, 'max_score': 6217.82, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI6217820.jpg'}, {'end': 6523.329, 'src': 'embed', 'start': 6495.146, 'weight': 0, 'content': [{'end': 6499.831, 'text': "And also, in addition, what we're using is these xi hats and xj hats, and they just come from the forward pass.", 'start': 6495.146, 'duration': 4.685}, {'end': 6509.442, 'text': "and otherwise this is a simple expression and it gives us dl by d xi for all the i's, and that's ultimately what we're interested in.", 'start': 6501.218, 'duration': 8.224}, {'end': 6513.184, 'text': "so that's the end of batch norm backward pass.", 'start': 6509.442, 'duration': 3.742}, {'end': 6516.906, 'text': "analytically, let's now implement this final result.", 'start': 6513.184, 'duration': 3.722}, {'end': 6523.329, 'text': 'okay, so i implemented the expression into a single line of code here and you can see that the max diff is tiny.', 'start': 6516.906, 'duration': 6.423}], 'summary': "Implementing forward pass results in dl by d xi for all i's.", 'duration': 28.183, 'max_score': 6495.146, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI6495146.jpg'}, {'end': 6563.458, 'src': 'embed', 'start': 6539.148, 'weight': 1, 'content': [{'end': 6545.49, 'text': 'because you have to consider the fact that this formula here is just for a single neuron and a batch of 32 examples.', 'start': 6539.148, 'duration': 6.342}, {'end': 6549.41, 'text': "But what I'm doing here is I'm actually, we actually have 64 neurons.", 'start': 6546.21, 'duration': 3.2}, {'end': 6556.712, 'text': 'And so this expression has to, in parallel, evaluate the best from backward pass for all of those 64 neurons in parallel and independently.', 'start': 6549.93, 'duration': 6.782}, {'end': 6563.458, 'text': 'So this has to happen basically in every single column of the inputs here.', 'start': 6557.472, 'duration': 5.986}], 'summary': 'Neural network processes 64 neurons in parallel for batch of 32 examples.', 'duration': 24.31, 'max_score': 6539.148, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI6539148.jpg'}, {'end': 6920.121, 'src': 'heatmap', 'start': 6853.283, 'weight': 1, 'content': [{'end': 6859.149, 'text': 'each one of these layers is like three lines of code or something like that, and most of it is fairly straightforward,', 'start': 6853.283, 'duration': 5.866}, {'end': 6862.993, 'text': 'potentially with the notable exception of the batch normalization, backward pass.', 'start': 6859.149, 'duration': 3.844}, {'end': 6866.959, 'text': "otherwise it's pretty good, Okay, and that's everything I wanted to cover for this lecture.", 'start': 6862.993, 'duration': 3.966}, {'end': 6869.399, 'text': 'So hopefully you found this interesting.', 'start': 6867.599, 'duration': 1.8}, {'end': 6874.941, 'text': 'And what I liked about it, honestly, is that it gave us a very nice diversity of layers to backpropagate through.', 'start': 6869.799, 'duration': 5.142}, {'end': 6881.843, 'text': 'And I think it gives a pretty nice and comprehensive sense of how these backward passes are implemented and how they work.', 'start': 6875.661, 'duration': 6.182}, {'end': 6884.203, 'text': "And you'd be able to derive them yourself.", 'start': 6882.483, 'duration': 1.72}, {'end': 6887.804, 'text': "But of course, in practice, you probably don't want to, and you want to use the PyTorch autograd.", 'start': 6884.283, 'duration': 3.521}, {'end': 6893.506, 'text': 'But hopefully you have some intuition about how gradients flow backwards through the neural net, starting at the loss.', 'start': 6888.284, 'duration': 5.222}, {'end': 6897.308, 'text': 'and how they flow through all the variables and all the intermediate results.', 'start': 6894.326, 'duration': 2.982}, {'end': 6901.85, 'text': 'And if you understood a good chunk of it and if you have a sense of that,', 'start': 6898.448, 'duration': 3.402}, {'end': 6907.174, 'text': 'then you can count yourself as one of these buff dojis on the left instead of the dojis on the right here.', 'start': 6901.85, 'duration': 5.324}, {'end': 6914.558, 'text': "Now, in the next lecture, we're actually going to go to recurrent neural nets, LSTMs, and all the other variants of RNNs.", 'start': 6908.054, 'duration': 6.504}, {'end': 6920.121, 'text': "And we're going to start to complexify the architecture and start to achieve better log likelihoods.", 'start': 6915.278, 'duration': 4.843}], 'summary': 'The lecture covered backpropagation through diverse layers and emphasized understanding gradients flow backward through neural networks.', 'duration': 66.838, 'max_score': 6853.283, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI6853283.jpg'}, {'end': 6914.558, 'src': 'embed', 'start': 6882.483, 'weight': 4, 'content': [{'end': 6884.203, 'text': "And you'd be able to derive them yourself.", 'start': 6882.483, 'duration': 1.72}, {'end': 6887.804, 'text': "But of course, in practice, you probably don't want to, and you want to use the PyTorch autograd.", 'start': 6884.283, 'duration': 3.521}, {'end': 6893.506, 'text': 'But hopefully you have some intuition about how gradients flow backwards through the neural net, starting at the loss.', 'start': 6888.284, 'duration': 5.222}, {'end': 6897.308, 'text': 'and how they flow through all the variables and all the intermediate results.', 'start': 6894.326, 'duration': 2.982}, {'end': 6901.85, 'text': 'And if you understood a good chunk of it and if you have a sense of that,', 'start': 6898.448, 'duration': 3.402}, {'end': 6907.174, 'text': 'then you can count yourself as one of these buff dojis on the left instead of the dojis on the right here.', 'start': 6901.85, 'duration': 5.324}, {'end': 6914.558, 'text': "Now, in the next lecture, we're actually going to go to recurrent neural nets, LSTMs, and all the other variants of RNNs.", 'start': 6908.054, 'duration': 6.504}], 'summary': 'Understanding gradient flow through neural nets leads to mastery.', 'duration': 32.075, 'max_score': 6882.483, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI6882483.jpg'}], 'start': 5771.372, 'title': 'Backpropagation and implementation in pytorch', 'summary': "Explains backpropagation in batch normalization, emphasizing efficient derivation of mathematical formulas for the backward pass, involving 32 neurons and the application of the chain rule. it also covers implementing the backward pass in pytorch, achieving similar results to using pytorch's autograd, with a focus on achieving a good loss and understanding the flow of gradients through the neural net.", 'chapters': [{'end': 6238.47, 'start': 5771.372, 'title': 'Backpropagation in batch normalization', 'summary': 'Explains the concept of backpropagation in batch normalization, emphasizing the efficient derivation of mathematical formulas for the backward pass and highlighting the impact of different elements such as mu, sigma square, and xi.', 'duration': 467.098, 'highlights': ['The process of backpropagation in batch normalization involves deriving mathematical formulas for the backward pass, particularly focusing on elements such as sigma square, mu, and xi.', 'The expression for dl by d sigma square requires summing over all the possible paths from sigma square to x hats, indicating the extensive influence of sigma square on multiple x hats.', 'The derivation of dl by d mu involves summing up gradients from all x hats and sigma square, highlighting the extensive impact of mu on the batch normalization process.', "The special case where mu is equal to the average of xi's leads to the gradient vanishing and becoming zero, resulting in a simplified expression for dl by d mu.", 'Deriving dl by dxi is emphasized as the ultimate goal of the backpropagation process in batch normalization.']}, {'end': 6578.143, 'start': 6239.31, 'title': 'Backpropagation in batch normalization', 'summary': 'Explains the process of backpropagation in batch normalization, involving 32 neurons and the application of the chain rule to derive the final implementation of the formula.', 'duration': 338.833, 'highlights': ['The process involves 32 neurons and the application of the chain rule to derive the final formula', 'The formula must evaluate the backpropagation for all 64 neurons in parallel and independently', 'The complexity of the formula arises from the need to ensure correct broadcasting and handling of sums']}, {'end': 6923.003, 'start': 6578.223, 'title': 'Implementing backward pass in pytorch', 'summary': "Covers implementing the backward pass in pytorch, including manually deriving gradients, optimizing the neural net, and achieving similar results to using pytorch's autograd, with a focus on achieving a good loss and understanding the flow of gradients through the neural net.", 'duration': 344.78, 'highlights': ["Manually deriving gradients and optimizing the neural net led to achieving a similar loss as using PyTorch's autograd, providing insight into the inner workings of loss.backward. (e.g. achieving a good loss)", 'The process involved re-initializing the neural net, implementing manual backpropagation, and achieving identical results to using loss.backward, demonstrating a comprehensive understanding of the backward pass. (e.g. implementing manual backpropagation)', 'The detailed code for the entire backward pass was provided, offering a clear understanding of the implementation, including the role of batch normalization and the efficiency gained by disabling loss.backward. (e.g. detailed code for the backward pass)', 'The lecture provided a diverse range of layers to backpropagate through, offering a comprehensive understanding of the implementation and flow of gradients through the neural net. (e.g. diverse range of layers to backpropagate through)']}], 'duration': 1151.631, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/q8SA3rM6ckI/pics/q8SA3rM6ckI5771372.jpg', 'highlights': ['Deriving dl by dxi is emphasized as the ultimate goal of the backpropagation process in batch normalization.', 'The process involves 32 neurons and the application of the chain rule to derive the final formula.', "The special case where mu is equal to the average of xi's leads to the gradient vanishing and becoming zero, resulting in a simplified expression for dl by d mu.", 'The expression for dl by d sigma square requires summing over all the possible paths from sigma square to x hats, indicating the extensive influence of sigma square on multiple x hats.', "Manually deriving gradients and optimizing the neural net led to achieving a similar loss as using PyTorch's autograd, providing insight into the inner workings of loss.backward. (e.g. achieving a good loss)"]}], 'highlights': ['Understanding backpropagation aids in debugging neural networks and ensures a comprehensive grasp of the process, leading to improved performance.', 'Exercises involve breaking down the loss and backpropagating through it manually, followed by analytical derivation of gradients for efficiency, and deriving gradients through the BatchNorm layer using mathematics and calculus.', 'The shape of DLogProps is 32 by 27, matching the shape of LogProps, indicating the derivative loss with respect to all of its elements.', "Detailed explanation of derivatives of Cs with respect to their inputs, showcasing the impact on corresponding A's and B's, as well as the additional sum and negative local derivative for B's.", "Bessel's Correction recommends dividing by n-1 for better variance estimation, particularly for small sample sizes, such as mini-batches, to avoid underestimation and bias.", 'Simplification of backpropagation through analytical differentiation and implementation of dlogits result in faster and more efficient loss optimization.', 'Deriving dl by dxi is emphasized as the ultimate goal of the backpropagation process in batch normalization.', 'Performing a backward pass to update derivatives in a 10-dimensional matrix C based on the occurrences and index positions in the forward pass.']}