title
Building makemore Part 3: Activations & Gradients, BatchNorm
description
We dive into some of the internals of MLPs with multiple layers and scrutinize the statistics of the forward pass activations, backward pass gradients, and some of the pitfalls when they are improperly scaled. We also look at the typical diagnostic tools and visualizations you'd want to use to understand the health of your deep network. We learn why training deep neural nets can be fragile and introduce the first modern innovation that made doing so much easier: Batch Normalization. Residual connections and the Adam optimizer remain notable todos for later video.
Links:
- makemore on github: https://github.com/karpathy/makemore
- jupyter notebook I built in this video: https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part3_bn.ipynb
- collab notebook: https://colab.research.google.com/drive/1H5CSy-OnisagUgDUXhHwo1ng2pjKHYSN?usp=sharing
- my website: https://karpathy.ai
- my twitter: https://twitter.com/karpathy
- Discord channel: https://discord.gg/3zy8kqD9Cp
Useful links:
- "Kaiming init" paper: https://arxiv.org/abs/1502.01852
- BatchNorm paper: https://arxiv.org/abs/1502.03167
- Bengio et al. 2003 MLP language model paper (pdf): https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
- Good paper illustrating some of the problems with batchnorm in practice: https://arxiv.org/abs/2105.07576
Exercises:
- E01: I did not get around to seeing what happens when you initialize all weights and biases to zero. Try this and train the neural net. You might think either that 1) the network trains just fine or 2) the network doesn't train at all, but actually it is 3) the network trains but only partially, and achieves a pretty bad final performance. Inspect the gradients and activations to figure out what is happening and why the network is only partially training, and what part is being trained exactly.
- E02: BatchNorm, unlike other normalization layers like LayerNorm/GroupNorm etc. has the big advantage that after training, the batchnorm gamma/beta can be "folded into" the weights of the preceeding Linear layers, effectively erasing the need to forward it at test time. Set up a small 3-layer MLP with batchnorms, train the network, then "fold" the batchnorm gamma/beta into the preceeding Linear layer's W,b by creating a new W2, b2 and erasing the batch norm. Verify that this gives the same forward pass during inference. i.e. we see that the batchnorm is there just for stabilizing the training, and can be thrown out after training is done! pretty cool.
Chapters:
00:00:00 intro
00:01:22 starter code
00:04:19 fixing the initial loss
00:12:59 fixing the saturated tanh
00:27:53 calculating the init scale: “Kaiming init”
00:40:40 batch normalization
01:03:07 batch normalization: summary
01:04:50 real example: resnet50 walkthrough
01:14:10 summary of the lecture
01:18:35 just kidding: part2: PyTorch-ifying the code
01:26:51 viz #1: forward pass activations statistics
01:30:54 viz #2: backward pass gradient statistics
01:32:07 the fully linear case of no non-linearities
01:36:15 viz #3: parameter activation and gradient statistics
01:39:55 viz #4: update:data ratio over time
01:46:04 bringing back batchnorm, looking at the visualizations
01:51:34 summary of the lecture for real this time
detail
{'title': 'Building makemore Part 3: Activations & Gradients, BatchNorm', 'heatmap': [{'end': 2584.43, 'start': 2432.495, 'weight': 0.914}, {'end': 5231.847, 'start': 5147.332, 'weight': 1}, {'end': 6950.252, 'start': 6885.599, 'weight': 0.775}], 'summary': 'Covers the implementation of multilayer perceptron for character-level language modeling, optimization of 11,000 parameters over 200,000 steps with a batch size of 32, the efficiency of using torch.no_grad for improved computation performance, issues with neural net initialization, the impact of batch normalization on training deep neural nets, and the role of batch normalization in controlling activation statistics in neural networks.', 'chapters': [{'end': 201.401, 'segs': [{'end': 30.972, 'src': 'embed', 'start': 0.029, 'weight': 2, 'content': [{'end': 3.492, 'text': 'Hi everyone! Today we are continuing our implementation of MakeMore.', 'start': 0.029, 'duration': 3.463}, {'end': 8.617, 'text': 'Now in the last lecture we implemented the multilayer perceptron along the lines of Bengio et al.', 'start': 4.233, 'duration': 4.384}, {'end': 10.178, 'text': '2003 for character-level language modeling.', 'start': 8.637, 'duration': 1.541}, {'end': 16.303, 'text': 'So we followed this paper, took in a few characters in the past, and used an MLP to predict the next character in a sequence.', 'start': 10.759, 'duration': 5.544}, {'end': 23.671, 'text': "So what we'd like to do now is we'd like to move on to more complex and larger neural networks, like recurrent neural networks and their variations,", 'start': 17.365, 'duration': 6.306}, {'end': 25.372, 'text': 'like the GRU, LSTM and so on.', 'start': 23.671, 'duration': 1.701}, {'end': 30.972, 'text': 'Now, before we do that, though, we have to stick around the level of multilayer perceptron for a bit longer.', 'start': 26.468, 'duration': 4.504}], 'summary': 'Continuing makemore implementation with multilayer perceptron for character-level language modeling, planning to move to larger neural networks.', 'duration': 30.943, 'max_score': 0.029, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc29.jpg'}, {'end': 63.966, 'src': 'embed', 'start': 31.692, 'weight': 1, 'content': [{'end': 38.438, 'text': "And I'd like to do this because I would like us to have a very good intuitive understanding of the activations in the neural net during training,", 'start': 31.692, 'duration': 6.746}, {'end': 42.822, 'text': 'and especially the gradients that are flowing backwards and how they behave and what they look like.', 'start': 38.438, 'duration': 4.384}, {'end': 48.143, 'text': 'This is going to be very important to understand the history of the development of these architectures,', 'start': 43.562, 'duration': 4.581}, {'end': 54.644, 'text': "because we'll see that recurrent neural networks, while they are very expressive in that they are a universal approximator and can, in principle,", 'start': 48.143, 'duration': 6.501}, {'end': 57.945, 'text': 'implement all the algorithms,', 'start': 54.644, 'duration': 3.301}, {'end': 63.966, 'text': "we'll see that they are not very easily optimisable with the first-order gradient-based techniques that we have available to us and that we use all the time.", 'start': 57.945, 'duration': 6.021}], 'summary': 'Understanding neural net activations and gradients in training is crucial for optimizing architectures.', 'duration': 32.274, 'max_score': 31.692, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc31692.jpg'}, {'end': 166.589, 'src': 'embed', 'start': 126.183, 'weight': 0, 'content': [{'end': 130.845, 'text': "And so I've pulled them outside here so that we don't have to go in and change all these magic numbers all the time.", 'start': 126.183, 'duration': 4.662}, {'end': 138.364, 'text': 'We have the same neural net with 11, 000 parameters that we optimize now over 200, 000 steps with a batch size of 32.', 'start': 131.701, 'duration': 6.663}, {'end': 143.526, 'text': "And you'll see that I refactored the code here a little bit, but there are no functional changes.", 'start': 138.364, 'duration': 5.162}, {'end': 148.669, 'text': 'I just created a few extra variables, a few more comments, and I removed all the magic numbers.', 'start': 143.666, 'duration': 5.003}, {'end': 150.99, 'text': "And otherwise it's the exact same thing.", 'start': 149.369, 'duration': 1.621}, {'end': 155.151, 'text': 'Then when we optimize, we saw that our loss looked something like this.', 'start': 152.01, 'duration': 3.141}, {'end': 158.753, 'text': 'We saw that the train and val loss were about 2.16 and so on.', 'start': 156.032, 'duration': 2.721}, {'end': 166.589, 'text': 'Here I refactored the code a little bit for the evaluation of arbitrary splits.', 'start': 161.806, 'duration': 4.783}], 'summary': 'Refactored code to optimize neural net with 11,000 parameters over 200,000 steps, achieving train and val loss of 2.16.', 'duration': 40.406, 'max_score': 126.183, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc126183.jpg'}, {'end': 208.764, 'src': 'embed', 'start': 183.038, 'weight': 5, 'content': [{'end': 190.778, 'text': "One thing that you'll notice here is, I'm using a decorator torch.nograd, which you can also look up and read documentation of.", 'start': 183.038, 'duration': 7.74}, {'end': 201.401, 'text': 'Basically what this decorator does on top of a function is that whatever happens in this function is assumed by torch to never require any gradients.', 'start': 191.398, 'duration': 10.003}, {'end': 208.764, 'text': 'So it will not do any of the bookkeeping that it does to keep track of all the gradients in anticipation of an eventual backward pass.', 'start': 202.042, 'duration': 6.722}], 'summary': 'Using torch.nograd decorator to prevent gradients computation.', 'duration': 25.726, 'max_score': 183.038, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc183038.jpg'}], 'start': 0.029, 'title': 'Implementing and refactoring neural networks for language modeling', 'summary': 'Covers the implementation of multilayer perceptron for character-level language modeling, understanding activations and gradients, as well as refactoring mlp code by optimizing 11,000 parameters over 200,000 steps with a batch size of 32 and using a torch.nograd decorator for improved readability and flexibility.', 'chapters': [{'end': 82.574, 'start': 0.029, 'title': 'Implementing complex neural networks', 'summary': 'Discusses the implementation of multilayer perceptron for character-level language modeling and the need to understand the activations and gradients in neural nets before moving on to more complex networks like recurrent neural networks and their variations.', 'duration': 82.545, 'highlights': ['The chapter emphasizes the importance of understanding the activations and gradients in neural nets before transitioning to more complex networks, such as recurrent neural networks and their variations.', 'The lecture implemented multilayer perceptron for character-level language modeling following the methodology of Bengio et al. 2003.', 'The discussion highlights the limitations of first-order gradient-based techniques in optimizing recurrent neural networks due to their complex activations and gradients.']}, {'end': 201.401, 'start': 82.934, 'title': 'Refactoring mlp code for improved readability and flexibility', 'summary': 'Discusses the refactoring of mlp code by removing magic numbers, optimizing 11,000 parameters over 200,000 steps with a batch size of 32, and using a torch.nograd decorator to improve readability and flexibility.', 'duration': 118.467, 'highlights': ['The MLP code is refactored to remove magic numbers and improve flexibility by explicitly defining the dimensionality of the embedding space and the number of hidden units, resulting in the optimization of 11,000 parameters over 200,000 steps with a batch size of 32.', 'The code is cleaned up by creating extra variables, adding comments, and removing magic numbers, while maintaining the same neural net with no functional changes.', 'The refactored code involves the evaluation of arbitrary splits using a decorator torch.nograd, which ensures that the function does not require any gradients, thereby improving readability and flexibility.']}], 'duration': 201.372, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc29.jpg', 'highlights': ['Refactored MLP code optimizes 11,000 parameters over 200,000 steps with batch size 32', 'Importance of understanding activations and gradients in neural nets emphasized', 'MLP implemented for character-level language modeling following Bengio et al. 2003', 'Limitations of first-order gradient-based techniques in optimizing RNNs highlighted', 'Code cleanup involves creating extra variables, adding comments, and removing magic numbers', 'Refactored code uses torch.nograd decorator for improved readability and flexibility']}, {'end': 862.697, 'segs': [{'end': 240.69, 'src': 'embed', 'start': 202.042, 'weight': 0, 'content': [{'end': 208.764, 'text': 'So it will not do any of the bookkeeping that it does to keep track of all the gradients in anticipation of an eventual backward pass.', 'start': 202.042, 'duration': 6.722}, {'end': 213.743, 'text': "It's almost as if all the tensors that get created here have a requires grad of false.", 'start': 209.679, 'duration': 4.064}, {'end': 217.306, 'text': "And so it just makes everything much more efficient, because you're telling Torch,", 'start': 214.583, 'duration': 2.723}, {'end': 222.49, 'text': "that I will not call dot backward on any of this computation and you don't need to maintain the graph under the hood.", 'start': 217.306, 'duration': 5.184}, {'end': 225.013, 'text': "So that's what this does.", 'start': 223.792, 'duration': 1.221}, {'end': 230.858, 'text': 'And you can also use a context manager with Torch dot no grad and you can look those up.', 'start': 225.653, 'duration': 5.205}, {'end': 240.69, 'text': 'Then here we have the sampling from a model, just as before, just a four-passive neural net, getting the distribution, sampling from it,', 'start': 233.069, 'duration': 7.621}], 'summary': "Using torch's no_grad makes computation more efficient by eliminating gradient tracking.", 'duration': 38.648, 'max_score': 202.042, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc202042.jpg'}, {'end': 290.291, 'src': 'embed', 'start': 262.697, 'weight': 3, 'content': [{'end': 269.1, 'text': "I can tell that our network is very improperly configured at initialization and there's multiple things wrong with it,", 'start': 262.697, 'duration': 6.403}, {'end': 270.241, 'text': "but let's just start with the first one.", 'start': 269.1, 'duration': 1.141}, {'end': 279.626, 'text': 'Look here on the zeroth iteration, the very first iteration, we are recording a loss of 27 and this rapidly comes down to roughly one or two or so.', 'start': 271.241, 'duration': 8.385}, {'end': 283.348, 'text': 'So I can tell that the initialization is all messed up because this is way too high.', 'start': 280.366, 'duration': 2.982}, {'end': 290.291, 'text': 'In training of neural nets, it is almost always the case that you will have a rough idea for what loss to expect at initialization.', 'start': 284.488, 'duration': 5.803}], 'summary': 'Network initialization causes high loss of 27, rapidly decreases to 1 or 2.', 'duration': 27.594, 'max_score': 262.697, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc262697.jpg'}, {'end': 332.008, 'src': 'embed', 'start': 300.663, 'weight': 4, 'content': [{'end': 308.388, 'text': "Basically at initialization, what we'd like is that there's 27 characters that could come next for any one training example.", 'start': 300.663, 'duration': 7.725}, {'end': 312.991, 'text': 'At initialization, we have no reason to believe any characters to be much more likely than others.', 'start': 309.128, 'duration': 3.863}, {'end': 319.215, 'text': "And so we'd expect that the probability distribution that comes out initially is a uniform distribution,", 'start': 313.852, 'duration': 5.363}, {'end': 322.117, 'text': 'assigning about equal probability to all the 27 characters.', 'start': 319.215, 'duration': 2.902}, {'end': 332.008, 'text': "So basically what we'd like is the probability for any character would be roughly one over 27.", 'start': 323.518, 'duration': 8.49}], 'summary': 'At initialization, the model aims for a uniform distribution among 27 characters, with roughly 1/27 probability for each.', 'duration': 31.345, 'max_score': 300.663, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc300663.jpg'}, {'end': 379.998, 'src': 'embed', 'start': 349.966, 'weight': 6, 'content': [{'end': 355.973, 'text': "And so what's happening right now is that at initialization, the neural net is creating probability distributions that are all messed up.", 'start': 349.966, 'duration': 6.007}, {'end': 359.898, 'text': 'Some characters are very confident and some characters are very not confident.', 'start': 356.353, 'duration': 3.545}, {'end': 370.331, 'text': "And then basically what's happening is that the network is very confidently wrong and that's what makes it record very high loss.", 'start': 360.739, 'duration': 9.592}, {'end': 372.833, 'text': "So here's a smaller four-dimensional example of the issue.", 'start': 370.651, 'duration': 2.182}, {'end': 379.998, 'text': "Let's say we only have four characters, and then we have logits that come out of the neural net, and they are very, very close to zero.", 'start': 373.433, 'duration': 6.565}], 'summary': 'Neural net creates skewed probability distributions leading to high loss.', 'duration': 30.032, 'max_score': 349.966, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc349966.jpg'}, {'end': 756.608, 'src': 'embed', 'start': 728.913, 'weight': 5, 'content': [{'end': 731.515, 'text': "and so there's no hockey stick appearance.", 'start': 728.913, 'duration': 2.602}, {'end': 740.744, 'text': "so good things are happening in that both number one, loss at initialization is what we expect, and the loss doesn't look like a hockey stick,", 'start': 731.515, 'duration': 9.229}, {'end': 744.567, 'text': 'and this is true for any neural net you might train and something to look out for.', 'start': 740.744, 'duration': 3.823}, {'end': 749.167, 'text': 'And second, the loss that came out is actually quite a bit improved.', 'start': 745.666, 'duration': 3.501}, {'end': 751.507, 'text': 'Unfortunately, I erased what we had here before.', 'start': 749.587, 'duration': 1.92}, {'end': 756.608, 'text': 'I believe this was 2.12 and this was 2.16.', 'start': 751.947, 'duration': 4.661}], 'summary': 'Loss at initialization as expected, improved to 2.12-2.16', 'duration': 27.695, 'max_score': 728.913, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc728913.jpg'}], 'start': 202.042, 'title': 'Efficiency of torch.no_grad and neural net initialization issues', 'summary': 'Explains the efficiency of using torch.no_grad for improved computation performance and discusses issues with neural net initialization, including the impact of improper configuration and the need for proper initialization.', 'chapters': [{'end': 240.69, 'start': 202.042, 'title': 'Efficiency of torch.no_grad', 'summary': 'Explains the efficiency of using torch.no_grad to improve computation performance by skipping gradient bookkeeping, resulting in more efficient tensor creation and graph maintenance in torch, as well as the use of a context manager with torch.no_grad and sampling from a model with a four-passive neural net.', 'duration': 38.648, 'highlights': ['Skipping gradient bookkeeping using Torch.no_grad improves computation efficiency by avoiding the maintenance of the graph under the hood and setting requires_grad to false for all tensors created, resulting in a more efficient process.', 'The use of a context manager with Torch.no_grad provides additional flexibility and control over gradient computation and management within Torch.', 'Sampling from a model with a four-passive neural net is also demonstrated, showcasing practical application of the discussed concepts in the context of neural network operations.']}, {'end': 862.697, 'start': 240.69, 'title': 'Neural net initialization issues', 'summary': 'Discusses issues with neural net initialization, including high loss due to improper configuration, and the need for proper initialization to ensure expected loss and prevent hockey stick loss behavior during optimization.', 'duration': 622.007, 'highlights': ["The network's improper initialization leads to a loss of 27 on the zeroth iteration, which rapidly decreases to about one or two, indicating the need for proper initialization.", 'At initialization, the probability distribution should be uniform, assigning about equal probability to all characters, resulting in an expected loss of 3.29 instead of 27.', 'Improper initialization leads to the creation of probability distributions with extreme values, causing high losses and incorrect answers.', 'Adjusting the initialization parameters for the network, such as bias and weights, results in closer-to-expected losses and prevents hockey stick loss behavior during optimization.']}], 'duration': 660.655, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc202042.jpg', 'highlights': ['Skipping gradient bookkeeping using torch.no_grad improves computation efficiency by avoiding graph maintenance and setting requires_grad to false for all tensors (efficiency improvement).', 'The use of a context manager with torch.no_grad provides additional flexibility and control over gradient computation and management within torch (flexibility enhancement).', 'Sampling from a model with a four-passive neural net is demonstrated, showcasing practical application of the discussed concepts in the context of neural network operations (practical application).', 'Improper initialization leads to a loss of 27 on the zeroth iteration, rapidly decreasing to about one or two, indicating the need for proper initialization (impact of improper configuration).', 'At initialization, the probability distribution should be uniform, assigning about equal probability to all characters, resulting in an expected loss of 3.29 instead of 27 (importance of proper initialization).', 'Adjusting the initialization parameters for the network, such as bias and weights, results in closer-to-expected losses and prevents hockey stick loss behavior during optimization (impact of proper initialization).', 'Improper initialization leads to the creation of probability distributions with extreme values, causing high losses and incorrect answers (consequences of improper initialization).']}, {'end': 2199.172, 'segs': [{'end': 991.944, 'src': 'embed', 'start': 963.988, 'weight': 1, 'content': [{'end': 968.009, 'text': 'this is the chain rule with the local gradient, which took the form of one minus t squared.', 'start': 963.988, 'duration': 4.021}, {'end': 973.438, 'text': 'So what happens if the outputs of your tanh are very close to negative one or one?', 'start': 969.096, 'duration': 4.342}, {'end': 979.06, 'text': "If you plug in t equals one here you're gonna get a zero multiplying out.grad.", 'start': 974.158, 'duration': 4.902}, {'end': 986.582, 'text': "No matter what out.grad is, we are killing the gradient and we're stopping effectively the back propagation through this tanh unit.", 'start': 979.78, 'duration': 6.802}, {'end': 991.944, 'text': 'Similarly, when t is negative one, this will again become zero and out.grad just stops.', 'start': 987.503, 'duration': 4.441}], 'summary': 'The chain rule with local gradient is affected when outputs of tanh are close to -1 or 1, leading to a halt in back propagation.', 'duration': 27.956, 'max_score': 963.988, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc963988.jpg'}, {'end': 1118.418, 'src': 'embed', 'start': 1089.279, 'weight': 2, 'content': [{'end': 1095.083, 'text': 'the concern here is that if all of these outputs H are in the flat regions of negative one and one,', 'start': 1089.279, 'duration': 5.804}, {'end': 1099.565, 'text': 'then the gradients that are flowing through the network will just get destroyed at this layer.', 'start': 1095.083, 'duration': 4.482}, {'end': 1106.509, 'text': 'Now there is some redeeming quality here and that we can actually get a sense of the problem here as follows.', 'start': 1101.126, 'duration': 5.383}, {'end': 1108.553, 'text': 'I wrote some code here.', 'start': 1107.793, 'duration': 0.76}, {'end': 1118.418, 'text': 'And basically what we want to do here is we want to take a look at H, take the absolute value and see how often it is in the flat region.', 'start': 1109.314, 'duration': 9.104}], 'summary': 'Concern over destructive gradients in flat regions of h outputs, tackled by analyzing absolute values.', 'duration': 29.139, 'max_score': 1089.279, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc1089279.jpg'}, {'end': 1591.633, 'src': 'embed', 'start': 1563.202, 'weight': 0, 'content': [{'end': 1567.404, 'text': 'Okay, so the optimization finished and I rerun the loss and this is the result that we get.', 'start': 1563.202, 'duration': 4.202}, {'end': 1571.466, 'text': 'And then just as a reminder, I put down all the losses that we saw previously in this lecture.', 'start': 1568.144, 'duration': 3.322}, {'end': 1574.85, 'text': 'So we see that we actually do get an improvement here.', 'start': 1572.509, 'duration': 2.341}, {'end': 1579.431, 'text': 'And just as a reminder, we started off with a validation loss of 2.17 when we started.', 'start': 1575.21, 'duration': 4.221}, {'end': 1584.132, 'text': 'By fixing the softmax being confidently wrong, we came down to 2.13.', 'start': 1580.131, 'duration': 4.001}, {'end': 1587.812, 'text': 'And by fixing the 10-inch layer being way too saturated, we came down to 2.10.', 'start': 1584.132, 'duration': 3.68}, {'end': 1591.633, 'text': 'And the reason this is happening, of course, is because our initialization is better.', 'start': 1587.812, 'duration': 3.821}], 'summary': 'Optimization led to improved validation loss from 2.17 to 2.10 by fixing softmax and 10-inch layer issues.', 'duration': 28.431, 'max_score': 1563.202, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc1563202.jpg'}, {'end': 1802.68, 'src': 'embed', 'start': 1768.903, 'weight': 4, 'content': [{'end': 1771.886, 'text': 'But we see here that the standard deviation has expanded to three.', 'start': 1768.903, 'duration': 2.983}, {'end': 1775.769, 'text': "So the input standard deviation was one, but now we've grown to three.", 'start': 1772.626, 'duration': 3.143}, {'end': 1779.532, 'text': "And so what you're seeing in the histogram is that this Gaussian is expanding.", 'start': 1776.449, 'duration': 3.083}, {'end': 1785.307, 'text': "And so we're expanding this Gaussian from the input.", 'start': 1781.164, 'duration': 4.143}, {'end': 1786.628, 'text': "And we don't want that.", 'start': 1785.828, 'duration': 0.8}, {'end': 1790.051, 'text': 'We want most of the neural nets to have relatively similar activations.', 'start': 1786.668, 'duration': 3.383}, {'end': 1793.213, 'text': 'So unit Gaussian roughly throughout the neural net.', 'start': 1790.611, 'duration': 2.602}, {'end': 1802.68, 'text': 'And so the question is how do we scale these Ws to preserve this distribution to remain a Gaussian?', 'start': 1793.994, 'duration': 8.686}], 'summary': 'Standard deviation expanded to three, aiming for unit gaussian distribution in neural nets.', 'duration': 33.777, 'max_score': 1768.903, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc1768903.jpg'}, {'end': 1847.536, 'src': 'embed', 'start': 1817.394, 'weight': 5, 'content': [{'end': 1821.655, 'text': 'So basically these numbers here in the output Y take on more and more extreme values.', 'start': 1817.394, 'duration': 4.261}, {'end': 1830.557, 'text': "But if we scale it down, like say 0.2, then conversely, this Gaussian is getting smaller and smaller and it's shrinking.", 'start': 1822.615, 'duration': 7.942}, {'end': 1834.018, 'text': 'And you can see that the standard deviation is 0.6.', 'start': 1831.157, 'duration': 2.861}, {'end': 1839.96, 'text': 'And so the question is what do I multiply by here to exactly preserve the standard deviation to be one?', 'start': 1834.018, 'duration': 5.942}, {'end': 1847.536, 'text': 'And it turns out that the correct answer mathematically, when you work out through the variance of this multiplication here,', 'start': 1841.072, 'duration': 6.464}], 'summary': 'Scaling down by 0.2 shrinks gaussian with standard deviation 0.6.', 'duration': 30.142, 'max_score': 1817.394, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc1817394.jpg'}, {'end': 1906.602, 'src': 'embed', 'start': 1863.667, 'weight': 3, 'content': [{'end': 1866.249, 'text': "That's the same as doing a square root.", 'start': 1863.667, 'duration': 2.582}, {'end': 1876.992, 'text': 'So when you divide by the square root of 10, then we see that The output, caution, it has exactly standard deviation of one.', 'start': 1867.273, 'duration': 9.719}, {'end': 1882.776, 'text': 'Now, unsurprisingly, a number of papers have looked into how to best initialize neural networks.', 'start': 1877.532, 'duration': 5.244}, {'end': 1888.619, 'text': 'And in the case of multi-layer perceptrons, we can have fairly deep networks that have these nonlinearities in between.', 'start': 1883.536, 'duration': 5.083}, {'end': 1894.783, 'text': "And we want to make sure that the activations are well-behaved and they don't expand to infinity or shrink all the way to zero.", 'start': 1889.3, 'duration': 5.483}, {'end': 1899.947, 'text': 'And the question is how do we initialize the weights so that these activations take on reasonable values throughout the network?', 'start': 1895.384, 'duration': 4.563}, {'end': 1906.602, 'text': 'Now, one paper that has studied this in quite a bit of detail that is often referenced is this paper by Kaiming He et al.', 'start': 1901.021, 'duration': 5.581}], 'summary': 'Initializing neural network weights for well-behaved activations, kaiming he et al.', 'duration': 42.935, 'max_score': 1863.667, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc1863667.jpg'}, {'end': 2215.22, 'src': 'embed', 'start': 2184.887, 'weight': 7, 'content': [{'end': 2189.669, 'text': 'But there are a number of modern innovations that have made everything significantly more stable and more well behaved,', 'start': 2184.887, 'duration': 4.782}, {'end': 2193.07, 'text': "and it's become less important to initialize these networks exactly right.", 'start': 2189.669, 'duration': 3.401}, {'end': 2199.172, 'text': 'And some of those modern innovations, for example, are residual connections, which we will cover in the future,', 'start': 2194.05, 'duration': 5.122}, {'end': 2206.795, 'text': 'the use of a number of normalization layers like, for example, batch normalization layer normalization group normalization.', 'start': 2199.172, 'duration': 7.623}, {'end': 2208.396, 'text': "We're going to go into a lot of these as well.", 'start': 2207.095, 'duration': 1.301}, {'end': 2215.22, 'text': "And number three much better optimizers, not just stochastic gradient descent, the simple optimizer we're basically using here,", 'start': 2209.036, 'duration': 6.184}], 'summary': 'Modern innovations have improved network stability and behavior, including residual connections, normalization layers, and better optimizers.', 'duration': 30.333, 'max_score': 2184.887, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc2184887.jpg'}], 'start': 864.398, 'title': 'Neural network initialization', 'summary': 'Delves into the issues of gradient vanishing, dead neurons, and weight initialization in neural networks, highlighting their impacts and providing insights on addressing these issues. it also showcases the effectiveness of proper initialization through a significant validation loss improvement from 2.17 to 2.10. additionally, it discusses the kaiming initialization method and its implications, emphasizing the diminishing importance of precise network initialization due to modern innovations.', 'chapters': [{'end': 1131.465, 'start': 864.398, 'title': 'Tanh activation and gradient vanishing', 'summary': 'Discusses the issue of gradient vanishing due to the distribution of pre-activations in the tanh activation function, impacting the flow of gradients through the neural network, with a focus on the destructive effect on the gradient flow through the tanh layer.', 'duration': 267.067, 'highlights': ['The pre-activations feeding into the tanh layer have a broad distribution between -15 and 15, causing the tanh activation to squash and cap the values to the range of -1 and 1, leading to many extreme values.', 'The destructive effect on the gradient flow through the tanh layer is due to the squashing of gradients when the outputs are close to -1 or 1, resulting in a significant decrease in the gradient flow, ultimately impacting the loss and rendering the weights and biases ineffective.', "The concern lies in the potential destruction of gradients flowing through the network if all tanh outputs are in the flat regions of -1 and 1, highlighting the issue of gradient vanishing and its impact on the network's learning ability."]}, {'end': 1662.399, 'start': 1132.594, 'title': 'Neural network dead neurons', 'summary': 'Illustrates the issue of dead neurons in neural networks, where certain neurons remain inactive due to improper initialization, causing gradients to be zeroed out, impacting network performance. the impact of initialization on network performance is demonstrated through the improvement in validation loss from 2.17 to 2.10 by addressing the issue of saturated neurons.', 'duration': 529.805, 'highlights': ['The issue of dead neurons in neural networks is illustrated, where certain neurons remain inactive due to improper initialization, causing gradients to be zeroed out.', 'The improvement in validation loss from 2.17 to 2.10 is attributed to addressing the issue of saturated neurons, demonstrating the significant impact of initialization on network performance.', 'The impact of initialization on network performance is emphasized, particularly in deeper networks where the problem becomes more complex and less forgiving to errors.']}, {'end': 1928.45, 'start': 1663.119, 'title': 'Neural network weight initialization', 'summary': 'Discusses the importance of carefully setting the scales of weights in neural networks to maintain a gaussian distribution, with a key focus on preserving standard deviation and utilizing the square root of the fan in for weight scaling. the discussion also references the paper by kaiming he et al. on weight initialization in convolutional neural networks.', 'duration': 265.331, 'highlights': ['The correct method of scaling the weights to maintain a standard deviation of one involves dividing by the square root of the fan in, which is demonstrated with the example of a 10-dimensional input, resulting in well-behaved activations throughout the network.', 'The expansion of the Gaussian distribution when weights are multiplied by larger numbers, leading to increased standard deviation, is highlighted as a factor to be avoided, with a specific example showing a standard deviation growing to 15.', 'The detrimental impact of weights that cause the Gaussian distribution to shrink, resulting in a smaller standard deviation, is discussed, with the example illustrating a standard deviation of 0.6 when weights are scaled down by a factor of 0.2.', 'The paper by Kaiming He et al. on weight initialization in convolutional neural networks, particularly focusing on the ReLU nonlinearity, is referenced as a comprehensive study for understanding weight initialization in neural networks.']}, {'end': 2199.172, 'start': 1928.45, 'title': 'Kyming initialization in neural networks', 'summary': 'Discusses the kyming initialization method for neural networks, highlighting the need to compensate for the discarding of half the distribution with a gain, the implications for forward and backward passes, and the implementation in pytorch, while emphasizing the diminishing importance of precise network initialization due to modern innovations.', 'duration': 270.722, 'highlights': ['The need to compensate for the discarding of half the distribution with a gain', 'Implications for forward and backward passes', 'Implementation of Kyming initialization in PyTorch', 'Diminishing importance of precise network initialization due to modern innovations']}], 'duration': 1334.774, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc864398.jpg', 'highlights': ['Significant validation loss improvement from 2.17 to 2.10 due to addressing dead neurons', 'Destructive effect on gradient flow through tanh layer due to squashing of gradients', 'Concern about potential destruction of gradients flowing through the network due to flat tanh outputs', 'Proper weight scaling maintains standard deviation of one, demonstrated with a 10-dimensional input', 'Expansion of Gaussian distribution when weights are multiplied by larger numbers leads to increased standard deviation', 'Detrimental impact of weights causing Gaussian distribution to shrink, resulting in a smaller standard deviation', "Kaiming He et al.'s paper on weight initialization in convolutional neural networks referenced as a comprehensive study", 'Diminishing importance of precise network initialization due to modern innovations']}, {'end': 3112.536, 'segs': [{'end': 2225.286, 'src': 'embed', 'start': 2199.172, 'weight': 1, 'content': [{'end': 2206.795, 'text': 'the use of a number of normalization layers like, for example, batch normalization layer normalization group normalization.', 'start': 2199.172, 'duration': 7.623}, {'end': 2208.396, 'text': "We're going to go into a lot of these as well.", 'start': 2207.095, 'duration': 1.301}, {'end': 2215.22, 'text': "And number three much better optimizers, not just stochastic gradient descent, the simple optimizer we're basically using here,", 'start': 2209.036, 'duration': 6.184}, {'end': 2219.042, 'text': 'but slightly more complex optimizers like RMS, prop and especially Adam.', 'start': 2215.22, 'duration': 3.822}, {'end': 2225.286, 'text': 'And so all of these modern innovations make it less important for you to precisely calibrate the initialization of the neural net.', 'start': 2219.742, 'duration': 5.544}], 'summary': 'Modern innovations like normalization layers and better optimizers reduce the need for precise neural net initialization.', 'duration': 26.114, 'max_score': 2199.172, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc2199172.jpg'}, {'end': 2584.43, 'src': 'heatmap', 'start': 2432.495, 'weight': 0.914, 'content': [{'end': 2439.74, 'text': 'We have something that is semi-principled and will scale us to much bigger networks and something that we can sort of use as a guide.', 'start': 2432.495, 'duration': 7.245}, {'end': 2445.624, 'text': 'So I mentioned that the precise setting of these initializations is not as important today due to some modern innovations.', 'start': 2440.18, 'duration': 5.444}, {'end': 2450.447, 'text': 'And I think now is a pretty good time to introduce one of those modern innovations, and that is batch normalization.', 'start': 2446.024, 'duration': 4.423}, {'end': 2464.512, 'text': 'So, Batch Normalization came out in 2015 from a team at Google and it was an extremely impactful paper because it made it possible to train very deep neural nets quite reliably and it basically just worked.', 'start': 2451.287, 'duration': 13.225}, {'end': 2467.153, 'text': "So here's what Batch Normalization does and let's implement it.", 'start': 2465.213, 'duration': 1.94}, {'end': 2473.011, 'text': 'Basically, we have these hidden states H pre-act, right?', 'start': 2469.909, 'duration': 3.102}, {'end': 2483.438, 'text': "And we were talking about how we don't want these pre-activation states to be way too small, because then the 10H is not doing anything.", 'start': 2473.792, 'duration': 9.646}, {'end': 2486.78, 'text': "But we don't want them to be too large because then the 10H is saturated.", 'start': 2483.898, 'duration': 2.882}, {'end': 2495.146, 'text': 'In fact, we want them to be roughly Gaussian, so zero mean and a unit or one standard deviation, at least at initialization.', 'start': 2487.601, 'duration': 7.545}, {'end': 2499.692, 'text': 'So the insight from the BatchNormalization paper is okay.', 'start': 2496.129, 'duration': 3.563}, {'end': 2503.614, 'text': "you have these hidden states and you'd like them to be roughly Gaussian.", 'start': 2499.692, 'duration': 3.922}, {'end': 2507.577, 'text': 'then why not take the hidden states and just normalize them to be Gaussian?', 'start': 2503.614, 'duration': 3.963}, {'end': 2509.919, 'text': 'And it sounds kind of crazy.', 'start': 2508.938, 'duration': 0.981}, {'end': 2519.686, 'text': "but you can just do that, because standardizing hidden states so that their unit gaussian is a perfectly differentiable operation, as we'll soon see.", 'start': 2509.919, 'duration': 9.767}, {'end': 2524.53, 'text': 'and so that was kind of like the big insight in this paper, and when i first read it my mind was blown.', 'start': 2519.686, 'duration': 4.844}, {'end': 2529.834, 'text': "because you can just normalize these hidden states and if you'd like unit gaussian states in your network,", 'start': 2524.53, 'duration': 5.304}, {'end': 2534.298, 'text': 'at least initialization you can just normalize them to be unit gaussian.', 'start': 2529.834, 'duration': 4.464}, {'end': 2536.539, 'text': "so let's see how that works.", 'start': 2534.298, 'duration': 2.241}, {'end': 2540.262, 'text': "so we're going to scroll to our pre-activations here just before they enter into the 10-h.", 'start': 2536.539, 'duration': 3.723}, {'end': 2544.384, 'text': "Now the idea again is remember, we're trying to make these roughly Gaussian.", 'start': 2541.502, 'duration': 2.882}, {'end': 2549.687, 'text': "And that's because if these are way too small numbers, then the tanh here is kind of inactive.", 'start': 2545.024, 'duration': 4.663}, {'end': 2555.85, 'text': 'But if these are very large numbers, then the tanh is way too saturated and gradient in the flow.', 'start': 2550.447, 'duration': 5.403}, {'end': 2558.131, 'text': "So we'd like this to be roughly Gaussian.", 'start': 2556.55, 'duration': 1.581}, {'end': 2566.016, 'text': 'So the insight in batch normalization, again, is that we can just standardize these activations so they are exactly Gaussian.', 'start': 2559.212, 'duration': 6.804}, {'end': 2575.026, 'text': 'So here, hpreact, has a shape of 32 by 200, 32 examples by 200 neurons in the hidden layer.', 'start': 2566.936, 'duration': 8.09}, {'end': 2584.43, 'text': 'So basically what we can do is we can take HPreact and we can just calculate the mean and the mean we want to calculate across the zero dimension.', 'start': 2576.086, 'duration': 8.344}], 'summary': 'Batch normalization enables reliable training of deep neural nets, ensuring roughly gaussian hidden states for improved performance.', 'duration': 151.935, 'max_score': 2432.495, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc2432495.jpg'}, {'end': 2473.011, 'src': 'embed', 'start': 2451.287, 'weight': 0, 'content': [{'end': 2464.512, 'text': 'So, Batch Normalization came out in 2015 from a team at Google and it was an extremely impactful paper because it made it possible to train very deep neural nets quite reliably and it basically just worked.', 'start': 2451.287, 'duration': 13.225}, {'end': 2467.153, 'text': "So here's what Batch Normalization does and let's implement it.", 'start': 2465.213, 'duration': 1.94}, {'end': 2473.011, 'text': 'Basically, we have these hidden states H pre-act, right?', 'start': 2469.909, 'duration': 3.102}], 'summary': 'Batch normalization from 2015 enabled reliable training of deep neural nets.', 'duration': 21.724, 'max_score': 2451.287, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc2451287.jpg'}, {'end': 2529.834, 'src': 'embed', 'start': 2499.692, 'weight': 2, 'content': [{'end': 2503.614, 'text': "you have these hidden states and you'd like them to be roughly Gaussian.", 'start': 2499.692, 'duration': 3.922}, {'end': 2507.577, 'text': 'then why not take the hidden states and just normalize them to be Gaussian?', 'start': 2503.614, 'duration': 3.963}, {'end': 2509.919, 'text': 'And it sounds kind of crazy.', 'start': 2508.938, 'duration': 0.981}, {'end': 2519.686, 'text': "but you can just do that, because standardizing hidden states so that their unit gaussian is a perfectly differentiable operation, as we'll soon see.", 'start': 2509.919, 'duration': 9.767}, {'end': 2524.53, 'text': 'and so that was kind of like the big insight in this paper, and when i first read it my mind was blown.', 'start': 2519.686, 'duration': 4.844}, {'end': 2529.834, 'text': "because you can just normalize these hidden states and if you'd like unit gaussian states in your network,", 'start': 2524.53, 'duration': 5.304}], 'summary': 'Standardizing hidden states to be unit gaussian is a perfectly differentiable operation, a key insight in the paper.', 'duration': 30.142, 'max_score': 2499.692, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc2499692.jpg'}, {'end': 2630.261, 'src': 'embed', 'start': 2600.976, 'weight': 3, 'content': [{'end': 2604.881, 'text': 'And similarly, we can calculate the standard deviation of these activations.', 'start': 2600.976, 'duration': 3.905}, {'end': 2608.224, 'text': 'And that will also be 1 by 200.', 'start': 2607.003, 'duration': 1.221}, {'end': 2613.951, 'text': 'Now in this paper, they have the sort of prescription here.', 'start': 2608.224, 'duration': 5.727}, {'end': 2620.178, 'text': 'And see here, we are calculating the mean, which is just taking the average value.', 'start': 2614.712, 'duration': 5.466}, {'end': 2623.299, 'text': "of any neuron's activation.", 'start': 2621.458, 'duration': 1.841}, {'end': 2630.261, 'text': "And then their standard deviation is basically kind of like the measure of the spread that we've been using,", 'start': 2623.939, 'duration': 6.322}], 'summary': "Calculating standard deviation of activations, mean is average value of neuron's activation.", 'duration': 29.285, 'max_score': 2600.976, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc2600976.jpg'}, {'end': 2776.776, 'src': 'embed', 'start': 2745.493, 'weight': 4, 'content': [{'end': 2751.417, 'text': "So we'd like this distribution to move around, and we'd like the back propagation to tell us how the distribution should move around.", 'start': 2745.493, 'duration': 5.924}, {'end': 2759.353, 'text': 'And so, in addition to this idea of standardizing the activations at any point in the network,', 'start': 2752.472, 'duration': 6.881}, {'end': 2765.014, 'text': 'we have to also introduce this additional component in the paper here described as scale and shift.', 'start': 2759.353, 'duration': 5.661}, {'end': 2766.415, 'text': 'And so basically,', 'start': 2765.934, 'duration': 0.481}, {'end': 2776.776, 'text': "what we're doing is we're taking these normalized inputs and we are additionally scaling them by some gain and offsetting them by some bias to get our final output from this layer.", 'start': 2766.415, 'duration': 10.361}], 'summary': 'Back propagation guides distribution movement, adding scale and shift for final output', 'duration': 31.283, 'max_score': 2745.493, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc2745493.jpg'}], 'start': 2199.172, 'title': 'Neural net initialization and batch normalization', 'summary': 'Discusses the use of modern innovations such as normalization layers and better optimizers in neural net initialization and introduces the modern innovation of batch normalization, its impact on training deep neural nets, and the process of standardizing hidden states to be gaussian.', 'chapters': [{'end': 2439.74, 'start': 2199.172, 'title': 'Neural net initialization', 'summary': 'Discusses the use of modern innovations such as normalization layers and better optimizers in neural net initialization. it also explains a semi-principled approach for initializing neural nets, which resulted in comparable performance without relying on arbitrary magic numbers.', 'duration': 240.568, 'highlights': ['The use of modern innovations like normalization layers (batch normalization, layer normalization, group normalization) and better optimizers (RMSprop, Adam) reduces the importance of precise initialization of neural nets.', 'In practice, the speaker normalizes the weights by the square root of the fan-in, providing a semi-principled approach to initialization.', 'Demonstrates setting the standard deviation of weights by multiplying with gain over the square root of fan-in, resulting in a standard deviation of 0.3 for a specific example, offering a practical method for initialization.', 'After initializing the neural net using the proposed method, the validation loss remains comparable at 2.10, indicating the efficacy of the semi-principled approach for neural net initialization.']}, {'end': 2914.19, 'start': 2440.18, 'title': 'Batch normalization for neural nets', 'summary': 'Introduces the modern innovation of batch normalization, its impact on training deep neural nets, and the process of standardizing hidden states to be gaussian, leading to reliable training and improved results.', 'duration': 474.01, 'highlights': ['The introduction of batch normalization in 2015 by a team at Google made it possible to train very deep neural nets reliably, leading to improved results.', 'Batch normalization standardizes hidden states to be Gaussian, ensuring unit Gaussian states in the network at least at initialization.', 'The process involves calculating the mean and standard deviation of the activations and normalizing or standardizing the values by subtracting the mean and dividing by the standard deviation.', 'In addition to standardizing activations, the introduction of scale and shift allows the network to move the distribution around and optimize the backpropagation.', 'The batch normalization gain and bias values are trained with backpropagation to allow the network to adjust the distribution internally.']}, {'end': 3112.536, 'start': 2914.91, 'title': 'Batch normalization in neural nets', 'summary': 'Discusses the use of batch normalization in neural nets to control the scale of activations, stabilize training, and the trade-off of coupling examples in batches leading to jitter in hidden state activations and logits.', 'duration': 197.626, 'highlights': ['The chapter discusses the use of batch normalization in neural nets to control the scale of activations and stabilize training.', 'The trade-off of coupling examples in batches leads to jitter in hidden state activations and logits.', 'The difficulty of tuning weights in deeper neural nets is highlighted, making it intractable compared to sprinkling batch normalization layers throughout the neural net.']}], 'duration': 913.364, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc2199172.jpg', 'highlights': ['The introduction of batch normalization in 2015 by a team at Google made it possible to train very deep neural nets reliably, leading to improved results.', 'The use of modern innovations like normalization layers (batch normalization, layer normalization, group normalization) and better optimizers (RMSprop, Adam) reduces the importance of precise initialization of neural nets.', 'Batch normalization standardizes hidden states to be Gaussian, ensuring unit Gaussian states in the network at least at initialization.', 'The process involves calculating the mean and standard deviation of the activations and normalizing or standardizing the values by subtracting the mean and dividing by the standard deviation.', 'In addition to standardizing activations, the introduction of scale and shift allows the network to move the distribution around and optimize the backpropagation.']}, {'end': 3695.541, 'segs': [{'end': 3157.371, 'src': 'embed', 'start': 3130.043, 'weight': 0, 'content': [{'end': 3135.845, 'text': "And so what that does is that it's effectively padding out any one of these input examples and it's introducing a little bit of entropy.", 'start': 3130.043, 'duration': 5.802}, {'end': 3142.454, 'text': "And because of the padding out, it's actually kind of like a form of data augmentation which we'll cover in the future.", 'start': 3136.605, 'duration': 5.849}, {'end': 3146.5, 'text': "And it's kind of like augmenting the input a little bit and jittering it.", 'start': 3142.474, 'duration': 4.026}, {'end': 3151.447, 'text': 'And that makes it harder for the neural nets to overfit these concrete specific examples.', 'start': 3146.981, 'duration': 4.466}, {'end': 3157.371, 'text': 'So by introducing all this noise, it actually like pads out the examples and it regularizes the neural net.', 'start': 3152.068, 'duration': 5.303}], 'summary': 'Padding input examples adds entropy, augments data, and regularizes neural nets.', 'duration': 27.328, 'max_score': 3130.043, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc3130043.jpg'}, {'end': 3249.933, 'src': 'embed', 'start': 3224.383, 'weight': 1, 'content': [{'end': 3233.373, 'text': 'And some of the reason that it works quite well is again because of this regularizing effect and because it is quite effective at controlling the activations and their distributions.', 'start': 3224.383, 'duration': 8.99}, {'end': 3243.008, 'text': "So that's kind of like the brief story of batch normalization and I'd like to show you one of the other weird sort of outcomes of this coupling.", 'start': 3234.543, 'duration': 8.465}, {'end': 3249.933, 'text': "So here's one of the strange outcomes that I only glossed over previously when I was evaluating the loss on the validation set.", 'start': 3243.769, 'duration': 6.164}], 'summary': 'Batch normalization effectively controls activations and their distributions.', 'duration': 25.55, 'max_score': 3224.383, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc3224383.jpg'}, {'end': 3393.011, 'src': 'embed', 'start': 3365.723, 'weight': 2, 'content': [{'end': 3369.606, 'text': 'And so this batch normalization paper actually introduced one more idea,', 'start': 3365.723, 'duration': 3.883}, {'end': 3376.35, 'text': 'which is that we can estimate the mean and standard deviation in a running manner during training of the neural net.', 'start': 3369.606, 'duration': 6.744}, {'end': 3379.732, 'text': 'And then we can simply just have a single stage of training.', 'start': 3377.11, 'duration': 2.622}, {'end': 3383.995, 'text': 'And on the side of that training, we are estimating the running mean and standard deviation.', 'start': 3380.152, 'duration': 3.843}, {'end': 3385.496, 'text': "So let's see what that would look like.", 'start': 3384.575, 'duration': 0.921}, {'end': 3393.011, 'text': 'Let me basically take the mean here that we are estimating on the batch, and let me call this b and mean on the i-th iteration.', 'start': 3386.684, 'duration': 6.327}], 'summary': 'Batch normalization paper introduces running mean and standard deviation estimation during training.', 'duration': 27.288, 'max_score': 3365.723, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc3365723.jpg'}, {'end': 3670.045, 'src': 'embed', 'start': 3646.76, 'weight': 3, 'content': [{'end': 3653.482, 'text': "And this way, we've eliminated the need for this explicit stage of calibration because we are doing it inline over here.", 'start': 3646.76, 'duration': 6.722}, {'end': 3655.942, 'text': "Okay, so we're almost done with batch normalization.", 'start': 3654.242, 'duration': 1.7}, {'end': 3657.923, 'text': "There are only two more notes that I'd like to make.", 'start': 3656.082, 'duration': 1.841}, {'end': 3661.603, 'text': "Number one, I've skipped a discussion over what is this plus epsilon doing here.", 'start': 3658.483, 'duration': 3.12}, {'end': 3666.624, 'text': 'This epsilon is usually like some small fixed number, for example, one in negative five by default.', 'start': 3662.224, 'duration': 4.4}, {'end': 3670.045, 'text': "And what it's doing is that it's basically preventing a division by zero.", 'start': 3667.325, 'duration': 2.72}], 'summary': 'Eliminated explicit calibration stage, almost done with batch normalization, epsilon prevents division by zero.', 'duration': 23.285, 'max_score': 3646.76, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc3646760.jpg'}], 'start': 3112.536, 'title': 'Jittering and batch normalization in neural network training', 'summary': 'Explores the impact of jittering as a regularizer in neural network training, and discusses the significance of batch normalization, its regularizing effect, and its role in controlling activations and their distributions, leading to improved performance and eliminating the need for explicit calibration.', 'chapters': [{'end': 3187.104, 'start': 3112.536, 'title': 'Effect of jittering on neural network training', 'summary': 'Discusses how jittering acts as a regularizer in neural network training, introducing entropy and data augmentation, making it harder for overfitting, and justifying the use of batch normalization despite its undesirable properties.', 'duration': 74.568, 'highlights': ['Jittering introduces entropy and effectively pads out input examples, acting as a form of data augmentation and making it harder for neural nets to overfit.', 'The use of batch normalization is justified as jittering acts as a regularizer, making it harder for overfitting despite the undesirable properties of batch normalization.', 'The coupling of examples in the batch leads to strange results, bugs, and undesirable properties, making it unpopular in neural network training.']}, {'end': 3695.541, 'start': 3187.104, 'title': 'Understanding batch normalization', 'summary': 'Explains the significance of batch normalization, its regularizing effect, and its role in controlling activations and their distributions, along with the method for estimating mean and standard deviation in a running manner, leading to improved performance and eliminating the need for explicit calibration.', 'duration': 508.437, 'highlights': ['The regularizing effect and effectiveness of batch normalization', 'Estimating mean and standard deviation in a running manner', 'Explanation of the plus epsilon term in batch normalization']}], 'duration': 583.005, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc3112536.jpg', 'highlights': ['Jittering introduces entropy, acting as data augmentation and preventing overfitting.', "Batch normalization's regularizing effect justifies its use despite undesirable properties.", "Batch normalization's role in estimating mean and standard deviation in a running manner.", 'Batch normalization eliminates the need for explicit calibration, improving performance.']}, {'end': 4161.566, 'segs': [{'end': 3750.675, 'src': 'embed', 'start': 3720.868, 'weight': 1, 'content': [{'end': 3722.649, 'text': 'And so these biases are not doing anything.', 'start': 3720.868, 'duration': 1.781}, {'end': 3726.753, 'text': "In fact, they're being subtracted out and they don't impact the rest of the calculation.", 'start': 3722.89, 'duration': 3.863}, {'end': 3732.897, 'text': "So if you look at b1.grad, it's actually going to be zero because it's being subtracted out and doesn't actually have any effect.", 'start': 3727.273, 'duration': 5.624}, {'end': 3740.003, 'text': "And so, whenever you're using batch normalization layers, then if you have any weight layers before, like a linear or a conv or something like that,", 'start': 3733.658, 'duration': 6.345}, {'end': 3744.327, 'text': "you're better off coming here and just like not using bias.", 'start': 3740.643, 'duration': 3.684}, {'end': 3750.675, 'text': "so you don't want to use bias and then here you don't want to add it, because that's spurious.", 'start': 3744.327, 'duration': 6.348}], 'summary': 'Biases are subtracted out in batch normalization, reducing their impact to zero.', 'duration': 29.807, 'max_score': 3720.868, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc3720868.jpg'}, {'end': 3833.028, 'src': 'embed', 'start': 3793.778, 'weight': 2, 'content': [{'end': 3798.84, 'text': 'We are using batch normalization to control the statistics of activations in the neural net.', 'start': 3793.778, 'duration': 5.062}, {'end': 3807.923, 'text': 'It is common to sprinkle batch normalization layer across the neural net, and usually we will place it after layers that have multiplications, like,', 'start': 3799.74, 'duration': 8.183}, {'end': 3812.024, 'text': 'for example, a linear layer or a convolutional layer, which we may cover in the future.', 'start': 3807.923, 'duration': 4.101}, {'end': 3821.844, 'text': 'now the batch normalization internally has parameters for the gain and the bias and these are trained using backpropagation.', 'start': 3813.24, 'duration': 8.604}, {'end': 3824.485, 'text': 'it also has two buffers.', 'start': 3821.844, 'duration': 2.641}, {'end': 3831.027, 'text': 'the buffers are the mean and the standard deviation, the running mean and the running mean of the standard deviation,', 'start': 3824.485, 'duration': 6.542}, {'end': 3833.028, 'text': 'and these are not trained using backpropagation.', 'start': 3831.027, 'duration': 2.001}], 'summary': 'Using batch normalization to control activations in neural net, with gain, bias, mean, and standard deviation parameters.', 'duration': 39.25, 'max_score': 3793.778, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc3793778.jpg'}, {'end': 4073.486, 'src': 'embed', 'start': 4045.464, 'weight': 0, 'content': [{'end': 4055.107, 'text': 'We have a weight layer, like a convolution or like a linear layer, batch normalization, and then 10H, which is non-linearity.', 'start': 4045.464, 'duration': 9.643}, {'end': 4059.169, 'text': 'But basically, a weight layer, a normalization layer, and non-linearity.', 'start': 4055.628, 'duration': 3.541}, {'end': 4064.871, 'text': "And that's the motif that you would be stacking up when you create these deep neural networks, exactly as is done here.", 'start': 4059.669, 'duration': 5.202}, {'end': 4073.486, 'text': "And one more thing I'd like you to notice is that here when they are initializing the conv layers, like conv 1x1, the def for that is right here.", 'start': 4065.68, 'duration': 7.806}], 'summary': 'Creating deep neural networks with weight layers, normalization, and non-linearity, as exemplified in the provided code.', 'duration': 28.022, 'max_score': 4045.464, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc4045464.jpg'}], 'start': 3696.328, 'title': 'Batch normalization and resnet in pytorch', 'summary': 'Discusses the inefficiency of biases in batch normalization layers and the role of batch normalization in controlling activation statistics in neural networks. it also covers the implementation of resnet components in pytorch, emphasizing the motif of stacking weight layer, normalization layer, and non-linearity in deep neural networks.', 'chapters': [{'end': 3877.979, 'start': 3696.328, 'title': 'Optimizing batch normalization in neural networks', 'summary': 'Discusses the inefficiency of adding biases in batch normalization layers, explaining that these biases are subtracted out and have no impact on the calculation, making them wasteful. additionally, it outlines the role of batch normalization in controlling activation statistics in neural networks, including its parameters and buffers and the process of centering and scaling the input data.', 'duration': 181.651, 'highlights': ["Batch normalization biases are subtracted out and don't impact the calculation, making them wasteful. For example, b1.grad becomes zero as it's subtracted out and has no effect.", 'Batch normalization is used to control the statistics of activations in neural nets, often placed after layers with multiplications like linear or convolutional layers.', 'Batch normalization has parameters for gain and bias, which are trained using backpropagation, and buffers for mean and standard deviation, which are not trained using backpropagation.', 'The process of batch normalization involves calculating the mean and standard deviation of the input activations, centering the batch to be unit Gaussian, and then offsetting and scaling it by the learned bias and gain.', 'Batch normalization also maintains a running mean and standard deviation of the inputs, which are later used at inference to avoid re-estimating the mean and standard deviation.']}, {'end': 4161.566, 'start': 3879.08, 'title': 'Resnet and pytorch layers', 'summary': 'Discusses resnet, a residual neural network, and the implementation of its components in pytorch, including convolutional layers, batch normalization, and non-linearities, emphasizing the motif of stacking weight layer, normalization layer, and non-linearity in deep neural networks.', 'duration': 282.486, 'highlights': ['ResNet and its components in PyTorch are discussed, including convolutional layers, batch normalization, and non-linearities.', 'The motif of stacking weight layer, normalization layer, and non-linearity in deep neural networks is emphasized.', 'The reasons for not using biases in certain layers, such as after batch normalization, are explained.']}], 'duration': 465.238, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc3696328.jpg', 'highlights': ['The motif of stacking weight layer, normalization layer, and non-linearity in deep neural networks is emphasized.', "Batch normalization biases are subtracted out and don't impact the calculation, making them wasteful.", 'Batch normalization has parameters for gain and bias, which are trained using backpropagation.', 'Batch normalization involves calculating the mean and standard deviation of the input activations, centering the batch to be unit Gaussian, and then offsetting and scaling it by the learned bias and gain.', 'Batch normalization is used to control the statistics of activations in neural nets, often placed after layers with multiplications like linear or convolutional layers.']}, {'end': 5017.462, 'segs': [{'end': 4232.299, 'src': 'embed', 'start': 4205.902, 'weight': 0, 'content': [{'end': 4211.063, 'text': "In the same way, they have a weight and a bias, and they're talking about how they initialize it by default.", 'start': 4205.902, 'duration': 5.161}, {'end': 4219.666, 'text': 'So by default, PyTorch will initialize your weights by taking the fan-in and then doing 1 over fan-in square root.', 'start': 4211.804, 'duration': 7.862}, {'end': 4224.97, 'text': 'And then instead of a normal distribution, they are using a uniform distribution.', 'start': 4220.945, 'duration': 4.025}, {'end': 4230.417, 'text': "So it's very much the same thing, but they are using a one instead of five over three.", 'start': 4225.851, 'duration': 4.566}, {'end': 4232.299, 'text': "So there's no gain being calculated here.", 'start': 4230.597, 'duration': 1.702}], 'summary': 'Pytorch initializes weights using 1/fan-in square root from a uniform distribution.', 'duration': 26.397, 'max_score': 4205.902, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc4205902.jpg'}, {'end': 4300.291, 'src': 'embed', 'start': 4271.935, 'weight': 1, 'content': [{'end': 4277.038, 'text': 'And you basically achieve that by scaling the weights by one over the square root of fan in.', 'start': 4271.935, 'duration': 5.103}, {'end': 4278.758, 'text': "So that's what this is doing.", 'start': 4277.858, 'duration': 0.9}, {'end': 4282.86, 'text': 'And then the second thing is the batch normalization layer.', 'start': 4280.119, 'duration': 2.741}, {'end': 4285.101, 'text': "So let's look at what that looks like in PyTorch.", 'start': 4283.3, 'duration': 1.801}, {'end': 4290.025, 'text': 'So here we have a one-dimensional bi-normalization layer, exactly as we are using here.', 'start': 4286.243, 'duration': 3.782}, {'end': 4292.907, 'text': 'And there are a number of keyword arguments going into it as well.', 'start': 4290.966, 'duration': 1.941}, {'end': 4295.008, 'text': 'So we need to know the number of features.', 'start': 4293.627, 'duration': 1.381}, {'end': 4296.589, 'text': 'For us, that is 200.', 'start': 4295.788, 'duration': 0.801}, {'end': 4300.291, 'text': 'And that is needed so that we can initialize these parameters here.', 'start': 4296.589, 'duration': 3.702}], 'summary': 'Scaling weights by 1 over sqrt of fan in, using batch normalization layer in pytorch with 200 features.', 'duration': 28.356, 'max_score': 4271.935, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc4271935.jpg'}, {'end': 4475.036, 'src': 'embed', 'start': 4442.63, 'weight': 2, 'content': [{'end': 4443.791, 'text': 'Otherwise, they link to the paper.', 'start': 4442.63, 'duration': 1.161}, {'end': 4449.334, 'text': "It's the same formula we've implemented, and everything is the same, exactly as we've done here.", 'start': 4443.971, 'duration': 5.363}, {'end': 4453.075, 'text': "Okay, so that's everything that I wanted to cover for this lecture.", 'start': 4450.913, 'duration': 2.162}, {'end': 4460.14, 'text': 'Really what I wanted to talk about is the importance of understanding the activations and the gradients and their statistics in neural networks.', 'start': 4453.895, 'duration': 6.245}, {'end': 4464.703, 'text': 'And this becomes increasingly important, especially as you make your neural networks bigger, larger, and deeper.', 'start': 4460.64, 'duration': 4.063}, {'end': 4468.374, 'text': 'We looked at the distributions, basically at the output layer,', 'start': 4465.913, 'duration': 2.461}, {'end': 4475.036, 'text': 'and we saw that if you have two confident mispredictions because the activations are too messed up at the last layer,', 'start': 4468.374, 'duration': 6.662}], 'summary': 'Importance of understanding activations and gradients in neural networks, especially as they get bigger and deeper.', 'duration': 32.406, 'max_score': 4442.63, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc4442630.jpg'}, {'end': 4590.26, 'src': 'embed', 'start': 4560.762, 'weight': 4, 'content': [{'end': 4562.163, 'text': 'And we saw how batch normalization works.', 'start': 4560.762, 'duration': 1.401}, {'end': 4565.744, 'text': 'This is a layer that you can sprinkle throughout your deep neural net.', 'start': 4563.003, 'duration': 2.741}, {'end': 4570.626, 'text': 'And the basic idea is, if you want roughly Gaussian activations, well,', 'start': 4566.385, 'duration': 4.241}, {'end': 4575.889, 'text': 'then take your activations and take the mean understanding deviation and center your data.', 'start': 4570.626, 'duration': 5.263}, {'end': 4580.411, 'text': 'And you can do that because the centering operation is differentiable.', 'start': 4576.669, 'duration': 3.742}, {'end': 4588.658, 'text': 'But on top of that we actually had to add a lot of bells and whistles, and that gave you a sense of the complexities of the batch normalization layer,', 'start': 4581.431, 'duration': 7.227}, {'end': 4590.26, 'text': "because now we're centering the data.", 'start': 4588.658, 'duration': 1.602}], 'summary': 'Batch normalization centers data for gaussian activations in deep neural networks.', 'duration': 29.498, 'max_score': 4560.762, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc4560762.jpg'}, {'end': 4682.044, 'src': 'embed', 'start': 4652.046, 'weight': 5, 'content': [{'end': 4656.632, 'text': 'some of the other alternatives to these layers are, for example, group normalization or layer normalization,', 'start': 4652.046, 'duration': 4.586}, {'end': 4662.18, 'text': "and those have become more common in more recent deep learning, but we haven't covered those yet.", 'start': 4656.632, 'duration': 5.548}, {'end': 4668.753, 'text': 'But definitely, batch normalization was very influential at the time when it came out, in roughly 2015,,', 'start': 4663.269, 'duration': 5.484}, {'end': 4674.738, 'text': 'because it was kind of the first time that you could train reliably much deeper neural nets.', 'start': 4668.753, 'duration': 5.985}, {'end': 4682.044, 'text': 'And fundamentally, the reason for that is because this layer was very effective at controlling the statistics of the activations in a neural net.', 'start': 4675.419, 'duration': 6.625}], 'summary': 'Batch normalization, influential in 2015, enabled reliable training of much deeper neural nets.', 'duration': 29.998, 'max_score': 4652.046, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc4652046.jpg'}], 'start': 4161.566, 'title': 'Pytorch weight initialization and batch normalization', 'summary': 'Discusses weight initialization in pytorch based on fan-in and fan-out, default weight initialization using a uniform distribution, and the use of batch normalization to achieve a roughly gaussian output. it also details the key parameters and settings for batch normalization, including the number of features, epsilon value, momentum, and device selection, while highlighting the significance of batch normalization in controlling activation statistics in neural networks and its impact on training deeper neural networks.', 'chapters': [{'end': 4292.907, 'start': 4161.566, 'title': 'Pytorch weight initialization and batch normalization', 'summary': 'Discusses the initialization of weights in pytorch based on fan-in and fan-out, default weight initialization using a uniform distribution, and the use of batch normalization to achieve a roughly gaussian output.', 'duration': 131.341, 'highlights': ['PyTorch initializes weights based on fan-in and fan-out, using a uniform distribution instead of a normal distribution, ensuring roughly Gaussian output.', 'The use of batch normalization in PyTorch to achieve a roughly Gaussian output after the linear layer.', 'The initialization of weights in PyTorch is based on fan-in and fan-out, and biases can be disabled, particularly if the layer is followed by a normalization layer like batch norm.']}, {'end': 5017.462, 'start': 4293.627, 'title': 'Understanding batch normalization', 'summary': 'Details the key parameters and settings for batch normalization, including the number of features, epsilon value, momentum, and device selection, while highlighting the significance of batch normalization in controlling activation statistics in neural networks and its impact on training deeper neural networks.', 'duration': 723.835, 'highlights': ['The importance of understanding the activations and the gradients and their statistics in neural networks', 'Significance of batch normalization in controlling activation statistics in neural networks', 'Impact of batch normalization on training deeper neural networks']}], 'duration': 855.896, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc4161566.jpg', 'highlights': ['PyTorch initializes weights based on fan-in and fan-out, using a uniform distribution instead of a normal distribution, ensuring roughly Gaussian output.', 'The use of batch normalization in PyTorch to achieve a roughly Gaussian output after the linear layer.', 'The importance of understanding the activations and the gradients and their statistics in neural networks', 'The initialization of weights in PyTorch is based on fan-in and fan-out, and biases can be disabled, particularly if the layer is followed by a normalization layer like batch norm.', 'Significance of batch normalization in controlling activation statistics in neural networks', 'Impact of batch normalization on training deeper neural networks']}, {'end': 6956.899, 'segs': [{'end': 5072.091, 'src': 'embed', 'start': 5041.431, 'weight': 0, 'content': [{'end': 5044.413, 'text': 'But PyTorch and modules will not have a dot out attribute.', 'start': 5041.431, 'duration': 2.982}, {'end': 5052.364, 'text': 'And finally, here we are updating the buffers using, again, as I mentioned, exponential moving average, given the provided momentum.', 'start': 5045.458, 'duration': 6.906}, {'end': 5056.648, 'text': "And importantly, you'll notice that I'm using the torch.nograd context manager.", 'start': 5053.065, 'duration': 3.583}, {'end': 5064.675, 'text': "And I'm doing this because if we don't use this, then PyTorch will start building out an entire computational graph out of these tensors,", 'start': 5057.409, 'duration': 7.266}, {'end': 5067.237, 'text': 'because it is expecting that we will eventually call a dot backward.', 'start': 5064.675, 'duration': 2.562}, {'end': 5072.091, 'text': 'but we are never going to be calling that backward on anything that includes running mean and running variance.', 'start': 5068.069, 'duration': 4.022}], 'summary': 'Updating buffers using exponential moving average with torch.nograd context manager.', 'duration': 30.66, 'max_score': 5041.431, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc5041431.jpg'}, {'end': 5231.847, 'src': 'heatmap', 'start': 5147.332, 'weight': 1, 'content': [{'end': 5151.876, 'text': 'Finally, the parameters are basically the embedding matrix and all the parameters in all the layers.', 'start': 5147.332, 'duration': 4.544}, {'end': 5156.12, 'text': "And notice here, I'm using a double list comprehension, if you want to call it that.", 'start': 5152.497, 'duration': 3.623}, {'end': 5164.047, 'text': 'But for every layer in layers and for every parameter in each of those layers, we are just stacking up all those parameters.', 'start': 5156.16, 'duration': 7.887}, {'end': 5168.091, 'text': 'Now, in total, we have 46, 000 parameters.', 'start': 5165.108, 'duration': 2.983}, {'end': 5172.115, 'text': "And I'm telling PyTorch that all of them require gradient.", 'start': 5169.492, 'duration': 2.623}, {'end': 5179.88, 'text': 'Then here, we have everything here we are actually mostly used to.', 'start': 5176.075, 'duration': 3.805}, {'end': 5183.124, 'text': 'We are sampling batch, we are doing forward pass.', 'start': 5180.741, 'duration': 2.383}, {'end': 5188.571, 'text': 'The forward pass now is just a linear application of all the layers in order, followed by the cross entropy.', 'start': 5183.585, 'duration': 4.986}, {'end': 5192.273, 'text': "And then, in the backward pass, you'll notice that for every single layer,", 'start': 5189.451, 'duration': 2.822}, {'end': 5196.595, 'text': "I now iterate over all the outputs and I'm telling PyTorch to retain the gradient of them.", 'start': 5192.273, 'duration': 4.322}, {'end': 5201.738, 'text': 'And then here we are already used to all the gradients set to none.', 'start': 5197.496, 'duration': 4.242}, {'end': 5207.902, 'text': 'do the backward to fill in the gradients, do an update using stochastic gradient, send and then track some statistics.', 'start': 5201.738, 'duration': 6.164}, {'end': 5211.491, 'text': 'and then I am going to break after a single iteration.', 'start': 5208.789, 'duration': 2.702}, {'end': 5214.212, 'text': 'Now, here in this cell, in this diagram,', 'start': 5212.091, 'duration': 2.121}, {'end': 5220.976, 'text': "I'm visualizing the histograms of the forward pass activations and I'm specifically doing it at the tanh layers.", 'start': 5214.212, 'duration': 6.764}, {'end': 5231.847, 'text': 'So iterating over all the layers except for the very last one, which is basically just the softmax layer If it is a 10H layer.', 'start': 5221.897, 'duration': 9.95}], 'summary': 'The model has 46,000 parameters and uses pytorch for forward and backward passes, with visualizations of forward pass activations.', 'duration': 84.515, 'max_score': 5147.332, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc5147332.jpg'}, {'end': 5310.741, 'src': 'embed', 'start': 5284.241, 'weight': 1, 'content': [{'end': 5290.545, 'text': "so the first layer is fairly saturated here at 20%, so you can see that it's got tails here.", 'start': 5284.241, 'duration': 6.304}, {'end': 5293.867, 'text': 'but then everything sort of stabilizes and if we had more layers here,', 'start': 5290.545, 'duration': 3.322}, {'end': 5300.772, 'text': 'it would actually just stabilize at around the standard deviation of about 0.65 and the saturation would be roughly 5%.', 'start': 5293.867, 'duration': 6.905}, {'end': 5307.819, 'text': 'and the reason that this stabilizes and gives us a nice distribution here is because gain is set to 5 over 3.', 'start': 5300.772, 'duration': 7.047}, {'end': 5310.741, 'text': 'Now, here this gain.', 'start': 5307.819, 'duration': 2.922}], 'summary': 'Saturation stabilizes at 20% initially, then stabilizes at 5% with a standard deviation of 0.65 due to gain set at 5 over 3.', 'duration': 26.5, 'max_score': 5284.241, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc5284241.jpg'}, {'end': 5630.28, 'src': 'embed', 'start': 5599.005, 'weight': 2, 'content': [{'end': 5603.687, 'text': "That's why before batch normalization, this was incredibly tricky to set.", 'start': 5599.005, 'duration': 4.682}, {'end': 5608.31, 'text': "In particular, if this is too large of a gain, this happens, and if it's too little of a gain.", 'start': 5604.268, 'duration': 4.042}, {'end': 5611.504, 'text': 'then this happens.', 'start': 5609.702, 'duration': 1.802}, {'end': 5622.433, 'text': 'so the opposite of that basically happens here we have a shrinking and a diffusion, depending on which direction you look at it from,', 'start': 5611.504, 'duration': 10.929}, {'end': 5630.28, 'text': "and so certainly this is not what you want, and in this case the correct setting of the gain is exactly one, just like we're doing at initialization.", 'start': 5622.433, 'duration': 7.847}], 'summary': 'Setting gain correctly at 1 prevents shrinking and diffusion in batch normalization.', 'duration': 31.275, 'max_score': 5599.005, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc5599005.jpg'}, {'end': 5951.192, 'src': 'embed', 'start': 5921.638, 'weight': 3, 'content': [{'end': 5925.179, 'text': 'which actually has roughly one in negative two standard deviation of gradients.', 'start': 5921.638, 'duration': 3.541}, {'end': 5935.162, 'text': 'And so the gradients on the last layer are currently about 100 times greater sorry, 10 times greater than all the other weights inside the neural net.', 'start': 5925.939, 'duration': 9.223}, {'end': 5936.803, 'text': "and so that's problematic,", 'start': 5935.982, 'duration': 0.821}, {'end': 5947.249, 'text': 'because in the simple stochastic gradient descent setup you would be training this last layer about 10 times faster than you would be training the other layers at initialization.', 'start': 5936.803, 'duration': 10.446}, {'end': 5951.192, 'text': 'now this actually like kind of fixes itself a little bit if you train for a bit longer.', 'start': 5947.249, 'duration': 3.943}], 'summary': 'The last layer has gradients 100 times greater, causing faster training, but it self-corrects with longer training.', 'duration': 29.554, 'max_score': 5921.638, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc5921638.jpg'}, {'end': 6222.756, 'src': 'embed', 'start': 6198.007, 'weight': 4, 'content': [{'end': 6203.27, 'text': 'But basically I like to look at the evolution of this update ratio for all my parameters usually,', 'start': 6198.007, 'duration': 5.263}, {'end': 6208.412, 'text': "and I like to make sure that it's not too much above one in negative three roughly.", 'start': 6203.27, 'duration': 5.142}, {'end': 6212.272, 'text': 'So around negative 3 on this log plot.', 'start': 6209.751, 'duration': 2.521}, {'end': 6216.714, 'text': "If it's below negative 3, usually that means that the parameters are not training fast enough.", 'start': 6213.112, 'duration': 3.602}, {'end': 6220.115, 'text': "So if our learning rate was very low, let's do that experiment.", 'start': 6217.454, 'duration': 2.661}, {'end': 6222.756, 'text': "Let's initialize.", 'start': 6221.776, 'duration': 0.98}], 'summary': 'Monitoring update ratio for parameters to ensure efficient training, aiming for -3 on log plot.', 'duration': 24.749, 'max_score': 6198.007, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc6198007.jpg'}, {'end': 6724.666, 'src': 'embed', 'start': 6696.582, 'weight': 5, 'content': [{'end': 6698.982, 'text': 'There are three things I was hoping to achieve with this section.', 'start': 6696.582, 'duration': 2.4}, {'end': 6702.264, 'text': 'Number one, I wanted to introduce you to batch normalization,', 'start': 6699.483, 'duration': 2.781}, {'end': 6708.966, 'text': "which is one of the first modern innovations that we're looking into that helped stabilize very deep neural networks and their training.", 'start': 6702.264, 'duration': 6.702}, {'end': 6715.208, 'text': 'And I hope you understand how the batch normalization works and how it would be used in a neural network.', 'start': 6709.826, 'duration': 5.382}, {'end': 6721.584, 'text': 'Number two, I was hoping to PyTorchify some of our code and wrap it up into these modules.', 'start': 6716.161, 'duration': 5.423}, {'end': 6724.666, 'text': 'So like linear, BatchNorm1D, 10H, et cetera.', 'start': 6721.905, 'duration': 2.761}], 'summary': 'Introduce batch normalization and pytorch modules for neural networks.', 'duration': 28.084, 'max_score': 6696.582, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc6696582.jpg'}, {'end': 6786.952, 'src': 'embed', 'start': 6759.013, 'weight': 6, 'content': [{'end': 6765.818, 'text': 'I tried to introduce you to the diagnostic tools that you would use to understand whether your neural network is in a good state dynamically.', 'start': 6759.013, 'duration': 6.805}, {'end': 6774.123, 'text': 'So we are looking at the statistics and histograms and activation of the forward pass activations, the backward pass gradients,', 'start': 6766.358, 'duration': 7.765}, {'end': 6780.167, 'text': "and then also we're looking at the weights that are going to be updated as part of stochastic gradient descent, and we're looking at their means,", 'start': 6774.123, 'duration': 6.044}, {'end': 6786.952, 'text': 'standard deviations and also the ratio of gradients to data or, even better, the updates to data.', 'start': 6780.167, 'duration': 6.785}], 'summary': 'Introduction to diagnostic tools for neural network analysis.', 'duration': 27.939, 'max_score': 6759.013, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc6759013.jpg'}, {'end': 6950.252, 'src': 'heatmap', 'start': 6885.599, 'weight': 0.775, 'content': [{'end': 6889.782, 'text': 'And so you may have found some of the parts here unintuitive and maybe you were slightly confused about.', 'start': 6885.599, 'duration': 4.183}, {'end': 6892.724, 'text': 'okay, if I change the gain here, how come that?', 'start': 6889.782, 'duration': 2.942}, {'end': 6893.864, 'text': 'we need a different learning rate?', 'start': 6892.724, 'duration': 1.14}, {'end': 6900.829, 'text': "And I didn't go into the full detail because you'd have to actually look at the backward pass of all these different layers and get an intuitive understanding of how that works.", 'start': 6894.345, 'duration': 6.484}, {'end': 6903.491, 'text': 'And I did not go into that in this lecture.', 'start': 6901.369, 'duration': 2.122}, {'end': 6907.836, 'text': 'The purpose really was just to introduce you to the diagnostic tools and what they look like.', 'start': 6904.051, 'duration': 3.785}, {'end': 6914.866, 'text': "But there's still a lot of work remaining on the intuitive level to understand the initialization, the backward pass, and how all of that interacts.", 'start': 6908.337, 'duration': 6.529}, {'end': 6918.531, 'text': "But you shouldn't feel too bad because honestly, we are.", 'start': 6915.847, 'duration': 2.684}, {'end': 6927.757, 'text': "Getting to the cutting edge of where the field is, we certainly haven't, I would say, solved initialization, and we haven't solved back propagation.", 'start': 6919.352, 'duration': 8.405}, {'end': 6930.538, 'text': 'And these are still very much an active area of research.', 'start': 6928.277, 'duration': 2.261}, {'end': 6937.322, 'text': 'People are still trying to figure out what is the best way to initialize these networks, what is the best update rule to use, and so on.', 'start': 6930.798, 'duration': 6.524}, {'end': 6941.144, 'text': "So none of this is really solved, and we don't really have all the answers to all the questions.", 'start': 6937.402, 'duration': 3.742}, {'end': 6943.406, 'text': 'to all these cases.', 'start': 6942.085, 'duration': 1.321}, {'end': 6950.252, 'text': "But at least we're making progress and at least we have some tools to tell us whether or not things are on the right track for now.", 'start': 6944.126, 'duration': 6.126}], 'summary': 'Introduction to diagnostic tools in neural networks, active research in initialization and backpropagation, ongoing progress', 'duration': 64.653, 'max_score': 6885.599, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc6885599.jpg'}], 'start': 5019.096, 'title': 'Custom pytorch layer and network analysis', 'summary': 'Discusses creating custom pytorch layers, setting gain for activation and gradient statistics, analyzing neural network training and parameter updates, introducing batch normalization, and diagnostic tools for monitoring neural network dynamics.', 'chapters': [{'end': 5450.373, 'start': 5019.096, 'title': 'Custom pytorch layer and initialization', 'summary': "Discusses the creation of custom pytorch layers, including 'dot out' attribute, buffer updates using exponential moving average, and the use of torch.nograd context manager, and the empirical determination of gain value in initialization to maintain standard deviation and saturation of activations.", 'duration': 431.277, 'highlights': ["The creation of custom PyTorch layers, including 'dot out' attribute, buffer updates using exponential moving average, and the use of torch.nograd context manager.", 'Empirical determination of the gain value in initialization to maintain the standard deviation and saturation of activations.', 'Visualization and analysis of histograms of forward pass activations at the tanh layers.']}, {'end': 5837.115, 'start': 5450.994, 'title': 'Setting gain and understanding activation and gradient statistics', 'summary': 'Demonstrates the importance of setting the gain for activation and gradient statistics in neural networks, showcasing the impact of gain values on activation and gradient behavior, and emphasizing the significance of 10h nonlinearities in enabling neural networks to approximate arbitrary functions.', 'duration': 386.121, 'highlights': ['The gain values significantly affect the behavior of activations and gradients in neural networks, with improper gain settings leading to undesirable behaviors such as shrinking or exploding activations and asymmetrical gradient distributions.', 'The 10H nonlinearities play a crucial role in enabling neural networks to approximate arbitrary functions by transforming a linear sandwich into a neural network with distinct optimization dynamics, despite collapsing to a linear function in the forward pass.', 'The demonstration underscores the importance of considering parameters and their values and gradients during neural network training, particularly focusing on the impact of weight updates on the behavior of linear layers.']}, {'end': 6086.774, 'start': 5837.655, 'title': 'Neural network training analysis', 'summary': 'Discusses the analysis of neural network training, highlighting issues such as gradient to data ratio, trouble with last layer weights, and the importance of update to data ratio, with an emphasis on the impact on training speed and optimization.', 'duration': 249.119, 'highlights': ["The last layer's weights have a standard deviation of gradients roughly 10 times greater than other weights inside the neural net, causing potential trouble in training with simple stochastic gradient descent.", 'The gradient to data ratio is not as informative as the update to data ratio, which determines the actual change in data tensors and is crucial for understanding the impact of updates on the neural network.', 'Visualizing the scale of the gradient compared to the scale of the actual values is important for understanding the impact of the learning rate times gradient update on the data, with potential implications for training speed and optimization.', 'The mean, standard deviation, and histogram of parameters highlight potential issues in the neural network training, indicating trouble in paradise despite seemingly okay gradients.']}, {'end': 6385.042, 'start': 6087.695, 'title': 'Neural network parameter updates analysis', 'summary': 'Discusses the analysis of parameter update ratios in neural networks, focusing on the evolution of update ratios over time, the impact of learning rate on update sizes, and the manifestation of miscalibrations in neural networks through various plots.', 'duration': 297.347, 'highlights': ['The evolution of update ratios over time is monitored to ensure that they stabilize during training, with the goal of keeping the update ratio around roughly one in negative three, indicating that updates to parameters are not too large (e.g., a symptom of a slow learning rate).', 'Monitoring the update sizes relative to the magnitude of the numbers in the tensor provides insights into the learning rate calibration, where excessively small updates (e.g., 10,000 times smaller than the tensor magnitude) suggest a learning rate that is too low, while excessively large updates indicate potential miscalibrations.', 'Miscalibrations in neural networks, such as the absence of fan-in normalization for weights, can be identified through manifestations in activation plots, gradients, weight histograms, and the discrepancy in parameter update ratios, highlighting the importance of precise calibration for optimizing neural network performance.']}, {'end': 6759.013, 'start': 6385.963, 'title': 'Batch normalization and pytorch modules', 'summary': 'Introduces batch normalization to stabilize deep neural network training, demonstrates its placement and impact on activations, gradients, and weights, and pytorchifies code by wrapping it into modules.', 'duration': 373.05, 'highlights': ['Introduction to Batch Normalization', 'Placement and Impact of Batch Normalization', 'PyTorchifying Code into Modules']}, {'end': 6956.899, 'start': 6759.013, 'title': 'Neural network diagnostic tools', 'summary': 'Introduces diagnostic tools for monitoring neural network dynamics, including statistics, histograms, activations, gradients, and weight updates, with a focus on maintaining a specific ratio of gradients to data. it also highlights the limitations of the batchnorm layer and the need for more powerful architectures like recurrent neural networks and transformers to improve performance.', 'duration': 197.886, 'highlights': ['The chapter introduces diagnostic tools for monitoring neural network dynamics, including statistics, histograms, activations, gradients, and weight updates, with a focus on maintaining a specific ratio of gradients to data. It emphasizes the importance of analyzing these metrics over time and provides a heuristic for determining the appropriate ratio of gradients to data.', "The limitations of the BatchNorm layer are highlighted, indicating that the network's performance is not bottlenecked by optimization but rather by the context length. This prompts the need for more powerful architectures like recurrent neural networks and transformers to improve performance.", 'The ongoing challenges in the field of neural networks are discussed, including the unresolved issues of network initialization and backpropagation. It emphasizes that these areas are still active areas of research, with no definitive solutions yet, but acknowledges the positive progress in providing tools to monitor network performance.']}], 'duration': 1937.803, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/P6sfmUTpUmc/pics/P6sfmUTpUmc5019096.jpg', 'highlights': ["The creation of custom PyTorch layers, including 'dot out' attribute, buffer updates using exponential moving average, and the use of torch.nograd context manager.", 'Empirical determination of the gain value in initialization to maintain the standard deviation and saturation of activations.', 'The gain values significantly affect the behavior of activations and gradients in neural networks, with improper gain settings leading to undesirable behaviors such as shrinking or exploding activations and asymmetrical gradient distributions.', "The last layer's weights have a standard deviation of gradients roughly 10 times greater than other weights inside the neural net, causing potential trouble in training with simple stochastic gradient descent.", 'The evolution of update ratios over time is monitored to ensure that they stabilize during training, with the goal of keeping the update ratio around roughly one in negative three, indicating that updates to parameters are not too large (e.g., a symptom of a slow learning rate).', 'Introduction to Batch Normalization', 'The chapter introduces diagnostic tools for monitoring neural network dynamics, including statistics, histograms, activations, gradients, and weight updates, with a focus on maintaining a specific ratio of gradients to data. It emphasizes the importance of analyzing these metrics over time and provides a heuristic for determining the appropriate ratio of gradients to data.']}], 'highlights': ['Refactored MLP code optimizes 11,000 parameters over 200,000 steps with batch size 32', 'Skipping gradient bookkeeping using torch.no_grad improves computation efficiency by avoiding graph maintenance and setting requires_grad to false for all tensors (efficiency improvement)', 'Significant validation loss improvement from 2.17 to 2.10 due to addressing dead neurons', 'The introduction of batch normalization in 2015 by a team at Google made it possible to train very deep neural nets reliably, leading to improved results', 'Jittering introduces entropy, acting as data augmentation and preventing overfitting', 'The motif of stacking weight layer, normalization layer, and non-linearity in deep neural networks is emphasized', 'PyTorch initializes weights based on fan-in and fan-out, using a uniform distribution instead of a normal distribution, ensuring roughly Gaussian output', "The creation of custom PyTorch layers, including 'dot out' attribute, buffer updates using exponential moving average, and the use of torch.nograd context manager", 'Empirical determination of the gain value in initialization to maintain the standard deviation and saturation of activations', 'The gain values significantly affect the behavior of activations and gradients in neural networks, with improper gain settings leading to undesirable behaviors such as shrinking or exploding activations and asymmetrical gradient distributions']}