title
Lecture 6 | Training Neural Networks I
description
In Lecture 6 we discuss many practical issues for training modern neural networks. We discuss different activation functions, the importance of data preprocessing and weight initialization, and batch normalization; we also cover some strategies for monitoring the learning process and choosing hyperparameters.
Keywords: Activation functions, data preprocessing, weight initialization, batch normalization, hyperparameter search
Slides: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture6.pdf
--------------------------------------------------------------------------------------
Convolutional Neural Networks for Visual Recognition
Instructors:
Fei-Fei Li: http://vision.stanford.edu/feifeili/
Justin Johnson: http://cs.stanford.edu/people/jcjohns/
Serena Yeung: http://ai.stanford.edu/~syyeung/
Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image classification, localization and detection. Recent developments in neural network (aka “deep learning”) approaches have greatly advanced the performance of these state-of-the-art visual recognition systems. This lecture collection is a deep dive into details of the deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification. From this lecture collection, students will learn to implement, train and debug their own neural networks and gain a detailed understanding of cutting-edge research in computer vision.
Website:
http://cs231n.stanford.edu/
For additional learning opportunities please visit:
http://online.stanford.edu/
detail
{'title': 'Lecture 6 | Training Neural Networks I', 'heatmap': [{'end': 628.205, 'start': 485.737, 'weight': 0.751}, {'end': 3089.899, 'start': 3036.049, 'weight': 0.816}, {'end': 3958.016, 'start': 3902.666, 'weight': 0.775}], 'summary': 'Lecture covers neural network training, project proposal deadlines, computational graphs, mini-batch stochastic gradient descent, activation functions, dead relus, leaky relu, prelu, elu, relu neuron variants, data preprocessing, symmetry problem, weight initialization challenges, xavier initialization method, batch normalization, and neural network training techniques, emphasizing practical implementations and optimization strategies.', 'chapters': [{'end': 70.486, 'segs': [{'end': 70.486, 'src': 'embed', 'start': 4.881, 'weight': 0, 'content': [{'end': 6.001, 'text': 'Stanford University.', 'start': 4.881, 'duration': 1.12}, {'end': 12.003, 'text': "OK, let's get started.", 'start': 10.943, 'duration': 1.06}, {'end': 21.046, 'text': "OK, so today we're going to get into some of the details about how we train neural networks.", 'start': 16.545, 'duration': 4.501}, {'end': 25.328, 'text': 'So some administrative details first.', 'start': 23.347, 'duration': 1.981}, {'end': 30.089, 'text': 'Assignment one is due today, Thursday, so 1159 PM tonight on Canvas.', 'start': 26.168, 'duration': 3.921}, {'end': 35.524, 'text': "We're also going to be releasing assignment two today.", 'start': 33.203, 'duration': 2.321}, {'end': 40.025, 'text': 'And then your project proposals are due Tuesday, April 25th.', 'start': 36.664, 'duration': 3.361}, {'end': 45.066, 'text': "So you should be really starting to think about your projects now if you haven't already.", 'start': 40.725, 'duration': 4.341}, {'end': 54.809, 'text': 'How many people have decided what they want to do for their project so far? Okay, so some people.', 'start': 46.887, 'duration': 7.922}, {'end': 65.783, 'text': 'So yeah, everyone else, you can go to TA office hours if you want suggestions and bounce ideas off of TAs.', 'start': 54.889, 'duration': 10.894}, {'end': 70.486, 'text': 'We also have a list of projects that other people have proposed.', 'start': 66.244, 'duration': 4.242}], 'summary': 'Neural network training details and deadlines discussed. assignment 1 due tonight, assignment 2 released. project proposals due april 25th.', 'duration': 65.605, 'max_score': 4.881, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M4881.jpg'}], 'start': 4.881, 'title': 'Neural network training and project proposals', 'summary': 'Covers neural network training, assignment deadlines, and project proposal submission, stressing the importance of initiating project planning, with proposals due on tuesday, april 25th.', 'chapters': [{'end': 70.486, 'start': 4.881, 'title': 'Neural network training and project proposals', 'summary': 'Covers the details of neural network training, including assignment deadlines and project proposal submission, emphasizing the importance of starting to think about projects, with project proposals due on tuesday, april 25th.', 'duration': 65.605, 'highlights': ['Assignment one is due today, Thursday, 1159 PM on Canvas, while assignment two will be released today.', 'Project proposals are due Tuesday, April 25th, urging students to start thinking about their projects, with TA support available for ideas and suggestions.', 'Encouraging students to start thinking about their projects and seek suggestions from TAs during office hours.']}], 'duration': 65.605, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M4881.jpg', 'highlights': ['Assignment one is due today, Thursday, 1159 PM on Canvas', 'Project proposals are due Tuesday, April 25th, urging students to start thinking about their projects', 'Assignment two will be released today', 'Encouraging students to start thinking about their projects and seek suggestions from TAs during office hours']}, {'end': 979.773, 'segs': [{'end': 155.887, 'src': 'embed', 'start': 117.246, 'weight': 0, 'content': [{'end': 125.814, 'text': "And we've talked more explicitly about neural networks, which is a type of graph where we have these linear layers that we stack on top of each other,", 'start': 117.246, 'duration': 8.568}, {'end': 127.616, 'text': 'with nonlinearities in between.', 'start': 125.814, 'duration': 1.802}, {'end': 133.601, 'text': "And we've also talked last lecture about convolutional neural networks,", 'start': 129.539, 'duration': 4.062}, {'end': 143.804, 'text': 'which are a particular type of network that uses convolution layers to preserve the spatial structure throughout all the hierarchy of the network.', 'start': 133.601, 'duration': 10.203}, {'end': 155.887, 'text': 'And so we saw exactly how a convolution layer looked where each activation map in the convolutional layer output is produced by sliding a filter of weights over all of the spatial locations in the input.', 'start': 145.044, 'duration': 10.843}], 'summary': 'Discussion on neural networks, including convolutional neural networks with spatial preservation and filter-based activation map generation.', 'duration': 38.641, 'max_score': 117.246, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M117246.jpg'}, {'end': 236.344, 'src': 'embed', 'start': 203.126, 'weight': 2, 'content': [{'end': 208.007, 'text': 'And so the whole process we actually call a mini-batch, stochastic gradient descent,', 'start': 203.126, 'duration': 4.881}, {'end': 217.048, 'text': 'where the steps are that we continuously we sample a batch of data, we forward, prop it through our computational graph or our neural network,', 'start': 208.007, 'duration': 9.041}, {'end': 218.669, 'text': 'we get the loss at the end,', 'start': 217.048, 'duration': 1.621}, {'end': 226.596, 'text': 'we back prop through our network to calculate the gradients and then we update the parameters or the weights in our network using this gradient.', 'start': 218.669, 'duration': 7.927}, {'end': 236.344, 'text': "Okay, so now for the next couple of lectures, we're going to talk about some of the details involved in training neural networks.", 'start': 230.479, 'duration': 5.865}], 'summary': 'Training neural networks using mini-batch stochastic gradient descent.', 'duration': 33.218, 'max_score': 203.126, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M203126.jpg'}, {'end': 296.405, 'src': 'embed', 'start': 266.284, 'weight': 4, 'content': [{'end': 270.467, 'text': "And then we'll also talk about evaluation and model ensembles.", 'start': 266.284, 'duration': 4.183}, {'end': 281.916, 'text': "So today, in the first part, we'll talk about activation functions, data preprocessing, weight initialization, batch normalization, babysitting,", 'start': 273.149, 'duration': 8.767}, {'end': 284.538, 'text': 'the learning process and hyperparameter optimization.', 'start': 281.916, 'duration': 2.622}, {'end': 289.842, 'text': 'Okay, so first, activation functions.', 'start': 287.5, 'duration': 2.342}, {'end': 296.405, 'text': 'So we saw earlier how at any particular layer we have the data coming in.', 'start': 291.843, 'duration': 4.562}], 'summary': 'Discussion on activation functions, data preprocessing, weight initialization, batch normalization, babysitting, learning process, hyperparameter optimization', 'duration': 30.121, 'max_score': 266.284, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M266284.jpg'}, {'end': 350.107, 'src': 'embed', 'start': 322.332, 'weight': 7, 'content': [{'end': 326.594, 'text': "So first the sigmoid, which we've seen before, and probably the one we're most comfortable with.", 'start': 322.332, 'duration': 4.262}, {'end': 332.738, 'text': 'So the sigmoid function is, as we have up here, one over one plus e to the negative x.', 'start': 327.335, 'duration': 5.403}, {'end': 337.9, 'text': "And what this does is it takes each number, that's input into this sigmoid non-linearity,", 'start': 332.738, 'duration': 5.162}, {'end': 344.504, 'text': 'so each element and element-wise squashes these into this range, zero one using this function here.', 'start': 337.9, 'duration': 6.604}, {'end': 350.107, 'text': 'And so if you get very high values as input, then the output is going to be something near one.', 'start': 345.364, 'duration': 4.743}], 'summary': 'Sigmoid function squashes input into range 0-1 using 1/(1+e^-x)', 'duration': 27.775, 'max_score': 322.332, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M322332.jpg'}, {'end': 628.205, 'src': 'heatmap', 'start': 485.737, 'weight': 0.751, 'content': [{'end': 487.899, 'text': "Yeah, it's fine in this regime, right?", 'start': 485.737, 'duration': 2.162}, {'end': 494.942, 'text': "So in this regime near 0, you're going to get a reasonable gradient here, and then it'll be fine for backprop.", 'start': 487.919, 'duration': 7.023}, {'end': 500.085, 'text': 'And then what about x equals 10? 0, right.', 'start': 495.382, 'duration': 4.703}, {'end': 500.485, 'text': 'So again.', 'start': 500.125, 'duration': 0.36}, {'end': 506.948, 'text': 'so when x is equal to very negative or x is equal to large positive numbers,', 'start': 500.485, 'duration': 6.463}, {'end': 514.052, 'text': "then these are all regions where the sigmoid function is flat and it's going to kill off the gradient and you're not going to get gradient flow coming back.", 'start': 506.948, 'duration': 7.104}, {'end': 521.594, 'text': 'Okay, so a second problem is that the sigmoid outputs are not zero-centered.', 'start': 517.15, 'duration': 4.444}, {'end': 524.756, 'text': "And so let's take a look at why this is a problem.", 'start': 522.775, 'duration': 1.981}, {'end': 531.201, 'text': 'So consider what happens when the input to a neuron is always positive.', 'start': 526.978, 'duration': 4.223}, {'end': 534.744, 'text': "So in this case, all of our x's we're gonna say is positive.", 'start': 532.122, 'duration': 2.622}, {'end': 543.771, 'text': "It's going to be multiplied by some weight w, and then we're going to run it through our activation function.", 'start': 535.304, 'duration': 8.467}, {'end': 558.109, 'text': 'What can we say about the gradients on w? So think about what the local gradient is going to be, right, for this linear layer.', 'start': 545.444, 'duration': 12.665}, {'end': 566.772, 'text': 'We have dl over, whatever the activation function, the loss coming down, and then we have our local gradient,', 'start': 558.769, 'duration': 8.003}, {'end': 569.833, 'text': 'which is going to be basically x right?', 'start': 566.772, 'duration': 3.061}, {'end': 574.135, 'text': 'And so what does this mean if all of x is positive?', 'start': 570.013, 'duration': 4.122}, {'end': 577.869, 'text': "Okay so I heard it's always gonna be positive.", 'start': 576.328, 'duration': 1.541}, {'end': 580.251, 'text': "So that's almost right.", 'start': 578.69, 'duration': 1.561}, {'end': 589.717, 'text': "It's always gonna be either positive or, all positive or all negative, right? So our upstream gradient coming down is dL over our loss L.", 'start': 580.311, 'duration': 9.406}, {'end': 591.098, 'text': "It's going to be dL over dF.", 'start': 589.717, 'duration': 1.381}, {'end': 593.62, 'text': 'And this is going to be either positive or negative.', 'start': 591.559, 'duration': 2.061}, {'end': 595.481, 'text': "It's some arbitrary gradient coming down.", 'start': 593.66, 'duration': 1.821}, {'end': 608.198, 'text': "And then our local gradient that we multiply this by is, if we're going to find the gradients on W, is going to be df over dw, which is going to be x.", 'start': 596.102, 'duration': 12.096}, {'end': 616.141, 'text': 'And so if x is always positive, then the gradients on w, which is multiplying these two together, are going to always be positive.', 'start': 608.198, 'duration': 7.943}, {'end': 620.24, 'text': 'the sign of the upstream gradient coming down.', 'start': 617.938, 'duration': 2.302}, {'end': 626.303, 'text': "And so what this means is that all the gradients of w, since they're always either positive or negative,", 'start': 620.9, 'duration': 5.403}, {'end': 628.205, 'text': "they're always gonna move in the same direction.", 'start': 626.303, 'duration': 1.902}], 'summary': 'Sigmoid function issues: flat gradient at x=0 and non-zero centered outputs lead to same direction gradients on w.', 'duration': 142.468, 'max_score': 485.737, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M485737.jpg'}, {'end': 771.448, 'src': 'embed', 'start': 745.12, 'weight': 6, 'content': [{'end': 750.146, 'text': "The saturated neurons can kill the gradients if we're too positive or too negative of an input.", 'start': 745.12, 'duration': 5.026}, {'end': 756.172, 'text': "They're also not zero-centered, and so we get this inefficient kind of gradient update.", 'start': 750.986, 'duration': 5.186}, {'end': 763.201, 'text': 'And then a third problem, We have an exponential function in here, so this is a little bit computationally expensive.', 'start': 756.733, 'duration': 6.468}, {'end': 765.543, 'text': 'In the grand scheme of your network.', 'start': 763.722, 'duration': 1.821}, {'end': 771.448, 'text': 'this is usually not the main problem, because we have all these convolutions and dot products that are a lot more expensive.', 'start': 765.543, 'duration': 5.905}], 'summary': 'Saturated neurons hinder gradient updates, making computations inefficient.', 'duration': 26.328, 'max_score': 745.12, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M745120.jpg'}, {'end': 933.771, 'src': 'embed', 'start': 889.394, 'weight': 3, 'content': [{'end': 895.619, 'text': 'And in practice, using this ReLU, it converges much faster than the sigmoid and the tanh, so about six times faster.', 'start': 889.394, 'duration': 6.225}, {'end': 900.977, 'text': "And it's also turned out to be more biologically plausible than the sigmoids.", 'start': 897.373, 'duration': 3.604}, {'end': 904.361, 'text': 'So if you look at a neuron and you look at what the inputs look like,', 'start': 901.017, 'duration': 3.344}, {'end': 911.709, 'text': 'and you look at what the outputs look like and you try and measure this in neuroscience experiments,', 'start': 904.361, 'duration': 7.348}, {'end': 917.235, 'text': "you'll see that this one is actually a closer approximation to what's happening than sigmoids.", 'start': 911.709, 'duration': 5.526}, {'end': 925.157, 'text': 'And so ReLUs were started to be used a lot around 2012, when we had AlexNet,', 'start': 918.447, 'duration': 6.71}, {'end': 930.305, 'text': 'the first major convolutional neural network that was able to do well on ImageNet and large-scale data.', 'start': 925.157, 'duration': 5.148}, {'end': 933.771, 'text': 'They used the ReLu in their experiments.', 'start': 930.886, 'duration': 2.885}], 'summary': 'Relu converges 6x faster than sigmoid, more biologically plausible. widely used since 2012, e.g. in alexnet.', 'duration': 44.377, 'max_score': 889.394, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M889394.jpg'}], 'start': 71.046, 'title': 'Neural network computations and training', 'summary': 'Discusses expressing functions as computational graphs, representing neural networks with linear layers and nonlinearities, and understanding convolutional neural networks. it also covers the process of training neural networks through mini-batch stochastic gradient descent, detailing activation functions, data preprocessing, weight initialization, batch normalization, and hyperparameter optimization. additionally, it delves into the issues with sigmoid and tanh activation functions, the advantages of relu, and their computational efficiency, concluding that relu converges about six times faster than sigmoid and tanh.', 'chapters': [{'end': 177.687, 'start': 71.046, 'title': 'Neural network computational graphs', 'summary': 'Discusses expressing functions as computational graphs, representing neural networks as a type of graph with linear layers and nonlinearities, and understanding convolutional neural networks and their use of convolution layers to preserve spatial structure.', 'duration': 106.641, 'highlights': ['Neural networks are represented as a type of graph with linear layers stacked on top of each other, with nonlinearities in between, and can be expressed as a computational graph.', 'Convolutional neural networks use convolution layers to preserve spatial structure, with each activation map in the output produced by sliding a filter of weights over all spatial locations in the input.']}, {'end': 361.216, 'start': 180.008, 'title': 'Training neural networks: activation functions and optimization', 'summary': 'Covers the process of training neural networks through mini-batch stochastic gradient descent, and details involved in training including activation functions, data preprocessing, weight initialization, batch normalization, babysitting the learning process, and hyperparameter optimization.', 'duration': 181.208, 'highlights': ['The chapter covers the process of training neural networks through mini-batch stochastic gradient descent', 'Details involved in training including activation functions, data preprocessing, weight initialization, batch normalization, babysitting the learning process, and hyperparameter optimization', 'Describes the sigmoid activation function and its characteristics']}, {'end': 979.773, 'start': 362.557, 'title': 'Neural network activation functions', 'summary': 'Discusses the issues with sigmoid and tanh activation functions, the advantages of relu, and their computational efficiency, concluding that relu is more biologically plausible and converges about six times faster than sigmoid and tanh.', 'duration': 617.216, 'highlights': ['ReLU is computationally very efficient and converges about six times faster than sigmoid and tanh.', 'ReLUs are more biologically plausible than sigmoids and were first widely used in the successful AlexNet in 2012.', 'Saturated neurons in sigmoid and tanh activation functions can kill off the gradient flow, leading to inefficient gradient updates and computational expense.']}], 'duration': 908.727, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M71046.jpg', 'highlights': ['Neural networks are represented as a type of graph with linear layers stacked on top of each other, with nonlinearities in between, and can be expressed as a computational graph.', 'Convolutional neural networks use convolution layers to preserve spatial structure, with each activation map in the output produced by sliding a filter of weights over all spatial locations in the input.', 'The chapter covers the process of training neural networks through mini-batch stochastic gradient descent.', 'ReLU is computationally very efficient and converges about six times faster than sigmoid and tanh.', 'Details involved in training including activation functions, data preprocessing, weight initialization, batch normalization, babysitting the learning process, and hyperparameter optimization.', 'ReLUs are more biologically plausible than sigmoids and were first widely used in the successful AlexNet in 2012.', 'Saturated neurons in sigmoid and tanh activation functions can kill off the gradient flow, leading to inefficient gradient updates and computational expense.', 'Describes the sigmoid activation function and its characteristics.']}, {'end': 1434.991, 'segs': [{'end': 1120.094, 'src': 'embed', 'start': 1083.999, 'weight': 0, 'content': [{'end': 1089.624, 'text': "And so in this case you started off with an okay ReLU, but because you're making these huge updates,", 'start': 1083.999, 'duration': 5.625}, {'end': 1095.009, 'text': 'the weights jump around and then your ReLU unit in a sense gets knocked off of the data manifold.', 'start': 1089.624, 'duration': 5.385}, {'end': 1098.012, 'text': 'And so this happens through training.', 'start': 1095.769, 'duration': 2.243}, {'end': 1102.456, 'text': 'So it was fine at the beginning and then at some point it became bad and it died.', 'start': 1098.112, 'duration': 4.344}, {'end': 1108.022, 'text': "And so, in practice, if you freeze a network that you've trained and you pass the data through,", 'start': 1103.097, 'duration': 4.925}, {'end': 1113.188, 'text': 'you can see that actually as much as 10 to 20% of the network is these dead ReLUs.', 'start': 1108.022, 'duration': 5.166}, {'end': 1120.094, 'text': "And so that's a problem, but also, Most networks do have this type of problem when you use ReLU.", 'start': 1113.668, 'duration': 6.426}], 'summary': 'Training with large updates causes 10-20% relus to die, a common problem with relu networks.', 'duration': 36.095, 'max_score': 1083.999, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M1083999.jpg'}, {'end': 1346.072, 'src': 'embed', 'start': 1317.902, 'weight': 2, 'content': [{'end': 1324.285, 'text': 'And so this looks very similar to the original ReLU, and the only difference is that now, instead of being flat in the negative regime,', 'start': 1317.902, 'duration': 6.383}, {'end': 1326.927, 'text': "we're going to give a slight negative slope here.", 'start': 1324.285, 'duration': 2.642}, {'end': 1331.59, 'text': 'And so this solves a lot of the problems that we mentioned earlier.', 'start': 1327.727, 'duration': 3.863}, {'end': 1337.073, 'text': "Here we don't have any saturating regime even in the negative space.", 'start': 1332.15, 'duration': 4.923}, {'end': 1339.294, 'text': "It's still very computationally efficient.", 'start': 1337.473, 'duration': 1.821}, {'end': 1346.072, 'text': "It still converges faster than sigmoid and tanh, so it's very similar to a ReLU, and it doesn't have this dying problem.", 'start': 1339.83, 'duration': 6.242}], 'summary': 'Modified relu avoids saturation and converges faster than sigmoid and tanh.', 'duration': 28.17, 'max_score': 1317.902, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M1317902.jpg'}, {'end': 1416.437, 'src': 'embed', 'start': 1386.739, 'weight': 1, 'content': [{'end': 1395.844, 'text': 'And this one, again, it has all of the benefits of the ReLU, but now it also is closer to zero mean outputs.', 'start': 1386.739, 'duration': 9.105}, {'end': 1405.268, 'text': "So that's actually an advantage that the Leaky ReLU, Parametric ReLU, a lot of these, they allow you to have your mean closer to zero.", 'start': 1396.504, 'duration': 8.764}, {'end': 1412.894, 'text': 'But compared with the leaky ReLU, instead of it being sloped in a negative regime,', 'start': 1406.969, 'duration': 5.925}, {'end': 1416.437, 'text': 'here you actually are building back in a negative saturation regime.', 'start': 1412.894, 'duration': 3.543}], 'summary': 'Leaky relu has benefits of relu and closer to zero mean outputs.', 'duration': 29.698, 'max_score': 1386.739, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M1386739.jpg'}], 'start': 980.273, 'title': 'Dead relus and activation functions in neural networks', 'summary': 'Discusses dead relus in neural networks, where 10-20% of the network can be affected, and explores drawbacks of the sigmoid function, introducing modifications like leaky relu, prelu, and elu, highlighting their advantages in computational efficiency, flexibility, and noise robustness.', 'chapters': [{'end': 1195.068, 'start': 980.273, 'title': 'Dead relus in neural networks', 'summary': 'Discusses the phenomenon of dead relus in neural networks, where up to 10-20% of the network can be affected, mainly caused by bad initialization or high learning rates, impacting the gradient flow and network activation.', 'duration': 214.795, 'highlights': ['Up to 10-20% of the network can be dead ReLUs when trained and passed through the data.', 'Dead ReLUs can result from bad weight initialization, leading to a lack of data input for activation.', 'High learning rates can cause ReLUs to be knocked off the data manifold, resulting in dead ReLUs.', 'The phenomenon of dead ReLUs is a common problem in networks utilizing ReLU activation functions.', 'The issue of dead ReLUs is a research problem, but generally, networks still perform adequately despite this challenge.']}, {'end': 1434.991, 'start': 1195.088, 'title': 'Activation functions in neural networks', 'summary': 'Discusses the drawbacks of the sigmoid function, introduces modifications like leaky relu, prelu, and elu, highlighting their advantages in terms of computational efficiency, flexibility, and robustness to noise.', 'duration': 239.903, 'highlights': ['ELU has the advantage of closer to zero mean outputs compared to Leaky ReLU and PReLU.', 'PReLU introduces a parameter for the negative slope, adding flexibility to the activation function.', 'Leaky ReLU addresses the saturation and dying problems, maintaining computational efficiency.', 'Sigmoid function can lead to saturation and zero gradient with large positive inputs.']}], 'duration': 454.718, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M980273.jpg', 'highlights': ['Up to 10-20% of the network can be dead ReLUs when trained and passed through the data.', 'ELU has the advantage of closer to zero mean outputs compared to Leaky ReLU and PReLU.', 'PReLU introduces a parameter for the negative slope, adding flexibility to the activation function.', 'High learning rates can cause ReLUs to be knocked off the data manifold, resulting in dead ReLUs.', 'Leaky ReLU addresses the saturation and dying problems, maintaining computational efficiency.']}, {'end': 1998.778, 'segs': [{'end': 1464.486, 'src': 'embed', 'start': 1436.166, 'weight': 0, 'content': [{'end': 1445.651, 'text': 'And in a sense this is kind of something in between the ReLUs and the Leaky ReLUs, right where it has some of this shape which the Leaky ReLu does,', 'start': 1436.166, 'duration': 9.485}, {'end': 1452.375, 'text': 'which gives it closer to zero mean outputs, but then it also still has some of this more saturating behavior that ReLUs have.', 'start': 1445.651, 'duration': 6.724}, {'end': 1464.486, 'text': 'Question? So whether this parameter alpha is going to be specific for each neuron.', 'start': 1453.396, 'duration': 11.09}], 'summary': 'Comparison between relus and leaky relus, considering alpha parameter for neurons.', 'duration': 28.32, 'max_score': 1436.166, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M1436166.jpg'}, {'end': 1514.886, 'src': 'embed', 'start': 1490.354, 'weight': 2, 'content': [{'end': 1496.718, 'text': 'And so you can see that all of these, you can argue that each one may have certain benefits, certain drawbacks.', 'start': 1490.354, 'duration': 6.364}, {'end': 1498.359, 'text': 'In practice.', 'start': 1497.439, 'duration': 0.92}, {'end': 1504.963, 'text': 'people just want to run experiments on all of them and see empirically what works better, try and justify it and come up with new ones,', 'start': 1498.359, 'duration': 6.604}, {'end': 1507.504, 'text': "but they're all different things that are being experimented with.", 'start': 1504.963, 'duration': 2.541}, {'end': 1512.825, 'text': "And so let's just mention one more.", 'start': 1510.145, 'duration': 2.68}, {'end': 1514.886, 'text': 'This is the max out neuron.', 'start': 1513.085, 'duration': 1.801}], 'summary': 'Various neural network models are being experimented with, including the max out neuron.', 'duration': 24.532, 'max_score': 1490.354, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M1490354.jpg'}, {'end': 1641.616, 'src': 'embed', 'start': 1616.699, 'weight': 3, 'content': [{'end': 1623.465, 'text': 'You can also try out tanh, but probably some of these ReLU and ReLU variants are going to be better.', 'start': 1616.699, 'duration': 6.766}, {'end': 1626.527, 'text': "And in general, don't use sigmoid.", 'start': 1624.305, 'duration': 2.222}, {'end': 1634.451, 'text': 'This is one of the earliest original activation functions and ReLU and these other variants have generally worked better since then.', 'start': 1626.987, 'duration': 7.464}, {'end': 1641.616, 'text': "Okay, so now let's talk a little bit about data pre-processing.", 'start': 1637.473, 'duration': 4.143}], 'summary': 'Relu and its variants generally outperform sigmoid as activation functions.', 'duration': 24.917, 'max_score': 1616.699, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M1616699.jpg'}, {'end': 1838.178, 'src': 'embed', 'start': 1810.425, 'weight': 4, 'content': [{'end': 1819.594, 'text': "So in general, on the training phase is where we determine our, let's say, mean, and then we apply this exact same mean to the test data.", 'start': 1810.425, 'duration': 9.169}, {'end': 1824.018, 'text': "So we'll normalize by the same empirical mean from the training data.", 'start': 1820.094, 'duration': 3.924}, {'end': 1838.178, 'text': 'Okay, so, to summarize, basically for images, we typically just do the zero mean pre-processing and we can subtract either the entire mean image.', 'start': 1825.31, 'duration': 12.868}], 'summary': 'During training, the mean is determined and applied to test data for normalization. for images, the zero mean pre-processing is typically used.', 'duration': 27.753, 'max_score': 1810.425, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M1810425.jpg'}], 'start': 1436.166, 'title': 'Relu neuron variants and activation functions', 'summary': 'Discusses different variants of relu neurons, including leaky relus and max out neurons, and various activation functions, emphasizing pre-processing of input data for training and testing phases in the context of images.', 'chapters': [{'end': 1514.886, 'start': 1436.166, 'title': 'Variants of relu neurons', 'summary': 'Discusses different variants of relu neurons, such as leaky relus and max out neurons, highlighting their shapes, mean outputs, and experimental benefits.', 'duration': 78.72, 'highlights': ['The chapter discusses different variants of ReLU neurons, such as Leaky ReLUs and max out neurons, highlighting their shapes, mean outputs, and experimental benefits.', 'The parameter alpha in Leaky ReLUs gives closer to zero mean outputs, and different variants of ReLU neurons may have certain benefits and drawbacks when experimented with.', "In practice, people want to run experiments on different variants of ReLU neurons to empirically determine what works better and come up with new ones, but they're all different things that are being experimented with."]}, {'end': 1998.778, 'start': 1514.966, 'title': 'Neural network activation functions', 'summary': 'Discusses various activation functions, such as relu and its variants, and emphasizes the pre-processing of input data, specifically zero-centering and normalization, for training and testing phases in the context of images.', 'duration': 483.812, 'highlights': ['The chapter discusses various activation functions, such as ReLU and its variants.', 'Emphasizes the pre-processing of input data, specifically zero-centering and normalization, for training and testing phases in the context of images.']}], 'duration': 562.612, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M1436166.jpg', 'highlights': ['The chapter discusses different variants of ReLU neurons, such as Leaky ReLUs and max out neurons, highlighting their shapes, mean outputs, and experimental benefits.', 'The parameter alpha in Leaky ReLUs gives closer to zero mean outputs, and different variants of ReLU neurons may have certain benefits and drawbacks when experimented with.', "In practice, people want to run experiments on different variants of ReLU neurons to empirically determine what works better and come up with new ones, but they're all different things that are being experimented with.", 'The chapter discusses various activation functions, such as ReLU and its variants.', 'Emphasizes the pre-processing of input data, specifically zero-centering and normalization, for training and testing phases in the context of images.']}, {'end': 2586.845, 'segs': [{'end': 2055.196, 'src': 'embed', 'start': 2027.63, 'weight': 1, 'content': [{'end': 2030.252, 'text': 'And we talked about how sigmoid we want to have zero mean.', 'start': 2027.63, 'duration': 2.622}, {'end': 2034.237, 'text': 'And so it does solve this for the first layer.', 'start': 2030.593, 'duration': 3.644}, {'end': 2035.758, 'text': 'that we pass it through.', 'start': 2034.817, 'duration': 0.941}, {'end': 2040.402, 'text': 'So now our inputs to the first layer of our network is going to be zero mean,', 'start': 2036.539, 'duration': 3.863}, {'end': 2048.268, 'text': "but we'll see later on that we're actually going to have this problem come up in much worse and greater form as we have deep networks.", 'start': 2040.402, 'duration': 7.866}, {'end': 2055.196, 'text': "You're going to get a lot of non-zero mean problems later on, and so in this case, this is not gonna be sufficient.", 'start': 2048.55, 'duration': 6.646}], 'summary': 'Initial layer aims for zero mean input, but deeper layers present non-zero mean challenges.', 'duration': 27.566, 'max_score': 2027.63, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M2027630.jpg'}, {'end': 2190.036, 'src': 'embed', 'start': 2150.935, 'weight': 2, 'content': [{'end': 2151.875, 'text': 'which is not what you want.', 'start': 2150.935, 'duration': 0.94}, {'end': 2153.756, 'text': 'You want the neurons to learn different things.', 'start': 2151.915, 'duration': 1.841}, {'end': 2160.821, 'text': "And so that's the problem when you initialize everything equally and there's basically no symmetry breaking here.", 'start': 2154.337, 'duration': 6.484}, {'end': 2164.6, 'text': "So what's the first?", 'start': 2163.519, 'duration': 1.081}, {'end': 2165.32, 'text': 'yeah, question?', 'start': 2164.6, 'duration': 0.72}, {'end': 2185.934, 'text': 'So the question is because the gradient also depends on our loss.', 'start': 2179.73, 'duration': 6.204}, {'end': 2190.036, 'text': "won't one backprop differently compared to the other??", 'start': 2185.934, 'duration': 4.102}], 'summary': 'Neurons need diverse learning to avoid symmetry; gradient affects backpropagation.', 'duration': 39.101, 'max_score': 2150.935, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M2150935.jpg'}, {'end': 2409.56, 'src': 'embed', 'start': 2385.68, 'weight': 0, 'content': [{'end': 2391.925, 'text': 'But the problem is that as we multiply by this w, these small numbers at each layer,', 'start': 2385.68, 'duration': 6.245}, {'end': 2398.411, 'text': 'this quickly shrinks and collapses all of these values as we multiply this over and over again.', 'start': 2391.925, 'duration': 6.486}, {'end': 2403.895, 'text': 'And so by the end we get all of these zeros, which is not what we want.', 'start': 2398.891, 'duration': 5.004}, {'end': 2406.457, 'text': 'So we get all the activations become zero.', 'start': 2404.275, 'duration': 2.182}, {'end': 2409.56, 'text': "And so now let's think about the backward pass.", 'start': 2407.538, 'duration': 2.022}], 'summary': "Multiplying by 'w' causes values to shrink, leading to zero activations.", 'duration': 23.88, 'max_score': 2385.68, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M2385680.jpg'}, {'end': 2557.72, 'src': 'embed', 'start': 2527.589, 'weight': 4, 'content': [{'end': 2535.597, 'text': "And so because here we're multiplying by w over and over again, you're getting basically the same phenomenon as we had in the forward pass,", 'start': 2527.589, 'duration': 8.008}, {'end': 2542.524, 'text': 'where everything is getting smaller and smaller and now the gradient upstream gradients are collapsing to zero as well.', 'start': 2535.597, 'duration': 6.927}, {'end': 2557.72, 'text': "Yeah, so I guess upstream and downstream can be interpreted differently depending on if you're going forward and backwards.", 'start': 2550.714, 'duration': 7.006}], 'summary': 'Multiplying by w causes gradient collapse, affecting upstream gradients.', 'duration': 30.131, 'max_score': 2527.589, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M2527589.jpg'}], 'start': 1999.219, 'title': 'Data preprocessing and symmetry problem in neural networks', 'summary': "Emphasizes the significance of data preprocessing for achieving zero mean and the drawbacks of initializing weights to zero, impacting the network's ability to learn different features. it also discusses the symmetry problem in neural networks caused by small random weights leading to gradient collapse and ineffective weight updates in deeper networks.", 'chapters': [{'end': 2214.624, 'start': 1999.219, 'title': 'Data preprocessing and weight initialization', 'summary': "Discusses the importance of data preprocessing for achieving zero mean and the issues with initializing weights to zero, which results in neurons learning the same thing and having no symmetry breaking, impacting the network's ability to learn different features.", 'duration': 215.405, 'highlights': ['The importance of data preprocessing for achieving zero mean and its limitation to the first layer of the network.', 'Issues with initializing weights to zero, resulting in neurons learning the same thing and having no symmetry breaking.']}, {'end': 2586.845, 'start': 2214.624, 'title': 'Symmetry problem in neural networks', 'summary': 'Discusses the symmetry problem in neural networks, highlighting the issue of small random weights causing activations to collapse to zero, leading to gradient collapse and ineffective weight updates in deeper networks.', 'duration': 372.221, 'highlights': ['The activations collapse to zero due to small random weights, causing gradient collapse and ineffective weight updates in deeper networks.', 'The mean and standard deviation of the activations quickly shrink and collapse to zero as the network multiplies by small weights at each layer.', 'The upstream gradients collapse to zero as well due to the multiplication of small weights, resulting in ineffective weight updates.', 'The backward pass also faces the issue of small gradients due to collapsed activations, leading to ineffective weight updates and gradient flows.']}], 'duration': 587.626, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M1999219.jpg', 'highlights': ['The activations collapse to zero due to small random weights, causing gradient collapse and ineffective weight updates in deeper networks.', 'The importance of data preprocessing for achieving zero mean and its limitation to the first layer of the network.', 'Issues with initializing weights to zero, resulting in neurons learning the same thing and having no symmetry breaking.', 'The mean and standard deviation of the activations quickly shrink and collapse to zero as the network multiplies by small weights at each layer.', 'The upstream gradients collapse to zero as well due to the multiplication of small weights, resulting in ineffective weight updates.', 'The backward pass also faces the issue of small gradients due to collapsed activations, leading to ineffective weight updates and gradient flows.']}, {'end': 2930.587, 'segs': [{'end': 2717.321, 'src': 'embed', 'start': 2671.232, 'weight': 0, 'content': [{'end': 2675.678, 'text': "And so this will have the problem that we talked about with the tanh earlier, when they're saturated,", 'start': 2671.232, 'duration': 4.446}, {'end': 2679.002, 'text': 'that all the gradients will be zero and our weights are not updating.', 'start': 2675.678, 'duration': 3.324}, {'end': 2686.592, 'text': "So basically it's really hard to get your weight initialization right.", 'start': 2681.706, 'duration': 4.886}, {'end': 2690.119, 'text': "When it's too small, they all collapse, When it's too large, they saturate.", 'start': 2686.612, 'duration': 3.507}, {'end': 2695.622, 'text': "So there's been some work in trying to figure out well what's the proper way to initialize these weights?", 'start': 2690.699, 'duration': 4.923}, {'end': 2702.205, 'text': 'And so one kind of good rule of thumb that you can use is the Xavier initialization.', 'start': 2696.322, 'duration': 5.883}, {'end': 2707.375, 'text': 'And so this is from this paper by Clorot in 2010.', 'start': 2702.745, 'duration': 4.63}, {'end': 2717.321, 'text': 'And so what this formula is is if we look at W up here, we can see that we want to initialize them to these.', 'start': 2707.375, 'duration': 9.946}], 'summary': 'Proper weight initialization is crucial; use xavier initialization as a rule of thumb.', 'duration': 46.089, 'max_score': 2671.232, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M2671232.jpg'}, {'end': 2834.191, 'src': 'embed', 'start': 2806.546, 'weight': 1, 'content': [{'end': 2810.63, 'text': 'But the problem is that this breaks when now you use something like a ReLU.', 'start': 2806.546, 'duration': 4.084}, {'end': 2813.593, 'text': 'And so with the ReLU.', 'start': 2811.731, 'duration': 1.862}, {'end': 2821.139, 'text': "what happens is that, because it's killing half of your units, it's setting approximately half of them to zero at each time,", 'start': 2813.593, 'duration': 7.546}, {'end': 2824.242, 'text': "it's actually halving the variance that you get out of this.", 'start': 2821.139, 'duration': 3.103}, {'end': 2834.191, 'text': "And so now if you just make the same assumptions as your derivation earlier, you won't actually get the right variance coming out.", 'start': 2824.803, 'duration': 9.388}], 'summary': 'Using relu reduces variance by halving units, affecting derivation assumptions.', 'duration': 27.645, 'max_score': 2806.546, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M2806546.jpg'}], 'start': 2594.437, 'title': 'Weight initialization in neural networks', 'summary': 'Discusses challenges of weight initialization, including impact of small and large weights on activation functions, emphasizing the xavier initialization method and its adaptation for relu activations, and its significant impact on training performance.', 'chapters': [{'end': 2690.119, 'start': 2594.437, 'title': 'Weight initialization issues', 'summary': 'Discusses the challenges of weight initialization in neural networks, highlighting the impact of small and large weights on the activation functions, resulting in saturation and gradient vanishing problems.', 'duration': 95.682, 'highlights': ['The issue of weight initialization in neural networks is addressed, where small weights lead to the collapse of gradients, while large weights cause saturation of activation functions and gradient vanishing.', 'The impact of using large weights on the activation functions is discussed, highlighting the saturation of tanh at very negative or positive regimes, leading to gradients approaching zero and hampering weight updates.', 'The consequence of weight saturation on the distribution of activations in the network is explained, indicating that the majority of activations will converge to either -1 or +1, exacerbating the gradient vanishing problem.', 'The challenge of finding the right weight initialization is emphasized, as small weights collapse gradients and large weights cause saturation, making it difficult to achieve optimal weight initialization.']}, {'end': 2930.587, 'start': 2690.699, 'title': 'Proper weight initialization and its impact', 'summary': 'Discusses the importance of proper weight initialization, highlighting the xavier initialization method and its adaptation for relu activations, emphasizing the significant impact on training performance and the active area of research in this field.', 'duration': 239.888, 'highlights': ['Xavier initialization method ensures variance consistency and is crucial for proper weight initialization, significantly impacting training performance.', 'Adaptation for ReLU activations involves compensating for the halving of variance by accounting for the deactivation of half the neurons, significantly impacting the distribution and network performance.', 'Proper weight initialization is an active area of research, with significant impact on training performance, as minor adjustments can make a substantial difference in network training and performance.']}], 'duration': 336.15, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M2594437.jpg', 'highlights': ['Xavier initialization method ensures variance consistency and is crucial for proper weight initialization, significantly impacting training performance.', 'Adaptation for ReLU activations involves compensating for the halving of variance by accounting for the deactivation of half the neurons, significantly impacting the distribution and network performance.', 'The issue of weight initialization in neural networks is addressed, where small weights lead to the collapse of gradients, while large weights cause saturation of activation functions, emphasizing the xavier initialization method and its adaptation for relu activations, and its significant impact on training performance.', 'The impact of using large weights on the activation functions is discussed, highlighting the saturation of tanh at very negative or positive regimes, leading to gradients approaching zero and hampering weight updates.']}, {'end': 3820.922, 'segs': [{'end': 2984.117, 'src': 'embed', 'start': 2954.401, 'weight': 0, 'content': [{'end': 2958.883, 'text': "And so how does this work? So let's consider a batch of activations at some layer.", 'start': 2954.401, 'duration': 4.482}, {'end': 2962.31, 'text': 'So now we have all of our activations coming out.', 'start': 2959.927, 'duration': 2.383}, {'end': 2969.377, 'text': 'If we want to make this unit Gaussian, we actually can just do this empirically, right?', 'start': 2962.81, 'duration': 6.567}, {'end': 2979.207, 'text': 'We can take the mean of the batch that we have so far of the current batch, and we can just and the variance and we can just normalize by this.', 'start': 2969.437, 'duration': 9.77}, {'end': 2984.117, 'text': 'Right, and so basically, instead of with weight initialization,', 'start': 2980.134, 'duration': 3.983}], 'summary': 'Normalize batch activations to make unit gaussian.', 'duration': 29.716, 'max_score': 2954.401, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M2954401.jpg'}, {'end': 3089.899, 'src': 'heatmap', 'start': 3036.049, 'weight': 0.816, 'content': [{'end': 3047.095, 'text': 'if we look at our input data and we think of this as we have n training examples in our current batch and then each batch has dimension d,', 'start': 3036.049, 'duration': 11.046}, {'end': 3055.66, 'text': "we're going to compute the empirical mean and variance independently for each dimension, so each basically feature element.", 'start': 3047.095, 'duration': 8.565}, {'end': 3062.163, 'text': 'And we compute this across our batch, our current mini-batch that we have, and we normalize by this.', 'start': 3056.44, 'duration': 5.723}, {'end': 3069.998, 'text': 'And so this is usually inserted after fully connected or convolutional layers.', 'start': 3065.894, 'duration': 4.104}, {'end': 3078.908, 'text': 'We saw that when we were multiplying by w in these layers, which we do over and over again, then we can have this bad scaling effect with each one,', 'start': 3070.759, 'duration': 8.149}, {'end': 3082.391, 'text': 'and so this, basically, is able to undo this effect.', 'start': 3078.908, 'duration': 3.483}, {'end': 3089.899, 'text': "And since we're basically just scaling by the inputs connected to each neuron, each activation,", 'start': 3083.292, 'duration': 6.607}], 'summary': 'Computes mean and variance for each dimension of input data, used to normalize after fully connected or convolutional layers.', 'duration': 53.85, 'max_score': 3036.049, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M3036049.jpg'}, {'end': 3125.264, 'src': 'embed', 'start': 3094.664, 'weight': 2, 'content': [{'end': 3099.368, 'text': 'and the only difference is that with convolutional layers we want to normalize,', 'start': 3094.664, 'duration': 4.704}, {'end': 3104.934, 'text': 'not just across all the training examples and independently for each neuron.', 'start': 3099.368, 'duration': 5.566}, {'end': 3112.498, 'text': 'each feature dimension, but we actually want to normalize jointly across both all the feature dimensions,', 'start': 3105.154, 'duration': 7.344}, {'end': 3118.46, 'text': 'all the spatial locations that we have in our activation map, as well as all of the training examples.', 'start': 3112.498, 'duration': 5.962}, {'end': 3125.264, 'text': 'And we do this because we want to obey the convolutional property and we want nearby locations to be normalized the same way.', 'start': 3119.101, 'duration': 6.163}], 'summary': 'Convolutional layers normalize jointly across feature dimensions and spatial locations to obey the convolutional property.', 'duration': 30.6, 'max_score': 3094.664, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M3094664.jpg'}, {'end': 3334.576, 'src': 'embed', 'start': 3291.357, 'weight': 1, 'content': [{'end': 3298.019, 'text': 'We compute our variance, we normalize by this median variance, and we have this additional scaling and shifting factor.', 'start': 3291.357, 'duration': 6.662}, {'end': 3302.364, 'text': 'And so this improves gradient flow through the network.', 'start': 3298.639, 'duration': 3.725}, {'end': 3305.528, 'text': "It's also more robust as a result.", 'start': 3303.305, 'duration': 2.223}, {'end': 3310.174, 'text': 'It works for more range of learning rates and different kinds of initialization.', 'start': 3305.628, 'duration': 4.546}, {'end': 3316.603, 'text': "So people have seen that once you put batch normalization in, it's just easier to train, and so that's why you should do this.", 'start': 3310.575, 'duration': 6.028}, {'end': 3326.77, 'text': 'And then also one thing that I just want to point out is that you can also think of this as in a way also doing some regularization.', 'start': 3317.824, 'duration': 8.946}, {'end': 3334.576, 'text': 'And so because now, at the output of each layer, each of these activations,', 'start': 3327.611, 'duration': 6.965}], 'summary': 'Batch normalization improves gradient flow, robustness, and training ease for various learning rates and initializations, acting as a form of regularization.', 'duration': 43.219, 'max_score': 3291.357, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M3291357.jpg'}, {'end': 3520.719, 'src': 'embed', 'start': 3499.367, 'weight': 4, 'content': [{'end': 3514.576, 'text': "I guess batch normalization has been used a lot for standard convolutional neural networks and there's actually papers on how do we want to do normalization for different kinds of recurrent networks or some of these networks that might also be in reinforcement learning,", 'start': 3499.367, 'duration': 15.209}, {'end': 3520.719, 'text': "and so there's different considerations that you might want to think of there, and this is still an active area of research.", 'start': 3514.576, 'duration': 6.143}], 'summary': 'Batch normalization used for cnn, research ongoing for rnn and reinforcement learning.', 'duration': 21.352, 'max_score': 3499.367, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M3499367.jpg'}, {'end': 3617.702, 'src': 'embed', 'start': 3583.16, 'weight': 5, 'content': [{'end': 3585.461, 'text': 'even if you were just doing data pre-processing.', 'start': 3583.16, 'duration': 2.301}, {'end': 3587.922, 'text': 'this Gaussian is not losing you any structure.', 'start': 3585.461, 'duration': 2.461}, {'end': 3596.987, 'text': "It's just shifting and scaling your data into a regime that is that works well for the operations that you're going to perform on it.", 'start': 3588.002, 'duration': 8.985}, {'end': 3602.671, 'text': 'In convolutional layers, you do have some structure that you want to preserve spatially.', 'start': 3598.148, 'duration': 4.523}, {'end': 3609.016, 'text': 'If you look at your activation maps, you want them to relatively all make sense to each other.', 'start': 3604.512, 'duration': 4.504}, {'end': 3617.702, 'text': "So in this case, you do want to take that into consideration, and so now we're going to normalize, find one mean for the entire activation map.", 'start': 3609.336, 'duration': 8.366}], 'summary': 'Gaussian normalization preserves structure, suitable for convolutional layers.', 'duration': 34.542, 'max_score': 3583.16, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M3583160.jpg'}], 'start': 2934.009, 'title': 'Batch normalization in neural networks', 'summary': 'Explains batch normalization, which aims to achieve unit gaussian activations by normalizing mean and variance of each batch, recommended for practical use and to be implemented in the next homework. it also highlights benefits such as improved gradient flow, robustness, ease of training, and flexibility to control saturation levels and regularization effect, with consideration for different network types and batch sizes.', 'chapters': [{'end': 3186.486, 'start': 2934.009, 'title': 'Batch normalization in neural networks', 'summary': 'Explains the concept of batch normalization which aims to force unit gaussian activations by empirically calculating the mean and variance of each batch and normalizing them, applied after fully connected or convolutional layers, with specific considerations for convolutional layers. the technique is recommended for practical use and is to be implemented in the next homework.', 'duration': 252.477, 'highlights': ['The batch normalization technique aims to force unit Gaussian activations by empirically calculating the mean and variance of each batch and normalizing them, applied after fully connected or convolutional layers, with specific considerations for convolutional layers.', 'The mean and variance are calculated independently for each dimension or feature element in the batch, and the normalization is performed based on these calculations, allowing the technique to undo the bad scaling effect and be applicable to both fully connected and convolutional layers.', 'For convolutional layers, batch normalization involves normalizing jointly across all the feature dimensions, spatial locations, and training examples, aiming to obey the convolutional property and normalize nearby locations in the same way.']}, {'end': 3820.922, 'start': 3186.486, 'title': 'Batch normalization and flexibility', 'summary': 'Introduces the concept of batch normalization, highlighting the benefits of improved gradient flow, robustness, and ease of training, with the flexibility to control saturation levels and regularization effect, and the consideration for different network types and batch sizes.', 'duration': 634.436, 'highlights': ['The additional squashing and scaling operation in batch normalization allows for the flexibility to control saturation levels and recover the identity function, improving gradient flow and robustness in training.', 'Batch normalization provides a regularization effect by tying all inputs in a batch together, leading to improved robustness and the ability to handle a wider range of learning rates and initialization.', 'Batch normalization works effectively for standard convolutional neural networks, and while it may be less accurate with smaller batch sizes, it still provides a similar effect and can be adapted for different network types through active research.', 'The concept of batch normalization does not lead to a loss of structure, as it effectively shifts and scales the data to a regime that works well for the operations, while still preserving spatial structure in convolutional layers.', 'Batch normalization normalizes the inputs to each layer, not the weights, and transforms the data into a Gaussian distribution, providing improved training and robustness.', 'The flexibility provided by batch normalization allows for the learning of scaling and shifting parameters, while in practice, the network does not learn the identity mapping, demonstrating the continued effectiveness of batch normalization.']}], 'duration': 886.913, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M2934009.jpg', 'highlights': ['Batch normalization aims to achieve unit Gaussian activations by normalizing mean and variance of each batch, applicable to fully connected and convolutional layers.', 'Batch normalization provides improved gradient flow, robustness, and flexibility to control saturation levels and regularization effect.', 'Batch normalization involves normalizing jointly across all feature dimensions and spatial locations for convolutional layers, obeying the convolutional property.', 'Batch normalization allows for the learning of scaling and shifting parameters, providing improved training and robustness.', 'Batch normalization works effectively for standard convolutional neural networks and can be adapted for different network types through active research.', 'Batch normalization does not lead to a loss of structure and preserves spatial structure in convolutional layers, while transforming data into a Gaussian distribution.', 'Batch normalization provides a regularization effect by tying all inputs in a batch together, leading to improved robustness and handling a wider range of learning rates and initialization.']}, {'end': 4817.923, 'segs': [{'end': 3876.757, 'src': 'embed', 'start': 3849.554, 'weight': 0, 'content': [{'end': 3857.321, 'text': 'Okay, so the last thing I just want to mention about this is that so at test time, the batch normalization layer,', 'start': 3849.554, 'duration': 7.767}, {'end': 3864.887, 'text': 'we now take the empirical mean and variance from the training data.', 'start': 3857.321, 'duration': 7.566}, {'end': 3867.269, 'text': "So we don't recompute this as test time.", 'start': 3864.947, 'duration': 2.322}, {'end': 3875.196, 'text': "we just estimate this at training time, for example using running averages, and then we're going to use this at test time.", 'start': 3867.269, 'duration': 7.927}, {'end': 3876.757, 'text': "So we're just going to scale by that.", 'start': 3875.556, 'duration': 1.201}], 'summary': 'At test time, batch normalization uses mean and variance from training data to avoid recomputation.', 'duration': 27.203, 'max_score': 3849.554, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M3849554.jpg'}, {'end': 3958.016, 'src': 'heatmap', 'start': 3902.666, 'weight': 0.775, 'content': [{'end': 3905.188, 'text': 'So we want to zero mean the data as we talked about earlier.', 'start': 3902.666, 'duration': 2.522}, {'end': 3907.23, 'text': 'Then we want to choose the architecture.', 'start': 3905.769, 'duration': 1.461}, {'end': 3913.334, 'text': 'And so here we are starting with one hidden layer of 50 neurons for example.', 'start': 3907.39, 'duration': 5.944}, {'end': 3917.818, 'text': 'But basically we can pick any architecture that we want to start with.', 'start': 3914.535, 'duration': 3.283}, {'end': 3924.192, 'text': 'And then the first thing that we want to do is we initialize our network,', 'start': 3920.39, 'duration': 3.802}, {'end': 3928.275, 'text': 'we do a forward pass through it and we want to make sure that our loss is reasonable.', 'start': 3924.192, 'duration': 4.083}, {'end': 3937.26, 'text': "So we talked about this several lectures ago where we have basically a, let's say we have a softmax classifier that we have here.", 'start': 3928.715, 'duration': 8.545}, {'end': 3943.524, 'text': 'We know what our loss should be when our weights are small and we have generally a diffuse distribution.', 'start': 3937.761, 'duration': 5.763}, {'end': 3950.008, 'text': 'Then we want it to be, the softmax classifier loss is going to be your negative log likelihood.', 'start': 3944.825, 'duration': 5.183}, {'end': 3958.016, 'text': "which if we have 10 classes, it'll be something like negative log of one over 10, which here is around 2.3.", 'start': 3950.488, 'duration': 7.528}], 'summary': 'Data zero-meaned, 50 neuron hidden layer, softmax loss target of around 2.3.', 'duration': 55.35, 'max_score': 3902.666, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M3902666.jpg'}, {'end': 4082.957, 'src': 'embed', 'start': 4054.826, 'weight': 2, 'content': [{'end': 4057.628, 'text': "Okay, so now once you've done that, these are all sanity checks.", 'start': 4054.826, 'duration': 2.802}, {'end': 4060.392, 'text': 'Now you can start really trying to train.', 'start': 4058.792, 'duration': 1.6}, {'end': 4066.894, 'text': 'So now you can take your full training data and now start with a small amount of regularization.', 'start': 4060.432, 'duration': 6.462}, {'end': 4069.494, 'text': "and let's first figure out what's a good learning rate.", 'start': 4066.894, 'duration': 2.6}, {'end': 4074.435, 'text': "So learning rate is one of the most important hyperparameters and it's something that you want to adjust first.", 'start': 4069.514, 'duration': 4.921}, {'end': 4082.957, 'text': "So you want to try some value of learning rate and here I've tried one e negative six and you can see that the loss is barely changing.", 'start': 4075.015, 'duration': 7.942}], 'summary': 'Start training with small regularization, experiment with learning rates, e.g., 1e-6.', 'duration': 28.131, 'max_score': 4054.826, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M4054826.jpg'}, {'end': 4651.632, 'src': 'embed', 'start': 4602.6, 'weight': 1, 'content': [{'end': 4609.902, 'text': "And in practice, you're going to do a lot of hyperparameter optimization, a lot of cross-validation.", 'start': 4602.6, 'duration': 7.302}, {'end': 4619.584, 'text': 'And so, in order to get numbers, people will run cross-validation over tons of hyperparameters monitor all of them, see which ones are doing better,', 'start': 4610.422, 'duration': 9.162}, {'end': 4620.364, 'text': 'which ones are doing worse.', 'start': 4619.584, 'duration': 0.78}, {'end': 4621.784, 'text': 'Here we have all of these loss curves.', 'start': 4620.404, 'duration': 1.38}, {'end': 4625.605, 'text': 'Pick the right ones, readjust, and keep going through this process.', 'start': 4622.524, 'duration': 3.081}, {'end': 4635.349, 'text': "And so, as I mentioned earlier, as you're monitoring each of these loss curves, learning rate is an important one,", 'start': 4628.627, 'duration': 6.722}, {'end': 4640.43, 'text': "but you'll get a sense for how different learning rates, which learning rates are good and bad.", 'start': 4635.349, 'duration': 5.081}, {'end': 4647.211, 'text': "So you'll see that if you have a very high exploding one, your loss explodes, then your learning rate is too high.", 'start': 4640.79, 'duration': 6.421}, {'end': 4651.632, 'text': "If it's too kind of linear and too flat, you'll see that it's too low.", 'start': 4647.771, 'duration': 3.861}], 'summary': 'Hyperparameter optimization involves cross-validation, monitoring loss curves, and adjusting learning rates for better performance.', 'duration': 49.032, 'max_score': 4602.6, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M4602600.jpg'}, {'end': 4744.847, 'src': 'embed', 'start': 4716.434, 'weight': 4, 'content': [{'end': 4722.497, 'text': "And so there's a lot of experience at looking at these and seeing what's wrong that you'll get over time.", 'start': 4716.434, 'duration': 6.063}, {'end': 4727.379, 'text': "And so you'll usually want to monitor and visualize your accuracy.", 'start': 4723.357, 'duration': 4.022}, {'end': 4735.022, 'text': 'If you have a big gap between your training accuracy and your validation accuracy,', 'start': 4727.399, 'duration': 7.623}, {'end': 4739.284, 'text': 'it usually means that you might have overfitting and you might want to increase your regularization strength.', 'start': 4735.022, 'duration': 4.262}, {'end': 4744.847, 'text': "If you have no gap, you might want to increase your model capacity because you haven't overfit yet.", 'start': 4739.805, 'duration': 5.042}], 'summary': 'Monitoring and visualizing accuracy helps identify overfitting or underfitting.', 'duration': 28.413, 'max_score': 4716.434, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M4716434.jpg'}], 'start': 3820.922, 'title': 'Neural network training techniques', 'summary': 'Discusses the importance of batch normalization, sanity checks, and hyperparameter optimization in neural network training. it highlights the use of empirical mean and variance, sanity checks for overfitting, hyperparameter optimization through cross-validation and fine-tuning in the range of 10^-3 to 10^-5, and learning rate adjustments for good learning results.', 'chapters': [{'end': 3928.275, 'start': 3820.922, 'title': 'Batch normalization and network training', 'summary': 'Discusses the importance of batch normalization in neural networks, highlighting how it uses empirical mean and variance from training data at test time and outlines the process of monitoring training and adjusting hyperparameters for good learning results.', 'duration': 107.353, 'highlights': ['The batch normalization layer uses the empirical mean and variance from the training data at test time, without recomputing it (quantifiable data: empirical mean and variance from training data).', 'The process of monitoring training and adjusting hyperparameters involves preprocessing the data by zero-meaning it, choosing the network architecture, and ensuring the initial forward pass results in a reasonable loss (quantifiable data: zero-meaning the data, choosing network architecture, and evaluating loss).', "The chapter also mentions that inputs coming in as approximately Gaussian would ideally have a certain effect, but in practice, it doesn't have to be a Gaussian (quantifiable data: inputs coming in as approximately Gaussian)."]}, {'end': 4553.856, 'start': 3928.715, 'title': 'Sanity checks and hyperparameter optimization', 'summary': 'Discusses the importance of conducting sanity checks before training a model, including verifying loss values and overfitting with small data, and then delves into the process of hyperparameter optimization through cross-validation and fine-tuning in the range of 10^-3 to 10^-5, emphasizing the significance of exploring the entire range sufficiently.', 'duration': 625.141, 'highlights': ['The importance of conducting sanity checks before training a model, including verifying loss values and overfitting with small data.', 'The process of hyperparameter optimization through cross-validation and fine-tuning in the range of 10^-3 to 10^-5.', 'The significance of exploring the entire range sufficiently during hyperparameter optimization.']}, {'end': 4817.923, 'start': 4556.559, 'title': 'Hyperparameter optimization', 'summary': 'Explores hyperparameter optimization, including learning rate adjustments, loss curve monitoring, and the importance of tracking accuracy and weight updates in neural network training.', 'duration': 261.364, 'highlights': ['The importance of hyperparameter optimization and cross-validation in neural network training', 'Learning rate adjustments and their impact on loss curves in neural network training', 'Monitoring accuracy and the gap between training and validation accuracy to identify overfitting in neural network training', 'The significance of tracking the updates-to-magnitude ratio in neural network training']}], 'duration': 997.001, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/wEoyxE0GP2M/pics/wEoyxE0GP2M3820922.jpg', 'highlights': ['The batch normalization layer uses the empirical mean and variance from the training data at test time, without recomputing it (quantifiable data: empirical mean and variance from training data)', 'The process of hyperparameter optimization through cross-validation and fine-tuning in the range of 10^-3 to 10^-5', 'The importance of conducting sanity checks before training a model, including verifying loss values and overfitting with small data', 'Learning rate adjustments and their impact on loss curves in neural network training', 'Monitoring accuracy and the gap between training and validation accuracy to identify overfitting in neural network training']}], 'highlights': ['Neural networks are represented as a type of graph with linear layers stacked on top of each other, with nonlinearities in between, and can be expressed as a computational graph.', 'Batch normalization aims to achieve unit Gaussian activations by normalizing mean and variance of each batch, applicable to fully connected and convolutional layers.', 'Xavier initialization method ensures variance consistency and is crucial for proper weight initialization, significantly impacting training performance.', 'ReLU is computationally very efficient and converges about six times faster than sigmoid and tanh.', 'Up to 10-20% of the network can be dead ReLUs when trained and passed through the data.', 'The chapter discusses different variants of ReLU neurons, such as Leaky ReLUs and max out neurons, highlighting their shapes, mean outputs, and experimental benefits.', 'The activations collapse to zero due to small random weights, causing gradient collapse and ineffective weight updates in deeper networks.', 'Batch normalization provides improved gradient flow, robustness, and flexibility to control saturation levels and regularization effect.', 'The importance of conducting sanity checks before training a model, including verifying loss values and overfitting with small data', 'The process of hyperparameter optimization through cross-validation and fine-tuning in the range of 10^-3 to 10^-5']}