title
CS231n Winter 2016: Lecture 5: Neural Networks Part 2

description
Stanford Winter Quarter 2016 class: CS231n: Convolutional Neural Networks for Visual Recognition. Lecture 5. Get in touch on Twitter @cs231n, or on Reddit /r/cs231n.

detail
{'title': 'CS231n Winter 2016: Lecture 5: Neural Networks Part 2', 'heatmap': [{'end': 1136.507, 'start': 1037.705, 'weight': 0.764}], 'summary': 'The lecture covers updates on ongoing assignments, tips for deep learning project success, evolution of neural networks, drawbacks of activation functions like sigmoid, solutions for dead relu neurons, data pre-processing techniques, the importance of weight initialization and batch normalization benefits.', 'chapters': [{'end': 59.872, 'segs': [{'end': 69.264, 'src': 'embed', 'start': 42.383, 'weight': 0, 'content': [{'end': 47.705, 'text': "And also the grading scheme, all of this stuff is kind of just tentative and subject to change, because we're still trying to figure out the course.", 'start': 42.383, 'duration': 5.322}, {'end': 49.706, 'text': "It's still relatively new, and a lot of it is changing.", 'start': 47.745, 'duration': 1.961}, {'end': 52.527, 'text': 'So those are just some heads up items before we start.', 'start': 50.566, 'duration': 1.961}, {'end': 57.83, 'text': 'Now, in terms of your project proposal, by the way, which is due in roughly 10 days.', 'start': 53.544, 'duration': 4.286}, {'end': 59.872, 'text': 'I wanted to just bring up a few points,', 'start': 57.83, 'duration': 2.042}, {'end': 66.621, 'text': "because you'll be thinking about your projects and some of you might have some misconceptions about what makes a good or bad project.", 'start': 59.872, 'duration': 6.749}, {'end': 69.264, 'text': 'So just to out a few of them.', 'start': 66.882, 'duration': 2.382}], 'summary': 'Grading scheme and project proposal details, subject to change. project proposal due in 10 days.', 'duration': 26.881, 'max_score': 42.383, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA42383.jpg'}], 'start': 0.029, 'title': 'Class update and assignment information', 'summary': 'Provides an update on the ongoing assignment, announces the release of the next assignment in the coming days, discusses potential changes in due dates and grading schemes, and highlights the upcoming deadline for project proposals.', 'chapters': [{'end': 59.872, 'start': 0.029, 'title': 'Class update and assignment information', 'summary': 'Provides an update on the ongoing assignment, announces the release of the next assignment in the coming days, and discusses potential changes in due dates and grading schemes, while also highlighting the upcoming deadline for project proposals.', 'duration': 59.843, 'highlights': ["The next assignment will be released in the next few days, with a focus on making it meaty and educational, requiring students to start it as soon as it's released.", 'The due date for assignment two might be adjusted as it is slightly larger compared to the previous one, showing potential flexibility in the course structure.', "The course's grading scheme and other details are tentative and subject to change, reflecting the evolving nature of the relatively new course.", 'The deadline for project proposal submission is roughly 10 days, emphasizing the upcoming deadline for an important course component.']}], 'duration': 59.843, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA29.jpg', 'highlights': ["The next assignment will be released in the next few days, with a focus on making it meaty and educational, requiring students to start it as soon as it's released.", 'The due date for assignment two might be adjusted as it is slightly larger compared to the previous one, showing potential flexibility in the course structure.', 'The deadline for project proposal submission is roughly 10 days, emphasizing the upcoming deadline for an important course component.', "The course's grading scheme and other details are tentative and subject to change, reflecting the evolving nature of the relatively new course."]}, {'end': 815.095, 'segs': [{'end': 106.42, 'src': 'embed', 'start': 77.128, 'weight': 3, 'content': [{'end': 79.71, 'text': "There's hundreds of millions of parameters in a comnet, and they need training.", 'start': 77.128, 'duration': 2.582}, {'end': 83.752, 'text': 'But actually, for your purposes in the project, this is kind of a myth.', 'start': 80.41, 'duration': 3.342}, {'end': 85.873, 'text': 'This is not something you have to worry about a lot.', 'start': 84.352, 'duration': 1.521}, {'end': 87.813, 'text': "You can work with smaller data sets, and it's OK.", 'start': 85.893, 'duration': 1.92}, {'end': 93.795, 'text': "The reason it's OK is that we have this process that we'll go into in much more detail later in the class called fine tuning.", 'start': 88.273, 'duration': 5.522}, {'end': 98.897, 'text': 'And the thing is that in practice, you rarely ever train these giant convolutional networks from scratch.', 'start': 94.356, 'duration': 4.541}, {'end': 102.138, 'text': 'You almost always do this pre-training and fine tuning process.', 'start': 99.297, 'duration': 2.841}, {'end': 106.42, 'text': 'So the way this will look like is you almost always take a convolutional network.', 'start': 102.679, 'duration': 3.741}], 'summary': 'Fine tuning process allows working with smaller data sets for training convolutional networks, rarely training from scratch.', 'duration': 29.292, 'max_score': 77.128, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA77128.jpg'}, {'end': 319.108, 'src': 'embed', 'start': 292.85, 'weight': 0, 'content': [{'end': 299.454, 'text': "And so we talked about intuitions of back propagation and the fact that it's really just a recursive application of chain rule from the back of the circuit to the front.", 'start': 292.85, 'duration': 6.604}, {'end': 302.436, 'text': "where we're chaining gradients through all the local operations.", 'start': 300.114, 'duration': 2.322}, {'end': 311.863, 'text': 'We looked at some implementations of this concretely with the forward-backward API, on both computational graph and also in terms of its nodes,', 'start': 303.256, 'duration': 8.607}, {'end': 315.766, 'text': 'which also implement the same API and do forward propagation and backward propagation.', 'start': 311.863, 'duration': 3.903}, {'end': 319.108, 'text': 'We looked at specific examples in Torch and Cafe.', 'start': 316.466, 'duration': 2.642}], 'summary': 'Back propagation uses recursive chain rule for gradient chaining in torch and cafe.', 'duration': 26.258, 'max_score': 292.85, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA292850.jpg'}, {'end': 376.786, 'src': 'embed', 'start': 346.871, 'weight': 4, 'content': [{'end': 352.035, 'text': "Okay, so that's roughly what we're doing right now and we're going to talk in this class about this process of training neural networks effectively.", 'start': 346.871, 'duration': 5.164}, {'end': 355.593, 'text': "OK, so we're going to go into that.", 'start': 353.311, 'duration': 2.282}, {'end': 357.354, 'text': 'Before I dive into the details of it,', 'start': 355.873, 'duration': 1.481}, {'end': 364.138, 'text': 'I just wanted to pull out and give you a zoomed out view of a bit of a history of how this field evolved over time.', 'start': 357.354, 'duration': 6.784}, {'end': 369.942, 'text': 'So if you try to find where this field really comes from, when were the first neural networks proposed, and so on,', 'start': 364.698, 'duration': 5.244}, {'end': 376.786, 'text': 'you probably will go back to roughly 1960, where Frank Rosenblatt in 1957 was playing around with something called perceptron.', 'start': 369.942, 'duration': 6.844}], 'summary': "Discussion on the history and evolution of neural networks from roughly 1960 with frank rosenblatt's work on perceptrons.", 'duration': 29.915, 'max_score': 346.871, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA346871.jpg'}, {'end': 412.174, 'src': 'embed', 'start': 386.313, 'weight': 1, 'content': [{'end': 393.157, 'text': 'They actually had to build these things out from circuits and electronics in these times, for the most part.', 'start': 386.313, 'duration': 6.844}, {'end': 399.88, 'text': 'So basically, the perceptron, roughly, was this function here.', 'start': 396.598, 'duration': 3.282}, {'end': 403.33, 'text': "And it looks very similar to what we're familiar with.", 'start': 401.109, 'duration': 2.221}, {'end': 405.231, 'text': "It's just a wx plus b.", 'start': 403.41, 'duration': 1.821}, {'end': 410.494, 'text': "But then the activation function, which we're used to as a sigmoid, that activation function was actually a step function.", 'start': 405.231, 'duration': 5.263}, {'end': 412.174, 'text': 'It was either 1 or 0.', 'start': 410.934, 'duration': 1.24}], 'summary': 'In the past, perceptrons were built from circuits and electronics, using a step function activation, with either 1 or 0 outputs.', 'duration': 25.861, 'max_score': 386.313, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA386313.jpg'}], 'start': 59.872, 'title': 'Deep learning project success and neural network evolution', 'summary': 'Provides tips for project success in deep learning, emphasizing the feasibility of working with smaller datasets for fine-tuning and cautioning against overestimating computing resources. it also discusses the historical evolution of training neural networks from the 1960s to the resurgence of deep learning in 2010 and 2012, highlighting the challenges and breakthroughs that led to the success of neural networks in various applications.', 'chapters': [{'end': 247.147, 'start': 59.872, 'title': 'Tips for project success in deep learning', 'summary': 'Emphasizes that working with smaller datasets is feasible due to the fine-tuning process, and cautions against overestimating computing resources for training deep learning models.', 'duration': 187.275, 'highlights': ['Working with smaller datasets is feasible due to the fine-tuning process. Despite the common misconception that deep learning models require large datasets, the fine-tuning process allows for effective use of smaller datasets, as pre-trained networks can be adapted to new datasets by training only the last layer or through fine-tuning.', 'Caution against overestimating computing resources for training deep learning models. The chapter warns against being overly ambitious in proposing projects that require extensive computing resources, emphasizing the need to consider compute constraints and the time required for training deep learning models with limited resources.']}, {'end': 815.095, 'start': 248.308, 'title': 'Training neural networks evolution', 'summary': 'Discusses the historical evolution of training neural networks, from the early development of perceptrons in the 1960s to the resurgence of deep learning in 2010 and 2012, highlighting the challenges and breakthroughs that led to the success of neural networks in various applications.', 'duration': 566.787, 'highlights': ['The first multilayer perceptron networks were developed in the 1960s, but lacked backpropagation and did not work well, leading to a period of stagnation in neural network research throughout the 1970s. First multilayer perceptron networks in the 1960s, lack of backpropagation, stagnation in neural network research throughout the 1970s.', 'In 1986, the influential paper by Rumelhart, Hinton, and Williams introduced backpropagation-like rules and formulated the concept of loss function, leading to renewed excitement but still faced challenges in scaling up networks. Introduction of backpropagation-like rules and loss function in 1986, challenges in scaling up networks.', "The resurgence of neural network research in 2006 was marked by Hinton's paper, which demonstrated the successful training of deep neural networks using unsupervised pre-training with restricted Boltzmann machines. Resurgence in 2006, successful training of deep neural networks using unsupervised pre-training.", 'From 2010 onwards, neural networks gained attention for their significant improvements in speech recognition and visual recognition, leading to the explosive growth of research in the field, attributed to better initialization, activation functions, and the availability of GPUs and more data. Significant improvements in speech recognition and visual recognition from 2010 onwards, attributed to better initialization, activation functions, and the availability of GPUs and more data.']}], 'duration': 755.223, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA59872.jpg', 'highlights': ['Working with smaller datasets is feasible due to the fine-tuning process.', 'Caution against overestimating computing resources for training deep learning models.', "The resurgence of neural network research in 2006 was marked by Hinton's paper, which demonstrated the successful training of deep neural networks using unsupervised pre-training.", 'Significant improvements in speech recognition and visual recognition from 2010 onwards, attributed to better initialization, activation functions, and the availability of GPUs and more data.', 'In 1986, the influential paper by Rumelhart, Hinton, and Williams introduced backpropagation-like rules and formulated the concept of loss function, leading to renewed excitement but still faced challenges in scaling up networks.']}, {'end': 1463.596, 'segs': [{'end': 877.835, 'src': 'embed', 'start': 852.161, 'weight': 1, 'content': [{'end': 857.124, 'text': 'So activation function is this function f at the top of the neuron.', 'start': 852.161, 'duration': 4.963}, {'end': 859.565, 'text': 'And we saw that it can have many different forms.', 'start': 857.804, 'duration': 1.761}, {'end': 865.928, 'text': 'So sigmoid, 10H, ReLU, these are all different proposals for what these activation functions can look like.', 'start': 860.165, 'duration': 5.763}, {'end': 872.112, 'text': "And we're going to go through some pros and cons in how you think about what are good, desirable properties of an activation function.", 'start': 866.148, 'duration': 5.964}, {'end': 877.835, 'text': 'So historically, the one that has been used the most is the sigmoid nonlinearity, which looks like this.', 'start': 873.032, 'duration': 4.803}], 'summary': 'Activation functions like sigmoid, 10h, relu have different forms and properties.', 'duration': 25.674, 'max_score': 852.161, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA852161.jpg'}, {'end': 915.503, 'src': 'embed', 'start': 891.644, 'weight': 2, 'content': [{'end': 898.488, 'text': 'which are neurons that output either very close to 0 or very close to 1, Those neurons kill gradients during backpropagation.', 'start': 891.644, 'duration': 6.844}, {'end': 901.551, 'text': "And so I'd like to expand on this and show you exactly what this means.", 'start': 899.189, 'duration': 2.362}, {'end': 905.214, 'text': "And this contributes to something that we'll go into called the vanishing gradient problem.", 'start': 902.131, 'duration': 3.083}, {'end': 909.478, 'text': "So let's look at a sigmoid gate in the circuit.", 'start': 906.435, 'duration': 3.043}, {'end': 912.7, 'text': 'It receives some value x, and sigma of x comes out.', 'start': 910.098, 'duration': 2.602}, {'end': 915.503, 'text': 'And then in backprop, we have dl by d sigma.', 'start': 913.541, 'duration': 1.962}], 'summary': 'Neurons output close to 0 or 1, causing vanishing gradient problem.', 'duration': 23.859, 'max_score': 891.644, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA891644.jpg'}, {'end': 1007.209, 'src': 'embed', 'start': 979.676, 'weight': 0, 'content': [{'end': 986.665, 'text': 'But if your neuron is saturated, so it basically either outputted 0 or outputted 1, then the gradient will be killed.', 'start': 979.676, 'duration': 6.989}, {'end': 989.028, 'text': 'It will just be multiplied by a very tiny number.', 'start': 986.685, 'duration': 2.343}, {'end': 992.652, 'text': 'And gradient flow will stop through the sigmoid neuron.', 'start': 989.468, 'duration': 3.184}, {'end': 1000.721, 'text': "So you can imagine, if you have a large network of sigmoid neurons and many of them are in a saturated regime where they're either 0 or 1,", 'start': 993.113, 'duration': 7.608}, {'end': 1007.209, 'text': "then gradients can't back propagate through the network because they'll be stopped if your sigmoid neurons are in these saturated regimes.", 'start': 1000.721, 'duration': 6.488}], 'summary': 'Saturated neurons hinder gradient flow, impeding backpropagation in a network.', 'duration': 27.533, 'max_score': 979.676, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA979676.jpg'}, {'end': 1136.507, 'src': 'heatmap', 'start': 1037.705, 'weight': 0.764, 'content': [{'end': 1041.809, 'text': "and we're putting more basically linear classifiers that we're stacking on top of each other.", 'start': 1037.705, 'duration': 4.104}, {'end': 1048.133, 'text': "And the problem, roughly, with non-zero-centered outputs, I'll just try to give you a bit of an intuition on what goes wrong.", 'start': 1042.388, 'duration': 5.745}, {'end': 1053.257, 'text': 'So consider a neuron that computes this function.', 'start': 1051.055, 'duration': 2.202}, {'end': 1058.202, 'text': "So it's a sigmoid neuron looking at, it's just computing wx plus b.", 'start': 1053.458, 'duration': 4.744}, {'end': 1059.703, 'text': 'And what can we say about?', 'start': 1058.202, 'duration': 1.501}, {'end': 1067.75, 'text': "think about what you can say about the gradients on w during backpropagation if your x's are all positive, in this case between 0 and 1..", 'start': 1059.703, 'duration': 8.047}, {'end': 1069.632, 'text': "So maybe you're a neuron somewhere deep in the network.", 'start': 1067.75, 'duration': 1.882}, {'end': 1082.276, 'text': "What can you say about the weights if all the x's are positive numbers? They're kind of constrained in a way.", 'start': 1070.113, 'duration': 12.163}, {'end': 1084.503, 'text': 'Go ahead.', 'start': 1083.962, 'duration': 0.541}, {'end': 1092.806, 'text': 'Right So what you said is all the gradients of w are either all positive or all negative.', 'start': 1088.803, 'duration': 4.003}, {'end': 1096.389, 'text': 'And that is because gradient flows in from the top.', 'start': 1093.807, 'duration': 2.582}, {'end': 1103.395, 'text': "And if you think about the expression for all the w gradients, they're basically x times the gradient on f.", 'start': 1097.01, 'duration': 6.385}, {'end': 1110.26, 'text': 'And so if the gradient at the output of the neuron is positive, then all your w gradients will be positive and vice versa.', 'start': 1103.395, 'duration': 6.865}, {'end': 1114.043, 'text': 'So basically, you end up with this case where, suppose you have just two weights.', 'start': 1110.861, 'duration': 3.182}, {'end': 1116.246, 'text': 'So you have the first weight and the second weight.', 'start': 1114.784, 'duration': 1.462}, {'end': 1123.934, 'text': 'What ends up happening is either all your gradient for that, as this input goes through, and you compute your gradient in the weights.', 'start': 1116.846, 'duration': 7.088}, {'end': 1125.896, 'text': "they're either all positive or they're all negative.", 'start': 1123.934, 'duration': 1.962}, {'end': 1136.507, 'text': "And so the issue is that you're constrained in the kind of update you can make and you end up with this undesirable zigzagging path if you want to get to some parts that are outside of these regions.", 'start': 1126.537, 'duration': 9.97}], 'summary': 'Stacking linear classifiers leads to constrained gradient updates, causing undesirable zigzagging paths.', 'duration': 98.802, 'max_score': 1037.705, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA1037704.jpg'}], 'start': 815.455, 'title': 'Activation functions in neural networks', 'summary': 'Covers drawbacks of sigmoid activation function, including saturated neurons causing vanishing gradients, non-zero-centered outputs affecting weight updates, and the computational expense of the exp function. it discusses the evolution of activation functions in neural networks, from sigmoid to tanh to relu, highlighting their impact on convergence speed and gradient flow.', 'chapters': [{'end': 1192.132, 'start': 815.455, 'title': 'Neural network activation functions', 'summary': 'Covers the drawbacks of sigmoid activation function, including the issue of saturated neurons causing vanishing gradients, non-zero-centered outputs affecting weight updates, and the computational expense of the exp function, potentially leading to slower convergence in training neural networks.', 'duration': 376.677, 'highlights': ['The issue of saturated neurons causing vanishing gradients Saturated neurons outputting values close to 0 or 1 kill gradients during backpropagation, contributing to the vanishing gradient problem, impacting the flow of gradients through the network.', 'Non-zero-centered outputs affecting weight updates Non-zero-centered outputs of sigmoid neurons constrain the kind of updates that can be made to weights, leading to slower convergence and undesirable zigzagging paths during training.', 'Computational expense of the exp function The exp function inside the sigmoid expression is computationally expensive compared to other alternatives, potentially impacting the training time, particularly in large convolutional networks.']}, {'end': 1463.596, 'start': 1192.572, 'title': 'Activation functions in neural networks', 'summary': 'Discusses the evolution of activation functions in neural networks, from sigmoid to tanh to relu, highlighting their advantages and disadvantages, and their impact on convergence speed and gradient flow.', 'duration': 271.024, 'highlights': ['Relevance: The ReLU activation function was found to make neural networks converge much quicker in experiments, almost by a factor of 6.', 'Relevance: The tanh function is preferred to sigmoid due to its zero-centered outputs, addressing one of the problems encountered with sigmoid.', 'Relevance: The discussion on the drawbacks of the ReLU neuron, such as non-zero-centered outputs and the issue of gradient killing for inactive neurons during backpropagation.']}], 'duration': 648.141, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA815455.jpg', 'highlights': ['The ReLU activation function makes neural networks converge much quicker, by a factor of 6.', 'The tanh function is preferred to sigmoid due to its zero-centered outputs.', 'Saturated neurons outputting values close to 0 or 1 kill gradients during backpropagation, contributing to the vanishing gradient problem.']}, {'end': 2024.044, 'segs': [{'end': 1676.394, 'src': 'embed', 'start': 1636.703, 'weight': 0, 'content': [{'end': 1643.25, 'text': 'And the idea with Leaky ReLu is basically we want this kink, and we want this piecewise linearity, and we want this efficiency of ReLu.', 'start': 1636.703, 'duration': 6.547}, {'end': 1648.573, 'text': 'But the issue is that in this region, your gradients die.', 'start': 1644.131, 'duration': 4.442}, {'end': 1655.276, 'text': "So instead, let's make this slightly negatively sloped here, or slightly positively sloped, I suppose, in this region.", 'start': 1648.913, 'duration': 6.363}, {'end': 1658.378, 'text': "And so you end up with this function, and that's called a leaky ReLU.", 'start': 1655.657, 'duration': 2.721}, {'end': 1662.22, 'text': 'And so some people, there are papers showing that this works slightly better.', 'start': 1658.918, 'duration': 3.302}, {'end': 1664.981, 'text': "You don't have this issue of neurons dying.", 'start': 1663.12, 'duration': 1.861}, {'end': 1670.384, 'text': "But I think it's not completely established that this works always better.", 'start': 1665.301, 'duration': 5.083}, {'end': 1672.852, 'text': 'And then some people play with this even more.', 'start': 1671.451, 'duration': 1.401}, {'end': 1674.733, 'text': 'So right now, this is 0.01.', 'start': 1672.992, 'duration': 1.741}, {'end': 1676.394, 'text': 'But that can actually be an arbitrary parameter.', 'start': 1674.733, 'duration': 1.661}], 'summary': 'Leaky relu introduces slight slope to prevent gradient dying, showing slight improvement in some cases.', 'duration': 39.691, 'max_score': 1636.703, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA1636703.jpg'}, {'end': 1743.868, 'src': 'embed', 'start': 1714.111, 'weight': 4, 'content': [{'end': 1719.553, 'text': 'So alpha here would be a parameter that you back propagate to in just a very normal way.', 'start': 1714.111, 'duration': 5.442}, {'end': 1724.319, 'text': 'In your computational graph, every neuron will have its alpha, just like it has its own bias.', 'start': 1720.714, 'duration': 3.605}, {'end': 1724.499, 'text': 'Go ahead.', 'start': 1724.339, 'duration': 0.16}, {'end': 1734.953, 'text': "Is there some restriction on the activation function? Can alpha be 1? Yeah, I'm not sure if they worry about this a lot.", 'start': 1726.441, 'duration': 8.512}, {'end': 1737.696, 'text': "So if alpha is 1, then you're going to get an identity.", 'start': 1734.973, 'duration': 2.723}, {'end': 1743.868, 'text': "That's probably not something that backpropagation will want, in the sense that if that was an identity,", 'start': 1739.827, 'duration': 4.041}], 'summary': 'In backpropagation, each neuron has its own alpha and can influence activation functions.', 'duration': 29.757, 'max_score': 1714.111, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA1714111.jpg'}, {'end': 1916.142, 'src': 'embed', 'start': 1887.722, 'weight': 2, 'content': [{'end': 1889.664, 'text': 'different when we use different activation functions.', 'start': 1887.722, 'duration': 1.942}, {'end': 1892.447, 'text': 'The weights that you obtain might not be the same.', 'start': 1890.084, 'duration': 2.363}, {'end': 1899.975, 'text': 'They might have different convergence rates, but they also get different weights at the end of the simulation.', 'start': 1892.948, 'duration': 7.027}, {'end': 1900.295, 'text': "That's right.", 'start': 1900.015, 'duration': 0.28}, {'end': 1901.677, 'text': 'So what is your question? Sorry.', 'start': 1900.315, 'duration': 1.362}, {'end': 1906.141, 'text': 'Because the loss functions are different when you use different activation functions.', 'start': 1901.857, 'duration': 4.284}, {'end': 1908.184, 'text': 'So the weights also will be different in both.', 'start': 1906.462, 'duration': 1.722}, {'end': 1910.418, 'text': "That's right.", 'start': 1910.078, 'duration': 0.34}, {'end': 1916.142, 'text': 'So the weights will end up, based on what the activation functions are, the dynamics of the back prop into those weights will be different.', 'start': 1910.458, 'duration': 5.684}], 'summary': 'Weights and convergence rates vary with different activation functions.', 'duration': 28.42, 'max_score': 1887.722, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA1887722.jpg'}, {'end': 1978.82, 'src': 'embed', 'start': 1949.661, 'weight': 5, 'content': [{'end': 1952.022, 'text': 'And also, we use specifically stochastic gradient descent.', 'start': 1949.661, 'duration': 2.361}, {'end': 1953.123, 'text': 'And it has a particular form.', 'start': 1952.062, 'duration': 1.061}, {'end': 1954.523, 'text': 'And some things play nicer.', 'start': 1953.323, 'duration': 1.2}, {'end': 1961.066, 'text': 'Some nonlinearities play nicer with the fact, like the optimization is tied, the update is tied into all of this as well.', 'start': 1954.823, 'duration': 6.243}, {'end': 1962.486, 'text': "And it's kind of all interacting together.", 'start': 1961.106, 'duration': 1.38}, {'end': 1967.968, 'text': 'And the choice of these activation functions and the choice of your updates are kind of coupled.', 'start': 1963.047, 'duration': 4.921}, {'end': 1971.75, 'text': "And it's kind of very unclear when you actually optimize this kind of complex thing.", 'start': 1968.229, 'duration': 3.521}, {'end': 1978.82, 'text': 'OK So TLDR here is that use ReLU.', 'start': 1971.77, 'duration': 7.05}], 'summary': 'Use relu for optimization in complex interactions with activation functions and updates.', 'duration': 29.159, 'max_score': 1949.661, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA1949661.jpg'}, {'end': 2016.62, 'src': 'embed', 'start': 1991.264, 'weight': 1, 'content': [{'end': 1996.886, 'text': "Of course we use it in things like long-short-term memory units, LSTMs and so on, and we'll go into that in a bit in recurrent neural networks.", 'start': 1991.264, 'duration': 5.622}, {'end': 2001.027, 'text': "but there are specific reasons why we use them there and that we'll see later in the class.", 'start': 1996.886, 'duration': 4.141}, {'end': 2012.256, 'text': "And they're used differently than what we have covered so far in this fully connected sandwich of matrix, multiply, nonlinearity and so on,", 'start': 2002.388, 'duration': 9.868}, {'end': 2013.337, 'text': 'just having a basic neural network.', 'start': 2012.256, 'duration': 1.081}, {'end': 2016.62, 'text': "OK, so that's everything I wanted to say about activation functions.", 'start': 2013.959, 'duration': 2.661}], 'summary': 'Activation functions like lstms are used in neural networks for specific reasons.', 'duration': 25.356, 'max_score': 1991.264, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA1991264.jpg'}], 'start': 1464.673, 'title': 'Relu issues and activation functions', 'summary': 'Addresses the issue of dead relu neurons, presenting solutions like leaky relu and p-relu, while also exploring the maxhat neuron, its impact on parameters, and complexities of choosing activation functions in optimization processes.', 'chapters': [{'end': 1827.167, 'start': 1464.673, 'title': 'Issues with relu and proposed solutions', 'summary': 'Discusses the issue of dead relu neurons, where up to 20% of the network can be dead if the learning rate is high, and proposes solutions like leaky relu and p-relu which aim to address the dead neuron problem and the controversy surrounding them.', 'duration': 362.494, 'highlights': ['The issue of dead ReLU neurons can cause up to 20% of the network to be dead if the learning rate is high. During training, if the learning rate is high, up to 20% of the network can end up with dead ReLU neurons that never turn on, affecting the training.', 'Proposed solutions like Leaky ReLU and P-relu aim to address the dead ReLU neuron problem. Leaky ReLU and P-relu are proposed solutions to the dead ReLU neuron problem, aiming to introduce slight slope variations to prevent neurons from dying and to give them the choice of different slopes.', 'New activation functions like exponential linear units (eLUs) are being researched to address the non-zero centered downside of ReLU. Exponential linear units (eLUs) are being researched to address the non-zero centered downside of ReLU, aiming to have zero mean outputs and claim better training results.']}, {'end': 2024.044, 'start': 1827.868, 'title': 'Neural network activation functions', 'summary': 'Discusses the maxhat neuron from ian goodfellow et al., its unique structure, impact on parameters, comparison with relu and sigmoid functions, and the complexities of choosing activation functions in optimization processes.', 'duration': 196.176, 'highlights': ['The MaxHAT neuron from Ian Goodfellow et al. is a unique form of a neuron that computes the max of two hyperplanes, doubling the number of parameters per neuron, impacting convergence rates and weights. The MaxHAT neuron computes max of w transpose x plus b and another set of w transpose x plus b, doubling the number of parameters per neuron and impacting convergence rates and weights.', 'The complexities of choosing activation functions in optimization processes are discussed, with a recommendation to primarily use ReLU and the limited use of 10H and sigmoid functions. The choice of activation functions impacts the dynamics of the backward flow of gradients and the optimization process, with a preference for ReLU and limited use of 10H and sigmoid functions.', 'The impact of different activation functions on weights, convergence rates, loss functions, and optimization dynamics is highlighted, emphasizing the intricate nature of the optimization process. Different activation functions lead to varying weights, convergence rates, and loss functions, with a complex impact on the optimization process dynamics.']}], 'duration': 559.371, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA1464673.jpg', 'highlights': ['During training, if the learning rate is high, up to 20% of the network can end up with dead ReLU neurons that never turn on, affecting the training.', 'Leaky ReLU and P-relu are proposed solutions to the dead ReLU neuron problem, aiming to introduce slight slope variations to prevent neurons from dying and to give them the choice of different slopes.', 'Exponential linear units (eLUs) are being researched to address the non-zero centered downside of ReLU, aiming to have zero mean outputs and claim better training results.', 'The MaxHAT neuron computes max of w transpose x plus b and another set of w transpose x plus b, doubling the number of parameters per neuron and impacting convergence rates and weights.', 'The choice of activation functions impacts the dynamics of the backward flow of gradients and the optimization process, with a preference for ReLU and limited use of 10H and sigmoid functions.', 'Different activation functions lead to varying weights, convergence rates, and loss functions, with a complex impact on the optimization process dynamics.']}, {'end': 2687.175, 'segs': [{'end': 2383.79, 'src': 'embed', 'start': 2356.892, 'weight': 0, 'content': [{'end': 2360.674, 'text': 'which is right now made up of just a series of layers of the same size.', 'start': 2356.892, 'duration': 3.782}, {'end': 2363.435, 'text': 'So we have 10 layers of 500 neurons on each layer.', 'start': 2360.934, 'duration': 2.501}, {'end': 2367.838, 'text': "And I'm forward propagating with this initialization strategy for a unit Gaussian data.", 'start': 2363.976, 'duration': 3.862}, {'end': 2376.864, 'text': "And what I want to look at is what happens to the statistics of the hidden neurons' activations throughout the network with this initialization.", 'start': 2368.518, 'duration': 8.346}, {'end': 2383.79, 'text': "So we're going to look specifically at the mean and the standard deviation and we're going to plot the mean-standard deviation and we're going to plot the histograms.", 'start': 2377.264, 'duration': 6.526}], 'summary': "Analyzing 10 layers of 500 neurons each with gaussian data for hidden neuron's activation statistics.", 'duration': 26.898, 'max_score': 2356.892, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA2356892.jpg'}, {'end': 2465.48, 'src': 'embed', 'start': 2433.424, 'weight': 2, 'content': [{'end': 2435.585, 'text': 'So we have a spread of numbers between negative one and one.', 'start': 2433.424, 'duration': 2.161}, {'end': 2442.029, 'text': 'And then what ends up happening to it, this just collapses to a tight distribution at exactly zero.', 'start': 2436.386, 'duration': 5.643}, {'end': 2448.593, 'text': 'So what ends up happening with this initialization for this 10 layer network is all the 10H neurons just end up outputting just zero.', 'start': 2442.469, 'duration': 6.124}, {'end': 2453.536, 'text': 'So at the last layer, these are tiny numbers of like near zero values.', 'start': 2449.273, 'duration': 4.263}, {'end': 2458.534, 'text': 'And so all activations basically become 0.', 'start': 2454.83, 'duration': 3.704}, {'end': 2465.48, 'text': 'And why is this an issue? So think about what happens to the dynamics of the backward pass, to the gradients.', 'start': 2458.534, 'duration': 6.946}], 'summary': 'Initializing with small values leads to zero activations, affecting gradients.', 'duration': 32.056, 'max_score': 2433.424, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA2433424.jpg'}, {'end': 2638.674, 'src': 'embed', 'start': 2614.582, 'weight': 1, 'content': [{'end': 2621.485, 'text': 'Instead of scaling here as we scaled with 1e negative 2, we can try a different scale of the w matrix at initialization.', 'start': 2614.582, 'duration': 6.903}, {'end': 2626.266, 'text': 'So suppose I try 1.0 instead of 0.01.', 'start': 2622.045, 'duration': 4.221}, {'end': 2633.59, 'text': "We'll see another funny thing happen, because now we've overshot the other way, in a sense that You can see that.", 'start': 2626.266, 'duration': 7.324}, {'end': 2636.232, 'text': "well, maybe it's best to look at the distributions here.", 'start': 2633.59, 'duration': 2.642}, {'end': 2638.674, 'text': 'You can see that everything is completely saturated.', 'start': 2636.432, 'duration': 2.242}], 'summary': 'Adjusting w matrix scale from 0.01 to 1.0 caused oversaturation in distributions.', 'duration': 24.092, 'max_score': 2614.582, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA2614582.jpg'}], 'start': 2024.044, 'title': 'Data pre-processing and weight initialization in ml', 'summary': 'Covers data pre-processing techniques such as zero centering, normalization, and whitening, with a focus on mean centering and per-channel mean subtraction. additionally, it explores the significance of weight initialization in neural networks, highlighting the problems associated with zero and small random number initialization, and the impact on gradients and neuron performance.', 'chapters': [{'end': 2215.096, 'start': 2024.044, 'title': 'Data pre-processing in machine learning', 'summary': 'Discusses data pre-processing in machine learning, including zero centering, normalization, and whitening, and emphasizes the common practice of mean centering and per-channel mean subtraction in computer vision applications.', 'duration': 191.052, 'highlights': ['The common practice of mean centering and per-channel mean subtraction in computer vision applications In computer vision applications, the common practice is to perform mean centering by computing the mean value for each pixel over the training set and subtracting it from every image, as well as subtracting the per-channel mean, which is more convenient in practice.', 'Zero centering and normalization It is common to zero center the data by subtracting the mean along every feature, and in machine learning literature, normalization is also attempted by normalizing each dimension, for instance, by standard deviation.', 'Whitening and covariance structure manipulation Further data pre-processing techniques include making the covariance structure diagonal by applying PCA and whitening the data to squash the covariance matrix into a diagonal, although these are less commonly used in computer vision applications.']}, {'end': 2687.175, 'start': 2215.717, 'title': 'Neural network weight initialization', 'summary': "Discusses the importance of weight initialization in neural networks, explaining the issues with zero initialization, the problems with small random number initialization in deep networks, and the impact of different scales of weight matrix initialization, leading to vanishing gradients and oversaturation of neurons, ultimately hindering the network's training and performance.", 'duration': 471.458, 'highlights': ['The problems with zero initialization Zero initialization leads to symmetry breaking, causing all neurons to output the same values, resulting in similar behaviors during backpropagation and lack of symmetry breaking.', 'Issues with small random number initialization in deep networks Small random number initialization works well for small networks but leads to vanishing gradients in deep networks, causing the standard deviation to plummet and resulting in activations collapsing to near-zero values.', "Impact of different scales of weight matrix initialization Using a small scale (0.01) for weight matrix initialization results in vanishing gradients, while using a larger scale (1.0) leads to oversaturation of neurons, causing gradients to become zeros and hindering the network's performance."]}], 'duration': 663.131, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA2024044.jpg', 'highlights': ['The common practice of mean centering and per-channel mean subtraction in computer vision applications', 'Zero centering and normalization are common data pre-processing techniques', 'Issues with small random number initialization in deep networks can lead to vanishing gradients', 'The problems with zero initialization can cause symmetry breaking and lack of symmetry breaking', 'The impact of different scales of weight matrix initialization on gradients and neuron performance']}, {'end': 3069.49, 'segs': [{'end': 2711.909, 'src': 'embed', 'start': 2687.315, 'weight': 0, 'content': [{'end': 2693.918, 'text': 'And in this particular case, it needs to be somewhere between 1 and 0.01.', 'start': 2687.315, 'duration': 6.603}, {'end': 2698.22, 'text': 'And so you can be slightly more principled instead of trying different values.', 'start': 2693.918, 'duration': 4.302}, {'end': 2699.941, 'text': 'And there are some papers written on this.', 'start': 2698.441, 'duration': 1.5}, {'end': 2706.685, 'text': 'So for example, in 2010, there was a proposal for what we now call the Javier initialization from Glorot et al.', 'start': 2700.442, 'duration': 6.243}, {'end': 2711.909, 'text': 'went through and they looked at the expression for the variance of your neurons.', 'start': 2708.846, 'duration': 3.063}], 'summary': 'In 2010, proposal for the javier initialization from glorot et al. suggested variance expression for neurons.', 'duration': 24.594, 'max_score': 2687.315, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA2687315.jpg'}, {'end': 2963.283, 'src': 'embed', 'start': 2938.055, 'weight': 2, 'content': [{'end': 2943.257, 'text': 'In practice, in their paper, for example, they compare having the factor of 2 or not having that factor of 2.', 'start': 2938.055, 'duration': 5.202}, {'end': 2944.817, 'text': 'And it matters when you have really deep networks.', 'start': 2943.257, 'duration': 1.56}, {'end': 2946.458, 'text': 'In this case, I think they had a few dozen layers.', 'start': 2944.897, 'duration': 1.561}, {'end': 2949.476, 'text': 'If you account for the factor of 2, you converge.', 'start': 2947.475, 'duration': 2.001}, {'end': 2951.937, 'text': "If you don't account for that factor of 2, you do nothing.", 'start': 2949.796, 'duration': 2.141}, {'end': 2953.758, 'text': "It's just zero loss.", 'start': 2952.478, 'duration': 1.28}, {'end': 2956.54, 'text': 'So very important stuff.', 'start': 2955.699, 'duration': 0.841}, {'end': 2958.941, 'text': 'You really need to think it through.', 'start': 2957.4, 'duration': 1.541}, {'end': 2960.442, 'text': 'You have to be careful with initialization.', 'start': 2958.961, 'duration': 1.481}, {'end': 2963.283, 'text': "If it's incorrectly set, bad things happen.", 'start': 2960.562, 'duration': 2.721}], 'summary': 'Accounting for a factor of 2 in deep networks leads to convergence, while not accounting for it results in zero loss. initialization precision is crucial.', 'duration': 25.228, 'max_score': 2938.055, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA2938055.jpg'}, {'end': 3069.49, 'src': 'embed', 'start': 3036.986, 'weight': 1, 'content': [{'end': 3041.53, 'text': "So I'm going to go into a technique that alleviates a lot of these problems.", 'start': 3036.986, 'duration': 4.544}, {'end': 3045.393, 'text': 'But right now, I could take some questions if there are any at this point.', 'start': 3041.65, 'duration': 3.743}, {'end': 3047.415, 'text': 'Go ahead.', 'start': 3046.814, 'duration': 0.601}, {'end': 3050.598, 'text': 'Would it make sense to standardize the gradient coming in by dividing by the variances?', 'start': 3047.435, 'duration': 3.163}, {'end': 3055.757, 'text': 'Would it make sense to standardize the gradient coming in by dividing by the variance??', 'start': 3052.093, 'duration': 3.664}, {'end': 3062.563, 'text': "Possibly, but then you're not doing back propagation, because if you meddle with the gradient, then it's not clear what your objective is anymore,", 'start': 3057.438, 'duration': 5.125}, {'end': 3064.405, 'text': "and so you're not getting necessarily gradient.", 'start': 3062.563, 'duration': 1.842}, {'end': 3066.687, 'text': "So that's maybe the only concern.", 'start': 3065.646, 'duration': 1.041}, {'end': 3067.548, 'text': "I'm not sure what would happen.", 'start': 3066.707, 'duration': 0.841}, {'end': 3069.49, 'text': 'You can try to normalize the gradient.', 'start': 3067.828, 'duration': 1.662}], 'summary': 'Technique for normalizing gradient to improve objective clarity discussed.', 'duration': 32.504, 'max_score': 3036.986, 'thumbnail': ''}], 'start': 2687.315, 'title': 'Importance of neural network initialization', 'summary': 'Emphasizes the significance of proper weight initialization in neural networks for improved performance and efficiency, recommending scaling gradients by dividing by the square root of the number of inputs and exploring data-driven techniques to ensure convergence and avoid issues such as saturation and zero loss.', 'chapters': [{'end': 2763.535, 'start': 2687.315, 'title': 'Neural network initialization', 'summary': 'Discusses the importance of neural network initialization and the recommendation to scale gradients by dividing by the square root of the number of inputs for every single neuron, as proposed in the javier initialization from glorot et al. in 2010, to achieve better performance and efficiency.', 'duration': 76.22, 'highlights': ['The recommendation to scale gradients by dividing by the square root of the number of inputs for every single neuron, proposed in the Javier initialization from Glorot et al. in 2010, to achieve better performance and efficiency.', 'The proposal for the Javier initialization in 2010, which suggests a specific initialization strategy for scaling gradients, leading to more principled approaches and eliminating the need to try different values.', 'The explanation of how neural network initialization with lower weights for neurons with lots of inputs and larger weights for neurons with a smaller number of inputs intuitively makes sense for achieving a variance of one.']}, {'end': 3069.49, 'start': 2763.535, 'title': 'Neural network initialization', 'summary': 'Discusses the importance of proper weight initialization in neural networks to ensure convergence and avoid issues such as saturation and zero loss, highlighting the impact of different nonlinearities and the factor of 2 in relu neurons, with data-driven techniques also being explored.', 'duration': 305.955, 'highlights': ['The importance of proper weight initialization in neural networks The chapter emphasizes the significance of correct weight initialization to ensure networks train effectively and avoid issues such as saturation.', 'Impact of different nonlinearities and the factor of 2 in ReLU neurons The discussion delves into the impact of nonlinearities, particularly ReLU neurons, emphasizing the need to account for the factor of 2 that halves the variance in order to obtain proper distributions.', 'Data-driven techniques for network initialization The chapter explores data-driven techniques for network initialization, highlighting the iterative scaling of weights to achieve roughly unit Gaussian activations throughout the network.']}], 'duration': 382.175, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA2687315.jpg', 'highlights': ['The recommendation to scale gradients by dividing by the square root of the number of inputs for every single neuron, proposed in the Javier initialization from Glorot et al. in 2010, to achieve better performance and efficiency.', 'The proposal for the Javier initialization in 2010, which suggests a specific initialization strategy for scaling gradients, leading to more principled approaches and eliminating the need to try different values.', 'The explanation of how neural network initialization with lower weights for neurons with lots of inputs and larger weights for neurons with a smaller number of inputs intuitively makes sense for achieving a variance of one.', 'Data-driven techniques for network initialization The chapter explores data-driven techniques for network initialization, highlighting the iterative scaling of weights to achieve roughly unit Gaussian activations throughout the network.', 'The importance of proper weight initialization in neural networks The chapter emphasizes the significance of correct weight initialization to ensure networks train effectively and avoid issues such as saturation.', 'Impact of different nonlinearities and the factor of 2 in ReLU neurons The discussion delves into the impact of nonlinearities, particularly ReLU neurons, emphasizing the need to account for the factor of 2 that halves the variance in order to obtain proper distributions.']}, {'end': 3525.414, 'segs': [{'end': 3099.815, 'src': 'embed', 'start': 3071.111, 'weight': 0, 'content': [{'end': 3076.717, 'text': "I think the method I'm going to propose in a bit is actually doing something to the effect of that, but in a clean way.", 'start': 3071.111, 'duration': 5.606}, {'end': 3078.974, 'text': 'OK, cool.', 'start': 3078.453, 'duration': 0.521}, {'end': 3082.777, 'text': "So let's go into something that actually fixes a lot of these problems in practice.", 'start': 3079.674, 'duration': 3.103}, {'end': 3085.619, 'text': "It's called batch normalization, and it was only proposed last year.", 'start': 3083.397, 'duration': 2.222}, {'end': 3088.982, 'text': "And so I couldn't even cover this last year in this class, but now I can.", 'start': 3086.02, 'duration': 2.962}, {'end': 3090.644, 'text': 'And so this actually helps a lot.', 'start': 3089.002, 'duration': 1.642}, {'end': 3099.815, 'text': 'The basic idea in Batch Normalization paper is, OK, you want roughly unit Gaussian activations in every single part of your network.', 'start': 3093.547, 'duration': 6.268}], 'summary': 'Proposing batch normalization, a new method to fix network problems, introduced last year.', 'duration': 28.704, 'max_score': 3071.111, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA3071111.jpg'}, {'end': 3402.484, 'src': 'embed', 'start': 3374.236, 'weight': 1, 'content': [{'end': 3379.957, 'text': "As you sweep through different choices of your initialization scale, you'll see that with and without batch norm, you'll see a huge difference.", 'start': 3374.236, 'duration': 5.721}, {'end': 3385.579, 'text': "With batch norm, you'll see a much more, things will just work for much larger settings of the initial scale.", 'start': 3380.297, 'duration': 5.282}, {'end': 3387.319, 'text': "And so you don't have to worry about it as much.", 'start': 3385.899, 'duration': 1.42}, {'end': 3389.34, 'text': 'It really helps out with this point.', 'start': 3387.439, 'duration': 1.901}, {'end': 3397.477, 'text': 'And one kind of more subtle thing to point out here is it kind of acts as a funny form of regularization.', 'start': 3390.487, 'duration': 6.99}, {'end': 3402.484, 'text': "And it reduces need for dropout, which we'll go into in a bit later in the class.", 'start': 3398.258, 'duration': 4.226}], 'summary': 'Batch normalization improves performance with larger initialization scales and reduces the need for dropout.', 'duration': 28.248, 'max_score': 3374.236, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA3374236.jpg'}], 'start': 3071.111, 'title': 'Batch normalization benefits and implementation', 'summary': 'Introduces batch normalization, a technique for achieving unit gaussian activations in the network. it discusses benefits such as improved gradient flow, higher learning rates, reduced dependence on initialization, acting as regularization, and explains implementation details and impact on runtime.', 'chapters': [{'end': 3180.871, 'start': 3071.111, 'title': 'Introduction to batch normalization', 'summary': 'Introduces the concept of batch normalization, a technique proposed last year to achieve unit gaussian activations in every part of the network, providing a clean solution to various problems in practice.', 'duration': 109.76, 'highlights': ['Batch normalization is a technique proposed last year to achieve unit Gaussian activations in every part of the network, providing a clean solution to various problems in practice.', 'The method involves inserting batch normalization layers into the network to ensure unit Gaussian activations in every single feature dimension across the batch, enabling backpropagation through the process.', 'Batch normalization evaluates the empirical mean and variance along every single feature and applies it independently across the batch, effectively normalizing the activations to be unit Gaussian.']}, {'end': 3525.414, 'start': 3181.211, 'title': 'Batch normalization benefits and implementation', 'summary': 'Discusses the benefits of batch normalization, including improved gradient flow, higher learning rates, reduced dependence on initialization, and acting as a form of regularization. it also explains the implementation details and the impact on runtime.', 'duration': 344.203, 'highlights': ['Batch normalization improves the gradient flow through the network, allowing for higher learning rates and reducing strong dependence on initialization. The chapter emphasizes that batch normalization improves the gradient flow through the network, enables higher learning rates, and reduces the strong dependence on initialization. This leads to faster learning and better performance.', 'Batch normalization acts as a form of regularization by jittering the representation space and reduces the need for dropout. The transcript highlights that batch normalization acts as a form of regularization by jittering the representation space, thereby reducing the need for dropout. This regularization effect contributes to the overall performance improvement.', 'At test time, batch normalization functions differently, using pre-computed mean and variance for a deterministic function forward. The chapter explains the functioning of batch normalization at test time, emphasizing the use of pre-computed mean and variance to ensure a deterministic function forward. This insight provides clarity on the behavior of batch normalization during testing.']}], 'duration': 454.303, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA3071111.jpg', 'highlights': ['Batch normalization ensures unit Gaussian activations in every part of the network.', 'Batch normalization improves gradient flow, enabling higher learning rates and reducing dependence on initialization.', 'Batch normalization acts as regularization by jittering the representation space and reduces the need for dropout.', 'At test time, batch normalization uses pre-computed mean and variance for a deterministic function forward.']}, {'end': 4716.915, 'segs': [{'end': 3580.59, 'src': 'embed', 'start': 3545.373, 'weight': 2, 'content': [{'end': 3548.374, 'text': "Any other questions? That's the price we pay, I suppose.", 'start': 3545.373, 'duration': 3.001}, {'end': 3548.714, 'text': 'Go ahead.', 'start': 3548.394, 'duration': 0.32}, {'end': 3555.936, 'text': "Is there a way of telling if you don't use batch normalization, if your data's not going so well and that maybe it would be a good idea to try it?", 'start': 3548.974, 'duration': 6.962}, {'end': 3558.337, 'text': 'So yeah, so when can you tell that you maybe need batch norm?', 'start': 3555.976, 'duration': 2.361}, {'end': 3560.698, 'text': "I think I'll come back to that in a few slides.", 'start': 3558.757, 'duration': 1.941}, {'end': 3566.339, 'text': "We'll see how can you detect that your network is not healthy, and then maybe you want to try batch norm.", 'start': 3561.538, 'duration': 4.801}, {'end': 3570.521, 'text': 'OK, so the learning process, I have 20 minutes.', 'start': 3566.74, 'duration': 3.781}, {'end': 3571.201, 'text': 'I think I can do this.', 'start': 3570.561, 'duration': 0.64}, {'end': 3575.167, 'text': "Yeah, we're at 70 out of 100, so I think we're fine.", 'start': 3573.186, 'duration': 1.981}, {'end': 3577.008, 'text': "So we've pre-processed our data.", 'start': 3575.927, 'duration': 1.081}, {'end': 3579.029, 'text': "We've decided.", 'start': 3577.628, 'duration': 1.401}, {'end': 3580.59, 'text': "let's decide on some.", 'start': 3579.029, 'duration': 1.561}], 'summary': 'Discussion on the use of batch normalization and data preprocessing, with a 70% progress in the learning process.', 'duration': 35.217, 'max_score': 3545.373, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA3545373.jpg'}, {'end': 3643.237, 'src': 'embed', 'start': 3616.556, 'weight': 6, 'content': [{'end': 3623.122, 'text': 'So weights and biases initialized with just naive initialization here, because this is just a very small network.', 'start': 3616.556, 'duration': 6.566}, {'end': 3626.585, 'text': 'So I can afford to maybe do just a naive sample from a Gaussian.', 'start': 3623.182, 'duration': 3.403}, {'end': 3629.908, 'text': 'And then this is a function that basically is going to train a neural network.', 'start': 3627.346, 'duration': 2.562}, {'end': 3632.21, 'text': "And I'm not showing you the implementation, obviously.", 'start': 3630.389, 'duration': 1.821}, {'end': 3637.735, 'text': 'But just one thing, basically it returns your loss and it returns your gradients on your model parameters.', 'start': 3632.851, 'duration': 4.884}, {'end': 3643.237, 'text': "And so the first thing I might try, for example, is I disable the regularization that's passed in the end.", 'start': 3638.416, 'duration': 4.821}], 'summary': 'Training a small neural network with naive weight initialization and loss computation', 'duration': 26.681, 'max_score': 3616.556, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA3616556.jpg'}, {'end': 3916.171, 'src': 'embed', 'start': 3888.285, 'weight': 8, 'content': [{'end': 3891.571, 'text': "I think it's because it's just looking at which one's the highest for the accuracy.", 'start': 3888.285, 'duration': 3.286}, {'end': 3893.995, 'text': 'And so the probabilities are just shifted a little bit.', 'start': 3891.731, 'duration': 2.264}, {'end': 3896.245, 'text': 'Right, right.', 'start': 3895.925, 'duration': 0.32}, {'end': 3901.627, 'text': 'So you start out with diffused scores.', 'start': 3896.685, 'duration': 4.942}, {'end': 3903.427, 'text': "And now what's happening is you're training.", 'start': 3902.187, 'duration': 1.24}, {'end': 3905.328, 'text': 'So these scores are tiny shifting.', 'start': 3903.547, 'duration': 1.781}, {'end': 3907.428, 'text': 'Your loss is still roughly diffused.', 'start': 3905.808, 'duration': 1.62}, {'end': 3908.449, 'text': 'So you end up with the same loss.', 'start': 3907.448, 'duration': 1.001}, {'end': 3911.67, 'text': 'But now your correct answers are now a tiny bit more probable.', 'start': 3908.489, 'duration': 3.181}, {'end': 3916.171, 'text': "And so when you're actually computing the accuracy, the argmax-y class actually ends up being the correct one.", 'start': 3912.07, 'duration': 4.101}], 'summary': 'Training causes tiny shift in probabilities, increasing accuracy slightly.', 'duration': 27.886, 'max_score': 3888.285, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA3888285.jpg'}, {'end': 4039.381, 'src': 'embed', 'start': 4010.885, 'weight': 4, 'content': [{'end': 4015.847, 'text': 'And when you do this hyper-RAM optimization, you can start out first with just a small number of epochs.', 'start': 4010.885, 'duration': 4.962}, {'end': 4017.228, 'text': "You don't have to run for a very long time.", 'start': 4015.867, 'duration': 1.361}, {'end': 4018.488, 'text': 'Just run for a few minutes.', 'start': 4017.248, 'duration': 1.24}, {'end': 4022.87, 'text': "You can already get a sense of what's working better than some other things.", 'start': 4018.649, 'duration': 4.221}, {'end': 4030.414, 'text': "And also, one note, when you're optimizing over regularization and learning rate, it's best to sample from log space.", 'start': 4023.771, 'duration': 6.643}, {'end': 4032.755, 'text': "You don't just want to sample from a uniform distribution.", 'start': 4030.774, 'duration': 1.981}, {'end': 4039.381, 'text': 'because these learning rates and regularization, they act multiplicatively on the dynamics of your backpropagation.', 'start': 4033.255, 'duration': 6.126}], 'summary': 'Hyper-ram optimization can start with few epochs, run for few minutes to gauge performance, and sample from log space for regularization and learning rate.', 'duration': 28.496, 'max_score': 4010.885, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA4010885.jpg'}, {'end': 4096.014, 'src': 'embed', 'start': 4068.206, 'weight': 0, 'content': [{'end': 4073.587, 'text': "I'm of course doing a second pass where I'm kind of going in and I'm changing these again a bit and I'm looking at what works.", 'start': 4068.206, 'duration': 5.381}, {'end': 4078.469, 'text': 'So I find that I can now get to 53, so some of these work really well.', 'start': 4074.908, 'duration': 3.561}, {'end': 4081.65, 'text': 'One thing to be aware of, sometimes you get a result like this.', 'start': 4079.369, 'duration': 2.281}, {'end': 4091.833, 'text': "So 53 is working quite well and this is actually, if I see this, I'm actually worried at this point because I'm so through this cross-validation here,", 'start': 4082.35, 'duration': 9.483}, {'end': 4096.014, 'text': "I have a result here and there's something actually wrong about this result that hints at some issue.", 'start': 4091.833, 'duration': 4.181}], 'summary': 'After a second pass, 53 items are found to work well, but a result at this point hints at an issue.', 'duration': 27.808, 'max_score': 4068.206, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA4068206.jpg'}, {'end': 4377.29, 'src': 'embed', 'start': 4342.209, 'weight': 3, 'content': [{'end': 4344.511, 'text': "OK, so you're optimizing and you're looking at the loss functions.", 'start': 4342.209, 'duration': 2.302}, {'end': 4350.456, 'text': 'These loss functions can take various different forms, and you need to be able to read into what that means.', 'start': 4345.672, 'duration': 4.784}, {'end': 4356.241, 'text': "So you'll get quite good at looking at loss functions and intuiting what happens.", 'start': 4351.196, 'duration': 5.045}, {'end': 4362.946, 'text': "So this one, for example, as I was pointing out in that previous lecture, it's not as exponential as I may be used to my loss functions.", 'start': 4357.221, 'duration': 5.725}, {'end': 4366.308, 'text': "I'd like it to, you know, it looks a little too linear.", 'start': 4363.687, 'duration': 2.621}, {'end': 4369.869, 'text': 'And so that maybe tells me that the learning rate is maybe slightly too low.', 'start': 4366.868, 'duration': 3.001}, {'end': 4371.869, 'text': "So that doesn't mean the learning rate is too low.", 'start': 4370.409, 'duration': 1.46}, {'end': 4374.75, 'text': 'It just means that I might want to consider trying a higher learning rate.', 'start': 4371.889, 'duration': 2.861}, {'end': 4377.29, 'text': 'Sometimes you get all kinds of funny things.', 'start': 4375.79, 'duration': 1.5}], 'summary': 'Optimizing involves analyzing loss functions to adjust learning rates for better performance.', 'duration': 35.081, 'max_score': 4342.209, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA4342209.jpg'}], 'start': 3525.434, 'title': 'Neural network optimization', 'summary': 'Covers the importance of batch normalization, training process, and hyperparameter optimization in neural networks, highlighting the use of batch normalization in a two-layer neural network with 50 hidden neurons and the cifar-10 dataset, performing sanity checks for overfitting, and emphasizing the significance of random sampling over grid search for hyperparameter optimization.', 'chapters': [{'end': 3616.256, 'start': 3525.434, 'title': 'Batch normalization in neural networks', 'summary': 'Discusses the importance of batch normalization in neural networks, addressing its impact on the training process and the detection of network health, while also highlighting the use of batch normalization in a two-layer neural network with 50 hidden neurons and the cifar-10 dataset.', 'duration': 90.822, 'highlights': ['Batch normalization is a common practice with convolutional layers, often used after every convolutional layer, and the accumulation of multiple convolutional layers can lead to issues. (relevance: 5)', 'Detection of network health and the potential need for batch normalization will be addressed in upcoming slides, emphasizing the importance of recognizing when batch normalization may be necessary. (relevance: 4)', 'The speaker mentions working with the CIFAR-10 dataset and utilizing a two-layer neural network with 50 hidden neurons, providing insights into the practical application of training neural networks. (relevance: 3)', 'The process of playing with data and optimizing hyperparameters is described, focusing on the implementation and validation of a small neural network, offering a practical view of the training process. (relevance: 2)', 'The speaker briefly discusses the initial steps of pre-processing data and the decision to work with the CIFAR-10 dataset, outlining the foundational aspects of the experimentation. (relevance: 1)']}, {'end': 3942.72, 'start': 3616.556, 'title': 'Neural network training and sanity checks', 'summary': 'Discusses the process of training a neural network, including initial weight and bias initialization, loss validation, regularization impact, and the importance of performing sanity checks such as overfitting on a small data set and tuning learning rates for effective training.', 'duration': 326.164, 'highlights': ['Performing sanity checks such as overfitting on a small data set and tuning learning rates is crucial for ensuring the correctness and effectiveness of neural network training.', 'Validating loss and gradients, as well as testing the impact of regularization, are essential steps in confirming the functionality of the neural network.', 'The process of scaling up from overfitting a small data set to finding the optimal learning rate for a larger data set requires iterative experimentation and careful consideration of the learning rate scale.']}, {'end': 4716.915, 'start': 3944.101, 'title': 'Hyperparameter optimization strategies', 'summary': 'Discusses hyperparameter optimization strategies, including using binary search to narrow down cost, hierarchical approach to finding optimal hyperparameters, and the importance of sampling hyperparameters from log space. it also emphasizes the significance of random sampling over grid search and the need to track training data accuracies to detect overfitting.', 'duration': 772.814, 'highlights': ['The importance of sampling hyperparameters from log space to optimize over regularization and learning rate, as they act multiplicatively on the dynamics of backpropagation.', 'The hierarchical approach to hyperparameter optimization, starting with a rough idea and then narrowing down on regions that work well, resulting in a 53% accuracy in this case.', 'The significance of random sampling over grid search in hyperparameter optimization, as random sampling provides more insights into important parameters and yields better results.', 'The need to track training data accuracies in comparison to validation data to detect overfitting, indicated by a significant gap between the two accuracies.', 'The importance of monitoring the scale of parameter updates in backpropagation and ensuring it is roughly 1e-3, as updates significantly larger or smaller may require adjustments to the learning rate.']}], 'duration': 1191.481, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/gYpoJMlgyXA/pics/gYpoJMlgyXA3525434.jpg', 'highlights': ['Batch normalization is crucial for convolutional layers, addressing potential issues.', 'Detection of network health and need for batch normalization is emphasized.', 'Utilizing a two-layer neural network with 50 hidden neurons on CIFAR-10 dataset.', 'Sanity checks like overfitting and learning rate tuning are crucial for training.', 'Importance of sampling hyperparameters from log space for optimization.', 'Hierarchical approach to hyperparameter optimization resulted in 53% accuracy.', 'Random sampling is emphasized over grid search for better hyperparameter optimization.', 'Monitoring training and validation data accuracies to detect overfitting is essential.', 'Monitoring the scale of parameter updates in backpropagation is crucial.']}], 'highlights': ["The resurgence of neural network research in 2006 was marked by Hinton's paper, which demonstrated the successful training of deep neural networks using unsupervised pre-training.", 'Significant improvements in speech recognition and visual recognition from 2010 onwards, attributed to better initialization, activation functions, and the availability of GPUs and more data.', 'The ReLU activation function makes neural networks converge much quicker, by a factor of 6.', 'During training, if the learning rate is high, up to 20% of the network can end up with dead ReLU neurons that never turn on, affecting the training.', 'Batch normalization ensures unit Gaussian activations in every part of the network.', 'The recommendation to scale gradients by dividing by the square root of the number of inputs for every single neuron, proposed in the Javier initialization from Glorot et al. in 2010, to achieve better performance and efficiency.']}