title
Lecture 7 | Training Neural Networks II
description
Lecture 7 continues our discussion of practical issues for training neural networks. We discuss different update rules commonly used to optimize neural networks during training, as well as different strategies for regularizing large neural networks including dropout. We also discuss transfer learning and finetuning.
Keywords: Optimization, momentum, Nesterov momentum, AdaGrad, RMSProp, Adam, second-order optimization, L-BFGS, ensembles, regularization, dropout, data augmentation, transfer learning, finetuning
Slides: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture7.pdf
--------------------------------------------------------------------------------------
Convolutional Neural Networks for Visual Recognition
Instructors:
Fei-Fei Li: http://vision.stanford.edu/feifeili/
Justin Johnson: http://cs.stanford.edu/people/jcjohns/
Serena Yeung: http://ai.stanford.edu/~syyeung/
Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image classification, localization and detection. Recent developments in neural network (aka “deep learning”) approaches have greatly advanced the performance of these state-of-the-art visual recognition systems. This lecture collection is a deep dive into details of the deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification. From this lecture collection, students will learn to implement, train and debug their own neural networks and gain a detailed understanding of cutting-edge research in computer vision.
Website:
http://cs231n.stanford.edu/
For additional learning opportunities please visit:
http://online.stanford.edu/
detail
{'title': 'Lecture 7 | Training Neural Networks II', 'heatmap': [{'end': 1496.814, 'start': 1402.197, 'weight': 0.869}, {'end': 1777.809, 'start': 1671.618, 'weight': 0.725}, {'end': 2178.073, 'start': 1994.887, 'weight': 0.977}, {'end': 2358.202, 'start': 2309.11, 'weight': 0.781}, {'end': 2582.141, 'start': 2536.008, 'weight': 0.728}, {'end': 2901.948, 'start': 2853.192, 'weight': 0.709}, {'end': 3352.847, 'start': 3263.673, 'weight': 0.823}, {'end': 3488.471, 'start': 3443.644, 'weight': 0.703}, {'end': 3763.509, 'start': 3667.6, 'weight': 0.826}, {'end': 4259.735, 'start': 4212.948, 'weight': 0.726}], 'summary': 'The lecture series covers topics such as optimizing instance usage and cloud training, activation functions, weight initialization, data normalization, hyperparameter optimization, and strategies for improving model performance. it also addresses challenges in gradient estimation, nesterov momentum in sgd, and regularization techniques such as dropout and transfer learning.', 'chapters': [{'end': 213.759, 'segs': [{'end': 109.766, 'src': 'embed', 'start': 75.02, 'weight': 3, 'content': [{'end': 78.643, 'text': 'And historically, assignment two has been the longest one in the class.', 'start': 75.02, 'duration': 3.623}, {'end': 84.968, 'text': "So if you haven't started already on assignment two, I recommend you take a look at that pretty soon.", 'start': 79.424, 'duration': 5.544}, {'end': 91.75, 'text': 'Another reminder is that for assignment two, I think a lot of you will be using Google Cloud.', 'start': 87.426, 'duration': 4.324}, {'end': 95.894, 'text': "And big reminder, make sure to stop your instances when you're not using them.", 'start': 92.511, 'duration': 3.383}, {'end': 98.736, 'text': 'Because whenever your instance is on, you get charged.', 'start': 96.254, 'duration': 2.482}, {'end': 101.298, 'text': 'And we only have so many coupons to distribute to you guys.', 'start': 98.836, 'duration': 2.462}, {'end': 109.766, 'text': "So, anytime your instance is on, even if you're not SSH to it, even if you're not running things immediately in your Jupyter Notebook,", 'start': 101.799, 'duration': 7.967}], 'summary': 'Assignment two is the longest in the class, use google cloud, stop instances to avoid charges.', 'duration': 34.746, 'max_score': 75.02, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA75020.jpg'}, {'end': 172.272, 'src': 'embed', 'start': 116.515, 'weight': 0, 'content': [{'end': 118.157, 'text': 'So in this example,', 'start': 116.515, 'duration': 1.642}, {'end': 124.286, 'text': "I've got a little screenshot of my dashboard on Google Cloud and you need to go in there and explicitly go to the dropdown and click stop.", 'start': 118.157, 'duration': 6.129}, {'end': 127.752, 'text': "So just make sure that you do this when you're done working each day.", 'start': 125.168, 'duration': 2.584}, {'end': 134.904, 'text': "Another thing to remember is it's kind of up to you guys to keep track of your spending on Google Cloud.", 'start': 129.777, 'duration': 5.127}, {'end': 140.15, 'text': 'So in particular, instances that use GPUs are a lot more expensive than those with CPUs.', 'start': 135.384, 'duration': 4.766}, {'end': 146.216, 'text': 'So like rough order of magnitude, those GPU instances are around like 90 cents to a dollar an hour.', 'start': 140.69, 'duration': 5.526}, {'end': 147.458, 'text': 'So those are actually quite pricey.', 'start': 146.356, 'duration': 1.102}, {'end': 150.74, 'text': 'and the CPU instances are much cheaper.', 'start': 148.458, 'duration': 2.282}, {'end': 155.802, 'text': 'The general strategy is that you probably want to make two instances, one with a GPU and one without,', 'start': 150.82, 'duration': 4.982}, {'end': 158.804, 'text': 'and then only use that GPU instance when you really need the GPU.', 'start': 155.802, 'duration': 3.002}, {'end': 165.288, 'text': 'For example, on assignment two, most of the assignment, you should only need the CPU.', 'start': 160.045, 'duration': 5.243}, {'end': 167.669, 'text': 'so you should only use your CPU instance for that.', 'start': 165.288, 'duration': 2.381}, {'end': 171.371, 'text': 'but then the final question about TensorFlow or PyTorch.', 'start': 167.669, 'duration': 3.702}, {'end': 172.272, 'text': 'that will need a GPU.', 'start': 171.371, 'duration': 0.901}], 'summary': 'Manage spending on google cloud by stopping instances when not in use. gpu instances cost 90 cents to a dollar per hour, while cpu instances are cheaper. consider using two instances, one with a gpu and one without, and only use the gpu when necessary.', 'duration': 55.757, 'max_score': 116.515, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA116515.jpg'}, {'end': 224.487, 'src': 'embed', 'start': 196.429, 'weight': 2, 'content': [{'end': 199.532, 'text': 'I think eight or 16 gigs is probably good for everything that you need in this class.', 'start': 196.429, 'duration': 3.103}, {'end': 205.835, 'text': 'But yeah, as you scale up the number of CPUs and the number of RAM, you also end up spending more money.', 'start': 201.773, 'duration': 4.062}, {'end': 210.798, 'text': 'So again, if you stick with two or four CPUs and eight or 16 gigs of RAM,', 'start': 205.935, 'duration': 4.863}, {'end': 213.759, 'text': 'that should be plenty for all the homework related stuff that you need to do.', 'start': 210.798, 'duration': 2.961}, {'end': 220.204, 'text': 'So as a quick recap, last time we talked about activation functions.', 'start': 216.761, 'duration': 3.443}, {'end': 224.487, 'text': 'We talked about this whole zoo of different activation functions and some of their different properties.', 'start': 220.644, 'duration': 3.843}], 'summary': 'For class, 8-16 gigs of ram and 2-4 cpus are sufficient, discussed activation functions in the previous session.', 'duration': 28.058, 'max_score': 196.429, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA196429.jpg'}], 'start': 4.901, 'title': 'Optimizing instance usage and cloud training', 'summary': 'Delves into neural network training, emphasizing the importance of managing google cloud instances to avoid charges, and highlights cost disparities between gpu and cpu instances, with gpu instances costing around 90 cents to a dollar an hour compared to cheaper cpu instances. it also advises strategic instance usage based on task requirements to avoid excessive spending. assignment two is emphasized as the longest in the class and due a week from thursday.', 'chapters': [{'end': 134.904, 'start': 4.901, 'title': 'Neural network training and google cloud usage', 'summary': 'Covers the nitty gritty details of training neural networks and emphasizes the importance of stopping google cloud instances to avoid unnecessary charges, with assignment two being the longest in the class and due a week from thursday.', 'duration': 130.003, 'highlights': ['The importance of stopping Google Cloud instances to avoid unnecessary charges, as instances being on leads to charges, and students are advised to keep track of their spending on Google Cloud.', 'Assignment two is the longest in the class and is due a week from Thursday.', 'Reminder for project proposals due today at 11.59 and the grading process for assignment one is underway.']}, {'end': 213.759, 'start': 135.384, 'title': 'Optimizing instance usage for cost-efficiency', 'summary': 'Highlights the cost disparity between gpu and cpu instances, with gpu instances costing around 90 cents to a dollar an hour compared to cheaper cpu instances, and advises the strategic use of instances based on task requirements to avoid excessive spending.', 'duration': 78.375, 'highlights': ['Using GPU instances are around 90 cents to a dollar an hour, much more expensive than CPU instances.', 'The advice is to create two instances, one with a GPU and one without, and only use the GPU instance when necessary to avoid excessive spending.', 'Suggests using CPU instances for tasks like assignment two, reserving GPU instances only for tasks like the final question about TensorFlow or PyTorch.', "Recommended RAM size of eight or 16 gigs for the class's requirements, as scaling up the number of CPUs and RAM results in increased spending."]}], 'duration': 208.858, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA4901.jpg', 'highlights': ['Using GPU instances costs around 90 cents to a dollar an hour', 'The importance of stopping Google Cloud instances to avoid unnecessary charges', "Recommended RAM size of eight or 16 gigs for the class's requirements", 'Assignment two is the longest in the class and is due a week from Thursday', 'Suggests using CPU instances for tasks like assignment two']}, {'end': 800.513, 'segs': [{'end': 258.317, 'src': 'embed', 'start': 216.761, 'weight': 0, 'content': [{'end': 220.204, 'text': 'So as a quick recap, last time we talked about activation functions.', 'start': 216.761, 'duration': 3.443}, {'end': 224.487, 'text': 'We talked about this whole zoo of different activation functions and some of their different properties.', 'start': 220.644, 'duration': 3.843}, {'end': 231.492, 'text': 'We saw that the sigmoid, which used to be quite popular when training neural networks, maybe 10 years ago or so,', 'start': 225.187, 'duration': 6.305}, {'end': 235.615, 'text': 'has this problem with vanishing gradients near the two ends of the activation function.', 'start': 231.492, 'duration': 4.123}, {'end': 238.799, 'text': 'TanH has this similar sort of problem.', 'start': 237.277, 'duration': 1.522}, {'end': 240.261, 'text': 'And kind of.', 'start': 239.821, 'duration': 0.44}, {'end': 245.729, 'text': 'the general recommendation is that you probably want to stick with ReLU for most cases as sort of a default choice,', 'start': 240.261, 'duration': 5.468}, {'end': 248.013, 'text': 'because it tends to work well for a lot of different architectures.', 'start': 245.729, 'duration': 2.284}, {'end': 251.228, 'text': 'We also talked about weight initialization.', 'start': 249.506, 'duration': 1.722}, {'end': 258.317, 'text': 'And remember that up on the top we have this idea that when you initialize your weights at the start of training,', 'start': 251.749, 'duration': 6.568}], 'summary': 'Recap: activation functions, relu recommended, vanishing gradients issue with sigmoid and tanh, and weight initialization discussed', 'duration': 41.556, 'max_score': 216.761, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA216761.jpg'}, {'end': 296.912, 'src': 'embed', 'start': 273.332, 'weight': 1, 'content': [{'end': 279.898, 'text': 'On the other hand, if you initialize your weights too big, then, as you go through the network and multiply by your weight matrix over and over again,', 'start': 273.332, 'duration': 6.566}, {'end': 281.439, 'text': "eventually they'll explode.", 'start': 279.898, 'duration': 1.541}, {'end': 283.24, 'text': "you'll be unhappy, there'll be no learning.", 'start': 281.439, 'duration': 1.801}, {'end': 283.961, 'text': 'it will be very bad.', 'start': 283.24, 'duration': 0.721}, {'end': 293.088, 'text': 'But if you get that initialization just right, for example using the Xavier initialization or the MSRA initialization,', 'start': 285.662, 'duration': 7.426}, {'end': 296.912, 'text': 'then you kind of keep a nice distribution of activations as you go through the network.', 'start': 293.088, 'duration': 3.824}], 'summary': 'Proper weight initialization crucial for stable network behavior and learning.', 'duration': 23.58, 'max_score': 273.332, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA273332.jpg'}, {'end': 347.29, 'src': 'embed', 'start': 311.94, 'weight': 3, 'content': [{'end': 314.401, 'text': 'We also talked last time about data pre-processing.', 'start': 311.94, 'duration': 2.461}, {'end': 322.926, 'text': "We talked about how it's pretty typical in Comnets to zero center and normalize your data so it has zero mean and unit variance.", 'start': 314.761, 'duration': 8.165}, {'end': 328.737, 'text': 'I wanted to provide a little bit of extra intuition about why you might actually want to do this.', 'start': 323.915, 'duration': 4.822}, {'end': 335.08, 'text': 'So imagine a simple setup where we have a binary classification problem,', 'start': 329.778, 'duration': 5.302}, {'end': 339.042, 'text': 'where we want to draw a line to separate these red points from these blue points.', 'start': 335.08, 'duration': 3.962}, {'end': 347.29, 'text': 'So, on the left you have this idea where, if those data points are kind of not normalized and not centered and far away from the origin,', 'start': 339.642, 'duration': 7.648}], 'summary': 'Data pre-processing involves zero centering and normalizing for zero mean and unit variance, to aid in classification.', 'duration': 35.35, 'max_score': 311.94, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA311940.jpg'}, {'end': 478.293, 'src': 'embed', 'start': 441.962, 'weight': 6, 'content': [{'end': 447.124, 'text': 'And because we have this intuition that normalization is so important, we talked about batch normalization,', 'start': 441.962, 'duration': 5.162}, {'end': 455.247, 'text': 'which is where we just add this additional layer inside our networks to just force all of the intermediate activations to be zero mean and unit variance.', 'start': 447.124, 'duration': 8.123}, {'end': 460.929, 'text': "And I've sort of resummarized the batch normalization equations here with the shapes a little bit more explicitly.", 'start': 456.167, 'duration': 4.762}, {'end': 464.79, 'text': "Hopefully this can help you out when you're implementing this thing on assignment two.", 'start': 461.669, 'duration': 3.121}, {'end': 478.293, 'text': 'But again in batch normalization we have this idea that in the forward pass we use the statistics of the mini batch to compute a mean and a standard deviation and then use those estimates to normalize our data on the forward pass.', 'start': 465.43, 'duration': 12.863}], 'summary': 'Batch normalization enforces zero mean and unit variance, using mini batch statistics for normalization in the forward pass.', 'duration': 36.331, 'max_score': 441.962, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA441962.jpg'}, {'end': 584.842, 'src': 'embed', 'start': 555.085, 'weight': 4, 'content': [{'end': 561.089, 'text': 'We talked a little bit about grid search versus random search and how random search is maybe a little bit nicer in theory,', 'start': 555.085, 'duration': 6.004}, {'end': 570.035, 'text': 'because in the situation where your performance might be more sensitive with respect to one hyperparameter than another and random search lets you cover that space a little bit better.', 'start': 561.089, 'duration': 8.946}, {'end': 576.918, 'text': "We also talked about the idea of course-defined search where, when you're doing this hyperparameter optimization,", 'start': 570.935, 'duration': 5.983}, {'end': 584.842, 'text': 'probably you want to start with very wide ranges for your hyperparameters, only train for a couple iterations and then, based on those results,', 'start': 576.918, 'duration': 7.924}], 'summary': 'Random search may cover hyperparameter space better than grid search.', 'duration': 29.757, 'max_score': 555.085, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA555085.jpg'}, {'end': 792.148, 'src': 'embed', 'start': 761.593, 'weight': 5, 'content': [{'end': 763.974, 'text': 'The answer is that it might take a very long time.', 'start': 761.593, 'duration': 2.381}, {'end': 774.832, 'text': 'Yeah, so intuitively, if you set the learning rate very low and let it go for a very long time, then this should, in theory, always work.', 'start': 767.805, 'duration': 7.027}, {'end': 780.137, 'text': "But in practice, those factors of 10 or 100 actually matter a lot when you're training these things.", 'start': 775.212, 'duration': 4.925}, {'end': 785.122, 'text': 'So maybe if you got the right learning rate, you could train it in six hours, 12 hours or a day.', 'start': 780.638, 'duration': 4.484}, {'end': 792.148, 'text': 'but then if you just were super safe and dropped it by a factor of 10 or by a factor of 100, Now that one day training becomes 100 days of training,', 'start': 785.122, 'duration': 7.026}], 'summary': 'Adjusting learning rate can significantly impact training time, from 6 hours to 100 days.', 'duration': 30.555, 'max_score': 761.593, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA761593.jpg'}], 'start': 216.761, 'title': 'Activation functions, weight initialization, data normalization, and hyperparameter optimization', 'summary': 'Discusses the limitations of sigmoid and tanh activation functions, advocating the use of relu, and emphasizes proper weight initialization. it also highlights the importance of data normalization, hyperparameter search, and learning rate optimization in comnets.', 'chapters': [{'end': 310.159, 'start': 216.761, 'title': 'Activation functions and weight initialization', 'summary': 'Discussed the problems with sigmoid and tanh activation functions due to vanishing gradients, and emphasized the importance of using relu as a default choice. it also highlighted the significance of proper weight initialization such as xavier or msra, to maintain a balanced distribution of activations in deeper networks.', 'duration': 93.398, 'highlights': ['The recommendation to use ReLU as a default choice due to its effectiveness for various architectures.', 'The problems associated with sigmoid and TanH activation functions, such as vanishing gradients near the ends.', 'The importance of proper weight initialization, such as Xavier or MSRA, to maintain a balanced distribution of activations in deeper networks.']}, {'end': 800.513, 'start': 311.94, 'title': 'Data normalization and hyperparameter optimization', 'summary': 'Explains the importance of data normalization in comnets, with a focus on zero centering and unit variance, and also discusses the process of hyperparameter search and its impact on model performance, emphasizing the significance of learning rate optimization.', 'duration': 488.573, 'highlights': ['The importance of data normalization in Comnets, with a focus on zero centering and unit variance', 'Explanation of batch normalization and its impact on network optimization', 'Discussion on hyperparameter search, including grid search, random search, and coarse-defined search strategies', 'Importance of learning rate optimization and its impact on model training duration']}], 'duration': 583.752, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA216761.jpg', 'highlights': ['The recommendation to use ReLU as a default choice due to its effectiveness for various architectures.', 'The importance of proper weight initialization, such as Xavier or MSRA, to maintain a balanced distribution of activations in deeper networks.', 'The problems associated with sigmoid and TanH activation functions, such as vanishing gradients near the ends.', 'The importance of data normalization in Comnets, with a focus on zero centering and unit variance', 'Discussion on hyperparameter search, including grid search, random search, and coarse-defined search strategies', 'Importance of learning rate optimization and its impact on model training duration', 'Explanation of batch normalization and its impact on network optimization']}, {'end': 1289.025, 'segs': [{'end': 855.982, 'src': 'embed', 'start': 826.863, 'weight': 2, 'content': [{'end': 832.765, 'text': "Okay, so today I wanted to talk about a couple other really interesting and important topics when we're training neural networks.", 'start': 826.863, 'duration': 5.902}, {'end': 839.148, 'text': "In particular, I wanted to talk, so we've kind of alluded to this fact of fancier, more powerful optimization algorithms a couple times.", 'start': 833.446, 'duration': 5.702}, {'end': 846.331, 'text': 'I wanted to spend some time today and really dig into those and talk about what are the actual optimization algorithms that most people are using these days.', 'start': 839.828, 'duration': 6.503}, {'end': 850.58, 'text': 'We also touched on regularization in earlier lectures,', 'start': 847.339, 'duration': 3.241}, {'end': 855.982, 'text': 'this concept of making your network do additional things to reduce the gap between train and test error,', 'start': 850.58, 'duration': 5.402}], 'summary': 'Discussion on optimization algorithms and regularization in neural networks.', 'duration': 29.119, 'max_score': 826.863, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA826863.jpg'}, {'end': 885.861, 'src': 'embed', 'start': 855.982, 'weight': 1, 'content': [{'end': 861.384, 'text': 'and I wanted to talk about some more strategies that people are using in practice of regularization with respect to neural networks.', 'start': 855.982, 'duration': 5.402}, {'end': 865.525, 'text': 'And finally, I also wanted to talk a bit about transfer learning,', 'start': 862.344, 'duration': 3.181}, {'end': 870.767, 'text': 'where you can sometimes get away with using less data than you think by transferring from one problem to another.', 'start': 865.525, 'duration': 5.242}, {'end': 875.83, 'text': 'So, if you recall from a few lectures ago,', 'start': 872.887, 'duration': 2.943}, {'end': 885.861, 'text': 'the kind of core strategy in training neural networks is an optimization problem where we write down some loss function which defines for each value of the network weights.', 'start': 875.83, 'duration': 10.031}], 'summary': 'Strategies for regularization in neural networks and benefits of transfer learning explained.', 'duration': 29.879, 'max_score': 855.982, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA855982.jpg'}, {'end': 1130.586, 'src': 'embed', 'start': 1100.67, 'weight': 0, 'content': [{'end': 1105.019, 'text': 'So I think this is actually quite a big problem in practice for many high dimensional problems.', 'start': 1100.67, 'duration': 4.349}, {'end': 1112.853, 'text': 'Another problem with SGD has to do with this idea of local minima, or saddle points.', 'start': 1108.109, 'duration': 4.744}, {'end': 1123.321, 'text': "So here I've sort of swapped the graph a little bit and now the x-axis is showing the value of one parameter and then the y-axis is showing the value of a loss.", 'start': 1113.393, 'duration': 9.928}, {'end': 1130.586, 'text': "So in this top example, we have kind of this curvy objective function where there's a valley in the middle.", 'start': 1123.981, 'duration': 6.605}], 'summary': 'Sgd faces challenges in high-dimensional problems due to local minima and saddle points.', 'duration': 29.916, 'max_score': 1100.67, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA1100670.jpg'}], 'start': 800.513, 'title': 'Neural network training and optimization', 'summary': 'Covers the significance of training constants, effectiveness of low learning rates, and utilization of optimization algorithms, regularization strategies, and transfer learning in neural network training. it also addresses challenges such as the limitations of stochastic gradient descent in dealing with bad condition numbers, local minima, and saddle points, impacting training efficiency and convergence.', 'chapters': [{'end': 875.83, 'start': 800.513, 'title': 'Neural network training and optimization', 'summary': 'Delves into the importance of training constants, the effectiveness of low learning rates in avoiding local optima, and the utilization of optimization algorithms, regularization strategies, and transfer learning in neural network training.', 'duration': 75.317, 'highlights': ['The chapter emphasizes the significance of training constants in neural network training.', 'It discusses the effectiveness of low learning rates in avoiding local optima, highlighting that it is not a significant problem in practice.', 'The chapter explores the utilization of optimization algorithms in neural network training, focusing on the ones commonly used in the field.', 'It delves into various regularization strategies employed in practice to reduce the gap between train and test errors in neural networks.', 'The chapter also addresses the concept of transfer learning, showcasing its potential to utilize less data by transferring knowledge from one problem to another.']}, {'end': 1289.025, 'start': 875.83, 'title': 'Neural network optimization issues', 'summary': 'Discusses the challenges of training neural networks, including the limitations of stochastic gradient descent (sgd) in dealing with bad condition numbers, local minima, and saddle points, which become more prevalent in high-dimensional spaces, impacting the training efficiency and convergence.', 'duration': 413.195, 'highlights': ['The challenges of stochastic gradient descent (SGD) in dealing with bad condition numbers, local minima, and saddle points, which become more prevalent in high-dimensional spaces, impacting the training efficiency and convergence.', 'The undesirable behavior of SGD on functions with bad condition numbers, leading to zigzagging and slow progress.', 'The prevalence of saddle points over local minima in high-dimensional problems, causing slow progress and inefficiency in training large neural networks.', 'The impact of the stochastic aspect of SGD on the computational cost of computing the loss over many training examples.']}], 'duration': 488.512, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA800513.jpg', 'highlights': ['The challenges of stochastic gradient descent (SGD) in dealing with bad condition numbers, local minima, and saddle points, impacting training efficiency and convergence.', 'The concept of transfer learning, showcasing its potential to utilize less data by transferring knowledge from one problem to another.', 'The chapter emphasizes the significance of training constants in neural network training.', 'The chapter explores the utilization of optimization algorithms in neural network training, focusing on the ones commonly used in the field.', 'The prevalence of saddle points over local minima in high-dimensional problems, causing slow progress and inefficiency in training large neural networks.']}, {'end': 1861.158, 'segs': [{'end': 1365.528, 'src': 'embed', 'start': 1310.496, 'weight': 2, 'content': [{'end': 1319.083, 'text': "so I've just added random uniform noise to the gradient at every point and then run SGD with these noisy, messed up gradients.", 'start': 1310.496, 'duration': 8.587}, {'end': 1327.169, 'text': "And so this is maybe not exactly what happens with the SGD process, but it still gives you the sense that if there's noise in your gradient estimates,", 'start': 1319.843, 'duration': 7.326}, {'end': 1333.054, 'text': 'then vanilla SGD kind of meanders around the space and it might actually take a long time to get towards the minima.', 'start': 1327.169, 'duration': 5.885}, {'end': 1340.043, 'text': "So, now that we've talked about a lot of these problems, that, oh sorry, was there a question?", 'start': 1335.638, 'duration': 4.405}, {'end': 1352.676, 'text': 'Yeah, so the question is do all of these just go away if we use normal gradient descent?', 'start': 1349.172, 'duration': 3.504}, {'end': 1356.62, 'text': "So let's see.", 'start': 1355.399, 'duration': 1.221}, {'end': 1363.587, 'text': 'so I think that the The taco shell problem of high condition numbers is still a problem with full batch gradient descent.', 'start': 1356.62, 'duration': 6.967}, {'end': 1365.528, 'text': 'The noise.', 'start': 1364.388, 'duration': 1.14}], 'summary': 'Adding noise to gradients in sgd slows convergence, still affects full batch gradient descent.', 'duration': 55.032, 'max_score': 1310.496, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA1310496.jpg'}, {'end': 1496.814, 'src': 'heatmap', 'start': 1402.197, 'weight': 0.869, 'content': [{'end': 1406.18, 'text': "and that's this idea of adding a momentum term to our stochastic gradient descent.", 'start': 1402.197, 'duration': 3.983}, {'end': 1412.804, 'text': 'So here on the left, we have our classic old friend SGD, where we just always step in the direction of the gradient.', 'start': 1406.94, 'duration': 5.864}, {'end': 1420.87, 'text': 'But now on the right, we have this minor, minor variant called SGD plus momentum, which is now two equations and five lines of code.', 'start': 1413.225, 'duration': 7.645}, {'end': 1422.211, 'text': "so it's like twice as complicated.", 'start': 1420.87, 'duration': 1.341}, {'end': 1424.313, 'text': "But it's very simple.", 'start': 1423.392, 'duration': 0.921}, {'end': 1431.599, 'text': 'The idea is that we maintain a velocity over time and we add our gradient estimates to the velocity,', 'start': 1424.533, 'duration': 7.066}, {'end': 1435.964, 'text': 'and then we step in the direction of the velocity rather than stepping in the direction of the gradient.', 'start': 1431.599, 'duration': 4.365}, {'end': 1439.347, 'text': 'So this is relatively, this is very, very simple.', 'start': 1436.644, 'duration': 2.703}, {'end': 1445.571, 'text': 'Oh, and we also have this hyperparameter rho now, which corresponds to friction.', 'start': 1440.828, 'duration': 4.743}, {'end': 1448.693, 'text': 'So now, at every time step we take our current velocity.', 'start': 1446.151, 'duration': 2.542}, {'end': 1453.756, 'text': 'we decay the current velocity by the friction constant rho, which is often something high like.', 'start': 1448.693, 'duration': 5.063}, {'end': 1455.077, 'text': '9 is a common choice.', 'start': 1453.756, 'duration': 1.321}, {'end': 1461.001, 'text': 'So we take our current velocity, we decay it by friction, and we add in our gradient.', 'start': 1455.097, 'duration': 5.904}, {'end': 1466.384, 'text': 'And now we step in the direction of our velocity vector, rather than the direction of our raw gradient vector.', 'start': 1461.421, 'duration': 4.963}, {'end': 1473.287, 'text': 'And this super, super simple strategy actually helps for all of these problems that we just talked about.', 'start': 1467.525, 'duration': 5.762}, {'end': 1481.989, 'text': "So if you think about what happens at local minima or saddle points, then if we're imagining velocity in this system,", 'start': 1474.047, 'duration': 7.942}, {'end': 1487.991, 'text': 'then you kind of have this physical interpretation of this ball, kind of rolling down the hill, picking up speed as it comes down.', 'start': 1481.989, 'duration': 6.002}, {'end': 1496.814, 'text': "And now once we have velocity, then even when we pass that point of local minima, the point will still have velocity even if it doesn't have gradient.", 'start': 1488.471, 'duration': 8.343}], 'summary': 'Sgd plus momentum adds velocity to gradient descent, improving convergence and avoiding local minima.', 'duration': 94.617, 'max_score': 1402.197, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA1402197.jpg'}, {'end': 1599.454, 'src': 'embed', 'start': 1565.354, 'weight': 1, 'content': [{'end': 1569.456, 'text': "So you can see that because we're adding up, we're building up this velocity over time,", 'start': 1565.354, 'duration': 4.102}, {'end': 1577.78, 'text': 'the noise kind of gets averaged out in our gradient estimates and now SGD ends up taking a much smoother path towards the minima compared with the SGD,', 'start': 1569.456, 'duration': 8.324}, {'end': 1579.24, 'text': 'which is kind of meandering due to noise.', 'start': 1577.78, 'duration': 1.46}, {'end': 1599.454, 'text': 'The question is how does SGD momentum help with the poorly conditioned coordinate?', 'start': 1595.05, 'duration': 4.404}], 'summary': 'Sgd momentum smoothes path, improves convergence towards minima by averaging out noise.', 'duration': 34.1, 'max_score': 1565.354, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA1565354.jpg'}, {'end': 1777.809, 'src': 'heatmap', 'start': 1671.618, 'weight': 0.725, 'content': [{'end': 1680.105, 'text': 'So in sort of normal SGD momentum, we imagine that we estimate the gradient at our current point and then take a mix of our velocity and our gradient.', 'start': 1671.618, 'duration': 8.487}, {'end': 1683.528, 'text': 'With Nesterov accelerated gradient, you do something a little bit different.', 'start': 1680.585, 'duration': 2.943}, {'end': 1687.15, 'text': 'Here you start at the red point.', 'start': 1684.548, 'duration': 2.602}, {'end': 1690.973, 'text': 'you step in the direction of where the velocity would take you,', 'start': 1687.15, 'duration': 3.823}, {'end': 1697.117, 'text': 'you evaluate the gradient at that point and then you go back to your original point and kind of mix together those two.', 'start': 1690.973, 'duration': 6.144}, {'end': 1704.842, 'text': "So this is kind of a funny interpretation, but you can imagine that you're kind of mixing together information a little bit more.", 'start': 1697.777, 'duration': 7.065}, {'end': 1708.745, 'text': 'That maybe if your velocity direction was actually a little bit wrong.', 'start': 1705.282, 'duration': 3.463}, {'end': 1712.827, 'text': 'it lets you incorporate gradient information from a little bit larger parts of the objective landscape.', 'start': 1708.745, 'duration': 4.082}, {'end': 1719.671, 'text': 'This also has some really nice theoretical properties when it comes to convex optimization,', 'start': 1714.989, 'duration': 4.682}, {'end': 1724.893, 'text': 'but those guarantees go a little bit out the window once it comes to non-convex problems like neural networks.', 'start': 1719.671, 'duration': 5.222}, {'end': 1729.349, 'text': 'So writing it down in equations, Nesterov momentum looks something like this.', 'start': 1725.925, 'duration': 3.424}, {'end': 1736.878, 'text': 'To update our velocity, we take a step according to our previous velocity and evaluate the gradient there.', 'start': 1731.371, 'duration': 5.507}, {'end': 1742.985, 'text': 'And now, when we take our next step, we mix together, we actually step in the direction of our velocity.', 'start': 1737.458, 'duration': 5.527}, {'end': 1745.067, 'text': "that's incorporating information from these multiple points.", 'start': 1742.985, 'duration': 2.082}, {'end': 1755.82, 'text': "Questions? Oh sorry, so the question is what's a good initialization for the velocity? And this is almost always zero.", 'start': 1746.589, 'duration': 9.231}, {'end': 1759.461, 'text': "So it's not even a hyperparameter, just set it to zero and don't worry.", 'start': 1756.5, 'duration': 2.961}, {'end': 1777.809, 'text': "Another question? Yeah, so intuitively the velocity is kind of a weighted sum of your gradients that you've seen over time.", 'start': 1760.442, 'duration': 17.367}], 'summary': 'Nesterov accelerated gradient incorporates gradient information from larger parts of the objective landscape, with theoretical guarantees for convex optimization.', 'duration': 106.191, 'max_score': 1671.618, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA1671618.jpg'}, {'end': 1708.745, 'src': 'embed', 'start': 1680.585, 'weight': 0, 'content': [{'end': 1683.528, 'text': 'With Nesterov accelerated gradient, you do something a little bit different.', 'start': 1680.585, 'duration': 2.943}, {'end': 1687.15, 'text': 'Here you start at the red point.', 'start': 1684.548, 'duration': 2.602}, {'end': 1690.973, 'text': 'you step in the direction of where the velocity would take you,', 'start': 1687.15, 'duration': 3.823}, {'end': 1697.117, 'text': 'you evaluate the gradient at that point and then you go back to your original point and kind of mix together those two.', 'start': 1690.973, 'duration': 6.144}, {'end': 1704.842, 'text': "So this is kind of a funny interpretation, but you can imagine that you're kind of mixing together information a little bit more.", 'start': 1697.777, 'duration': 7.065}, {'end': 1708.745, 'text': 'That maybe if your velocity direction was actually a little bit wrong.', 'start': 1705.282, 'duration': 3.463}], 'summary': 'Nesterov accelerated gradient mixes velocity and gradient information to improve direction for faster convergence.', 'duration': 28.16, 'max_score': 1680.585, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA1680585.jpg'}], 'start': 1289.525, 'title': 'Challenges and strategies in gradient estimation', 'summary': 'Discusses challenges in gradient estimation, such as the impact of noisy gradient estimates on convergence, high condition numbers, additional noise, and saddle points. it also explores the concept of momentum in stochastic gradient descent, covering its impact on local minima and saddle points, dealing with poorly conditioned problems, and the benefits of nesterov accelerated gradient.', 'chapters': [{'end': 1402.197, 'start': 1289.525, 'title': 'Challenges in gradient estimation', 'summary': 'Discusses the challenges in gradient estimation, highlighting the impact of noisy gradient estimates on the convergence of stochastic gradient descent, the persistence of problems like high condition numbers, additional noise in the network, and saddle points even when using full batch gradient descent.', 'duration': 112.672, 'highlights': ['Noisy gradient estimates impact the convergence of stochastic gradient descent, causing it to meander around the space and potentially take a long time to reach the minima.', "High condition numbers remain a problem with full batch gradient descent, leading to the persistence of the 'taco shell problem' and additional noise in the network due to sampling mini-batches and explicit stochasticity.", 'Saddle points continue to pose a challenge for full batch gradient descent, indicating that transitioning to this method does not resolve the aforementioned problems.', 'Transitioning to full batch gradient descent does not eliminate the challenges in gradient estimation, necessitating the consideration of a more advanced optimization algorithm to address these concerns.']}, {'end': 1861.158, 'start': 1402.197, 'title': 'Sgd momentum and nesterov accelerated gradient', 'summary': 'Discusses the concept of adding momentum to stochastic gradient descent, with a focus on the simple strategy of maintaining velocity, the impact on local minima and saddle points, dealing with poorly conditioned problems, and the benefits of nesterov accelerated gradient. it also explains the computation of velocity, the visualization of gradient descent with noise, and the theoretical properties of nesterov momentum in convex optimization.', 'duration': 458.961, 'highlights': ['The simple strategy of maintaining velocity by adding gradient estimates to the velocity, stepping in the direction of velocity rather than the gradient, and decaying the velocity by a friction constant rho, such as 0.9, helps overcome local minima, saddle points, and poorly conditioned problems.', 'The visualization of gradient descent with noise shows that adding momentum averages out the noise in gradient estimates, resulting in a smoother path towards the minima compared to vanilla SGD.', 'The Nesterov accelerated gradient switches the order of operations, incorporating gradient information from a larger part of the objective landscape and has theoretical properties in convex optimization.', 'The velocity is initialized to zero and is updated at every time step by decaying the old velocity by friction and adding in the current gradient, representing a smooth moving average of recent gradients with an exponentially decaying weight.', 'A change of variables and reshuffling allows Nesterov momentum to be interpreted as updating the velocity in the first step and updating the parameter vector in the second step, ensuring the evaluation of loss and gradient at the same point.']}], 'duration': 571.633, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA1289525.jpg', 'highlights': ['The Nesterov accelerated gradient switches the order of operations, incorporating gradient information from a larger part of the objective landscape and has theoretical properties in convex optimization.', 'Adding momentum averages out the noise in gradient estimates, resulting in a smoother path towards the minima compared to vanilla SGD.', 'Noisy gradient estimates impact the convergence of stochastic gradient descent, causing it to meander around the space and potentially take a long time to reach the minima.', "High condition numbers remain a problem with full batch gradient descent, leading to the persistence of the 'taco shell problem' and additional noise in the network due to sampling mini-batches and explicit stochasticity."]}, {'end': 2687.044, 'segs': [{'end': 1899.68, 'src': 'embed', 'start': 1861.718, 'weight': 0, 'content': [{'end': 1869.226, 'text': 'So here, Nesterov momentum is kind of incorporating some kind of error correcting term between your current velocity and your previous velocity.', 'start': 1861.718, 'duration': 7.508}, {'end': 1872.105, 'text': 'And if you look at what these.', 'start': 1870.984, 'duration': 1.121}, {'end': 1879.349, 'text': 'so if you look at, if we look at SGD, SGD momentum and Nesterov momentum on this kind of simple problem compared with SGD,', 'start': 1872.105, 'duration': 7.244}, {'end': 1885.812, 'text': 'we notice that Nesterov and so SGD kind of takes this SGD is in the black kind of taking this slow progress toward the minima.', 'start': 1879.349, 'duration': 6.463}, {'end': 1889.734, 'text': 'The blue and the green show momentum and Nesterov,', 'start': 1886.612, 'duration': 3.122}, {'end': 1895.157, 'text': "and these have this behavior of kind of overshooting the minimum because they're building up velocity, going past this,", 'start': 1889.734, 'duration': 5.423}, {'end': 1899.68, 'text': 'going past the minimum and then kind of correcting themselves and coming back towards the minima.', 'start': 1895.157, 'duration': 4.523}], 'summary': 'Nesterov momentum corrects velocity, overshoots minima, and corrects itself back towards minima.', 'duration': 37.962, 'max_score': 1861.718, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA1861718.jpg'}, {'end': 1966.038, 'src': 'embed', 'start': 1931.578, 'weight': 2, 'content': [{'end': 1937.501, 'text': 'Because the idea is that maybe if you have a very sharp minima, that actually could be a minima that over fits more,', 'start': 1931.578, 'duration': 5.923}, {'end': 1940.142, 'text': 'that if you imagine that we doubled our training set,', 'start': 1937.501, 'duration': 2.641}, {'end': 1946.584, 'text': 'the whole optimization landscape would change and maybe that very sensitive minima would actually disappear if we were to collect more training data.', 'start': 1940.142, 'duration': 6.442}, {'end': 1956.188, 'text': 'So we kind of have this intuition that we maybe want to land in very flat minima because those very flat minima are probably more robust as we change the training data.', 'start': 1947.325, 'duration': 8.863}, {'end': 1959.27, 'text': 'so those flat minima might actually generalize better to testing data.', 'start': 1956.188, 'duration': 3.082}, {'end': 1966.038, 'text': "So this is, again, sort of very recent theoretical work, but that's actually a really good point that you bring it up.", 'start': 1960.391, 'duration': 5.647}], 'summary': 'Flat minima may generalize better, with more training data', 'duration': 34.46, 'max_score': 1931.578, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA1931578.jpg'}, {'end': 2178.073, 'src': 'heatmap', 'start': 1994.887, 'weight': 0.977, 'content': [{'end': 1998.128, 'text': "So there's another kind of common optimization strategy.", 'start': 1994.887, 'duration': 3.241}, {'end': 2004.31, 'text': "is this algorithm called AdiGrad, which John Ducci, who's now a professor here, worked on during his PhD.", 'start': 1998.128, 'duration': 6.182}, {'end': 2012.253, 'text': 'And the idea with AdiGrad is that as you go during the course of the optimization,', 'start': 2005.431, 'duration': 6.822}, {'end': 2019.216, 'text': "you're going to keep a running estimate or a running sum of all the squared gradients that you see during training.", 'start': 2012.253, 'duration': 6.963}, {'end': 2023.739, 'text': 'So now, rather than having a velocity term, instead we have this grad squared term.', 'start': 2019.756, 'duration': 3.983}, {'end': 2028.623, 'text': "And during training, we're gonna just keep adding the squared gradients to this grad squared term.", 'start': 2024.179, 'duration': 4.444}, {'end': 2038.01, 'text': "And now, when we update our parameter vector, we'll divide by this grad squared term when we're making our update step.", 'start': 2029.443, 'duration': 8.567}, {'end': 2047.477, 'text': 'And the question is what does this kind of scaling do in this situation where we have a very high condition number??', 'start': 2039.491, 'duration': 7.986}, {'end': 2058.856, 'text': 'Yeah, so the idea is that if we have two coordinates,', 'start': 2056.074, 'duration': 2.782}, {'end': 2066, 'text': 'one that always has a very high gradient and one that always has a very small gradient then as we add the sum of the squares of the small gradient,', 'start': 2058.856, 'duration': 7.144}, {'end': 2068.141, 'text': "we're gonna be dividing by a small number.", 'start': 2066, 'duration': 2.141}, {'end': 2077.045, 'text': "so we'll accelerate movement along the slow dimension, along the one dimension and then along the other dimension,", 'start': 2068.141, 'duration': 8.904}, {'end': 2079.286, 'text': 'where the gradients tend to be very large.', 'start': 2077.045, 'duration': 2.241}, {'end': 2081.748, 'text': "then we'll be dividing by a large number.", 'start': 2079.286, 'duration': 2.462}, {'end': 2084.489, 'text': "so we'll kind of slow down our progress along the wiggling dimension.", 'start': 2081.748, 'duration': 2.741}, {'end': 2091.965, 'text': "But there's kind of a problem here and that's the question of what happens with attagrad over the course of training,", 'start': 2086.28, 'duration': 5.685}, {'end': 2093.886, 'text': 'like as t gets larger and larger and larger?', 'start': 2091.965, 'duration': 1.921}, {'end': 2105.816, 'text': 'Yeah, so with attagrad the steps actually get smaller and smaller and smaller because we just continue updating this estimate of the squared gradients over time.', 'start': 2098.03, 'duration': 7.786}, {'end': 2109.699, 'text': 'so this estimate just grows and grows and grows monotonically over the course of training.', 'start': 2105.816, 'duration': 3.883}, {'end': 2114.003, 'text': 'And now this causes our step size to get smaller and smaller and smaller over time.', 'start': 2110.14, 'duration': 3.863}, {'end': 2121.499, 'text': "So again, in the convex case there's some really nice theory showing that this is actually really good.", 'start': 2115.016, 'duration': 6.483}, {'end': 2127.362, 'text': 'and that kind of because in the convex case, as you approach a minimum, you kind of want to slow down, so you actually converge.', 'start': 2121.499, 'duration': 5.863}, {'end': 2131.084, 'text': "So that's actually kind of a feature in the convex case.", 'start': 2128.322, 'duration': 2.762}, {'end': 2137.146, 'text': "But in the non-convex case that's a little bit problematic, because then, as you come towards a saddle point,", 'start': 2131.504, 'duration': 5.642}, {'end': 2140.488, 'text': 'you might get stuck with AdaGrad and then you kind of no longer make any progress.', 'start': 2137.146, 'duration': 3.342}, {'end': 2148.287, 'text': "So there's a slight variation of AdaGrad called rmsprop that actually addresses this concern a little bit.", 'start': 2142.102, 'duration': 6.185}, {'end': 2153.19, 'text': 'So now with rmsprop we still keep this estimate of the squared gradients,', 'start': 2148.867, 'duration': 4.323}, {'end': 2157.393, 'text': 'but instead of just letting that squared estimate continually accumulate over training.', 'start': 2153.19, 'duration': 4.203}, {'end': 2160.515, 'text': 'instead we let that squared estimate actually decay.', 'start': 2157.393, 'duration': 3.122}, {'end': 2163.898, 'text': 'So this ends up looking kind of like a momentum update,', 'start': 2161.116, 'duration': 2.782}, {'end': 2168.962, 'text': "except we're having kind of momentum over the squared gradients rather than momentum over the actual gradients.", 'start': 2163.898, 'duration': 5.064}, {'end': 2171.224, 'text': 'So now with rmsprop.', 'start': 2169.562, 'duration': 1.662}, {'end': 2175.169, 'text': 'after we compute our gradient, we take our current estimate of the grad squared.', 'start': 2171.224, 'duration': 3.945}, {'end': 2178.073, 'text': 'we multiply it by this decay rate, which is commonly something like', 'start': 2175.169, 'duration': 2.904}], 'summary': 'Adigrad and its variation, rmsprop, optimize training by accumulating squared gradients and adjusting step size, but adagrad may lead to smaller steps over time, causing potential convergence issues in non-convex cases.', 'duration': 183.186, 'max_score': 1994.887, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA1994887.jpg'}, {'end': 2358.202, 'src': 'heatmap', 'start': 2309.11, 'weight': 0.781, 'content': [{'end': 2315.252, 'text': "we had this idea of velocity, where we're building up velocity by adding in the gradients and then stepping in the direction of the velocity.", 'start': 2309.11, 'duration': 6.142}, {'end': 2323.094, 'text': 'And we saw with AdaGrad and RMS prop that we had this other idea of building up an estimate of the squared gradients and then dividing by the squared gradients.', 'start': 2315.632, 'duration': 7.462}, {'end': 2326.716, 'text': 'So then like these both seem like good ideas on their own.', 'start': 2323.955, 'duration': 2.761}, {'end': 2330.057, 'text': "Why don't we just stick them together and use them both? Maybe that would be even better.", 'start': 2326.796, 'duration': 3.261}, {'end': 2336.223, 'text': 'And that brings us to this algorithm called Atom, or rather brings us very close to Atom.', 'start': 2331.138, 'duration': 5.085}, {'end': 2340.648, 'text': "We'll see in a couple slides that there's a slight correction we need to make here.", 'start': 2337.024, 'duration': 3.624}, {'end': 2346.631, 'text': 'So here with Atom, we maintain an estimate of the first moment, and the second moment.', 'start': 2341.269, 'duration': 5.362}, {'end': 2349.394, 'text': 'And now, in the red,', 'start': 2347.192, 'duration': 2.202}, {'end': 2358.202, 'text': 'we make this estimate of the first moment as a weighted sum of our gradients and we have this moving estimate of the second moment,', 'start': 2349.394, 'duration': 8.808}], 'summary': 'Atom algorithm combines first and second moment estimates for optimization.', 'duration': 49.092, 'max_score': 2309.11, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA2309110.jpg'}, {'end': 2582.141, 'src': 'heatmap', 'start': 2536.008, 'weight': 0.728, 'content': [{'end': 2542.752, 'text': 'we create an unbiased estimate of those first and second moments by incorporating the current time step t.', 'start': 2536.008, 'duration': 6.744}, {'end': 2548.805, 'text': 'And now we actually make our step using these unbiased estimates rather than the original first and second moment estimates.', 'start': 2542.752, 'duration': 6.053}, {'end': 2551.736, 'text': 'So this gives us our full form of Adam.', 'start': 2549.568, 'duration': 2.168}, {'end': 2560.07, 'text': 'And by the way, Atom is a really, really good optimization algorithm and it works really well for a lot of different problems.', 'start': 2553.286, 'duration': 6.784}, {'end': 2565.252, 'text': "So that's kind of my default optimization algorithm for just about any new problem that I'm tackling.", 'start': 2560.47, 'duration': 4.782}, {'end': 2568.174, 'text': 'And in particular, if you set beta one equals .', 'start': 2565.752, 'duration': 2.422}, {'end': 2569.975, 'text': '9, beta two equals .', 'start': 2568.174, 'duration': 1.801}, {'end': 2577.418, 'text': "999, learning rate one e minus three or five e minus four, that's a great starting point for just about all the architectures I've ever worked with.", 'start': 2569.975, 'duration': 7.443}, {'end': 2579.94, 'text': 'So try that.', 'start': 2577.899, 'duration': 2.041}, {'end': 2582.141, 'text': "That's a really good place to start in general.", 'start': 2580.6, 'duration': 1.541}], 'summary': 'Adam optimization algorithm works well for various problems, recommended default for new problems, with specific parameters yielding great starting point.', 'duration': 46.133, 'max_score': 2536.008, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA2536008.jpg'}, {'end': 2568.174, 'src': 'embed', 'start': 2542.752, 'weight': 6, 'content': [{'end': 2548.805, 'text': 'And now we actually make our step using these unbiased estimates rather than the original first and second moment estimates.', 'start': 2542.752, 'duration': 6.053}, {'end': 2551.736, 'text': 'So this gives us our full form of Adam.', 'start': 2549.568, 'duration': 2.168}, {'end': 2560.07, 'text': 'And by the way, Atom is a really, really good optimization algorithm and it works really well for a lot of different problems.', 'start': 2553.286, 'duration': 6.784}, {'end': 2565.252, 'text': "So that's kind of my default optimization algorithm for just about any new problem that I'm tackling.", 'start': 2560.47, 'duration': 4.782}, {'end': 2568.174, 'text': 'And in particular, if you set beta one equals .', 'start': 2565.752, 'duration': 2.422}], 'summary': 'Using unbiased estimates, adam is a highly effective optimization algorithm for various problems.', 'duration': 25.422, 'max_score': 2542.752, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA2542752.jpg'}, {'end': 2646.535, 'src': 'embed', 'start': 2612.999, 'weight': 5, 'content': [{'end': 2618.004, 'text': 'And maybe in this small two-dimensional example Adam converged about similarly to other ones,', 'start': 2612.999, 'duration': 5.005}, {'end': 2622.669, 'text': "but you can see qualitatively that it's kind of combining the behaviors of both momentum and RMS prop.", 'start': 2618.004, 'duration': 4.665}, {'end': 2627.414, 'text': 'So any questions about optimization algorithms?', 'start': 2625.091, 'duration': 2.323}, {'end': 2638.823, 'text': 'The question is what does Atom not fix?', 'start': 2636.837, 'duration': 1.986}, {'end': 2640.368, 'text': 'These neural networks are still large.', 'start': 2639.204, 'duration': 1.164}, {'end': 2641.592, 'text': 'they still take a long time to train.', 'start': 2640.368, 'duration': 1.224}, {'end': 2646.535, 'text': 'There can still be a problem.', 'start': 2645.073, 'duration': 1.462}], 'summary': 'Adam optimization combines momentum and rmsprop, but neural networks remain large and time-consuming to train.', 'duration': 33.536, 'max_score': 2612.999, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA2612999.jpg'}], 'start': 1861.718, 'title': 'Optimization algorithms and nesterov momentum in sgd', 'summary': 'Discusses the incorporation of nesterov momentum in sgd, its behavior of overshooting the minimum and correcting itself compared to standard sgd, and explores optimization algorithms for neural networks, emphasizing the importance of avoiding sharp minima and analyzing the behaviors of algorithms such as adagrad, rmsprop, and atom for optimal convergence.', 'chapters': [{'end': 1899.68, 'start': 1861.718, 'title': 'Nesterov momentum in sgd', 'summary': 'Discusses the incorporation of nesterov momentum as an error-correcting term in sgd, showing its behavior of overshooting the minimum and correcting itself compared to standard sgd.', 'duration': 37.962, 'highlights': ['Nesterov momentum incorporates an error-correcting term between current and previous velocity in SGD.', 'Comparison of SGD, SGD momentum, and Nesterov momentum shows that Nesterov overshoots the minimum and then corrects itself to move towards the minima.', 'SGD takes slow progress towards the minima, while Nesterov momentum overshoots and corrects itself to converge towards the minimum.']}, {'end': 2687.044, 'start': 1911.974, 'title': 'Optimization algorithms for neural networks', 'summary': 'Discusses the optimization landscape for neural networks, emphasizing the importance of avoiding sharp minima and exploring the behaviors of various algorithms such as adagrad, rmsprop, and atom in achieving optimal convergence.', 'duration': 775.07, 'highlights': ['The importance of avoiding sharp minima and preferring flat minima for better generalization is emphasized, along with the impact of training data size on the optimization landscape.', 'The behaviors and advantages of optimization algorithms such as AdaGrad, RMSprop, and Atom are explained, detailing their impact on the optimization process and convergence.', "The discussion of Atom algorithm's features and adjustments, including its unbiased estimates and effectiveness as a default optimization algorithm for various problems, is highlighted."]}], 'duration': 825.326, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA1861718.jpg', 'highlights': ['Nesterov momentum incorporates an error-correcting term in SGD', 'Nesterov momentum overshoots and corrects itself towards minima', 'Importance of avoiding sharp minima and preferring flat minima', 'Comparison of SGD, SGD momentum, and Nesterov momentum behaviors', 'Impact of training data size on the optimization landscape', 'Behaviors and advantages of optimization algorithms like AdaGrad, RMSprop, and Atom', "Atom algorithm's features and effectiveness as a default optimization algorithm"]}, {'end': 3301.812, 'segs': [{'end': 2718.827, 'src': 'embed', 'start': 2690.838, 'weight': 0, 'content': [{'end': 2696.76, 'text': "So another thing that we've seen in all these optimization algorithms is learning rate as a hyperparameter.", 'start': 2690.838, 'duration': 5.922}, {'end': 2704.622, 'text': "So we've seen this picture before a couple times that as you use different learning rates, sometimes if it's too high, it might explode in the yellow.", 'start': 2697.26, 'duration': 7.362}, {'end': 2709.504, 'text': "If it's a very low learning rate in the blue, it might take a very long time to converge.", 'start': 2705.383, 'duration': 4.121}, {'end': 2711.545, 'text': "And it's kind of tricky to pick the right learning rate.", 'start': 2709.804, 'duration': 1.741}, {'end': 2718.827, 'text': "But this is a little bit of a trick question because we don't actually have to stick with one learning rate throughout the course of training.", 'start': 2713.605, 'duration': 5.222}], 'summary': 'Optimization algorithms use learning rate as a hyperparameter for convergence, avoiding high or low rates for efficient training.', 'duration': 27.989, 'max_score': 2690.838, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA2690838.jpg'}, {'end': 2767.795, 'src': 'embed', 'start': 2729.819, 'weight': 1, 'content': [{'end': 2736.861, 'text': "So sometimes you'll start with a higher learning rate near the start of training and then decay the learning rate and make it smaller and smaller throughout the course of training.", 'start': 2729.819, 'duration': 7.042}, {'end': 2746.605, 'text': 'A couple strategies for these would be a step decay where like at 100, 000 iteration you just decay and by some factor and you keep going.', 'start': 2739.402, 'duration': 7.203}, {'end': 2750.187, 'text': 'You might see an exponential decay where you continually decay during training.', 'start': 2747.025, 'duration': 3.162}, {'end': 2756.69, 'text': 'So you might see different variations of kind of continually decaying the learning rate during training.', 'start': 2751.688, 'duration': 5.002}, {'end': 2764.674, 'text': 'And if you look at papers, especially the ResNet paper, you often see plots that look kind of like this, where the loss is kind of going down,', 'start': 2757.61, 'duration': 7.064}, {'end': 2767.795, 'text': 'then dropping and then flattening again and dropping again.', 'start': 2764.674, 'duration': 3.121}], 'summary': 'Strategies for decaying learning rate during training, including step decay and exponential decay, are commonly used in resnet paper.', 'duration': 37.976, 'max_score': 2729.819, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA2729819.jpg'}, {'end': 2901.948, 'src': 'heatmap', 'start': 2853.192, 'weight': 0.709, 'content': [{'end': 2855.793, 'text': "what we're basically doing is computing the gradients at that point.", 'start': 2853.192, 'duration': 2.601}, {'end': 2861.035, 'text': "We're using the gradient information to compute some linear approximation to our function,", 'start': 2856.253, 'duration': 4.782}, {'end': 2863.916, 'text': 'which is kind of a first-order Taylor approximation to our function.', 'start': 2861.035, 'duration': 2.881}, {'end': 2871.139, 'text': 'And now we pretend that the first order approximation is our actual function, and we make a step to try to minimize the approximation.', 'start': 2864.454, 'duration': 6.685}, {'end': 2877.104, 'text': "But this approximation doesn't hold for very large regions, so we can't step too far in that direction.", 'start': 2872.14, 'duration': 4.964}, {'end': 2881.928, 'text': "But really the idea here is that we're only incorporating information about the first derivative of the function.", 'start': 2877.664, 'duration': 4.264}, {'end': 2884.243, 'text': 'And you can actually go a little bit fancier.', 'start': 2882.603, 'duration': 1.64}, {'end': 2891.465, 'text': "And there's this idea of second order approximation, where we take into account both first derivative and second derivative information.", 'start': 2884.724, 'duration': 6.741}, {'end': 2898.167, 'text': 'and now we make a second order, Taylor approximation to our function and kind of locally approximate our function with a quadratic.', 'start': 2891.465, 'duration': 6.702}, {'end': 2901.948, 'text': "And now with a quadratic you can step right to the minimum and you're really happy.", 'start': 2898.707, 'duration': 3.241}], 'summary': 'Using gradients to make linear and quadratic approximations to optimize functions.', 'duration': 48.756, 'max_score': 2853.192, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA2853192.jpg'}, {'end': 3022.556, 'src': 'embed', 'start': 2992.833, 'weight': 2, 'content': [{'end': 2994.513, 'text': "So you'll sometimes see these for some problems.", 'start': 2992.833, 'duration': 1.68}, {'end': 3004.587, 'text': "So LBFGS is one particular second-order optimizer that has this approximate second, keeps this approximation to the Hessian that you'll sometimes see.", 'start': 2996.674, 'duration': 7.913}, {'end': 3009.169, 'text': "But in practice, it doesn't work too well for many deep learning problems.", 'start': 3005.407, 'duration': 3.762}, {'end': 3016.292, 'text': "Because these approximations, these second order approximations, don't really handle the stochastic case very much, very nicely.", 'start': 3009.529, 'duration': 6.763}, {'end': 3019.634, 'text': 'And they also tend not to work so well with non-convex problems.', 'start': 3016.312, 'duration': 3.322}, {'end': 3022.556, 'text': "I don't want to get into that right now too much.", 'start': 3020.174, 'duration': 2.382}], 'summary': 'Lbfgs is a second-order optimizer but not suitable for deep learning due to issues with stochastic cases and non-convex problems.', 'duration': 29.723, 'max_score': 2992.833, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA2992833.jpg'}, {'end': 3132.226, 'src': 'embed', 'start': 3104.893, 'weight': 3, 'content': [{'end': 3109.956, 'text': "And now at test time, we'll run our data through all of the 10 models and average the predictions of those 10 models.", 'start': 3104.893, 'duration': 5.063}, {'end': 3119.86, 'text': 'And this tends to adding these multiple models together, tends to reduce overfitting a little bit and tends to improve performance a little bit,', 'start': 3110.716, 'duration': 9.144}, {'end': 3120.941, 'text': 'typically by a couple percent.', 'start': 3119.86, 'duration': 1.081}, {'end': 3125.383, 'text': 'So this is generally not a drastic improvement, but it is a consistent improvement.', 'start': 3121.701, 'duration': 3.682}, {'end': 3132.226, 'text': "And you'll see that in competitions like ImageNet and other things like that, using model ensembles is very common to get maximal performance.", 'start': 3125.503, 'duration': 6.723}], 'summary': 'Ensembling 10 models improves performance by a couple percent, reducing overfitting.', 'duration': 27.333, 'max_score': 3104.893, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA3104893.jpg'}], 'start': 2690.838, 'title': 'Optimizing learning rates and model ensembles', 'summary': 'Emphasizes the significance of learning rate as a hyperparameter, examining its impact on convergence and strategies for decay. it also delves into optimization techniques and model ensembles, highlighting their benefits in reducing overfitting and improving performance on unseen data.', 'chapters': [{'end': 2844.337, 'start': 2690.838, 'title': 'Optimizing learning rates', 'summary': 'Explains the importance of learning rate as a hyperparameter in optimization algorithms, discussing the impact of different learning rates on convergence and strategies for decaying the learning rate over time, highlighting the practical implications and considerations for incorporating learning rate decay in training.', 'duration': 153.499, 'highlights': ['The importance of learning rate as a hyperparameter in optimization algorithms, with high learning rates leading to potential instability and low learning rates resulting in slow convergence.', 'Strategies for decaying the learning rate over time, such as step decay and exponential decay, to achieve better convergence and progress during training.', 'The practical implications of learning rate decay, illustrated through examples from papers like the ResNet paper, and its role in mitigating bouncing gradients and aiding progress down the landscape during training.', 'The consideration that learning rate decay is more common with SGD momentum and less common with Atom, highlighting its relevance in different optimization algorithms.', 'The recommendation to treat learning rate decay as a second-order hyperparameter and not optimize over it from the start, but rather evaluate its necessity based on the loss curve and performance of the network.', 'Insight into the idea that the discussed optimization algorithms are first-order optimization algorithms, providing context for the importance of learning rate and its decay in the training process.']}, {'end': 3301.812, 'start': 2844.989, 'title': 'Optimization techniques & model ensembles', 'summary': 'Discusses optimization techniques like first and second-order approximations, the limitations of second-order optimization in deep learning, and the benefits of model ensembles in reducing overfitting and improving performance on unseen data.', 'duration': 456.823, 'highlights': ['The limitations of second-order optimization in deep learning, as the Hessian matrix is impractical to compute and invert for networks with a large number of parameters.', 'The idea of model ensembles, where training 10 different models independently and averaging their predictions at test time tends to reduce overfitting and improve performance by a couple of percent.', 'The potential benefits of using a crazy learning rate schedule to converge the model to different regions in the objective landscape and improve performance through ensembles of the different snapshots.']}], 'duration': 610.974, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA2690838.jpg', 'highlights': ['The importance of learning rate as a hyperparameter in optimization algorithms, with high learning rates leading to potential instability and low learning rates resulting in slow convergence.', 'Strategies for decaying the learning rate over time, such as step decay and exponential decay, to achieve better convergence and progress during training.', 'The limitations of second-order optimization in deep learning, as the Hessian matrix is impractical to compute and invert for networks with a large number of parameters.', 'The idea of model ensembles, where training 10 different models independently and averaging their predictions at test time tends to reduce overfitting and improve performance by a couple of percent.', 'The practical implications of learning rate decay, illustrated through examples from papers like the ResNet paper, and its role in mitigating bouncing gradients and aiding progress down the landscape during training.']}, {'end': 3945.501, 'segs': [{'end': 3329.262, 'src': 'embed', 'start': 3301.832, 'weight': 1, 'content': [{'end': 3306.014, 'text': 'And we really want some strategies to improve the performance of our single models.', 'start': 3301.832, 'duration': 4.182}, {'end': 3308.555, 'text': "And that's really this idea of regularization,", 'start': 3306.474, 'duration': 2.081}, {'end': 3315.438, 'text': 'where we add something to our model to prevent it from fitting the training data too well in the attempt to make it perform better on unseen data.', 'start': 3308.555, 'duration': 6.883}, {'end': 3323.72, 'text': "And we've seen a couple ideas, a couple methods for regularization already, where we add some explicit extra term to the loss,", 'start': 3316.318, 'duration': 7.402}, {'end': 3329.262, 'text': "where we have this one term telling the model to fit the data and another term, that's a regularization term.", 'start': 3323.72, 'duration': 5.542}], 'summary': 'Strategies for improving single models through regularization to prevent overfitting.', 'duration': 27.43, 'max_score': 3301.832, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA3301832.jpg'}, {'end': 3361.233, 'src': 'embed', 'start': 3333.743, 'weight': 3, 'content': [{'end': 3340.408, 'text': "But, as we talked about in lecture a couple lectures ago, this L2 regularization doesn't really make, maybe,", 'start': 3333.743, 'duration': 6.665}, {'end': 3343.451, 'text': 'a lot of sense in the context of neural networks.', 'start': 3340.408, 'duration': 3.043}, {'end': 3346.153, 'text': 'So sometimes we use other things for neural networks.', 'start': 3344.051, 'duration': 2.102}, {'end': 3352.847, 'text': "So one regularization strategy that's super, super common for neural networks is this idea of dropout.", 'start': 3347.344, 'duration': 5.503}, {'end': 3354.769, 'text': 'So dropout is super simple.', 'start': 3353.428, 'duration': 1.341}, {'end': 3361.233, 'text': "Every time we do a forward pass through the network, at every layer, we're gonna randomly set some neurons to zero.", 'start': 3355.369, 'duration': 5.864}], 'summary': 'L2 regularization less relevant for neural networks, dropout commonly used instead.', 'duration': 27.49, 'max_score': 3333.743, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA3333743.jpg'}, {'end': 3488.471, 'src': 'heatmap', 'start': 3443.644, 'weight': 0.703, 'content': [{'end': 3449.006, 'text': 'In convolutions, you have this channel dimension, and you might drop out entire channels rather than random elements.', 'start': 3443.644, 'duration': 5.362}, {'end': 3454.181, 'text': 'So dropout is kind of super simple in practice.', 'start': 3451.657, 'duration': 2.524}, {'end': 3458.086, 'text': 'It only requires adding like two lines, like one line per dropout call.', 'start': 3454.221, 'duration': 3.865}, {'end': 3466.938, 'text': "So here we have like a three layer neural network and we've added dropout and you can see that all we needed to do was add this extra line where we randomly set some things to zero.", 'start': 3458.466, 'duration': 8.472}, {'end': 3468.54, 'text': 'So this is super easy to implement.', 'start': 3467.278, 'duration': 1.262}, {'end': 3471.616, 'text': 'But the question is why is this even a good idea??', 'start': 3469.754, 'duration': 1.862}, {'end': 3477.481, 'text': "We're seriously messing up messing with the network at training time by setting a bunch of its values to zero.", 'start': 3472.156, 'duration': 5.325}, {'end': 3479.543, 'text': 'And how can this possibly make sense?', 'start': 3477.921, 'duration': 1.622}, {'end': 3488.471, 'text': 'So one sort of slightly hand-wavy idea that people have is that Dropout helps prevent co-adaptation of features.', 'start': 3480.143, 'duration': 8.328}], 'summary': 'Adding dropout in neural networks is simple and helps prevent co-adaptation of features.', 'duration': 44.827, 'max_score': 3443.644, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA3443644.jpg'}, {'end': 3497.155, 'src': 'embed', 'start': 3472.156, 'weight': 0, 'content': [{'end': 3477.481, 'text': "We're seriously messing up messing with the network at training time by setting a bunch of its values to zero.", 'start': 3472.156, 'duration': 5.325}, {'end': 3479.543, 'text': 'And how can this possibly make sense?', 'start': 3477.921, 'duration': 1.622}, {'end': 3488.471, 'text': 'So one sort of slightly hand-wavy idea that people have is that Dropout helps prevent co-adaptation of features.', 'start': 3480.143, 'duration': 8.328}, {'end': 3497.155, 'text': "So maybe, if you imagine that we're trying to classify cats, maybe in some universe the network might learn one neuron for having an ear,", 'start': 3489.131, 'duration': 8.024}], 'summary': 'Network values set to zero, dropout prevents co-adaptation, neuron for cat ear', 'duration': 24.999, 'max_score': 3472.156, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA3472156.jpg'}, {'end': 3763.509, 'src': 'heatmap', 'start': 3667.6, 'weight': 0.826, 'content': [{'end': 3675.481, 'text': "because there's four possible dropout masks and we're gonna average out the values across these four masks and we can see that the expected value of A during training is 1 1⁄2 W1X plus W2Y.", 'start': 3667.6, 'duration': 7.881}, {'end': 3686.34, 'text': "So there's this disconnect between this average value of w one x plus w two y at test time and at training time.", 'start': 3679.099, 'duration': 7.241}, {'end': 3688.121, 'text': 'the average value is only half as much.', 'start': 3686.34, 'duration': 1.781}, {'end': 3695.182, 'text': "So one cheap thing we can do is that at test time we don't have any stochasticity.", 'start': 3688.901, 'duration': 6.281}, {'end': 3700.103, 'text': 'instead we just multiply this output by the dropout probability, and now these expected values are the same.', 'start': 3695.182, 'duration': 4.921}, {'end': 3707.664, 'text': 'So this is kind of like a local cheap approximation to this complex integral and this is what people really commonly do in practice with dropout.', 'start': 3700.663, 'duration': 7.001}, {'end': 3714.768, 'text': 'So at dropout, we have this predict function and we just multiply our outputs of the layer by the dropout probability.', 'start': 3709.845, 'duration': 4.923}, {'end': 3718.59, 'text': "So the summary of dropout is that it's really simple.", 'start': 3716.389, 'duration': 2.201}, {'end': 3723.673, 'text': "On the forward pass, you're just adding two lines to your implementation to randomly zero out some nodes.", 'start': 3718.75, 'duration': 4.923}, {'end': 3729.956, 'text': 'And then at the test time prediction function, you just added one little multiplication by your probability.', 'start': 3724.113, 'duration': 5.843}, {'end': 3731.477, 'text': 'So dropout is super simple.', 'start': 3730.377, 'duration': 1.1}, {'end': 3734.559, 'text': 'It tends to work well sometimes for regularizing neural networks.', 'start': 3731.557, 'duration': 3.002}, {'end': 3740.643, 'text': 'By the way, one common trick you see sometimes is this idea of inverted dropout.', 'start': 3736.882, 'duration': 3.761}, {'end': 3748.445, 'text': 'So maybe at test time you care more about efficiency, so you want to eliminate that extra multiplication by p at test time.', 'start': 3741.183, 'duration': 7.262}, {'end': 3755.807, 'text': 'So then what you can do is at test time you use the entire weight matrix, but now at training time instead you divide by p.', 'start': 3748.925, 'duration': 6.882}, {'end': 3757.928, 'text': 'Because training is probably happening on a GPU.', 'start': 3755.807, 'duration': 2.121}, {'end': 3763.509, 'text': "you don't really care if you do one extra multiply at training time, but then at test time you kind of want this thing to be as efficient as possible.", 'start': 3757.928, 'duration': 5.581}], 'summary': 'Dropout simplifies training and prediction, improving neural network regularization through stochastic node zeroing.', 'duration': 95.909, 'max_score': 3667.6, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA3667600.jpg'}, {'end': 3874.714, 'src': 'embed', 'start': 3847.479, 'weight': 2, 'content': [{'end': 3854.284, 'text': 'But now at test time, we kind of average out this stochasticity by using some global estimates to normalize rather than the per mini batch estimates.', 'start': 3847.479, 'duration': 6.805}, {'end': 3859.546, 'text': 'So actually, batch normalization tends to have kind of a similar regularizing effect as dropout,', 'start': 3854.924, 'duration': 4.622}, {'end': 3865.089, 'text': 'because they both introduce some kind of stochasticity or noise at training time, but then average it out at test time.', 'start': 3859.546, 'duration': 5.543}, {'end': 3869.351, 'text': 'So actually, when you train networks with batch normalization,', 'start': 3865.629, 'duration': 3.722}, {'end': 3874.714, 'text': "sometimes you don't use dropout at all and just the batch normalization adds enough of a regularizing effect to your network.", 'start': 3869.351, 'duration': 5.363}], 'summary': 'Batch normalization and dropout introduce stochasticity at training time but average it out at test time, providing regularizing effect to networks.', 'duration': 27.235, 'max_score': 3847.479, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA3847479.jpg'}], 'start': 3301.832, 'title': 'Strategies for improving model performance and understanding dropout', 'summary': 'Discusses strategies for improving single model performance through regularization techniques such as adding an extra term to prevent overfitting and introducing dropout. it also explains the concept of dropout in neural networks, its benefits in preventing co-adaptation of features, and related strategies like batch normalization and data augmentation.', 'chapters': [{'end': 3354.769, 'start': 3301.832, 'title': 'Strategies for model performance improvement', 'summary': 'Discusses strategies for improving single model performance through regularization techniques, focusing on the concept of adding an extra term to prevent overfitting and introducing the idea of dropout as a common regularization strategy for neural networks.', 'duration': 52.937, 'highlights': ['Regularization involves adding an extra term to the model to prevent overfitting and improve performance on unseen data.', 'L2 regularization was previously used but may not be ideal for neural networks, leading to the adoption of dropout as a common regularization strategy.']}, {'end': 3945.501, 'start': 3355.369, 'title': 'Understanding dropout in neural networks', 'summary': 'Explains the concept of dropout in neural networks, where during forward pass, random neurons are set to zero, preventing co-adaptation of features and acting as a form of regularization, leading to better generalization and longer training time, and discusses related strategies like batch normalization and data augmentation.', 'duration': 590.132, 'highlights': ['During forward pass, random neurons are set to zero, preventing co-adaptation of features and acting as a form of regularization, leading to better generalization and longer training time.', "Another interpretation of dropout is that it's like doing model ensembling within a single model, learning a whole ensemble of networks simultaneously with parameters shared, and approximating the integral at test time by multiplying the output by the dropout probability.", 'Batch normalization and data augmentation also fit the paradigm of introducing stochasticity or noise at training time and averaging it out at test time, with batch normalization potentially providing enough regularization to eliminate the need for dropout, while data augmentation involves randomly transforming training images and averaging out the stochasticity during testing.']}], 'duration': 643.669, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA3301832.jpg', 'highlights': ['Dropout prevents co-adaptation of features and acts as a form of regularization.', 'Regularization involves adding an extra term to prevent overfitting and improve performance.', 'Batch normalization and data augmentation introduce stochasticity at training time and average it out at test time.', 'L2 regularization was previously used but may not be ideal for neural networks.']}, {'end': 4526.679, 'segs': [{'end': 4022.912, 'src': 'embed', 'start': 3985.189, 'weight': 1, 'content': [{'end': 3992.455, 'text': 'And now, during training, you just apply these random transformations to your input data, and this sort of has a regularizing effect on the network,', 'start': 3985.189, 'duration': 7.266}, {'end': 3998.341, 'text': "because you're again adding some kind of stochasticity during training and then marginalizing it out at test time.", 'start': 3992.455, 'duration': 5.886}, {'end': 4004.98, 'text': "So now we've seen kind of three examples of this pattern, dropout, batch normalization, data augmentation.", 'start': 4000.157, 'duration': 4.823}, {'end': 4006.842, 'text': "But there's many other examples as well.", 'start': 4005.521, 'duration': 1.321}, {'end': 4012.285, 'text': "And once you kind of have this pattern in your mind, you'll kind of recognize this thing as you read other papers sometimes.", 'start': 4007.242, 'duration': 5.043}, {'end': 4016.328, 'text': "So there's another kind of related idea to dropout called dropconnect.", 'start': 4013.146, 'duration': 3.182}, {'end': 4022.912, 'text': "So with dropconnect it's the same idea, but rather than zeroing out the activations at every forward pass,", 'start': 4016.808, 'duration': 6.104}], 'summary': 'Applying random transformations during training has a regularizing effect on the network, seen in examples like dropout, batch normalization, and data augmentation.', 'duration': 37.723, 'max_score': 3985.189, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA3985189.jpg'}, {'end': 4259.735, 'src': 'heatmap', 'start': 4212.948, 'weight': 0.726, 'content': [{'end': 4214.028, 'text': 'So the idea is really simple.', 'start': 4212.948, 'duration': 1.08}, {'end': 4220.351, 'text': "So you'll maybe first take some CNN here is kind of a BGG style architecture.", 'start': 4214.769, 'duration': 5.582}, {'end': 4227.415, 'text': "you'll take your CNN, you'll train it in a very large data set like ImageNet, where you actually have enough data to train the whole network.", 'start': 4220.351, 'duration': 7.064}, {'end': 4234.062, 'text': 'And now the idea is that you want to apply the features from this data set to some small data set that you care about.', 'start': 4228.26, 'duration': 5.802}, {'end': 4240.923, 'text': 'So maybe instead of classifying the thousand ImageNet categories, now you want to classify like 10 dog breeds or something like that.', 'start': 4234.522, 'duration': 6.401}, {'end': 4242.504, 'text': 'And you only have a small data set.', 'start': 4241.324, 'duration': 1.18}, {'end': 4245.585, 'text': 'So here our small data set only has C classes.', 'start': 4242.984, 'duration': 2.601}, {'end': 4253.727, 'text': "So, then, what you'll typically do is, for this last fully connected layer, that is, going from the last layer features to the final class scores,", 'start': 4246.125, 'duration': 7.602}, {'end': 4257.171, 'text': 'this now you need to reinitialize that matrix randomly.', 'start': 4254.628, 'duration': 2.543}, {'end': 4259.735, 'text': 'In this case, going from.', 'start': 4257.972, 'duration': 1.763}], 'summary': 'Using pre-trained cnn on imagenet to classify small data sets with fewer classes.', 'duration': 46.787, 'max_score': 4212.948, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA4212948.jpg'}, {'end': 4308.876, 'src': 'embed', 'start': 4281.136, 'weight': 0, 'content': [{'end': 4284.678, 'text': 'And this tends to work pretty well if you only have a very small data set to work with.', 'start': 4281.136, 'duration': 3.542}, {'end': 4294.06, 'text': 'And now if you have a little bit more data, another thing you can try is actually fine tuning the whole network.', 'start': 4288.854, 'duration': 5.206}, {'end': 4299.806, 'text': 'So after that top layer converges and after you learn that last layer for your data,', 'start': 4294.52, 'duration': 5.286}, {'end': 4303.651, 'text': 'then you can consider actually trying to update the whole network as well.', 'start': 4299.806, 'duration': 3.845}, {'end': 4308.876, 'text': 'And if you have more data, then you might consider updating larger parts of the network.', 'start': 4304.151, 'duration': 4.725}], 'summary': 'Fine tune network for larger datasets to update the whole network with more data.', 'duration': 27.74, 'max_score': 4281.136, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA4281136.jpg'}, {'end': 4492.873, 'src': 'embed', 'start': 4473.542, 'weight': 2, 'content': [{'end': 4484.73, 'text': "then what you should generally do is download some thing that's relatively some pre-trained model that's relatively close to the task you care about and then either reinitialize parts of that model or fine tune that model for your data.", 'start': 4473.542, 'duration': 11.188}, {'end': 4489.954, 'text': 'And that tends to work pretty well even if you have only a modest amount of training data to work with.', 'start': 4485.07, 'duration': 4.884}, {'end': 4492.873, 'text': 'And because this is such a common strategy,', 'start': 4490.872, 'duration': 2.001}], 'summary': 'Download pre-trained model, fine-tune for data, works well with modest amount of training data.', 'duration': 19.331, 'max_score': 4473.542, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA4473542.jpg'}], 'start': 3948.55, 'title': 'Regularization and transfer learning', 'summary': 'Covers data augmentation, regularization techniques such as dropout and batch normalization, and transfer learning, highlighting methods like fractional max pooling and stochastic depth, and their impact on model performance with limited data.', 'chapters': [{'end': 4022.912, 'start': 3948.55, 'title': 'Data augmentation and regularization', 'summary': 'Discusses the concept of data augmentation as a general technique applicable to various problems, where random transformations are applied to input data during training to add stochasticity and have a regularizing effect on the network, citing examples like dropout, batch normalization, and data augmentation, and mentions a related idea called dropconnect.', 'duration': 74.362, 'highlights': ['Data augmentation is a general technique applicable to various problems, involving the application of random transformations to input data during training, adding stochasticity and having a regularizing effect on the network.', 'Examples of this pattern include dropout, batch normalization, and data augmentation, which are recognized as a related idea to dropconnect.', 'Color jittering and varying contrast or brightness are used in data augmentation, and it can be more complex by introducing color jitters that are in the PCA directions of the data space.', 'Dropconnect is a related idea to dropout, where rather than zeroing out the activations at every forward pass, the same idea is applied in a different manner.']}, {'end': 4526.679, 'start': 4022.912, 'title': 'Regularization and transfer learning', 'summary': 'Discussed different regularization methods in deep learning, including fractional max pooling and stochastic depth, as well as the concept of transfer learning and its application in improving model performance with limited data.', 'duration': 503.767, 'highlights': ['The chapter discussed different regularization methods in deep learning, including fractional max pooling and stochastic depth, as well as the concept of transfer learning and its application in improving model performance with limited data.', 'Transfer learning involves reinitializing the last fully connected layer and training a linear classifier, which tends to work well with a small dataset, and fine-tuning the whole network when more data is available.', 'Pre-trained models, close to the task at hand, can be reinitialized or fine-tuned for the specific dataset, leveraging the common strategy of transfer learning.']}], 'duration': 578.129, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_JB0AO7QxSA/pics/_JB0AO7QxSA3948550.jpg', 'highlights': ['Transfer learning involves reinitializing the last fully connected layer and training a linear classifier, which tends to work well with a small dataset, and fine-tuning the whole network when more data is available.', 'Data augmentation is a general technique applicable to various problems, involving the application of random transformations to input data during training, adding stochasticity and having a regularizing effect on the network.', 'Pre-trained models, close to the task at hand, can be reinitialized or fine-tuned for the specific dataset, leveraging the common strategy of transfer learning.', 'Examples of this pattern include dropout, batch normalization, and data augmentation, which are recognized as a related idea to dropconnect.']}], 'highlights': ['Using GPU instances costs around 90 cents to a dollar an hour', 'The importance of stopping Google Cloud instances to avoid unnecessary charges', "Recommended RAM size of eight or 16 gigs for the class's requirements", 'Assignment two is the longest in the class and is due a week from Thursday', 'Suggests using CPU instances for tasks like assignment two', 'The recommendation to use ReLU as a default choice due to its effectiveness for various architectures', 'The importance of proper weight initialization, such as Xavier or MSRA, to maintain a balanced distribution of activations in deeper networks', 'The problems associated with sigmoid and TanH activation functions, such as vanishing gradients near the ends', 'The importance of data normalization in Comnets, with a focus on zero centering and unit variance', 'Discussion on hyperparameter search, including grid search, random search, and coarse-defined search strategies', 'Importance of learning rate optimization and its impact on model training duration', 'Explanation of batch normalization and its impact on network optimization', 'The challenges of stochastic gradient descent (SGD) in dealing with bad condition numbers, local minima, and saddle points, impacting training efficiency and convergence', 'The concept of transfer learning, showcasing its potential to utilize less data by transferring knowledge from one problem to another', 'The chapter emphasizes the significance of training constants in neural network training', 'The chapter explores the utilization of optimization algorithms in neural network training, focusing on the ones commonly used in the field', 'The prevalence of saddle points over local minima in high-dimensional problems, causing slow progress and inefficiency in training large neural networks', 'The Nesterov accelerated gradient switches the order of operations, incorporating gradient information from a larger part of the objective landscape and has theoretical properties in convex optimization', 'Adding momentum averages out the noise in gradient estimates, resulting in a smoother path towards the minima compared to vanilla SGD', 'Noisy gradient estimates impact the convergence of stochastic gradient descent, causing it to meander around the space and potentially take a long time to reach the minima', "High condition numbers remain a problem with full batch gradient descent, leading to the persistence of the 'taco shell problem' and additional noise in the network due to sampling mini-batches and explicit stochasticity", 'Nesterov momentum incorporates an error-correcting term in SGD', 'Nesterov momentum overshoots and corrects itself towards minima', 'Importance of avoiding sharp minima and preferring flat minima', 'Comparison of SGD, SGD momentum, and Nesterov momentum behaviors', 'Impact of training data size on the optimization landscape', 'Behaviors and advantages of optimization algorithms like AdaGrad, RMSprop, and Atom', "Atom algorithm's features and effectiveness as a default optimization algorithm", 'The importance of learning rate as a hyperparameter in optimization algorithms, with high learning rates leading to potential instability and low learning rates resulting in slow convergence', 'Strategies for decaying the learning rate over time, such as step decay and exponential decay, to achieve better convergence and progress during training', 'The limitations of second-order optimization in deep learning, as the Hessian matrix is impractical to compute and invert for networks with a large number of parameters', 'The idea of model ensembles, where training 10 different models independently and averaging their predictions at test time tends to reduce overfitting and improve performance by a couple of percent', 'The practical implications of learning rate decay, illustrated through examples from papers like the ResNet paper, and its role in mitigating bouncing gradients and aiding progress down the landscape during training', 'Dropout prevents co-adaptation of features and acts as a form of regularization', 'Regularization involves adding an extra term to prevent overfitting and improve performance', 'Batch normalization and data augmentation introduce stochasticity at training time and average it out at test time', 'L2 regularization was previously used but may not be ideal for neural networks', 'Transfer learning involves reinitializing the last fully connected layer and training a linear classifier, which tends to work well with a small dataset, and fine-tuning the whole network when more data is available', 'Data augmentation is a general technique applicable to various problems, involving the application of random transformations to input data during training, adding stochasticity and having a regularizing effect on the network', 'Pre-trained models, close to the task at hand, can be reinitialized or fine-tuned for the specific dataset, leveraging the common strategy of transfer learning', 'Examples of this pattern include dropout, batch normalization, and data augmentation, which are recognized as a related idea to dropconnect']}