title

Deep Learning-All Optimizers In One Video-SGD with Momentum,Adagrad,Adadelta,RMSprop,Adam Optimizers

description

In this video we will revise all the optimizers
02:11 Gradient Descent
11:42 SGD
30:53 SGD With Momentum
57:22 Adagrad
01:17:12 Adadelta And RMSprop
1:28:52 Adam Optimizer
⭐ Kite is a free AI-powered coding assistant that will help you code faster and smarter. The Kite plugin integrates with all the top editors and IDEs to give you smart completions and documentation while you’re typing. I've been using Kite for a few months and I love it! https://www.kite.com/get-kite/?utm_medium=referral&utm_source=youtube&utm_campaign=krishnaik&utm_content=description-only
All Playlist In My channel
Complete DL Playlist: https://www.youtube.com/watch?v=9jA0KjS7V_c&list=PLZoTAELRMXVPGU70ZGsckrMdr0FteeRUi
Julia Playlist: https://www.youtube.com/watch?v=Bxp1YFA6M4s&list=PLZoTAELRMXVPJwtjTo2Y6LkuuYK0FT4Q-
Complete ML Playlist :https://www.youtube.com/playlist?list=PLZoTAELRMXVPBTrWtJkn3wWQxZkmTXGwe
Complete NLP Playlist:https://www.youtube.com/playlist?list=PLZoTAELRMXVMdJ5sqbCK2LiM0HhQVWNzm
Docker End To End Implementation: https://www.youtube.com/playlist?list=PLZoTAELRMXVNKtpy0U_Mx9N26w8n0hIbs
Live stream Playlist: https://www.youtube.com/playlist?list=PLZoTAELRMXVNxYFq_9MuiUdn2YnlFqmMK
Machine Learning Pipelines: https://www.youtube.com/playlist?list=PLZoTAELRMXVNKtpy0U_Mx9N26w8n0hIbs
Pytorch Playlist: https://www.youtube.com/playlist?list=PLZoTAELRMXVNxYFq_9MuiUdn2YnlFqmMK
Feature Engineering :https://www.youtube.com/playlist?list=PLZoTAELRMXVPwYGE2PXD3x0bfKnR0cJjN
Live Projects :https://www.youtube.com/playlist?list=PLZoTAELRMXVOFnfSwkB_uyr4FT-327noK
Kaggle competition :https://www.youtube.com/playlist?list=PLZoTAELRMXVPiKOxbwaniXjHJ02bdkLWy
Mongodb with Python :https://www.youtube.com/playlist?list=PLZoTAELRMXVN_8zzsevm1bm6G-plsiO1I
MySQL With Python :https://www.youtube.com/playlist?list=PLZoTAELRMXVMd3RF7p-u7ezEysGaG9JmO
Deployment Architectures:https://www.youtube.com/playlist?list=PLZoTAELRMXVOPzVJiSJAn9Ly27Fi1-8ac
Amazon sagemaker :https://www.youtube.com/playlist?list=PLZoTAELRMXVONh5mHrXowH6-dgyWoC_Ew
Please donate if you want to support the channel through GPay UPID,
Gpay: krishnaik06@okicici
Telegram link: https://t.me/joinchat/N77M7xRvYUd403DgfE4TWw
Please join as a member in my channel to get additional benefits like materials in Data Science, live streaming for Members and many more
https://www.youtube.com/channel/UCNU_lfiiWBdtULKOw6X0Dig/join
Please do subscribe my other channel too
https://www.youtube.com/channel/UCjWY5hREA6FFYrthD0rZNIw
Connect with me here:
Twitter: https://twitter.com/Krishnaik06
Facebook: https://www.facebook.com/krishnaik06
instagram: https://www.instagram.com/krishnaik06
#Optimizers

detail

{'title': 'Deep Learning-All Optimizers In One Video-SGD with Momentum,Adagrad,Adadelta,RMSprop,Adam Optimizers', 'heatmap': [{'end': 2264.196, 'start': 2195.486, 'weight': 0.754}, {'end': 5445.713, 'start': 5317.847, 'weight': 0.881}], 'summary': 'Series covers explanations of various optimizers, including adam optimizer, addressing mistakes in previous videos, and recommends revising concepts for thorough understanding in over one and a half hours. it discusses challenges of training models with large datasets, introduces mini batch sgd as a more efficient alternative, and explores convergence in gradient descent and mini batch sgd, emphasizing improved efficiency and smoother convergence towards the global minimum.', 'chapters': [{'end': 395.443, 'segs': [{'end': 90.549, 'src': 'embed', 'start': 68.404, 'weight': 0, 'content': [{'end': 76.506, 'text': 'just revise all the concepts, starting from starting till the end, because there were some minor mistakes in my previous optimizer video.', 'start': 68.404, 'duration': 8.102}, {'end': 78.226, 'text': 'so i tried to fix that.', 'start': 76.506, 'duration': 1.72}, {'end': 80.827, 'text': 'i would suggest you please go through the whole video.', 'start': 78.226, 'duration': 2.601}, {'end': 90.549, 'text': 'This one and a half hour of video will be very, very beneficial for all of you to understand the whole working of optimizers.', 'start': 82.205, 'duration': 8.344}], 'summary': 'Revised video with minor mistakes, 1.5 hours long, beneficial for understanding optimizers', 'duration': 22.145, 'max_score': 68.404, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk68404.jpg'}, {'end': 204.193, 'src': 'embed', 'start': 176.856, 'weight': 1, 'content': [{'end': 181.017, 'text': 'And finally, we are getting connected to the output layer, right? So this is my output layer.', 'start': 176.856, 'duration': 4.161}, {'end': 182.938, 'text': 'This is my hidden layer one.', 'start': 181.597, 'duration': 1.341}, {'end': 184.679, 'text': 'This is my input layer.', 'start': 183.598, 'duration': 1.081}, {'end': 187.54, 'text': "I'm just taking an example of a neural network.", 'start': 184.999, 'duration': 2.541}, {'end': 193.722, 'text': 'So what happens in the front propagation? We get our Y hat over here.', 'start': 188.32, 'duration': 5.402}, {'end': 197.988, 'text': 'And then based on that, we create our loss function.', 'start': 194.266, 'duration': 3.722}, {'end': 201.911, 'text': 'My loss function is basically my error function, you can see.', 'start': 198.689, 'duration': 3.222}, {'end': 204.193, 'text': 'Or I can also say this as my cost function.', 'start': 202.051, 'duration': 2.142}], 'summary': 'Neural network layers and front propagation process explained.', 'duration': 27.337, 'max_score': 176.856, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk176856.jpg'}, {'end': 285.85, 'src': 'embed', 'start': 257.845, 'weight': 3, 'content': [{'end': 260.146, 'text': 'w12. okay, some different different weights are there.', 'start': 257.845, 'duration': 2.301}, {'end': 263.508, 'text': 'so every weights gets updated in the back propagation.', 'start': 260.146, 'duration': 3.362}, {'end': 267.149, 'text': 'we have also understood what is the back updation formula.', 'start': 263.508, 'duration': 3.641}, {'end': 278.207, 'text': 'so we have, like w new is equal to w old, minus learning rate of derivative of loss with respect to w old.', 'start': 267.149, 'duration': 11.058}, {'end': 281.848, 'text': 'so this is the formula that we had already seen.', 'start': 278.207, 'duration': 3.641}, {'end': 285.85, 'text': 'and this particular formula is of the gradient descent.', 'start': 281.848, 'duration': 4.002}], 'summary': 'Weights are updated in back propagation using the formula w new = w old - learning rate * derivative of loss with respect to w old, representing gradient descent.', 'duration': 28.005, 'max_score': 257.845, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk257845.jpg'}], 'start': 0.069, 'title': 'Optimizing neural networks', 'summary': 'Covers the explanation of various optimizers including adam optimizer, addressing mistakes in previous videos, and recommending viewers to revise the concepts for a thorough understanding, in a video lasting over one and a half hours. it also explains the front and back propagation in a neural network, the use of optimizers to update weights, the concept of epoch and iteration, and the relevance of loss functions and optimizers in machine learning.', 'chapters': [{'end': 144.894, 'start': 0.069, 'title': 'Optimizing with adam: complete guide', 'summary': 'Covers the explanation of various optimizers including adam optimizer, addressing mistakes in previous videos, and recommending viewers to revise the concepts for a thorough understanding, in a video lasting over one and a half hours.', 'duration': 144.825, 'highlights': ['The video covers explanations of various optimizers including Adam optimizer, addressing mistakes in previous videos, in a video lasting over one and a half hours.', 'The creator recommends viewers to revise the concepts for a thorough understanding of optimizers.', 'The creator provides timestamps for specific topics in the video to aid viewers in accessing relevant content easily.']}, {'end': 395.443, 'start': 145.475, 'title': 'Neural network optimization', 'summary': 'Explains the front and back propagation in a neural network, the use of optimizers to update weights, the concept of epoch and iteration, and the relevance of loss functions and optimizers in machine learning.', 'duration': 249.968, 'highlights': ['The front and back propagation in a neural network involves connecting input, hidden neurons, and output neurons, and aims to reduce the loss function by using optimizers to update weights.', 'The back propagation involves updating all the weights using the back updation formula, w new = w old - learning rate * derivative of loss with respect to w old, which is a part of the gradient descent technique.', 'Different types of loss functions exist for regression and classification problem statements, but the focus of the session is on discussing various kinds of optimizers and explaining the concepts of epoch and iteration.', 'Epoch and iteration are important concepts in handling large datasets, where epoch refers to a complete pass through the entire dataset, and iteration refers to the number of batches of data processed within an epoch.']}], 'duration': 395.374, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk69.jpg', 'highlights': ['The video covers explanations of various optimizers including Adam optimizer, addressing mistakes in previous videos, in a video lasting over one and a half hours.', 'The front and back propagation in a neural network involves connecting input, hidden neurons, and output neurons, and aims to reduce the loss function by using optimizers to update weights.', 'The creator recommends viewers to revise the concepts for a thorough understanding of optimizers.', 'The back propagation involves updating all the weights using the back updation formula, w new = w old - learning rate * derivative of loss with respect to w old, which is a part of the gradient descent technique.']}, {'end': 1025.598, 'segs': [{'end': 622.678, 'src': 'embed', 'start': 592.138, 'weight': 2, 'content': [{'end': 595.983, 'text': 'I have to take all the 1 million records according to gradient descent and calculate my loss function.', 'start': 592.138, 'duration': 3.845}, {'end': 599.767, 'text': 'And then in the backward propagation, I have to update all the weights.', 'start': 596.483, 'duration': 3.284}, {'end': 601.89, 'text': 'Now, there is a problem in this.', 'start': 600.468, 'duration': 1.422}, {'end': 607.193, 'text': 'If I take such a huge amount of record, I will be requiring huge RAM.', 'start': 602.891, 'duration': 4.302}, {'end': 611.514, 'text': "You'll be seeing that how many weights will get updated.", 'start': 609.213, 'duration': 2.301}, {'end': 613.695, 'text': "Here, I've just developed a two-layer neural network.", 'start': 611.574, 'duration': 2.121}, {'end': 618.137, 'text': 'But if I have a deep neural network, at that time, the weight updation will take time.', 'start': 613.735, 'duration': 4.402}, {'end': 622.678, 'text': 'When the weight updation will take time, the convergence will happen slowly.', 'start': 619.537, 'duration': 3.141}], 'summary': 'Gradient descent with 1 million records requires huge ram and slows convergence.', 'duration': 30.54, 'max_score': 592.138, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk592138.jpg'}, {'end': 668.841, 'src': 'embed', 'start': 637.921, 'weight': 0, 'content': [{'end': 640.643, 'text': 'OK, this updation, weight updation will take time.', 'start': 637.921, 'duration': 2.722}, {'end': 645.306, 'text': 'Definitely it will take time because I have so many records, right? 1 million records.', 'start': 640.823, 'duration': 4.483}, {'end': 650.569, 'text': "And at one time I'm taking so many records and doing back propagation for every epochs.", 'start': 645.726, 'duration': 4.843}, {'end': 653.191, 'text': 'right?. First of all, to load 1 million records.', 'start': 650.569, 'duration': 2.622}, {'end': 655.592, 'text': "OK, I'm just taking an example of 1 million record.", 'start': 653.191, 'duration': 2.401}, {'end': 659.515, 'text': 'Probably in a project, you have 10 million records or 20 million records.', 'start': 655.952, 'duration': 3.563}, {'end': 664.818, 'text': 'right. just to load this particular model in the ram, you will be requiring huge ram size, right?', 'start': 660.055, 'duration': 4.763}, {'end': 668.841, 'text': 'so this is basically resource resource.', 'start': 664.818, 'duration': 4.023}], 'summary': 'Updating weight for 1 million records requires significant time and resources.', 'duration': 30.92, 'max_score': 637.921, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk637921.jpg'}, {'end': 790.742, 'src': 'embed', 'start': 759.066, 'weight': 4, 'content': [{'end': 764.01, 'text': 'Now see this, what is this iteration? How many iterations I have to do? That is the main thing.', 'start': 759.066, 'duration': 4.944}, {'end': 766.933, 'text': 'Suppose, if I have 10K records right?', 'start': 764.431, 'duration': 2.502}, {'end': 771.277, 'text': "And I told you in stochastic gradient descent, by default we'll be just taking one record right?", 'start': 767.113, 'duration': 4.164}, {'end': 777.903, 'text': 'So if I have 10K records for every epoch, for one epoch, I have to do 10K iterations.', 'start': 771.717, 'duration': 6.186}, {'end': 783.941, 'text': 'I have to do 10K iterations of forward and backward propagation.', 'start': 780.56, 'duration': 3.381}, {'end': 790.742, 'text': "Again, understand, why 10K? Because every time I'm taking one record and doing the forward propagation, I'm doing the back propagation.", 'start': 784.481, 'duration': 6.261}], 'summary': 'In stochastic gradient descent, with 10k records, there are 10k iterations for one epoch.', 'duration': 31.676, 'max_score': 759.066, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk759066.jpg'}, {'end': 892.653, 'src': 'embed', 'start': 865.927, 'weight': 3, 'content': [{'end': 869.749, 'text': 'And in that particular case, for every epoch, there will be an introduction of iteration.', 'start': 865.927, 'duration': 3.822}, {'end': 872.491, 'text': 'There will be an introduction of iteration that we have learned.', 'start': 870.089, 'duration': 2.402}, {'end': 875.774, 'text': 'Now, in order to prevent this, what we can do.', 'start': 873.132, 'duration': 2.642}, {'end': 881.177, 'text': 'So scientists came up with the second thing that is called as mini batch.', 'start': 876.114, 'duration': 5.063}, {'end': 883.999, 'text': 'Let me just show you over here.', 'start': 881.957, 'duration': 2.042}, {'end': 890.543, 'text': 'Scientists came up with something called as mini batch SGD.', 'start': 884.019, 'duration': 6.524}, {'end': 892.653, 'text': 'Stochastic gradient descent.', 'start': 891.573, 'duration': 1.08}], 'summary': 'Scientists introduced mini batch sgd to prevent introduction of iteration in each epoch.', 'duration': 26.726, 'max_score': 865.927, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk865927.jpg'}], 'start': 396.323, 'title': 'Gradient descent challenges and stochastic gradient descent', 'summary': 'Discusses the challenges of training models with large datasets, emphasizing the impact on loss function, weight updation, and computational resources. it also explains stochastic gradient descent (sgd) and introduces mini batch sgd as a more efficient alternative, emphasizing their impact on computational resources and convergence speed.', 'chapters': [{'end': 678.908, 'start': 396.323, 'title': 'Gradient descent challenges with large datasets', 'summary': 'Discusses the challenges of training a model with large datasets, highlighting the impact on loss function, weight updation, and computational resources, as well as the slow convergence rate when using gradient descent with 1 million records.', 'duration': 282.585, 'highlights': ['The weight updation process becomes computationally expensive and slow when dealing with a large dataset of 1 million records, impacting the convergence rate and requiring significant computational resources.', 'Training a model with 1 million records or more can lead to slow convergence due to the time-consuming weight updation process and the extensive computational resources required for handling such large datasets.', 'The loss function calculation and weight updation process become computationally expensive and time-consuming when dealing with a dataset of 1 million records, hindering the convergence rate and necessitating significant computational resources.']}, {'end': 1025.598, 'start': 678.908, 'title': 'Stochastic gradient descent', 'summary': 'Explains stochastic gradient descent (sgd) and introduces mini batch sgd as a more efficient alternative, highlighting the impact on computational resources and convergence speed.', 'duration': 346.69, 'highlights': ['In stochastic gradient descent, for every epoch, 10K iterations of forward and backward propagation are required if the batch size is set to 1, resulting in slow convergence. 10K iterations per epoch', 'Mini batch SGD reduces resource requirement by specifying a batch size, enabling fewer iterations for forward and backward propagation in each epoch. Reduced number of iterations per epoch', 'Introduction of mini batch SGD as a more efficient alternative to stochastic gradient descent, emphasizing the impact on computational resources and convergence speed. Efficiency improvement in computational resources and convergence speed']}], 'duration': 629.275, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk396323.jpg', 'highlights': ['Training a model with 1 million records or more can lead to slow convergence due to the time-consuming weight updation process and the extensive computational resources required for handling such large datasets.', 'The weight updation process becomes computationally expensive and slow when dealing with a large dataset of 1 million records, impacting the convergence rate and requiring significant computational resources.', 'The loss function calculation and weight updation process become computationally expensive and time-consuming when dealing with a dataset of 1 million records, hindering the convergence rate and necessitating significant computational resources.', 'Mini batch SGD reduces resource requirement by specifying a batch size, enabling fewer iterations for forward and backward propagation in each epoch. Reduced number of iterations per epoch', 'In stochastic gradient descent, for every epoch, 10K iterations of forward and backward propagation are required if the batch size is set to 1, resulting in slow convergence. 10K iterations per epoch', 'Introduction of mini batch SGD as a more efficient alternative to stochastic gradient descent, emphasizing the impact on computational resources and convergence speed. Efficiency improvement in computational resources and convergence speed']}, {'end': 2029.227, 'segs': [{'end': 1131.652, 'src': 'embed', 'start': 1100.262, 'weight': 0, 'content': [{'end': 1102.463, 'text': 'We are converging at this specific point.', 'start': 1100.262, 'duration': 2.201}, {'end': 1107.947, 'text': 'But in the case of minibatch SGD, we are getting this zigzag line.', 'start': 1103.384, 'duration': 4.563}, {'end': 1112.53, 'text': 'The reason why we are getting the zigzag line, because over here, it is simple.', 'start': 1108.067, 'duration': 4.463}, {'end': 1114.992, 'text': 'We are just taking a batch of data.', 'start': 1112.57, 'duration': 2.422}, {'end': 1117.181, 'text': 'We are taking a batch of data.', 'start': 1115.64, 'duration': 1.541}, {'end': 1119.283, 'text': 'We are just taking a batch of data.', 'start': 1117.261, 'duration': 2.022}, {'end': 1121.184, 'text': "We don't know the whole data.", 'start': 1119.943, 'duration': 1.241}, {'end': 1131.652, 'text': "So if you don't know the whole data, obviously, the weight updation will take a little bit more time when compared to the gradient descent.", 'start': 1121.785, 'duration': 9.867}], 'summary': 'Minibatch sgd results in zigzag line due to partial data, causing slower weight updation compared to gradient descent.', 'duration': 31.39, 'max_score': 1100.262, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk1100262.jpg'}, {'end': 1313.677, 'src': 'embed', 'start': 1287.905, 'weight': 2, 'content': [{'end': 1292.848, 'text': "I'm just picking one record, doing the front propagation, doing the back propagation and updating the weights right?", 'start': 1287.905, 'duration': 4.943}, {'end': 1294.769, 'text': 'So that is also a disadvantage, right?', 'start': 1292.888, 'duration': 1.881}, {'end': 1295.835, 'text': 'Okay.', 'start': 1295.615, 'duration': 0.22}, {'end': 1298.479, 'text': 'And you can see over there, it will be very, very slow.', 'start': 1296.316, 'duration': 2.163}, {'end': 1300.142, 'text': 'The convergence will be very, very slow.', 'start': 1298.519, 'duration': 1.623}, {'end': 1304.208, 'text': 'So the researcher finally came up with something called as mini batch SGD.', 'start': 1300.603, 'duration': 3.605}, {'end': 1313.677, 'text': 'In minibatch SGD, researchers thought that, can we fix a batch size? Suppose if I have 10k records.', 'start': 1305.171, 'duration': 8.506}], 'summary': 'Mini batch sgd was introduced to address slow convergence in training, using fixed batch sizes like 10k records.', 'duration': 25.772, 'max_score': 1287.905, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk1287905.jpg'}, {'end': 1424.608, 'src': 'embed', 'start': 1399.403, 'weight': 1, 'content': [{'end': 1404.629, 'text': 'In SGD, we just take one data at a time, and we do the front propagation and back propagation.', 'start': 1399.403, 'duration': 5.226}, {'end': 1407.112, 'text': 'In mini batch, we take some batch size.', 'start': 1405.049, 'duration': 2.063}, {'end': 1410.857, 'text': 'We take some batch size, and we do the front propagation and back propagation.', 'start': 1407.432, 'duration': 3.425}, {'end': 1421.046, 'text': 'And remember, when we are taking many batches GD, many batches GD, our convergence will not be smooth like gradient descent.', 'start': 1411.519, 'duration': 9.527}, {'end': 1423.408, 'text': 'I told you why? Because we are taking the batch.', 'start': 1421.266, 'duration': 2.142}, {'end': 1424.608, 'text': "We don't know the whole data.", 'start': 1423.448, 'duration': 1.16}], 'summary': 'In stochastic gradient descent, one data at a time is taken for front and back propagation. in mini-batch, a batch size is used, affecting convergence in comparison to gradient descent.', 'duration': 25.205, 'max_score': 1399.403, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk1399403.jpg'}, {'end': 1565.06, 'src': 'embed', 'start': 1539.173, 'weight': 3, 'content': [{'end': 1545.357, 'text': 'uh. uh, you know, not just, you can just type out the message, but i hope it is clear.', 'start': 1539.173, 'duration': 6.184}, {'end': 1548.379, 'text': 'epoch is a round of front propagation and back propagation.', 'start': 1545.357, 'duration': 3.022}, {'end': 1551.882, 'text': 'exactly, epoch is a round of front propagation and back propagation.', 'start': 1548.379, 'duration': 3.503}, {'end': 1558.707, 'text': 'but imagine, if there are multiple iteration inside epoch, that many number of front propagation and back propagation will happen.', 'start': 1551.882, 'duration': 6.825}, {'end': 1565.06, 'text': 'okay, There should be two loops, one for batch size and one for number of iterations, to get the total number of observations.', 'start': 1558.707, 'duration': 6.353}], 'summary': 'Epoch involves front and back propagation, with multiple iterations inside, requiring two loops for batch size and number of iterations.', 'duration': 25.887, 'max_score': 1539.173, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk1539173.jpg'}, {'end': 1635.948, 'src': 'embed', 'start': 1609.761, 'weight': 4, 'content': [{'end': 1616.003, 'text': 'But if I just have batch of data like 100 data in the first epoch, that time I may take a different path.', 'start': 1609.761, 'duration': 6.242}, {'end': 1617.163, 'text': 'I may take a different path.', 'start': 1616.103, 'duration': 1.06}, {'end': 1619.664, 'text': "And finally, I'll reach that specific path itself.", 'start': 1617.223, 'duration': 2.441}, {'end': 1625.825, 'text': 'Always remember, in stochastic gradient descent, we are getting the data on batches.', 'start': 1620.424, 'duration': 5.401}, {'end': 1628.146, 'text': 'We are not getting the whole data like gradient descent.', 'start': 1626.186, 'duration': 1.96}, {'end': 1631.087, 'text': 'So because of that, this noise will be definitely there.', 'start': 1628.526, 'duration': 2.561}, {'end': 1635.948, 'text': 'But again, the final goal will be that it will be reaching over here.', 'start': 1632.007, 'duration': 3.941}], 'summary': 'In stochastic gradient descent, data is received in batches, causing noise, but still reaching the goal.', 'duration': 26.187, 'max_score': 1609.761, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk1609761.jpg'}, {'end': 1812.727, 'src': 'embed', 'start': 1781.273, 'weight': 5, 'content': [{'end': 1782.394, 'text': 'how many iteration will be there?', 'start': 1781.273, 'duration': 1.121}, {'end': 1789.448, 'text': 'The iteration will be equal to the total data set, divided by batch size.', 'start': 1783.015, 'duration': 6.433}, {'end': 1794.051, 'text': "So here I'm specifying some batch size.", 'start': 1791.95, 'duration': 2.101}, {'end': 1797.273, 'text': 'Suppose if I specify my batch size is 1, 000.', 'start': 1794.692, 'duration': 2.581}, {'end': 1803.396, 'text': 'So my total number of iteration will be 10, 000 divided by 1, 000.', 'start': 1797.273, 'duration': 6.123}, {'end': 1806.379, 'text': "So this will be for every epoch, I'll be having 10 iterations.", 'start': 1803.397, 'duration': 2.982}, {'end': 1812.727, 'text': 'You can select the number of epochs, like how many number of epochs will be there.', 'start': 1808.206, 'duration': 4.521}], 'summary': 'Total iterations = total data set / batch size. for batch size 1,000, 10 iterations per epoch.', 'duration': 31.454, 'max_score': 1781.273, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk1781273.jpg'}], 'start': 1025.598, 'title': 'Gradient descent convergence', 'summary': 'Discusses convergence in gradient descent and mini batch sgd, addressing noise impact, challenges with large datasets, and the concept of epoch, iteration, and noise, emphasizing improved efficiency and smoother convergence towards the global minimum.', 'chapters': [{'end': 1121.184, 'start': 1025.598, 'title': 'Convergence in gradient descent', 'summary': 'Explains the difference between convergence in gradient descent and mini batch sgd, highlighting the impact of noise in convergence due to the use of mini batch sgd and the limited data batch size.', 'duration': 95.586, 'highlights': ['The reason for the noise in convergence in mini batch SGD is due to the zigzag line formed, caused by taking a batch of data instead of the whole dataset, resulting in limited data representation.', 'In gradient descent, convergence occurs with a straight line, reaching the global minimum, while in mini batch SGD, the convergence exhibits a zigzag pattern, attributed to the noise caused by the use of limited data batch size.']}, {'end': 1539.173, 'start': 1121.785, 'title': 'Optimizing gradient descent for neural networks', 'summary': 'Discusses the challenges of gradient descent in handling large datasets, leading to the development of stochastic gradient descent (sgd) and mini batch sgd, which improve efficiency by updating weights based on smaller subsets of data, reducing computational expense and enabling smoother convergence towards the global minimum.', 'duration': 417.388, 'highlights': ['The challenges of gradient descent in handling large datasets led to the development of stochastic gradient descent (SGD) and mini batch SGD, which update weights based on smaller subsets of data, reducing computational expense and enabling smoother convergence towards the global minimum.', 'In gradient descent, the entire dataset is used for forward and backward propagation, leading to resource-intensive computations, while SGD involves updating weights for each record within an epoch, resulting in slow convergence, and mini batch SGD addresses this by fixing a batch size, improving efficiency and reducing computational expense.', 'The convergence in gradient descent is smooth, while in SGD, the convergence is affected by the use of smaller subsets of data and the consequent updates to weights, leading to less smooth convergence towards the global minimum.']}, {'end': 2029.227, 'start': 1539.173, 'title': 'Understanding epoch, iteration, and noise in gradient descent', 'summary': 'Explains the concept of epoch, iteration, and noise in gradient descent, emphasizing the impact on training models and convergence. it details the differences between stochastic gradient descent, mini batch stochastic gradient descent, and gradient descent, highlighting the effects of noise and convergence speed.', 'duration': 490.054, 'highlights': ['Epoch is a round of front propagation and back propagation, with the number of front and back propagations depending on the number of iterations inside the epoch. The concept of epoch in training models is explained, involving the occurrence of front and back propagations, with the number of propagations determined by the iterations within the epoch.', 'Stochastic gradient descent involves getting data on batches, resulting in the presence of noise due to the batch size, affecting convergence speed compared to gradient descent. The presence of noise in stochastic gradient descent, caused by obtaining data in batches, impacts convergence speed in contrast to gradient descent.', "The number of iterations in stochastic gradient descent is equal to the number of records for each epoch, while mini batch stochastic gradient descent's iterations are determined by the total data set divided by the batch size. The calculation of iterations in stochastic gradient descent and mini batch stochastic gradient descent is detailed, based on the number of records and the total data set divided by the batch size, respectively.", 'Mini batch stochastic gradient descent leads to more noise and slower convergence compared to gradient descent, while still resolving the main problem of resource utilization. The effects of noise and convergence speed in mini batch stochastic gradient descent are highlighted, indicating a slower convergence compared to gradient descent but addressing resource utilization challenges.']}], 'duration': 1003.629, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk1025598.jpg', 'highlights': ['Mini batch SGD exhibits zigzag convergence due to limited data representation.', 'Challenges with large datasets led to the development of SGD and mini batch SGD.', 'Mini batch SGD addresses slow convergence and computational expense.', 'Epoch involves front and back propagations, determined by the number of iterations.', 'Stochastic gradient descent introduces noise due to batch data, impacting convergence speed.', 'Iterations in mini batch SGD are determined by the total data set divided by the batch size.']}, {'end': 3667.242, 'segs': [{'end': 2128.84, 'src': 'embed', 'start': 2101.073, 'weight': 0, 'content': [{'end': 2113.68, 'text': "we'll try to reach over here and for that we basically say a topic which is called as sgd with momentum, sgd with momentum.", 'start': 2101.073, 'duration': 12.607}, {'end': 2121.004, 'text': 'so, with the help of this momentum terminology, this momentum basically says that this momentum what it.', 'start': 2113.68, 'duration': 7.324}, {'end': 2123.385, 'text': "uh, we'll discuss about this momentum what exactly it is.", 'start': 2121.004, 'duration': 2.381}, {'end': 2128.84, 'text': 'With the help of SGD with momentum, this noisy data that we have.', 'start': 2124.457, 'duration': 4.383}], 'summary': 'Discussing sgd with momentum for noise reduction in data.', 'duration': 27.767, 'max_score': 2101.073, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk2101073.jpg'}, {'end': 2264.196, 'src': 'heatmap', 'start': 2195.486, 'weight': 0.754, 'content': [{'end': 2203.614, 'text': 'Momentum says that we are instead of calculating our derivative of loss with respect to derivative of w old.', 'start': 2195.486, 'duration': 8.128}, {'end': 2213.944, 'text': 'in order to make this smoothen, we will compute something called as exponential weighted average.', 'start': 2203.614, 'duration': 10.33}, {'end': 2220.216, 'text': 'Exponential Weighted Average.', 'start': 2218.095, 'duration': 2.121}, {'end': 2221.657, 'text': 'We are going to compute this one.', 'start': 2220.236, 'duration': 1.421}, {'end': 2223.819, 'text': 'That is Exponential Weighted Average.', 'start': 2222.118, 'duration': 1.701}, {'end': 2229.742, 'text': 'Instead of just computing gradient descent, instead we will try to compute Exponential Weighted Average.', 'start': 2223.859, 'duration': 5.883}, {'end': 2233.084, 'text': 'Now what is this Exponential Weighted Average? Let me just give you a very good example.', 'start': 2229.762, 'duration': 3.322}, {'end': 2238.007, 'text': 'Okay Suppose I have a series of data.', 'start': 2233.725, 'duration': 4.282}, {'end': 2239.568, 'text': 'I have a series of data.', 'start': 2238.348, 'duration': 1.22}, {'end': 2243.171, 'text': "Like I'll say that, okay, at T1, I have A1 data.", 'start': 2240.049, 'duration': 3.122}, {'end': 2246.377, 'text': 'At T2, I have A2 data.', 'start': 2244.375, 'duration': 2.002}, {'end': 2249.12, 'text': 'At T3, I have A3 data.', 'start': 2247.038, 'duration': 2.082}, {'end': 2251.462, 'text': 'At T4, I have A4 data.', 'start': 2249.64, 'duration': 1.822}, {'end': 2254.806, 'text': 'And like this, I have Tn, I have An data.', 'start': 2252.083, 'duration': 2.723}, {'end': 2257.248, 'text': 'This is just like a time series problem.', 'start': 2255.206, 'duration': 2.042}, {'end': 2260.331, 'text': 'With respect to the future data, I have some values.', 'start': 2257.809, 'duration': 2.522}, {'end': 2261.973, 'text': 'At T1, I had A1 data.', 'start': 2260.411, 'duration': 1.562}, {'end': 2264.196, 'text': 't2, a2, a3, a4.', 'start': 2262.654, 'duration': 1.542}], 'summary': 'Using exponential weighted average to smooth derivative calculation and optimize gradient descent.', 'duration': 68.71, 'max_score': 2195.486, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk2195486.jpg'}, {'end': 2233.084, 'src': 'embed', 'start': 2203.614, 'weight': 1, 'content': [{'end': 2213.944, 'text': 'in order to make this smoothen, we will compute something called as exponential weighted average.', 'start': 2203.614, 'duration': 10.33}, {'end': 2220.216, 'text': 'Exponential Weighted Average.', 'start': 2218.095, 'duration': 2.121}, {'end': 2221.657, 'text': 'We are going to compute this one.', 'start': 2220.236, 'duration': 1.421}, {'end': 2223.819, 'text': 'That is Exponential Weighted Average.', 'start': 2222.118, 'duration': 1.701}, {'end': 2229.742, 'text': 'Instead of just computing gradient descent, instead we will try to compute Exponential Weighted Average.', 'start': 2223.859, 'duration': 5.883}, {'end': 2233.084, 'text': 'Now what is this Exponential Weighted Average? Let me just give you a very good example.', 'start': 2229.762, 'duration': 3.322}], 'summary': 'Introducing exponential weighted average for optimization.', 'duration': 29.47, 'max_score': 2203.614, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk2203614.jpg'}, {'end': 2618.977, 'src': 'embed', 'start': 2593.218, 'weight': 3, 'content': [{'end': 2599.383, 'text': 'VDW is nothing but it is an exponential weighted average, like whatever formula we have actually found out over here.', 'start': 2593.218, 'duration': 6.165}, {'end': 2604.567, 'text': 'Similarly, till now, remember, the weight updation formula is like this.', 'start': 2599.903, 'duration': 4.664}, {'end': 2607.088, 'text': 'We also have that bias updation formula also.', 'start': 2604.987, 'duration': 2.101}, {'end': 2609.33, 'text': 'Bias updation formula is also like this only.', 'start': 2607.589, 'duration': 1.741}, {'end': 2615.895, 'text': 'I can write it b new is equal to b old minus learning rate of derivative of loss with respect to b old.', 'start': 2609.79, 'duration': 6.105}, {'end': 2618.977, 'text': 'So my db old will also become something like this.', 'start': 2616.235, 'duration': 2.742}], 'summary': 'Vdw is an exponential weighted average. weight and bias updation formulas are derived using the derivative of loss and learning rate.', 'duration': 25.759, 'max_score': 2593.218, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk2593218.jpg'}, {'end': 3043.667, 'src': 'embed', 'start': 3015.535, 'weight': 2, 'content': [{'end': 3019.416, 'text': 'If I take t1 and t2 and I want to compute the exponentially weighted average,', 'start': 3015.535, 'duration': 3.881}, {'end': 3024.378, 'text': 'I will multiply beta multiplied by vt1 plus 1 minus beta multiplied by a2..', 'start': 3019.416, 'duration': 4.962}, {'end': 3029.98, 'text': "This beta, if I'm specifying it as 0.95, it is giving more importance to the previous value.", 'start': 3024.778, 'duration': 5.202}, {'end': 3036.803, 'text': 'If I am specifying a smaller value, it is giving more importance to the new value, the next upcoming values.', 'start': 3030.6, 'duration': 6.203}, {'end': 3043.667, 'text': 'Now, here in this case, I said that if beta is 0.1, then more importance is given to the next upcoming values.', 'start': 3036.863, 'duration': 6.804}], 'summary': 'Computing exponentially weighted average with beta values 0.95 and 0.1.', 'duration': 28.132, 'max_score': 3015.535, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk3015535.jpg'}, {'end': 3437.062, 'src': 'embed', 'start': 3406.552, 'weight': 5, 'content': [{'end': 3407.613, 'text': 'Every damn person.', 'start': 3406.552, 'duration': 1.061}, {'end': 3410.714, 'text': 'Everybody uses Adam.', 'start': 3408.453, 'duration': 2.261}, {'end': 3413.435, 'text': "We'll discuss about Adam also.", 'start': 3412.295, 'duration': 1.14}, {'end': 3414.336, 'text': "It's pretty much good.", 'start': 3413.455, 'duration': 0.881}, {'end': 3418.077, 'text': 'That does not mean that we should not know this.', 'start': 3415.756, 'duration': 2.321}, {'end': 3419.038, 'text': 'We should know this.', 'start': 3418.377, 'duration': 0.661}, {'end': 3420.118, 'text': "That's why I'm teaching you.", 'start': 3419.058, 'duration': 1.06}, {'end': 3421.919, 'text': 'But that only.', 'start': 3420.938, 'duration': 0.981}, {'end': 3426.941, 'text': 'They may ask you some interview questions with respect to that.', 'start': 3422.579, 'duration': 4.362}, {'end': 3437.062, 'text': "The fourth type of optimizer It's called as, let me write it down, adaptive Adagrad.", 'start': 3427.181, 'duration': 9.881}], 'summary': 'Discussion of adam optimizer and teaching about adaptive adagrad', 'duration': 30.51, 'max_score': 3406.552, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk3406552.jpg'}, {'end': 3617.439, 'src': 'embed', 'start': 3589.005, 'weight': 6, 'content': [{'end': 3599.968, 'text': 'in this process, i have kept my learning rate fixed, but in this process my learning rate is changing as the number of iteration is increasing.', 'start': 3589.005, 'duration': 10.963}, {'end': 3602.154, 'text': 'Yes,', 'start': 3601.954, 'duration': 0.2}, {'end': 3603.935, 'text': 'Obviously, this will be faster.', 'start': 3602.734, 'duration': 1.201}, {'end': 3605.355, 'text': 'This blue color one will be faster.', 'start': 3603.955, 'duration': 1.4}, {'end': 3606.756, 'text': 'Initially, I took a bigger step.', 'start': 3605.395, 'duration': 1.361}, {'end': 3610.857, 'text': 'I know that I want to reach at that point of time.', 'start': 3608.556, 'duration': 2.301}, {'end': 3614.838, 'text': "Suppose I'm going to just consider like this.", 'start': 3612.097, 'duration': 2.741}, {'end': 3617.439, 'text': 'This is my mountain.', 'start': 3616.619, 'duration': 0.82}], 'summary': 'Learning rate changes with iteration for faster convergence.', 'duration': 28.434, 'max_score': 3589.005, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk3589005.jpg'}], 'start': 2029.227, 'title': 'Optimization techniques in machine learning', 'summary': 'Covers the application of sgd with momentum technique for faster convergence, the concept of exponential weighted average, its application in updating weights and biases, and the impact of exponentially weighted average on computational performance in gradient descent. additionally, it discusses the significance of adapting the learning rate in neural network training and compares static and dynamic learning rates.', 'chapters': [{'end': 2233.084, 'start': 2029.227, 'title': 'Sgd with momentum technique', 'summary': 'Discusses the application of sgd with momentum technique to smoothen noisy data for faster convergence in gradient descent, introducing the concept of exponential weighted average as a key component.', 'duration': 203.857, 'highlights': ['SGD with momentum enhances convergence by smoothening noisy data, leading to faster reaching of specific points. The technique of SGD with momentum is applied to smoothen noisy data for faster convergence, enabling the reaching of specific points more quickly.', 'Introduction of Exponential Weighted Average as a key concept in SGD with momentum. Exponential Weighted Average is introduced as a key concept in SGD with momentum, replacing the traditional computation of gradient descent and contributing to the smoothening of data.', 'Explanation of the modification in the computation of derivative of loss with respect to derivative of weights using the t and t-1 iterations. The computation of the derivative of loss with respect to the derivative of weights is modified by introducing t and t-1 iterations, specifying the previous and current iterations for the calculation.']}, {'end': 2990.663, 'start': 2233.725, 'title': 'Exponential weighted average', 'summary': 'Discusses the concept of exponential weighted average, its formula, and the significance of the hyperparameter beta in smoothening data, emphasizing the application of this concept in updating weights and biases for optimization in machine learning.', 'duration': 756.938, 'highlights': ['The formula for exponential weighted average is beta multiplied by vt1 plus 1 minus beta multiplied by a2, with beta specifying the importance given to previous and future data, such as 0.95 giving more weight to A1 for smoothening. Exponential weighted average formula, importance of beta in smoothening, example of using beta value 0.95', 'Updating weights and biases involves the use of the exponentially weighted average, with the weight updation formula being wt = wt - vdw and the bias updation formula being b = b - bdw. Application of exponentially weighted average in updating weights and biases, weight and bias updation formulas', 'The formula for updating VDW involves the use of beta and the derivative of loss with respect to weights, with beta often initialized as 0.95 for smoothening weights. Formula for updating VDW, use of beta in smoothening weights']}, {'end': 3345.964, 'start': 2991.843, 'title': 'Exponentially weighted average in gradient descent', 'summary': 'Explains the concept of exponentially weighted average in gradient descent, emphasizing the use of beta values to assign weights and achieve smoothening of convergence in the optimization process, with a focus on the implementation details and the impact on computational performance.', 'duration': 354.121, 'highlights': ['The concept of exponentially weighted average involves specifying weights to the previous value and computing the average, with emphasis on assigning beta values to give importance to previous or new values. ', 'The implementation details of the exponentially weighted average in gradient descent involve the computation of VDW and VDB, updating weights and biases using learning rates, and achieving smoothening of convergence. ', 'The use of beta values, such as 0.1 or 0.95, impacts the importance given to previous or new values, influencing the smoothening of convergence in the optimization process. ']}, {'end': 3667.242, 'start': 3346.605, 'title': 'Optimizing learning rate in neural networks', 'summary': 'Introduces different optimization algorithms for neural network training, emphasizing the significance of adapting the learning rate, with a comparison between static and dynamic learning rates and their impact on convergence speed.', 'duration': 320.637, 'highlights': ['The best optimizer is Adam, widely used in practice, but understanding other optimization algorithms is essential for a comprehensive understanding of the mathematics behind neural network training and potential interview questions.', 'Adaptive Adagrad is introduced as an adaptive gradient descent method, highlighting the concept of dynamically changing learning rates during training to improve convergence speed.', 'Comparing the impact of fixed learning rates versus dynamically changing learning rates during training, illustrating the faster convergence achieved with dynamic learning rates through an analogy of reaching the tip of a mountain with different strategies.', 'An analogy of reaching the tip of a mountain is used to explain the concept of dynamic learning rates, emphasizing the advantage of initially taking larger steps and gradually reducing step size for faster convergence.']}], 'duration': 1638.015, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk2029227.jpg', 'highlights': ['SGD with momentum enhances convergence by smoothening noisy data, leading to faster reaching of specific points.', 'Introduction of Exponential Weighted Average as a key concept in SGD with momentum.', 'The formula for exponential weighted average is beta multiplied by vt1 plus 1 minus beta multiplied by a2, with beta specifying the importance given to previous and future data, such as 0.95 giving more weight to A1 for smoothening.', 'Updating weights and biases involves the use of the exponentially weighted average, with the weight updation formula being wt = wt - vdw and the bias updation formula being b = b - bdw.', 'The concept of exponentially weighted average involves specifying weights to the previous value and computing the average, with emphasis on assigning beta values to give importance to previous or new values.', 'The best optimizer is Adam, widely used in practice, but understanding other optimization algorithms is essential for a comprehensive understanding of the mathematics behind neural network training and potential interview questions.', 'Comparing the impact of fixed learning rates versus dynamically changing learning rates during training, illustrating the faster convergence achieved with dynamic learning rates through an analogy of reaching the tip of a mountain with different strategies.']}, {'end': 4291.186, 'segs': [{'end': 3696.929, 'src': 'embed', 'start': 3667.682, 'weight': 0, 'content': [{'end': 3670.123, 'text': 'And here, they will start taking smaller, smaller steps.', 'start': 3667.682, 'duration': 2.441}, {'end': 3671.483, 'text': 'And they will try to search everything.', 'start': 3670.183, 'duration': 1.3}, {'end': 3679.985, 'text': 'Like that, that is the most advantageous when we change the learning rate dynamically as the training is happening.', 'start': 3673.003, 'duration': 6.982}, {'end': 3686.527, 'text': 'And that is the reason why they came up with this particular concept, which is called as adaptive gradient descent.', 'start': 3681.145, 'duration': 5.382}, {'end': 3694.649, 'text': 'and in this the formula is very, very simple, very simple, not like sgd with momentum.', 'start': 3687.446, 'duration': 7.203}, {'end': 3696.929, 'text': "again, i'm telling you, not like sgd with momentum.", 'start': 3694.649, 'duration': 2.28}], 'summary': 'Adaptive gradient descent uses dynamic learning rates, unlike sgd with momentum.', 'duration': 29.247, 'max_score': 3667.682, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk3667682.jpg'}, {'end': 3959.13, 'src': 'embed', 'start': 3932.227, 'weight': 2, 'content': [{'end': 3936.349, 'text': 'Sorry, initially the jump is high, and then it is becoming smaller.', 'start': 3932.227, 'duration': 4.122}, {'end': 3940.651, 'text': 'That basically means after considering one learning rate,', 'start': 3936.769, 'duration': 3.882}, {'end': 3944.832, 'text': 'we should try to reduce this learning rate value as we are moving towards the convergence point.', 'start': 3940.651, 'duration': 4.181}, {'end': 3947.694, 'text': 'And this will happen with respect to time steps.', 'start': 3945.553, 'duration': 2.141}, {'end': 3954.265, 'text': 'So suppose that initially t is equal to 1, times step one, I initialize it as 0.01.', 'start': 3947.754, 'duration': 6.511}, {'end': 3959.13, 'text': 'In t is equal to two, I should try to reduce this and make it like this, 0.009.', 'start': 3954.265, 'duration': 4.865}], 'summary': 'Learning rate should decrease over time steps for convergence, e.g., from 0.01 to 0.009.', 'duration': 26.903, 'max_score': 3932.227, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk3932227.jpg'}, {'end': 4016.978, 'src': 'embed', 'start': 3988.166, 'weight': 3, 'content': [{'end': 3993.811, 'text': 'you can see that summation of i is equal to 1 to t derivative of loss with respect to derivative of w whole square.', 'start': 3988.166, 'duration': 5.645}, {'end': 4000.036, 'text': "When I write alpha of t, alpha of t, and I'm doing summation of i is equal to 1 to t, i is equal to 1 to t.", 'start': 3994.131, 'duration': 5.905}, {'end': 4001.918, 'text': "Suppose I'm at the t5 stage.", 'start': 4000.036, 'duration': 1.882}, {'end': 4004.775, 'text': "I'm at the fifth iteration stage.", 'start': 4002.935, 'duration': 1.84}, {'end': 4011.717, 'text': 'I also have to compute the derivative of loss with respect to derivative of w whole square and I have to do the summation.', 'start': 4005.396, 'duration': 6.321}, {'end': 4016.978, 'text': 'I have to find the sum of derivative of loss with respect to derivative of whole square.', 'start': 4011.737, 'duration': 5.241}], 'summary': 'Iteratively compute summation of derivatives for loss with respect to w.', 'duration': 28.812, 'max_score': 3988.166, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk3988166.jpg'}, {'end': 4236.253, 'src': 'embed', 'start': 4202.061, 'weight': 5, 'content': [{'end': 4204.503, 'text': 'And for this also, we have a technique to solve it.', 'start': 4202.061, 'duration': 2.442}, {'end': 4208.907, 'text': 'And that is what we will be discussing in our next optimizer, which is called as RMS prop.', 'start': 4204.983, 'duration': 3.924}, {'end': 4211.883, 'text': 'rms prop.', 'start': 4210.763, 'duration': 1.12}, {'end': 4214.164, 'text': 'right, there was still a disadvantage in this.', 'start': 4211.883, 'duration': 2.281}, {'end': 4222.227, 'text': 'the disadvantage is that, as i keep on doing this particular operation for a very deep neural network with respect to all the previous iterations,', 'start': 4214.164, 'duration': 8.063}, {'end': 4226.148, 'text': 'right, i have keep on updating this alpha of t.', 'start': 4222.227, 'duration': 3.921}, {'end': 4227.169, 'text': 'i keep on updating this.', 'start': 4226.148, 'duration': 1.021}, {'end': 4236.253, 'text': 'so when this value is very, very bigger, obviously this value is going to decrease And at one point of time the value will be such a minimal value.', 'start': 4227.169, 'duration': 9.084}], 'summary': 'Introducing rms prop technique to solve optimization problems in deep neural networks.', 'duration': 34.192, 'max_score': 4202.061, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk4202061.jpg'}, {'end': 4277.658, 'src': 'embed', 'start': 4248.58, 'weight': 4, 'content': [{'end': 4250.922, 'text': 'And probably the convergence may also not happen.', 'start': 4248.58, 'duration': 2.342}, {'end': 4257.424, 'text': 'that is the disadvantages with respect to, uh, this technique which is called as adaptive gradient descent.', 'start': 4251.622, 'duration': 5.802}, {'end': 4260.864, 'text': 'that basically means that here we are changing the learning rate.', 'start': 4257.424, 'duration': 3.44}, {'end': 4265.326, 'text': 'we are changing the learning rate as the number of epochs are increasing.', 'start': 4260.864, 'duration': 4.462}, {'end': 4270.727, 'text': "so, uh, let me know, guys, whether you're following.", 'start': 4265.326, 'duration': 5.401}, {'end': 4274.168, 'text': "uh, you can message it whether you're following this or not.", 'start': 4270.727, 'duration': 3.441}, {'end': 4277.658, 'text': 'Is it beautiful??', 'start': 4276.277, 'duration': 1.381}], 'summary': 'Disadvantages of adaptive gradient descent include changing learning rate as epochs increase, with uncertainty about convergence.', 'duration': 29.078, 'max_score': 4248.58, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk4248580.jpg'}], 'start': 3667.682, 'title': 'Adaptive gradient descent', 'summary': 'Discusses the concept of adaptive gradient descent (adagrad) which dynamically adjusts the learning rate during training, utilizing a formula that incorporates a change in learning rate over time and emphasizes the reduction of the learning rate as the training progresses. it also explores the computation of alpha of t and its impact on reducing the learning rate, while highlighting potential disadvantages of the technique in deep neural networks, leading to the introduction of rms prop optimizer.', 'chapters': [{'end': 3988.166, 'start': 3667.682, 'title': 'Adaptive gradient descent', 'summary': 'Discusses the concept of adaptive gradient descent (adagrad) which dynamically adjusts the learning rate during training, utilizing a formula that incorporates a change in learning rate over time and emphasizes the reduction of the learning rate as the training progresses.', 'duration': 320.484, 'highlights': ['The concept of AdaGrad dynamically adjusts the learning rate during training. This highlights the central concept of AdaGrad, which dynamically changes the learning rate during training to optimize the training process.', 'The formula for AdaGrad involves a change in learning rate over time and emphasizes the reduction of the learning rate as the training progresses. The formula for AdaGrad involves a change in learning rate over time, incorporating a reduction in learning rate as the training progresses, aiming to optimize the convergence of the model.', 'The initial learning rate is 0.01, and it is progressively reduced over time to achieve convergence. The initial learning rate of 0.01 is progressively reduced over time, with specific examples provided, illustrating the reduction in learning rate as the training progresses to achieve convergence.']}, {'end': 4291.186, 'start': 3988.166, 'title': 'Adaptive gradient descent', 'summary': 'Discusses the concept of adaptive gradient descent, focusing on the computation of alpha of t and its impact on reducing the learning rate, while highlighting the potential disadvantage of the technique in deep neural networks, leading to the introduction of rms prop optimizer.', 'duration': 303.02, 'highlights': ['The computation of alpha of t involves the summation of the derivative of loss with respect to the derivative of w whole square from i=1 to t, impacting the reduction of learning rate. The computation of alpha of t involves the summation of the derivative of loss with respect to the derivative of w whole square from i=1 to t, impacting the reduction of learning rate.', 'The potential disadvantage of the adaptive gradient descent technique in deep neural networks is that alpha of t may become a very large number, leading to a minimal value after updation, causing w new and w old to be approximately equal. The potential disadvantage of the adaptive gradient descent technique in deep neural networks is that alpha of t may become a very large number, leading to a minimal value after updation, causing w new and w old to be approximately equal.', 'The introduction of RMS prop optimizer is mentioned as a technique to address the potential disadvantage of adaptive gradient descent in deep neural networks. The introduction of RMS prop optimizer is mentioned as a technique to address the potential disadvantage of adaptive gradient descent in deep neural networks.']}], 'duration': 623.504, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk3667682.jpg', 'highlights': ['The concept of AdaGrad dynamically adjusts the learning rate during training.', 'The formula for AdaGrad involves a change in learning rate over time and emphasizes the reduction of the learning rate as the training progresses.', 'The initial learning rate is 0.01, and it is progressively reduced over time to achieve convergence.', 'The computation of alpha of t involves the summation of the derivative of loss with respect to the derivative of w whole square from i=1 to t, impacting the reduction of learning rate.', 'The potential disadvantage of the adaptive gradient descent technique in deep neural networks is that alpha of t may become a very large number, leading to a minimal value after updation, causing w new and w old to be approximately equal.', 'The introduction of RMS prop optimizer is mentioned as a technique to address the potential disadvantage of adaptive gradient descent in deep neural networks.']}, {'end': 4603.483, 'segs': [{'end': 4342.303, 'src': 'embed', 'start': 4291.226, 'weight': 3, 'content': [{'end': 4304.548, 'text': 'Obviously, it is very, very interesting, okay? How to choose? See, guys, it will take time, okay? Does AdaGrad have momentum concepts? No.', 'start': 4291.226, 'duration': 13.322}, {'end': 4311.17, 'text': 'It does not have any momentum concepts, so I did not change the formula of derivative of loss with respect to derivative of double world.', 'start': 4305.508, 'duration': 5.662}, {'end': 4317.993, 'text': 'Guys, see, if you are not following, just wait for some time.', 'start': 4314.511, 'duration': 3.482}, {'end': 4321.394, 'text': 'I have already uploaded all these videos in my YouTube channel.', 'start': 4318.573, 'duration': 2.821}, {'end': 4323.115, 'text': 'You just have to watch it once again.', 'start': 4321.414, 'duration': 1.701}, {'end': 4331.801, 'text': 'How T1 comma T2 value came, sir? See, T1, T2 is nothing but as we go on doing the number of iteration.', 'start': 4325.031, 'duration': 6.77}, {'end': 4335.526, 'text': "Suppose in the first iteration, I'll specify it as T1.", 'start': 4332.582, 'duration': 2.944}, {'end': 4336.848, 'text': 'In the second iteration, T2.', 'start': 4335.566, 'duration': 1.282}, {'end': 4338.05, 'text': 'In the third iteration, T3.', 'start': 4337.188, 'duration': 0.862}, {'end': 4342.303, 'text': "So, when I'm doing this, I'm doing the summation of 1 to t right?", 'start': 4339.181, 'duration': 3.122}], 'summary': 'Adagrad does not have momentum concepts. videos are on youtube.', 'duration': 51.077, 'max_score': 4291.226, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk4291226.jpg'}, {'end': 4416.171, 'src': 'embed', 'start': 4386.621, 'weight': 1, 'content': [{'end': 4389.462, 'text': 'Unable to understand the algorithm and equation relationship.', 'start': 4386.621, 'duration': 2.841}, {'end': 4392.563, 'text': 'Just try to understand, guys, over here, what we are trying to do.', 'start': 4389.522, 'duration': 3.041}, {'end': 4394.424, 'text': 'What we are trying to do over here.', 'start': 4393.143, 'duration': 1.281}, {'end': 4398.145, 'text': 'We are trying to reduce this learning rate dynamically as the training is happening.', 'start': 4394.464, 'duration': 3.681}, {'end': 4402.226, 'text': 'So in order to do that, I have divided by alpha of t plus e.', 'start': 4398.645, 'duration': 3.581}, {'end': 4407.668, 'text': 'This alpha of t is basically derived by this particular equation, right? This is what the researchers have given.', 'start': 4402.226, 'duration': 5.442}, {'end': 4408.568, 'text': 'They have exploded.', 'start': 4407.708, 'duration': 0.86}, {'end': 4411.389, 'text': "I have also not proved it, you know? They've given this.", 'start': 4408.588, 'duration': 2.801}, {'end': 4416.171, 'text': 'What if we apply to a smaller neural network? It will definitely work well.', 'start': 4413.17, 'duration': 3.001}], 'summary': 'Discussing dynamic learning rate reduction in neural networks and its potential application to smaller networks.', 'duration': 29.55, 'max_score': 4386.621, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk4386621.jpg'}, {'end': 4496.14, 'src': 'embed', 'start': 4450.895, 'weight': 0, 'content': [{'end': 4459.257, 'text': 'uh, in interview they will ask you these basic things you know, and probably you have to tell them that how the learning rate is basically decreasing.', 'start': 4450.895, 'duration': 8.362}, {'end': 4463.338, 'text': 'you know, can you explain the part at which alpha increases?', 'start': 4459.257, 'duration': 4.081}, {'end': 4465.699, 'text': 'see alpha alpha formula.', 'start': 4463.338, 'duration': 2.361}, {'end': 4467.519, 'text': 'just focus everyone over here.', 'start': 4465.699, 'duration': 1.82}, {'end': 4475.621, 'text': 'see alpha formula is like this right summation of i is equal to 1 to t dl of the derivative of loss with respect to derivative of w.', 'start': 4467.519, 'duration': 8.102}, {'end': 4481.086, 'text': 'now, suppose my iteration has happened till t5, T5, right?', 'start': 4475.621, 'duration': 5.465}, {'end': 4483.088, 'text': 'Suppose T4 and T5..', 'start': 4481.246, 'duration': 1.842}, {'end': 4488.933, 'text': "For everything, we'll calculate the derivative of loss with respect to derivative of W for that specific time.", 'start': 4484.109, 'duration': 4.824}, {'end': 4496.14, 'text': 'Right? We are doing the square.', 'start': 4491.536, 'duration': 4.604}], 'summary': 'Explaining the decrease in learning rate and alpha formula in interview context.', 'duration': 45.245, 'max_score': 4450.895, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk4450895.jpg'}], 'start': 4291.226, 'title': 'Adagrad, momentum, and learning rate dynamics', 'summary': 'Discusses the absence of momentum concepts in adagrad, emphasizes the importance of understanding t1 and t2 values during iterations, delves into dynamically reducing the learning rate to mitigate the vanishing gradient problem, and highlights the increasing importance of these basic concepts for interviews.', 'chapters': [{'end': 4342.303, 'start': 4291.226, 'title': 'Adagrad and momentum concepts', 'summary': 'Discusses the absence of momentum concepts in adagrad and emphasizes the importance of watching the uploaded videos for understanding t1 and t2 values during iterations.', 'duration': 51.077, 'highlights': ['The absence of momentum concepts in AdaGrad and the unchanged formula of the derivative of loss with respect to the derivative of the double world.', 'Emphasizing the importance of watching the uploaded videos for understanding T1 and T2 values during iterations.', 'Explanation of T1, T2 as the values during different iterations and the summation of 1 to t.']}, {'end': 4603.483, 'start': 4342.583, 'title': 'Understanding learning rate dynamics', 'summary': 'Delves into the dynamics of learning rate, emphasizing the need to dynamically reduce it during training to mitigate the vanishing gradient problem. it discusses the derivation and application of alpha of t, showcasing its impact on the learning rate, and highlights the increasing importance of understanding these basic concepts for interviews.', 'duration': 260.9, 'highlights': ['The chapter emphasizes the need to dynamically reduce the learning rate during training to mitigate the vanishing gradient problem, which can lead to a very big alpha T and fluctuating learning rates, illustrated by decreasing values from 0.001 to 0.0060.', 'It discusses the derivation and application of alpha of t, which is dynamically calculated based on the sum of the squares of the derivatives of loss with respect to weights for each time iteration, showcasing its increasing value over time.', 'The increasing importance of understanding basic concepts related to learning rate dynamics for interviews is highlighted, emphasizing the need to explain how the learning rate is decreasing and how alpha increases in a given formula.']}], 'duration': 312.257, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk4291226.jpg', 'highlights': ['The increasing importance of understanding basic concepts related to learning rate dynamics for interviews is highlighted, emphasizing the need to explain how the learning rate is decreasing and how alpha increases in a given formula.', 'The chapter emphasizes the need to dynamically reduce the learning rate during training to mitigate the vanishing gradient problem, which can lead to a very big alpha T and fluctuating learning rates, illustrated by decreasing values from 0.001 to 0.0060.', 'It discusses the derivation and application of alpha of t, which is dynamically calculated based on the sum of the squares of the derivatives of loss with respect to weights for each time iteration, showcasing its increasing value over time.', 'Emphasizing the importance of watching the uploaded videos for understanding T1 and T2 values during iterations.', 'Explanation of T1, T2 as the values during different iterations and the summation of 1 to t.', 'The absence of momentum concepts in AdaGrad and the unchanged formula of the derivative of loss with respect to the derivative of the double world.']}, {'end': 5254.841, 'segs': [{'end': 4931.097, 'src': 'embed', 'start': 4906.216, 'weight': 0, 'content': [{'end': 4912.903, 'text': 'Instead of just using alpha of t, we are changing this alpha of t with something called as exponentially weighted average.', 'start': 4906.216, 'duration': 6.687}, {'end': 4916.046, 'text': 'This notation is basically given in the research paper, so I have written it.', 'start': 4913.263, 'duration': 2.783}, {'end': 4918.873, 'text': 'Okay? Research paper.', 'start': 4917.793, 'duration': 1.08}, {'end': 4921.854, 'text': 'So instead of this, we are using exponentially weighted average.', 'start': 4919.233, 'duration': 2.621}, {'end': 4926.155, 'text': 'And here in the exponentially weighted average, I have taken 1 minus beta.', 'start': 4922.414, 'duration': 3.741}, {'end': 4928.496, 'text': 'Okay? We are restricting this.', 'start': 4926.515, 'duration': 1.981}, {'end': 4931.097, 'text': 'Because of this, only this alpha t was increasing.', 'start': 4929.056, 'duration': 2.041}], 'summary': 'Using exponentially weighted average to modify alpha of t for better performance.', 'duration': 24.881, 'max_score': 4906.216, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk4906216.jpg'}, {'end': 5056.655, 'src': 'embed', 'start': 5031.627, 'weight': 1, 'content': [{'end': 5042.299, 'text': 'bias also will be like that which has been nothing but beta hd of b plus 1 minus beta derivative of loss with respect to derivative of b whole square,', 'start': 5031.627, 'duration': 10.672}, {'end': 5048.426, 'text': 'okay. and if i go down little bit okay.', 'start': 5042.299, 'duration': 6.127}, {'end': 5056.655, 'text': 'after this, my weight updation formula will be wt is equal to wt minus 1, minus learning rate.', 'start': 5048.426, 'duration': 8.229}], 'summary': 'Weight updation formula: wt = wt-1 - learning rate', 'duration': 25.028, 'max_score': 5031.627, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk5031627.jpg'}, {'end': 5170.463, 'src': 'embed', 'start': 5119.388, 'weight': 2, 'content': [{'end': 5122.649, 'text': "Guys, if you're understanding 50%, you're better than me.", 'start': 5119.388, 'duration': 3.261}, {'end': 5128.451, 'text': 'Initially, when I saw this, I left it.', 'start': 5123.93, 'duration': 4.521}, {'end': 5132.672, 'text': "I said, no, right now I'm not going to study this because there's so many things.", 'start': 5129.171, 'duration': 3.501}, {'end': 5138.954, 'text': "But if you're understanding 20, 30, 40%, 50%, it is very, very good.", 'start': 5132.712, 'duration': 6.242}, {'end': 5143.095, 'text': "But I hope I'm teaching nicely, guys.", 'start': 5139.514, 'duration': 3.581}, {'end': 5146.776, 'text': "Yes or no? Something you're understanding or.", 'start': 5143.795, 'duration': 2.981}, {'end': 5151.051, 'text': "or Chris, you don't know how to teach like that.", 'start': 5149.27, 'duration': 1.781}, {'end': 5152.012, 'text': 'It is something like that.', 'start': 5151.131, 'duration': 0.881}, {'end': 5157.695, 'text': 'Again, I want, okay, many people are saying that they are understanding, okay.', 'start': 5152.932, 'duration': 4.763}, {'end': 5160.717, 'text': 'Guys, this always remember the first time.', 'start': 5158.296, 'duration': 2.421}, {'end': 5162.418, 'text': "you'll not be able to understand.", 'start': 5160.717, 'duration': 1.701}, {'end': 5167.521, 'text': 'you have to keep on going onto this because there are a lot of maths involved.', 'start': 5162.418, 'duration': 5.103}, {'end': 5170.463, 'text': 'The people who are deriving this, they are PhD guys.', 'start': 5167.761, 'duration': 2.702}], 'summary': 'Understanding 20-50% is good. keep going. maths involved. taught by phds.', 'duration': 51.075, 'max_score': 5119.388, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk5119388.jpg'}], 'start': 4603.483, 'title': 'Optimizing with adadelta and rmsprop', 'summary': 'Discusses the use of adadelta and rmsprop as fifth optimizers, utilizing exponentially weighted average with a beta value to restrict the increase of alpha t, and also explains the weight updation algorithm for adaptive rms prop using mini batch iterations, emphasizing the importance of understanding at least 20-30% of the process.', 'chapters': [{'end': 4968.609, 'start': 4603.483, 'title': 'Optimizing with adadelta and rmsprop', 'summary': 'Discusses the use of adadelta and rmsprop as fifth optimizers, utilizing exponentially weighted average with a beta value to restrict the increase of alpha t, which leads to a proper weight updation formula.', 'duration': 365.126, 'highlights': ['AdaDelta and RMSProp are used as fifth optimizers, utilizing exponentially weighted average with a beta value to restrict the increase of alpha t, which leads to a proper weight updation formula. Both AdaDelta and RMSProp are employed as fifth optimizers, incorporating exponentially weighted average with a specified beta value to restrict the increase of alpha t, ensuring proper weight updation formula.', 'Exponentially weighted average with a beta value is used to restrict the increase of alpha t, preventing it from skyrocketing, ensuring proper weight updation. Utilizing exponentially weighted average with a specified beta value restricts the increase of alpha t, preventing it from skyrocketing and ensuring proper weight updation.', 'The formula for weight updation is based on the exponentially weighted average, restricting the increase of alpha t, ensuring proper weight updation. The weight updation formula is based on the exponentially weighted average, restricting the increase of alpha t and ensuring proper weight updation.']}, {'end': 5254.841, 'start': 4968.609, 'title': 'Weight updation algorithm explanation', 'summary': 'Explains the weight updation algorithm for adaptive rms prop using mini batch iterations, emphasizing the importance of understanding at least 20-30% of the process, and the need to grasp the overall story and reasoning behind the steps involved.', 'duration': 286.232, 'highlights': ['The weight updation algorithm involves computing derivatives of loss with respect to weights and bias, and then updating the weights and bias using the computed derivatives and learning rate.', 'Emphasizes the importance of understanding at least 20-30% of the process and the need to grasp the overall story and reasoning behind the steps involved.', 'Encourages viewers to think about the process as a story, understand the problem with respect to adaptive gradient descent, and be able to explain the working principles to interviewers.', 'Stresses the presence of complex mathematical concepts and the involvement of PhD-level researchers in deriving the algorithms, but urges continuous learning and understanding.', 'Encourages viewers not to worry about understanding the exact equations, but to focus on comprehending how things are working and the rationale behind each step.']}], 'duration': 651.358, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk4603483.jpg', 'highlights': ['AdaDelta and RMSProp are used as fifth optimizers, incorporating exponentially weighted average with a specified beta value to restrict the increase of alpha t, ensuring proper weight updation formula.', 'The weight updation algorithm involves computing derivatives of loss with respect to weights and bias, and then updating the weights and bias using the computed derivatives and learning rate.', 'Emphasizes the importance of understanding at least 20-30% of the process and the need to grasp the overall story and reasoning behind the steps involved.', 'Stresses the presence of complex mathematical concepts and the involvement of PhD-level researchers in deriving the algorithms, but urges continuous learning and understanding.']}, {'end': 6114.213, 'segs': [{'end': 5445.713, 'src': 'heatmap', 'start': 5317.847, 'weight': 0.881, 'content': [{'end': 5322.25, 'text': 'At one point of time, the RMS prop was the most efficient one.', 'start': 5317.847, 'duration': 4.403}, {'end': 5332.457, 'text': 'So the next one that they actually came up with was something called as the last part, which is called as Adam optimizer.', 'start': 5323.011, 'duration': 9.446}, {'end': 5337.026, 'text': 'This Adam optimizer.', 'start': 5335.906, 'duration': 1.12}, {'end': 5341.247, 'text': 'Adam optimizer probably was the best till now and still the best.', 'start': 5337.206, 'duration': 4.041}, {'end': 5348.148, 'text': 'This name is something called as adaptive moment estimation.', 'start': 5342.187, 'duration': 5.961}, {'end': 5353.829, 'text': 'This is a full form of the Adam optimizer.', 'start': 5349.068, 'duration': 4.761}, {'end': 5354.849, 'text': 'It was pretty much simple.', 'start': 5353.869, 'duration': 0.98}, {'end': 5359.51, 'text': 'In Adam optimizer, what did they do? They just tried to combine two things.', 'start': 5354.889, 'duration': 4.621}, {'end': 5364.191, 'text': 'One is momentum and second one is RMS prop.', 'start': 5360.37, 'duration': 3.821}, {'end': 5376.388, 'text': 'They just combined these two techniques, and they came up with an amazing way of implementing an adaptive moment estimation.', 'start': 5366.906, 'duration': 9.482}, {'end': 5380.129, 'text': 'With the help of momentum, you are actually getting smoothening.', 'start': 5377.108, 'duration': 3.021}, {'end': 5389.251, 'text': 'With the help of RMS prop, you are actually able to change your learning rate in an efficient manner,', 'start': 5382.609, 'duration': 6.642}, {'end': 5392.111, 'text': 'such that that alpha t value will also not go high.', 'start': 5389.251, 'duration': 2.86}, {'end': 5395.132, 'text': 'So both these advantages things.', 'start': 5392.591, 'duration': 2.541}, {'end': 5401.19, 'text': 'They took, and they combined, and they came up with this atom optimizer.', 'start': 5397.728, 'duration': 3.462}, {'end': 5405.132, 'text': "So let's write down the algorithm with respect to atom optimizer.", 'start': 5401.69, 'duration': 3.442}, {'end': 5407.814, 'text': 'Remember, we will be calculating this momentum.', 'start': 5405.553, 'duration': 2.261}, {'end': 5409.775, 'text': "We'll be calculating this RMS prop also.", 'start': 5407.854, 'duration': 1.921}, {'end': 5422.322, 'text': 'So initially, whenever I talk about momentum, I had initialized two variables like VDW, VDW, VDB.', 'start': 5410.435, 'duration': 11.887}, {'end': 5424.704, 'text': 'When I am talking about RMS prop.', 'start': 5422.743, 'duration': 1.961}, {'end': 5428.199, 'text': 'we have come up with HDW hdb.', 'start': 5424.704, 'duration': 3.495}, {'end': 5428.679, 'text': 'right, we have.', 'start': 5428.199, 'duration': 0.48}, {'end': 5428.979, 'text': 'we had.', 'start': 5428.679, 'duration': 0.3}, {'end': 5429.84, 'text': 'we had created this.', 'start': 5428.979, 'duration': 0.861}, {'end': 5434.024, 'text': 'you see over here, we had created this right hdw hdb.', 'start': 5429.84, 'duration': 4.184}, {'end': 5443.191, 'text': 'with respect to momentum, uh, we had written vdw, vdd, vdb, right and based on that only, we wrote all these details right.', 'start': 5434.024, 'duration': 9.167}, {'end': 5445.713, 'text': 'so now we are trying to combine both the things.', 'start': 5443.191, 'duration': 2.522}], 'summary': 'Adam optimizer combines momentum and rms prop techniques for efficient learning rate adaptation.', 'duration': 127.866, 'max_score': 5317.847, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk5317847.jpg'}, {'end': 5364.191, 'src': 'embed', 'start': 5337.206, 'weight': 0, 'content': [{'end': 5341.247, 'text': 'Adam optimizer probably was the best till now and still the best.', 'start': 5337.206, 'duration': 4.041}, {'end': 5348.148, 'text': 'This name is something called as adaptive moment estimation.', 'start': 5342.187, 'duration': 5.961}, {'end': 5353.829, 'text': 'This is a full form of the Adam optimizer.', 'start': 5349.068, 'duration': 4.761}, {'end': 5354.849, 'text': 'It was pretty much simple.', 'start': 5353.869, 'duration': 0.98}, {'end': 5359.51, 'text': 'In Adam optimizer, what did they do? They just tried to combine two things.', 'start': 5354.889, 'duration': 4.621}, {'end': 5364.191, 'text': 'One is momentum and second one is RMS prop.', 'start': 5360.37, 'duration': 3.821}], 'summary': 'Adam optimizer combines momentum and rms prop for improved performance.', 'duration': 26.985, 'max_score': 5337.206, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk5337206.jpg'}, {'end': 5523.122, 'src': 'embed', 'start': 5498.765, 'weight': 1, 'content': [{'end': 5506.507, 'text': 'for every gradient descent we are going to for, for, for every uh, optimizers that we are using, we are going to use a mini batch.', 'start': 5498.765, 'duration': 7.742}, {'end': 5511.558, 'text': 'mini batch basically means there will be some batch size, Batch size of 1, 000..', 'start': 5506.507, 'duration': 5.051}, {'end': 5512.599, 'text': 'It may be of 10, 000.', 'start': 5511.558, 'duration': 1.041}, {'end': 5513.619, 'text': 'It may be of 20, 000.', 'start': 5512.599, 'duration': 1.02}, {'end': 5516.22, 'text': 'Depends on the number of data.', 'start': 5513.619, 'duration': 2.601}, {'end': 5519.161, 'text': 'Now, in the next step, this is the first step.', 'start': 5517.42, 'duration': 1.741}, {'end': 5523.122, 'text': 'We compute this one using the current minibatch.', 'start': 5519.261, 'duration': 3.861}], 'summary': 'For every gradient descent, mini batches of 1,000, 10,000, or 20,000 are used based on the data size.', 'duration': 24.357, 'max_score': 5498.765, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk5498765.jpg'}, {'end': 6066.163, 'src': 'embed', 'start': 6039.833, 'weight': 2, 'content': [{'end': 6044.155, 'text': "OK. there's so much math equation and sometimes we also get confused.", 'start': 6039.833, 'duration': 4.322}, {'end': 6046.956, 'text': 'So it is nothing but 1 minus beta 2t.', 'start': 6044.255, 'duration': 2.701}, {'end': 6049.597, 'text': 'So this kind of correction was brought.', 'start': 6048.096, 'duration': 1.501}, {'end': 6056.079, 'text': 'After you computed VDW, they did some of this kind of correction, and then they replaced it over here.', 'start': 6050.237, 'duration': 5.842}, {'end': 6059.28, 'text': 'They replaced it like VDW correction.', 'start': 6056.819, 'duration': 2.461}, {'end': 6062.201, 'text': 'VDW correction.', 'start': 6061.081, 'duration': 1.12}, {'end': 6064.402, 'text': 'Here also correction.', 'start': 6063.382, 'duration': 1.02}, {'end': 6066.163, 'text': 'Here also correction.', 'start': 6065.303, 'duration': 0.86}], 'summary': 'Math equations and vdw corrections were applied to replace and adjust values.', 'duration': 26.33, 'max_score': 6039.833, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk6039833.jpg'}], 'start': 5254.841, 'title': 'Evolution of adam optimizer', 'summary': "Delves into the evolution of optimization techniques, focusing on adam optimizer's combination of momentum and rms prop methods to efficiently adjust learning rates, resulting in more effective adaptive moment estimation. it also discusses optimizing weight update formulas by computing vdw and vdb, using different beta values in rms prop, and introducing bias correction to optimize learning rate and update bias and weights.", 'chapters': [{'end': 5523.122, 'start': 5254.841, 'title': 'Adam optimizer overview', 'summary': 'Discusses the evolution of optimization techniques in machine learning, from gradient descent to the introduction of adam optimizer, which combines momentum and rms prop methods to efficiently adjust learning rates, resulting in a more effective approach for adaptive moment estimation.', 'duration': 268.281, 'highlights': ['The Adam optimizer is the most efficient optimization technique, surpassing RMS prop, and it combines momentum and RMS prop methods to achieve adaptive moment estimation.', 'The algorithm for Adam optimizer involves initializing values to zero before starting the iteration, computing derivatives of loss with respect to weights and biases, and utilizing mini batches for the optimization process.', 'Mini batch optimization is crucial for all optimizers after gradient descent and involves using a batch size of varying numbers, such as 1,000, 10,000, or 20,000, depending on the dataset size.']}, {'end': 6114.213, 'start': 5523.563, 'title': 'Optimizing weight update formulas', 'summary': 'Discusses the computation of vdw and vdb, the use of different beta values in rms prop, and the introduction of bias correction in weight update formulas, ultimately aiming to optimize the learning rate and update bias and weights.', 'duration': 590.65, 'highlights': ['The computation of VDW and VDB involves the use of beta values in the formulas, with the same beta value used for each component. The computation of VDW and VDB involves the use of beta values in the formulas, with the same beta value used for each component.', 'Different beta values are used in the computation of s d of w and s d of b, introducing a hyperparameter and a bias correction in the updated equation. Different beta values are used in the computation of s d of w and s d of b, introducing a hyperparameter and a bias correction in the updated equation.', 'The weight updation formula involves the introduction of a bias correction and the adjustment of the learning rate in RMS prop, aiming to optimize the weight update process. The weight updation formula involves the introduction of a bias correction and the adjustment of the learning rate in RMS prop, aiming to optimize the weight update process.']}], 'duration': 859.372, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/TudQZtgpoHk/pics/TudQZtgpoHk5254841.jpg', 'highlights': ['The Adam optimizer combines momentum and RMS prop methods for adaptive moment estimation, surpassing RMS prop.', 'Mini batch optimization is crucial for all optimizers after gradient descent, involving varying batch sizes.', 'The computation of VDW and VDB involves the use of beta values, introducing a hyperparameter and bias correction.']}], 'highlights': ['The video covers explanations of various optimizers including Adam optimizer, addressing mistakes in previous videos, in a video lasting over one and a half hours.', 'The creator recommends viewers to revise the concepts for a thorough understanding of optimizers.', 'The best optimizer is Adam, widely used in practice, but understanding other optimization algorithms is essential for a comprehensive understanding of the mathematics behind neural network training and potential interview questions.', 'Training a model with 1 million records or more can lead to slow convergence due to the time-consuming weight updation process and the extensive computational resources required for handling such large datasets.', 'Mini batch SGD reduces resource requirement by specifying a batch size, enabling fewer iterations for forward and backward propagation in each epoch. Reduced number of iterations per epoch', 'Mini batch SGD exhibits zigzag convergence due to limited data representation.', 'SGD with momentum enhances convergence by smoothening noisy data, leading to faster reaching of specific points.', 'The concept of AdaGrad dynamically adjusts the learning rate during training.', 'The Adam optimizer combines momentum and RMS prop methods for adaptive moment estimation, surpassing RMS prop.']}