title

Lecture 10 | Recurrent Neural Networks

description

In Lecture 10 we discuss the use of recurrent neural networks for modeling sequence data. We show how recurrent neural networks can be used for language modeling and image captioning, and how soft spatial attention can be incorporated into image captioning models. We discuss different architectures for recurrent neural networks, including Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU).
Keywords: Recurrent neural networks, RNN, language modeling, image captioning, soft attention, LSTM, GRU
Slides: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf
--------------------------------------------------------------------------------------
Convolutional Neural Networks for Visual Recognition
Instructors:
Fei-Fei Li: http://vision.stanford.edu/feifeili/
Justin Johnson: http://cs.stanford.edu/people/jcjohns/
Serena Yeung: http://ai.stanford.edu/~syyeung/
Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image classification, localization and detection. Recent developments in neural network (aka “deep learning”) approaches have greatly advanced the performance of these state-of-the-art visual recognition systems. This lecture collection is a deep dive into details of the deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification. From this lecture collection, students will learn to implement, train and debug their own neural networks and gain a detailed understanding of cutting-edge research in computer vision.
Website:
http://cs231n.stanford.edu/
For additional learning opportunities please visit:
http://online.stanford.edu/

detail

{'title': 'Lecture 10 | Recurrent Neural Networks', 'heatmap': [{'end': 925.804, 'start': 779.485, 'weight': 0.794}, {'end': 1196.568, 'start': 1002.91, 'weight': 0.711}, {'end': 1494.544, 'start': 1443.572, 'weight': 0.717}, {'end': 2770.164, 'start': 2674.409, 'weight': 1}, {'end': 3477.077, 'start': 3423.833, 'weight': 0.762}], 'summary': 'The lecture at stanford university covers recurrent neural networks, with around half the students completing assignment two, exploring cnn architectures like alexnet, vgg, googlenet, and discussing the flexibility of rnns in handling variable length input and output data, addressing challenges of training deep models before batch normalization, and delving into lstm as a solution for vanishing gradient problem, emphasizing its design for better gradient flow properties.', 'chapters': [{'end': 75.64, 'segs': [{'end': 75.64, 'src': 'embed', 'start': 29.605, 'weight': 0, 'content': [{'end': 32.267, 'text': 'So as usual, a couple administrative notes.', 'start': 29.605, 'duration': 2.662}, {'end': 37.01, 'text': "So we're working hard on assignment one grading.", 'start': 32.807, 'duration': 4.203}, {'end': 40.391, 'text': 'Those grades will probably be out sometime later today.', 'start': 37.41, 'duration': 2.981}, {'end': 44.174, 'text': 'Hopefully they can get out before the A2 deadline.', 'start': 41.572, 'duration': 2.602}, {'end': 45.314, 'text': "That's what I'm hoping for.", 'start': 44.254, 'duration': 1.06}, {'end': 55.874, 'text': "On a related note, assignment two is due today at 11.59 p.m. So who's done with that already? About half you guys.", 'start': 46.455, 'duration': 9.419}, {'end': 60.916, 'text': 'So remember I did warn you when the assignment went out that it was quite long and to start early.', 'start': 56.494, 'duration': 4.422}, {'end': 63.076, 'text': 'So you were warned about that.', 'start': 61.556, 'duration': 1.52}, {'end': 65.677, 'text': 'But hopefully you guys have some late days left.', 'start': 64.056, 'duration': 1.621}, {'end': 70.018, 'text': 'Also as another reminder, the midterm will be in class on Tuesday.', 'start': 66.937, 'duration': 3.081}, {'end': 75.64, 'text': 'If you kind of look around the lecture hall, there are not enough seats in this room to seat all the enrolled students in the class.', 'start': 71.018, 'duration': 4.622}], 'summary': 'Assignment one grades out later today, half have finished assignment two, midterm in class on tuesday.', 'duration': 46.035, 'max_score': 29.605, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ29605.jpg'}], 'start': 4.898, 'title': 'Lecture 10: recurrent neural networks', 'summary': 'Discusses lecture 10 on recurrent neural networks at stanford university, covering administrative notes, assignment deadlines, and midterm arrangements, with approximately half the students having completed assignment two.', 'chapters': [{'end': 75.64, 'start': 4.898, 'title': 'Lecture 10: recurrent neural networks', 'summary': 'Discusses lecture 10 on recurrent neural networks at stanford university, covering administrative notes, assignment deadlines, and midterm arrangements, with approximately half the students having completed assignment two.', 'duration': 70.742, 'highlights': ['Approximately half of the students have completed assignment two, which is due today at 11.59 p.m.', 'The grades for assignment one will probably be out later today, before the A2 deadline.', 'The midterm will be held in class on Tuesday, and there are not enough seats in the lecture hall for all enrolled students.']}], 'duration': 70.742, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ4898.jpg', 'highlights': ['The midterm will be held in class on Tuesday, and there are not enough seats in the lecture hall for all enrolled students.', 'Approximately half of the students have completed assignment two, which is due today at 11.59 p.m.', 'The grades for assignment one will probably be out later today, before the A2 deadline.']}, {'end': 422.336, 'segs': [{'end': 101.701, 'src': 'embed', 'start': 75.84, 'weight': 0, 'content': [{'end': 80.041, 'text': "So we'll actually be having the midterm in several other lecture halls across campus.", 'start': 75.84, 'duration': 4.201}, {'end': 83.602, 'text': "And we'll be sending out some more details on exactly where to go in the next couple of days.", 'start': 80.481, 'duration': 3.121}, {'end': 87.909, 'text': 'So another bit of announcement.', 'start': 85.687, 'duration': 2.222}, {'end': 92.693, 'text': "We've been working on this sort of fun bit of extra credit thing for you to play with that we're calling the training game.", 'start': 87.949, 'duration': 4.744}, {'end': 99.399, 'text': 'So this is this cool browser-based experience where you can go in and interactively train neural networks and tweak the hyperparameters during training.', 'start': 92.713, 'duration': 6.686}, {'end': 101.701, 'text': 'And this should be a really cool,', 'start': 100.18, 'duration': 1.521}], 'summary': 'Midterm will be held in multiple lecture halls, details to be sent soon. a browser-based training game is being developed for interactive neural network training.', 'duration': 25.861, 'max_score': 75.84, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ75840.jpg'}, {'end': 175.452, 'src': 'embed', 'start': 148.278, 'weight': 2, 'content': [{'end': 152.842, 'text': 'We kind of walked through the timeline of some of the various winners of the ImageNet classification challenge.', 'start': 148.278, 'duration': 4.564}, {'end': 159.888, 'text': 'Kind of the breakthrough result, as we saw, was the AlexNet architecture in 2012, which was a nine-layer convolutional network.', 'start': 153.562, 'duration': 6.326}, {'end': 166.875, 'text': 'It did amazingly well and it sort of kick-started this whole deep learning revolution in computer vision and kind of brought a lot of these models into the mainstream.', 'start': 159.988, 'duration': 6.887}, {'end': 175.452, 'text': 'Then we skip ahead a couple years and saw that in 2014 ImageNet Challenge, we had these two really interesting models, VGG and GoogleNet,', 'start': 168.347, 'duration': 7.105}], 'summary': 'Alexnet in 2012 was a breakthrough in imagenet challenge with a nine-layer convolutional network, kick-starting the deep learning revolution in computer vision. vgg and googlenet were also notable in the 2014 challenge.', 'duration': 27.174, 'max_score': 148.278, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ148278.jpg'}, {'end': 204.105, 'src': 'embed', 'start': 176.892, 'weight': 5, 'content': [{'end': 182.056, 'text': 'So VGG was, they had a 16 and a 19 layer model, and GoogleNet was, I believe, a 22 layer model.', 'start': 176.892, 'duration': 5.164}, {'end': 190.681, 'text': 'Although one thing that is kind of interesting about these models is that the 2014 ImageNet Challenge was right before batch normalization was invented.', 'start': 183.016, 'duration': 7.665}, {'end': 198.363, 'text': 'So at this time, before the invention of batch normalization, training these relatively deep models of roughly 20 layers was very challenging.', 'start': 191.221, 'duration': 7.142}, {'end': 204.105, 'text': 'So in fact, both of these two models had to resort to a little bit of hackery in order to get their deep models to converge.', 'start': 198.763, 'duration': 5.342}], 'summary': 'Vgg had 16 and 19 layer models, googlenet had a 22 layer model. training deep models before batch normalization was challenging.', 'duration': 27.213, 'max_score': 176.892, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ176892.jpg'}, {'end': 314.881, 'src': 'embed', 'start': 283.851, 'weight': 3, 'content': [{'end': 289.036, 'text': 'One is that if we just set all the weights in this residual block to zero, then this block is computing the identity.', 'start': 283.851, 'duration': 5.185}, {'end': 295.222, 'text': "So in some way, it's relatively easy for this model to learn not to use the layers that it doesn't need.", 'start': 289.497, 'duration': 5.725}, {'end': 301.508, 'text': 'In addition, it kind of adds this interpretation to L2 regularization in the context of these neural networks.', 'start': 296.103, 'duration': 5.405}, {'end': 307.674, 'text': "Because once you put L2 regularization, remember, on the weights of your network, that's gonna drive all the parameters towards zero.", 'start': 301.988, 'duration': 5.686}, {'end': 314.881, 'text': "And in maybe your standard convolutional architectures, driving towards zero maybe doesn't make sense, but in the context of a residual network,", 'start': 308.254, 'duration': 6.627}], 'summary': 'Residual block with zero weights computes identity, aiding in learning and l2 regularization interpretation.', 'duration': 31.03, 'max_score': 283.851, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ283851.jpg'}, {'end': 375.205, 'src': 'embed', 'start': 350.746, 'weight': 1, 'content': [{'end': 357.711, 'text': 'So then, when you look at, when you imagine stacking many of these residual blocks up on top of each other and our network ends up with hundreds of,', 'start': 350.746, 'duration': 6.965}, {'end': 359.253, 'text': 'potentially hundreds of layers,', 'start': 357.711, 'duration': 1.542}, {'end': 365.137, 'text': 'then these residual connections give a sort of gradient superhighway for gradients to flow backward through the entire network.', 'start': 359.253, 'duration': 5.884}, {'end': 369.18, 'text': 'And this allows it to train much easier and much faster.', 'start': 365.457, 'duration': 3.723}, {'end': 375.205, 'text': 'And actually allows these things to converge reasonably well, even when the model is potentially hundreds of layers deep.', 'start': 369.62, 'duration': 5.585}], 'summary': 'Residual connections enable faster training in deep networks with potentially hundreds of layers.', 'duration': 24.459, 'max_score': 350.746, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ350746.jpg'}, {'end': 420.574, 'src': 'embed', 'start': 391.129, 'weight': 4, 'content': [{'end': 397.73, 'text': 'So then we kind of also saw a couple other more exotic, more recent CNN architectures last time, including DenseNet and FractalNet.', 'start': 391.129, 'duration': 6.601}, {'end': 402.571, 'text': 'And once you think about these architectures in terms of gradient flow, they make a little bit more sense.', 'start': 398.17, 'duration': 4.401}, {'end': 408.632, 'text': 'These things like DenseNet and FractalNet are adding these additional shortcut or identity connections inside the model.', 'start': 403.251, 'duration': 5.381}, {'end': 412.013, 'text': 'And if you think about what happens in the backward pass for these models,', 'start': 408.992, 'duration': 3.021}, {'end': 420.574, 'text': 'these additional funny topologies are basically providing direct paths for gradients to flow from the loss at the end of the network more easily into all the different layers of the network.', 'start': 412.013, 'duration': 8.561}], 'summary': 'New cnn architectures like densenet and fractalnet use shortcut connections to improve gradient flow.', 'duration': 29.445, 'max_score': 391.129, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ391129.jpg'}], 'start': 75.84, 'title': 'Cnn architectures and gradient flow in cnns', 'summary': 'Covers the midterm arrangements, a training game for hyperparameter tuning, and the challenges of training deep cnn models before batch normalization. it also explores breakthroughs in cnn architectures like alexnet, vgg, and googlenet. additionally, it discusses the advantages of residual networks, including the facilitation of gradient flow for training deep models and their impact on other cnn architectures like densenet and fractalnet.', 'chapters': [{'end': 247.505, 'start': 75.84, 'title': 'Midterm announcement and cnn architectures', 'summary': 'Discusses the upcoming midterm arrangements, introduces a training game for practicing hyperparameter tuning skills, and delves into the challenges of training deep cnn models before the invention of batch normalization, highlighting the breakthroughs in cnn architectures like alexnet, vgg, and googlenet.', 'duration': 171.665, 'highlights': ['The chapter mentions the midterm will be held in multiple lecture halls and introduces a training game for interactive neural network training and hyperparameter tuning, offering extra credit for participation.', 'It explains the challenges of training deep CNN models before the invention of batch normalization, highlighting the difficulties faced by models like VGG and GoogleNet in converging without hackery.', 'It delves into the breakthroughs in CNN architectures, including the success of the nine-layer AlexNet in the 2012 ImageNet classification challenge, followed by the deeper VGG and GoogleNet models in the 2014 challenge.']}, {'end': 422.336, 'start': 247.945, 'title': 'Residual networks and gradient flow in cnns', 'summary': 'Discusses the advantages of residual networks, including the ability to learn not to use unnecessary layers, the interpretation of l2 regularization, and the facilitation of gradient flow for training deep models, as well as the impact of gradient flow on other cnn architectures like densenet and fractalnet.', 'duration': 174.391, 'highlights': ["Residual networks have the ability to learn not to use unnecessary layers by setting the weights in the residual block to zero, driving the residual blocks towards the identity where they're not needed for classification. This property of residual networks allows the model to learn not to use layers that it doesn't need by setting the weights in the residual block to zero, which drives the residual blocks towards the identity where they're not needed for classification.", 'Residual networks facilitate gradient flow in the backward pass, creating a gradient superhighway for gradients to flow backward through the entire network, enabling easier and faster training of deep models. The residual connections in the network create a gradient superhighway for gradients to flow backward through the entire network, allowing for easier and faster training, even with potentially hundreds of layers.', 'The impact of gradient flow is crucial in machine learning and is prevalent in recurrent networks, with other CNN architectures such as DenseNet and FractalNet also benefiting from additional shortcut or identity connections to facilitate gradient flow in the backward pass. The idea of managing gradient flow is super important in machine learning and is prevalent in recurrent networks, with other CNN architectures like DenseNet and FractalNet adding additional shortcut or identity connections to facilitate gradient flow in the backward pass.']}], 'duration': 346.496, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ75840.jpg', 'highlights': ['The midterm will be held in multiple lecture halls and introduces a training game for interactive neural network training and hyperparameter tuning, offering extra credit for participation.', 'Residual networks facilitate gradient flow in the backward pass, creating a gradient superhighway for gradients to flow backward through the entire network, enabling easier and faster training of deep models.', 'The chapter delves into the breakthroughs in CNN architectures, including the success of the nine-layer AlexNet in the 2012 ImageNet classification challenge, followed by the deeper VGG and GoogleNet models in the 2014 challenge.', "Residual networks have the ability to learn not to use unnecessary layers by setting the weights in the residual block to zero, driving the residual blocks towards the identity where they're not needed for classification.", 'The impact of gradient flow is crucial in machine learning and is prevalent in recurrent networks, with other CNN architectures such as DenseNet and FractalNet also benefiting from additional shortcut or identity connections to facilitate gradient flow in the backward pass.', 'It explains the challenges of training deep CNN models before the invention of batch normalization, highlighting the difficulties faced by models like VGG and GoogleNet in converging without hackery.']}, {'end': 933.65, 'segs': [{'end': 513.34, 'src': 'embed', 'start': 482.123, 'weight': 0, 'content': [{'end': 485.847, 'text': 'And if you multiply that out, you see that that single layer has 38 million parameters.', 'start': 482.123, 'duration': 3.724}, {'end': 491.634, 'text': 'So more than half of the parameters of the entire AlexNet model are just sitting in that last fully connected layer.', 'start': 486.408, 'duration': 5.226}, {'end': 498.502, 'text': 'And if you add up all the parameters in just the fully connected layers of AlexNet, including these other final fully connected layers,', 'start': 492.155, 'duration': 6.347}, {'end': 503.688, 'text': 'you see something like 59 of the 62 million parameters in AlexNet are sitting in these fully connected layers.', 'start': 498.502, 'duration': 5.186}, {'end': 510.096, 'text': 'So then, when we move to other architectures like GoogleNet and ResNets, they do away with a lot of these large,', 'start': 504.249, 'duration': 5.847}, {'end': 513.34, 'text': 'fully connected layers in favor of global average pooling at the end of the network.', 'start': 510.096, 'duration': 3.244}], 'summary': 'Alexnet has 38 million parameters in the last layer, comprising 59 out of 62 million parameters in the model; googlenet and resnets use global average pooling instead of fully connected layers.', 'duration': 31.217, 'max_score': 482.123, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ482123.jpg'}, {'end': 584.2, 'src': 'embed', 'start': 559.468, 'weight': 2, 'content': [{'end': 566.031, 'text': 'But in some context in machine learning, we want to have more flexibility in the types of data that our models can process.', 'start': 559.468, 'duration': 6.563}, {'end': 569.692, 'text': 'So, once we move to this idea of recurrent neural networks,', 'start': 566.591, 'duration': 3.101}, {'end': 574.694, 'text': 'we have a lot more opportunities to play around with the types of input and output data that our networks can handle.', 'start': 569.692, 'duration': 5.002}, {'end': 584.2, 'text': 'So once we have recurrent neural networks, we can do what we call these one-to-many models, where maybe our input is some object of fixed size,', 'start': 575.294, 'duration': 8.906}], 'summary': 'Recurrent neural networks offer flexibility in processing various types of input and output data.', 'duration': 24.732, 'max_score': 559.468, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ559468.jpg'}, {'end': 693.325, 'src': 'embed', 'start': 665.442, 'weight': 3, 'content': [{'end': 671.465, 'text': 'So in the context of videos, that might be making some classification decision along every frame of the video.', 'start': 665.442, 'duration': 6.023}, {'end': 682.032, 'text': 'And recurrent neural networks are this kind of general paradigm for handling variably sized sequence data that allow us to pretty naturally capture all of these different types of setups in our models.', 'start': 671.986, 'duration': 10.046}, {'end': 687.396, 'text': 'So recurrent neural networks are actually important.', 'start': 684.372, 'duration': 3.024}, {'end': 693.325, 'text': 'even for some problems that have a fixed size input and a fixed size output, recurrent neural networks can still be pretty useful.', 'start': 687.396, 'duration': 5.929}], 'summary': 'Recurrent neural networks are crucial for handling variably sized sequence data in videos, allowing for capturing different types of setups in models.', 'duration': 27.883, 'max_score': 665.442, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ665442.jpg'}, {'end': 769.82, 'src': 'embed', 'start': 743.965, 'weight': 4, 'content': [{'end': 749.97, 'text': 'where now we want the model to synthesize brand new images that look kind of like the images it saw in training,', 'start': 743.965, 'duration': 6.005}, {'end': 756.716, 'text': 'and we can use a recurrent neural network architecture to actually paint these output images sort of one piece at a time in the output.', 'start': 749.97, 'duration': 6.746}, {'end': 761.003, 'text': 'So you can see that, even though our output is this fixed-sized image,', 'start': 757.276, 'duration': 3.727}, {'end': 766.493, 'text': 'we can have these models that are working over time to compute parts of the output, one at a time, sequentially.', 'start': 761.003, 'duration': 5.49}, {'end': 769.82, 'text': 'And we can use recurrent neural networks for that type of setup as well.', 'start': 766.513, 'duration': 3.307}], 'summary': 'Use recurrent neural network to synthesize new images resembling training data.', 'duration': 25.855, 'max_score': 743.965, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ743965.jpg'}, {'end': 925.804, 'src': 'heatmap', 'start': 779.485, 'weight': 0.794, 'content': [{'end': 791.439, 'text': 'a recurrent neural network has this little recurrent core cell and it will take some input x feed that input into the RNN and that RNN has some internal hidden state.', 'start': 779.485, 'duration': 11.954}, {'end': 797.205, 'text': 'and that internal hidden state will be updated every time that the RNN reads a new input.', 'start': 792.34, 'duration': 4.865}, {'end': 802.51, 'text': 'And that internal hidden state will be then fed back to the model the next time it reads an input.', 'start': 798.006, 'duration': 4.504}, {'end': 805.392, 'text': 'And frequently.', 'start': 804.291, 'duration': 1.101}, {'end': 808.835, 'text': 'we will want our RNNs to also produce some output at every time step.', 'start': 805.392, 'duration': 3.443}, {'end': 813.5, 'text': "so we'll have this pattern where it will read an input, update its hidden state and then produce an output.", 'start': 808.835, 'duration': 4.665}, {'end': 820.424, 'text': "So then, the question is what is the functional form of this recurrence relation that we're computing?", 'start': 816.002, 'duration': 4.422}, {'end': 826.847, 'text': "So, inside this little green RNN block, we're computing some recurrence relation with a function f.", 'start': 820.984, 'duration': 5.863}, {'end': 829.909, 'text': 'So this function f will depend on some weights w.', 'start': 826.847, 'duration': 3.062}, {'end': 835.651, 'text': 'It will accept the previous hidden state, ht minus one, as well as the input at the current state, xt,', 'start': 829.909, 'duration': 5.742}, {'end': 840.294, 'text': 'and this will output the next hidden state or the updated hidden state that we call ht.', 'start': 835.651, 'duration': 4.643}, {'end': 850.181, 'text': 'And now, then, as we read the next input, this new hidden state ht will then just be passed into the same function as we read the next input,', 'start': 841.734, 'duration': 8.447}, {'end': 850.801, 'text': 'xt plus one.', 'start': 850.181, 'duration': 0.62}, {'end': 855.625, 'text': 'And now, if we wanted to produce some output at every time step of this network,', 'start': 851.522, 'duration': 4.103}, {'end': 864.712, 'text': 'we might attach some additional fully connected layers that read in this ht at every time step and make that decision based on the hidden state at every time step.', 'start': 855.625, 'duration': 9.087}, {'end': 874.408, 'text': 'And one thing to note is that we use this same function fw and these same weights w at every time step of the computation.', 'start': 867.585, 'duration': 6.823}, {'end': 882.911, 'text': 'So then kind of the simplest functional form that you can imagine is what we call this vanilla recurrent neural network.', 'start': 877.109, 'duration': 5.802}, {'end': 890.294, 'text': "So here we have this same functional form from the previous slide, where we're taking in our previous hidden state and our current input,", 'start': 883.431, 'duration': 6.863}, {'end': 891.775, 'text': 'and we need to produce the next hidden state.', 'start': 890.294, 'duration': 1.481}, {'end': 900.583, 'text': 'And the kind of simplest thing you might imagine is that we have some weight matrix WXH that we multiply against the input XT,', 'start': 892.455, 'duration': 8.128}, {'end': 905.167, 'text': 'as well as another weight matrix WHH that we multiply against the previous hidden state.', 'start': 900.583, 'duration': 4.584}, {'end': 910.852, 'text': 'So we make these two multiplications against our two states, add them together and squash them through a tanh,', 'start': 905.567, 'duration': 5.285}, {'end': 912.774, 'text': 'so we get some kind of non-linearity in the system.', 'start': 910.852, 'duration': 1.922}, {'end': 917.497, 'text': 'You might be wondering why we use a tanh here and not some other type of nonlinearity,', 'start': 913.855, 'duration': 3.642}, {'end': 920.26, 'text': "after all that we've said negative about tanhs in previous lectures.", 'start': 917.497, 'duration': 2.763}, {'end': 925.804, 'text': "And I think we'll return a little bit to that later on when we talk about more advanced architectures like the LSTM.", 'start': 920.88, 'duration': 4.924}], 'summary': 'Recurrent neural networks update hidden states using a function f with weights w, producing output at each time step.', 'duration': 146.319, 'max_score': 779.485, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ779485.jpg'}, {'end': 835.651, 'src': 'embed', 'start': 805.392, 'weight': 5, 'content': [{'end': 808.835, 'text': 'we will want our RNNs to also produce some output at every time step.', 'start': 805.392, 'duration': 3.443}, {'end': 813.5, 'text': "so we'll have this pattern where it will read an input, update its hidden state and then produce an output.", 'start': 808.835, 'duration': 4.665}, {'end': 820.424, 'text': "So then, the question is what is the functional form of this recurrence relation that we're computing?", 'start': 816.002, 'duration': 4.422}, {'end': 826.847, 'text': "So, inside this little green RNN block, we're computing some recurrence relation with a function f.", 'start': 820.984, 'duration': 5.863}, {'end': 829.909, 'text': 'So this function f will depend on some weights w.', 'start': 826.847, 'duration': 3.062}, {'end': 835.651, 'text': 'It will accept the previous hidden state, ht minus one, as well as the input at the current state, xt,', 'start': 829.909, 'duration': 5.742}], 'summary': 'Rnns produce output at every time step based on input and hidden state.', 'duration': 30.259, 'max_score': 805.392, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ805392.jpg'}, {'end': 905.167, 'src': 'embed', 'start': 877.109, 'weight': 6, 'content': [{'end': 882.911, 'text': 'So then kind of the simplest functional form that you can imagine is what we call this vanilla recurrent neural network.', 'start': 877.109, 'duration': 5.802}, {'end': 890.294, 'text': "So here we have this same functional form from the previous slide, where we're taking in our previous hidden state and our current input,", 'start': 883.431, 'duration': 6.863}, {'end': 891.775, 'text': 'and we need to produce the next hidden state.', 'start': 890.294, 'duration': 1.481}, {'end': 900.583, 'text': 'And the kind of simplest thing you might imagine is that we have some weight matrix WXH that we multiply against the input XT,', 'start': 892.455, 'duration': 8.128}, {'end': 905.167, 'text': 'as well as another weight matrix WHH that we multiply against the previous hidden state.', 'start': 900.583, 'duration': 4.584}], 'summary': 'Introduction to vanilla recurrent neural network with basic functional form.', 'duration': 28.058, 'max_score': 877.109, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ877109.jpg'}], 'start': 422.336, 'title': 'Cnn architectures, rnns & applications', 'summary': "Covers managing gradient flow in cnn architectures, fully connected layers' impact on parameter count, and the flexibility of rnns in handling variable length input and output data. it also discusses rnns' importance in handling variably sized sequence data, including videos and images, and their application in generating new images, along with the functional form of the recurrence relation and a vanilla rnn.", 'chapters': [{'end': 646.37, 'start': 422.336, 'title': 'Cnn architectures & recurrent neural networks', 'summary': 'Covers the significance of managing gradient flow in cnn architectures, the impact of fully connected layers on parameter count, and the flexibility of recurrent neural networks in handling variable length input and output data.', 'duration': 224.034, 'highlights': ['The fully connected layers in VGG and AlexNet contribute to a large number of parameters, with the last fully connected layer in AlexNet alone containing more than half of the total parameters (38 million out of 62 million).', 'GoogleNet and ResNets utilize global average pooling to reduce the parameter count by eliminating large, fully connected layers, leading to more efficient architectures.', 'Recurrent neural networks offer flexibility in processing variable length input and output data, enabling the development of one-to-many, many-to-one, and variable-length input-output models for tasks like caption generation, sentiment analysis, video classification, and machine translation.']}, {'end': 933.65, 'start': 646.83, 'title': 'Recurrent neural networks', 'summary': 'Discusses the importance of recurrent neural networks in handling variably sized sequence data, including videos and images, and their application in generating new images. it also explains the functional form of the recurrence relation in rnns and the simplest functional form of a vanilla recurrent neural network.', 'duration': 286.82, 'highlights': ['Recurrent neural networks are important for handling variably sized sequence data in videos and images Recurrent neural networks are a general paradigm for handling variably sized sequence data in videos and images, allowing for classification decisions along every frame of a video and sequential processing of fixed size inputs like images.', 'Application of recurrent neural networks in generating new images Recurrent neural networks can be used to synthesize brand new images that look similar to the training images, painting these output images one piece at a time sequentially.', 'Explanation of the functional form of the recurrence relation in RNNs The recurrence relation in RNNs is computed using a function f with weights w that depend on the previous hidden state and the current input, outputting the next hidden state.', 'Description of the simplest functional form of a vanilla recurrent neural network The simplest functional form involves taking the previous hidden state and the current input, multiplying them with weight matrices, adding the results, and applying a tanh nonlinearity.']}], 'duration': 511.314, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ422336.jpg', 'highlights': ['GoogleNet and ResNets utilize global average pooling to reduce parameter count, leading to more efficient architectures.', 'The last fully connected layer in AlexNet alone contains more than half of the total parameters (38 million out of 62 million).', 'Recurrent neural networks offer flexibility in processing variable length input and output data, enabling the development of one-to-many, many-to-one, and variable-length input-output models for various tasks.', 'Recurrent neural networks are important for handling variably sized sequence data in videos and images, allowing for classification decisions along every frame of a video and sequential processing of fixed size inputs like images.', 'Recurrent neural networks can be used to synthesize brand new images that look similar to the training images, painting these output images one piece at a time sequentially.', 'The recurrence relation in RNNs is computed using a function f with weights w that depend on the previous hidden state and the current input, outputting the next hidden state.', 'The simplest functional form of a vanilla recurrent neural network involves taking the previous hidden state and the current input, multiplying them with weight matrices, adding the results, and applying a tanh nonlinearity.']}, {'end': 1893.153, 'segs': [{'end': 975.541, 'src': 'embed', 'start': 951.687, 'weight': 1, 'content': [{'end': 959.111, 'text': 'One is this concept of having a hidden state that feeds back at itself recurrently, but I find that picture a little bit confusing.', 'start': 951.687, 'duration': 7.424}, {'end': 966.034, 'text': 'And sometimes I find it clear to think about unrolling this computational graph for multiple time steps.', 'start': 959.851, 'duration': 6.183}, {'end': 971.337, 'text': 'And this makes the data flow of the hidden states and the inputs and the outputs and the weights maybe a little bit more clear.', 'start': 966.434, 'duration': 4.903}, {'end': 975.541, 'text': "So then at the first time step, we'll have some initial hidden state h zero.", 'start': 972.017, 'duration': 3.524}], 'summary': 'Recurrent neural networks use hidden states, unrolling computational graph for clarity.', 'duration': 23.854, 'max_score': 951.687, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ951687.jpg'}, {'end': 1196.568, 'src': 'heatmap', 'start': 1002.91, 'weight': 0.711, 'content': [{'end': 1009.495, 'text': 'And this process will repeat over and over again as we consume all of the input xts in our sequence of inputs.', 'start': 1002.91, 'duration': 6.585}, {'end': 1017.464, 'text': 'And now one thing to note is that we can actually make this even more explicit and write the W matrix in our computational graph.', 'start': 1010.762, 'duration': 6.702}, {'end': 1022.965, 'text': "And here you can see that we're reusing the same W matrix at every time step of the computation.", 'start': 1017.964, 'duration': 5.001}, {'end': 1030.906, 'text': "So now every time we have this little FW block, it's receiving a unique H and a unique X, but all of these blocks are taking the same W.", 'start': 1023.445, 'duration': 7.461}, {'end': 1040.372, 'text': 'And if you remember, we talked about how gradient flows in backpropagation when you reuse the same node multiple times in a computational graph.', 'start': 1031.627, 'duration': 8.745}, {'end': 1048.057, 'text': "Then remember during the backwards pass, you end up summing the gradients into the W matrix when you're computing dLoss, dW.", 'start': 1040.792, 'duration': 7.265}, {'end': 1056.662, 'text': "So if you kind of think about the backpropagation for this model, then you'll have a separate gradient for W flowing from each of those timesteps,", 'start': 1048.677, 'duration': 7.985}, {'end': 1061.205, 'text': 'and then the final gradient for W will be the sum of all of those individual per-timestep gradients.', 'start': 1056.662, 'duration': 4.543}, {'end': 1067.737, 'text': 'We can also write this yt explicitly in this computational graph.', 'start': 1063.354, 'duration': 4.383}, {'end': 1074.682, 'text': 'so then this output ht at every time step might feed into some other little neural network that can produce a yt,', 'start': 1067.737, 'duration': 6.945}, {'end': 1077.864, 'text': 'which might be some class scores or something like that, at every time step.', 'start': 1074.682, 'duration': 3.182}, {'end': 1080.661, 'text': 'We can also make the loss more explicit.', 'start': 1078.919, 'duration': 1.742}, {'end': 1083.483, 'text': 'So, in many cases you might imagine producing,', 'start': 1081.061, 'duration': 2.422}, {'end': 1089.428, 'text': "you might imagine that you have some ground truth label at every time step of your sequence and then you'll compute some loss,", 'start': 1083.483, 'duration': 5.945}, {'end': 1093.572, 'text': "some individual loss at every time step of these output yt's.", 'start': 1089.428, 'duration': 4.144}, {'end': 1095.113, 'text': 'And this loss might.', 'start': 1094.052, 'duration': 1.061}, {'end': 1101.699, 'text': 'it will frequently be something like a softmax loss in the case where you have maybe a ground truth label at every time step of the sequence.', 'start': 1095.113, 'duration': 6.586}, {'end': 1107.632, 'text': 'And now the final loss for this entire training step will be the sum of these individual losses.', 'start': 1102.625, 'duration': 5.007}, {'end': 1113.539, 'text': 'So now we had a scalar loss at every time step, and we just summed them up to get our final scalar loss at the top of the network.', 'start': 1108.152, 'duration': 5.387}, {'end': 1118.666, 'text': 'And now, if you think about again back propagation through this thing, in order to train the model,', 'start': 1114.04, 'duration': 4.626}, {'end': 1122.151, 'text': 'we need to compute the gradient of the loss with respect to w.', 'start': 1118.666, 'duration': 3.485}, {'end': 1126.158, 'text': "So we'll have loss flowing from that final loss into each of these time steps,", 'start': 1122.151, 'duration': 4.007}, {'end': 1130.044, 'text': 'and then each of those time steps will compute a local gradient on the weights w,', 'start': 1126.158, 'duration': 3.886}, {'end': 1133.19, 'text': 'which will all then be sums to give us our final gradient for the weights w.', 'start': 1130.044, 'duration': 3.146}, {'end': 1141.35, 'text': 'Now, if we have sort of this many to one situation where maybe we want to do something like sentiment analysis,', 'start': 1135.526, 'duration': 5.824}, {'end': 1145.853, 'text': 'then we would typically make that decision based on the final hidden state of this network,', 'start': 1141.35, 'duration': 4.503}, {'end': 1150.096, 'text': 'because this final hidden state kind of summarizes all of the context from the entire sequence.', 'start': 1145.853, 'duration': 4.243}, {'end': 1159.583, 'text': 'Also, if we have kind of a one to many situation where we want to receive a fixed size input and then produce a variably sized output,', 'start': 1152.34, 'duration': 7.243}, {'end': 1169.606, 'text': "then you'll commonly use that fixed size input to initialize somehow the initial hidden state of the model and now the recurrent network will tick for each cell in the output.", 'start': 1159.583, 'duration': 10.023}, {'end': 1175.528, 'text': "And now as you produce your variably sized output, you'll unroll the graph for each element in the output.", 'start': 1170.006, 'duration': 5.522}, {'end': 1184.498, 'text': 'So, then, when we talk about these sequence-to-sequence models, where you might do something like machine translation,', 'start': 1178.453, 'duration': 6.045}, {'end': 1192.004, 'text': 'where you take a variably sized input and a variably sized output, you can think of this as a combination of the many-to-one plus a one-to-many.', 'start': 1184.498, 'duration': 7.506}, {'end': 1196.568, 'text': "So we'll kind of proceed in two stages, what we call an encoder and a decoder.", 'start': 1192.565, 'duration': 4.003}], 'summary': 'Recurrent neural network explained with backpropagation, loss computation, and applications like sentiment analysis and sequence-to-sequence models.', 'duration': 193.658, 'max_score': 1002.91, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ1002910.jpg'}, {'end': 1083.483, 'src': 'embed', 'start': 1017.964, 'weight': 0, 'content': [{'end': 1022.965, 'text': "And here you can see that we're reusing the same W matrix at every time step of the computation.", 'start': 1017.964, 'duration': 5.001}, {'end': 1030.906, 'text': "So now every time we have this little FW block, it's receiving a unique H and a unique X, but all of these blocks are taking the same W.", 'start': 1023.445, 'duration': 7.461}, {'end': 1040.372, 'text': 'And if you remember, we talked about how gradient flows in backpropagation when you reuse the same node multiple times in a computational graph.', 'start': 1031.627, 'duration': 8.745}, {'end': 1048.057, 'text': "Then remember during the backwards pass, you end up summing the gradients into the W matrix when you're computing dLoss, dW.", 'start': 1040.792, 'duration': 7.265}, {'end': 1056.662, 'text': "So if you kind of think about the backpropagation for this model, then you'll have a separate gradient for W flowing from each of those timesteps,", 'start': 1048.677, 'duration': 7.985}, {'end': 1061.205, 'text': 'and then the final gradient for W will be the sum of all of those individual per-timestep gradients.', 'start': 1056.662, 'duration': 4.543}, {'end': 1067.737, 'text': 'We can also write this yt explicitly in this computational graph.', 'start': 1063.354, 'duration': 4.383}, {'end': 1074.682, 'text': 'so then this output ht at every time step might feed into some other little neural network that can produce a yt,', 'start': 1067.737, 'duration': 6.945}, {'end': 1077.864, 'text': 'which might be some class scores or something like that, at every time step.', 'start': 1074.682, 'duration': 3.182}, {'end': 1080.661, 'text': 'We can also make the loss more explicit.', 'start': 1078.919, 'duration': 1.742}, {'end': 1083.483, 'text': 'So, in many cases you might imagine producing,', 'start': 1081.061, 'duration': 2.422}], 'summary': 'Using the same w matrix at each time step in backpropagation leads to summing gradients into the w matrix.', 'duration': 65.519, 'max_score': 1017.964, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ1017964.jpg'}, {'end': 1223.413, 'src': 'embed', 'start': 1196.988, 'weight': 4, 'content': [{'end': 1202.453, 'text': 'So here the encoder will receive the variably sized input, which might be your sentence in English,', 'start': 1196.988, 'duration': 5.465}, {'end': 1207.057, 'text': 'and then summarize that entire sentence using the final hidden state of the encoder network.', 'start': 1202.453, 'duration': 4.604}, {'end': 1215.811, 'text': "and now we're in this many to one situation where we've summarized this entire variably sized input in this single vector,", 'start': 1208.109, 'duration': 7.702}, {'end': 1223.413, 'text': 'and now we have a second decoder network, which is a one to many situation, which will input that single vector, summarizing the input sentence,', 'start': 1215.811, 'duration': 7.602}], 'summary': 'Encoder summarizes input sentence to single vector for decoder network.', 'duration': 26.425, 'max_score': 1196.988, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ1196988.jpg'}, {'end': 1272.551, 'src': 'embed', 'start': 1251.163, 'weight': 5, 'content': [{'end': 1260.525, 'text': 'So in the language modeling problem, we want to read some sequence of, we want to have our network sort of understand how to produce natural language.', 'start': 1251.163, 'duration': 9.362}, {'end': 1266.387, 'text': 'So this might happen at the character level, where our model will produce characters one at a time.', 'start': 1261.045, 'duration': 5.342}, {'end': 1270.19, 'text': 'This might also happen at the word level where our model will produce words one at a time.', 'start': 1266.708, 'duration': 3.482}, {'end': 1272.551, 'text': 'But in a very simple example,', 'start': 1270.71, 'duration': 1.841}], 'summary': 'Language modeling problem: model produces characters or words, one at a time.', 'duration': 21.388, 'max_score': 1251.163, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ1251163.jpg'}, {'end': 1494.544, 'src': 'heatmap', 'start': 1443.572, 'weight': 0.717, 'content': [{'end': 1454.777, 'text': 'one thing that we might want to do with it is a sample from the model and actually use this trained neural network model to synthesize new text that kind of looks similar in spirit to the text that it was trained on.', 'start': 1443.572, 'duration': 11.205}, {'end': 1459.719, 'text': "So the way that this will work is we'll typically seed the model with some input prefix of text.", 'start': 1455.357, 'duration': 4.362}, {'end': 1463.241, 'text': 'In this case, the prefix is just this single letter h.', 'start': 1460.12, 'duration': 3.121}, {'end': 1467.363, 'text': "And now we'll feed that letter h through the first time step of our recurrent neural network.", 'start': 1463.241, 'duration': 4.122}, {'end': 1472.866, 'text': 'It will produce this distribution of scores over all the characters in the vocabulary.', 'start': 1467.683, 'duration': 5.183}, {'end': 1477.048, 'text': "But now, at training time, we'll use these scores to actually sample from it.", 'start': 1473.326, 'duration': 3.722}, {'end': 1487.358, 'text': "So we'll use a softmax function to convert those scores into a probability distribution and then we will sample from that probability distribution to actually synthesize the second letter in the sequence.", 'start': 1477.368, 'duration': 9.99}, {'end': 1494.544, 'text': 'And in this case, even though the scores were pretty bad, maybe we got lucky and sampled the letter E from this probability distribution.', 'start': 1488.198, 'duration': 6.346}], 'summary': 'Using a trained neural network model to synthesize text by sampling from a probability distribution.', 'duration': 50.972, 'max_score': 1443.572, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ1443572.jpg'}, {'end': 1616.644, 'src': 'embed', 'start': 1573.923, 'weight': 6, 'content': [{'end': 1580.065, 'text': "then you'll see that sometimes these trained models actually are able to produce multiple different types of reasonable output sequences,", 'start': 1573.923, 'duration': 6.142}, {'end': 1583.145, 'text': 'depending on which samples they take at the first time steps.', 'start': 1580.065, 'duration': 3.08}, {'end': 1587.887, 'text': "So it's actually kind of a benefit because we can get now more diversity in our outputs.", 'start': 1583.726, 'duration': 4.161}, {'end': 1606.561, 'text': 'Another question? Could we feed in the softmax vector instead of the one element vector? You mean at test time? Yeah,', 'start': 1589.127, 'duration': 17.434}, {'end': 1610.903, 'text': 'so the question is at test time, could we feed in this whole softmax vector rather than a one-hot vector?', 'start': 1606.561, 'duration': 4.342}, {'end': 1612.843, 'text': "So there's kind of two problems with that.", 'start': 1611.283, 'duration': 1.56}, {'end': 1616.644, 'text': "One is that that's very different from the data that it saw at training time.", 'start': 1613.403, 'duration': 3.241}], 'summary': 'Trained models produce diverse outputs based on initial samples. using whole softmax vector at test time differs from training data.', 'duration': 42.721, 'max_score': 1573.923, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ1573923.jpg'}, {'end': 1745.463, 'src': 'embed', 'start': 1714.207, 'weight': 8, 'content': [{'end': 1719.368, 'text': 'So in practice, what people do is this sort of approximation called truncated back propagation through time.', 'start': 1714.207, 'duration': 5.161}, {'end': 1727.231, 'text': "So here the idea is that, even though our input sequence is very, very long and even potentially infinite, what we'll do is that,", 'start': 1719.969, 'duration': 7.262}, {'end': 1728.871, 'text': "when we're training the model,", 'start': 1727.231, 'duration': 1.64}, {'end': 1738.957, 'text': "we'll step forward for some number of steps maybe like 100 is kind of a ballpark number that people frequently use and we'll step forward for maybe 100 steps,", 'start': 1728.871, 'duration': 10.086}, {'end': 1745.463, 'text': 'compute a loss only over this sub-sequence of the data and then back-propagate through this sub-sequence and now make a gradient step.', 'start': 1738.957, 'duration': 6.506}], 'summary': 'In truncated backpropagation through time, models train on sub-sequences to handle long input sequences effectively.', 'duration': 31.256, 'max_score': 1714.207, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ1714207.jpg'}, {'end': 1882.334, 'src': 'embed', 'start': 1859.364, 'weight': 9, 'content': [{'end': 1867.026, 'text': 'So Andre has this example of what he calls min char rnn that does all of this stuff in just like 112 lines of Python.', 'start': 1859.364, 'duration': 7.662}, {'end': 1869.127, 'text': 'So it handles building the vocabulary,', 'start': 1867.406, 'duration': 1.721}, {'end': 1876.089, 'text': 'it trains the model with truncated back propagation through time and then it can actually sample from that model in actually not too much code.', 'start': 1869.127, 'duration': 6.962}, {'end': 1880.411, 'text': "So even though this sounds like kind of a big scary process, it's actually not too difficult.", 'start': 1876.45, 'duration': 3.961}, {'end': 1882.334, 'text': "So I'd encourage you, if you're confused,", 'start': 1880.871, 'duration': 1.463}], 'summary': "Andre's min char rnn example handles building vocab, training model with truncated back propagation, and sampling in just 112 lines of python.", 'duration': 22.97, 'max_score': 1859.364, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ1859364.jpg'}], 'start': 933.65, 'title': 'Recurrent neural networks and rnn training', 'summary': 'Covers the concept of hidden states in recurrent neural networks, unrolling computational graph, training models using backpropagation, and addressing challenges such as feeding softmax vectors, training long sequences, and using truncated backpropagation through time. it also includes a specific example of min char rnn implemented in 112 lines of python.', 'chapters': [{'end': 1141.35, 'start': 933.65, 'title': 'Recurrent neural networks', 'summary': 'Explains the concept of hidden states in recurrent neural networks, unrolling computational graph for multiple time steps, reusing weight matrix, backpropagation for model training, explicit representation of outputs and loss computation, and gradient computation for the weights in backpropagation.', 'duration': 207.7, 'highlights': ['The concept of hidden states in recurrent neural networks and unrolling computational graph for multiple time steps Explains the concept of having a hidden state that feeds back at itself recurrently and unrolling the computational graph for multiple time steps.', 'Reusing the same weight matrix at every time step of the computation Describes the reuse of the same weight matrix at every time step of the computation, impacting gradient flows in backpropagation and the computation of the final gradient for the weights.', 'Explicit representation of outputs and loss computation in the computational graph Explains the explicit representation of outputs and loss computation in the computational graph, including the representation of outputs and the computation of individual and final loss for the entire training step.', 'Backpropagation for model training and gradient computation for the weights Describes the backpropagation process for model training, including the flow of loss and local gradient computation on the weights, leading to the final gradient computation for the weights.']}, {'end': 1587.887, 'start': 1141.35, 'title': 'Recurrent neural networks', 'summary': 'Discusses the use of recurrent neural networks for sequence-to-sequence models, language modeling, and the process of training and sampling from the model, emphasizing the use of hidden states and various examples of input and output sequences.', 'duration': 446.537, 'highlights': ['Recurrent neural networks are used in sequence-to-sequence models, where the encoder summarizes the input sequence using the final hidden state, and the decoder produces the variable-sized output. The encoder summarizes the input sequence using the final hidden state, transitioning from a variably sized input to a single vector, and the decoder produces the variable-sized output, such as in machine translation.', 'The language modeling problem involves training recurrent neural networks to predict the next character in a sequence, using softmax loss to quantify prediction errors. Training recurrent neural networks for language modeling involves predicting the next character in a sequence, using softmax loss to quantify prediction errors, and repeated training over multiple sequences to learn context-based predictions.', "Sampling from the model's probability distribution during synthesis allows for diverse and multiple reasonable output sequences, providing an advantage over taking the argmax probability. Sampling from the model's probability distribution during synthesis allows for diverse and multiple reasonable output sequences, providing an advantage over taking the argmax probability, and contributes to more diversity in the outputs."]}, {'end': 1893.153, 'start': 1589.127, 'title': 'Rnn training and backpropagation', 'summary': 'Discusses the challenges of feeding softmax vectors at test time, the issues with training long sequences, and the use of truncated backpropagation through time for training rnn models, with a specific example of min char rnn implemented in 112 lines of python.', 'duration': 304.026, 'highlights': ['The challenges of feeding softmax vectors at test time and the issues with training long sequences are discussed. The chapter explains that feeding softmax vectors at test time can lead to computational inefficiency, especially for large vocabularies, and training with very long sequences can be slow and memory-intensive.', 'The concept of truncated backpropagation through time is introduced as a solution for training RNN models with long sequences. Truncated backpropagation through time is presented as a way to approximate gradients without making a backward pass through potentially very large sequences of data, providing an efficient solution for training RNN models.', 'An example of min char rnn is mentioned, which implements truncated backpropagation through time and can sample from the model with relatively little code. An example of min char rnn is highlighted, demonstrating the implementation of truncated backpropagation through time and the ability to sample from the model with minimal code, making the process more accessible and manageable.']}], 'duration': 959.503, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ933650.jpg', 'highlights': ['Reusing the same weight matrix at every time step of the computation impacts gradient flows in backpropagation', 'The concept of hidden states in recurrent neural networks and unrolling computational graph for multiple time steps', 'Explicit representation of outputs and loss computation in the computational graph', 'Backpropagation for model training and gradient computation for the weights', 'The encoder summarizes the input sequence using the final hidden state, transitioning from a variably sized input to a single vector', 'Training recurrent neural networks for language modeling involves predicting the next character in a sequence', "Sampling from the model's probability distribution during synthesis allows for diverse and multiple reasonable output sequences", 'The challenges of feeding softmax vectors at test time and the issues with training long sequences', 'The concept of truncated backpropagation through time is introduced as a solution for training RNN models with long sequences', 'An example of min char rnn is highlighted, demonstrating the implementation of truncated backpropagation through time']}, {'end': 2548.154, 'segs': [{'end': 1938.215, 'src': 'embed', 'start': 1913.479, 'weight': 0, 'content': [{'end': 1921.781, 'text': "we took this entire text of all of Shakespeare's works and then used that to train a recurrent neural network language model on all of Shakespeare.", 'start': 1913.479, 'duration': 8.302}, {'end': 1927.046, 'text': "And you can see that at the beginning of training it's kind of producing maybe random gibberish garbage,", 'start': 1922.523, 'duration': 4.523}, {'end': 1931.39, 'text': 'but throughout the course of training it ends up producing things that seem relatively reasonable.', 'start': 1927.046, 'duration': 4.344}, {'end': 1938.215, 'text': 'And after this model has been trained pretty well, then it produces text that seems kind of Shakespeare-esque to me.', 'start': 1932.15, 'duration': 6.065}], 'summary': "A recurrent neural network trained on all of shakespeare's works produces text that becomes more shakespeare-esque as it's trained.", 'duration': 24.736, 'max_score': 1913.479, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ1913479.jpg'}, {'end': 2021.033, 'src': 'embed', 'start': 1994.925, 'weight': 1, 'content': [{'end': 1998.966, 'text': "So it's just a whole bunch of LaTeX files that are this like super dense mathematics.", 'start': 1994.925, 'duration': 4.041}, {'end': 2006.007, 'text': 'And LaTeX, because LaTeX is sort of this, lets you write equations and diagrams and everything just using plain text.', 'start': 1999.586, 'duration': 6.421}, {'end': 2013.049, 'text': 'So we can actually train our recurrent neural network language model on the raw LaTeX source code of this algebraic topology textbook.', 'start': 2006.507, 'duration': 6.542}, {'end': 2021.033, 'text': 'And if we do that, then after we sample from the model, then we get something that seems kind of like algebraic topology.', 'start': 2013.689, 'duration': 7.344}], 'summary': 'Training recurrent neural network on latex files to generate algebraic topology content.', 'duration': 26.108, 'max_score': 1994.925, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ1994925.jpg'}, {'end': 2106.258, 'src': 'embed', 'start': 2059.65, 'weight': 2, 'content': [{'end': 2064.13, 'text': "So it kind of got the general gist of how to make those diagrams, but they actually don't make any sense.", 'start': 2059.65, 'duration': 4.48}, {'end': 2071.417, 'text': 'And actually, One of my favorite examples here is that it sometimes omits proofs.', 'start': 2066.252, 'duration': 5.165}, {'end': 2078.141, 'text': "So it'll sometimes say something like theorem, blah, blah, blah, blah, blah, proof omitted.", 'start': 2072.558, 'duration': 5.583}, {'end': 2083.726, 'text': 'So this thing kind of has gotten the gist of how some of these math textbooks look like.', 'start': 2078.523, 'duration': 5.203}, {'end': 2089.011, 'text': 'We can have a lot of fun with this.', 'start': 2087.911, 'duration': 1.1}, {'end': 2092.893, 'text': 'So we also tried training one of these models on the entire source code of the Linux kernel.', 'start': 2089.032, 'duration': 3.861}, {'end': 2096.853, 'text': "Because again, that's just this character level stuff that we can train on.", 'start': 2093.672, 'duration': 3.181}, {'end': 2101.096, 'text': 'And then when we sample this, it actually, again, looks like C source code.', 'start': 2097.395, 'duration': 3.701}, {'end': 2103.717, 'text': 'So it knows how to write if statements.', 'start': 2101.536, 'duration': 2.181}, {'end': 2106.258, 'text': 'It has pretty good code formatting skills.', 'start': 2103.837, 'duration': 2.421}], 'summary': 'Model trained on linux kernel source code can write c source code with if statements and good formatting skills.', 'duration': 46.608, 'max_score': 2059.65, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ2059650.jpg'}, {'end': 2175.181, 'src': 'embed', 'start': 2148.707, 'weight': 4, 'content': [{'end': 2153.449, 'text': 'Where again, during training, all we asked this model to do was try to predict the next character in the sequence.', 'start': 2148.707, 'duration': 4.742}, {'end': 2158.993, 'text': "We didn't tell it any of this structure, but somehow, just through the course of this training process,", 'start': 2154.089, 'duration': 4.904}, {'end': 2162.455, 'text': 'it learned a lot about the latent structure in the sequential data.', 'start': 2158.993, 'duration': 3.462}, {'end': 2168.199, 'text': 'Yeah, so it knows how to write code, does a lot of cool stuff.', 'start': 2165.797, 'duration': 2.402}, {'end': 2175.181, 'text': 'So I had this paper with Andre a couple years ago, where we trained a bunch of these models,', 'start': 2170.137, 'duration': 5.044}], 'summary': 'A model was trained to predict next character; learned latent structure in sequential data; capable of writing code.', 'duration': 26.474, 'max_score': 2148.707, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ2148707.jpg'}, {'end': 2398.829, 'src': 'embed', 'start': 2367.436, 'weight': 5, 'content': [{'end': 2372.18, 'text': 'So the idea here is that the caption is this variably length sequence that might have.', 'start': 2367.436, 'duration': 4.744}, {'end': 2375.222, 'text': 'this sequence might have different numbers of words for different captions.', 'start': 2372.18, 'duration': 3.042}, {'end': 2378.885, 'text': 'So this is a totally natural fit for a recurrent neural network language model.', 'start': 2375.642, 'duration': 3.243}, {'end': 2387.546, 'text': 'So, then, what this model looks like is we have some convolutional network which will take as input the image,', 'start': 2379.864, 'duration': 7.682}, {'end': 2390.147, 'text': "and we've seen a lot about how convolutional networks work at this point.", 'start': 2387.546, 'duration': 2.601}, {'end': 2398.829, 'text': 'And now that convolutional network will produce a summary vector of the image which will then feed into the first time step of one of these recurrent neural network language models,', 'start': 2390.647, 'duration': 8.182}], 'summary': 'Using recurrent neural network language model for variable-length captions in images.', 'duration': 31.393, 'max_score': 2367.436, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ2367436.jpg'}], 'start': 1895.714, 'title': 'Recurrent neural network language model', 'summary': "Explores training recurrent neural network language models on text data like shakespeare's works and algebraic topology textbook, showcasing the model's ability to generate realistic text. it also discusses training a character-level model on the linux kernel source code, highlighting its capabilities in writing c source code and recognizing gnu license, and explaining a recurrent neural network language model for image captioning.", 'chapters': [{'end': 2083.726, 'start': 1895.714, 'title': 'Recurrent neural network language model', 'summary': "Discusses training a recurrent neural network language model on text data, such as shakespeare's works and algebraic topology textbook, to generate new text that resembles the original, showcasing how the model learns to produce realistic text over the course of training.", 'duration': 188.012, 'highlights': ["The model learns to produce realistic text resembling Shakespeare's works after being trained on the entire collection, showcasing its ability to capture the style and structure of the data. The model trained on all of Shakespeare's works initially produces random gibberish but eventually generates text that seems relatively reasonable, and even Shakespeare-esque, demonstrating its capacity to capture the style and structure of the original data.", 'The model is capable of generating text resembling algebraic topology after being trained on the raw LaTeX source code of the textbook, demonstrating its ability to capture the mathematical language and structure of the data. After being trained on the raw LaTeX source code of an algebraic topology textbook, the model is able to generate text that resembles algebraic topology, including equations, proofs, diagrams, and references to previous lemmas, showcasing its ability to capture the mathematical language and structure of the original data.', 'The model attempts to create commutative diagrams and sometimes omits proofs, demonstrating its understanding of the general structure of the mathematical content. The model attempts to create commutative diagrams and occasionally omits proofs, showcasing its understanding of the general structure of mathematical content, despite sometimes producing nonsensical diagrams and omitting proofs, which are typical features of the original data.']}, {'end': 2548.154, 'start': 2087.911, 'title': 'Recurrent neural network language model', 'summary': 'Discusses training a character-level model on the linux kernel source code, highlighting its ability to write c source code, recognize gnu license, and learn latent structure in sequential data, and then transitions to explaining a recurrent neural network language model for image captioning.', 'duration': 460.243, 'highlights': ['The character-level model trained on the Linux kernel source code can write C source code, recognize GNU license, and learn latent structure in sequential data. The model demonstrates the ability to write C source code, recognize the GNU license, and learn the latent structure in the sequential data.', 'The model can also identify elements such as quotes and count the number of characters since a line break, which aids in understanding the latent structure of the data. The model can identify elements like quotes and count the number of characters since a line break, aiding in understanding the latent structure of the data.', 'The discussion transitions to a recurrent neural network language model for image captioning, which involves using a convolutional network to process the image and a recurrent neural network for generating the caption. The transition is made to a recurrent neural network language model for image captioning, involving a convolutional network to process the image and a recurrent neural network for generating the caption.']}], 'duration': 652.44, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ1895714.jpg', 'highlights': ["The model trained on all of Shakespeare's works initially produces random gibberish but eventually generates text that seems relatively reasonable, and even Shakespeare-esque, demonstrating its capacity to capture the style and structure of the original data.", 'The model trained on the raw LaTeX source code of an algebraic topology textbook is able to generate text that resembles algebraic topology, including equations, proofs, diagrams, and references to previous lemmas, showcasing its ability to capture the mathematical language and structure of the original data.', 'The model attempts to create commutative diagrams and occasionally omits proofs, showcasing its understanding of the general structure of mathematical content, despite sometimes producing nonsensical diagrams and omitting proofs, which are typical features of the original data.', 'The character-level model trained on the Linux kernel source code can write C source code, recognize GNU license, and learn latent structure in sequential data.', 'The model can identify elements like quotes and count the number of characters since a line break, aiding in understanding the latent structure of the data.', 'The discussion transitions to a recurrent neural network language model for image captioning, involving a convolutional network to process the image and a recurrent neural network for generating the caption.']}, {'end': 3332.161, 'segs': [{'end': 2608.274, 'src': 'embed', 'start': 2548.574, 'weight': 0, 'content': [{'end': 2566.662, 'text': 'But you can just train this model in a purely supervised way and then back propagate through to jointly train both this recurrent neural network language model and then also pass gradients back into this final layer of the CNN and additionally update the weights of the CNN to jointly tune all parts of the model to perform this task.', 'start': 2548.574, 'duration': 18.088}, {'end': 2571.146, 'text': 'So once you train these models, they actually do some pretty reasonable things.', 'start': 2567.805, 'duration': 3.341}, {'end': 2576.247, 'text': 'So these are some real results from a model, from one of these trained models.', 'start': 2571.706, 'duration': 4.541}, {'end': 2581.808, 'text': 'And it says things like a cat sitting on a suitcase on the floor, which is pretty impressive.', 'start': 2576.627, 'duration': 5.181}, {'end': 2585.749, 'text': 'It knows about cats sitting on a tree branch, which is also pretty cool.', 'start': 2582.268, 'duration': 3.481}, {'end': 2589.03, 'text': 'It knows about two people walking on the beach with surfboards.', 'start': 2586.449, 'duration': 2.581}, {'end': 2595.471, 'text': 'So these models are actually pretty powerful and can produce relatively complex captions to describe the image.', 'start': 2589.85, 'duration': 5.621}, {'end': 2599.585, 'text': 'But that being said, these models are really not perfect.', 'start': 2596.562, 'duration': 3.023}, {'end': 2600.466, 'text': "They're not magical.", 'start': 2599.665, 'duration': 0.801}, {'end': 2608.274, 'text': "Just like any machine learning model, if you try to run them on data that was very different from the training data, they don't work very well.", 'start': 2601.207, 'duration': 7.067}], 'summary': 'Model can produce complex image captions, but not perfect on diverse data', 'duration': 59.7, 'max_score': 2548.574, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ2548574.jpg'}, {'end': 2698.845, 'src': 'embed', 'start': 2674.409, 'weight': 4, 'content': [{'end': 2680.433, 'text': "when we're generating the words of this caption, we can allow the model to steer its attention to different parts of the image.", 'start': 2674.409, 'duration': 6.024}, {'end': 2688.858, 'text': "And I don't want to spend too much time on this, but the general way that this works is that now our convolutional network,", 'start': 2681.373, 'duration': 7.485}, {'end': 2692.18, 'text': 'rather than producing a single vector summarizing the entire image,', 'start': 2688.858, 'duration': 3.322}, {'end': 2698.845, 'text': 'now it produces some grid of vectors that give maybe one vector for each spatial location in the image.', 'start': 2692.18, 'duration': 6.665}], 'summary': 'The model can steer attention to different image areas by generating a grid of vectors from the convolutional network.', 'duration': 24.436, 'max_score': 2674.409, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ2674409.jpg'}, {'end': 2770.164, 'src': 'heatmap', 'start': 2674.409, 'weight': 1, 'content': [{'end': 2680.433, 'text': "when we're generating the words of this caption, we can allow the model to steer its attention to different parts of the image.", 'start': 2674.409, 'duration': 6.024}, {'end': 2688.858, 'text': "And I don't want to spend too much time on this, but the general way that this works is that now our convolutional network,", 'start': 2681.373, 'duration': 7.485}, {'end': 2692.18, 'text': 'rather than producing a single vector summarizing the entire image,', 'start': 2688.858, 'duration': 3.322}, {'end': 2698.845, 'text': 'now it produces some grid of vectors that give maybe one vector for each spatial location in the image.', 'start': 2692.18, 'duration': 6.665}, {'end': 2706.59, 'text': 'And now, when this model runs forward, in addition to sampling the vocabulary at every time step,', 'start': 2699.565, 'duration': 7.025}, {'end': 2710.552, 'text': 'it also produces a distribution over the locations in the image where it wants to look.', 'start': 2706.59, 'duration': 3.962}, {'end': 2717.517, 'text': 'And now this distribution over image locations can be seen as a kind of attention of where the model should look during training.', 'start': 2711.113, 'duration': 6.404}, {'end': 2723.081, 'text': 'So now that first hidden state computes this distribution over image locations,', 'start': 2718.157, 'duration': 4.924}, {'end': 2730.727, 'text': 'which then goes back to this set of vectors to give a single summary vector that maybe focuses the attention on one part of that image.', 'start': 2723.081, 'duration': 7.646}, {'end': 2735.831, 'text': 'And now that summary vector gets fed as an additional input at the next time step of the neural network.', 'start': 2731.247, 'duration': 4.584}, {'end': 2737.993, 'text': 'And now again, it will produce two outputs.', 'start': 2736.372, 'duration': 1.621}, {'end': 2742.657, 'text': 'One is our distribution over vocabulary words, and the other is a distribution over image locations.', 'start': 2738.273, 'duration': 4.384}, {'end': 2748.238, 'text': 'and this whole process will continue, and it will sort of do these two different things at every time step.', 'start': 2743.697, 'duration': 4.541}, {'end': 2751.259, 'text': 'And after you train the model,', 'start': 2750.099, 'duration': 1.16}, {'end': 2757.701, 'text': 'then you can see that it kind of will shift its attention around the image for every word that it generates in the caption.', 'start': 2751.259, 'duration': 6.442}, {'end': 2761.202, 'text': 'So here you can see that it produced the caption.', 'start': 2758.201, 'duration': 3.001}, {'end': 2763.262, 'text': 'a bird is flying over.', 'start': 2761.202, 'duration': 2.06}, {'end': 2764.623, 'text': "I can't see that far,", 'start': 2763.262, 'duration': 1.361}, {'end': 2770.164, 'text': 'but you can see that its attention is shifting around different parts of the image for each word in the caption that it generates.', 'start': 2764.623, 'duration': 5.541}], 'summary': 'Model uses attention to focus on image locations while generating captions.', 'duration': 95.755, 'max_score': 2674.409, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ2674409.jpg'}, {'end': 2838.369, 'src': 'embed', 'start': 2811.417, 'weight': 3, 'content': [{'end': 2817.384, 'text': 'So now, when you look at after you train one of these attention models and then run it on to generate captions,', 'start': 2811.417, 'duration': 5.967}, {'end': 2823.432, 'text': 'you can see that it tends to focus its attention on maybe the salient or semantically meaningful part of the image when generating captions.', 'start': 2817.384, 'duration': 6.048}, {'end': 2830.326, 'text': 'So you can see that the caption was a woman is throwing a frisbee in a park and you can see that this attention mask.', 'start': 2824.144, 'duration': 6.182}, {'end': 2838.369, 'text': 'when the model generated the word frisbee at the same time, it was focusing its attention on this image region that actually contains the frisbee.', 'start': 2830.326, 'duration': 8.043}], 'summary': 'Attention models focus on salient image parts when generating captions.', 'duration': 26.952, 'max_score': 2811.417, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ2811417.jpg'}, {'end': 3327.455, 'src': 'embed', 'start': 3299.003, 'weight': 5, 'content': [{'end': 3304.575, 'text': 'where now our gradients will shrink and shrink and shrink exponentially as we back, propagate and pick up more and more factors of this weight matrix.', 'start': 3299.003, 'duration': 5.572}, {'end': 3308.128, 'text': "That's called the vanishing gradient problem.", 'start': 3306.667, 'duration': 1.461}, {'end': 3314.19, 'text': "So there's a bit of a hack that people sometimes do to fix the exploding gradient problem called gradient clipping,", 'start': 3308.768, 'duration': 5.422}, {'end': 3322.333, 'text': 'which is just this simple heuristic saying that after we compute our gradient, if that gradient, if its L2 norm is above some threshold,', 'start': 3314.19, 'duration': 8.143}, {'end': 3327.455, 'text': 'then just clamp it down and divide, just clamp it down, so it has this maximum threshold.', 'start': 3322.333, 'duration': 5.122}], 'summary': 'Vanishing gradient problem occurs when gradients shrink exponentially during backpropagation, addressed using gradient clipping.', 'duration': 28.452, 'max_score': 3299.003, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ3299003.jpg'}], 'start': 2548.574, 'title': 'Neural network and attention models for image captioning', 'summary': 'Delves into the joint training of a recurrent neural network language model and a cnn for image captioning, highlighting its impressive capabilities and limitations, and also explores attention models, addressing challenges and solutions in the process.', 'chapters': [{'end': 2657.214, 'start': 2548.574, 'title': 'Neural network image captioning', 'summary': "Discusses the joint training of a recurrent neural network language model and a cnn to produce complex image captions, highlighting the model's impressive capabilities but also its limitations when faced with new data.", 'duration': 108.64, 'highlights': ["The trained models can produce relatively complex captions such as 'a cat sitting on a suitcase on the floor' and 'two people walking on the beach with surfboards.'", 'The models can struggle when faced with data very different from the training data, leading to errors like mistaking a fur coat for a cat, missing a woman doing a handstand on a beach, and misidentifying a bird sitting on a tree branch as opposed to a spider in a web.', "The joint training of the recurrent neural network language model and the CNN allows for the back propagation of gradients to update the weights of the CNN, resulting in the model's ability to perform image captioning tasks."]}, {'end': 3332.161, 'start': 2657.655, 'title': 'Attention models for image captioning', 'summary': 'Explains the concept of attention models in image captioning, detailing how the model steers its attention to different parts of the image, and discusses the challenges of hard attention and the solutions for the exploding gradient problem.', 'duration': 674.506, 'highlights': ['The chapter introduces the concept of attention models in image captioning, explaining how the model steers its attention to different parts of the image during the generation of captions. The attention model allows the neural network to produce a distribution over the locations in the image where it wants to look, enabling it to focus its attention on different parts of the image for each word in the caption.', 'The chapter discusses the challenges of hard attention, noting that selecting exactly one image location at each time step is not a differentiable function, requiring a more complex training approach. The hard attention case poses challenges as it is not a differentiable function, necessitating a more advanced training approach than vanilla backpropagation, which will be discussed in a later lecture on reinforcement learning.', 'The chapter explains the issue of the exploding gradient problem in recurrent neural networks, and introduces gradient clipping as a common solution to limit the L2 norm of the gradient to prevent it from becoming excessively large. To address the exploding gradient problem, the chapter introduces the concept of gradient clipping, which involves clamping down the gradient to a maximum threshold when its L2 norm exceeds a certain value, serving as a practical solution used in training recurrent neural networks.']}], 'duration': 783.587, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ2548574.jpg', 'highlights': ["The joint training of the recurrent neural network language model and the CNN allows for the back propagation of gradients to update the weights of the CNN, resulting in the model's ability to perform image captioning tasks.", "The trained models can produce relatively complex captions such as 'a cat sitting on a suitcase on the floor' and 'two people walking on the beach with surfboards.'", 'The models can struggle when faced with data very different from the training data, leading to errors like mistaking a fur coat for a cat, missing a woman doing a handstand on a beach, and misidentifying a bird sitting on a tree branch as opposed to a spider in a web.', 'The chapter introduces the concept of attention models in image captioning, explaining how the model steers its attention to different parts of the image during the generation of captions.', 'The attention model allows the neural network to produce a distribution over the locations in the image where it wants to look, enabling it to focus its attention on different parts of the image for each word in the caption.', 'The chapter explains the issue of the exploding gradient problem in recurrent neural networks, and introduces gradient clipping as a common solution to limit the L2 norm of the gradient to prevent it from becoming excessively large.']}, {'end': 3842.879, 'segs': [{'end': 3362.077, 'src': 'embed', 'start': 3332.782, 'weight': 0, 'content': [{'end': 3337.829, 'text': "And it's a relatively useful tool for attacking this exploding gradient problem.", 'start': 3332.782, 'duration': 5.047}, {'end': 3345.393, 'text': 'But now for the vanishing gradient problem, what we typically do is we might need to move to a more complicated RNN architecture.', 'start': 3338.831, 'duration': 6.562}, {'end': 3348.513, 'text': 'So that motivates this idea of an LSTM.', 'start': 3346.233, 'duration': 2.28}, {'end': 3353.575, 'text': 'An LSTM is a slightly which stands for long short-term memory.', 'start': 3349.614, 'duration': 3.961}, {'end': 3362.077, 'text': 'is this slightly fancier recurrence relation for these recurrent neural networks that is really designed to help alleviate this problem of vanishing and exploding gradients?', 'start': 3353.575, 'duration': 8.502}], 'summary': 'Lstm is a useful tool for addressing vanishing and exploding gradient problems in rnns.', 'duration': 29.295, 'max_score': 3332.782, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ3332782.jpg'}, {'end': 3423.273, 'src': 'embed', 'start': 3396.975, 'weight': 1, 'content': [{'end': 3402.079, 'text': 'And one thing about, so remember when we had this vanilla recurrent neural network, it had this hidden state.', 'start': 3396.975, 'duration': 5.104}, {'end': 3405.801, 'text': 'And we used this recurrence relation to update the hidden state at every time step.', 'start': 3402.459, 'duration': 3.342}, {'end': 3411.045, 'text': 'Well now, in an LSTM, we actually have two, we maintain two hidden states at every time step.', 'start': 3406.262, 'duration': 4.783}, {'end': 3418.75, 'text': 'One is this HT, which is called the hidden state, which is kind of an analogy to the hidden state that we had in the vanilla RNN.', 'start': 3411.645, 'duration': 7.105}, {'end': 3423.273, 'text': 'But an LSTM also maintains the second vector, CT, called the cell state.', 'start': 3419.19, 'duration': 4.083}], 'summary': 'Lstm uses two hidden states (ht and ct) at every time step.', 'duration': 26.298, 'max_score': 3396.975, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ3396975.jpg'}, {'end': 3477.077, 'src': 'heatmap', 'start': 3423.833, 'weight': 0.762, 'content': [{'end': 3431.796, 'text': 'and the cell state is this vector which is kind of internal, kept inside the LSTM, and it does not really get exposed to the outside world.', 'start': 3423.833, 'duration': 7.963}, {'end': 3441.02, 'text': 'And you can kind of see that through this update equation, where you can see that first we compute these, we take our two inputs,', 'start': 3432.156, 'duration': 8.864}, {'end': 3445.421, 'text': 'we use them to compute these four gates called I, F, O and G.', 'start': 3441.02, 'duration': 4.401}, {'end': 3452.764, 'text': 'We use those gates to update our cell state CT, and then we expose part of our cell state as the hidden state at the next time step.', 'start': 3445.421, 'duration': 7.343}, {'end': 3460.506, 'text': 'So this is kind of a funny functional form and I wanna walk through for a couple slides.', 'start': 3454.918, 'duration': 5.588}, {'end': 3466.834, 'text': 'exactly why do we use this architecture and why does it make sense, especially in the context of vanishing or exploding gradients?', 'start': 3460.506, 'duration': 6.328}, {'end': 3477.077, 'text': "So, then, this first thing that we do in an LSTM is that we're given this previous hidden state HT and we're given our current input vector XT,", 'start': 3467.594, 'duration': 9.483}], 'summary': 'Lstm uses gates to update internal cell state and expose part as hidden state for next time step, addressing vanishing or exploding gradients.', 'duration': 53.244, 'max_score': 3423.833, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ3423833.jpg'}, {'end': 3560.708, 'src': 'embed', 'start': 3534.477, 'weight': 2, 'content': [{'end': 3538.839, 'text': 'F is the forget gate, how much do we want to forget the cell memory from the previous time step.', 'start': 3534.477, 'duration': 4.362}, {'end': 3544.122, 'text': 'O is the output gate, which is how much do we want to reveal our cell to the outside world.', 'start': 3540.18, 'duration': 3.942}, {'end': 3548.363, 'text': "And G doesn't really have a nice gate, a nice name, so I usually call it the gate gate.", 'start': 3544.582, 'duration': 3.781}, {'end': 3553.865, 'text': 'And G tells us how much do we want to write into our input cell.', 'start': 3550.084, 'duration': 3.781}, {'end': 3560.708, 'text': 'And then you notice that each of these four gates are using a different non-linearity.', 'start': 3554.846, 'duration': 5.862}], 'summary': 'Lstm uses forget, output, and input gates with different nonlinearities.', 'duration': 26.231, 'max_score': 3534.477, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ3534477.jpg'}], 'start': 3332.782, 'title': 'Using lstm and rnn for gradient problems', 'summary': 'Discusses lstm as a solution for vanishing gradient problem, emphasizing its design for better gradient flow properties, also touches upon the use of more complicated rnn architectures. it explains lstm architecture, maintaining two hidden states and using four gates (input, forget, output, and gate) to update the cell state, making it suitable for addressing vanishing or exploding gradients.', 'chapters': [{'end': 3374.606, 'start': 3332.782, 'title': 'Rnn and lstm for gradient problems', 'summary': 'Discusses the use of lstm as a solution for the vanishing gradient problem in recurrent neural networks, emphasizing its design for better gradient flow properties. it also touches upon the use of more complicated rnn architectures for addressing the issue.', 'duration': 41.824, 'highlights': ['LSTM is designed to alleviate the vanishing and exploding gradient problem in recurrent neural networks by improving gradient flow properties, similar to sophisticated CNN architectures.', 'Complex RNN architectures are considered as a solution for addressing the vanishing gradient problem.', 'Exploding gradient problem can be addressed by using a relatively useful tool.']}, {'end': 3842.879, 'start': 3375.607, 'title': 'Understanding lstm architecture', 'summary': 'Explains the architecture of lstm cells, which maintain two hidden states and use four gates (input, forget, output, and gate) to update the cell state, making it suitable for addressing vanishing or exploding gradients.', 'duration': 467.272, 'highlights': ['LSTM cells maintain two hidden states and use four gates to update the cell state, making it suitable for addressing vanishing or exploding gradients. LSTMs maintain two hidden states (HT and CT) and use four gates (I, F, O, G) to update the cell state, addressing vanishing or exploding gradients.', 'The four gates in LSTM (I, F, O, G) are used to compute how much to input, forget, reveal, or write into the cell state, with different non-linearities and value ranges. The four gates (I, F, O, G) in LSTM compute how much to input, forget, reveal, or write into the cell state, using different non-linearities and value ranges.', 'LSTM cells use a weight matrix to compute the four gates, with each gate using a different part of the weight matrix. LSTM cells use a weight matrix to compute the four gates, with each gate using a different part of the weight matrix, contributing to the overall functionality of the LSTM architecture.']}], 'duration': 510.097, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ3332782.jpg', 'highlights': ['LSTM is designed to alleviate the vanishing and exploding gradient problem in recurrent neural networks by improving gradient flow properties, similar to sophisticated CNN architectures.', 'LSTM cells maintain two hidden states and use four gates to update the cell state, making it suitable for addressing vanishing or exploding gradients.', 'The four gates in LSTM (I, F, O, G) are used to compute how much to input, forget, reveal, or write into the cell state, with different non-linearities and value ranges.']}, {'end': 4386.065, 'segs': [{'end': 3954.95, 'src': 'embed', 'start': 3918.599, 'weight': 1, 'content': [{'end': 3924.401, 'text': 'So element-wise multiplication is going to be a little bit nicer than full matrix multiplication.', 'start': 3918.599, 'duration': 5.802}, {'end': 3931.124, 'text': 'Second is that that element-wise multiplication will potentially be multiplying by a different forget gate at every time step.', 'start': 3925.001, 'duration': 6.123}, {'end': 3936.686, 'text': 'So remember, in the vanilla RNN we were continually multiplying by that same weight matrix over and over again,', 'start': 3931.664, 'duration': 5.022}, {'end': 3939.907, 'text': 'which led very explicitly to these exploding or vanishing gradients.', 'start': 3936.686, 'duration': 3.221}, {'end': 3944.828, 'text': 'But now in the LSTM case, this forget gate can vary from each time step.', 'start': 3940.387, 'duration': 4.441}, {'end': 3950.989, 'text': "So now it's much easier for the model to avoid these problems of exploding and vanishing gradients.", 'start': 3945.188, 'duration': 5.801}, {'end': 3954.95, 'text': 'Finally, because this forget gate is coming out from a sigmoid.', 'start': 3951.929, 'duration': 3.021}], 'summary': 'Element-wise multiplication in lstm avoids exploding or vanishing gradients.', 'duration': 36.351, 'max_score': 3918.599, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ3918599.jpg'}, {'end': 4197.215, 'src': 'embed', 'start': 4174.814, 'weight': 0, 'content': [{'end': 4184.956, 'text': 'where these additive and element-wise multiplicative interactions of the cell state kind of give a similar gradient superhighway for gradients to flow backwards through the cell state in an LSTM.', 'start': 4174.814, 'duration': 10.142}, {'end': 4189.93, 'text': "And, by the way, there's this other kind of nice paper called highway networks,", 'start': 4186.328, 'duration': 3.602}, {'end': 4197.215, 'text': 'which is kind of in between this idea of this LSTM cell and these residual networks.', 'start': 4189.93, 'duration': 7.285}], 'summary': 'Lstm and highway networks create gradient superhighway for backflow in cell state.', 'duration': 22.401, 'max_score': 4174.814, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ4174814.jpg'}, {'end': 4259.937, 'src': 'embed', 'start': 4234.551, 'weight': 3, 'content': [{'end': 4239.632, 'text': 'Probably the most common, apart from the LSTM, is this GRU called the Gated Recurrent Unit.', 'start': 4234.551, 'duration': 5.081}, {'end': 4242.612, 'text': 'And you can see those update equations here.', 'start': 4240.732, 'duration': 1.88}, {'end': 4251.455, 'text': 'And it kind of has this similar flavor of the LSTM where it uses these multiplicative element-wise gates together with these additive interactions,', 'start': 4242.973, 'duration': 8.482}, {'end': 4253.155, 'text': 'to avoid this vanishing gradient problem.', 'start': 4251.455, 'duration': 1.7}, {'end': 4259.937, 'text': "There's also this cool paper called LSTM, A Search Space Odyssey, where very inventive title,", 'start': 4254.435, 'duration': 5.502}], 'summary': 'Gru is a common recurrent unit, similar to lstm, with multiplicative element-wise gates to avoid vanishing gradient problem.', 'duration': 25.386, 'max_score': 4234.551, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ4234551.jpg'}, {'end': 4386.065, 'src': 'embed', 'start': 4360.016, 'weight': 4, 'content': [{'end': 4365.721, 'text': 'They sometimes are susceptible to vanishing or exploding gradients, but we can address that with weight clipping and with fancier architectures.', 'start': 4360.016, 'duration': 5.705}, {'end': 4370.965, 'text': "And there's a lot of cool overlap between CNN architectures and RNN architectures.", 'start': 4367.082, 'duration': 3.883}, {'end': 4376.77, 'text': "So, next time you'll be taking the midterm, but after that we'll have oh sorry, question?", 'start': 4371.986, 'duration': 4.784}, {'end': 4382.595, 'text': 'Midterm is up to this lecture, so anything up to this point is fair game.', 'start': 4379.392, 'duration': 3.203}, {'end': 4386.065, 'text': 'And see you guys, good luck on the midterm on Tuesday.', 'start': 4384.086, 'duration': 1.979}], 'summary': 'Exploring solutions for vanishing/exploding gradients, cnn-rnn overlap, and preparing for midterm.', 'duration': 26.049, 'max_score': 4360.016, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ4360016.jpg'}], 'start': 3844.447, 'title': 'Lstm backward pass and rnn architectures', 'summary': 'Delves into the favorable gradient flow in lstm backward pass, the potential for vanishing gradients, and the architectural similarities between cnns and rnns, especially focusing on lstm and gru variants. the study also includes insights on managing gradient flow through additive connections and multiplicative gates in rnns.', 'chapters': [{'end': 4216.509, 'start': 3844.447, 'title': 'Lstm backward pass analysis', 'summary': 'Discusses the backward pass in an lstm, highlighting the favorable gradient flow due to element-wise multiplication, potential for vanishing gradients, and the similarity to resnets and highway networks.', 'duration': 372.062, 'highlights': ["The LSTM's element-wise multiplication during the backward pass results in a more favorable gradient flow compared to a vanilla RNN's full matrix multiplication, potentially avoiding exploding or vanishing gradients. Comparison between element-wise multiplication and full matrix multiplication in LSTM and vanilla RNN.", 'The variability of the forget gate at each time step in an LSTM makes it easier to avoid exploding and vanishing gradients compared to the vanilla RNN, where the same weight matrix is repeatedly multiplied. Comparison of forget gate variability in LSTM and the repetitive multiplication in vanilla RNN.', 'The initialization of biases of the forget gate to be somewhat positive at the beginning of training in an LSTM helps in maintaining a relatively clean gradient flow through the forget gates, reducing the potential for vanishing gradients. Impact of bias initialization on gradient flow in LSTM.', 'The additive and element-wise multiplicative interactions of the cell state in an LSTM create a gradient superhighway, similar to the concept in ResNets, facilitating favorable gradient flow during backward propagation. Comparison of gradient flow in LSTM and ResNets.']}, {'end': 4386.065, 'start': 4217.07, 'title': 'Rnn architectures & their variants', 'summary': 'Explores the architectural similarities between cnns and rnns, highlighting the prevalence of lstm and gru, with a study on tweaking lstm equations and a google paper on evolutionary search for rnn architectures, concluding that managing gradient flow through additive connections and multiplicative gates is crucial for rnns.', 'duration': 168.995, 'highlights': ['LSTM and GRU are the most common variants of recurrent neural network architectures, with LSTM being studied extensively in a paper that concluded its equations are robust and effective for various problems. LSTM and GRU are prevalent in RNN architectures; a study concluded that LSTM equations are robust and effective for various problems.', 'A Google paper conducted an evolutionary search over a large number of random RNN architectures, finding no significant improvement over existing GRU or LSTM cells, emphasizing the importance of managing gradient flow through additive connections and multiplicative gates. Google paper found no significant improvement over existing GRU or LSTM cells; emphasized the importance of managing gradient flow through additive connections and multiplicative gates.', 'RNNs are highlighted as powerful tools for addressing various problems, albeit susceptible to vanishing or exploding gradients, which can be mitigated through techniques like weight clipping and advanced architectures. RNNs are powerful tools but susceptible to vanishing or exploding gradients; techniques like weight clipping and advanced architectures can address these issues.', "The discussion emphasizes the architectural similarities and crossover between CNN and RNN architectures, showcasing the potential for RNNs to address a wide array of problems. Emphasized architectural similarities and crossover between CNN and RNN architectures; highlighted RNNs' potential for addressing a wide array of problems."]}], 'duration': 541.618, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6niqTuYFZLQ/pics/6niqTuYFZLQ3844447.jpg', 'highlights': ['The additive and element-wise multiplicative interactions of the cell state in an LSTM create a gradient superhighway, similar to the concept in ResNets, facilitating favorable gradient flow during backward propagation.', "The LSTM's element-wise multiplication during the backward pass results in a more favorable gradient flow compared to a vanilla RNN's full matrix multiplication, potentially avoiding exploding or vanishing gradients.", 'The variability of the forget gate at each time step in an LSTM makes it easier to avoid exploding and vanishing gradients compared to the vanilla RNN, where the same weight matrix is repeatedly multiplied.', 'LSTM and GRU are prevalent in RNN architectures; a study concluded that LSTM equations are robust and effective for various problems.', 'RNNs are powerful tools but susceptible to vanishing or exploding gradients; techniques like weight clipping and advanced architectures can address these issues.']}], 'highlights': ['Residual networks facilitate gradient flow in the backward pass, creating a gradient superhighway for gradients to flow backward through the entire network, enabling easier and faster training of deep models.', 'Recurrent neural networks offer flexibility in processing variable length input and output data, enabling the development of one-to-many, many-to-one, and variable-length input-output models for various tasks.', "The joint training of the recurrent neural network language model and the CNN allows for the back propagation of gradients to update the weights of the CNN, resulting in the model's ability to perform image captioning tasks.", 'LSTM is designed to alleviate the vanishing and exploding gradient problem in recurrent neural networks by improving gradient flow properties, similar to sophisticated CNN architectures.', 'The additive and element-wise multiplicative interactions of the cell state in an LSTM create a gradient superhighway, similar to the concept in ResNets, facilitating favorable gradient flow during backward propagation.']}