title
CS231n Winter 2016: Lecture 10: Recurrent Neural Networks, Image Captioning, LSTM

description
Stanford Winter Quarter 2016 class: CS231n: Convolutional Neural Networks for Visual Recognition. Lecture 10. Get in touch on Twitter @cs231n, or on Reddit /r/cs231n. Our course website is http://cs231n.stanford.edu/

detail
{'title': 'CS231n Winter 2016: Lecture 10: Recurrent Neural Networks, Image Captioning, LSTM', 'heatmap': [{'end': 508.665, 'start': 417.71, 'weight': 0.752}, {'end': 631.217, 'start': 586.953, 'weight': 0.913}, {'end': 1006.619, 'start': 921.666, 'weight': 0.737}, {'end': 2057.231, 'start': 1971.791, 'weight': 0.794}, {'end': 2895.761, 'start': 2809.642, 'weight': 0.883}], 'summary': 'Covers recurrent neural networks (rnns), including understanding, language modeling, character-level implementation, math generation, integration with convolutional networks, lstm models, architectural differences, and their applications such as image captioning and natural language generation, with a likelihood achievement of 2.2 for the correct next character in language modeling.', 'chapters': [{'end': 59.23, 'segs': [{'end': 59.23, 'src': 'embed', 'start': 24.861, 'weight': 0, 'content': [{'end': 26.361, 'text': "This Wednesday, you can tell that I'm really excited.", 'start': 24.861, 'duration': 1.5}, {'end': 27.442, 'text': "I don't know if you guys are excited.", 'start': 26.381, 'duration': 1.061}, {'end': 30.398, 'text': "You don't look very excited to me.", 'start': 29.278, 'duration': 1.12}, {'end': 33.84, 'text': 'Assignment three will be out due this Wednesday.', 'start': 32.119, 'duration': 1.721}, {'end': 36.601, 'text': 'Sorry, it will be out in Wednesday.', 'start': 34.2, 'duration': 2.401}, {'end': 42.083, 'text': "It's due two weeks from now on Monday, but I think since we're shifting it, I think, to Wednesday, we plan to have released it today.", 'start': 37.221, 'duration': 4.862}, {'end': 46.324, 'text': "But we're going to be shifting it to roughly Wednesday, so we'll probably defer the deadline for it by a few days.", 'start': 42.463, 'duration': 3.861}, {'end': 49.585, 'text': "And assignment two, if I'm not mistaken, was due on Friday.", 'start': 47.084, 'duration': 2.501}, {'end': 52.046, 'text': "So if you're using three late days, then you'd be handing it in today.", 'start': 49.605, 'duration': 2.441}, {'end': 54.487, 'text': 'Hopefully not too many of you are doing that.', 'start': 52.446, 'duration': 2.041}, {'end': 58.829, 'text': 'Are people done with assignment two, or how many people are done? OK, most of you.', 'start': 55.007, 'duration': 3.822}, {'end': 59.23, 'text': 'OK, good.', 'start': 58.929, 'duration': 0.301}], 'summary': 'Assignment three due date shifted to wednesday, assignment two mostly completed.', 'duration': 34.369, 'max_score': 24.861, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF824861.jpg'}], 'start': 0.469, 'title': 'Recurrent neural networks', 'summary': 'Provides an overview of recurrent neural networks and upcoming assignments, including the midterm on wednesday, release of assignment three, and extension of deadlines, with most students having completed assignment two.', 'chapters': [{'end': 59.23, 'start': 0.469, 'title': 'Recurrent neural networks', 'summary': 'Discusses recurrent neural networks and upcoming assignments, including the midterm on wednesday and the release of assignment three and extension of deadlines, with most students having completed assignment two.', 'duration': 58.761, 'highlights': ['The midterm is scheduled for this Wednesday.', 'Assignment three will be released on Wednesday with a deadline extension.', 'Most students have completed assignment two.']}], 'duration': 58.761, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF8469.jpg', 'highlights': ['Assignment three will be released on Wednesday with a deadline extension.', 'The midterm is scheduled for this Wednesday.', 'Most students have completed assignment two.']}, {'end': 348.313, 'segs': [{'end': 88.685, 'src': 'embed', 'start': 59.65, 'weight': 4, 'content': [{'end': 61.491, 'text': "Great OK, so we're doing well.", 'start': 59.65, 'duration': 1.841}, {'end': 64.792, 'text': "So currently in the class, we're talking about convolutional neural networks.", 'start': 62.051, 'duration': 2.741}, {'end': 68.794, 'text': 'Last class specifically, we looked at visualizing and understanding convolutional neural networks.', 'start': 65.072, 'duration': 3.722}, {'end': 75.958, 'text': 'So we looked at a whole bunch of pretty pictures and videos and we had a lot of fun trying to interpret exactly what these convolutional networks are doing,', 'start': 69.415, 'duration': 6.543}, {'end': 77.799, 'text': "what they're learning, how they're working, and so on.", 'start': 75.958, 'duration': 1.841}, {'end': 84.522, 'text': 'And so we debugged this through several ways that you maybe can recall from last lecture.', 'start': 78.619, 'duration': 5.903}, {'end': 88.685, 'text': 'Actually, over the weekend, I stumbled by some other visualizations that are new.', 'start': 85.023, 'duration': 3.662}], 'summary': 'Class discussing convolutional neural networks and visualizing their functionality.', 'duration': 29.035, 'max_score': 59.65, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF859650.jpg'}, {'end': 220.148, 'src': 'embed', 'start': 159.161, 'weight': 0, 'content': [{'end': 163.523, 'text': "So for example, in the case of image captioning, and we'll see some of it today, you're given a fixed-sized image.", 'start': 159.161, 'duration': 4.362}, {'end': 168.725, 'text': "And then through a recurrent neural network, we're going to produce a sequence of words that describe the content of that image.", 'start': 163.723, 'duration': 5.002}, {'end': 172.927, 'text': "So that's going to be a sentence that is the caption for that image.", 'start': 169.365, 'duration': 3.562}, {'end': 178.749, 'text': "In the case of sentiment classification in NLP, for example, we're consuming a number of words in sequence.", 'start': 173.627, 'duration': 5.122}, {'end': 182.291, 'text': "And then we're trying to classify whether the sentiment of that sentence is positive or negative.", 'start': 178.989, 'duration': 3.302}, {'end': 188.922, 'text': 'In the case of machine translation, we can have a recurrent neural network that takes a number of words in, say,', 'start': 183.396, 'duration': 5.526}, {'end': 194.588, 'text': "English and then it's asked to produce a number of words in French, for example, as a translation.", 'start': 188.922, 'duration': 5.666}, {'end': 198.592, 'text': "So we'd feed this into a recurrent neural network in what we call sequence-to-sequence kind of setup.", 'start': 195.068, 'duration': 3.524}, {'end': 204.257, 'text': 'And so this recurrent network would just perform translation on arbitrary sentences in English into French.', 'start': 199.232, 'duration': 5.025}, {'end': 207.14, 'text': 'And in the last case, for example, we have video classification,', 'start': 204.958, 'duration': 2.182}, {'end': 211.663, 'text': 'where you might want to imagine classifying every single frame of a video with some number of classes.', 'start': 207.14, 'duration': 4.523}, {'end': 217.767, 'text': "But, crucially, you don't want the prediction to be only a function of the current time step, the current frame of the video,", 'start': 212.103, 'duration': 5.664}, {'end': 220.148, 'text': 'but also all the frames that have come before it in the video.', 'start': 217.767, 'duration': 2.381}], 'summary': 'Recurrent neural networks used for image captioning, sentiment classification, machine translation, and video classification.', 'duration': 60.987, 'max_score': 159.161, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF8159161.jpg'}], 'start': 59.65, 'title': 'Convolutional and recurrent neural networks', 'summary': 'Explores understanding convolutional neural networks and introduces recurrent neural networks, emphasizing their flexibility in processing sequences for various tasks such as image captioning, sentiment classification, and machine translation.', 'chapters': [{'end': 98.19, 'start': 59.65, 'title': 'Understanding convolutional neural networks', 'summary': 'Discusses the exploration of visualizing and understanding convolutional neural networks in the class, along with the discovery of new visualizations on twitter.', 'duration': 38.54, 'highlights': ['The class is currently focusing on understanding convolutional neural networks, specifically through visualizations and interpretation of their learning and functioning.', 'The instructor stumbled upon new visualizations on Twitter, which are intriguing and lack detailed descriptions.']}, {'end': 348.313, 'start': 98.21, 'title': 'Recurrent neural networks', 'summary': 'Introduces recurrent neural networks, highlighting their flexibility in processing sequences for tasks like image captioning, sentiment classification, machine translation, video classification, and processing fixed-sized inputs or outputs sequentially.', 'duration': 250.103, 'highlights': ['Recurrent neural networks offer flexibility in processing sequences for various tasks like image captioning, sentiment classification, machine translation, and video classification. Flexibility in processing sequences for various tasks', 'Recurrent neural networks enable the production of a sequence of words describing the content of an image for tasks like image captioning. Production of a sequence of words describing the content of an image', 'In sentiment classification in NLP, recurrent neural networks process a sequence of words to classify the sentiment of a sentence as positive or negative. Processing a sequence of words to classify sentiment', 'For machine translation, recurrent neural networks can translate sentences from one language to another in a sequence-to-sequence setup. Translation of sentences from one language to another in a sequence-to-sequence setup', 'In video classification, recurrent neural networks consider all frames that have come before the current time step, allowing predictions to be a function of all frames up to that point. Consideration of all frames before the current time step in video classification']}], 'duration': 288.663, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF859650.jpg', 'highlights': ['Recurrent neural networks enable image captioning by producing a sequence of words (image captioning)', 'Recurrent neural networks process sequences for sentiment classification in NLP (sentiment classification)', 'Recurrent neural networks translate sentences in a sequence-to-sequence setup for machine translation (machine translation)', 'Recurrent neural networks consider all frames before the current time step in video classification (video classification)', 'Understanding convolutional neural networks through visualizations and interpretation (convolutional neural networks)']}, {'end': 790.428, 'segs': [{'end': 381.189, 'src': 'embed', 'start': 348.613, 'weight': 1, 'content': [{'end': 350.674, 'text': "They look quite real, but they're actually made up from the model.", 'start': 348.613, 'duration': 2.061}, {'end': 356.036, 'text': 'So a recurrent neural network is basically this thing here, a box in green.', 'start': 352.175, 'duration': 3.861}, {'end': 357.637, 'text': 'And it has a state.', 'start': 356.656, 'duration': 0.981}, {'end': 361.919, 'text': 'And it basically receives, through time, it receives input vectors.', 'start': 358.257, 'duration': 3.662}, {'end': 365.28, 'text': 'So at every single time step, we can feed in an input vector into the RNN.', 'start': 362.419, 'duration': 2.861}, {'end': 367.061, 'text': 'And it has some state internally.', 'start': 365.68, 'duration': 1.381}, {'end': 372.263, 'text': 'And then it can modify that state as a function of what it receives at every single time step.', 'start': 367.521, 'duration': 4.742}, {'end': 375.385, 'text': 'And so there will, of course, be weights inside the RNN.', 'start': 372.743, 'duration': 2.642}, {'end': 381.189, 'text': 'And so when we tune those weights, the RNN will have different behavior in terms of how its state evolves as it receives these inputs.', 'start': 375.425, 'duration': 5.764}], 'summary': 'A recurrent neural network receives input vectors and can modify its state based on weights, affecting its behavior.', 'duration': 32.576, 'max_score': 348.613, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF8348613.jpg'}, {'end': 424.513, 'src': 'embed', 'start': 388.294, 'weight': 0, 'content': [{'end': 390.115, 'text': 'So we can produce these vectors on top of the RNN.', 'start': 388.294, 'duration': 1.821}, {'end': 392.396, 'text': "So you'll see me show pictures like this.", 'start': 390.775, 'duration': 1.621}, {'end': 396.239, 'text': "But I'd just like to note that the RNN is really just the block in the middle.", 'start': 392.496, 'duration': 3.743}, {'end': 398.741, 'text': 'It has a state, and it can receive vectors over time.', 'start': 396.259, 'duration': 2.482}, {'end': 402.103, 'text': 'And then we can base some prediction on top of its state in some applications.', 'start': 398.841, 'duration': 3.262}, {'end': 413.188, 'text': "OK So concretely, the way this will look like is the RNN has some kind of a state, which here I'm denoting as a vector h.", 'start': 402.903, 'duration': 10.285}, {'end': 416.269, 'text': 'But this can be also a collection of vectors or just a more general state.', 'start': 413.188, 'duration': 3.081}, {'end': 424.513, 'text': "And we're going to base it as a function of the previous hidden state at previous iteration time t minus 1 and the current input vector xt.", 'start': 417.71, 'duration': 6.803}], 'summary': 'Rnn produces vectors, state-based predictions, and input vectors over time.', 'duration': 36.219, 'max_score': 388.294, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF8388294.jpg'}, {'end': 508.665, 'src': 'heatmap', 'start': 417.71, 'weight': 0.752, 'content': [{'end': 424.513, 'text': "And we're going to base it as a function of the previous hidden state at previous iteration time t minus 1 and the current input vector xt.", 'start': 417.71, 'duration': 6.803}, {'end': 429.117, 'text': "And this is going to be done through a function, which I'll call a recurrence function, f.", 'start': 425.473, 'duration': 3.644}, {'end': 431.56, 'text': 'And that function will have parameters w.', 'start': 429.117, 'duration': 2.443}, {'end': 435.405, 'text': "And so as we change those w's, we're going to see that the RNN will have different behaviors.", 'start': 431.56, 'duration': 3.845}, {'end': 437.848, 'text': 'And then, of course, we want some specific behavior out of the RNN.', 'start': 435.505, 'duration': 2.343}, {'end': 439.69, 'text': "So we're going to be training those weights on data.", 'start': 438.088, 'duration': 1.602}, {'end': 441.312, 'text': "So you'll see examples of that soon.", 'start': 440.25, 'duration': 1.062}, {'end': 445.736, 'text': "For now, I'd like you to note that, The same function is used at every single time step.", 'start': 441.772, 'duration': 3.964}, {'end': 450.881, 'text': 'We have a fixed function f of weights w, and we apply that single function at every single time step.', 'start': 445.957, 'duration': 4.924}, {'end': 456.847, 'text': 'And that allows us to use the recurrent neural network on sequences without having to commit to the size of the sequence,', 'start': 451.302, 'duration': 5.545}, {'end': 461.891, 'text': 'because we apply the exact same function at every single time step, no matter how long the input or output sequences are.', 'start': 456.847, 'duration': 5.044}, {'end': 466.661, 'text': 'So, in a specific case of a recurrent neural network, a vanilla recurrent neural network,', 'start': 463.177, 'duration': 3.484}, {'end': 471.606, 'text': "the simplest way you can set this up and the simplest recurrence you can use is what I'll refer to as a vanilla RNN.", 'start': 466.661, 'duration': 4.945}, {'end': 476.831, 'text': 'In this case, the state of a recurrent neural network is just a single hidden state, H.', 'start': 472.146, 'duration': 4.685}, {'end': 485.834, 'text': 'And then we have a recurrence formula that basically tells you how you should update your hidden state h as a function of the previous hidden state and the current input xt.', 'start': 476.831, 'duration': 9.003}, {'end': 490.996, 'text': "And in particular, in the simplest case, we're going to have these weight matrices, whh and wxh.", 'start': 486.354, 'duration': 4.642}, {'end': 496.659, 'text': "And they're going to basically project both the hidden state from the previous time step and the current input.", 'start': 491.717, 'duration': 4.942}, {'end': 498.2, 'text': 'And then those are going to add.', 'start': 497.119, 'duration': 1.081}, {'end': 499.721, 'text': 'And then we squish them with a tanh.', 'start': 498.4, 'duration': 1.321}, {'end': 502.822, 'text': "And that's how we update the hidden state at time t.", 'start': 500.241, 'duration': 2.581}, {'end': 508.665, 'text': 'So this recurrence is telling you how h will change as a function of its history and also the current input at this time step.', 'start': 502.822, 'duration': 5.843}], 'summary': 'Recurrent neural networks use a fixed function at every time step, allowing for variable sequence sizes and behavior changes with different weights.', 'duration': 90.955, 'max_score': 417.71, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF8417710.jpg'}, {'end': 557.996, 'src': 'embed', 'start': 530.219, 'weight': 3, 'content': [{'end': 535.043, 'text': 'And so one of the ways in which we can use a recurrent neural network is in the case of character-level language models.', 'start': 530.219, 'duration': 4.824}, {'end': 539.986, 'text': "And this is one of my favorite ways of explaining RNNs because it's intuitive and fun to look at.", 'start': 535.683, 'duration': 4.303}, {'end': 544.068, 'text': 'So in this case, we have character-level language models using RNNs.', 'start': 540.506, 'duration': 3.562}, {'end': 548.611, 'text': 'And the way this will work is we will feed a sequence of characters into the recurrent neural network.', 'start': 544.428, 'duration': 4.183}, {'end': 553.454, 'text': "And at every single time step, we'll ask the recurrent neural network to predict the next character in the sequence.", 'start': 549.031, 'duration': 4.423}, {'end': 557.996, 'text': "So we'll predict an entire distribution for what it thinks should come next in the sequence that it has seen so far.", 'start': 553.834, 'duration': 4.162}], 'summary': 'Rnn used for character-level language models, predicting next characters in sequence.', 'duration': 27.777, 'max_score': 530.219, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF8530219.jpg'}, {'end': 631.217, 'src': 'heatmap', 'start': 586.953, 'weight': 0.913, 'content': [{'end': 590.358, 'text': "And here I'm encoding characters using what we call a one-hot representation,", 'start': 586.953, 'duration': 3.405}, {'end': 593.842, 'text': "where we just turn on the bit that corresponds to that character's order in the vocabulary.", 'start': 590.358, 'duration': 3.484}, {'end': 601.173, 'text': "Then we're going to use the recurrence formula that I've shown you, where at every single time step, suppose we start off with h as all 0.", 'start': 594.929, 'duration': 6.244}, {'end': 605.836, 'text': 'And then we apply this recurrence to compute the hidden state vector at every single time step using this fixed recurrence formula.', 'start': 601.173, 'duration': 4.663}, {'end': 608.758, 'text': 'So suppose here we have only three numbers in the hidden state.', 'start': 606.217, 'duration': 2.541}, {'end': 613.101, 'text': "We're going to end up with a three-dimensional representation that basically, at any point in time,", 'start': 609.159, 'duration': 3.942}, {'end': 615.603, 'text': 'summarizes all the characters that have come until then.', 'start': 613.101, 'duration': 2.502}, {'end': 619.506, 'text': "And so we've applied this recurrence at every single time step.", 'start': 616.644, 'duration': 2.862}, {'end': 623.949, 'text': "And now we're going to predict at every single time step what should be the next character in the sequence.", 'start': 619.946, 'duration': 4.003}, {'end': 631.217, 'text': "So for example, since we have four characters in this vocabulary, we're going to predict four numbers at every single time step.", 'start': 625.353, 'duration': 5.864}], 'summary': 'Using one-hot representation to encode characters, applying recurrence formula to compute hidden state, predicting next character at every time step.', 'duration': 44.264, 'max_score': 586.953, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF8586953.jpg'}, {'end': 661.613, 'src': 'embed', 'start': 636.28, 'weight': 2, 'content': [{'end': 642.444, 'text': 'And the RNN, with its current setting of weights, computed these unnormalized log probabilities here for what it thinks should come next.', 'start': 636.28, 'duration': 6.164}, {'end': 645.366, 'text': 'So it thinks that H is 1.0 likely to come next.', 'start': 642.925, 'duration': 2.441}, {'end': 646.527, 'text': 'It thinks that E is 2.2 likely.', 'start': 645.587, 'duration': 0.94}, {'end': 653.13, 'text': 'L is negative 3 likely, and O is 4.1 likely right now in terms of unnormalized log probabilities.', 'start': 647.908, 'duration': 5.222}, {'end': 657.532, 'text': 'Of course, we know that in this training sequence, we know that E should follow H.', 'start': 653.73, 'duration': 3.802}, {'end': 661.613, 'text': "So in fact, this 2.2, which I'm showing in green, is the correct answer in this case.", 'start': 657.532, 'duration': 4.081}], 'summary': 'Rnn computed unnormalized log probabilities, predicting h(1.0), e(2.2), l(-3), o(4.1).', 'duration': 25.333, 'max_score': 636.28, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF8636280.jpg'}], 'start': 348.613, 'title': 'Recurrent neural networks for language modeling', 'summary': "Covers the understanding of recurrent neural networks (rnns) and their application in language modeling, achieving a 2.2 likelihood for the correct next character 'e' in a character-level language model.", 'chapters': [{'end': 508.665, 'start': 348.613, 'title': 'Understanding recurrent neural networks', 'summary': 'Explains the concept of recurrent neural networks (rnns), highlighting their structure, behavior, and training process, and emphasizing the use of a fixed function at every time step to enable rnns to work on sequences of varying length.', 'duration': 160.052, 'highlights': ['The RNN is represented by a box with a state and receives input vectors through time, allowing it to modify its state based on the input at each time step.', 'The RNN can produce output vectors on top of its state and can be trained on data to exhibit specific behaviors.', 'A fixed function with parameters w is used at every time step, enabling the RNN to work on sequences of varying lengths without committing to a specific size.', 'In the case of a vanilla RNN, the state is a single hidden state H, and it uses a recurrence formula to update the hidden state based on the previous hidden state and the current input, using weight matrices whh and wxh to project and update the hidden state.']}, {'end': 790.428, 'start': 509.105, 'title': 'Recurrent neural networks for language modeling', 'summary': "Explains the use of recurrent neural networks (rnns) in character-level language models, illustrating how rnns can predict the next character in a sequence based on a training set, with a vocabulary of four characters, achieving a 2.2 likelihood for the correct next character 'e' in the example 'hello' sequence.", 'duration': 281.323, 'highlights': ["The chapter explains the use of recurrent neural networks (RNNs) in character-level language models, illustrating how RNNs can predict the next character in a sequence based on a training set, with a vocabulary of four characters, achieving a 2.2 likelihood for the correct next character 'E' in the example 'hello' sequence.", "The training sequence 'hello' is used as an example, with a vocabulary of four characters (H, E, L, O) to train the recurrent neural network to predict the next character in the sequence, demonstrating the process of feeding characters one at a time into the network and using a one-hot representation for encoding characters.", "The RNN computes unnormalized log probabilities for the next character at each time step, with the correct prediction (in this case, 'E') having a likelihood of 2.2, and the goal being to have high probabilities for the correct next character and low probabilities for other characters, which is encoded in the gradient signal of the loss function.", 'At every time step, the RNN functions as a softmax classifier over the next character, with the losses flowing down from the top and backpropagating through the network, resulting in gradients on all weight matrices, allowing the network to adjust its behavior and shape the weights to produce the correct probabilities for the next characters in the sequence.']}], 'duration': 441.815, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF8348613.jpg', 'highlights': ['The RNN can produce output vectors on top of its state and can be trained on data to exhibit specific behaviors.', 'The RNN is represented by a box with a state and receives input vectors through time, allowing it to modify its state based on the input at each time step.', "The RNN computes unnormalized log probabilities for the next character at each time step, with the correct prediction (in this case, 'E') having a likelihood of 2.2.", "The chapter explains the use of recurrent neural networks (RNNs) in character-level language models, illustrating how RNNs can predict the next character in a sequence based on a training set, with a vocabulary of four characters, achieving a 2.2 likelihood for the correct next character 'E' in the example 'hello' sequence."]}, {'end': 1428.803, 'segs': [{'end': 837.302, 'src': 'embed', 'start': 809.497, 'weight': 6, 'content': [{'end': 814.1, 'text': "We're going to go through some specific examples, which I think will clarify some of these points.", 'start': 809.497, 'duration': 4.603}, {'end': 816.782, 'text': "So let's look at a specific example.", 'start': 815.481, 'duration': 1.301}, {'end': 820.044, 'text': "In fact, if you want to train a character level language model, it's quite short.", 'start': 816.922, 'duration': 3.122}, {'end': 827.389, 'text': 'So I wrote a gist that you can find on GitHub, where this is 100 line implementation in NumPy for a character level RNN that you can go through.', 'start': 820.484, 'duration': 6.905}, {'end': 831.892, 'text': "I'd actually like to step through this with you so you can see concretely how we could train a recurrent neural network in practice.", 'start': 827.649, 'duration': 4.243}, {'end': 834.274, 'text': "And so I'm going to step through this code with you now.", 'start': 832.712, 'duration': 1.562}, {'end': 837.302, 'text': "So we're going to go through all the blocks.", 'start': 835.362, 'duration': 1.94}], 'summary': 'The transcript discusses an example of a character level language model implemented in numpy with 100 lines, demonstrating how to train a recurrent neural network.', 'duration': 27.805, 'max_score': 809.497, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF8809497.jpg'}, {'end': 913.979, 'src': 'embed', 'start': 882.195, 'weight': 3, 'content': [{'end': 883.055, 'text': 'Here, we have a learning rate.', 'start': 882.195, 'duration': 0.86}, {'end': 885.976, 'text': 'Sequence length here is set to 25.', 'start': 884.255, 'duration': 1.721}, {'end': 888.599, 'text': "This is a parameter that you'll become aware of with RNNs.", 'start': 885.976, 'duration': 2.623}, {'end': 895.284, 'text': "Basically, the problem is if our input data is way too large, say like millions of time steps, there's no way you can put an RNN on top of all of it,", 'start': 889.059, 'duration': 6.225}, {'end': 898.507, 'text': 'because you need to maintain all of this stuff in memory so that you can do backpropagation.', 'start': 895.284, 'duration': 3.223}, {'end': 902.69, 'text': "So in fact, we won't be able to keep all of it in memory and do backprop through all of it.", 'start': 899.007, 'duration': 3.683}, {'end': 904.872, 'text': "So we'll go in chunks through our input data.", 'start': 903.15, 'duration': 1.722}, {'end': 907.414, 'text': "In this case, we're going through chunks of 25 at a time.", 'start': 905.172, 'duration': 2.242}, {'end': 913.979, 'text': "So as you'll see in a bit, we have this entire data set, but we'll be going in chunks of 25.", 'start': 907.994, 'duration': 5.985}], 'summary': 'Rnn uses chunks of 25 data points due to memory constraints.', 'duration': 31.784, 'max_score': 882.195, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF8882195.jpg'}, {'end': 1006.619, 'src': 'heatmap', 'start': 921.666, 'weight': 0.737, 'content': [{'end': 923.227, 'text': "because we'd have to remember all that stuff.", 'start': 921.666, 'duration': 1.561}, {'end': 926.49, 'text': "And so we're going in chunks here of 25.", 'start': 923.727, 'duration': 2.763}, {'end': 930.033, 'text': "And then we have all these w matrices that here I'm initializing randomly and some biases.", 'start': 926.49, 'duration': 3.543}, {'end': 932.075, 'text': 'So w, xh, hh, and hy.', 'start': 930.173, 'duration': 1.902}, {'end': 935.498, 'text': "And those are all of our parameters that we're going to train with backprop.", 'start': 932.455, 'duration': 3.043}, {'end': 941.155, 'text': "OK? Now I'm going to skip over the loss function here, and I'm going to skip to the bottom of the script.", 'start': 936.659, 'duration': 4.496}, {'end': 944.538, 'text': "Here we have a main loop, and I'm going to go through some of this main loop now.", 'start': 941.616, 'duration': 2.922}, {'end': 948.821, 'text': 'So there are some initializations here of various things to 0 in the beginning.', 'start': 945.378, 'duration': 3.443}, {'end': 950.081, 'text': "And then we're looping forever.", 'start': 949.101, 'duration': 0.98}, {'end': 952.723, 'text': "What we're doing here is I'm sampling a batch of data.", 'start': 950.762, 'duration': 1.961}, {'end': 957.727, 'text': 'So here is where I actually take a batch of 25 characters out of this data set.', 'start': 953.204, 'duration': 4.523}, {'end': 959.268, 'text': "So that's in the list inputs.", 'start': 958.127, 'duration': 1.141}, {'end': 963.331, 'text': 'And the list inputs basically just has 25 integers corresponding to the characters.', 'start': 959.668, 'duration': 3.663}, {'end': 967.637, 'text': "The targets, as you'll see, is just all the same characters, but offset by 1,", 'start': 964.031, 'duration': 3.606}, {'end': 970.602, 'text': "because those are the indices that we're trying to predict at every single time step.", 'start': 967.637, 'duration': 2.965}, {'end': 975.249, 'text': 'So inputs and targets are just lists of 25 characters.', 'start': 971.223, 'duration': 4.026}, {'end': 977.473, 'text': 'Targets is offset by 1 into the future.', 'start': 975.569, 'duration': 1.904}, {'end': 981.963, 'text': "So that's where we sample, basically, a batch of data.", 'start': 979.181, 'duration': 2.782}, {'end': 984.684, 'text': 'This is some sampling code.', 'start': 982.943, 'duration': 1.741}, {'end': 987.566, 'text': "So, at every single point in time as we're training this RNN,", 'start': 985.065, 'duration': 2.501}, {'end': 995.331, 'text': 'we can of course try to generate some samples of what it currently thinks characters should actually what these sequences look like.', 'start': 987.566, 'duration': 7.765}, {'end': 1001.655, 'text': "So the way we use character level RNNs in test time is that We're going to seed it with some characters.", 'start': 995.731, 'duration': 5.924}, {'end': 1005.078, 'text': 'And then this RNN basically always gives us a distribution of the next character in a sequence.', 'start': 1001.815, 'duration': 3.263}, {'end': 1006.619, 'text': 'So you can imagine sampling from it.', 'start': 1005.378, 'duration': 1.241}], 'summary': 'Training a character level rnn with batches of 25 characters, using w matrices and biases for backpropagation.', 'duration': 84.953, 'max_score': 921.666, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF8921666.jpg'}, {'end': 1086.45, 'src': 'embed', 'start': 1059.507, 'weight': 4, 'content': [{'end': 1063.029, 'text': "And then here's a parameter update where the loss function told us all the gradients.", 'start': 1059.507, 'duration': 3.522}, {'end': 1067.412, 'text': 'And here we are actually performing the update, which you should recognize as an Adagrad update.', 'start': 1063.61, 'duration': 3.802}, {'end': 1077.438, 'text': "So I have all these cached variables for the gradient squared, which I'm accumulating and then performing the Adagrad update.", 'start': 1067.912, 'duration': 9.526}, {'end': 1080.82, 'text': "So I'm going to go into the loss function and what that looks like now.", 'start': 1078.959, 'duration': 1.861}, {'end': 1084.228, 'text': 'The loss function is this block of code.', 'start': 1082.145, 'duration': 2.083}, {'end': 1086.45, 'text': 'It really consists of a forward and a backward method.', 'start': 1084.268, 'duration': 2.182}], 'summary': 'Parameter update using adagrad method in loss function.', 'duration': 26.943, 'max_score': 1059.507, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF81059507.jpg'}, {'end': 1162.165, 'src': 'embed', 'start': 1134.14, 'weight': 1, 'content': [{'end': 1137.222, 'text': 'And then your loss is negative log probability of the correct answer.', 'start': 1134.14, 'duration': 3.082}, {'end': 1140.103, 'text': "So that's just a softmax classifier loss over there.", 'start': 1137.602, 'duration': 2.501}, {'end': 1142.044, 'text': "So that's the forward pass.", 'start': 1140.923, 'duration': 1.121}, {'end': 1143.825, 'text': "And now we're going to back propagate through the graph.", 'start': 1142.304, 'duration': 1.521}, {'end': 1150.472, 'text': 'So in the backward pass, we go backwards through that sequence from 25 all the way back to 1.', 'start': 1144.725, 'duration': 5.747}, {'end': 1155.998, 'text': "And maybe you'll recognize, I don't know how much detail I want to go in here, but you'll recognize that I'm back propagating through a softmax.", 'start': 1150.472, 'duration': 5.526}, {'end': 1158.261, 'text': "I'm back propagating through the activation functions.", 'start': 1156.438, 'duration': 1.823}, {'end': 1159.902, 'text': "I'm back propagating through all of it.", 'start': 1158.281, 'duration': 1.621}, {'end': 1162.165, 'text': "And I'm just adding up all the gradients and all the parameters.", 'start': 1160.283, 'duration': 1.882}], 'summary': 'Training process includes forward pass with softmax classifier and backpropagation through the graph, computing gradients and parameters.', 'duration': 28.025, 'max_score': 1134.14, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF81134140.jpg'}, {'end': 1342.644, 'src': 'embed', 'start': 1315.18, 'weight': 5, 'content': [{'end': 1318.161, 'text': "So this RNN, in fact, doesn't know anything about characters or language or anything like that.", 'start': 1315.18, 'duration': 2.981}, {'end': 1319.921, 'text': "It's just indices and sequences of indices.", 'start': 1318.181, 'duration': 1.74}, {'end': 1320.701, 'text': "And that's what we're modeling.", 'start': 1319.941, 'duration': 0.76}, {'end': 1323.802, 'text': 'Yeah Go ahead.', 'start': 1320.721, 'duration': 3.081}, {'end': 1328.483, 'text': 'Is there a reason that we use a constant segment size instead of using spaces as delimiters, for example?', 'start': 1324.142, 'duration': 4.341}, {'end': 1334.833, 'text': 'Can we use spaces as delimiters or something like that instead of just constant batches of 25?', 'start': 1330.087, 'duration': 4.746}, {'end': 1340.401, 'text': 'I think you maybe could, but then you have to make assumptions about language.', 'start': 1334.833, 'duration': 5.568}, {'end': 1342.644, 'text': "We'll see soon why you wouldn't actually want to do that.", 'start': 1340.781, 'duration': 1.863}], 'summary': 'Rnn works with indices, not characters or language. constant segment size used for modeling, instead of spaces as delimiters.', 'duration': 27.464, 'max_score': 1315.18, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF81315180.jpg'}, {'end': 1381.1, 'src': 'embed', 'start': 1353.752, 'weight': 0, 'content': [{'end': 1358.619, 'text': 'And we feed it into the RNN, and we can train the RNN to create text like it.', 'start': 1353.752, 'duration': 4.867}, {'end': 1362.544, 'text': "And so, for example, you can take all of William Shakespeare's works.", 'start': 1359.38, 'duration': 3.164}, {'end': 1364.166, 'text': 'You concatenate all of it.', 'start': 1363.285, 'duration': 0.881}, {'end': 1365.767, 'text': "It's just a giant sequence of characters.", 'start': 1364.246, 'duration': 1.521}, {'end': 1367.809, 'text': 'And you put it into the recurrent neural network.', 'start': 1366.267, 'duration': 1.542}, {'end': 1371.091, 'text': 'And you try to predict the next character in a sequence for William Shakespeare poems.', 'start': 1367.869, 'duration': 3.222}, {'end': 1376.356, 'text': 'And so when you do this, of course, in the beginning, the recurrent neural network has random parameters.', 'start': 1371.772, 'duration': 4.584}, {'end': 1378.297, 'text': "So it's just producing a garble at the very end.", 'start': 1376.456, 'duration': 1.841}, {'end': 1381.1, 'text': "So it's just random characters.", 'start': 1378.878, 'duration': 2.222}], 'summary': "Rnn can be trained to generate text, e.g., shakespeare's poems, from a sequence of characters.", 'duration': 27.348, 'max_score': 1353.752, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF81353752.jpg'}], 'start': 790.448, 'title': 'Implementing character level rnn', 'summary': "Covers implementing character level rnn in numpy with a 100 line code, focusing on data loading, initialization, chunking, sampling, loss function, the sequence length parameter, and the adagrad update. it also explains the forward and backward passes in a recurrent neural network, including loss computation, backpropagation, parameter updates, and text data generation based on training data statistics, mentioning the use of constant segment size and training on william shakespeare's works.", 'chapters': [{'end': 1080.82, 'start': 790.448, 'title': 'Character level rnn training', 'summary': 'Discusses the implementation of a character level rnn in numpy with a 100 line code, covering the process of data loading, initialization, chunking, sampling, and loss function, with a focus on the sequence length parameter and the adagrad update.', 'duration': 290.372, 'highlights': ['The process of training a character level RNN in practice is explained with a 100 line implementation in NumPy, covering data loading, initialization, chunking, sampling, and loss function, focusing on the sequence length parameter and the Adagrad update.', 'The sequence length parameter is set to 25 for chunking the input data, which is crucial for maintaining memory efficiency and enabling backpropagation through manageable chunks of data.', 'An Adagrad update is performed for parameter optimization, involving the accumulation of gradient squared and subsequent parameter updates based on the calculated gradients.']}, {'end': 1428.803, 'start': 1082.145, 'title': 'Recurrent neural network', 'summary': "Explains the forward and backward passes in a recurrent neural network, including the computation of loss, backpropagation, and parameter updates, as well as the process of generating new text data based on training data statistics, with a mention of using constant segment size and training on william shakespeare's works.", 'duration': 346.658, 'highlights': ["The process of generating new text data based on training data statistics is explained, with a mention of training the recurrent neural network on William Shakespeare's works and the network's learning process. The chapter explains how a recurrent neural network can be trained to create text similar to William Shakespeare's works, learning statistical patterns and refining its understanding of language without hand-coding, resulting in the ability to sample entire infinite Shakespeare based on character-level data.", 'The explanation of the forward pass, including the computation of loss, computation of the recurrence formula, and the computation of the softmax function. The chapter details the forward pass process in a recurrent neural network, including the computation of loss using the negative log probability of the correct answer, the recurrence formula, and the softmax function for normalizing and obtaining probabilities.', 'The process of backpropagation through the graph, including backpropagating through softmax and activation functions, and the accumulation of gradients on weight matrices. The chapter explains the backward pass process in a recurrent neural network, involving backpropagation through softmax and activation functions, as well as the accumulation of gradients on weight matrices using a plus equals operation.', 'The use of a constant segment size and the implications of using spaces as delimiters. The chapter discusses the use of a constant segment size in the recurrent neural network and the implications of using spaces as delimiters, highlighting the considerations related to making assumptions about language and the flexibility of training on various text data.']}], 'duration': 638.355, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF8790448.jpg', 'highlights': ["The process of generating new text data based on training data statistics is explained, with a mention of training the recurrent neural network on William Shakespeare's works and the network's learning process.", 'The explanation of the forward pass, including the computation of loss, computation of the recurrence formula, and the computation of the softmax function.', 'The process of backpropagation through the graph, including backpropagating through softmax and activation functions, and the accumulation of gradients on weight matrices.', 'The sequence length parameter is set to 25 for chunking the input data, crucial for maintaining memory efficiency and enabling backpropagation through manageable chunks of data.', 'An Adagrad update is performed for parameter optimization, involving the accumulation of gradient squared and subsequent parameter updates based on the calculated gradients.', 'The use of a constant segment size and the implications of using spaces as delimiters, highlighting the considerations related to making assumptions about language and the flexibility of training on various text data.', 'The process of training a character level RNN in practice is explained with a 100 line implementation in NumPy, covering data loading, initialization, chunking, sampling, and loss function, focusing on the sequence length parameter and the Adagrad update.']}, {'end': 1966.248, 'segs': [{'end': 1475.951, 'src': 'embed', 'start': 1448.725, 'weight': 0, 'content': [{'end': 1451.808, 'text': 'And so Justin took, he found this book on algebraic geometry.', 'start': 1448.725, 'duration': 3.083}, {'end': 1454.29, 'text': 'And this is just a large latex source file.', 'start': 1452.408, 'duration': 1.882}, {'end': 1458.855, 'text': 'And we took that latex source file for this algebraic geometry and fed it into the RNN.', 'start': 1454.971, 'duration': 3.884}, {'end': 1464.121, 'text': 'And the RNN can learn to basically generate mathematics So this is a sample.', 'start': 1459.015, 'duration': 5.106}, {'end': 1467.624, 'text': 'So basically, this RNN just spits out LaTeX, and then we compile it.', 'start': 1464.441, 'duration': 3.183}, {'end': 1468.765, 'text': "And of course, it doesn't work right away.", 'start': 1467.644, 'duration': 1.121}, {'end': 1470.046, 'text': 'We had to tune it a tiny bit.', 'start': 1468.845, 'duration': 1.201}, {'end': 1475.951, 'text': 'But basically, the RNN, after we tweaked some of the mistakes that it has made, you can compile it, and you can generate mathematics.', 'start': 1470.506, 'duration': 5.445}], 'summary': 'Using rnn, justin generated mathematics from algebraic geometry book with minor tuning.', 'duration': 27.226, 'max_score': 1448.725, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF81448725.jpg'}, {'end': 1566.268, 'src': 'embed', 'start': 1536.546, 'weight': 1, 'content': [{'end': 1538.447, 'text': 'And so this is generated code from the RNN.', 'start': 1536.546, 'duration': 1.901}, {'end': 1541.989, 'text': 'And you can see that basically it creates function declarations.', 'start': 1539.327, 'duration': 2.662}, {'end': 1542.889, 'text': 'It knows about inputs.', 'start': 1542.029, 'duration': 0.86}, {'end': 1545.151, 'text': 'Syntactically, it makes very few mistakes.', 'start': 1543.33, 'duration': 1.821}, {'end': 1548.112, 'text': 'It knows about variables and sort of how to use them sometimes.', 'start': 1545.431, 'duration': 2.681}, {'end': 1549.473, 'text': 'It invents the code.', 'start': 1548.612, 'duration': 0.861}, {'end': 1550.794, 'text': 'It creates its own bogus comments.', 'start': 1549.633, 'duration': 1.161}, {'end': 1558.442, 'text': "Like syntactically, it's very rare to find that it would open a bracket and not close it, and so on.", 'start': 1553.939, 'duration': 4.503}, {'end': 1560.424, 'text': 'This actually is relatively easy for the RNN to learn.', 'start': 1558.482, 'duration': 1.942}, {'end': 1566.268, 'text': 'And so some of the mistakes that it makes actually is that, for example, it declares some variables that it never ends up using,', 'start': 1561.265, 'duration': 5.003}], 'summary': 'Rnn-generated code shows few mistakes, knows inputs/variables, but makes unused variable declarations.', 'duration': 29.722, 'max_score': 1536.546, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF81536546.jpg'}, {'end': 1602.692, 'src': 'embed', 'start': 1570.431, 'weight': 2, 'content': [{'end': 1572.293, 'text': 'But otherwise, it can do code just fine.', 'start': 1570.431, 'duration': 1.862}, {'end': 1578.898, 'text': 'It also knows how to recite the GNU GPU license character by character that is learned from data.', 'start': 1573.033, 'duration': 5.865}, {'end': 1585.246, 'text': "And it knows that after the GNU GPO license, there are some include files, there are some macros, and then there's some code.", 'start': 1579.975, 'duration': 5.271}, {'end': 1586.689, 'text': "So that's basically what it has learned.", 'start': 1585.427, 'duration': 1.262}, {'end': 1596.186, 'text': "Good? Yeah, so a min-char RNN that just I've shown you is very small, just a toy thing to show you what's going on.", 'start': 1587.23, 'duration': 8.956}, {'end': 1602.692, 'text': "Then there's a char RNN, which is a more kind of a mature implementation in Torch, which is just a min-char RNN scaled up and runs on GPU.", 'start': 1596.206, 'duration': 6.486}], 'summary': 'The ai can recite the gnu gpu license character by character and learned code structure. it includes a small min-char rnn and a mature char rnn in torch for gpu implementation.', 'duration': 32.261, 'max_score': 1570.431, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF81570431.jpg'}, {'end': 1638.49, 'src': 'embed', 'start': 1608.256, 'weight': 3, 'content': [{'end': 1609.597, 'text': "It's a three-layer LSTM.", 'start': 1608.256, 'duration': 1.341}, {'end': 1610.898, 'text': "And so we'll see what that means.", 'start': 1609.958, 'duration': 0.94}, {'end': 1613.2, 'text': "It's a more complex kind of form of recurrent neural network.", 'start': 1610.958, 'duration': 2.242}, {'end': 1616.631, 'text': 'OK Just to give you an idea about how this works.', 'start': 1613.22, 'duration': 3.411}, {'end': 1620.675, 'text': 'So this is from a paper that we played a lot with this with Justin last year.', 'start': 1616.971, 'duration': 3.704}, {'end': 1624.157, 'text': "And we were basically trying to pretend that we're neuroscientists.", 'start': 1621.095, 'duration': 3.062}, {'end': 1628.901, 'text': 'And we threw a character level RNN on some test text.', 'start': 1624.798, 'duration': 4.103}, {'end': 1632.064, 'text': 'And so the RNN is reading this text, this snippet of code.', 'start': 1629.522, 'duration': 2.542}, {'end': 1635.007, 'text': "And we're looking at a specific cell in the hidden state of the RNN.", 'start': 1632.545, 'duration': 2.462}, {'end': 1638.49, 'text': "We're coloring the text based on whether or not that cell is excited or not.", 'start': 1635.467, 'duration': 3.023}], 'summary': 'Three-layer lstm used in rnn for text analysis.', 'duration': 30.234, 'max_score': 1608.256, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF81608256.jpg'}, {'end': 1727.607, 'src': 'embed', 'start': 1700.645, 'weight': 4, 'content': [{'end': 1703.908, 'text': "because it wouldn't be able to spot dependencies that are much longer than that.", 'start': 1700.645, 'duration': 3.263}, {'end': 1711.015, 'text': 'But I think basically this seems to show that you can train this character-level detection cell as useful on sequences less than 100,', 'start': 1704.509, 'duration': 6.506}, {'end': 1713.558, 'text': 'and then it generalizes properly to longer sequences.', 'start': 1711.015, 'duration': 2.543}, {'end': 1724.605, 'text': 'So this cell seems to work for more than 100 steps, even if it was only trained, even if it was only able to spot the dependencies on less than 100.', 'start': 1715.72, 'duration': 8.885}, {'end': 1725.686, 'text': 'This is another data set here.', 'start': 1724.605, 'duration': 1.081}, {'end': 1727.607, 'text': "This is, I think, Leo Tolstoy's War and Peace.", 'start': 1725.866, 'duration': 1.741}], 'summary': 'Character-level detection cell trained on sequences less than 100 generalizes properly to longer sequences.', 'duration': 26.962, 'max_score': 1700.645, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF81700645.jpg'}], 'start': 1429.584, 'title': 'Rnns and mathematics generation', 'summary': 'Discusses the implementation and visualization of character-level rnns, highlighting its ability to generate mathematics and code from latex source files and c codebase with varying degrees of success, as well as its application in detecting patterns within text data and image captioning.', 'chapters': [{'end': 1586.689, 'start': 1429.584, 'title': 'Rnn generating mathematics and code', 'summary': 'Discusses how a character level rnn can learn to generate mathematics from latex source files and code from a large c codebase, creating proofs, lemma, and function declarations with varying degrees of success and accuracy.', 'duration': 157.105, 'highlights': ['The RNN can learn to generate mathematics from LaTeX source files for algebraic geometry and code from a 700 megabyte C codebase, creating proofs, lemmas, and function declarations. The RNN successfully generates mathematics from LaTeX source files for algebraic geometry and code from a 700 megabyte C codebase.', 'The RNN is able to create function declarations, know about inputs, variables, and how to use them, and even invent its own bogus comments. The RNN can create function declarations, understand inputs and variables, and invent its own comments, with few syntactic mistakes.', 'The RNN can recite the GNU GPU license character by character and understands the structure of include files, macros, and code, learned from data. The RNN can recite the GNU GPU license character by character and understands the structure of include files, macros, and code, learned from data.']}, {'end': 1966.248, 'start': 1587.23, 'title': 'Character-level recurrent neural networks', 'summary': 'Discusses the implementation and visualization of character-level rnns, highlighting its ability to detect patterns, such as quote detection and line tracking, within text data, and its application in image captioning using a combination of convolutional and recurrent neural networks.', 'duration': 379.018, 'highlights': ['The RNN is a three-layer LSTM, demonstrating its complexity and capability to track patterns within text data, such as quote detection and line tracking.', 'The RNN was trained on a sequence length of 100, but it was able to generalize properly to longer sequences, indicating its ability to learn and apply character-level detection on sequences longer than the training length.', "Various cells within the RNN were found to respond to specific conditions, such as inside if statements, inside quotes and strings, and with increasing excitement as the expression nesting deepens, showcasing the RNN's ability to identify and respond to different patterns within the text data.", 'In the context of image captioning, a combination of a convolutional neural network and a recurrent neural network is used to process and describe an image with a sequence of words, conditioning the RNN-generated model by the output of the convolutional network.']}], 'duration': 536.664, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF81429584.jpg', 'highlights': ['The RNN can learn to generate mathematics from LaTeX source files for algebraic geometry and code from a 700 megabyte C codebase, creating proofs, lemmas, and function declarations. (Relevance: 5)', 'The RNN is able to create function declarations, know about inputs, variables, and how to use them, and even invent its own bogus comments. (Relevance: 4)', 'The RNN can recite the GNU GPU license character by character and understands the structure of include files, macros, and code, learned from data. (Relevance: 3)', 'The RNN is a three-layer LSTM, demonstrating its complexity and capability to track patterns within text data, such as quote detection and line tracking. (Relevance: 2)', 'The RNN was trained on a sequence length of 100, but it was able to generalize properly to longer sequences, indicating its ability to learn and apply character-level detection on sequences longer than the training length. (Relevance: 1)']}, {'end': 2423.193, 'segs': [{'end': 2064.913, 'src': 'heatmap', 'start': 1966.728, 'weight': 0, 'content': [{'end': 1971.17, 'text': "And instead, we're going to redirect the representation at the top of the convolutional network into the recurrent neural network.", 'start': 1966.728, 'duration': 4.442}, {'end': 1975.893, 'text': 'So we begin the generation of the RNN with a special start vector.', 'start': 1971.791, 'duration': 4.102}, {'end': 1979.836, 'text': 'So the input to this RNN was, I think, 300 dimensional.', 'start': 1976.773, 'duration': 3.063}, {'end': 1983.639, 'text': 'And this is a special 300 dimensional vector that we always plug in at the first iteration.', 'start': 1980.236, 'duration': 3.403}, {'end': 1985.701, 'text': 'It tells the RNN that this is the beginning of the sequence.', 'start': 1983.659, 'duration': 2.042}, {'end': 1991.365, 'text': "And then we're going to perform the recurrence formula that I've shown you before for a vanilla recurrent neural network.", 'start': 1986.621, 'duration': 4.744}, {'end': 1998.251, 'text': 'So normally, we compute this recurrence, which we saw already, where we compute wxh times x plus whh times h.', 'start': 1992.086, 'duration': 6.165}, {'end': 2005.67, 'text': 'And now we want to additionally condition this recurrent neural network not only on the current input and the current hidden state,', 'start': 1998.251, 'duration': 7.419}, {'end': 2007.011, 'text': 'which we initialize with 0..', 'start': 2005.67, 'duration': 1.341}, {'end': 2008.693, 'text': 'So that term goes away at the first time step.', 'start': 2007.011, 'duration': 1.682}, {'end': 2014.177, 'text': 'But we additionally condition just by adding w i h times v.', 'start': 2009.273, 'duration': 4.904}, {'end': 2015.798, 'text': 'And so this v is the top of the comnet here.', 'start': 2014.177, 'duration': 1.621}, {'end': 2019.601, 'text': 'And we basically have this added interaction and this added weight matrix w,', 'start': 2016.298, 'duration': 3.303}, {'end': 2025.065, 'text': 'which tells us how this image information merges into the very first time step of the recurrent neural network.', 'start': 2019.601, 'duration': 5.464}, {'end': 2029.669, 'text': 'Now, there are many ways to actually play with this recurrence and many ways to actually plug in the image into their RNN.', 'start': 2025.626, 'duration': 4.043}, {'end': 2032.031, 'text': 'And this is only one of them, one of the simpler ones, perhaps.', 'start': 2029.729, 'duration': 2.302}, {'end': 2038.296, 'text': 'And at the very first time step here, this y0 vector is the distribution over the first word in a sequence.', 'start': 2032.771, 'duration': 5.525}, {'end': 2041.859, 'text': 'So the way this works, you might imagine, for example,', 'start': 2039.036, 'duration': 2.823}, {'end': 2049.025, 'text': "is you can see that these straw textures in a man's hat can be recognized by the convolutional network as straw-like stuff.", 'start': 2041.859, 'duration': 7.166}, {'end': 2051.226, 'text': 'And then, through this interaction, wih,', 'start': 2049.445, 'duration': 1.781}, {'end': 2057.231, 'text': 'it might condition the hidden state to go into a particular state where the probability of the word straw could be slightly higher.', 'start': 2051.226, 'duration': 6.005}, {'end': 2064.913, 'text': 'So you might imagine that the straw-like textures can influence the probability of straw so one of the numbers inside y0 to be higher because there are straw textures in there.', 'start': 2058.112, 'duration': 6.801}], 'summary': 'Redirecting representation from convolutional to recurrent neural network, conditioning the rnn on image information, and influencing word probability based on image features.', 'duration': 98.185, 'max_score': 1966.728, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF81966728.jpg'}, {'end': 2049.025, 'src': 'embed', 'start': 2019.601, 'weight': 1, 'content': [{'end': 2025.065, 'text': 'which tells us how this image information merges into the very first time step of the recurrent neural network.', 'start': 2019.601, 'duration': 5.464}, {'end': 2029.669, 'text': 'Now, there are many ways to actually play with this recurrence and many ways to actually plug in the image into their RNN.', 'start': 2025.626, 'duration': 4.043}, {'end': 2032.031, 'text': 'And this is only one of them, one of the simpler ones, perhaps.', 'start': 2029.729, 'duration': 2.302}, {'end': 2038.296, 'text': 'And at the very first time step here, this y0 vector is the distribution over the first word in a sequence.', 'start': 2032.771, 'duration': 5.525}, {'end': 2041.859, 'text': 'So the way this works, you might imagine, for example,', 'start': 2039.036, 'duration': 2.823}, {'end': 2049.025, 'text': "is you can see that these straw textures in a man's hat can be recognized by the convolutional network as straw-like stuff.", 'start': 2041.859, 'duration': 7.166}], 'summary': 'Image data merging into initial time step of rnn, y0 vector represents distribution over first word in a sequence.', 'duration': 29.424, 'max_score': 2019.601, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF82019601.jpg'}, {'end': 2115.076, 'src': 'embed', 'start': 2086.737, 'weight': 3, 'content': [{'end': 2089.34, 'text': 'And so in this case, I think we were using word-level embeddings.', 'start': 2086.737, 'duration': 2.603}, {'end': 2094.483, 'text': "So the straw word is associated with a 300-dimensional vector, which we're going to learn.", 'start': 2089.54, 'duration': 4.943}, {'end': 2098.826, 'text': "We're going to learn a 300-dimensional representation for every single unique word in the vocabulary.", 'start': 2094.523, 'duration': 4.303}, {'end': 2104.83, 'text': 'And we plug in those 300 numbers into the RNN and forward it again to get a distribution over the second word in the sequence inside y1.', 'start': 2099.347, 'duration': 5.483}, {'end': 2106.852, 'text': 'So we get all these probabilities.', 'start': 2105.911, 'duration': 0.941}, {'end': 2107.852, 'text': 'We sample from it again.', 'start': 2106.952, 'duration': 0.9}, {'end': 2110.053, 'text': 'Suppose that the word hat is likely now.', 'start': 2108.332, 'duration': 1.721}, {'end': 2115.076, 'text': 'We take hats, 300 dimensional representation, plug it in, and get the distribution over there.', 'start': 2110.474, 'duration': 4.602}], 'summary': 'Using word-level embeddings to learn 300-dimensional representations for unique words and generate distributions over words in the sequence.', 'duration': 28.339, 'max_score': 2086.737, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF82086737.jpg'}, {'end': 2301.876, 'src': 'embed', 'start': 2256.348, 'weight': 4, 'content': [{'end': 2260.211, 'text': 'It has to remember about the image what it needs to remember through the RNN.', 'start': 2256.348, 'duration': 3.863}, {'end': 2262.113, 'text': 'And it also has to produce all these outputs.', 'start': 2260.612, 'duration': 1.501}, {'end': 2263.554, 'text': 'And somehow it wants to do that.', 'start': 2262.313, 'duration': 1.241}, {'end': 2267.337, 'text': "There are some hand-wavy reasons I can give you after class for why that's true.", 'start': 2264.975, 'duration': 2.362}, {'end': 2269.892, 'text': 'I see.', 'start': 2269.332, 'duration': 0.56}, {'end': 2270.433, 'text': 'Not quite.', 'start': 2269.912, 'duration': 0.521}, {'end': 2275.976, 'text': 'So at training time, a single instance will correspond to an image and a sequence of words.', 'start': 2270.453, 'duration': 5.523}, {'end': 2279.818, 'text': 'And so we would plug in those words here, and we would plug in that image, and we.', 'start': 2276.036, 'duration': 3.782}, {'end': 2280.258, 'text': 'Yeah, so like.', 'start': 2279.818, 'duration': 0.44}, {'end': 2301.876, 'text': 'So at training time, you have all those words plugged in on the bottom.', 'start': 2298.735, 'duration': 3.141}], 'summary': 'Rnn needs to remember image details and produce outputs during training.', 'duration': 45.528, 'max_score': 2256.348, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF82256348.jpg'}, {'end': 2373.166, 'src': 'embed', 'start': 2349.618, 'weight': 5, 'content': [{'end': 2358.102, 'text': "And that's a big advantage, actually, because we can figure out what features to look for in order to better describe the images at the end.", 'start': 2349.618, 'duration': 8.484}, {'end': 2362.743, 'text': 'OK So when you train this in practice, we train this on image sentence data sets.', 'start': 2358.622, 'duration': 4.121}, {'end': 2364.823, 'text': 'One of the more common ones is called Microsoft Cocoa.', 'start': 2362.903, 'duration': 1.92}, {'end': 2370.425, 'text': "So just to give you an idea of what it looks like, it's roughly 100, 000 images and five sentence descriptions for each image.", 'start': 2365.404, 'duration': 5.021}, {'end': 2373.166, 'text': 'These were obtained using Amazon Mechanical Turk.', 'start': 2371.185, 'duration': 1.981}], 'summary': 'Training involves a dataset of 100,000 images with 5 sentence descriptions each, obtained through amazon mechanical turk.', 'duration': 23.548, 'max_score': 2349.618, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF82349618.jpg'}], 'start': 1966.728, 'title': 'Integrating convolutional network with recurrent neural network and rnn for image captioning', 'summary': 'Discusses integrating a convolutional network with a recurrent neural network, utilizing a 300-dimensional special start vector and conditioning the recurrent neural network on image information. it also explains the process of using recurrent neural networks to generate image captions, training on datasets like microsoft coco, yielding detailed descriptions.', 'chapters': [{'end': 2032.031, 'start': 1966.728, 'title': 'Integrating convolutional network with recurrent neural network', 'summary': 'Discusses integrating a convolutional network with a recurrent neural network, utilizing a 300-dimensional special start vector to indicate the beginning of the sequence and conditioning the recurrent neural network on image information through added interactions and weight matrix.', 'duration': 65.303, 'highlights': ['The input to the recurrent neural network is a special 300-dimensional vector indicating the start of the sequence, providing a clear indication of the sequence beginning.', 'The recurrent neural network is conditioned on the current input and hidden state, with the added interaction of the image information through the weight matrix, enhancing the integration of image data into the recurrent neural network.', 'Exploration of various methods to integrate the image information into the recurrent neural network, offering flexibility and potential for further optimization.']}, {'end': 2423.193, 'start': 2032.771, 'title': 'Rnn for image captioning', 'summary': "Explains the process of using recurrent neural networks (rnn) to generate image captions, where the rnn juggles between predicting the next word in the sequence and remembering the image information, with 300-dimensional word embeddings, and training on datasets like microsoft coco, yielding descriptions such as 'a man in a black shirt playing guitar' and 'a construction worker in orange safety west working on the road'.", 'duration': 390.422, 'highlights': ['The RNN juggles between predicting the next word in the sequence and remembering the image information The RNN has to predict the next word in the sequence and remember the image information simultaneously, which influences the distribution over words and requires juggling two tasks.', '300-dimensional word embeddings are used for different words in the vocabulary The RNN uses 300-dimensional word embeddings for each unique word in the vocabulary, which are learned and used to generate a distribution over the sequence of words.', "Training on datasets like Microsoft Coco yields descriptions such as 'a man in a black shirt playing guitar' The model is trained on datasets like Microsoft Coco, resulting in captions such as 'a man in a black shirt playing guitar' and 'a construction worker in Orange Safety West working on the road', obtained from images with five sentence descriptions each."]}], 'duration': 456.465, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF81966728.jpg', 'highlights': ['The input to the recurrent neural network is a special 300-dimensional vector indicating the start of the sequence, providing a clear indication of the sequence beginning.', 'The recurrent neural network is conditioned on the current input and hidden state, with the added interaction of the image information through the weight matrix, enhancing the integration of image data into the recurrent neural network.', 'Exploration of various methods to integrate the image information into the recurrent neural network, offering flexibility and potential for further optimization.', '300-dimensional word embeddings are used for different words in the vocabulary The RNN uses 300-dimensional word embeddings for each unique word in the vocabulary, which are learned and used to generate a distribution over the sequence of words.', 'The RNN juggles between predicting the next word in the sequence and remembering the image information The RNN has to predict the next word in the sequence and remember the image information simultaneously, which influences the distribution over words and requires juggling two tasks.', "Training on datasets like Microsoft Coco yields descriptions such as 'a man in a black shirt playing guitar' The model is trained on datasets like Microsoft Coco, resulting in captions such as 'a man in a black shirt playing guitar' and 'a construction worker in Orange Safety West working on the road', obtained from images with five sentence descriptions each."]}, {'end': 3278.252, 'segs': [{'end': 2503.173, 'src': 'embed', 'start': 2472.212, 'weight': 0, 'content': [{'end': 2474.033, 'text': 'And you can actually do this in a fully trainable way.', 'start': 2472.212, 'duration': 1.821}, {'end': 2479.114, 'text': 'So the RNN not only creates these words, but also decides where to look next in the image.', 'start': 2474.513, 'duration': 4.601}, {'end': 2485.118, 'text': 'And so the way this works is not only does the RNN output your probability distribution over the next word in the sequence,', 'start': 2479.634, 'duration': 5.484}, {'end': 2486.799, 'text': 'but this comnet gives you this volume.', 'start': 2485.118, 'duration': 1.681}, {'end': 2494.369, 'text': 'So say in this case, we forwarded the comnet and got a 14 by 14 by 512.', 'start': 2487.92, 'duration': 6.449}, {'end': 2495.87, 'text': 'by 512 activation volume.', 'start': 2494.369, 'duration': 1.501}, {'end': 2503.173, 'text': "And at every single time step you don't just emit that distribution, but you also emit a 512 dimensional vector.", 'start': 2496.79, 'duration': 6.383}], 'summary': 'Rnn generates words and specifies image locations, emitting 512-dimensional vectors.', 'duration': 30.961, 'max_score': 2472.212, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF82472212.jpg'}, {'end': 2647.958, 'src': 'embed', 'start': 2617.15, 'weight': 2, 'content': [{'end': 2620.613, 'text': 'So this gives you more deep stuff usually works better.', 'start': 2617.15, 'duration': 3.463}, {'end': 2625.898, 'text': "So the way we stack this up, one of the ways at least you can stack recurrent neural networks, and there's many ways.", 'start': 2621.274, 'duration': 4.624}, {'end': 2631.343, 'text': 'but this is just one of them that people use in practice, is you can straight up just plug RNNs into each other.', 'start': 2626.519, 'duration': 4.824}, {'end': 2637.869, 'text': 'So the input for one RNN is the hidden state vector of the previous RNN.', 'start': 2631.944, 'duration': 5.925}, {'end': 2640.551, 'text': 'So in this image, we have the time axis going horizontally.', 'start': 2638.389, 'duration': 2.162}, {'end': 2643.134, 'text': 'And then going upwards, we have different RNNs.', 'start': 2640.992, 'duration': 2.142}, {'end': 2647.958, 'text': 'And so in this particular image, there are three separate recurrent neural networks, each with their own set of weights.', 'start': 2643.554, 'duration': 4.404}], 'summary': 'Stack recurrent neural networks to improve performance with multiple separate rnns.', 'duration': 30.808, 'max_score': 2617.15, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF82617150.jpg'}, {'end': 2773.066, 'src': 'embed', 'start': 2741.227, 'weight': 1, 'content': [{'end': 2744.289, 'text': 'In practice, you will actually rarely ever use a formula like this.', 'start': 2741.227, 'duration': 3.062}, {'end': 2746.009, 'text': 'A basic recurrent network is very rarely used.', 'start': 2744.349, 'duration': 1.66}, {'end': 2749.851, 'text': "Instead, you'll use what we call an LSTM, or long short-term memory.", 'start': 2746.39, 'duration': 3.461}, {'end': 2752.873, 'text': 'So this is basically used in all the papers now.', 'start': 2750.291, 'duration': 2.582}, {'end': 2757.415, 'text': "So this is the formula you'd be using also in your projects if you were to use recurrent neural networks.", 'start': 2752.913, 'duration': 4.502}, {'end': 2762.598, 'text': "What I'd like you to notice at this point is everything is exactly the same as with an RNN.", 'start': 2758.075, 'duration': 4.523}, {'end': 2765.961, 'text': "It's just that the recurrence formula is a slightly more complex function.", 'start': 2763.159, 'duration': 2.802}, {'end': 2773.066, 'text': "We're still taking the hidden vector from below in depth, like your input, and from before in time, the previous hidden state.", 'start': 2766.741, 'duration': 6.325}], 'summary': 'Lstm is widely used in papers and projects for recurrent neural networks.', 'duration': 31.839, 'max_score': 2741.227, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF82741227.jpg'}, {'end': 2895.761, 'src': 'heatmap', 'start': 2809.642, 'weight': 0.883, 'content': [{'end': 2813.685, 'text': 'If you look for LSTMs online, you can look for LSTM.', 'start': 2809.642, 'duration': 4.043}, {'end': 2818.67, 'text': "when you go on Wikipedia or you go to Google Images, you'll find diagrams like this, which is really not helping, I think, anyone.", 'start': 2813.685, 'duration': 4.985}, {'end': 2821.812, 'text': 'The first time I saw LSTMs, they really scared me.', 'start': 2820.051, 'duration': 1.761}, {'end': 2822.913, 'text': 'Like this one really scared me.', 'start': 2821.892, 'duration': 1.021}, {'end': 2824.254, 'text': "I wasn't really sure what's going on.", 'start': 2822.933, 'duration': 1.321}, {'end': 2828.138, 'text': "I understand LSTMs, and I still don't know what these two diagrams are.", 'start': 2825.255, 'duration': 2.883}, {'end': 2833.129, 'text': "OK, so I'm going to try to break down LSTM.", 'start': 2831.368, 'duration': 1.761}, {'end': 2835.591, 'text': "It's kind of a tricky thing to put into a diagram.", 'start': 2833.209, 'duration': 2.382}, {'end': 2836.832, 'text': 'You really have to step through it.', 'start': 2835.611, 'duration': 1.221}, {'end': 2839.614, 'text': 'So lecture format is perfect for an LSTM.', 'start': 2837.333, 'duration': 2.281}, {'end': 2843.589, 'text': 'OK So here we have the LSTM equations.', 'start': 2840.095, 'duration': 3.494}, {'end': 2849.393, 'text': "And I'm going to first focus on the first part here on the top, where we take these two vectors from below and from before.", 'start': 2843.87, 'duration': 5.523}, {'end': 2850.915, 'text': 'So x and h.', 'start': 2849.634, 'duration': 1.281}, {'end': 2852.736, 'text': 'h is our previous hidden state, and x is the input.', 'start': 2850.915, 'duration': 1.821}, {'end': 2855.938, 'text': 'We map them through that transformation w.', 'start': 2853.416, 'duration': 2.522}, {'end': 2862.883, 'text': "And now, if both x and h are of size n, so there's n numbers in them, we're going to end up producing 4n numbers through this w matrix,", 'start': 2855.938, 'duration': 6.945}, {'end': 2863.843, 'text': 'which is 4n by 2n.', 'start': 2862.883, 'duration': 0.96}, {'end': 2869.206, 'text': 'So we have these four n-dimensional vectors, i, f, o, and g.', 'start': 2864.904, 'duration': 4.302}, {'end': 2872.387, 'text': "They're short for input, forget, output, and g.", 'start': 2869.206, 'duration': 3.181}, {'end': 2873.467, 'text': "I'm not sure what that's short for.", 'start': 2872.387, 'duration': 1.08}, {'end': 2874.728, 'text': "It's just g.", 'start': 2873.507, 'duration': 1.221}, {'end': 2879.269, 'text': 'And so the i, f, and o go through sigmoid gates, and g goes through a tanh gate.', 'start': 2874.728, 'duration': 4.541}, {'end': 2885.752, 'text': 'Now the way this actually works.', 'start': 2879.93, 'duration': 5.822}, {'end': 2889.893, 'text': 'the LSTM, basically the best way to think about it, is, oh one thing I forgot to mention actually in the previous slide.', 'start': 2885.752, 'duration': 4.141}, {'end': 2895.761, 'text': 'Normally recurrent neural network just has the single H vector at every single time step.', 'start': 2892.717, 'duration': 3.044}], 'summary': 'Lstm breakdown explained with equations and vector transformation, producing 4n numbers.', 'duration': 86.119, 'max_score': 2809.642, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF82809642.jpg'}, {'end': 3021.676, 'src': 'embed', 'start': 2996.169, 'weight': 3, 'content': [{'end': 3004.775, 'text': "And since i is between 0 and 1, and g is between negative 1 and 1, we're basically adding a number between negative 1 and 1 to every cell.", 'start': 2996.169, 'duration': 8.606}, {'end': 3008.199, 'text': 'So at every single time step, we have these counters in all the cells.', 'start': 3005.515, 'duration': 2.684}, {'end': 3014.847, 'text': 'We can reset these counters to 0 with the forget gate, or we can choose to add a number between negative 1 and 1 to every single cell.', 'start': 3008.519, 'duration': 6.328}, {'end': 3017.25, 'text': "So that's how we perform the cell update.", 'start': 3015.768, 'duration': 1.482}, {'end': 3021.676, 'text': 'And then the hidden update ends up being a squashed cell, so 10 H of C.', 'start': 3017.711, 'duration': 3.965}], 'summary': 'Lstm cell update adds number between -1 and 1 to each cell at every time step.', 'duration': 25.507, 'max_score': 2996.169, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF82996169.jpg'}], 'start': 2424.334, 'title': 'Lstm models and cell state update', 'summary': 'Discusses the use of recurrent neural networks (rnns) and long short-term memory (lstm) models in image processing and natural language generation, emphasizing soft attention and the benefits of stacking rnns. it also explains the lstm cell state update mechanism involving gates i, f, o, and g, enabling counters to be reset or incremented.', 'chapters': [{'end': 2917.086, 'start': 2424.334, 'title': 'Recurrent neural networks and lstm models', 'summary': 'Discusses the utilization of recurrent neural networks (rnns) and long short-term memory (lstm) models in image processing and natural language generation, emphasizing the concept of soft attention and the benefits of stacking rnns in layers.', 'duration': 492.752, 'highlights': ["The chapter introduces the concept of soft attention, allowing the recurrent neural network (RNN) to reference parts of the image while generating words, resulting in a 14 by 14 probability map over the image for selective attention. The RNN can utilize soft attention to reference parts of the image while generating words, resulting in a 14 by 14 probability map for selective attention, enhancing the model's descriptive capabilities.", 'The discussion emphasizes the benefits of stacking recurrent neural networks (RNNs) in layers, demonstrating that deeper models usually yield better performance. Stacking RNNs in layers can lead to improved performance, as deeper models are often more effective in processing and analyzing data.', 'The chapter presents the transition from using basic recurrent neural network (RNN) formulas to employing long short-term memory (LSTM) models for enhanced complexity and improved performance. Transitioning from basic RNN formulas to LSTM models offers enhanced complexity and improved performance, as LSTM models provide a more complex recurrence formula for updating the hidden state.']}, {'end': 3278.252, 'start': 2917.687, 'title': 'Understanding lstm cell state update', 'summary': 'Explains the lstm cell state update mechanism, involving gates i, f, o, and g, with i, f, and o acting as binary gates and g modulating the cell state, enabling counters to be reset or incremented, and the hidden state being a squashed cell modulated by the output gate.', 'duration': 360.565, 'highlights': ['The concept of using gates i, f, and o as binary elements and g to modulate the cell state is explained, where i, f, and o act as binary gates and g modulates the cell state. i, f, and o act as binary gates', 'The update mechanism involves resetting some cells to zero using the forget gate F, and adding a number between -1 and 1 to every cell based on the interaction of i and g, where i is between 0 and 1, and g is between -1 and 1. resetting some cells to zero, adding a number between -1 and 1 to every cell', 'The hidden update is a squashed cell modulated by the output gate, allowing only some of the cell state to leak into the hidden state in a learnable way. allowing only some of the cell state to leak into the hidden state', 'The explanation of the importance of adding i times g instead of just g, as it provides a more expressive function and decouples the concepts of how much to add to the cell state and whether to add to the cell state. importance of adding i times g instead of just g']}], 'duration': 853.918, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF82424334.jpg', 'highlights': ["The RNN can utilize soft attention to reference parts of the image while generating words, resulting in a 14 by 14 probability map for selective attention, enhancing the model's descriptive capabilities.", 'Transitioning from basic RNN formulas to LSTM models offers enhanced complexity and improved performance, as LSTM models provide a more complex recurrence formula for updating the hidden state.', 'Stacking RNNs in layers can lead to improved performance, as deeper models are often more effective in processing and analyzing data.', 'The update mechanism involves resetting some cells to zero using the forget gate F, and adding a number between -1 and 1 to every cell based on the interaction of i and g, where i is between 0 and 1, and g is between -1 and 1.']}, {'end': 4187.548, 'segs': [{'end': 3399.94, 'src': 'embed', 'start': 3373.193, 'weight': 0, 'content': [{'end': 3377.517, 'text': 'We have these additive interactions where here the x is basically your cell.', 'start': 3373.193, 'duration': 4.324}, {'end': 3381.922, 'text': 'And we go off, we do some function, and then we choose to add to this cell state.', 'start': 3378.038, 'duration': 3.884}, {'end': 3386.866, 'text': "But the LSTMs, unlike ResNets, have also these forget gates that we're adding.", 'start': 3382.382, 'duration': 4.484}, {'end': 3390.329, 'text': 'And these forget gates can choose to shut off some parts of the signal as well.', 'start': 3386.946, 'duration': 3.383}, {'end': 3393.032, 'text': 'But otherwise, it looks very much like a ResNet.', 'start': 3391.391, 'duration': 1.641}, {'end': 3399.94, 'text': "So I think it's kind of interesting that we're converging on very similar kind of looking architecture that works both in ComNuts and in recurrent neural networks,", 'start': 3393.052, 'duration': 6.888}], 'summary': 'Lstms have forget gates like resnets, converging on similar architecture for comnuts and recurrent neural networks.', 'duration': 26.747, 'max_score': 3373.193, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF83373193.jpg'}, {'end': 3470.496, 'src': 'embed', 'start': 3442.224, 'weight': 1, 'content': [{'end': 3449.629, 'text': "But you'll never end up with what we refer to with RNNs, a problem called vanishing gradients, where these gradients just die off,", 'start': 3442.224, 'duration': 7.405}, {'end': 3451.43, 'text': 'go to zero as you back propagate through.', 'start': 3449.629, 'duration': 1.801}, {'end': 3454.611, 'text': "And I'll show you an example concretely of why this happens in a bit.", 'start': 3451.71, 'duration': 2.901}, {'end': 3457.212, 'text': 'So in an RNN, we have this vanishing gradient problem.', 'start': 3455.251, 'duration': 1.961}, {'end': 3458.492, 'text': "I'll show you why that happens.", 'start': 3457.512, 'duration': 0.98}, {'end': 3460.173, 'text': 'In an LSTM.', 'start': 3458.912, 'duration': 1.261}, {'end': 3463.614, 'text': 'because of this superhighway of just additions,', 'start': 3460.173, 'duration': 3.441}, {'end': 3470.496, 'text': "these gradients at every single time step that we inject into the LSTM from above just flow through the cells and your gradients don't end up vanishing.", 'start': 3463.614, 'duration': 6.882}], 'summary': "Rnns face vanishing gradients, solved by lstm's superhighway of gradients.", 'duration': 28.272, 'max_score': 3442.224, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF83442224.jpg'}, {'end': 3817.359, 'src': 'embed', 'start': 3792.04, 'weight': 3, 'content': [{'end': 3797.424, 'text': 'And the backdrop turns out to actually be that you take your gradient signal and you multiply it by the WHH matrix.', 'start': 3792.04, 'duration': 5.384}, {'end': 3801.167, 'text': 'And so we end up multiplying by WHH.', 'start': 3798.145, 'duration': 3.022}, {'end': 3806.311, 'text': 'The gradient gets multiplied by WHH, then thresholded, then multiplied by WHH, thresholded.', 'start': 3801.347, 'duration': 4.964}, {'end': 3810.654, 'text': 'And so we end up multiplying by this matrix, WHH, 50 times.', 'start': 3806.551, 'duration': 4.103}, {'end': 3817.359, 'text': 'And so the issue with this is that the gradient signal basically, OK, two things can happen.', 'start': 3811.655, 'duration': 5.704}], 'summary': 'Gradient signal multiplied by whh matrix 50 times, causing issues.', 'duration': 25.319, 'max_score': 3792.04, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF83792040.jpg'}, {'end': 3919.394, 'src': 'embed', 'start': 3889.879, 'weight': 4, 'content': [{'end': 3895.641, 'text': 'But if your gradient is above 5 in norm, then you clamp it to 5 element-wise or something like that.', 'start': 3889.879, 'duration': 5.762}, {'end': 3896.721, 'text': 'So you can do that.', 'start': 3896.081, 'duration': 0.64}, {'end': 3897.642, 'text': "It's called gradient clipping.", 'start': 3896.761, 'duration': 0.881}, {'end': 3899.643, 'text': "That's how you address the exploding gradient problem.", 'start': 3897.762, 'duration': 1.881}, {'end': 3902.424, 'text': "And then your recurrentness don't explode anymore.", 'start': 3899.683, 'duration': 2.741}, {'end': 3905.305, 'text': 'But the gradients can still vanish in a recurrent neural network.', 'start': 3903.104, 'duration': 2.201}, {'end': 3913.23, 'text': 'And LSTM is very good with the vanishing gradient problem because of these highways of cells that are only changed with additive interactions where the gradients just flow.', 'start': 3905.646, 'duration': 7.584}, {'end': 3919.394, 'text': "They never die down because you're multiplying by the same matrix again and again or something like that.", 'start': 3913.57, 'duration': 5.824}], 'summary': 'Gradient clipping prevents exploding gradients, lstms address vanishing gradient problem.', 'duration': 29.515, 'max_score': 3889.879, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF83889879.jpg'}, {'end': 4122.733, 'src': 'embed', 'start': 4095.306, 'weight': 2, 'content': [{'end': 4098.228, 'text': 'And so you might want to use it, or you can use an LSTM.', 'start': 4095.306, 'duration': 2.922}, {'end': 4099.89, 'text': 'They both kind of do about the same.', 'start': 4098.689, 'duration': 1.201}, {'end': 4108.783, 'text': 'And so summary is that RNNs are very nice, but the raw RNN does not actually work very well.', 'start': 4101.258, 'duration': 7.525}, {'end': 4110.624, 'text': 'So use LSTMs or GRUs instead.', 'start': 4109.122, 'duration': 1.502}, {'end': 4117.149, 'text': "What's nice about them is that we're having these additive interactions that allow gradients to flow much better and you don't get a vanishing gradient problem.", 'start': 4110.725, 'duration': 6.424}, {'end': 4122.733, 'text': "We still have to worry a bit about the exploding gradient problem, so it's common to see people clip these gradients sometimes.", 'start': 4117.849, 'duration': 4.884}], 'summary': 'Lstms and grus are better than raw rnns due to improved gradient flow and avoiding vanishing gradient problem.', 'duration': 27.427, 'max_score': 4095.306, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF84095306.jpg'}], 'start': 3278.252, 'title': 'Rnn and lstm architectural differences', 'summary': 'Explains the differences between lstm and rnn architectures, focusing on the prevention of vanishing gradients in lstm, and discussing the issues of vanishing and exploding gradients in recurrent neural networks, with an emphasis on the impact of repeated multiplication by the whh matrix on the gradient signal.', 'chapters': [{'end': 3508.233, 'start': 3278.252, 'title': 'Lstm vs rnn architectures', 'summary': 'Explains the architectural differences between lstm and rnn, highlighting the additive interactions in lstm that prevent vanishing gradients, as compared to rnn.', 'duration': 229.981, 'highlights': ['The LSTM architecture involves additive interactions and forget gates, which prevent vanishing gradients in contrast to RNN.', 'The comparison is drawn between LSTM and ResNets, highlighting the similarity in their additive interactions and their effectiveness in backpropagation.', "The LSTM's ability to prevent vanishing gradients is explained through the concept of a 'gradient superhighway' due to the additive interactions.", 'The significance of the O vector in LSTM is discussed, emphasizing its less critical importance and the flexibility to modify the architecture.', 'The potential of adding peephole connections to the LSTM architecture is mentioned, showcasing the adaptability of its design.']}, {'end': 3852.554, 'start': 3508.753, 'title': 'Vanishing gradients in recurrent neural networks', 'summary': 'Discusses the issue of vanishing gradients in recurrent neural networks, particularly with respect to lstms, demonstrating the problem of gradients dying off in rnns and the impact of repeated multiplication by the whh matrix on the gradient signal.', 'duration': 343.801, 'highlights': ['The problem of vanishing gradients in recurrent neural networks is demonstrated, showing that in RNNs, the gradient instantly dies off, leading to the inability to learn long dependencies. In the demonstration, it is shown that the gradient in RNNs vanishes after about 8-10 time steps, leading to the inability to learn long dependencies due to the correlation structure dying down.', 'The impact of repeated multiplication by the WHH matrix on the gradient signal is explained, illustrating the potential for the gradient signal to either die or explode due to the multiplication, and the implications of this on the learning process. It is explained that the repeated multiplication by the WHH matrix, occurring 50 times in the demonstration, can lead to the gradient signal either dying or exploding, highlighting the potential challenges in learning due to this repeated multiplication.', "The architecture of recurrent neural networks and the challenges associated with learning long dependencies are discussed, with the demonstration of the vanishing gradients problem and its impact on the network's ability to retain gradient information. The discussion includes the challenges in learning long dependencies due to vanishing gradients, showcasing the impact on the network's ability to retain gradient information and highlighting the difficulties in learning with recurrent neural networks."]}, {'end': 4187.548, 'start': 3852.574, 'title': 'Issues with gradient in recurrent neural networks', 'summary': 'Discusses the challenges with gradient dynamics in recurrent neural networks, highlighting issues such as exploding and vanishing gradients, and the effectiveness of using lstms or grus to address these problems.', 'duration': 334.974, 'highlights': ['LSTMs and GRUs are recommended over raw RNNs due to better gradient flow and avoidance of vanishing gradient problem. LSTMs and GRUs are recommended over raw RNNs due to better gradient flow and avoidance of vanishing gradient problem.', 'Exploding gradients in RNNs can be controlled by gradient clipping, where gradients above a certain norm are clamped to a specific value. Exploding gradients in RNNs can be controlled by gradient clipping, where gradients above a certain norm are clamped to a specific value.', 'LSTMs are effective in handling vanishing gradient problem due to additive interactions, while gradient clipping is still commonly utilized to address potential exploding gradients. LSTMs are effective in handling vanishing gradient problem due to additive interactions, while gradient clipping is still commonly utilized to address potential exploding gradients.']}], 'duration': 909.296, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/yCC09vCHzF8/pics/yCC09vCHzF83278252.jpg', 'highlights': ['The LSTM architecture involves additive interactions and forget gates, preventing vanishing gradients in contrast to RNN.', 'The problem of vanishing gradients in recurrent neural networks is demonstrated, showing that in RNNs, the gradient instantly dies off, leading to the inability to learn long dependencies.', 'LSTMs and GRUs are recommended over raw RNNs due to better gradient flow and avoidance of vanishing gradient problem.', 'The impact of repeated multiplication by the WHH matrix on the gradient signal is explained, illustrating the potential for the gradient signal to either die or explode due to the multiplication.', 'Exploding gradients in RNNs can be controlled by gradient clipping, where gradients above a certain norm are clamped to a specific value.']}], 'highlights': ['The RNN can learn to generate mathematics from LaTeX source files for algebraic geometry and code from a 700 megabyte C codebase, creating proofs, lemmas, and function declarations. (Relevance: 5)', 'The RNN is able to create function declarations, know about inputs, variables, and how to use them, and even invent its own bogus comments. (Relevance: 4)', 'The RNN can recite the GNU GPU license character by character and understands the structure of include files, macros, and code, learned from data. (Relevance: 3)', 'The RNN is a three-layer LSTM, demonstrating its complexity and capability to track patterns within text data, such as quote detection and line tracking. (Relevance: 2)', 'The RNN was trained on a sequence length of 100, but it was able to generalize properly to longer sequences, indicating its ability to learn and apply character-level detection on sequences longer than the training length. (Relevance: 1)', "The RNN can utilize soft attention to reference parts of the image while generating words, resulting in a 14 by 14 probability map for selective attention, enhancing the model's descriptive capabilities.", 'Transitioning from basic RNN formulas to LSTM models offers enhanced complexity and improved performance, as LSTM models provide a more complex recurrence formula for updating the hidden state.', 'Stacking RNNs in layers can lead to improved performance, as deeper models are often more effective in processing and analyzing data.', 'The LSTM architecture involves additive interactions and forget gates, preventing vanishing gradients in contrast to RNN.', 'LSTMs and GRUs are recommended over raw RNNs due to better gradient flow and avoidance of vanishing gradient problem.']}