title
MIT 6.S191 (2019): Recurrent Neural Networks
description
MIT Introduction to Deep Learning 6.S191: Lecture 2
Deep Sequence Modeling with Recurrent Neural Networks
Lecturer: Ava Soleimany
January 2019
For all lectures, slides and lab materials: http://introtodeeplearning.com
detail
{'title': 'MIT 6.S191 (2019): Recurrent Neural Networks', 'heatmap': [{'end': 760.965, 'start': 655.868, 'weight': 0.817}, {'end': 906.703, 'start': 876.701, 'weight': 0.707}, {'end': 991.325, 'start': 941.941, 'weight': 0.719}, {'end': 1250.827, 'start': 1203.62, 'weight': 0.898}, {'end': 1361.058, 'start': 1288.5, 'weight': 0.712}, {'end': 1478.959, 'start': 1421.521, 'weight': 1}, {'end': 1633.194, 'start': 1557.784, 'weight': 0.873}], 'summary': 'Discusses the application of neural networks to sequential data, highlighting the limitations of standard feedforward networks and introducing recurrent neural networks (rnns) as a solution, emphasizing their design criteria and internal state update mechanism. it also addresses challenges in rnn training and explores the significance of lstms in tracking long-term dependencies, culminating in rnn applications in music generation, sentiment analysis, and machine translation.', 'chapters': [{'end': 165.841, 'segs': [{'end': 37.374, 'src': 'embed', 'start': 16.568, 'weight': 1, 'content': [{'end': 31.179, 'text': "And now we're going to turn our attention to applying neural networks to problems which involve sequential processing of data and why these sorts of tasks require a different type of network architecture from what we've seen so far.", 'start': 16.568, 'duration': 14.611}, {'end': 37.374, 'text': "So before we dive in, I'd like to start off with a really simple example.", 'start': 32.85, 'duration': 4.524}], 'summary': 'Neural networks for sequential data processing explained.', 'duration': 20.806, 'max_score': 16.568, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk16568.jpg'}, {'end': 143.221, 'src': 'embed', 'start': 63.142, 'weight': 0, 'content': [{'end': 69.287, 'text': 'And I think we can all agree that we have a pretty clear sense of where the ball is going to next.', 'start': 63.142, 'duration': 6.145}, {'end': 73.73, 'text': 'So this is a really, really simple sequence modeling problem.', 'start': 70.588, 'duration': 3.142}, {'end': 81.336, 'text': "Given this image, this thought experiment of a ball's travel through space, can we predict where it's going to go next?", 'start': 74.17, 'duration': 7.166}, {'end': 86.854, 'text': 'But in reality, the truth is that sequential data is all around us.', 'start': 82.79, 'duration': 4.064}, {'end': 97.643, 'text': 'For example, audio can be split up into a sequence of sound waves, while text can be split up into a sequence of either characters or words.', 'start': 87.534, 'duration': 10.109}, {'end': 105.81, 'text': 'And beyond these two ubiquitous examples, there are many more cases in which sequential processing may be useful.', 'start': 99.064, 'duration': 6.746}, {'end': 114.537, 'text': 'from analysis of medical signals like EKGs to predicting stock trends to processing genomic data.', 'start': 106.553, 'duration': 7.984}, {'end': 120.26, 'text': "And now that we've gotten a sense of what sequential data looks like,", 'start': 115.658, 'duration': 4.602}, {'end': 130.166, 'text': "I want to turn our attention to another simple problem to motivate the types of networks that we're going to use for this task.", 'start': 120.26, 'duration': 9.906}, {'end': 140.198, 'text': "And in this case, suppose we have a language model where we're trying to train a neural network to predict the next word in a phrase or a sentence.", 'start': 131.451, 'duration': 8.747}, {'end': 143.221, 'text': 'And suppose we have this sentence.', 'start': 141.82, 'duration': 1.401}], 'summary': 'Analyzing sequential data for predicting future trends in various fields.', 'duration': 80.079, 'max_score': 63.142, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk63142.jpg'}], 'start': 3.459, 'title': 'Deep sequence modeling', 'summary': 'Discusses the application of neural networks to sequential data, with examples like predicting the next word and ball trajectory, emphasizing the ubiquity of sequential data in various domains.', 'chapters': [{'end': 165.841, 'start': 3.459, 'title': 'Deep sequence modeling', 'summary': 'Discusses the application of neural networks to sequential data, using examples like predicting the next word in a sentence and the trajectory of a ball, and highlights the ubiquity of sequential data in various domains.', 'duration': 162.382, 'highlights': ['The ubiquity of sequential data in various domains, such as audio, text, medical signals, stock trends, and genomic data, is emphasized.', 'The example of predicting the trajectory of a ball using its previous locations to illustrate the concept of sequence modeling is presented.', 'The use of neural networks to predict the next word in a sentence as an example of sequential data processing is discussed.', 'The need for a different type of network architecture for problems involving sequential processing of data is highlighted, building on the essentials of neural networks and feedforward models.', 'The importance of considering the history of sequential data is emphasized through the example of predicting the trajectory of a ball.']}], 'duration': 162.382, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk3459.jpg', 'highlights': ['The ubiquity of sequential data in various domains, such as audio, text, medical signals, stock trends, and genomic data, is emphasized.', 'The need for a different type of network architecture for problems involving sequential processing of data is highlighted, building on the essentials of neural networks and feedforward models.', 'The importance of considering the history of sequential data is emphasized through the example of predicting the trajectory of a ball.', 'The example of predicting the trajectory of a ball using its previous locations to illustrate the concept of sequence modeling is presented.', 'The use of neural networks to predict the next word in a sentence as an example of sequential data processing is discussed.']}, {'end': 422.869, 'segs': [{'end': 192.987, 'src': 'embed', 'start': 165.841, 'weight': 1, 'content': [{'end': 169.147, 'text': 'like a feedforward network from our first lecture to do this.', 'start': 165.841, 'duration': 3.306}, {'end': 178.762, 'text': "And one problem that we're immediately going to run into is that our feedforward network can only take a fixed length vector as its input.", 'start': 170.779, 'duration': 7.983}, {'end': 183.204, 'text': 'And we have to specify the size of this input right at the start.', 'start': 179.602, 'duration': 3.602}, {'end': 192.987, 'text': "And you can imagine that this is going to be a problem for our task in general, because sometimes we'll have seen five words, sometimes seven words,", 'start': 184.044, 'duration': 8.943}], 'summary': 'Feedforward network requires fixed length input, posing a challenge for variable length tasks.', 'duration': 27.146, 'max_score': 165.841, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk165841.jpg'}, {'end': 228.977, 'src': 'embed', 'start': 202.513, 'weight': 0, 'content': [{'end': 212.22, 'text': 'And one way we can do this is to use this idea of a fixed window to force our input vector to be a certain length, in this case, 2.', 'start': 202.513, 'duration': 9.707}, {'end': 219.325, 'text': "And this means that no matter where we're trying to make our prediction, we just take the previous two words and try to predict the next word.", 'start': 212.22, 'duration': 7.105}, {'end': 228.977, 'text': 'And we can represent these two words as a fixed length vector where we take a larger vector, allocate some space for the first word,', 'start': 220.97, 'duration': 8.007}], 'summary': 'Using fixed window to predict next word, based on previous two words', 'duration': 26.464, 'max_score': 202.513, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk202513.jpg'}, {'end': 308.881, 'src': 'embed', 'start': 285.802, 'weight': 4, 'content': [{'end': 293.909, 'text': "And this representation is what's called a bag of words, where we have some vector, and each slot in this vector represents a word.", 'start': 285.802, 'duration': 8.107}, {'end': 300.494, 'text': "And the value that's in that slot represents the number of times that that word appears in the sentence.", 'start': 294.449, 'duration': 6.045}, {'end': 308.881, 'text': 'And so we have a fixed length vector over some vocabulary of words, regardless of the length of the input sentence,', 'start': 301.895, 'duration': 6.986}], 'summary': 'Bag of words representation with fixed length vector and word frequency count.', 'duration': 23.079, 'max_score': 285.802, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk285802.jpg'}, {'end': 352.35, 'src': 'embed', 'start': 327.923, 'weight': 3, 'content': [{'end': 335.805, 'text': 'And so for example, these two sentences with completely opposite semantic meanings would have the exact same bag of words representation.', 'start': 327.923, 'duration': 7.882}, {'end': 338.386, 'text': 'Same words, same counts.', 'start': 337.006, 'duration': 1.38}, {'end': 340.807, 'text': "So obviously, this isn't going to work.", 'start': 339.006, 'duration': 1.801}, {'end': 350.33, 'text': 'Another idea could be to simply extend our first idea of a fixed window, thinking that by looking at more words,', 'start': 342.168, 'duration': 8.162}, {'end': 352.35, 'text': 'we can get most of the context we need.', 'start': 350.33, 'duration': 2.02}], 'summary': 'Bag of words representation fails due to same words and counts. idea of fixed window extension explored.', 'duration': 24.427, 'max_score': 327.923, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk327923.jpg'}, {'end': 427.952, 'src': 'embed', 'start': 400.363, 'weight': 2, 'content': [{'end': 405.847, 'text': 'because the parameters that see the end of the vector have never seen that phrase before.', 'start': 400.363, 'duration': 5.484}, {'end': 410.029, 'text': "And the parameters from the beginning haven't been shared across the sequence.", 'start': 406.327, 'duration': 3.702}, {'end': 413.731, 'text': 'And so at a higher level.', 'start': 411.81, 'duration': 1.921}, {'end': 422.869, 'text': 'what this means is that what we learn about the sequence at one point is not going to transfer anywhere to anywhere else in the sequence if we use this representation.', 'start': 413.731, 'duration': 9.138}, {'end': 427.952, 'text': "And so hopefully by walking through this I've motivated that.", 'start': 424.05, 'duration': 3.902}], 'summary': 'Parameters at end and beginning not shared, hindering information transfer across sequence.', 'duration': 27.589, 'max_score': 400.363, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk400363.jpg'}], 'start': 165.841, 'title': 'Sequence representation challenges', 'summary': 'Discusses the limitations of feedforward networks in handling variable length inputs and proposes the use of a fixed window approach, as well as the challenges in representing sequence information for language modeling, including the limitations of fixed-length vectors and the loss of sequence information in bag-of-words representation.', 'chapters': [{'end': 239.501, 'start': 165.841, 'title': 'Handling variable length input', 'summary': 'Discusses the limitation of feedforward networks in handling variable length inputs and proposes the use of fixed window approach to address this issue, by allocating space for the previous two words to predict the next word.', 'duration': 73.66, 'highlights': ['Our feedforward network can only take a fixed length vector as its input, creating a challenge for tasks with variable input lengths.', 'The fixed window approach involves representing the previous two words as a fixed length vector to predict the next word, addressing the challenge of variable input lengths.']}, {'end': 422.869, 'start': 239.501, 'title': 'Challenges in sequence representation', 'summary': 'Discusses the challenges in representing sequence information for language modeling, including the limitations of fixed-length vectors and the loss of sequence information in bag-of-words representation, leading to the need for a more effective approach.', 'duration': 183.368, 'highlights': ['Bag-of-words representation loses sequence information and may result in the same representation for sentences with different meanings.', 'Inability of fixed window approach to effectively capture context from longer sequences, leading to limitations in representing sequence information.', 'Challenge in transferring learned information about the sequence across different parts of the input, hindering the effective representation of sequence information.']}], 'duration': 257.028, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk165841.jpg', 'highlights': ['The fixed window approach involves representing the previous two words as a fixed length vector to predict the next word, addressing the challenge of variable input lengths.', 'Our feedforward network can only take a fixed length vector as its input, creating a challenge for tasks with variable input lengths.', 'Challenge in transferring learned information about the sequence across different parts of the input, hindering the effective representation of sequence information.', 'Inability of fixed window approach to effectively capture context from longer sequences, leading to limitations in representing sequence information.', 'Bag-of-words representation loses sequence information and may result in the same representation for sentences with different meanings.']}, {'end': 770.212, 'segs': [{'end': 472.04, 'src': 'embed', 'start': 448.143, 'weight': 1, 'content': [{'end': 459.051, 'text': 'Specifically, our network needs to be able to handle variable length sequences, be able to track long-term dependencies in the data,', 'start': 448.143, 'duration': 10.908}, {'end': 466.296, 'text': 'maintain information about the sequence order and share the parameters it learns across the entirety of the sequence.', 'start': 459.051, 'duration': 7.245}, {'end': 472.04, 'text': "And today we're going to talk about using recurrent neural networks, or RNNs,", 'start': 467.216, 'duration': 4.824}], 'summary': 'Discussing the use of recurrent neural networks for handling variable length sequences and long-term dependencies.', 'duration': 23.897, 'max_score': 448.143, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk448143.jpg'}, {'end': 530.554, 'src': 'embed', 'start': 499.368, 'weight': 0, 'content': [{'end': 503.969, 'text': "And we already motivated why a network like this can't really handle sequential data.", 'start': 499.368, 'duration': 4.601}, {'end': 512.706, 'text': 'RNNs, in contrast, are really well-suited for handling cases where we have a sequence of inputs rather than a single input.', 'start': 505.763, 'duration': 6.943}, {'end': 520.489, 'text': "And they're great for problems like this one, in which a sequence of data is propagated through the model to give a single output.", 'start': 513.285, 'duration': 7.204}, {'end': 521.71, 'text': 'For example,', 'start': 521.13, 'duration': 0.58}, {'end': 530.554, 'text': "you can imagine training a model that takes as input a sequence of words and outputs a sentiment that's associated with that phrase or that sentence.", 'start': 521.71, 'duration': 8.844}], 'summary': 'Rnns are suited for handling sequential data, such as sentiment analysis for a sequence of words.', 'duration': 31.186, 'max_score': 499.368, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk499368.jpg'}, {'end': 644.058, 'src': 'embed', 'start': 572.608, 'weight': 2, 'content': [{'end': 581.133, 'text': 'As I mentioned before, to reiterate our standard vanilla feedforward neural network, we are going from input to output in one direction.', 'start': 572.608, 'duration': 8.525}, {'end': 586.417, 'text': "And this fundamentally can't maintain information about sequential data.", 'start': 581.814, 'duration': 4.603}, {'end': 597.271, 'text': 'RNNs, on the other hand, are networks where they have these loops in them which allow for information to persist.', 'start': 588.34, 'duration': 8.931}, {'end': 607.458, 'text': 'So, in this diagram, our RNN takes as input this vector x of t outputs a value like a prediction y hat of t,', 'start': 598.012, 'duration': 9.446}, {'end': 613.863, 'text': 'but also makes this computation to update an internal state, which we call h of t,', 'start': 607.458, 'duration': 6.405}, {'end': 619.387, 'text': 'and then passes this information about its state from this step of the network to the next.', 'start': 613.863, 'duration': 5.524}, {'end': 628.285, 'text': 'And we call these networks with loops in them recurrent because information is being passed internally from one time step to the next.', 'start': 621.075, 'duration': 7.21}, {'end': 638.673, 'text': "So what's going on under the hood? How is information being passed? RNNs use a simple recurrence relation in order to process sequential data.", 'start': 629.026, 'duration': 9.647}, {'end': 644.058, 'text': 'Specifically, they maintain this internal state, h of t.', 'start': 639.474, 'duration': 4.584}], 'summary': 'Rnns process sequential data by maintaining internal state, h of t.', 'duration': 71.45, 'max_score': 572.608, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk572608.jpg'}, {'end': 694.02, 'src': 'embed', 'start': 667.791, 'weight': 4, 'content': [{'end': 675.213, 'text': "And this addresses that important design criteria from earlier of why it's useful to share parameters in the context of sequence modeling.", 'start': 667.791, 'duration': 7.422}, {'end': 682.644, 'text': 'To be more specific, the RNN computation includes both a state update as well as the output.', 'start': 677.096, 'duration': 5.548}, {'end': 688.231, 'text': 'So given our input vector, we apply some function to update the hidden state.', 'start': 683.425, 'duration': 4.806}, {'end': 694.02, 'text': 'And as we saw in the first lecture, this function is a standard neural net operation.', 'start': 688.892, 'duration': 5.128}], 'summary': 'Sharing parameters in sequence modeling improves rnn computation efficiency and utilizes standard neural net operations.', 'duration': 26.229, 'max_score': 667.791, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk667791.jpg'}, {'end': 784.293, 'src': 'heatmap', 'start': 655.868, 'weight': 6, 'content': [{'end': 657.51, 'text': 'and the current input x of t.', 'start': 655.868, 'duration': 1.642}, {'end': 667.25, 'text': 'And the important thing to know here is that the same function and the same set of parameters are used at every time step.', 'start': 660.168, 'duration': 7.082}, {'end': 675.213, 'text': "And this addresses that important design criteria from earlier of why it's useful to share parameters in the context of sequence modeling.", 'start': 667.791, 'duration': 7.422}, {'end': 682.644, 'text': 'To be more specific, the RNN computation includes both a state update as well as the output.', 'start': 677.096, 'duration': 5.548}, {'end': 688.231, 'text': 'So given our input vector, we apply some function to update the hidden state.', 'start': 683.425, 'duration': 4.806}, {'end': 694.02, 'text': 'And as we saw in the first lecture, this function is a standard neural net operation.', 'start': 688.892, 'duration': 5.128}, {'end': 699.865, 'text': 'that consists of multiplication by a weight matrix and applying a nonlinearity.', 'start': 694.46, 'duration': 5.405}, {'end': 709.733, 'text': 'But in this case, since we both have the input vector x of t as well as the previous state h of t minus 1, as inputs to our function,', 'start': 700.485, 'duration': 9.248}, {'end': 711.234, 'text': 'we have two weight matrices.', 'start': 709.733, 'duration': 1.501}, {'end': 715.238, 'text': 'And we can then apply our nonlinearity to the sum of these two terms.', 'start': 711.815, 'duration': 3.423}, {'end': 720.867, 'text': 'Finally, we generate an output at a given time step,', 'start': 716.906, 'duration': 3.961}, {'end': 728.348, 'text': 'which is a transformed version of our internal state that falls from a multiplication by a separate weight matrix.', 'start': 720.867, 'duration': 7.481}, {'end': 736.469, 'text': "So, so far we've seen RNNs as depicted as having these loops that feed back in on themselves.", 'start': 730.228, 'duration': 6.241}, {'end': 742.851, 'text': 'Another way of thinking about the RNN can be in terms of unrolling this loop across time.', 'start': 737.39, 'duration': 5.461}, {'end': 754.459, 'text': 'And if we do this, We can think of the RNN as multiple copies of the same network, where each copy is passing a message onto its descendant.', 'start': 744.471, 'duration': 9.988}, {'end': 760.965, 'text': 'And continuing this scheme throughout time.', 'start': 755.58, 'duration': 5.385}, {'end': 764.588, 'text': 'you can easily see that RNNs have this chain-like structure,', 'start': 760.965, 'duration': 3.623}, {'end': 770.212, 'text': "which really highlights how and why they're so well-suited for processing sequential data.", 'start': 764.588, 'duration': 5.624}, {'end': 784.293, 'text': 'So, in this representation we can make our weight matrices explicit, beginning with the weights that transform the inputs to the hidden state,', 'start': 771.433, 'duration': 12.86}], 'summary': 'Rnn uses shared parameters for state update and output, with chain-like structure suited for sequential data.', 'duration': 128.425, 'max_score': 655.868, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk655868.jpg'}], 'start': 424.05, 'title': 'Recurrent neural networks', 'summary': 'Introduces the limitations of feedforward neural networks for handling sequential data and highlights the design criteria for sequence modeling, leading to the discussion on using recurrent neural networks (rnns) as a framework for sequential processing and sequence modeling problems. it also discusses the limitations of standard feedforward neural networks in processing sequential data and introduces recurrent neural networks (rnns) as a solution, highlighting the internal state update mechanism and the use of recurrence relation to process sequential data. additionally, it explains the working of recurrent neural networks (rnns), emphasizing the use of shared parameters for updating states, the computation of output, and the chain-like structure, making them suitable for processing sequential data.', 'chapters': [{'end': 571.787, 'start': 424.05, 'title': 'Introduction to rnns', 'summary': 'Introduces the limitations of feedforward neural networks for handling sequential data and highlights the design criteria for sequence modeling, leading to the discussion on using recurrent neural networks (rnns) as a framework for sequential processing and sequence modeling problems.', 'duration': 147.737, 'highlights': ['RNNs are well-suited for handling sequential data, such as sequences of inputs, and are ideal for problems like sentiment analysis and text/music generation.', 'Design criteria for sequence modeling includes handling variable length sequences, tracking long-term dependencies, maintaining sequence order, and sharing parameters across the entirety of the sequence.', 'RNNs are fundamentally different from traditional feedforward neural networks, as they allow data to propagate bidirectionally and are capable of processing sequences of inputs rather than just a single input.']}, {'end': 644.058, 'start': 572.608, 'title': 'Understanding recurrent neural networks', 'summary': 'Discusses the limitations of standard feedforward neural networks in processing sequential data and introduces recurrent neural networks (rnns) as a solution, highlighting the internal state update mechanism and the use of recurrence relation to process sequential data.', 'duration': 71.45, 'highlights': ['RNNs use loops to allow information to persist internally, enabling them to process sequential data effectively.', "RNNs update an internal state 'h of t' based on input 'x of t' and pass this information to the next time step, facilitating the retention and utilization of sequential information.", 'Standard feedforward neural networks are unable to maintain information about sequential data, while RNNs use a simple recurrence relation to process sequential data by updating and passing an internal state.']}, {'end': 770.212, 'start': 644.058, 'title': 'Recurrent neural networks', 'summary': 'Explains the working of recurrent neural networks (rnns), emphasizing the use of shared parameters for updating states, the computation of output, and the chain-like structure, making them suitable for processing sequential data.', 'duration': 126.154, 'highlights': ['RNN computation includes both a state update and the output', 'Use of shared parameters in RNN for processing sequential data', 'RNN can be thought of as multiple copies of the same network passing messages across time']}], 'duration': 346.162, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk424050.jpg', 'highlights': ['RNNs are well-suited for handling sequential data, such as sequences of inputs, and are ideal for problems like sentiment analysis and text/music generation.', 'Design criteria for sequence modeling includes handling variable length sequences, tracking long-term dependencies, maintaining sequence order, and sharing parameters across the entirety of the sequence.', 'RNNs use loops to allow information to persist internally, enabling them to process sequential data effectively.', "RNNs update an internal state 'h of t' based on input 'x of t' and pass this information to the next time step, facilitating the retention and utilization of sequential information.", 'RNN computation includes both a state update and the output', 'Use of shared parameters in RNN for processing sequential data', 'RNN can be thought of as multiple copies of the same network passing messages across time', 'Standard feedforward neural networks are unable to maintain information about sequential data, while RNNs use a simple recurrence relation to process sequential data by updating and passing an internal state.', 'RNNs are fundamentally different from traditional feedforward neural networks, as they allow data to propagate bidirectionally and are capable of processing sequences of inputs rather than just a single input.']}, {'end': 1158.845, 'segs': [{'end': 821.472, 'src': 'embed', 'start': 771.433, 'weight': 0, 'content': [{'end': 784.293, 'text': 'So, in this representation we can make our weight matrices explicit, beginning with the weights that transform the inputs to the hidden state,', 'start': 771.433, 'duration': 12.86}, {'end': 791.68, 'text': 'transform the previous hidden state to the next hidden state and finally transform the hidden state to the output.', 'start': 784.293, 'duration': 7.387}, {'end': 797.926, 'text': "And it's important, once again, to note that we are using the same weight matrices at every time step.", 'start': 793.061, 'duration': 4.865}, {'end': 803.117, 'text': 'And from these outputs, we can compute a loss at each time step.', 'start': 799.554, 'duration': 3.563}, {'end': 807.16, 'text': 'And this completes what is called our forward pass through the network.', 'start': 803.477, 'duration': 3.683}, {'end': 814.706, 'text': 'And finally, to define the total loss, we simply sum the losses from all the individual time steps.', 'start': 808.802, 'duration': 5.904}, {'end': 821.472, 'text': 'And since our total loss consists of these individual contributions over time,', 'start': 815.567, 'duration': 5.905}], 'summary': 'Explanation of weight matrices and forward pass in neural network.', 'duration': 50.039, 'max_score': 771.433, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk771433.jpg'}, {'end': 906.703, 'src': 'heatmap', 'start': 876.701, 'weight': 0.707, 'content': [{'end': 883.306, 'text': 'For RNNs, our forward pass through the network consists of going forward across time,', 'start': 876.701, 'duration': 6.605}, {'end': 890.372, 'text': 'updating the cell state based on the input and the previous state, generating an output at each time step,', 'start': 883.306, 'duration': 7.066}, {'end': 896.497, 'text': 'computing a loss at each time step and then finally summing these individual losses to get the total loss.', 'start': 890.372, 'duration': 6.125}, {'end': 906.703, 'text': 'And what this means is that instead of back-propagating errors through a single feed-forward network at a single time step in RNNs,', 'start': 898.055, 'duration': 8.648}], 'summary': 'Rnn forward pass updates cell state, generates output, computes loss at each time step.', 'duration': 30.002, 'max_score': 876.701, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk876701.jpg'}, {'end': 929.899, 'src': 'embed', 'start': 906.703, 'weight': 2, 'content': [{'end': 919.395, 'text': 'errors are back-propagated at each individual time step and then across time steps all the way from where we are currently to the very beginning of the sequence.', 'start': 906.703, 'duration': 12.692}, {'end': 924.094, 'text': "And this is the reason why it's called back propagation through time.", 'start': 920.731, 'duration': 3.363}, {'end': 929.899, 'text': 'Because as you can see, all the errors are flowing back in time to the beginning of our data sequence.', 'start': 924.674, 'duration': 5.225}], 'summary': 'Back-propagates errors across time steps in sequence.', 'duration': 23.196, 'max_score': 906.703, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk906703.jpg'}, {'end': 991.325, 'src': 'heatmap', 'start': 941.941, 'weight': 0.719, 'content': [{'end': 947.146, 'text': 'in doing this back propagation, we have this factor w, h, h, which is a matrix.', 'start': 941.941, 'duration': 5.205}, {'end': 955.753, 'text': 'And this means that in each step, we have to perform a matrix multiplication that involves this weight matrix w.', 'start': 947.666, 'duration': 8.087}, {'end': 963.12, 'text': 'Furthermore, each cell state update results from a nonlinear activation.', 'start': 957.938, 'duration': 5.182}, {'end': 973.444, 'text': 'And what this means is that in computing the gradient in an RNN, the derivative of the loss with respect to our initial state H0,', 'start': 964.18, 'duration': 9.264}, {'end': 981.967, 'text': 'we have to make many matrix multiplications that involve the weight matrix as well as repeated use of the derivative of the activation function.', 'start': 973.444, 'duration': 8.523}, {'end': 984.308, 'text': 'Why might this be problematic?', 'start': 982.947, 'duration': 1.361}, {'end': 991.325, 'text': 'Well, if we consider these multiplication operations, if many of these values are greater than 1,', 'start': 985.543, 'duration': 5.782}], 'summary': 'Back propagation in rnn involves matrix multiplications and nonlinear activation, potentially leading to computational challenges.', 'duration': 49.384, 'max_score': 941.941, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk941941.jpg'}, {'end': 1049.363, 'src': 'embed', 'start': 1001.994, 'weight': 3, 'content': [{'end': 1010.065, 'text': "one thing that's done in practice is called gradient clipping, which basically means you scale back your gradients when they become too large.", 'start': 1001.994, 'duration': 8.071}, {'end': 1018.116, 'text': "And this is a really good practical option, especially when you have a network that's not too complicated and doesn't have many parameters.", 'start': 1010.686, 'duration': 7.43}, {'end': 1025.939, 'text': 'On the flip side, we can also have the opposite problem, where, if our matrix values are too small,', 'start': 1019.676, 'duration': 6.263}, {'end': 1028.901, 'text': "we can encounter what's called the vanishing gradient problem.", 'start': 1025.939, 'duration': 2.962}, {'end': 1034.423, 'text': "And it's really the motivating factor behind the most widely used RNN architectures.", 'start': 1029.461, 'duration': 4.962}, {'end': 1040.906, 'text': "And today we're going to address three ways in which we can alleviate the vanishing gradient problem.", 'start': 1035.664, 'duration': 5.242}, {'end': 1043.678, 'text': 'by changing the activation function.', 'start': 1041.936, 'duration': 1.742}, {'end': 1049.363, 'text': "that's used being clever about how we initialize the weights in our network and, finally,", 'start': 1043.678, 'duration': 5.685}], 'summary': 'Gradient clipping is used to scale back gradients, addressing the vanishing gradient problem in rnn architectures.', 'duration': 47.369, 'max_score': 1001.994, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk1001994.jpg'}, {'end': 1118.831, 'src': 'embed', 'start': 1093.12, 'weight': 6, 'content': [{'end': 1102.266, 'text': "And this means that during training, we'll end up biasing our network to capture short-term dependencies, which may not always be a problem.", 'start': 1093.12, 'duration': 9.146}, {'end': 1108.868, 'text': 'Sometimes we only need to consider very recent information to perform our task of interest.', 'start': 1102.847, 'duration': 6.021}, {'end': 1114.23, 'text': "So to make this concrete, let's go back to our example from the beginning of the lecture.", 'start': 1109.948, 'duration': 4.282}, {'end': 1118.831, 'text': "A language model, we're trying to predict the next word in a phrase.", 'start': 1114.85, 'duration': 3.981}], 'summary': 'Training biases network to capture short-term dependencies, useful for predicting next word in language model.', 'duration': 25.711, 'max_score': 1093.12, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk1093120.jpg'}], 'start': 771.433, 'title': 'Rnn training challenges', 'summary': 'Discusses backpropagation through time in rnns, utilizing the same weight matrices at every time step and addressing the vanishing gradient problem, with practical solutions like gradient clipping and architectural changes to alleviate the issue.', 'chapters': [{'end': 929.899, 'start': 771.433, 'title': 'Backpropagation through time in rnns', 'summary': 'Explains the concept of backpropagation through time in rnns, where the same weight matrices are used at every time step, and the total loss is defined by summing the losses from all individual time steps, involving a time component in training.', 'duration': 158.466, 'highlights': ['RNNs use the same weight matrices at every time step, ensuring consistent transformation of inputs to hidden state, previous hidden state to next hidden state, and hidden state to output.', 'Total loss in RNNs is defined by summing the losses from all individual time steps, introducing a time component in network training.', 'Backpropagation through time in RNNs involves back-propagating errors at each individual time step and then across time steps from the current to the beginning of the sequence, extending the concept of backpropagation in feedforward models.']}, {'end': 1158.845, 'start': 931.633, 'title': 'Rnn vanishing gradient problem', 'summary': 'Explains the vanishing gradient problem in rnns, its implications on error propagation and the practical solutions such as gradient clipping, while addressing ways to alleviate the problem by changing activation functions, weight initialization, and rnn architecture.', 'duration': 227.212, 'highlights': ['The vanishing gradient problem in RNNs can lead to biased networks capturing only short-term dependencies, hindering error propagation and potentially impacting performance.', 'Explained the concept of gradient clipping as a practical solution to the exploding gradient problem, where gradients are scaled back when they become too large.', 'Addressed three ways to alleviate the vanishing gradient problem in RNNs, including changing the activation function, being clever about weight initialization, and fundamentally changing the RNN architecture.', 'Highlighted the implication of vanishing gradients on capturing short-term dependencies in a language model, with examples illustrating the relevance of context in predictions.']}], 'duration': 387.412, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk771433.jpg', 'highlights': ['RNNs use the same weight matrices at every time step, ensuring consistent transformation of inputs to hidden state, previous hidden state to next hidden state, and hidden state to output.', 'Total loss in RNNs is defined by summing the losses from all individual time steps, introducing a time component in network training.', 'Backpropagation through time in RNNs involves back-propagating errors at each individual time step and then across time steps from the current to the beginning of the sequence, extending the concept of backpropagation in feedforward models.', 'Addressed three ways to alleviate the vanishing gradient problem in RNNs, including changing the activation function, being clever about weight initialization, and fundamentally changing the RNN architecture.', 'Explained the concept of gradient clipping as a practical solution to the exploding gradient problem, where gradients are scaled back when they become too large.', 'The vanishing gradient problem in RNNs can lead to biased networks capturing only short-term dependencies, hindering error propagation and potentially impacting performance.', 'Highlighted the implication of vanishing gradients on capturing short-term dependencies in a language model, with examples illustrating the relevance of context in predictions.']}, {'end': 1409.782, 'segs': [{'end': 1250.827, 'src': 'heatmap', 'start': 1159.866, 'weight': 2, 'content': [{'end': 1166.89, 'text': "And in many cases, the gap between what's relevant and the point where that information is needed can become really, really large.", 'start': 1159.866, 'duration': 7.024}, {'end': 1173.294, 'text': 'And as that gap grows, standard RNNs become increasingly unable to connect the information.', 'start': 1167.471, 'duration': 5.823}, {'end': 1176.136, 'text': "And that's all because of the vanishing gradient problem.", 'start': 1173.714, 'duration': 2.422}, {'end': 1181.331, 'text': 'So how can we alleviate this? The first trick is pretty simple.', 'start': 1177.136, 'duration': 4.195}, {'end': 1185.352, 'text': 'We can change the activation function the network uses.', 'start': 1181.911, 'duration': 3.441}, {'end': 1193.716, 'text': 'And specifically, both the tanh and sigmoid activation functions have derivatives less than 1 pretty much everywhere.', 'start': 1186.013, 'duration': 7.703}, {'end': 1203.62, 'text': 'In contrast, if we use a ReLU activation function, the derivative is 1 for whenever x is greater than 0.', 'start': 1195.117, 'duration': 8.503}, {'end': 1207.442, 'text': 'And so this helps prevent the value of the derivative from shrinking our gradients.', 'start': 1203.62, 'duration': 3.822}, {'end': 1214.228, 'text': "But it's only true for when x is greater than 0.", 'start': 1208.186, 'duration': 6.042}, {'end': 1218.95, 'text': 'Another trick is to be smart in terms of how we initialize parameters in our network.', 'start': 1214.228, 'duration': 4.722}, {'end': 1229.014, 'text': 'By initialing our weights to the identity matrix, we can help prevent them from shrinking to 0 too rapidly during back propagation.', 'start': 1219.83, 'duration': 9.184}, {'end': 1247.425, 'text': "The final and most robust solution is to use a more complex type of recurrent unit that can more effectively track long-term dependencies by controlling what information is passed through and what's used to update the cell state.", 'start': 1230.697, 'duration': 16.728}, {'end': 1250.827, 'text': "Specifically, we'll use what we call gated cells.", 'start': 1248.065, 'duration': 2.762}], 'summary': 'To address vanishing gradient problem, consider using relu activation function, smart parameter initialization, and gated cells in recurrent units.', 'duration': 69.148, 'max_score': 1159.866, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk1159866.jpg'}, {'end': 1279.813, 'src': 'embed', 'start': 1254.148, 'weight': 0, 'content': [{'end': 1262.31, 'text': "And today we'll focus on one type of gated cell called a long short-term memory network, or LSTMs for short,", 'start': 1254.148, 'duration': 8.162}, {'end': 1269.351, 'text': 'which are really good at learning long-term dependencies and overcoming this vanishing gradient problem.', 'start': 1262.31, 'duration': 7.041}, {'end': 1276.132, 'text': 'And LSTMs are basically the gold standard when it comes to building RNNs in practice.', 'start': 1269.891, 'duration': 6.241}, {'end': 1279.813, 'text': "And they're very, very widely used by the deep learning community.", 'start': 1276.192, 'duration': 3.621}], 'summary': 'Lstms excel in learning long-term dependencies, overcoming vanishing gradient problem, and are widely used in deep learning.', 'duration': 25.665, 'max_score': 1254.148, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk1254148.jpg'}, {'end': 1361.058, 'src': 'heatmap', 'start': 1288.5, 'weight': 0.712, 'content': [{'end': 1295.124, 'text': 'All recurrent neural networks have this form of a series of repeating modules, the RNN being unrolled across time.', 'start': 1288.5, 'duration': 6.624}, {'end': 1301.507, 'text': 'And in a standard RNN, the repeating module contains one computation node.', 'start': 1295.884, 'duration': 5.623}, {'end': 1304.649, 'text': "In this case, it's a tanh layer.", 'start': 1301.687, 'duration': 2.962}, {'end': 1311.796, 'text': 'LSTMs also have this chain-like structure, but the repeating module is slightly more complex.', 'start': 1306.292, 'duration': 5.504}, {'end': 1317.419, 'text': "And don't get too frightened, hopefully, by what these flow diagrams mean.", 'start': 1312.376, 'duration': 5.043}, {'end': 1319.361, 'text': "We'll walk through it step by step.", 'start': 1317.519, 'duration': 1.842}, {'end': 1328.807, 'text': 'But the key idea here is that the repeating unit in an LSTM contains these different interacting layers that control the flow of information.', 'start': 1320.181, 'duration': 8.626}, {'end': 1343.317, 'text': 'The first key idea behind LSTMs is that they maintain an internal cell state, which will denote C of t in addition to the standard RNN state, H of t.', 'start': 1330.97, 'duration': 12.347}, {'end': 1348.421, 'text': 'And this cell state runs throughout the chain of repeating modules.', 'start': 1343.317, 'duration': 5.104}, {'end': 1353.043, 'text': 'And as you can see, there are only a couple of simple linear interactions.', 'start': 1349.001, 'duration': 4.042}, {'end': 1356.906, 'text': 'This is a pointwise multiplication, and this is addition.', 'start': 1353.163, 'duration': 3.743}, {'end': 1361.058, 'text': 'that update the value of C of T.', 'start': 1358.075, 'duration': 2.983}], 'summary': 'Rnns and lstms have repeating modules with complex structures, controlling flow of information and maintaining an internal cell state.', 'duration': 72.558, 'max_score': 1288.5, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk1288500.jpg'}, {'end': 1388.418, 'src': 'embed', 'start': 1361.058, 'weight': 4, 'content': [{'end': 1366.222, 'text': "And this means that it's really easy for information to flow along relatively unchanged.", 'start': 1361.058, 'duration': 5.164}, {'end': 1377.191, 'text': 'The second key idea that LSTMs use is that they use these structures called gates to add or remove information to the cell state.', 'start': 1367.243, 'duration': 9.948}, {'end': 1384.017, 'text': 'And gates consist of a sigmoid neural net layer followed by a pointwise multiplication.', 'start': 1378.052, 'duration': 5.965}, {'end': 1388.418, 'text': "So let's take a moment to think about what these gates are doing.", 'start': 1385.137, 'duration': 3.281}], 'summary': 'Lstms use gates to manage information flow effectively.', 'duration': 27.36, 'max_score': 1361.058, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk1361058.jpg'}], 'start': 1159.866, 'title': 'Rnns and lstms', 'summary': 'Discusses solutions to the vanishing gradient problem in rnns, proposing the use of relu activation function and weight initialization. it also explains the significance of lstms in effectively tracking long-term dependencies and their widespread use in the deep learning community.', 'chapters': [{'end': 1229.014, 'start': 1159.866, 'title': 'Vanishing gradient problem solution', 'summary': "Discusses the vanishing gradient problem in rnns due to the activation functions' derivatives being less than 1, and proposes solutions of using relu activation function with derivative 1 for x > 0 and initializing weights to the identity matrix to prevent rapid shrinking during back propagation.", 'duration': 69.148, 'highlights': ['Using ReLU activation function with derivative 1 for x > 0 helps prevent the value of the derivative from shrinking our gradients.', 'Smart initialization of parameters by initializing weights to the identity matrix can prevent them from shrinking to 0 too rapidly during back propagation.', 'The vanishing gradient problem in RNNs is caused by the gap between relevant information and the point where it is needed becoming large, making standard RNNs increasingly unable to connect the information.']}, {'end': 1409.782, 'start': 1230.697, 'title': 'Understanding lstms in rnns', 'summary': 'Explains the significance of lstms in rnns, highlighting their ability to effectively track long-term dependencies, overcome the vanishing gradient problem, and serve as the gold standard for building rnns, with lstms being widely utilized in the deep learning community.', 'duration': 179.085, 'highlights': ['LSTMs are the gold standard for building RNNs, widely used in the deep learning community', 'LSTMs are effective at learning long-term dependencies and overcoming the vanishing gradient problem', 'LSTMs use gated cells to control the flow of information and add or remove information to the cell state']}], 'duration': 249.916, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk1159866.jpg', 'highlights': ['LSTMs are the gold standard for building RNNs, widely used in the deep learning community', 'LSTMs are effective at learning long-term dependencies and overcoming the vanishing gradient problem', 'Using ReLU activation function with derivative 1 for x > 0 helps prevent the value of the derivative from shrinking our gradients', 'Smart initialization of parameters by initializing weights to the identity matrix can prevent them from shrinking to 0 too rapidly during back propagation', 'LSTMs use gated cells to control the flow of information and add or remove information to the cell state', 'The vanishing gradient problem in RNNs is caused by the gap between relevant information and the point where it is needed becoming large, making standard RNNs increasingly unable to connect the information']}, {'end': 1844.769, 'segs': [{'end': 1478.959, 'src': 'heatmap', 'start': 1410.382, 'weight': 4, 'content': [{'end': 1416.439, 'text': 'And so this regulates the flow of of information through the LSTM.', 'start': 1410.382, 'duration': 6.057}, {'end': 1420.461, 'text': "So now you're probably wondering, OK, these lines look really complicated.", 'start': 1417.499, 'duration': 2.962}, {'end': 1430.605, 'text': 'How do these LSTMs actually work? Thinking of the LSTM operations at a high level, it boils down to three key steps.', 'start': 1421.521, 'duration': 9.084}, {'end': 1439.289, 'text': 'The first step in the LSTM is to decide what information is going to be thrown away from the prior cell state.', 'start': 1432.046, 'duration': 7.243}, {'end': 1441.67, 'text': 'Forget irrelevant history right?', 'start': 1439.809, 'duration': 1.861}, {'end': 1448.474, 'text': 'The next step is to take both the prior information as well as the current input,', 'start': 1442.833, 'duration': 5.641}, {'end': 1452.815, 'text': 'process this information in some way and then selectively update the cell state.', 'start': 1448.474, 'duration': 4.341}, {'end': 1456.876, 'text': 'And our final step is to return an output.', 'start': 1454.715, 'duration': 2.161}, {'end': 1464.597, 'text': 'And for this, LSTMs are going to use an output gate to return a transformed version of the cell state.', 'start': 1457.336, 'duration': 7.261}, {'end': 1478.959, 'text': "So, now that we have a sense of these three key LSTM operations forget update output Let's walk through each step by step to get a concrete understanding of how these computations work.", 'start': 1466.398, 'duration': 12.561}], 'summary': 'Lstm regulates information flow through 3 key steps: forgetting irrelevant history, updating cell state, and returning an output.', 'duration': 54.215, 'max_score': 1410.382, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk1410382.jpg'}, {'end': 1633.194, 'src': 'heatmap', 'start': 1557.784, 'weight': 0.873, 'content': [{'end': 1566.992, 'text': "updating the LSTM to forget the gender pronoun of a sentence's past subject once it encounters a new subject in that sentence.", 'start': 1557.784, 'duration': 9.208}, {'end': 1576.981, 'text': 'Our second step is to decide what new information is going to be stored in our updated cell state and to actually execute that update.', 'start': 1569.254, 'duration': 7.727}, {'end': 1579.636, 'text': 'So there are two steps to this.', 'start': 1578.496, 'duration': 1.14}, {'end': 1588.599, 'text': 'The first is a sigmoid layer, which you can think of as gating the input, which identifies what values we should update.', 'start': 1580.337, 'duration': 8.262}, {'end': 1596.281, 'text': 'Secondly, we have a tanh layer that generates a new vector of candidate values that could be added to the state.', 'start': 1589.899, 'duration': 6.382}, {'end': 1605.724, 'text': 'And in our language model, we may decide to add the gender of a new subject in order to replace the gender of the old subject.', 'start': 1597.301, 'duration': 8.423}, {'end': 1615.307, 'text': 'Now we can actually update our old cell state c of t minus 1 into the new cell state c of t.', 'start': 1607.816, 'duration': 7.491}, {'end': 1618.472, 'text': 'Our previous two steps decided what we should do.', 'start': 1615.307, 'duration': 3.165}, {'end': 1621.176, 'text': "Now it's about actually executing that.", 'start': 1618.952, 'duration': 2.224}, {'end': 1633.194, 'text': 'So to perform this update, we first multiply our old cell state, c of t minus 1, by our forget state, our forget gate, f of t.', 'start': 1622.642, 'duration': 10.552}], 'summary': 'Updating lstm to forget gender pronouns, involving sigmoid and tanh layers.', 'duration': 75.41, 'max_score': 1557.784, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk1557784.jpg'}, {'end': 1815.599, 'src': 'embed', 'start': 1765.092, 'weight': 0, 'content': [{'end': 1768.733, 'text': 'And that c of t is only involved in really simple computations.', 'start': 1765.092, 'duration': 3.641}, {'end': 1779.316, 'text': "And so when you link up these repeating LSTM units in a chain, what you'll see is that you get this completely uninterrupted gradient flow,", 'start': 1770.393, 'duration': 8.923}, {'end': 1784.197, 'text': 'unlike in a standard RNN, where you have to do repeated matrix multiplications.', 'start': 1779.316, 'duration': 4.881}, {'end': 1791.379, 'text': 'And this is really great for training purposes and for overcoming the vanishing gradient problem.', 'start': 1784.817, 'duration': 6.562}, {'end': 1799.062, 'text': "So to recap the key ideas behind LSTMs, we maintain a separate cell state from what's outputted.", 'start': 1792.78, 'duration': 6.282}, {'end': 1806.054, 'text': "We use gates to control the flow of information, first forgetting what's irrelevant,", 'start': 1800.131, 'duration': 5.923}, {'end': 1811.677, 'text': 'selectively updating the cell state based on both the past history and the current input,', 'start': 1806.054, 'duration': 5.623}, {'end': 1815.599, 'text': 'and then outputting some filtered version of what we just computed.', 'start': 1811.677, 'duration': 3.922}], 'summary': 'Lstms enable uninterrupted gradient flow, overcoming vanishing gradient problem, using gates for information flow control.', 'duration': 50.507, 'max_score': 1765.092, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk1765092.jpg'}], 'start': 1410.382, 'title': 'Understanding lstm operations and lstms in rnns', 'summary': 'Delves into the three key steps in lstm operations and highlights the high-level operations of lstms in rnns, offering a solution to the vanishing gradient problem.', 'chapters': [{'end': 1537.65, 'start': 1410.382, 'title': 'Understanding lstm operations', 'summary': 'Explains the three key steps in lstm operations: deciding what information to discard, processing and updating information, and returning an output using gates, with a focus on establishing intuition behind how lstms work.', 'duration': 127.268, 'highlights': ['The first step in the LSTM is to decide what information is going to be thrown away from the prior cell state, achieved using a sigmoid layer called the forget gate, parametrized by a set of weights and biases.', 'The next step involves processing the prior information and current input to selectively update the cell state, followed by using an output gate to return a transformed version of the cell state.', 'LSTMs use an output gate to return a transformed version of the cell state, with the forget gate determining what past information is relevant and irrelevant, and processing information to selectively update the cell state.']}, {'end': 1844.769, 'start': 1537.65, 'title': 'Understanding lstms in rnns', 'summary': 'Explains the working of lstms in rnns, emphasizing the three high-level operations of forgetting old information, updating the cell state, and outputting a filtered version, and also highlights the solution provided by lstms to overcome the vanishing gradient problem.', 'duration': 307.119, 'highlights': ["LSTMs maintain a separate cell state from what's outputted, use gates to control the flow of information, and allow for simple back propagation with uninterrupted gradient flow, providing a solution to the vanishing gradient problem.", 'The fundamental workings of LSTMs involve three high-level operations: forgetting old information, updating the cell state, and outputting a filtered version, demonstrating the internal workings of LSTMs.', 'The relationship between c of t and c of t minus 1 allows for simple back propagation with uninterrupted gradient flow, unlike in standard RNNs, providing an advantage for training purposes and overcoming the vanishing gradient problem.']}], 'duration': 434.387, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk1410382.jpg', 'highlights': ['LSTMs use an output gate to return a transformed version of the cell state, with the forget gate determining what past information is relevant and irrelevant, and processing information to selectively update the cell state.', 'The fundamental workings of LSTMs involve three high-level operations: forgetting old information, updating the cell state, and outputting a filtered version, demonstrating the internal workings of LSTMs.', "LSTMs maintain a separate cell state from what's outputted, use gates to control the flow of information, and allow for simple back propagation with uninterrupted gradient flow, providing a solution to the vanishing gradient problem.", 'The relationship between c of t and c of t minus 1 allows for simple back propagation with uninterrupted gradient flow, unlike in standard RNNs, providing an advantage for training purposes and overcoming the vanishing gradient problem.', 'The first step in the LSTM is to decide what information is going to be thrown away from the prior cell state, achieved using a sigmoid layer called the forget gate, parametrized by a set of weights and biases.', 'The next step involves processing the prior information and current input to selectively update the cell state, followed by using an output gate to return a transformed version of the cell state.']}, {'end': 2184.766, 'segs': [{'end': 1908.436, 'src': 'embed', 'start': 1846.872, 'weight': 0, 'content': [{'end': 1856.875, 'text': "Let's first imagine we're trying to learn a RNN to predict the next musical note and to use this model to generate brand new musical sequences.", 'start': 1846.872, 'duration': 10.003}, {'end': 1865.919, 'text': 'So you can imagine inputting a sequence of notes and producing an output at each time step,', 'start': 1857.556, 'duration': 8.363}, {'end': 1870.68, 'text': 'where our output at each time step is what we think is the next note in the sequence.', 'start': 1865.919, 'duration': 4.761}, {'end': 1880.419, 'text': "If you train a model like this, you can actually use it to generate brand new music that's never been heard before.", 'start': 1873.672, 'duration': 6.747}, {'end': 1898.029, 'text': 'And so for example, Right, you get the idea.', 'start': 1881.08, 'duration': 16.949}, {'end': 1899.87, 'text': 'This sounds like classical music right?', 'start': 1898.049, 'duration': 1.821}, {'end': 1908.436, 'text': 'But in reality this was music that was generated by a recurrent neural network that trained on piano pieces from Chopin.', 'start': 1900.27, 'duration': 8.166}], 'summary': "Using rnn to generate new music, trained on chopin's piano pieces.", 'duration': 61.564, 'max_score': 1846.872, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk1846872.jpg'}, {'end': 1962.445, 'src': 'embed', 'start': 1934.729, 'weight': 2, 'content': [{'end': 1941.031, 'text': "where you'll be training an RNN to generate brand new Irish folk music that has never been heard before.", 'start': 1934.729, 'duration': 6.302}, {'end': 1950.091, 'text': "As another cool example, where we're going from an input sequence to just a single output.", 'start': 1944.184, 'duration': 5.907}, {'end': 1961.063, 'text': 'we can train an RNN to take as input words in a sentence and actually output the sentiment or the feeling of that particular sentence,', 'start': 1950.091, 'duration': 10.972}, {'end': 1962.445, 'text': 'either positive or negative.', 'start': 1961.063, 'duration': 1.382}], 'summary': 'Train rnn to generate original irish folk music and predict sentiment from input words.', 'duration': 27.716, 'max_score': 1934.729, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk1934729.jpg'}, {'end': 2068.853, 'src': 'embed', 'start': 2040.29, 'weight': 3, 'content': [{'end': 2045.951, 'text': "And this is a huge bottleneck when you're considering large bodies of text that you're trying to translate.", 'start': 2040.29, 'duration': 5.661}, {'end': 2051.333, 'text': 'And actually, researchers devised a clever way to get around this problem.', 'start': 2046.832, 'duration': 4.501}, {'end': 2054.6, 'text': 'which is this idea of attention.', 'start': 2052.478, 'duration': 2.122}, {'end': 2062.527, 'text': 'And the basic idea here is that, instead of the decoder only having access to the final encoded state,', 'start': 2055.541, 'duration': 6.986}, {'end': 2068.853, 'text': 'it now has access to each of these states after each of the steps in the original sentence.', 'start': 2062.527, 'duration': 6.326}], 'summary': "Attention mechanism resolves translation bottleneck by providing access to each step's state.", 'duration': 28.563, 'max_score': 2040.29, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk2040290.jpg'}, {'end': 2141.596, 'src': 'embed', 'start': 2112.887, 'weight': 4, 'content': [{'end': 2120.933, 'text': 'how to train them using back propagation through time, and also looked at how gated cells can, let us, model long-term dependencies.', 'start': 2112.887, 'duration': 8.046}, {'end': 2134.279, 'text': 'And finally, we discussed three concrete applications, And so this concludes the lecture portion of our first day of 6S191,', 'start': 2121.754, 'duration': 12.525}, {'end': 2141.596, 'text': "and we're really excited now to transition to the lab portion which, as I mentioned,", 'start': 2134.279, 'duration': 7.317}], 'summary': 'Back propagation through time for training, gated cells for long-term dependencies, and 3 concrete applications. transitioning to lab portion.', 'duration': 28.709, 'max_score': 2112.887, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk2112887.jpg'}], 'start': 1846.872, 'title': 'Rnn applications', 'summary': "Discusses rnn's use in music generation, sentiment analysis, and machine translation, including attention mechanisms for handling large text bodies, with a focus on generating music trained on chopin's pieces.", 'chapters': [{'end': 1934.729, 'start': 1846.872, 'title': 'Rnn music generation', 'summary': 'Discusses using rnn to generate new musical sequences, showcasing its ability to produce realistic music by training on piano pieces from chopin and generating brand new music.', 'duration': 87.857, 'highlights': ["You can train a RNN model to generate brand new music that's never been heard before, using it to produce realistic music based on the learned patterns.", 'Music generated by a recurrent neural network trained on piano pieces from Chopin sounds extremely realistic, making it difficult to differentiate from music composed by a human expert.', 'The RNN model is capable of predicting the next musical note and generating brand new musical sequences by inputting a sequence of notes and producing an output at each time step.']}, {'end': 2184.766, 'start': 1934.729, 'title': 'Rnn applications and training', 'summary': 'Discusses the applications of rnns in generating music, sentiment analysis, and machine translation, with a focus on the use of attention in handling large bodies of text for translation.', 'duration': 250.037, 'highlights': ["RNNs are used to generate brand new Irish folk music, perform sentiment analysis on tweets, and for machine translation in Google's algorithm.", 'The use of attention in machine translation allows the decoder to access each state of the original sentence, addressing the bottleneck of encoding large bodies of text into a single vector.', 'The lecture also covers the training of RNNs using back propagation through time and the modeling of long-term dependencies using gated cells.']}], 'duration': 337.894, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/_h66BW-xNgk/pics/_h66BW-xNgk1846872.jpg', 'highlights': ["Music generated by RNN trained on Chopin's pieces sounds extremely realistic", 'RNN model capable of predicting next musical note and generating new sequences', 'RNNs used for generating Irish folk music, sentiment analysis, and machine translation', 'Use of attention in machine translation addresses bottleneck of encoding large text bodies', 'Training RNNs using back propagation through time and modeling long-term dependencies']}], 'highlights': ['RNNs are well-suited for handling sequential data, such as sequences of inputs, and are ideal for problems like sentiment analysis and text/music generation.', 'LSTMs are the gold standard for building RNNs, widely used in the deep learning community', 'The fixed window approach involves representing the previous two words as a fixed length vector to predict the next word, addressing the challenge of variable input lengths.', 'RNNs use loops to allow information to persist internally, enabling them to process sequential data effectively.', 'The need for a different type of network architecture for problems involving sequential processing of data is highlighted, building on the essentials of neural networks and feedforward models.', "RNNs update an internal state 'h of t' based on input 'x of t' and pass this information to the next time step, facilitating the retention and utilization of sequential information.", 'LSTMs use gated cells to control the flow of information and add or remove information to the cell state', 'RNN model capable of predicting next musical note and generating new sequences', 'RNNs used for generating Irish folk music, sentiment analysis, and machine translation', 'The importance of considering the history of sequential data is emphasized through the example of predicting the trajectory of a ball.']}