title
Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 6 – Language Models and RNNs
description
For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/3n7saLk
Professor Christopher Manning & PhD Candidate Abigail See, Stanford University
http://onlinehub.stanford.edu/
Professor Christopher Manning
Thomas M. Siebel Professor in Machine Learning, Professor of Linguistics and of Computer Science
Director, Stanford Artificial Intelligence Laboratory (SAIL)
To follow along with the course schedule and syllabus, visit: http://web.stanford.edu/class/cs224n/index.html#schedule
0:00 Introduction
0:33 Overview
2:50 You use Language Models every day!
5:36 n-gram Language Models: Example
10:12 Sparsity Problems with n-gram Language Models
10:58 Storage Problems with n-gram Language Models
11:34 n-gram Language Models in practice
12:53 Generating text with a n-gram Language Model
15:08 How to build a neural Language Model?
16:03 A fixed-window neural Language Model
20:57 Recurrent Neural Networks (RNN)
22:39 ARNN Language Model
32:51 Training a RNN Language Model
36:35 Multivariable Chain Rule
37:10 Backpropagation for RNNs: Proof sketch
41:23 Generating text with a RNN Language Model
51:39 Evaluating Language Models
53:30 RNNs have greatly improved perplexity
54:09 Why should we care about Language Modeling?
58:30 Recap
59:21 RNNs can be used for tagging
detail
{'title': 'Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 6 – Language Models and RNNs', 'heatmap': [{'end': 1029.404, 'start': 940.204, 'weight': 0.807}, {'end': 1316.646, 'start': 1226.519, 'weight': 0.808}, {'end': 1443.235, 'start': 1355.965, 'weight': 0.754}, {'end': 2096.214, 'start': 2012.915, 'weight': 0.933}, {'end': 2224.248, 'start': 2132.828, 'weight': 0.778}, {'end': 3163.98, 'start': 3118.788, 'weight': 0.945}], 'summary': 'The lecture covers language models and recurrent neural networks (rnns), highlighting language modeling basics, issues with n-gram language models, challenges, word embedding, rnn training, rnn language models, and their application in nlp tasks such as tagging, sentiment classification, question answering, and speech recognition.', 'chapters': [{'end': 43.628, 'segs': [{'end': 43.628, 'src': 'embed', 'start': 6.209, 'weight': 0, 'content': [{'end': 6.79, 'text': 'Hi, everyone.', 'start': 6.209, 'duration': 0.581}, {'end': 7.79, 'text': "I'm Abby.", 'start': 7.31, 'duration': 0.48}, {'end': 12.254, 'text': "I'm the head TA for this class, and I'm also a PhD student in the Stanford NLP group.", 'start': 8.031, 'duration': 4.223}, {'end': 16.798, 'text': "And today, I'm gonna be telling you about language models and recurrent neural networks.", 'start': 12.775, 'duration': 4.023}, {'end': 20.642, 'text': "So, here's an overview of what we're gonna do today.", 'start': 17.599, 'duration': 3.043}, {'end': 24.465, 'text': "Today, first, we're going to introduce a new NLP task, that's language modeling.", 'start': 20.662, 'duration': 3.803}, {'end': 31.892, 'text': "And that's going to motivate us to learn about a new family of neural networks, that is recurrent neural networks or RNNs.", 'start': 25.306, 'duration': 6.586}, {'end': 37.243, 'text': "So, I'd say that these are two of the most important ideas we're going to learn for the rest of the course.", 'start': 33.379, 'duration': 3.864}, {'end': 39.985, 'text': "So, we're going to be covering some fairly core material today.", 'start': 37.723, 'duration': 2.262}, {'end': 43.628, 'text': "So, let's start off with language modeling.", 'start': 42.107, 'duration': 1.521}], 'summary': 'Abby, head ta at stanford, introduces language models and rnns for nlp tasks.', 'duration': 37.419, 'max_score': 6.209, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U6209.jpg'}], 'start': 6.209, 'title': 'Language models and rnns', 'summary': 'Introduces language modeling and its relation to recurrent neural networks (rnns) as essential concepts for the course, presented by abby, the head ta and phd student in the stanford nlp group.', 'chapters': [{'end': 43.628, 'start': 6.209, 'title': 'Language models and recurrent neural networks', 'summary': 'Covers the introduction of language modeling as a new nlp task and its motivation to learn about recurrent neural networks (rnns), which are two important concepts for the course, presented by abby, the head ta and phd student in the stanford nlp group.', 'duration': 37.419, 'highlights': ['The chapter introduces the new NLP task of language modeling, motivating the learning about recurrent neural networks (RNNs).', 'Abby, the head TA and a PhD student in the Stanford NLP group, presents the core material of language modeling and recurrent neural networks (RNNs) in the class.']}], 'duration': 37.419, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U6209.jpg', 'highlights': ['Abby, the head TA and a PhD student in the Stanford NLP group, presents the core material of language modeling and recurrent neural networks (RNNs) in the class.', 'The chapter introduces the new NLP task of language modeling, motivating the learning about recurrent neural networks (RNNs).']}, {'end': 446.388, 'segs': [{'end': 72.937, 'src': 'embed', 'start': 44.909, 'weight': 1, 'content': [{'end': 48.312, 'text': 'Language modeling is the task of predicting what words comes next.', 'start': 44.909, 'duration': 3.403}, {'end': 51.956, 'text': 'So, given this piece of text, the students opened their blank.', 'start': 49.093, 'duration': 2.863}, {'end': 56.28, 'text': 'Could anyone shout out a word which you think might be coming next? Book.', 'start': 52.416, 'duration': 3.864}, {'end': 58.842, 'text': 'Book Mind.', 'start': 56.56, 'duration': 2.282}, {'end': 59.803, 'text': 'Mind What else?', 'start': 58.922, 'duration': 0.881}, {'end': 66.714, 'text': "I didn't quite hear them, but uh yeah, these are all likely things right?", 'start': 62.853, 'duration': 3.861}, {'end': 69.015, 'text': 'So these are some things which I thought students might be opening.', 'start': 66.734, 'duration': 2.281}, {'end': 71.116, 'text': 'Uh, students open their books, seems likely.', 'start': 69.415, 'duration': 1.701}, {'end': 72.937, 'text': 'Uh, students open their laptops.', 'start': 71.136, 'duration': 1.801}], 'summary': 'Language modeling predicts next words. students open books and laptops.', 'duration': 28.028, 'max_score': 44.909, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U44909.jpg'}, {'end': 117.722, 'src': 'embed', 'start': 87.682, 'weight': 3, 'content': [{'end': 90.243, 'text': "So, here's a more formal definition of what a language model is.", 'start': 87.682, 'duration': 2.561}, {'end': 101.751, 'text': 'Given a sequence of words, x1 up to xt, a language model is something that computes the probability distribution of the next word, xt plus 1.', 'start': 92.264, 'duration': 9.487}, {'end': 108.035, 'text': 'So, a language model comes up with the probability distribution, the conditional probability of what xt plus 1 is given the words so far.', 'start': 101.751, 'duration': 6.284}, {'end': 114.499, 'text': "And here we're assuming that xt plus 1 can be any word w from a fixed vocabulary v.", 'start': 109.036, 'duration': 5.463}, {'end': 117.722, 'text': "So, we are assuming that there's a predefined list of words that we're considering.", 'start': 114.499, 'duration': 3.223}], 'summary': 'A language model computes the probability distribution of the next word based on a sequence of words from a fixed vocabulary.', 'duration': 30.04, 'max_score': 87.682, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U87682.jpg'}, {'end': 276.721, 'src': 'embed', 'start': 246.124, 'weight': 2, 'content': [{'end': 251.066, 'text': 'So the core idea of an n-gram language model is that, in order to predict what word comes next,', 'start': 246.124, 'duration': 4.942}, {'end': 256.648, 'text': "you're going to collect a bunch of statistics about how frequent different n-grams are from some kind of training data,", 'start': 251.066, 'duration': 5.582}, {'end': 260.009, 'text': 'and then you can use those statistics to predict what next words might be likely.', 'start': 256.648, 'duration': 3.361}, {'end': 263.55, 'text': "Here's some more detail.", 'start': 262.87, 'duration': 0.68}, {'end': 269.693, 'text': 'So, to make an n-gram language model, first you need to make a simplifying assumption, and this is your assumption.', 'start': 264.631, 'duration': 5.062}, {'end': 276.721, 'text': 'you say that the next word, Xt plus 1, depends only on the preceding n minus 1 words.', 'start': 270.799, 'duration': 5.922}], 'summary': 'N-gram language model predicts next words based on n-gram statistics.', 'duration': 30.597, 'max_score': 246.124, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U246124.jpg'}, {'end': 374.727, 'src': 'embed', 'start': 346.391, 'weight': 0, 'content': [{'end': 348.172, 'text': "And we're trying to predict what word is coming next.", 'start': 346.391, 'duration': 1.781}, {'end': 355.917, 'text': "So, because we're learning a four-gram language model, our simplifying assumption is that the next word depends only on the last three words,", 'start': 349.474, 'duration': 6.443}, {'end': 357.058, 'text': 'the last n minus one words.', 'start': 355.917, 'duration': 1.141}, {'end': 362.681, 'text': "So we're going to discard all of the context so far except for the last three words, which is students open there.", 'start': 358.039, 'duration': 4.642}, {'end': 365.602, 'text': 'So, as a reminder,', 'start': 364.882, 'duration': 0.72}, {'end': 374.727, 'text': 'our n-gram language model says that the probability of the next word being some particular word w in the vocabulary is equal to the number of times we saw students open their w,', 'start': 365.602, 'duration': 9.125}], 'summary': 'Using a four-gram model to predict next word based on the last three words.', 'duration': 28.336, 'max_score': 346.391, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U346391.jpg'}], 'start': 44.909, 'title': 'Language modeling basics', 'summary': "Introduces language modeling as the task of predicting the next word in a given text, with examples of likely next words and highlights the audience's active role. it explains the concept of a language model, its applications, the importance of n-gram language models, and the methodology of predicting the next word based on statistics.", 'chapters': [{'end': 86.702, 'start': 44.909, 'title': 'Language modeling basics', 'summary': 'Introduces language modeling as the task of predicting the next word in a given text, with examples of likely next words such as books, minds, laptops, exams, and the metaphorical meaning of opening, highlighting that the audience is actively performing language modeling.', 'duration': 41.793, 'highlights': ['The students open their books, seems likely. (Likely next word example: books)', 'Students open their minds. (Likely next word example: minds)', 'Students open their laptops. (Likely next word example: laptops)', 'Students open their exams. (Likely next word example: exams)', 'In thinking about what word comes next, you are being a language model. (Highlighting that the audience is actively performing language modeling)']}, {'end': 446.388, 'start': 87.682, 'title': 'Language model basics', 'summary': 'Explains the concept of a language model, its applications, and the process of learning a language model, highlighting the importance of n-gram language models and the methodology of predicting the next word based on statistics.', 'duration': 358.706, 'highlights': ['Language models compute the probability distribution of the next word, xt plus 1, and are used for everyday tasks such as texting and internet searches. Language models compute the probability distribution of the next word, xt plus 1, and are used for everyday tasks such as texting and internet searches.', 'N-gram language models simplify the prediction of the next word by collecting statistics about the frequency of different n-grams from training data. N-gram language models simplify the prediction of the next word by collecting statistics about the frequency of different n-grams from training data.', 'The probability of the next word in a four-gram language model is determined by the count of specific n-grams in the training corpus, affecting the prediction of the next word. The probability of the next word in a four-gram language model is determined by the count of specific n-grams in the training corpus, affecting the prediction of the next word.', 'The order of words matters in language modeling, as demonstrated by the importance of context in predicting the next word. The order of words matters in language modeling, as demonstrated by the importance of context in predicting the next word.']}], 'duration': 401.479, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U44909.jpg', 'highlights': ['The order of words matters in language modeling, as demonstrated by the importance of context in predicting the next word.', 'In thinking about what word comes next, you are being a language model.', 'N-gram language models simplify the prediction of the next word by collecting statistics about the frequency of different n-grams from training data.', 'Language models compute the probability distribution of the next word, xt plus 1, and are used for everyday tasks such as texting and internet searches.']}, {'end': 854.81, 'segs': [{'end': 567.716, 'src': 'embed', 'start': 542.818, 'weight': 0, 'content': [{'end': 548.52, 'text': "So this technique is called smoothing, because the idea is that you're going from a very sparse probability distribution,", 'start': 542.818, 'duration': 5.702}, {'end': 553.201, 'text': "which is zero almost everywhere, with a few spikes where there's been n-grams that we've seen.", 'start': 548.52, 'duration': 4.681}, {'end': 558.523, 'text': 'It goes from that to being a more smooth probability distribution where everything has at least a small probability on it.', 'start': 553.601, 'duration': 4.922}, {'end': 567.716, 'text': 'So the second sparsity problem, which is possibly worse than the first one, is what happens if the number in uh, the denominator is zero?', 'start': 560.632, 'duration': 7.084}], 'summary': 'Smoothing technique aims to create a more even probability distribution from sparse n-gram data.', 'duration': 24.898, 'max_score': 542.818, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U542818.jpg'}, {'end': 638.906, 'src': 'embed', 'start': 614.263, 'weight': 1, 'content': [{'end': 619.628, 'text': 'So, um, another thing to note is that these sparsity problems get worse if you increase n.', 'start': 614.263, 'duration': 5.365}, {'end': 623.271, 'text': 'If you make n larger in your n-gram language model, and you might want to do this.', 'start': 619.628, 'duration': 3.643}, {'end': 629.357, 'text': 'For example, you might think uh, I want to have a larger context so that I can, uh, pay attention to words that happened longer ago,', 'start': 623.411, 'duration': 5.946}, {'end': 630.718, 'text': "and that's gonna make it a better predictor.", 'start': 629.357, 'duration': 1.361}, {'end': 633.04, 'text': 'So, you might think making n bigger is a good idea.', 'start': 630.998, 'duration': 2.042}, {'end': 636.063, 'text': 'But the problem is that if you do that, then these sparsity problems get worse.', 'start': 633.401, 'duration': 2.662}, {'end': 638.906, 'text': "Because, uh, let's suppose you say I want a 10-gram language model.", 'start': 636.644, 'duration': 2.262}], 'summary': 'Increasing n in n-gram language model worsens sparsity problems.', 'duration': 24.643, 'max_score': 614.263, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U614263.jpg'}, {'end': 689.593, 'src': 'embed', 'start': 663.206, 'weight': 2, 'content': [{'end': 668.847, 'text': 'So if we look at this equation, uh, you have to think about what do you need to store in order to use your n-gram language model.', 'start': 663.206, 'duration': 5.641}, {'end': 676.65, 'text': 'You need to store this count number for all of the n-grams that you observed in the corpus when you were going through the training corpus counting them.', 'start': 669.808, 'duration': 6.842}, {'end': 683.291, 'text': 'And the problem is that as you increase n, then this number of n-grams that you have to store and count increases.', 'start': 677.65, 'duration': 5.641}, {'end': 689.593, 'text': 'So another problem with increasing n is that the size of your model, of your n-gram model, uh, gets bigger.', 'start': 684.232, 'duration': 5.361}], 'summary': 'Storing count number for n-grams in language model increases as n increases, leading to larger model size.', 'duration': 26.387, 'max_score': 663.206, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U663206.jpg'}, {'end': 757.233, 'src': 'embed', 'start': 719.795, 'weight': 3, 'content': [{'end': 726.038, 'text': 'So I gave it the context of the bigram today, the, and then I asked the trigram language model what word is likely to come next?', 'start': 719.795, 'duration': 6.243}, {'end': 732.96, 'text': 'So the language model said that the top next most likely words are company bank price, Italian Emirates, etc.', 'start': 726.798, 'duration': 6.162}, {'end': 739.443, 'text': "So already just looking at these probabilities that are assigned to these different words, uh, you can see that there's a sparsity problem.", 'start': 733.901, 'duration': 5.542}, {'end': 743.245, 'text': 'For example, the top two most likely words have the exact same probability.', 'start': 739.643, 'duration': 3.602}, {'end': 746.846, 'text': 'And the reason for that is that this number is 4 over 26.', 'start': 743.685, 'duration': 3.161}, {'end': 752.51, 'text': 'So these are all quite small integers, uh, meaning that we only saw, uh, today the company and today the bank four times each.', 'start': 746.846, 'duration': 5.664}, {'end': 757.233, 'text': 'So, um, this is an example of the sparsity problem because overall these are quite low counts.', 'start': 753.37, 'duration': 3.863}], 'summary': "Trigram language model predicts next words with sparsity issue: 'company', 'bank', 'price', 'italian', 'emirates'.", 'duration': 37.438, 'max_score': 719.795, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U719795.jpg'}, {'end': 823.417, 'src': 'embed', 'start': 794.668, 'weight': 5, 'content': [{'end': 799.833, 'text': 'So, then price is your next words, and then you just condition on the last two words, which in this example is now the price.', 'start': 794.668, 'duration': 5.165}, {'end': 806.94, 'text': 'So now you get a new probability distribution and you can continue this process, uh, sampling and then conditioning again and sampling.', 'start': 800.714, 'duration': 6.226}, {'end': 811.284, 'text': 'So if you do this long enough, you will get a piece of text.', 'start': 808.922, 'duration': 2.362}, {'end': 816.509, 'text': 'So this is the actual text that I got when I ran this generation process with this trigram language model.', 'start': 811.364, 'duration': 5.145}, {'end': 823.417, 'text': 'So it says today the price of gold per ton while production of shoe lasts and shoe industry.', 'start': 817.375, 'duration': 6.042}], 'summary': 'By conditioning on the last two words, a trigram language model generated a piece of text mentioning the price of gold and shoe production.', 'duration': 28.749, 'max_score': 794.668, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U794668.jpg'}], 'start': 447.149, 'title': 'Issues with n-gram language models', 'summary': 'Discusses problems with n-gram language models, including the impact of simplifying assumptions on word prediction, the sparsity problem caused by zero counts, and the use of smoothing to address the issue of zero probabilities. it also explores the challenges of sparsity in n-gram language models, the impact of increasing n on sparsity, and the challenge of storage size. additionally, it analyzes the sparsity problem in a trigram language model, demonstrating the lack of granular probability distribution and showcasing the process of text generation.', 'chapters': [{'end': 558.523, 'start': 447.149, 'title': 'Issues with n-gram language models', 'summary': 'Discusses problems with n-gram language models, including the impact of simplifying assumptions on word prediction, the sparsity problem caused by zero counts, and the use of smoothing to address the issue of zero probabilities.', 'duration': 111.374, 'highlights': ["The impact of simplifying assumptions on word prediction is highlighted as a problem with n-gram language models. The context of the proctor and the clock should indicate 'exams', but the simplifying assumption leads to a higher likelihood of 'books'.", "The sparsity problem, where a zero count for a word results in zero probability, is identified as a significant issue in language models. If a phrase like 'students open their petri dishes' never occurred in the data, the probability of the next word being 'petri dishes' will be zero.", 'The technique of smoothing, involving the addition of a small number to word counts, is proposed as a partial solution to the sparsity problem. Adding a small delta to the count for every word in the vocabulary ensures that every possible word has at least some small probability.']}, {'end': 718.718, 'start': 560.632, 'title': 'N-gram language models', 'summary': 'Discusses sparsity problems in n-gram language models, including the issues with zero denominators and the impact of increasing n on sparsity, as well as the challenge of storage size, and provides an example of building a trigram language model over a 1.7 million word corpus.', 'duration': 158.086, 'highlights': ['The sparsity problems in n-gram language models are discussed, including the issue of zero denominators and the impact of increasing n on sparsity. The chapter details the problems that arise in n-gram language models when the denominator becomes zero and the impact of increasing n on sparsity.', 'The challenge of storing and counting n-grams in the corpus is highlighted, with an increase in n leading to a larger model size. The chapter explains the issue of storage size and the need to store and count an increasing number of n-grams as n grows, leading to a larger model size.', 'An example of building a trigram language model over a 1.7 million word corpus is provided, demonstrating the efficiency of creating the model in a few seconds on a laptop. The chapter provides an example of creating a trigram language model over a 1.7 million word corpus, emphasizing the efficiency of building the model in a few seconds on a laptop.']}, {'end': 854.81, 'start': 719.795, 'title': 'Trigram language model analysis', 'summary': 'Explores the sparsity problem in a trigram language model, demonstrating how low counts lead to a lack of granular probability distribution and showcases the process of text generation, resulting in surprisingly grammatical but incoherent output.', 'duration': 135.015, 'highlights': ['The language model assigns probabilities to next likely words, with the top two most likely words having the same small probability of 4 over 26, showcasing a sparsity problem. The top two most likely words have the exact same probability of 4 over 26, indicating a sparsity problem due to low counts.', 'The text generated from the trigram language model is surprisingly grammatical but incoherent, indicating the limitations of the model in producing meaningful output. The generated text is surprisingly grammatical but incoherent, showcasing the limitations of the trigram language model in producing coherent output.', 'The process of text generation involves conditioning on previous words and sampling from the probability distribution, demonstrating the method of generating text using a language model. The process of text generation involves conditioning on previous words and sampling from the probability distribution, providing a method for text generation using a language model.']}], 'duration': 407.661, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U447149.jpg', 'highlights': ['Smoothing is proposed as a partial solution to the sparsity problem.', 'The impact of increasing n on sparsity is discussed in n-gram language models.', 'The challenge of storing and counting n-grams in the corpus is highlighted.', 'The top two most likely words have the exact same probability, indicating a sparsity problem.', 'The limitations of the trigram language model in producing coherent output are showcased.', 'The process of text generation involves conditioning on previous words and sampling from the probability distribution.']}, {'end': 1240.172, 'segs': [{'end': 883.282, 'src': 'embed', 'start': 855.231, 'weight': 3, 'content': [{'end': 857.531, 'text': 'Because if you remember, this is a trigram language model.', 'start': 855.231, 'duration': 2.3}, {'end': 862.092, 'text': 'It has a memory of just the last, well, three or two words depending on how you look at it.', 'start': 857.851, 'duration': 4.241}, {'end': 867.296, 'text': 'So clearly, we need to consider more than three words at a time if we want to model language well.', 'start': 863.174, 'duration': 4.122}, {'end': 874.919, 'text': 'But as we already know, increasing N makes the sparsity problem worse for N-gram language models, and it also increases the model size.', 'start': 868.536, 'duration': 6.383}, {'end': 877.44, 'text': 'Is that a question?', 'start': 877.02, 'duration': 0.42}, {'end': 879.34, 'text': 'How does it know when to put commas?', 'start': 877.46, 'duration': 1.88}, {'end': 883.282, 'text': 'So the question is how does the N-gram language model know when to put commas??', 'start': 880.361, 'duration': 2.921}], 'summary': 'N-gram language model struggles with sparsity and model size, seeking better language modeling', 'duration': 28.051, 'max_score': 855.231, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U855231.jpg'}, {'end': 944.967, 'src': 'embed', 'start': 921.08, 'weight': 0, 'content': [{'end': 928.516, 'text': 'and then it outputs a probability distribution of what the next word might be xt plus 1..', 'start': 921.08, 'duration': 7.436}, {'end': 934.36, 'text': "Okay So when we think about what kind of neural models we met in this class so far, uh, we've already met window-based neural models.", 'start': 928.516, 'duration': 5.844}, {'end': 939.804, 'text': 'And in lecture three, we saw how you could apply a window-based neural model to, uh, named entity recognition.', 'start': 934.66, 'duration': 5.144}, {'end': 944.967, 'text': 'So in that scenario, you take some kind of window around the word that you care about, which in this, uh, example is Paris.', 'start': 940.204, 'duration': 4.763}], 'summary': 'Neural models can predict next word with probability distribution. window-based models applied to named entity recognition.', 'duration': 23.887, 'max_score': 921.08, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U921080.jpg'}, {'end': 1029.404, 'src': 'heatmap', 'start': 940.204, 'weight': 0.807, 'content': [{'end': 944.967, 'text': 'So in that scenario, you take some kind of window around the word that you care about, which in this, uh, example is Paris.', 'start': 940.204, 'duration': 4.763}, {'end': 950.631, 'text': 'And then, uh, you get the word embeddings for those, concatenate them, put them through some layers and then you get your decision,', 'start': 945.528, 'duration': 5.103}, {'end': 952.492, 'text': 'which is that Paris is a location.', 'start': 950.631, 'duration': 1.861}, {'end': 954.774, 'text': 'not you know a person or a organization.', 'start': 952.492, 'duration': 2.282}, {'end': 957.321, 'text': "So that's a recap of what we saw in lecture three.", 'start': 955.899, 'duration': 1.422}, {'end': 963.772, 'text': "How would we apply a model like this to language modeling? So here's how you would do it.", 'start': 958.443, 'duration': 5.329}, {'end': 966.454, 'text': "Here's an example of a fixed window neural language model.", 'start': 963.852, 'duration': 2.602}, {'end': 970.976, 'text': 'So again we have some kind of context, which is as the proctor starts at the clock,', 'start': 967.354, 'duration': 3.622}, {'end': 974.718, 'text': "the students open there and we're trying to guess what words might come next.", 'start': 970.976, 'duration': 3.742}, {'end': 978.32, 'text': 'So we have to make a similar simplifying assumption to before.', 'start': 975.618, 'duration': 2.702}, {'end': 985.203, 'text': "Uh, because it's a fixed size window, uh, we have to discard the context except for the window that we're conditioning on.", 'start': 978.34, 'duration': 6.863}, {'end': 988.025, 'text': "So let's suppose that our fixed window is of size four.", 'start': 985.744, 'duration': 2.281}, {'end': 995.033, 'text': "So what we'll do is, similarly to the uh NER model,", 'start': 990.091, 'duration': 4.942}, {'end': 1004.737, 'text': "we're going to represent these words with one-hot vectors and then we'll use those to look up the word embeddings for these words using the uh embedding lookup matrix.", 'start': 995.033, 'duration': 9.704}, {'end': 1009.519, 'text': 'So, then we get all of our word embeddings E1, 2, 3, 4, and then we concatenate them together to get E.', 'start': 1004.937, 'duration': 4.582}, {'end': 1015.577, 'text': 'we put this through a linear layer and a non-linearity function f to get some kind of hidden layer,', 'start': 1010.575, 'duration': 5.002}, {'end': 1021.46, 'text': 'and then we put this through another linear layer and the softmax function and now we have an output probability distribution y hat.', 'start': 1015.577, 'duration': 5.883}, {'end': 1025.242, 'text': "And in our case, because we're trying to predict what word comes next.", 'start': 1021.98, 'duration': 3.262}, {'end': 1029.404, 'text': 'uh, our vector y hat will be of length v, where v is the vocabulary,', 'start': 1025.242, 'duration': 4.162}], 'summary': 'Using word embeddings and neural networks to identify locations and predict words in language modeling.', 'duration': 89.2, 'max_score': 940.204, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U940204.jpg'}, {'end': 1015.577, 'src': 'embed', 'start': 995.033, 'weight': 2, 'content': [{'end': 1004.737, 'text': "we're going to represent these words with one-hot vectors and then we'll use those to look up the word embeddings for these words using the uh embedding lookup matrix.", 'start': 995.033, 'duration': 9.704}, {'end': 1009.519, 'text': 'So, then we get all of our word embeddings E1, 2, 3, 4, and then we concatenate them together to get E.', 'start': 1004.937, 'duration': 4.582}, {'end': 1015.577, 'text': 'we put this through a linear layer and a non-linearity function f to get some kind of hidden layer,', 'start': 1010.575, 'duration': 5.002}], 'summary': 'Word vectors represented by one-hot vectors, then concatenated and processed through linear layer to get a hidden layer.', 'duration': 20.544, 'max_score': 995.033, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U995033.jpg'}, {'end': 1076.006, 'src': 'embed', 'start': 1048.433, 'weight': 1, 'content': [{'end': 1051.714, 'text': 'So none of this should be, um, unfamiliar to you because you saw it all last week.', 'start': 1048.433, 'duration': 3.281}, {'end': 1055.016, 'text': "We're just applying a window-based model to a different task such as language modeling.", 'start': 1051.754, 'duration': 3.262}, {'end': 1061.693, 'text': 'Okay, So what are some good things about this model compared to n-gram language models?', 'start': 1057.549, 'duration': 4.144}, {'end': 1065.917, 'text': "So one uh advantage I'd say is that there's no sparsity problem.", 'start': 1062.694, 'duration': 3.223}, {'end': 1072.863, 'text': "If you remember, an n-gram language model has a sparsity problem, which is that if you've never seen a particular n-gram in training,", 'start': 1066.618, 'duration': 6.245}, {'end': 1074.985, 'text': "then you can't assign any probability to it.", 'start': 1072.863, 'duration': 2.122}, {'end': 1076.006, 'text': "You don't have any data on it.", 'start': 1075.045, 'duration': 0.961}], 'summary': 'Applying window-based model to language modeling, no sparsity problem compared to n-gram models.', 'duration': 27.573, 'max_score': 1048.433, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U1048433.jpg'}, {'end': 1127.42, 'src': 'embed', 'start': 1095.454, 'weight': 5, 'content': [{'end': 1101.597, 'text': 'So, uh, this is an advantage by, uh, comparison, you just have to store all of the word vectors for all the words in your vocabulary.', 'start': 1095.454, 'duration': 6.143}, {'end': 1106.004, 'text': 'Uh, but there are quite a lot of problems with this fixed window language model.', 'start': 1103.562, 'duration': 2.442}, {'end': 1107.366, 'text': 'So, here are some remaining problems.', 'start': 1106.164, 'duration': 1.202}, {'end': 1111.269, 'text': 'Uh, one is that your fixed window is probably too small.', 'start': 1109.087, 'duration': 2.182}, {'end': 1117.715, 'text': "No matter how big you make your fixed window, uh, you're probably going to be losing some kind of useful context that you would want to use sometimes.", 'start': 1111.649, 'duration': 6.066}, {'end': 1127.42, 'text': 'And in fact, if you try to enlarge the window size, then you also have to enlarge the size of your, uh, weight vector, sorry, your weight matrix W.', 'start': 1119.093, 'duration': 8.327}], 'summary': 'Challenges of fixed window language model: size limitations, loss of useful context, and scaling weight matrix', 'duration': 31.966, 'max_score': 1095.454, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U1095454.jpg'}], 'start': 855.231, 'title': 'Language modeling challenges and word embedding', 'summary': 'Discusses the limitations of trigram language model, the impact of increasing n on sparsity problem, and the concept of neural language model. it also explains the process of representing words with one-hot vectors, using them to look up word embeddings, applying a window-based model to language modeling, and the advantages and problems of the fixed window language model compared to n-gram language models.', 'chapters': [{'end': 995.033, 'start': 855.231, 'title': 'Challenges of n-gram language models', 'summary': 'Discusses the limitations of trigram language model, the impact of increasing n on sparsity problem, and the concept of neural language model for better language modeling.', 'duration': 139.802, 'highlights': ['The trigram language model has a memory of just the last three or two words, leading to limitations in modeling language well.', 'Increasing N exacerbates the sparsity problem for N-gram language models and also increases the model size, impacting its efficiency.', 'The concept of neural language model is introduced as a potential solution for better language modeling, by taking inputs as a sequence of words and outputting a probability distribution of the next word.']}, {'end': 1240.172, 'start': 995.033, 'title': 'Word embedding and language modeling', 'summary': 'Explains the process of representing words with one-hot vectors, using them to look up word embeddings, applying a window-based model to language modeling, and the advantages and problems of the fixed window language model compared to n-gram language models.', 'duration': 245.139, 'highlights': ['Advantages of fixed window language model over n-gram language models The fixed window language model does not have a sparsity problem and does not require storing all observed n-grams, allowing the use of any n-gram for output distribution, thus providing better predictions.', 'Disadvantages of fixed window language model The fixed window size is likely too small, resulting in the loss of useful context, and enlarging the window size increases the size of the weight matrix W, making it inefficient. Additionally, different weights in W multiply different word embeddings, leading to inefficiency in learning and processing similar functions.', 'Representation of words with one-hot vectors and word embeddings Words are represented with one-hot vectors and used to look up the word embeddings using an embedding lookup matrix, resulting in a concatenated word embeddings E, which are put through linear layers and a softmax function to obtain an output probability distribution y hat.']}], 'duration': 384.941, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U855231.jpg', 'highlights': ['The concept of neural language model is introduced as a potential solution for better language modeling, by taking inputs as a sequence of words and outputting a probability distribution of the next word.', 'Advantages of fixed window language model over n-gram language models The fixed window language model does not have a sparsity problem and does not require storing all observed n-grams, allowing the use of any n-gram for output distribution, thus providing better predictions.', 'Representation of words with one-hot vectors and word embeddings Words are represented with one-hot vectors and used to look up the word embeddings using an embedding lookup matrix, resulting in a concatenated word embeddings E, which are put through linear layers and a softmax function to obtain an output probability distribution y hat.', 'The trigram language model has a memory of just the last three or two words, leading to limitations in modeling language well.', 'Increasing N exacerbates the sparsity problem for N-gram language models and also increases the model size, impacting its efficiency.', 'Disadvantages of fixed window language model The fixed window size is likely too small, resulting in the loss of useful context, and enlarging the window size increases the size of the weight matrix W, making it inefficient. Additionally, different weights in W multiply different word embeddings, leading to inefficiency in learning and processing similar functions.']}, {'end': 1884.005, 'segs': [{'end': 1309.776, 'src': 'embed', 'start': 1281.035, 'weight': 0, 'content': [{'end': 1283.357, 'text': 'The idea is that you have a sequence of hidden states.', 'start': 1281.035, 'duration': 2.322}, {'end': 1287.302, 'text': 'Instead of just having, for example, one hidden state, as we did in the the previous model,', 'start': 1283.758, 'duration': 3.544}, {'end': 1290.265, 'text': 'we have a sequence of hidden states and we have as many of them as we have inputs.', 'start': 1287.302, 'duration': 2.963}, {'end': 1299.535, 'text': 'And the important thing is that each hidden state ht is computed based on the previous hidden state and also the input on that step.', 'start': 1291.466, 'duration': 8.069}, {'end': 1307.193, 'text': "So the reason why they're called hidden states is because you could think of this as a single state that's mutating over time.", 'start': 1301.004, 'duration': 6.189}, {'end': 1309.776, 'text': "It's kind of like several versions of the same thing.", 'start': 1307.573, 'duration': 2.203}], 'summary': 'Model involves sequence of hidden states based on previous state and input.', 'duration': 28.741, 'max_score': 1281.035, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U1281035.jpg'}, {'end': 1443.235, 'src': 'heatmap', 'start': 1355.965, 'weight': 0.754, 'content': [{'end': 1357.887, 'text': "Okay So that's a simple diagram of an RNN.", 'start': 1355.965, 'duration': 1.922}, {'end': 1359.849, 'text': "Uh, here I'm gonna give you a bit more detail.", 'start': 1358.367, 'duration': 1.482}, {'end': 1362.812, 'text': "So here's how you would apply an RNN to do language modeling.", 'start': 1359.889, 'duration': 2.923}, {'end': 1367.815, 'text': "So, uh, again, let's suppose that we have some kind of text so far.", 'start': 1364.514, 'duration': 3.301}, {'end': 1375.077, 'text': "My text is only four words long, but you can assume that it could be any length, right? It's just short because we can't fit more on the slide.", 'start': 1368.395, 'duration': 6.682}, {'end': 1377.837, 'text': 'So, you have some sequence of text which could be kind of long.', 'start': 1375.697, 'duration': 2.14}, {'end': 1385.399, 'text': "And again, we're going to represent these via some kind of one-hot vectors and use those to look up the word embeddings from our embedding matrix.", 'start': 1378.797, 'duration': 6.602}, {'end': 1393.93, 'text': 'So then to compute the first hidden state, H1, we need to compute it based on the previous hidden state and the current input.', 'start': 1387.307, 'duration': 6.623}, {'end': 1395.711, 'text': 'We already have the current input.', 'start': 1394.611, 'duration': 1.1}, {'end': 1399.753, 'text': "that's E1, uh, but the question is where do we get this first hidden state from right?", 'start': 1395.711, 'duration': 4.042}, {'end': 1400.834, 'text': 'What comes before H1?', 'start': 1399.793, 'duration': 1.041}, {'end': 1404.656, 'text': 'So we often call uh, the initial hidden state H0, uh, yeah,', 'start': 1401.394, 'duration': 3.262}, {'end': 1410.999, 'text': "we call it the initial hidden state and it can either be something that you learn like it's a parameter of the network and you learn, uh,", 'start': 1404.656, 'duration': 6.343}, {'end': 1414.521, 'text': "how to initialize it, or you can assume something like maybe it's the zero vector.", 'start': 1410.999, 'duration': 3.522}, {'end': 1422.924, 'text': 'So the, uh, the formula we use to compute the new hidden state based on the previous one and also the current input is, uh, written on the left.', 'start': 1416.3, 'duration': 6.624}, {'end': 1430.948, 'text': 'So you do a linear transformation on the previous hidden state and on the current inputs and then you add some kind of bias and then put it through a non-linearity,', 'start': 1423.404, 'duration': 7.544}, {'end': 1432.489, 'text': 'like, for example, the sigmoid function.', 'start': 1430.948, 'duration': 1.541}, {'end': 1434.55, 'text': 'And that gives you your new hidden state.', 'start': 1433.409, 'duration': 1.141}, {'end': 1443.235, 'text': "Okay So once you've done that, then you can compute the next hidden state, and you can keep unrolling the network like this.", 'start': 1437.752, 'duration': 5.483}], 'summary': 'Applying rnn for language modeling using one-hot vectors and word embeddings to compute hidden states.', 'duration': 87.27, 'max_score': 1355.965, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U1355965.jpg'}, {'end': 1587.266, 'src': 'embed', 'start': 1543.502, 'weight': 1, 'content': [{'end': 1550.123, 'text': 'Um, when we do like back propagation, does that mean we only update like W-E or do we update both W-H and W-E??', 'start': 1543.502, 'duration': 6.621}, {'end': 1553.984, 'text': 'So the question is uh, you say we reuse the matrix.', 'start': 1551.003, 'duration': 2.981}, {'end': 1555.924, 'text': 'do we update W-E and W-H or just one?', 'start': 1553.984, 'duration': 1.94}, {'end': 1558.744, 'text': 'So you certainly learn both W-E and W-H.', 'start': 1556.184, 'duration': 2.56}, {'end': 1563.925, 'text': "Uh, I suppose I was emphasizing W-H more, but yeah, they're both matrices that are applied repeatedly.", 'start': 1559.105, 'duration': 4.82}, {'end': 1566.946, 'text': "There was also a question about backprop, but we're gonna cover that later in this lecture.", 'start': 1564.165, 'duration': 2.781}, {'end': 1571.451, 'text': 'Okay Moving on for now.', 'start': 1568.328, 'duration': 3.123}, {'end': 1577.076, 'text': 'Um so what are some advantages and disadvantages of this RNN language model?', 'start': 1571.471, 'duration': 5.605}, {'end': 1582.201, 'text': 'So here are some advantages that we can see, uh, in comparison to the fixed window one.', 'start': 1577.957, 'duration': 4.244}, {'end': 1587.266, 'text': 'So an obvious advantage is that this RNN can process any length of input.', 'start': 1583.823, 'duration': 3.443}], 'summary': 'During backpropagation, both w-e and w-h are updated, with rnn having the advantage of processing any input length.', 'duration': 43.764, 'max_score': 1543.502, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U1543502.jpg'}, {'end': 1714.541, 'src': 'embed', 'start': 1688.89, 'weight': 4, 'content': [{'end': 1695.576, 'text': "So, especially if you're trying to compute an RNN over a pretty long sequence of inputs, this means that the RNN can be pretty slow to compute.", 'start': 1688.89, 'duration': 6.686}, {'end': 1703.958, 'text': "Another disadvantage of RNNs is that it turns out, in practice, it's quite difficult to access information from many steps back.", 'start': 1697.657, 'duration': 6.301}, {'end': 1709.079, 'text': 'So, even though I said we should be able to remember about the proctor and the clock and use that to predict exams, not books,', 'start': 1704.378, 'duration': 4.701}, {'end': 1714.541, 'text': "it turns out that RNNs at least the ones that I've presented in this lecture, um, are not as good as that as you would think.", 'start': 1709.079, 'duration': 5.462}], 'summary': 'Rnns can be slow over long sequences, struggle to access distant information, and may not perform as expected.', 'duration': 25.651, 'max_score': 1688.89, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U1688890.jpg'}], 'start': 1240.172, 'title': 'Recurrent neural networks (rnn)', 'summary': 'Introduces recurrent neural networks (rnns) as a solution for processing variable-length inputs, highlighting their application in language modeling. it also explains the computation of hidden states in rnn language model, along with its advantages and disadvantages.', 'chapters': [{'end': 1377.837, 'start': 1240.172, 'title': 'Recurrent neural networks (rnn)', 'summary': 'Introduces the need for neural architectures that can process any length input and presents recurrent neural networks (rnns) as a solution, highlighting their ability to handle variable-length inputs and the application of rnns in language modeling.', 'duration': 137.665, 'highlights': ['Recurrent neural networks (RNNs) are introduced as a solution for processing any length input, addressing the limitations of fixed size neural models.', 'RNNs feature a sequence of hidden states, computed based on the previous hidden state and the input at each step, allowing them to handle variable-length inputs.', 'The same weight matrix W is applied on every time step of the RNN, enabling the processing of inputs of any length without the need for different weights on every step.', 'RNNs can produce optional outputs, denoted as y hats, on each step, offering flexibility in the computation of outputs based on the specific requirements of the task or application.', 'The application of RNNs in language modeling is illustrated, emphasizing their ability to handle sequences of text of any length for tasks such as language modeling.']}, {'end': 1884.005, 'start': 1378.797, 'title': 'Rnn language model', 'summary': 'Explains the computation of hidden states in rnn language model, advantages including processing any length of input and fixed model size, and disadvantages such as slow computation and difficulty in accessing information from many steps back.', 'duration': 505.208, 'highlights': ['Advantages of RNN language model The RNN can process any length of input, use information from many steps back, and has a fixed model size.', "Disadvantages of RNN language model The recurrent computation is slow and it's difficult to access information from many steps back in practice.", 'Computation of hidden state in RNN To compute the first hidden state, a linear transformation on the previous hidden state and the current input is done, then put through a non-linearity like the sigmoid function to get the new hidden state.', 'Learning embeddings in RNN One can choose to use pre-trained embeddings, fine-tune them, or initialize them to small random values and learn them from scratch.', 'Updating matrix in backpropagation Both the W-E and W-H matrices are learned and applied repeatedly during backpropagation.']}], 'duration': 643.833, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U1240172.jpg', 'highlights': ['RNNs feature a sequence of hidden states, computed based on the previous hidden state and the input at each step, allowing them to handle variable-length inputs.', 'The application of RNNs in language modeling is illustrated, emphasizing their ability to handle sequences of text of any length for tasks such as language modeling.', 'Advantages of RNN language model: The RNN can process any length of input, use information from many steps back, and has a fixed model size.', 'Updating matrix in backpropagation: Both the W-E and W-H matrices are learned and applied repeatedly during backpropagation.', "Disadvantages of RNN language model: The recurrent computation is slow and it's difficult to access information from many steps back in practice."]}, {'end': 2740.826, 'segs': [{'end': 1933.898, 'src': 'embed', 'start': 1902.053, 'weight': 0, 'content': [{'end': 1909.876, 'text': 'Um so, I suppose in practice, you choose how long the inputs are in training, either based on what your data is or maybe based on, uh,', 'start': 1902.053, 'duration': 7.823}, {'end': 1910.836, 'text': 'your efficiency concerns.', 'start': 1909.876, 'duration': 0.96}, {'end': 1913.737, 'text': 'So, maybe you make it artificially shorter by chopping it up.', 'start': 1910.856, 'duration': 2.881}, {'end': 1918.032, 'text': 'Um, what was the other question? Does WH depend on the length? Yeah.', 'start': 1914.478, 'duration': 3.554}, {'end': 1922.013, 'text': 'Okay So the question was, does WH depend on the length used? So, no.', 'start': 1918.632, 'duration': 3.381}, {'end': 1929.056, 'text': "And that's one of the good things in the advantages list is that the model size doesn't increase for longer input because we just unroll the RNN,", 'start': 1922.113, 'duration': 6.943}, {'end': 1931.117, 'text': 'applying the same weights again and again for as long as we like.', 'start': 1929.056, 'duration': 2.061}, {'end': 1933.898, 'text': "There's no need to have more weights just because you have a longer input.", 'start': 1931.297, 'duration': 2.601}], 'summary': "Rnn model size doesn't increase for longer input, same weights applied repeatedly.", 'duration': 31.845, 'max_score': 1902.053, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U1902053.jpg'}, {'end': 1999.609, 'src': 'embed', 'start': 1960.174, 'weight': 3, 'content': [{'end': 1963.957, 'text': 'For example Word2Vec, and you just download them and use them, or maybe you learn them from scratch,', 'start': 1960.174, 'duration': 3.783}, {'end': 1966.839, 'text': 'in which case you decide at the beginning of training how big you want those vectors to be.', 'start': 1963.957, 'duration': 2.882}, {'end': 1968.981, 'text': "Okay I'm gonna move on for now.", 'start': 1968.1, 'duration': 0.881}, {'end': 1975.854, 'text': "So we've learned what a RNN language model is and we've learned how you would, uh, run one forwards.", 'start': 1971.485, 'duration': 4.369}, {'end': 1978.839, 'text': 'but the question remains how would you train an RNN language model??', 'start': 1975.854, 'duration': 2.985}, {'end': 1979.901, 'text': 'How would you learn it?', 'start': 1979.28, 'duration': 0.621}, {'end': 1983.944, 'text': 'So, as always in machine learning,', 'start': 1982.304, 'duration': 1.64}, {'end': 1991.046, 'text': "our answer starts with you're going to get a big corpus of text and we're gonna call that just a sequence of words x1 up to x, capital T.", 'start': 1983.944, 'duration': 7.102}, {'end': 1999.609, 'text': 'So you feed the sequence of words into the RNN language model and then the idea is that you compute the output distribution y hat t for every step t.', 'start': 1991.046, 'duration': 8.563}], 'summary': 'Training an rnn language model involves feeding a sequence of words into the model and computing the output distribution for each step.', 'duration': 39.435, 'max_score': 1960.174, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U1960174.jpg'}, {'end': 2096.214, 'src': 'heatmap', 'start': 2012.915, 'weight': 0.933, 'content': [{'end': 2015.256, 'text': "Okay So once you've done that, then you can define the loss function.", 'start': 2012.915, 'duration': 2.341}, {'end': 2016.977, 'text': 'And this should be familiar to you by now.', 'start': 2015.676, 'duration': 1.301}, {'end': 2025.981, 'text': 'Uh. this is the cross entropy between our predicted probability distribution y hat t, and the true uh distribution, which is y hat.', 'start': 2017.517, 'duration': 8.464}, {'end': 2032.884, 'text': 'sorry, just y t, which is a one-hot vector, uh, representing the true next word, which is x t plus 1..', 'start': 2025.981, 'duration': 6.903}, {'end': 2039.567, 'text': "So as you've seen before, this, uh, cross entropy between those two vectors can be written also as a negative log probability.", 'start': 2032.884, 'duration': 6.683}, {'end': 2049.214, 'text': 'And then, lastly, if you average this cross-entropy loss across every step, uh, every t in the corpus times step t, then uh,', 'start': 2041.564, 'duration': 7.65}, {'end': 2051.697, 'text': 'this gives you your overall loss for the entire training set.', 'start': 2049.214, 'duration': 2.483}, {'end': 2061.797, 'text': 'Okay. So, just to make that even more clear with the picture uh, suppose that our corpus is the students open their exams, et cetera,', 'start': 2056.155, 'duration': 5.642}, {'end': 2062.777, 'text': 'and it goes on for a long time.', 'start': 2061.797, 'duration': 0.98}, {'end': 2070.4, 'text': "Then what we'd be doing is we'd be running our RNN over this text, and then on every step, we would be predicting the probability distribution y hats.", 'start': 2063.157, 'duration': 7.243}, {'end': 2074.141, 'text': 'And then from each of those, you can calculate what your loss is, which is the JT.', 'start': 2070.58, 'duration': 3.561}, {'end': 2081.744, 'text': 'And then, uh, on the first step, the loss would be the negative log probability of the next words, which is in this example, students, and so on.', 'start': 2074.92, 'duration': 6.824}, {'end': 2084.864, 'text': 'Each of those is the negative log probability of the next word.', 'start': 2082.184, 'duration': 2.68}, {'end': 2090.992, 'text': "And then once you've computed all of those, you can add them all up and average them, and then this gives you your final loss.", 'start': 2085.949, 'duration': 5.043}, {'end': 2096.214, 'text': "Okay So there's a caveat here.", 'start': 2095.053, 'duration': 1.161}], 'summary': 'Defining loss function using cross entropy and averaging for overall loss.', 'duration': 83.299, 'max_score': 2012.915, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U2012915.jpg'}, {'end': 2039.567, 'src': 'embed', 'start': 2017.517, 'weight': 5, 'content': [{'end': 2025.981, 'text': 'Uh. this is the cross entropy between our predicted probability distribution y hat t, and the true uh distribution, which is y hat.', 'start': 2017.517, 'duration': 8.464}, {'end': 2032.884, 'text': 'sorry, just y t, which is a one-hot vector, uh, representing the true next word, which is x t plus 1..', 'start': 2025.981, 'duration': 6.903}, {'end': 2039.567, 'text': "So as you've seen before, this, uh, cross entropy between those two vectors can be written also as a negative log probability.", 'start': 2032.884, 'duration': 6.683}], 'summary': 'Cross entropy measures prediction accuracy using negative log probability.', 'duration': 22.05, 'max_score': 2017.517, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U2017517.jpg'}, {'end': 2224.248, 'src': 'heatmap', 'start': 2132.828, 'weight': 0.778, 'content': [{'end': 2139.13, 'text': "but that's actually a batch of sentences and then you compute the gradients with respect to that batch of sentences, update your weights and repeat.", 'start': 2132.828, 'duration': 6.302}, {'end': 2146.193, 'text': 'Any questions at this point? Okay.', 'start': 2140.391, 'duration': 5.802}, {'end': 2148.015, 'text': 'So, uh, moving on to backprop.', 'start': 2146.674, 'duration': 1.341}, {'end': 2150.937, 'text': "Don't worry, there won't be as much backprop as there was last week.", 'start': 2148.075, 'duration': 2.862}, {'end': 2153.039, 'text': 'But uh, there is an interesting question here, right?', 'start': 2151.097, 'duration': 1.942}, {'end': 2158.643, 'text': 'So the uh characteristic thing about RNNs is that they apply the same weight matrix repeatedly.', 'start': 2153.419, 'duration': 5.224}, {'end': 2162.146, 'text': "So the question is what's the derivative of our loss function?", 'start': 2159.304, 'duration': 2.842}, {'end': 2167.751, 'text': "Let's say on step t, what's the derivative of that loss with respect to the repeated weight matrix wh?", 'start': 2162.286, 'duration': 5.465}, {'end': 2176.257, 'text': 'So the answer is that the derivative of the loss, uh, the gradient with respect to the repeated weight,', 'start': 2169.355, 'duration': 6.902}, {'end': 2179.497, 'text': 'is the sum of the gradient with respect to each time it appears.', 'start': 2176.257, 'duration': 3.24}, {'end': 2181.198, 'text': "And that's what that equation says.", 'start': 2179.978, 'duration': 1.22}, {'end': 2185.659, 'text': 'So on the right, the notation with the vertical line in the i is saying uh,', 'start': 2181.438, 'duration': 4.221}, {'end': 2190.1, 'text': 'the derivative of the loss with respect to wh when it appears on the ith step.', 'start': 2185.659, 'duration': 4.441}, {'end': 2197.488, 'text': "Okay So, so why is that true? Uh, to sketch why this is true, uh, I'm gonna remind you of the multivariable chain rule.", 'start': 2191.144, 'duration': 6.344}, {'end': 2202.331, 'text': 'So, uh, this is a screenshot from a Khan Academy article on the multivariable chain rule.', 'start': 2198.208, 'duration': 4.123}, {'end': 2206.113, 'text': "And, uh, I advise you to check it out if you want to learn more because it's very easy to understand.", 'start': 2202.751, 'duration': 3.362}, {'end': 2215.98, 'text': 'Uh, and what it says is given a function f, which depends on x and y, which are both themselves functions of some variable t,', 'start': 2206.134, 'duration': 9.846}, {'end': 2224.248, 'text': 'then if you want to get the derivative of f with respect to t, then you need to do the chain rule across x and y separately and then add them up.', 'start': 2215.98, 'duration': 8.268}], 'summary': 'Rnns apply the same weight matrix repeatedly, and the derivative of the loss with respect to the repeated weight matrix is the sum of the gradient with respect to each time it appears.', 'duration': 91.42, 'max_score': 2132.828, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U2132828.jpg'}, {'end': 2341.553, 'src': 'embed', 'start': 2315.818, 'weight': 6, 'content': [{'end': 2324.025, 'text': 'So, this algorithm of computing each of these, uh, each of these gradients with respect to the previous one is called backpropagation through time.', 'start': 2315.818, 'duration': 8.207}, {'end': 2327.606, 'text': 'And, um, I always think that this sounds way more sci-fi than it is.', 'start': 2324.625, 'duration': 2.981}, {'end': 2330.208, 'text': "It sounds like it's time travel or something, but it's actually pretty simple.", 'start': 2327.646, 'duration': 2.562}, {'end': 2336.17, 'text': "Uh, it's just the name you give to applying the backprop algorithm to a recurrent neural network.", 'start': 2330.228, 'duration': 5.942}, {'end': 2341.553, 'text': 'Any questions at this point? Yep.', 'start': 2339.012, 'duration': 2.541}], 'summary': 'Backpropagation through time is the algorithm for recurrent neural networks.', 'duration': 25.735, 'max_score': 2315.818, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U2315818.jpg'}, {'end': 2586.349, 'src': 'embed', 'start': 2557.099, 'weight': 7, 'content': [{'end': 2558.78, 'text': "Okay So, uh, let's have some fun with this.", 'start': 2557.099, 'duration': 1.681}, {'end': 2563.521, 'text': 'Uh, you can generate, uh, text using an RNN language model.', 'start': 2559.22, 'duration': 4.301}, {'end': 2570.964, 'text': 'If you train the RNN language model on any kind of text, then you can use it to generate text in that style.', 'start': 2564.181, 'duration': 6.783}, {'end': 2575.505, 'text': 'And in fact, this has become a whole kind of genre of Internet humor that you might have seen.', 'start': 2571.644, 'duration': 3.861}, {'end': 2582.528, 'text': 'So, uh, for example, here is an RNN language model trained on Obama speeches, and I found this in a blog post online.', 'start': 2576.005, 'duration': 6.523}, {'end': 2586.349, 'text': "So, here's the text that the RNN language model generated.", 'start': 2583.588, 'duration': 2.761}], 'summary': 'Rnn language model can generate text in a style trained on specific content, leading to internet humor.', 'duration': 29.25, 'max_score': 2557.099, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U2557099.jpg'}], 'start': 1885.126, 'title': 'Rnn training and language model', 'summary': "Discusses the impact of input length on training and the model size in rnn, highlighting that the model size doesn't increase for longer input lengths. it also explains the process of training an rnn language model, including various aspects of the training process and model utilization.", 'chapters': [{'end': 1933.898, 'start': 1885.126, 'title': 'Rnn training length and model size', 'summary': "Discusses the impact of input length on training and the model size in rnn, highlighting that the model size doesn't increase for longer input lengths as the rnn unrolls and applies the same weights for as long as needed.", 'duration': 48.772, 'highlights': ["The model size doesn't increase for longer input in RNN as it unrolls and applies the same weights for as long as needed, leading to no need for more weights despite longer input lengths.", 'The input length during training can be chosen based on data or efficiency concerns, allowing the inputs to be artificially shortened by chopping them up.', "The question of whether the model's performance depends on the input length was addressed, with the response being that it does not."]}, {'end': 2740.826, 'start': 1936.539, 'title': 'Training rnn language model', 'summary': 'Explains the process of training an rnn language model, including choosing the dimension of word vectors, computing the output distribution, defining the loss function, backpropagation through time, and using the model for text generation.', 'duration': 804.287, 'highlights': ['The process of training an RNN language model starts with choosing the dimension of word vectors, which can be pre-trained or learned from scratch. The dimension of word vectors is chosen by either using pre-trained word vectors like Word2Vec or learning them from scratch, determining the size of the vectors at the beginning of training.', 'The RNN language model computes the output distribution for every step, predicting the probability of the next words at each step. The RNN language model computes the output distribution y hat t for every step, predicting the probability of the next words on every step in the sequence of words.', 'The loss function is defined as the cross-entropy between the predicted probability distribution and the true distribution, averaged across the entire training set. The loss function is defined as the cross-entropy between the predicted probability distribution and the true distribution, averaged across the entire training set, representing the overall loss.', 'Backpropagation through time is used to compute the gradient with respect to the recurrent weight matrix by accumulating the sum as the algorithm progresses. Backpropagation through time is used to compute the gradient with respect to the recurrent weight matrix by accumulating the sum as the algorithm progresses, allowing the computation of gradients with respect to previous ones.', 'The RNN language model can be used for text generation, where the model samples words based on the probability distribution from the previous step and uses the sampled word as input for the next step. The RNN language model can be used for text generation, sampling words based on the probability distribution from the previous step and using the sampled word as input for the next step, generating text in the style of the trained model.']}], 'duration': 855.7, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U1885126.jpg', 'highlights': ["The model size doesn't increase for longer input in RNN as it unrolls and applies the same weights for as long as needed, leading to no need for more weights despite longer input lengths.", 'The input length during training can be chosen based on data or efficiency concerns, allowing the inputs to be artificially shortened by chopping them up.', "The question of whether the model's performance depends on the input length was addressed, with the response being that it does not.", 'The process of training an RNN language model starts with choosing the dimension of word vectors, which can be pre-trained or learned from scratch.', 'The RNN language model computes the output distribution for every step, predicting the probability of the next words at each step.', 'The loss function is defined as the cross-entropy between the predicted probability distribution and the true distribution, averaged across the entire training set, representing the overall loss.', 'Backpropagation through time is used to compute the gradient with respect to the recurrent weight matrix by accumulating the sum as the algorithm progresses.', 'The RNN language model can be used for text generation, sampling words based on the probability distribution from the previous step and using the sampled word as input for the next step, generating text in the style of the trained model.']}, {'end': 3488.757, 'segs': [{'end': 2827.785, 'src': 'embed', 'start': 2785.967, 'weight': 3, 'content': [{'end': 2790.729, 'text': "Like for example, shape mixture into the moderate oven is grammatical but it doesn't make any sense.", 'start': 2785.967, 'duration': 4.762}, {'end': 2792.904, 'text': 'Okay Last example.', 'start': 2791.863, 'duration': 1.041}, {'end': 2797.108, 'text': "So, here is an RNN language model that's trained on paint color names.", 'start': 2793.585, 'duration': 3.523}, {'end': 2804.616, 'text': "And, uh, this is an example of a character level language model because it's predicting what character comes next, not what word comes next.", 'start': 2797.869, 'duration': 6.747}, {'end': 2806.998, 'text': "And this is why it's able to come up with new words.", 'start': 2805.036, 'duration': 1.962}, {'end': 2811.86, 'text': 'Another thing to note is that this language model was trained to be conditioned on some kind of input.', 'start': 2808.099, 'duration': 3.761}, {'end': 2815.741, 'text': 'So here the input is the color itself, I think, represented by the three numbers.', 'start': 2812.26, 'duration': 3.481}, {'end': 2820.523, 'text': "that's probably RGB numbers, and it generated some names for the colors.", 'start': 2815.741, 'duration': 4.782}, {'end': 2822.103, 'text': 'And I think these are pretty funny.', 'start': 2821.143, 'duration': 0.96}, {'end': 2825.044, 'text': 'My favorite one is Stanky Bean, which is in the bottom right.', 'start': 2822.163, 'duration': 2.881}, {'end': 2827.785, 'text': "Um, so it's, it's pretty creative.", 'start': 2825.845, 'duration': 1.94}], 'summary': 'An rnn language model trained on paint color names can generate new words and funny color names, like stanky bean.', 'duration': 41.818, 'max_score': 2785.967, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U2785967.jpg'}, {'end': 2902.034, 'src': 'embed', 'start': 2871.177, 'weight': 5, 'content': [{'end': 2875.197, 'text': 'the examples that people put online were hand-selected by humans to be the funniest examples.', 'start': 2871.177, 'duration': 4.02}, {'end': 2881.979, 'text': "Like I think all of the examples I've shown today were definitely hand-selected by humans as the funniest examples that the RNN came up with.", 'start': 2875.497, 'duration': 6.482}, {'end': 2885.139, 'text': 'And in some cases, they might even have been edited by a human.', 'start': 2882.359, 'duration': 2.78}, {'end': 2888.44, 'text': 'So, uh, yeah, you do need to be a little bit skeptical when you look at these examples.', 'start': 2885.699, 'duration': 2.741}, {'end': 2896.306, 'text': 'Yep In the Harry Potter one, there was an opening quote and then there was a closing quote.', 'start': 2889.66, 'duration': 6.646}, {'end': 2902.034, 'text': 'So do you expect that when it puts an opening quote and it keeps putting more words?', 'start': 2896.927, 'duration': 5.107}], 'summary': 'Online examples were hand-selected by humans for humor, requiring skepticism in their interpretation.', 'duration': 30.857, 'max_score': 2871.177, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U2871177.jpg'}, {'end': 3163.98, 'src': 'heatmap', 'start': 3118.788, 'weight': 0.945, 'content': [{'end': 3121.729, 'text': "So if you look at it, you can see that that's what this formula is saying.", 'start': 3118.788, 'duration': 2.941}, {'end': 3127.692, 'text': "It's saying that for every uh word, xt, lowercase, t in the corpus.", 'start': 3122.59, 'duration': 5.102}, {'end': 3131.753, 'text': "uh, we're computing the probability of that word, given everything that came so far.", 'start': 3127.692, 'duration': 4.061}, {'end': 3133.154, 'text': "but it's inverse, so it's one over that.", 'start': 3131.753, 'duration': 1.401}, {'end': 3141.299, 'text': "And then lastly, we're normalizing this big, uh, product by the number of words, which is capital T.", 'start': 3133.914, 'duration': 7.385}, {'end': 3147.783, 'text': "And the reason why we're doing that is because if we didn't do that, then perplexity would just get smaller and smaller as your corpus got bigger.", 'start': 3141.299, 'duration': 6.484}, {'end': 3149.885, 'text': 'So we need to normalize by that factor.', 'start': 3148.524, 'duration': 1.361}, {'end': 3158.358, 'text': 'So, you can actually show that this, uh, perplexity is equal to the exponential of the cross entropy loss J Theta.', 'start': 3152.236, 'duration': 6.122}, {'end': 3163.98, 'text': "So, if you remember, cross entropy loss J Theta is, uh, the training objective that we're using to train the language model.", 'start': 3158.538, 'duration': 5.442}], 'summary': 'Formula calculates word probability, normalized by corpus size.', 'duration': 45.192, 'max_score': 3118.788, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U3118788.jpg'}, {'end': 3240.063, 'src': 'embed', 'start': 3211.223, 'weight': 1, 'content': [{'end': 3216.027, 'text': 'Uh, so RNNs have been pretty successful in recent years in improving perplexity.', 'start': 3211.223, 'duration': 4.804}, {'end': 3223.372, 'text': 'So, uh, this is a results table from a recent, um, Facebook research paper about RNN language models.', 'start': 3216.427, 'duration': 6.945}, {'end': 3226.674, 'text': "And uh, you don't have to understand all of the details of this table,", 'start': 3223.912, 'duration': 2.762}, {'end': 3235.44, 'text': "but what it's telling you is that on the uh top row we have an N-gram language model and then in the subsequent rows we have some increasingly complex and large RNNs.", 'start': 3226.674, 'duration': 8.766}, {'end': 3240.063, 'text': 'And you can see that the perplexity numbers are decreasing because, uh, lower is better.', 'start': 3235.9, 'duration': 4.163}], 'summary': 'Rnns improving perplexity, shown in facebook research paper results table.', 'duration': 28.84, 'max_score': 3211.223, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U3211223.jpg'}, {'end': 3310.82, 'src': 'embed', 'start': 3285.327, 'weight': 0, 'content': [{'end': 3291.229, 'text': 'You have to understand grammar, you have to understand syntax and you have to understand, uh, logic and reasoning,', 'start': 3285.327, 'duration': 5.902}, {'end': 3293.609, 'text': 'and you have to understand something about you know real-world knowledge.', 'start': 3291.229, 'duration': 2.38}, {'end': 3297.571, 'text': 'You have to understand a lot of things in order to be able to do language modeling properly.', 'start': 3293.95, 'duration': 3.621}, {'end': 3305.216, 'text': "So the reason why we care about it as a benchmark task is because if you're able to build a model which is a better language model than the ones that came before it,", 'start': 3298.251, 'duration': 6.965}, {'end': 3310.82, 'text': 'then you must have made some kind of progress on at least some of those sub-components of natural language understanding.', 'start': 3305.216, 'duration': 5.604}], 'summary': 'To build a better language model, one must understand grammar, syntax, logic, reasoning, and real-world knowledge.', 'duration': 25.493, 'max_score': 3285.327, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U3285327.jpg'}, {'end': 3339.013, 'src': 'embed', 'start': 3312.381, 'weight': 2, 'content': [{'end': 3318.945, 'text': "So another more tangible reason why you might care about language modeling is that it's a sub-component of many, many NLP tasks, uh,", 'start': 3312.381, 'duration': 6.564}, {'end': 3322.928, 'text': 'especially those which involve generating text or estimating the probability of text.', 'start': 3318.945, 'duration': 3.983}, {'end': 3325.106, 'text': "So, here's a bunch of examples.", 'start': 3324.065, 'duration': 1.041}, {'end': 3327.167, 'text': 'Uh, one is predictive typing.', 'start': 3325.966, 'duration': 1.201}, {'end': 3331.329, 'text': "That's the example that we showed at the beginning of the lecture with typing on your phone or searching on Google.", 'start': 3327.227, 'duration': 4.102}, {'end': 3335.211, 'text': 'Uh, this is also very useful for people who have um movement disabilities.', 'start': 3331.869, 'duration': 3.342}, {'end': 3339.013, 'text': 'uh, because there are these systems that help people communicate, uh, using fewer movements.', 'start': 3335.211, 'duration': 3.802}], 'summary': 'Language modeling is essential for nlp tasks like predictive typing and aiding people with movement disabilities.', 'duration': 26.632, 'max_score': 3312.381, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U3312381.jpg'}], 'start': 2740.826, 'title': 'Rnn language models and language modeling', 'summary': "Discusses the challenges of rnn language models in generating coherent output and emphasizes the application of language modeling, the importance of language modeling as a benchmark task, and the role of rnns in improving language models' perplexity across various nlp tasks.", 'chapters': [{'end': 2888.44, 'start': 2740.826, 'title': 'Rnn language models', 'summary': 'Discusses the challenges of rnn language models in generating coherent output, such as nonsensical recipe sentences and bizarre paint color names, while highlighting the need for skepticism when evaluating humorous examples.', 'duration': 147.614, 'highlights': ['The RNN language model struggles with generating coherent output, as seen in its nonsensical recipe sentences and bizarre paint color names.', 'The language model is trained on paint color names and utilizes a character level approach, resulting in the creation of new, often bizarre, color names.', "It's important to approach examples of language model output with skepticism, as many of the showcased examples are hand-selected by humans for humor and may have been edited."]}, {'end': 3488.757, 'start': 2889.66, 'title': 'Language modeling and rnns', 'summary': "Discusses the application of language modeling, the importance of language modeling as a benchmark task, and the role of rnns in improving language models' perplexity, with a particular emphasis on various nlp tasks, such as predictive typing, speech recognition, handwriting recognition, authorship identification, machine translation, summarization, and dialogue generation.", 'duration': 599.097, 'highlights': ["RNNs have been successful in improving perplexity in language models, as demonstrated in a recent research paper from Facebook, showcasing the decreasing perplexity numbers with increasingly complex and large RNNs. Recent research paper from Facebook demonstrates RNNs' success in improving perplexity in language models, with decreasing perplexity numbers for increasingly complex and large RNNs.", 'Language modeling serves as a benchmark task for measuring progress in understanding language, requiring understanding of grammar, syntax, logic, reasoning, and real-world knowledge. Language modeling is a benchmark task for measuring progress in understanding language, necessitating comprehension of grammar, syntax, logic, reasoning, and real-world knowledge.', 'Language modeling is a sub-component of many NLP tasks, including predictive typing, speech recognition, handwriting recognition, spelling and grammar correction, authorship identification, machine translation, summarization, and dialogue generation. Language modeling is a sub-component of various NLP tasks such as predictive typing, speech recognition, handwriting recognition, spelling and grammar correction, authorship identification, machine translation, summarization, and dialogue generation.']}], 'duration': 747.931, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U2740826.jpg', 'highlights': ['Language modeling is a benchmark task for measuring progress in understanding language, necessitating comprehension of grammar, syntax, logic, reasoning, and real-world knowledge.', 'RNNs have been successful in improving perplexity in language models, as demonstrated in a recent research paper from Facebook, showcasing the decreasing perplexity numbers with increasingly complex and large RNNs.', 'Language modeling is a sub-component of various NLP tasks such as predictive typing, speech recognition, handwriting recognition, spelling and grammar correction, authorship identification, machine translation, summarization, and dialogue generation.', 'The RNN language model struggles with generating coherent output, as seen in its nonsensical recipe sentences and bizarre paint color names.', 'The language model is trained on paint color names and utilizes a character level approach, resulting in the creation of new, often bizarre, color names.', "It's important to approach examples of language model output with skepticism, as many of the showcased examples are hand-selected by humans for humor and may have been edited."]}, {'end': 4098.13, 'segs': [{'end': 3582.045, 'src': 'embed', 'start': 3552.526, 'weight': 0, 'content': [{'end': 3557.628, 'text': 'but actually it turns out that you can use RNNs for um a lot of other different things that are not language modeling.', 'start': 3552.526, 'duration': 5.102}, {'end': 3559.689, 'text': "So, here's a few examples of that.", 'start': 3558.608, 'duration': 1.081}, {'end': 3563.771, 'text': 'uh, you can use an RNN to do a tagging task.', 'start': 3561.169, 'duration': 2.602}, {'end': 3568.895, 'text': 'So, some examples of tagging tasks are part of speech tagging and named entity recognition.', 'start': 3564.352, 'duration': 4.543}, {'end': 3571.337, 'text': 'So, pictured here is part of speech tagging.', 'start': 3569.595, 'duration': 1.742}, {'end': 3575.3, 'text': 'And this is the task where you have some kind of input text such as uh,', 'start': 3571.777, 'duration': 3.523}, {'end': 3582.045, 'text': 'the startled cat knocked over the vase and your job is to um label or tag each word with its parts of speech.', 'start': 3575.3, 'duration': 6.745}], 'summary': 'Rnns can be used for various tasks like tagging; e.g. part of speech tagging and named entity recognition.', 'duration': 29.519, 'max_score': 3552.526, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U3552526.jpg'}, {'end': 3657.162, 'src': 'embed', 'start': 3628.008, 'weight': 1, 'content': [{'end': 3634.19, 'text': "Uh, sentiment classification is when you have some kind of input text such as, let's say, overall I enjoyed the movie a lot,", 'start': 3628.008, 'duration': 6.182}, {'end': 3637.892, 'text': "and then you're trying to classify that as being positive or negative or neutral sentiment.", 'start': 3634.19, 'duration': 3.702}, {'end': 3639.953, 'text': 'So, in this example, this is positive sentiment.', 'start': 3638.292, 'duration': 1.661}, {'end': 3647.74, 'text': 'So one way you might use an RNN to tackle this task is, uh, you might encode the text using the RNN.', 'start': 3641.178, 'duration': 6.562}, {'end': 3657.162, 'text': 'And then really what you want is some kind of sentence encoding so that you can output your label for the sentence right?', 'start': 3649.16, 'duration': 8.002}], 'summary': 'Sentiment classification involves classifying input text as positive, negative, or neutral sentiment, and rnn can be used to encode the text for this task.', 'duration': 29.154, 'max_score': 3628.008, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U3628008.jpg'}, {'end': 3748.31, 'src': 'embed', 'start': 3702.829, 'weight': 2, 'content': [{'end': 3709.711, 'text': 'is to do something like maybe take an element-wise max or an element-wise mean of all these hidden states to get your sentence encoding.', 'start': 3702.829, 'duration': 6.882}, {'end': 3714.553, 'text': 'Um, and, uh, this tends to work better than just using the final hidden state.', 'start': 3710.492, 'duration': 4.061}, {'end': 3717.334, 'text': 'Uh, there are some other more advanced things you can do as well.', 'start': 3714.573, 'duration': 2.761}, {'end': 3724.678, 'text': 'Okay Another thing that you can use RNNs for is as a general purpose encoder module.', 'start': 3720.356, 'duration': 4.322}, {'end': 3727.18, 'text': "Uh so here's an example.", 'start': 3725.779, 'duration': 1.401}, {'end': 3728.62, 'text': "that's question answering.", 'start': 3727.18, 'duration': 1.44}, {'end': 3729.141, 'text': 'but really,', 'start': 3728.62, 'duration': 0.521}, {'end': 3737.385, 'text': 'this idea of RNNs as a general purpose encoder module is very common and you use it in a lot of different um deep learning architectures for NLP.', 'start': 3729.141, 'duration': 8.244}, {'end': 3740.807, 'text': "So here's an example which is question answering.", 'start': 3739.266, 'duration': 1.541}, {'end': 3748.31, 'text': "Uh, so let's suppose that the, the task is you've got some kind of context which in this, uh, situation is the Wikipedia article on Beethoven.", 'start': 3741.327, 'duration': 6.983}], 'summary': 'Using rnns for general purpose encoder module; better than using final hidden state.', 'duration': 45.481, 'max_score': 3702.829, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U3702829.jpg'}, {'end': 3840.002, 'src': 'embed', 'start': 3816.507, 'weight': 4, 'content': [{'end': 3823.29, 'text': 'So you could have um taken uh element wise, max or mean, like we showed in the previous slide, to get a single vector for the question,', 'start': 3816.507, 'duration': 6.783}, {'end': 3824.191, 'text': "but often you don't do that.", 'start': 3823.29, 'duration': 0.901}, {'end': 3827.392, 'text': "Often you'll, uh, do something else which uses the hidden states directly.", 'start': 3824.231, 'duration': 3.161}, {'end': 3836.597, 'text': 'So the general point here is that RNNs are quite powerful as a way to represent, uh, a sequence of text, uh, for further computation.', 'start': 3828.933, 'duration': 7.664}, {'end': 3840.002, 'text': 'Okay Last example.', 'start': 3839.161, 'duration': 0.841}], 'summary': 'Rnns represent sequences of text for computation.', 'duration': 23.495, 'max_score': 3816.507, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U3816507.jpg'}, {'end': 3922.879, 'src': 'embed', 'start': 3878.066, 'weight': 5, 'content': [{'end': 3881.867, 'text': "And in this case, the utterance might be something like, what's the weather? Question mark.", 'start': 3878.066, 'duration': 3.801}, {'end': 3898.421, 'text': 'Yeah Okay.', 'start': 3884.769, 'duration': 13.652}, {'end': 3904.603, 'text': 'So the question is in speech recognition, we often use word error rates to evaluate, but would you use perplexity to evaluate?', 'start': 3898.441, 'duration': 6.162}, {'end': 3906.764, 'text': "Um, I don't actually know much about that.", 'start': 3905.663, 'duration': 1.101}, {'end': 3922.879, 'text': 'Do you know, Chris, what they use in, uh, speech recognition as an e-parametric? So, I mean, Right.', 'start': 3906.784, 'duration': 16.095}], 'summary': 'Discussion on using perplexity to evaluate speech recognition, with uncertainty about its usage.', 'duration': 44.813, 'max_score': 3878.066, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U3878066.jpg'}, {'end': 4018.636, 'src': 'embed', 'start': 3962.33, 'weight': 6, 'content': [{'end': 3966.152, 'text': 'Machine translation is an example also of a conditional language model,', 'start': 3962.33, 'duration': 3.822}, {'end': 3969.474, 'text': "and we're going to see that in much more detail in the lecture next week on machine translation.", 'start': 3966.152, 'duration': 3.322}, {'end': 3971.635, 'text': 'All right.', 'start': 3971.355, 'duration': 0.28}, {'end': 3974.197, 'text': 'Are there any more questions? We have a bit of extra time, I think.', 'start': 3971.656, 'duration': 2.541}, {'end': 3979.28, 'text': 'Yeah I have a question about RNN.', 'start': 3977.199, 'duration': 2.081}, {'end': 3999.01, 'text': 'network you want to run them through, uh, five recurring layers.', 'start': 3995.009, 'duration': 4.001}, {'end': 4004.412, 'text': 'Do people mix and match like that, or are these, uh sort of specialized, so they really only use those quasi-solids?', 'start': 3999.35, 'duration': 5.062}, {'end': 4008.573, 'text': 'Uh, the question is do you ever combine RNNs with other types of architectures?', 'start': 4005.412, 'duration': 3.161}, {'end': 4009.873, 'text': 'So I think the answer is yes.', 'start': 4008.693, 'duration': 1.18}, {'end': 4011.454, 'text': 'Um, you might, you know.', 'start': 4010.113, 'duration': 1.341}, {'end': 4018.636, 'text': 'uh, have, you might have other types of architectures, uh, to produce the vectors that are going to be the input to your RNN,', 'start': 4011.454, 'duration': 7.182}], 'summary': 'Discussion on combining rnns with other architectures for input vectors.', 'duration': 56.306, 'max_score': 3962.33, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U3962330.jpg'}, {'end': 4074.052, 'src': 'embed', 'start': 4047.305, 'weight': 8, 'content': [{'end': 4054.469, 'text': 'So the reason why those are called vanilla RNNs is because there are actually other more complex kinds of RNN flavors.', 'start': 4047.305, 'duration': 7.164}, {'end': 4059.972, 'text': "So for example, there's GRU and LSTM, and we're gonna learn about both of those next week.", 'start': 4054.889, 'duration': 5.083}, {'end': 4065.155, 'text': "And another thing we're going to learn about next week is that you can actually get multilayer RNNs,", 'start': 4060.693, 'duration': 4.462}, {'end': 4068.137, 'text': 'which is when you stack multiple RNNs on top of each other.', 'start': 4065.155, 'duration': 2.982}, {'end': 4074.052, 'text': "So you're gonna learn about those, but we hope that by the time you reach the end of this course,", 'start': 4069.289, 'duration': 4.763}], 'summary': 'Vanilla rnns are just one type of rnn, with more complex flavors like gru and lstm, and multilayer rnns that stack multiple rnns on top of each other to be covered next week.', 'duration': 26.747, 'max_score': 4047.305, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U4047305.jpg'}], 'start': 3488.777, 'title': 'Rnns in nlp', 'summary': 'Explores the application of recurrent neural networks (rnn) and language models in nlp tasks, including tagging and sentiment classification, as well as their use as general purpose encoders in question answering and speech recognition. it also covers speech recognition evaluations using word error rates and perplexity, conditional language models, and the combination of rnns with other architectures.', 'chapters': [{'end': 3727.18, 'start': 3488.777, 'title': 'Rnns and language models', 'summary': 'Explains the relationship between recurrent neural networks (rnn) and language models, their applications in tasks like tagging and sentiment classification, and different methods for sentence encoding.', 'duration': 238.403, 'highlights': ['Recurrent neural networks (RNN) are used to build language models and can be used for tasks like part of speech tagging and named entity recognition. RNN used for language models, part of speech tagging, named entity recognition.', 'RNNs can be utilized for sentence classification tasks such as sentiment classification, where the text is encoded using RNN and a single vector is obtained to represent the sentiment. RNN used for sentiment classification, text encoding, obtaining single vector for sentiment representation.', 'The final hidden state of RNN can be used as a sentence encoding, but it is usually more effective to use methods like element-wise max or mean of all hidden states. Comparison of using final hidden state vs. element-wise max or mean for sentence encoding.']}, {'end': 3877.725, 'start': 3727.18, 'title': 'Rnn as a general purpose encoder', 'summary': 'Discusses the use of rnns as a general purpose encoder module in deep learning architectures for nlp, with examples including question answering and speech recognition.', 'duration': 150.545, 'highlights': ['RNNs are commonly used as a general purpose encoder module in various deep learning architectures for NLP, such as question answering and speech recognition.', 'The example of question answering involves using an RNN to process the question and then using the hidden states from the RNN as a representation of the question, which is a part of the default final project.', 'RNNs serve as a powerful way to represent a sequence of text for further computation, such as generating text in RNN language models for applications like speech recognition.']}, {'end': 4098.13, 'start': 3878.066, 'title': 'Speech recognition and conditional language models', 'summary': "Discusses speech recognition evaluations using word error rates and perplexity, conditional language models, and combining rnns with other types of architectures, while also highlighting the terminology of 'vanilla rnn' and the complexity of rnn flavors and multilayer rnns.", 'duration': 220.064, 'highlights': ['The chapter discusses speech recognition evaluations using word error rates and perplexity, where both are used to evaluate speech recognition. Perplexity is mentioned as an alternative evaluation method.', 'Conditional language models are explained with the example of machine translation, which will be discussed in detail in the next lecture on machine translation.', 'The possibility of combining RNNs with other types of architectures is addressed, explaining that it is feasible to use other architectures to produce input vectors for RNNs or to use the output of RNNs as input for a different type of neural network.', "The terminology of 'vanilla RNN' is introduced, indicating that it refers to the RNNs described in the lecture, and the complexity of RNN flavors and multilayer RNNs is mentioned, with the promise of learning about GRU, LSTM, and stacked bidirectional LSTM with residual connections and self-attention in the next week's lecture."]}], 'duration': 609.353, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/iWea12EAu6U/pics/iWea12EAu6U3488777.jpg', 'highlights': ['RNN used for language models, part of speech tagging, named entity recognition.', 'RNN used for sentiment classification, text encoding, obtaining single vector for sentiment representation.', 'Comparison of using final hidden state vs. element-wise max or mean for sentence encoding.', 'RNNs are commonly used as a general purpose encoder module in various deep learning architectures for NLP.', 'RNNs serve as a powerful way to represent a sequence of text for further computation.', 'Speech recognition evaluations using word error rates and perplexity.', 'Conditional language models explained with the example of machine translation.', 'Combining RNNs with other types of architectures is feasible.', "Introduction of 'vanilla RNN' and complexity of RNN flavors and multilayer RNNs."]}], 'highlights': ['RNNs feature a sequence of hidden states, computed based on the previous hidden state and the input at each step, allowing them to handle variable-length inputs.', 'The RNN language model computes the output distribution for every step, predicting the probability of the next words at each step.', 'The lecture covers language models and recurrent neural networks (RNNs), highlighting language modeling basics, issues with n-gram language models, challenges, word embedding, RNN training, RNN language models, and their application in NLP tasks such as tagging, sentiment classification, question answering, and speech recognition.', 'The concept of neural language model is introduced as a potential solution for better language modeling, by taking inputs as a sequence of words and outputting a probability distribution of the next word.', 'Language modeling is a benchmark task for measuring progress in understanding language, necessitating comprehension of grammar, syntax, logic, reasoning, and real-world knowledge.']}