title
Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 8 – Translation, Seq2Seq, Attention
description
For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/3Cbvt8s
Professor Christopher Manning & PhD Candidate Abigail See, Stanford University
http://onlinehub.stanford.edu/
Professor Christopher Manning
Thomas M. Siebel Professor in Machine Learning, Professor of Linguistics and of Computer Science
Director, Stanford Artificial Intelligence Laboratory (SAIL)
To follow along with the course schedule and syllabus, visit: http://web.stanford.edu/class/cs224n/index.html#schedule
0:00 Introduction
1:07 Overview
2:46 1950s: Early Machine Translation
5:02 1990s-2010s: Statistical Machine Translation
8:51 What is alignment?
9:23 Alignment is complex
11:27 Learning alignment for SMT
12:36 Decoding for SMT
17:28 What is Neural Machine Translation?
20:54 Sequence-to-sequence is versatile!
27:56 Training a Neural Machine Translation system
32:05 Exhaustive search decoding
35:14 Beam search decoding: example
37:50 Beam search decoding: stopping criterion
39:25 Beam search decoding: finishing up
44:41 Disadvantages of NMT?
46:52 How do we evaluate Machine Translation?
50:32 MT progress over time
51:34 NMT: the biggest success story of NLP Deep Learning
53:31 So is Machine Translation solved?
detail
{'title': 'Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 8 – Translation, Seq2Seq, Attention', 'heatmap': [{'end': 3925.224, 'start': 3734.88, 'weight': 1}], 'summary': 'The lecture covers machine translation, including history, statistical and neural machine translation, advantages of nmt, advancements, and challenges. it also explains attention in deep learning and its role in improving sequence-to-sequence models for machine translation.', 'chapters': [{'end': 45.163, 'segs': [{'end': 58.095, 'src': 'embed', 'start': 26.923, 'weight': 0, 'content': [{'end': 30.466, 'text': "Uh, if you missed it, don't get up now, it's fine.", 'start': 26.923, 'duration': 3.543}, {'end': 32.567, 'text': "There'll be time to sign in after the lecture.", 'start': 30.506, 'duration': 2.061}, {'end': 37.151, 'text': 'Uh, and then, if you have any kind of questions about special cases with the attendance policy, uh,', 'start': 32.587, 'duration': 4.564}, {'end': 40.834, 'text': 'you should check out the Piazza post that we put up last night with some clarifications.', 'start': 37.151, 'duration': 3.683}, {'end': 45.163, 'text': 'Uh, the other reminder is assignment four content is going to be covered today.', 'start': 41.92, 'duration': 3.243}, {'end': 47.545, 'text': "So you're gonna have everything you need to do assignment four at the end of today.", 'start': 45.183, 'duration': 2.362}, {'end': 51.089, 'text': 'And, uh, do get started early because the model takes four hours to train.', 'start': 48.166, 'duration': 2.923}, {'end': 58.095, 'text': "The other announcement is that we're going to be sending out our mid-quarter feedback survey sometime in the next few days probably.", 'start': 52.55, 'duration': 5.545}], 'summary': 'Lecture covers assignment four, model takes 4 hrs to train, mid-quarter feedback survey coming soon', 'duration': 31.172, 'max_score': 26.923, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c26923.jpg'}], 'start': 6.078, 'title': 'Machine translation lecture', 'summary': 'Covers announcements on attendance, signing in with tas, and the coverage of assignment four content for the machine translation lecture.', 'chapters': [{'end': 45.163, 'start': 6.078, 'title': 'Machine translation lecture', 'summary': 'Covers announcements on attendance, signing in with tas, and the coverage of assignment four content.', 'duration': 39.085, 'highlights': ['The chapter emphasizes the importance of signing in with TAs for attendance.', 'The lecture covers the content of assignment four.', 'Students are reminded to check the Piazza post for clarifications on the attendance policy.']}], 'duration': 39.085, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c6078.jpg', 'highlights': ['The lecture covers the content of assignment four.', 'The chapter emphasizes the importance of signing in with TAs for attendance.', 'Students are reminded to check the Piazza post for clarifications on the attendance policy.']}, {'end': 483.877, 'segs': [{'end': 110.057, 'src': 'embed', 'start': 80.519, 'weight': 0, 'content': [{'end': 84.822, 'text': 'And the connection here is that machine translation is a major use case of sequence-to-sequence.', 'start': 80.519, 'duration': 4.303}, {'end': 89.105, 'text': "After that, we're going to introduce a new neural technique called attention.", 'start': 86.123, 'duration': 2.982}, {'end': 92.368, 'text': 'And this is something that improves sequence-to-sequence a lot.', 'start': 89.926, 'duration': 2.442}, {'end': 100.754, 'text': 'Okay So, uh, section one of this is gonna be about, uh, a bit of machine translation history, pre-neural machine translation.', 'start': 94.352, 'duration': 6.402}, {'end': 110.057, 'text': 'So machine translation or MT, uh, is the task of translating a sentence X, uh, which we call the source language,', 'start': 103.415, 'duration': 6.642}], 'summary': 'Machine translation and sequence-to-sequence connection, introducing attention technique to improve translation.', 'duration': 29.538, 'max_score': 80.519, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c80519.jpg'}, {'end': 192.769, 'src': 'embed', 'start': 164.952, 'weight': 2, 'content': [{'end': 169.674, 'text': 'So, the beginning of machine translation as an AI task began in the early 1950s.', 'start': 164.952, 'duration': 4.722}, {'end': 174.896, 'text': 'So, um, in particular, there was lots of work translating Russian to English.', 'start': 170.735, 'duration': 4.161}, {'end': 179.278, 'text': 'uh, because the West was very interested in listening to what the Russians were saying during the Cold War.', 'start': 174.896, 'duration': 4.382}, {'end': 187.948, 'text': "And, uh, we've got a fun video here which shows the state of machine translation in 1954.", 'start': 180.239, 'duration': 7.709}, {'end': 192.769, 'text': "They hadn't reckoned with ambiguity when they set out to use computers to translate languages.", 'start': 187.948, 'duration': 4.821}], 'summary': 'Machine translation began in 1950s, focused on russian to english due to cold war interest. video shows state of translation in 1954.', 'duration': 27.817, 'max_score': 164.952, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c164952.jpg'}, {'end': 313.599, 'src': 'embed', 'start': 283.388, 'weight': 4, 'content': [{'end': 286.329, 'text': 'and they were essentially mostly just looking up the Russian words.', 'start': 283.388, 'duration': 2.941}, {'end': 287.789, 'text': 'uh, looking up their English counterparts.', 'start': 286.329, 'duration': 1.46}, {'end': 291.15, 'text': 'And they were storing these big bilingual dictionaries on these large magnetic tapes.', 'start': 287.929, 'duration': 3.221}, {'end': 296.252, 'text': 'Um, so certainly it was a huge technical feat at the time, uh, but they, uh,', 'start': 291.83, 'duration': 4.422}, {'end': 299.333, 'text': 'some people were probably too optimistic about how quickly it would replace humans.', 'start': 296.252, 'duration': 3.081}, {'end': 305.855, 'text': 'So jumping forward several decades in time, uh, now I want to tell you about statistical machine translation.', 'start': 300.632, 'duration': 5.223}, {'end': 313.599, 'text': "So the core idea of statistical machine translation is that you're going to learn a probabilistic model from the data in order to do the translation.", 'start': 306.715, 'duration': 6.884}], 'summary': 'Early machine translation used big bilingual dictionaries stored on magnetic tapes. statistical machine translation learns a probabilistic model from data.', 'duration': 30.211, 'max_score': 283.388, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c283388.jpg'}, {'end': 473.307, 'src': 'embed', 'start': 443.981, 'weight': 1, 'content': [{'end': 448.443, 'text': "So, I'm gonna tell you more about how we would learn this translation model that needs to be learned from parallel data.", 'start': 443.981, 'duration': 4.462}, {'end': 456.114, 'text': 'So, we need a large amount of parallel data in order to learn this translation model.', 'start': 451.431, 'duration': 4.683}, {'end': 460.777, 'text': 'And an early example of a parallel corpus is the Rosetta Stone.', 'start': 456.935, 'duration': 3.842}, {'end': 465.601, 'text': 'So, this is a stone that has the same text written in three different languages.', 'start': 461.618, 'duration': 3.983}, {'end': 473.307, 'text': 'And, uh, this is a hugely important artifact for, um, the- the people who were trying to understand ancient Egyptian.', 'start': 466.421, 'duration': 6.886}], 'summary': 'Translation model learned from parallel data, requiring large amounts of parallel data. an example is the rosetta stone, with text in three languages.', 'duration': 29.326, 'max_score': 443.981, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c443981.jpg'}], 'start': 45.183, 'title': 'Machine translation history', 'summary': 'Covers the history of machine translation, including the early efforts during the cold war era, the overestimation of ai capabilities in 1954, and the evolution towards statistical machine translation, emphasizing the use of parallel data for learning translation models.', 'chapters': [{'end': 243.8, 'start': 45.183, 'title': 'Nlp: machine translation and sequence-to-sequence', 'summary': 'Covers the introduction of machine translation and sequence-to-sequence models, including the history of machine translation, the ambiguity in translations, and the early efforts in machine translation during the cold war era.', 'duration': 198.617, 'highlights': ['Machine translation is a major use case of sequence-to-sequence, and attention improves sequence-to-sequence significantly.', "Early efforts in machine translation during the Cold War era focused on translating Russian to English, driven by the need to monitor the Russians' activities.", 'The initial attempts at machine translation in the 1950s were ambitious but faced challenges with ambiguities and limitations in speed.', 'The chapter introduces a new task in NLP, machine translation, and a new neural architecture called sequence-to-sequence.']}, {'end': 483.877, 'start': 246.317, 'title': 'Machine translation evolution', 'summary': 'Discusses the historical context of machine translation, highlighting the overestimation of ai capabilities in 1954, and the evolution towards statistical machine translation, emphasizing the use of parallel data for learning translation models.', 'duration': 237.56, 'highlights': ['The early machine translation systems were mostly rule-based and relied on bilingual dictionaries between Russian and English, stored on large magnetic tapes.', 'Statistical machine translation involves learning a probabilistic model from data to find the best translation, breaking down the probability into translation and language models.', 'The importance of parallel data in learning translation models is illustrated with the example of the Rosetta Stone, an early parallel corpus that aided in understanding ancient Egyptian.']}], 'duration': 438.694, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c45183.jpg', 'highlights': ['Machine translation is a major use case of sequence-to-sequence, and attention improves sequence-to-sequence significantly.', 'The importance of parallel data in learning translation models is illustrated with the example of the Rosetta Stone, an early parallel corpus that aided in understanding ancient Egyptian.', 'The initial attempts at machine translation in the 1950s were ambitious but faced challenges with ambiguities and limitations in speed.', "Early efforts in machine translation during the Cold War era focused on translating Russian to English, driven by the need to monitor the Russians' activities.", 'The early machine translation systems were mostly rule-based and relied on bilingual dictionaries between Russian and English, stored on large magnetic tapes.']}, {'end': 1024.093, 'segs': [{'end': 526.629, 'src': 'embed', 'start': 484.778, 'weight': 2, 'content': [{'end': 487.32, 'text': 'So, this is a- this is a really important parallel corpus.', 'start': 484.778, 'duration': 2.542}, {'end': 491.384, 'text': "And, uh, if you're ever in London, you can go to the British Museum and see this in person.", 'start': 487.84, 'duration': 3.544}, {'end': 495.123, 'text': 'So, the idea is that you get your parallel data.', 'start': 493.041, 'duration': 2.082}, {'end': 499.126, 'text': "Obviously, you need a larger amount than is on the stone and hopefully it shouldn't be written on a stone either.", 'start': 495.283, 'duration': 3.843}, {'end': 504.111, 'text': 'Uh, but you can use this to learn your statistical machine translation model.', 'start': 499.146, 'duration': 4.965}, {'end': 510.955, 'text': "So the idea is that you're trying to learn this conditional probability distribution of X given Y.", 'start': 505.97, 'duration': 4.985}, {'end': 513.116, 'text': 'So what we do is we actually break this down even further.', 'start': 510.955, 'duration': 2.161}, {'end': 519.241, 'text': 'We actually want to consider the probability of X and A given Y, where A is the alignment.', 'start': 513.797, 'duration': 5.444}, {'end': 526.629, 'text': 'So the idea of alignment is this is, uh, how the words in the English sentence and the French sentence correspond to each other.', 'start': 519.982, 'duration': 6.647}], 'summary': 'Parallel corpus at british museum for statistical machine translation model training.', 'duration': 41.851, 'max_score': 484.778, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c484778.jpg'}, {'end': 603.993, 'src': 'embed', 'start': 573.101, 'weight': 4, 'content': [{'end': 574.882, 'text': 'So, this is what we call many to one alignment.', 'start': 573.101, 'duration': 1.781}, {'end': 578.343, 'text': 'Uh, it can go in the other direction too.', 'start': 577.042, 'duration': 1.301}, {'end': 579.723, 'text': 'Alignment can be one to many.', 'start': 578.483, 'duration': 1.24}, {'end': 584.585, 'text': 'So here we have a single English word implemented, which has a one to many alignment,', 'start': 580.283, 'duration': 4.302}, {'end': 587.706, 'text': "because there's a three word French fa- phrase that corresponds to it.", 'start': 584.585, 'duration': 3.121}, {'end': 592.309, 'text': 'So, on the left and the right, we have two ways of depicting the same alignment.', 'start': 588.908, 'duration': 3.401}, {'end': 596.23, 'text': "It's either, uh, a kind of, uh, a chart or it can be a, a graph.", 'start': 592.409, 'duration': 3.821}, {'end': 603.993, 'text': "So, here's another example, um, of a one-to-many, uh, sorry.", 'start': 599.071, 'duration': 4.922}], 'summary': 'Many-to-one and one-to-many alignments can be depicted as charts or graphs.', 'duration': 30.892, 'max_score': 573.101, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c573101.jpg'}, {'end': 837.018, 'src': 'embed', 'start': 807.679, 'weight': 1, 'content': [{'end': 812.76, 'text': "Uh, but along the way, you're going to discard hypotheses that are too low prob- uh, probability.", 'start': 807.679, 'duration': 5.081}, {'end': 817.062, 'text': "So you're going- you're gonna search, but you're going to discard and prune the tree as you go,", 'start': 813.34, 'duration': 3.722}, {'end': 820.443, 'text': "to make sure that you're not keeping too many hypotheses, uh, on each step.", 'start': 817.062, 'duration': 3.381}, {'end': 826.725, 'text': 'So, this process of finding your best sequence is also called decoding.', 'start': 823.184, 'duration': 3.541}, {'end': 830.374, 'text': "So, here's an overview of how that works for SMT.", 'start': 828.493, 'duration': 1.881}, {'end': 837.018, 'text': 'Uh, this is an example, uh, where you have this German sentence that translates to, he does not go home.', 'start': 831.335, 'duration': 5.683}], 'summary': 'Discarding low probability hypotheses during search to find best sequence in smt.', 'duration': 29.339, 'max_score': 807.679, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c807679.jpg'}, {'end': 942.465, 'src': 'embed', 'start': 912.682, 'weight': 0, 'content': [{'end': 913.563, 'text': 'Uh, was it effective?', 'start': 912.682, 'duration': 0.881}, {'end': 920.869, 'text': 'Uh so, SMT was a huge research field, uh, from the 1990s to about maybe uh, 2013..', 'start': 914.324, 'duration': 6.545}, {'end': 923.731, 'text': 'And, uh, the best systems during this time were extremely complex.', 'start': 920.869, 'duration': 2.862}, {'end': 926.173, 'text': 'They were extremely sophisticated and impressive systems.', 'start': 923.791, 'duration': 2.382}, {'end': 929.635, 'text': 'And, uh, SMT made the best machine translation systems in the world.', 'start': 926.433, 'duration': 3.202}, {'end': 931.737, 'text': 'But they were very complex.', 'start': 930.436, 'duration': 1.301}, {'end': 936.36, 'text': "So for example, you know, there were hundreds of important details that we haven't mentioned here at all.", 'start': 931.857, 'duration': 4.503}, {'end': 942.465, 'text': "There were many, many techniques to make it, uh, more complex and more, um, sophisticated than what I've described today.", 'start': 936.7, 'duration': 5.765}], 'summary': 'Smt was a major research field from 1990s to 2013, producing highly complex and sophisticated machine translation systems.', 'duration': 29.783, 'max_score': 912.682, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c912682.jpg'}], 'start': 484.778, 'title': 'Statistical machine translation', 'summary': "Covers statistical machine translation including alignment techniques and model overview, demonstrating different alignment types and discussing smt's effectiveness, complexity, and resource requirements.", 'chapters': [{'end': 641.648, 'start': 484.778, 'title': 'Statistical machine translation alignment', 'summary': 'Discusses the use of a parallel corpus to train a statistical machine translation model, focusing on learning the conditional probability distribution of x given y, and demonstrates different types of alignments such as one-to-one, many-to-one, and one-to-many, with examples and explanations.', 'duration': 156.87, 'highlights': ['The chapter emphasizes the use of a parallel corpus to train a statistical machine translation model, which requires a larger amount of data than what is available on a stone inscription.', 'It explains the concept of learning the conditional probability distribution of X given Y and further breaks it down to consider the probability of X and A given Y, where A represents alignment.', 'Demonstrates different types of alignments such as one-to-one, many-to-one, and one-to-many, with examples like a spurious word in French, and a fertile word with no single word equivalent in English.']}, {'end': 1024.093, 'start': 641.708, 'title': 'Statistical machine translation overview', 'summary': 'Discusses statistical machine translation (smt) overview, probability distribution of alignments, learning translation model, decoding process, effectiveness of smt, and its complexity and resource requirements.', 'duration': 382.385, 'highlights': ['SMT was a huge research field from the 1990s to about 2013, and the best systems during this time were extremely complex and made the best machine translation systems in the world.', 'The SMT systems were extremely sophisticated and impressive, requiring hundreds of important details and many separately designed sub-components, which led to a significant amount of human effort to maintain.', 'The process of finding the best sequence in SMT, also called decoding, involves using a heuristic search algorithm to search for the best translation Y and discarding hypotheses that are too low probability.']}], 'duration': 539.315, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c484778.jpg', 'highlights': ['SMT systems were extremely sophisticated and impressive, requiring hundreds of important details and many separately designed sub-components, leading to a significant amount of human effort to maintain.', 'The process of finding the best sequence in SMT, also called decoding, involves using a heuristic search algorithm to search for the best translation Y and discarding hypotheses that are too low probability.', 'The chapter emphasizes the use of a parallel corpus to train a statistical machine translation model, which requires a larger amount of data than what is available on a stone inscription.', 'SMT was a huge research field from the 1990s to about 2013, and the best systems during this time were extremely complex and made the best machine translation systems in the world.', 'Demonstrates different types of alignments such as one-to-one, many-to-one, and one-to-many, with examples like a spurious word in French, and a fertile word with no single word equivalent in English.', 'It explains the concept of learning the conditional probability distribution of X given Y and further breaks it down to consider the probability of X and A given Y, where A represents alignment.']}, {'end': 1819.042, 'segs': [{'end': 1263.105, 'src': 'embed', 'start': 1221.953, 'weight': 0, 'content': [{'end': 1228.497, 'text': 'Uh, but this thing with the, the pink dotted arrows where you feed the word back in, this is what you do to generate text at test time.', 'start': 1221.953, 'duration': 6.544}, {'end': 1230.539, 'text': 'Any questions on this?', 'start': 1229.958, 'duration': 0.581}, {'end': 1239.757, 'text': 'Uh oh, another thing I should note is that you need two separate sets of word embeddings.', 'start': 1234.97, 'duration': 4.787}, {'end': 1243.301, 'text': 'right?. You need word embeddings for French words and you need English word embeddings.', 'start': 1239.757, 'duration': 3.544}, {'end': 1245.985, 'text': "That's kind of two separate sets, two separate vocabularies.", 'start': 1243.362, 'duration': 2.623}, {'end': 1248.048, 'text': 'Um, yeah.', 'start': 1247.106, 'duration': 0.942}, {'end': 1254.537, 'text': 'Okay So, as a side note, uh, this architecture called sequence-to-sequence is actually pretty versatile.', 'start': 1249.773, 'duration': 4.764}, {'end': 1256.679, 'text': "It's not just a machine translation architecture.", 'start': 1254.637, 'duration': 2.042}, {'end': 1263.105, 'text': 'Uh, you can, uh, uh, phrase quite a few NLP tasks as sequence-to-sequence tasks.', 'start': 1256.699, 'duration': 6.406}], 'summary': 'Sequence-to-sequence architecture is versatile, used for nlp tasks, requires separate sets of word embeddings for different languages.', 'duration': 41.152, 'max_score': 1221.953, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c1221953.jpg'}, {'end': 1367.487, 'src': 'embed', 'start': 1341.236, 'weight': 3, 'content': [{'end': 1345.601, 'text': "But it's a conditional language model, because it's also conditioning on your source sentence,", 'start': 1341.236, 'duration': 4.365}, {'end': 1348.164, 'text': 'which is represented by the encoding of the source sentence.', 'start': 1345.601, 'duration': 2.563}, {'end': 1353.254, 'text': 'So you could look, uh, you could view it like this.', 'start': 1351.232, 'duration': 2.022}, {'end': 1360.16, 'text': 'NMT is directly calculating the probability of the target sentence y given the source sentence x.', 'start': 1353.915, 'duration': 6.245}, {'end': 1366.466, 'text': "So, if you look at this, you'll see that this is just, uh, breaking down the probability of the sequence y, which we suppose is of length, uh,", 'start': 1360.16, 'duration': 6.306}, {'end': 1367.487, 'text': 'capital T.', 'start': 1366.466, 'duration': 1.021}], 'summary': 'Nmt calculates probability of target sentence given source sentence.', 'duration': 26.251, 'max_score': 1341.236, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c1341236.jpg'}, {'end': 1414.566, 'src': 'embed', 'start': 1390.526, 'weight': 1, 'content': [{'end': 1398.272, 'text': "So the reason I'm highlighting this is because if you remember in SMT, uh, we didn't directly learn the translation model P of y given x.", 'start': 1390.526, 'duration': 7.746}, {'end': 1401.835, 'text': 'We broke it down into, uh, uh, smaller components.', 'start': 1398.272, 'duration': 3.563}, {'end': 1406.099, 'text': 'Whereas here in NMT, we are directly learning this model.', 'start': 1402.416, 'duration': 3.683}, {'end': 1409.301, 'text': "And this is in some ways an advantage because it's simpler to do.", 'start': 1406.699, 'duration': 2.602}, {'end': 1412.344, 'text': "You don't have to learn all of these different systems and optimize them separately.", 'start': 1409.381, 'duration': 2.963}, {'end': 1414.566, 'text': "It's, uh, kind of simpler and easier.", 'start': 1412.744, 'duration': 1.822}], 'summary': 'Nmt directly learns translation model, simpler and easier.', 'duration': 24.04, 'max_score': 1390.526, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c1390526.jpg'}, {'end': 1635.065, 'src': 'embed', 'start': 1606.492, 'weight': 2, 'content': [{'end': 1611.773, 'text': 'people view training end-to-end as favorable because the idea is that you can optimize the system as a whole.', 'start': 1606.492, 'duration': 5.281}, {'end': 1617.895, 'text': 'You might think that if you optimize the parts separately, then when you put them together, they will not be optimal together necessarily.', 'start': 1612.133, 'duration': 5.762}, {'end': 1624.799, 'text': 'So if possible, directly optimizing the thing that you care about with respect to all of the parameters is more likely to succeed.', 'start': 1618.175, 'duration': 6.624}, {'end': 1627.221, 'text': 'However, there is a notion of pre-training.', 'start': 1625.339, 'duration': 1.882}, {'end': 1635.065, 'text': "And as you said, maybe you'd want to learn your decoder RNN as a kind of language model, an unconditional language model by itself.", 'start': 1627.541, 'duration': 7.524}], 'summary': 'Optimizing the system as a whole is favorable compared to optimizing the parts separately. pre-training is also considered.', 'duration': 28.573, 'max_score': 1606.492, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c1606492.jpg'}], 'start': 1025.233, 'title': 'Nmt and end-to-end training in nlp', 'summary': 'Introduces neural machine translation (nmt) using a single neural network architecture called sequence-to-sequence, highlighting its applications in nlp tasks and its advantage over traditional translation models. it also explains the end-to-end training process in nlp, addressing issues like handling the end token and the flexibility of training separate components.', 'chapters': [{'end': 1428.842, 'start': 1025.233, 'title': 'Neural machine translation', 'summary': 'Introduces neural machine translation (nmt) as a single neural network architecture called sequence-to-sequence, involving two rnns, and highlights its applications in nlp tasks, its versatility, and its advantage over traditional translation models.', 'duration': 403.609, 'highlights': ['Neural machine translation (NMT) utilizes a single neural network architecture called sequence-to-sequence, involving two RNNs to map a source sentence to a target sentence.', 'Sequence-to-sequence architecture is versatile, applicable to NLP tasks such as summarization, dialogue, parsing, and code generation.', 'NMT directly calculates the probability of the target sentence given the source sentence, simplifying the learning process compared to traditional translation models like Statistical Machine Translation (SMT).', 'During test time, the decoder generates text by feeding the word back in, using separate sets of word embeddings for the source and target languages.', 'NMT simplifies the learning process by directly learning the translation model, unlike SMT, where the model is broken down into smaller components.']}, {'end': 1819.042, 'start': 1429.202, 'title': 'Training end-to-end in nlp', 'summary': 'Explains the training process in nlp, where the encoder rnn processes the source sentence, the decoder rnn predicts the next word, and the system is trained end-to-end with backpropagation, addressing issues like handling the end token and the flexibility of training separate components.', 'duration': 389.84, 'highlights': ['The system is trained end-to-end with backpropagation, optimizing the entire system with respect to all parameters, viewed as favorable for achieving optimal performance.', 'The process of handling the end token and training time versus test time is explained, with the distinction between feeding the produced token back in during training and using the target sentence from the corpus to compute loss.', 'The practical implementation of handling sentences of different lengths in a parallel corpus is discussed, involving padding short sentences and avoiding the use of hidden states from the padding.']}], 'duration': 793.809, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c1025233.jpg', 'highlights': ['Sequence-to-sequence architecture is versatile, applicable to NLP tasks such as summarization, dialogue, parsing, and code generation.', 'NMT simplifies the learning process by directly learning the translation model, unlike SMT, where the model is broken down into smaller components.', 'The system is trained end-to-end with backpropagation, optimizing the entire system with respect to all parameters, viewed as favorable for achieving optimal performance.', 'NMT directly calculates the probability of the target sentence given the source sentence, simplifying the learning process compared to traditional translation models like Statistical Machine Translation (SMT).', 'During test time, the decoder generates text by feeding the word back in, using separate sets of word embeddings for the source and target languages.']}, {'end': 2532.655, 'segs': [{'end': 1938.646, 'src': 'embed', 'start': 1909.51, 'weight': 1, 'content': [{'end': 1910.891, 'text': "So that's the main problem with greedy decoding.", 'start': 1909.51, 'duration': 1.381}, {'end': 1913.072, 'text': "There's kind of no way to backtrack, no way to go back.", 'start': 1910.911, 'duration': 2.161}, {'end': 1916.055, 'text': 'So how can we fix this?', 'start': 1914.834, 'duration': 1.221}, {'end': 1922.96, 'text': 'And this relates back to, uh, what I told you earlier about how we might use a a kind of searching algorithm to do decoding in SMT.', 'start': 1916.695, 'duration': 6.265}, {'end': 1929.863, 'text': 'Uh, but first, you might, uh, think exhaustive search is a good idea.', 'start': 1926.142, 'duration': 3.721}, {'end': 1932.844, 'text': "Well, probably not because it's still a bad idea for the same reasons as before.", 'start': 1930.023, 'duration': 2.821}, {'end': 1938.646, 'text': 'So if you did want to do exhaustive search and search through the space of all possible French translations, uh,', 'start': 1933.304, 'duration': 5.342}], 'summary': 'Greedy decoding lacks backtracking, so we may use searching algorithm for decoding in smt.', 'duration': 29.136, 'max_score': 1909.51, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c1909510.jpg'}, {'end': 2126.114, 'src': 'embed', 'start': 2099.786, 'weight': 0, 'content': [{'end': 2104.527, 'text': "that is guaranteed to find the optimal solution, but it's just completely infeasible because it's so ex- expensive.", 'start': 2099.786, 'duration': 4.741}, {'end': 2110.729, 'text': 'So beam search is not guaranteed to find the optimal solution, but it is much more efficient than exhaustive search, of course.', 'start': 2105.367, 'duration': 5.362}, {'end': 2116.583, 'text': "Okay So, um, here's an example of beam search decoding in action.", 'start': 2112.779, 'duration': 3.804}, {'end': 2119.987, 'text': "Uh, so let's suppose that beam size equals k, uh, is two.", 'start': 2117.104, 'duration': 2.883}, {'end': 2126.114, 'text': 'And then as a reminder, we have, uh, this is the score that you apply to a partial, uh, hypothesis.', 'start': 2120.988, 'duration': 5.126}], 'summary': 'Beam search is more efficient than exhaustive search, with a beam size of 2.', 'duration': 26.328, 'max_score': 2099.786, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c2099786.jpg'}, {'end': 2164.729, 'src': 'embed', 'start': 2138.239, 'weight': 3, 'content': [{'end': 2145.161, 'text': 'So, having computed that probability distribution using our seek-to-seek model, then we just take the top k, that is top two possible options.', 'start': 2138.239, 'duration': 6.922}, {'end': 2148.042, 'text': "So, let's suppose that the top two are the words he and I.", 'start': 2145.581, 'duration': 2.461}, {'end': 2155.205, 'text': 'So the idea is that we can compute the score of these two hypotheses, uh, by using the formula above.', 'start': 2149.242, 'duration': 5.963}, {'end': 2159.066, 'text': "It's just the log probability of this word given the context so far.", 'start': 2155.765, 'duration': 3.301}, {'end': 2164.729, 'text': "So here, let's say that he has a score of minus 0.7 and I has a score of minus 0.9.", 'start': 2160.567, 'duration': 4.162}], 'summary': 'Using seek-to-seek model, top 2 options he and i have scores -0.7 and -0.9.', 'duration': 26.49, 'max_score': 2138.239, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c2138239.jpg'}, {'end': 2357.113, 'src': 'embed', 'start': 2331.898, 'weight': 4, 'content': [{'end': 2337.901, 'text': "So there's, uh, multiple possible stopping criterions, but two common ones are, you might say uh,", 'start': 2331.898, 'duration': 6.003}, {'end': 2343.905, 'text': "we're gonna stop doing beam search once we reach time, step t, where t is some uh predefined threshold that you choose.", 'start': 2337.901, 'duration': 6.004}, {'end': 2350.848, 'text': "So you might say uh, we're gonna stop, beam search after 30 steps, because we don't want any output sentences that are longer than 30 words,", 'start': 2344.265, 'duration': 6.583}, {'end': 2351.349, 'text': 'for example.', 'start': 2350.848, 'duration': 0.501}, {'end': 2357.113, 'text': "Or you might say, uh, we're gonna stop doing beam search once we've collected at least n completed hypotheses.", 'start': 2352.009, 'duration': 5.104}], 'summary': 'Beam search can be stopped at a predefined threshold or after collecting a specified number of completed hypotheses.', 'duration': 25.215, 'max_score': 2331.898, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c2331898.jpg'}, {'end': 2447.575, 'src': 'embed', 'start': 2419.538, 'weight': 5, 'content': [{'end': 2422.64, 'text': "Or I guess if we're adding longer probabilities, you're going to get a more negative value.", 'start': 2419.538, 'duration': 3.102}, {'end': 2430.245, 'text': "So it's not quite that you'll definitely choose the shortest hypothesis because you could overall have a lower score.", 'start': 2423.06, 'duration': 7.185}, {'end': 2435.829, 'text': "But there's definitely going to be a bias towards shorter translations because they'll in general have lower scores.", 'start': 2430.485, 'duration': 5.344}, {'end': 2438.37, 'text': 'So the way you can fix this is pretty simple.', 'start': 2436.789, 'duration': 1.581}, {'end': 2439.991, 'text': 'You just normalize by length.', 'start': 2438.79, 'duration': 1.201}, {'end': 2447.575, 'text': "So instead of using the score that we have above, you're going to use, uh, the score divided by T where T is the length of that hypothesis.", 'start': 2440.371, 'duration': 7.204}], 'summary': 'Bias towards shorter translations due to lower scores; can be fixed by normalizing score by length.', 'duration': 28.037, 'max_score': 2419.538, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c2419538.jpg'}], 'start': 1819.842, 'title': 'Neural machine translation decoding and beam search', 'summary': 'Discusses the limitations of greedy decoding in neural machine translation and introduces beam search with a beam size of 5 to 10 to track the k most probable partial translations. it also explains beam search process, stopping criteria, bias towards shorter translations, and score normalization by length.', 'chapters': [{'end': 2119.987, 'start': 1819.842, 'title': 'Neural machine translation decoding', 'summary': 'Discusses the limitations of greedy decoding in neural machine translation and introduces beam search decoding as a more efficient alternative with a beam size of 5 to 10, aiming to track the k most probable partial translations on each step of the decoder.', 'duration': 300.145, 'highlights': ['Beam search decoding is introduced as a more efficient alternative to greedy decoding in neural machine translation with a beam size of 5 to 10, aiming to track the k most probable partial translations on each step of the decoder.', 'Exhaustive search, while guaranteed to find the optimal solution, is deemed completely infeasible due to its exponential complexity.', 'The limitations of greedy decoding in neural machine translation are discussed, highlighting the inability to backtrack and the need for a more sophisticated decoding method.']}, {'end': 2532.655, 'start': 2120.988, 'title': 'Beam search and stopping criteria', 'summary': 'Explains the process of beam search in machine translation, using an example with top-k hypotheses, and discusses stopping criteria including steps taken and completed hypotheses, while addressing the issue of bias towards shorter translations and the normalization of scores by length.', 'duration': 411.667, 'highlights': ['The chapter provides a detailed explanation of the process of beam search and stopping criteria in machine translation, using an example with top-k hypotheses and discussing the issue of bias towards shorter translations.', 'The process involves computing the probability distribution of the next word using a seek-to-seek model and selecting the top k possible options based on scores calculated from log probabilities.', 'The stopping criteria for beam search includes predefined thresholds such as a maximum number of steps or a minimum number of completed hypotheses, with the aim of controlling the output length and the number of translations.', 'Addressing the bias towards shorter translations, the chapter outlines the normalization of scores by length as a solution to ensure fair comparison and selection of the top hypothesis.']}], 'duration': 712.813, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c1819842.jpg', 'highlights': ['Beam search decoding is introduced as a more efficient alternative to greedy decoding with a beam size of 5 to 10.', 'The limitations of greedy decoding in neural machine translation are discussed, emphasizing the inability to backtrack.', 'Exhaustive search is deemed completely infeasible due to its exponential complexity.', 'The process of beam search involves computing the probability distribution of the next word using a seek-to-seek model.', 'The stopping criteria for beam search includes predefined thresholds such as a maximum number of steps or a minimum number of completed hypotheses.', 'The chapter outlines the normalization of scores by length as a solution to address the bias towards shorter translations.']}, {'end': 3009.719, 'segs': [{'end': 2593.528, 'src': 'embed', 'start': 2561.403, 'weight': 0, 'content': [{'end': 2565.385, 'text': 'Uh, NMT systems tend to give better output than SMT systems in several ways.', 'start': 2561.403, 'duration': 3.982}, {'end': 2568.586, 'text': 'One is that the output often tends to be more fluent.', 'start': 2565.905, 'duration': 2.681}, {'end': 2575.37, 'text': 'Uh, this is probably because NMT, uh, this is probably because RNNs are particularly good at learning language models as you learned last week.', 'start': 2569.167, 'duration': 6.203}, {'end': 2579.892, 'text': "Uh, another way that they're better is they often use, uh, the context better.", 'start': 2576.47, 'duration': 3.422}, {'end': 2585.175, 'text': "That is, uh, they're better at conditioning on the source sentence and using that to change the output.", 'start': 2580.272, 'duration': 4.903}, {'end': 2593.528, 'text': "Another way they're better is they often, uh, are more able to generalize what they learn about phrases and how to translate them.", 'start': 2586.601, 'duration': 6.927}], 'summary': 'Nmt systems outperform smt systems in fluency, context usage, and generalization.', 'duration': 32.125, 'max_score': 2561.403, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c2561403.jpg'}, {'end': 2617.213, 'src': 'embed', 'start': 2594.649, 'weight': 1, 'content': [{'end': 2603.138, 'text': "if it sees an example of how to translate a certain source phrase and then later it sees a slightly different version of that source phrase, it's, uh,", 'start': 2594.649, 'duration': 8.489}, {'end': 2606.942, 'text': 'more able to generalize what it learns about the first phrase than SMT systems were.', 'start': 2603.138, 'duration': 3.804}, {'end': 2617.213, 'text': "Another big advantage of NMT systems compared to SMT that we talked about before is that it's a single neural network that can be optimized end-to-end.", 'start': 2610.003, 'duration': 7.21}], 'summary': 'Nmt systems generalize better than smt, with a single neural network.', 'duration': 22.564, 'max_score': 2594.649, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c2594649.jpg'}, {'end': 2677.548, 'src': 'embed', 'start': 2638.679, 'weight': 4, 'content': [{'end': 2641.361, 'text': "uh, there's relatively less engineering effort than MT.", 'start': 2638.679, 'duration': 2.682}, {'end': 2644.864, 'text': "And MT is certainly not easy, but it's, it's less complicated than SMT.", 'start': 2641.641, 'duration': 3.223}, {'end': 2647.786, 'text': "In particular, there's no feature engineering.", 'start': 2646.045, 'duration': 1.741}, {'end': 2652.17, 'text': "You don't have to define what features of, uh, linguistic phenomena that you want to capture.", 'start': 2648.587, 'duration': 3.583}, {'end': 2657.474, 'text': 'You can mostly just view it as a sequence of words, although, uh, there are different views on that.', 'start': 2652.81, 'duration': 4.664}, {'end': 2665.724, 'text': 'Uh, lastly, a great thing about NMT is that you can use pretty much the same method for all language pairs.', 'start': 2660.402, 'duration': 5.322}, {'end': 2671.346, 'text': "So if you've uh, you know built your French to English translation system and now you want to build a Spanish to English one, uh,", 'start': 2666.184, 'duration': 5.162}, {'end': 2674.567, 'text': 'you can probably use basically the same architecture and the same method,', 'start': 2671.346, 'duration': 3.221}, {'end': 2677.548, 'text': 'as long as you can go find a big enough parallel corpus of Spanish to English.', 'start': 2674.567, 'duration': 2.981}], 'summary': 'Nmt requires less engineering effort, no feature engineering, and is applicable to multiple language pairs.', 'duration': 38.869, 'max_score': 2638.679, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c2638679.jpg'}, {'end': 2759.213, 'src': 'embed', 'start': 2728.079, 'weight': 8, 'content': [{'end': 2732.922, 'text': "And, you know, that's by no means easy to interpret, but it was at least more interpretable than NMT.", 'start': 2728.079, 'duration': 4.843}, {'end': 2738.012, 'text': 'Another disadvantage is NMT is pretty difficult to control.', 'start': 2735.269, 'duration': 2.743}, {'end': 2742.796, 'text': 'So, uh, for example, if your NMT system is, uh.', 'start': 2738.592, 'duration': 4.204}, {'end': 2747.161, 'text': "doing a particular error, it's not very easy for you, the uh, the programmer,", 'start': 2742.796, 'duration': 4.365}, {'end': 2751.024, 'text': 'to specify some kind of rule or guideline that you want the NMT system to follow.', 'start': 2747.161, 'duration': 3.863}, {'end': 2759.213, 'text': 'So, for example, if you want to say, I want to always translate this word in this way, um, when- when this other thing is present.', 'start': 2751.485, 'duration': 7.728}], 'summary': 'Nmt is more interpretable than rnn, but difficult to control or specify rules for, making it challenging for programmers.', 'duration': 31.134, 'max_score': 2728.079, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c2728079.jpg'}, {'end': 2837.242, 'src': 'embed', 'start': 2807.895, 'weight': 7, 'content': [{'end': 2810.477, 'text': "doing things that their uh designers certainly didn't intend.", 'start': 2807.895, 'duration': 2.582}, {'end': 2822.087, 'text': 'Okay So, uh, how do we evaluate MT? Uh, every good NLP task needs to have an automatic metric so that we can, uh, measure our progress.', 'start': 2813.376, 'duration': 8.711}, {'end': 2829.757, 'text': 'So, the, uh, most commonly used evaluation metric for MT is called BLUE, and that stands for Bilingual Evaluation Understudy.', 'start': 2822.648, 'duration': 7.109}, {'end': 2837.242, 'text': "So the main idea is that Blue is gonna compare the translation that's produced by your machine translation system.", 'start': 2830.635, 'duration': 6.607}], 'summary': 'Evaluate mt using blue, a commonly used metric for machine translation.', 'duration': 29.347, 'max_score': 2807.895, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c2807895.jpg'}], 'start': 2538.564, 'title': 'Advantages of nmt', 'summary': 'Discusses the advantages of neural machine translation (nmt) over statistical machine translation (smt), including more fluent output, improved context utilization, enhanced ability to generalize, end-to-end optimization, simplicity, and language pair versatility, while highlighting disadvantages such as lower interpretability and difficulty in control and debugging.', 'chapters': [{'end': 2594.649, 'start': 2538.564, 'title': 'Advantages of nmt over smt', 'summary': 'Discusses the advantages of neural machine translation (nmt) over statistical machine translation (smt), highlighting better performance, including more fluent output, improved context utilization, and enhanced ability to generalize.', 'duration': 56.085, 'highlights': ['NMT systems tend to give better output than SMT systems, with more fluent results and improved context utilization.', 'RNNs are particularly good at learning language models, contributing to the fluency of NMT output.', 'NMT systems are more able to generalize what they learn about phrases and how to translate them.']}, {'end': 3009.719, 'start': 2594.649, 'title': 'Advantages and disadvantages of nmt', 'summary': 'Discusses the advantages of nmt over smt, including generalization from source phrase variations, end-to-end optimization, simplicity, convenience, less human engineering efforts, and language pair versatility, while highlighting the disadvantages such as lower interpretability, difficulty in control and debugging, and the use of blue as an evaluation metric.', 'duration': 415.07, 'highlights': ['NMT systems can generalize better than SMT systems from a source phrase to slightly different versions of that phrase.', 'NMT is a single neural network that can be optimized end-to-end, offering simplicity, convenience, and less human engineering efforts.', 'NMT requires much less human engineering effort and has no feature engineering, making it less complicated than SMT.', 'NMT allows using the same method for all language pairs, offering versatility in language translation systems.', 'NMT is less interpretable and difficult to control, making it challenging to attribute errors, debug, and impose rules.', 'BLUE is the most commonly used evaluation metric for MT, comparing machine translation to human translations based on n-gram precision and adding a brevity penalty.']}], 'duration': 471.155, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c2538564.jpg', 'highlights': ['NMT systems tend to give better output than SMT systems, with more fluent results and improved context utilization.', 'NMT systems are more able to generalize what they learn about phrases and how to translate them.', 'NMT systems can generalize better than SMT systems from a source phrase to slightly different versions of that phrase.', 'NMT is a single neural network that can be optimized end-to-end, offering simplicity, convenience, and less human engineering efforts.', 'NMT allows using the same method for all language pairs, offering versatility in language translation systems.', 'RNNs are particularly good at learning language models, contributing to the fluency of NMT output.', 'NMT requires much less human engineering effort and has no feature engineering, making it less complicated than SMT.', 'BLUE is the most commonly used evaluation metric for MT, comparing machine translation to human translations based on n-gram precision and adding a brevity penalty.', 'NMT is less interpretable and difficult to control, making it challenging to attribute errors, debug, and impose rules.']}, {'end': 3648.299, 'segs': [{'end': 3151.044, 'src': 'embed', 'start': 3110.433, 'weight': 0, 'content': [{'end': 3118.86, 'text': 'In particular, in 2014, the first Seek to Seek paper was published, and in 2016, Google Translate switches from SMT to NMT.', 'start': 3110.433, 'duration': 8.427}, {'end': 3122.043, 'text': 'This is a pretty remarkable turnaround for just two years.', 'start': 3119.4, 'duration': 2.643}, {'end': 3129.735, 'text': 'So, this is amazing not just because it was a quick turnaround, but also if you think about the level of human effort involved.', 'start': 3124.153, 'duration': 5.582}, {'end': 3137.558, 'text': 'Uh, these SMT systems, for example, the Google Translate SMT system was built by doubtless hundreds of engineers over many years.', 'start': 3130.355, 'duration': 7.203}, {'end': 3144.9, 'text': 'And this, uh, this SMT system was outperformed by an NMT system that was trained by, uh, you know, relatively few,', 'start': 3138.218, 'duration': 6.682}, {'end': 3146.821, 'text': 'like a handful of engineers in a few months.', 'start': 3144.9, 'duration': 1.921}, {'end': 3151.044, 'text': "So I'm not- I'm not diminishing how difficult it is to, um, build NMT systems.", 'start': 3147.521, 'duration': 3.523}], 'summary': 'By 2016, google translate switched to nmt, outperforming smt with far less human effort.', 'duration': 40.611, 'max_score': 3110.433, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c3110433.jpg'}, {'end': 3248.055, 'src': 'embed', 'start': 3218.01, 'weight': 3, 'content': [{'end': 3221.834, 'text': 'Uh, NMT definitely is not doing machine translation perfectly.', 'start': 3218.01, 'duration': 3.824}, {'end': 3226.099, 'text': 'So, um, just to highlight some of the difficulties that remain with NMT.', 'start': 3222.295, 'duration': 3.804}, {'end': 3228.481, 'text': 'Uh, one is out of vocabulary words.', 'start': 3226.659, 'duration': 1.822}, {'end': 3231.825, 'text': "Um, this is a kind of basic problem but it's, it's, it's pretty tricky.", 'start': 3228.882, 'duration': 2.943}, {'end': 3232.425, 'text': 'you know.', 'start': 3232.185, 'duration': 0.24}, {'end': 3237.408, 'text': "what do you do if you're trying to translate a sentence that contains a word that is not in your source vocabulary?", 'start': 3232.425, 'duration': 4.983}, {'end': 3240.63, 'text': "Or what if you're trying to produce a word that's not in your target vocabulary?", 'start': 3237.749, 'duration': 2.881}, {'end': 3248.055, 'text': "Um, there's certainly been lots of work on doing this and you're going to hear later in the class how you might try to attack this with, for example.", 'start': 3241.171, 'duration': 6.884}], 'summary': 'Nmt struggles with out-of-vocabulary words, presenting difficulties in translation.', 'duration': 30.045, 'max_score': 3218.01, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c3218010.jpg'}, {'end': 3537.642, 'src': 'embed', 'start': 3505.953, 'weight': 2, 'content': [{'end': 3510.198, 'text': "Okay So that's the first puzzle piece, but the other- the other puzzle piece is the nonsensical input.", 'start': 3505.953, 'duration': 4.245}, {'end': 3515.003, 'text': "So when the input isn't really Somali or any kind of text right?", 'start': 3510.698, 'duration': 4.305}, {'end': 3520.549, 'text': "It's just the same syllable over and over, then the NMT system doesn't really have anything sensible to condition on.", 'start': 3515.043, 'duration': 5.506}, {'end': 3522.291, 'text': "It's basically nonsense, it's just noise.", 'start': 3520.649, 'duration': 1.642}, {'end': 3528.035, 'text': "So what does the NMT system do, right? It can't really use, it can't really condition on the, uh, source sentence.", 'start': 3522.751, 'duration': 5.284}, {'end': 3531.438, 'text': 'So what it does is it just uses the English language model right?', 'start': 3528.075, 'duration': 3.363}, {'end': 3537.642, 'text': 'You can think of it as like the English language model, the decoder RNN just kind of goes into autopilot and starts generating random text.', 'start': 3531.498, 'duration': 6.144}], 'summary': 'Nonsensical input hampers nmt system, leading to random text generation.', 'duration': 31.689, 'max_score': 3505.953, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c3505953.jpg'}, {'end': 3574.223, 'src': 'embed', 'start': 3552.373, 'weight': 4, 'content': [{'end': 3560.597, 'text': 'Um, so this is an example why, uh, neural machine translation in particular makes these kinds of errors, uh, because the system is uninterpretable.', 'start': 3552.373, 'duration': 8.224}, {'end': 3563.718, 'text': "So you don't know that this is gonna happen until it happens,", 'start': 3560.997, 'duration': 2.721}, {'end': 3567.52, 'text': "and perhaps Google didn't know this was gonna happen until it happened and it got reported.", 'start': 3563.718, 'duration': 3.802}, {'end': 3574.223, 'text': "Um so this is one downside of uninterpretability is that really weird effects can happen and you don't see them coming,", 'start': 3567.54, 'duration': 6.683}], 'summary': 'Neural machine translation can lead to unexpected errors due to uninterpretability.', 'duration': 21.85, 'max_score': 3552.373, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c3552373.jpg'}, {'end': 3632.087, 'src': 'embed', 'start': 3603.045, 'weight': 5, 'content': [{'end': 3607.988, 'text': "Um, yeah, and there's a lot of examples of these online where you do different kinds of nonsense syllables in different languages.", 'start': 3603.045, 'duration': 4.943}, {'end': 3613.162, 'text': "So there's a lot of challenges remaining in NMT.", 'start': 3610.908, 'duration': 2.254}, {'end': 3616.015, 'text': 'and the research continues.', 'start': 3614.554, 'duration': 1.461}, {'end': 3620.679, 'text': 'So, NMT I think remains one of the flagship tasks for NLP deep learning.', 'start': 3616.576, 'duration': 4.103}, {'end': 3627.044, 'text': 'And in fact, NMT research has pioneered many of the successful innovations of NLP deep learning in general.', 'start': 3621.78, 'duration': 5.264}, {'end': 3632.087, 'text': 'So, today in 2019, NMT research continues to thrive.', 'start': 3628.164, 'duration': 3.923}], 'summary': 'Nmt research in 2019 continues to thrive with challenges remaining and pioneering innovations in nlp deep learning.', 'duration': 29.042, 'max_score': 3603.045, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c3603045.jpg'}], 'start': 3009.719, 'title': 'Nmt advancements and challenges', 'summary': 'Highlights the rapid success of neural machine translation (nmt) over statistical machine translation (smt), becoming the leading method by 2016. it also discusses challenges such as out of vocabulary words, domain mismatch, maintaining context over longer text, low-resource language pairs, biases in training data, and the uninterpretability of nmt when using low-resource languages like somali, with the bible as a significant resource.', 'chapters': [{'end': 3477.434, 'start': 3009.719, 'title': 'Nmt success in machine translation', 'summary': 'Highlights how neural machine translation (nmt) has rapidly outperformed statistical machine translation (smt) in a few years, with nmt being the leading method for machine translation by 2016, despite being a fringe research activity in 2014, and discusses the challenges and errors in nmt such as out of vocabulary words, domain mismatch, maintaining context over longer text, low-resource language pairs, and biases in the training data.', 'duration': 467.715, 'highlights': ['NMT became the leading standard method for machine translation in 2016, outperforming SMT despite being a fringe research activity in 2014, showcasing a remarkable turnaround in just two years.', 'NMT systems outperformed SMT systems with significantly less human effort involved, with an NMT system trained by a handful of engineers in a few months surpassing the performance of an SMT system built by hundreds of engineers over many years.', 'Challenges and errors in NMT include out of vocabulary words, domain mismatch, maintaining context over longer text, low-resource language pairs, as well as biases in the training data.']}, {'end': 3648.299, 'start': 3480.216, 'title': 'Challenges in nmt and bible-based training', 'summary': 'Discusses the challenges in neural machine translation (nmt) using low-resource languages such as somali, where the bible is one of the best resources for parallel text, leading to nonsensical outputs when the input is not coherent, illustrating the uninterpretability of nmt and the ongoing research in nmt.', 'duration': 168.083, 'highlights': ['NMT using the Bible as training text for low-resource languages like Somali can lead to nonsensical outputs when the input is not coherent.', 'NMT system generates random text when it encounters nonsensical input, resembling the behavior of language models trained on specific styles.', 'Uninterpretability of the NMT system can lead to unexpected errors, making it challenging to predict and explain the outcomes.', 'Ongoing research in NMT continues to address challenges and pioneer innovations in NLP deep learning.']}], 'duration': 638.58, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c3009719.jpg', 'highlights': ['NMT became the leading standard method for machine translation in 2016, outperforming SMT despite being a fringe research activity in 2014, showcasing a remarkable turnaround in just two years.', 'NMT systems outperformed SMT systems with significantly less human effort involved, with an NMT system trained by a handful of engineers in a few months surpassing the performance of an SMT system built by hundreds of engineers over many years.', 'NMT using the Bible as training text for low-resource languages like Somali can lead to nonsensical outputs when the input is not coherent.', 'Challenges and errors in NMT include out of vocabulary words, domain mismatch, maintaining context over longer text, low-resource language pairs, as well as biases in the training data.', 'Uninterpretability of the NMT system can lead to unexpected errors, making it challenging to predict and explain the outcomes.', 'Ongoing research in NMT continues to address challenges and pioneer innovations in NLP deep learning.', 'NMT system generates random text when it encounters nonsensical input, resembling the behavior of language models trained on specific styles.']}, {'end': 4609.477, 'segs': [{'end': 3925.224, 'src': 'heatmap', 'start': 3734.88, 'weight': 1, 'content': [{'end': 3740.264, 'text': "If some information about the source sentence isn't in that vector, then there's no way the decoder is gonna be able to translate it correctly.", 'start': 3734.88, 'duration': 5.384}, {'end': 3743.246, 'text': 'So this is the, uh, this is an informational bottleneck.', 'start': 3740.944, 'duration': 2.302}, {'end': 3748.289, 'text': "It's putting kind of too much pressure on this single vector to be a good representation of the encoder.", 'start': 3743.986, 'duration': 4.303}, {'end': 3751.795, 'text': 'So, this is the motivation for attention.', 'start': 3750.054, 'duration': 1.741}, {'end': 3756.056, 'text': 'Attention is a neural technique and it provides a solution to the bottleneck problem.', 'start': 3752.635, 'duration': 3.421}, {'end': 3759.638, 'text': 'The core idea is that on each step of the decoder,', 'start': 3756.737, 'duration': 2.901}, {'end': 3765.3, 'text': "you're gonna use a direct connection to the encoder to focus on a particular part of the source sequence.", 'start': 3759.638, 'duration': 5.662}, {'end': 3771.382, 'text': "So, first I'm gonna show you what attention is via a diagram.", 'start': 3768.181, 'duration': 3.201}, {'end': 3774.783, 'text': "So, that's kind of an intuitive explanation and then I'm gonna show you the equations later.", 'start': 3771.402, 'duration': 3.381}, {'end': 3778.525, 'text': "So, here's how sequence to sequence with attention works.", 'start': 3775.944, 'duration': 2.581}, {'end': 3784.577, 'text': 'So on the first step of our decoder, uh, we have our first decoder hidden state.', 'start': 3780.356, 'duration': 4.221}, {'end': 3794.339, 'text': 'So what we do is we take the dot products between that decoder hidden state and the first encoder hidden state and then we get something called an attention score,', 'start': 3785.577, 'duration': 8.762}, {'end': 3795.5, 'text': "which I'm representing by a dot.", 'start': 3794.339, 'duration': 1.161}, {'end': 3796.46, 'text': "So that's a scalar.", 'start': 3795.52, 'duration': 0.94}, {'end': 3802.701, 'text': 'And in fact, we take the dot product between the decoder hidden state and all of the encoder hidden states.', 'start': 3797.7, 'duration': 5.001}, {'end': 3809.083, 'text': 'So this means that we get one attention score, one scalar for each of these, uh, source words effectively.', 'start': 3803.242, 'duration': 5.841}, {'end': 3816.966, 'text': 'So next, what we do is we take those four numbers scores and we apply the softmax uh distribution.', 'start': 3810.819, 'duration': 6.147}, {'end': 3820.59, 'text': 'uh, the softmax function to them, and then we get a probability distribution.', 'start': 3816.966, 'duration': 3.624}, {'end': 3825.536, 'text': "So, here I'm going to represent that probability distribution as a bar chart.", 'start': 3821.652, 'duration': 3.884}, {'end': 3829.461, 'text': 'Um, and we call this the attention distribution, and this one sums up to one.', 'start': 3826.357, 'duration': 3.104}, {'end': 3834.439, 'text': 'So here, you can see that most of the probability mass is on the first word.', 'start': 3830.836, 'duration': 3.603}, {'end': 3841.565, 'text': "And that kind of makes sense because our first word essentially means he, and, uh, we're gonna be producing the word he first in our target sentence.", 'start': 3835.059, 'duration': 6.506}, {'end': 3850.632, 'text': "So once we've got this attention distribution, uh, we're going to use it to produce something called the attention output.", 'start': 3843.026, 'duration': 7.606}, {'end': 3860.072, 'text': 'So the idea is that the attention output is a weighted sum of the encoder hidden states, and the weighting is the attention distribution.', 'start': 3851.806, 'duration': 8.266}, {'end': 3864.196, 'text': "So I've got these dotted arrows that go from the attention distribution to the attention output.", 'start': 3860.893, 'duration': 3.303}, {'end': 3868.179, 'text': "Probably there should be dotted arrows also from the encoder RNN, but that's hard to depict.", 'start': 3864.536, 'duration': 3.643}, {'end': 3873.503, 'text': "But the idea is that you're summing up these encoder, RNN, uh, hidden states,", 'start': 3869.06, 'duration': 4.443}, {'end': 3876.906, 'text': "but you're gonna weight each one according to how much attention distribution it has on it.", 'start': 3873.503, 'duration': 3.403}, {'end': 3881.145, 'text': 'So this means that your attention output, which is a single vector,', 'start': 3878.123, 'duration': 3.022}, {'end': 3885.128, 'text': 'is going to be mostly containing information from the hidden states that had high attention.', 'start': 3881.145, 'duration': 3.983}, {'end': 3888.631, 'text': "In this case, it's gonna be mostly information from the first hidden state.", 'start': 3885.489, 'duration': 3.142}, {'end': 3898.257, 'text': "So, after you do this, you're going to use the attention output to influence your prediction of the next word.", 'start': 3893.314, 'duration': 4.943}, {'end': 3903.901, 'text': 'So what you usually do is you concatenate the attention output with your decoder hidden states and then, uh,', 'start': 3898.698, 'duration': 5.203}, {'end': 3908.465, 'text': 'use that kind of concatenated pair in the way you would have used the decoder hidden state alone before.', 'start': 3903.901, 'duration': 4.564}, {'end': 3913.128, 'text': "So, that way you can get your probability distribution, uh, y hat one of what's coming next.", 'start': 3909.285, 'duration': 3.843}, {'end': 3917.031, 'text': 'So, as before, you can use that to sample your next word.', 'start': 3915.169, 'duration': 1.862}, {'end': 3920.501, 'text': 'So, on the next step, you just do the same thing again.', 'start': 3918.399, 'duration': 2.102}, {'end': 3922.442, 'text': "You've got your second decoder hidden state.", 'start': 3920.781, 'duration': 1.661}, {'end': 3925.224, 'text': 'Again, you take dot product with all of the encoder hidden states.', 'start': 3922.843, 'duration': 2.381}], 'summary': 'Attention technique solves bottleneck by focusing on encoder parts, weighting information for accurate translation.', 'duration': 190.344, 'max_score': 3734.88, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c3734880.jpg'}, {'end': 3982.913, 'src': 'embed', 'start': 3954.87, 'weight': 3, 'content': [{'end': 3959.355, 'text': 'Here you have a much more flexible soft notion of alignment where, uh,', 'start': 3954.87, 'duration': 4.485}, {'end': 3963.179, 'text': 'each word kind of has a distribution over the corresponding words in the source sentence.', 'start': 3959.355, 'duration': 3.824}, {'end': 3971.967, 'text': 'So another thing to note kind of a side note is that sometimes, uh, we take the attention output from the previous hidden state, uh,', 'start': 3964.864, 'duration': 7.103}, {'end': 3976.29, 'text': 'and we kind of feed it into the decoder again along with the usual word.', 'start': 3971.967, 'duration': 4.323}, {'end': 3982.913, 'text': 'So that would mean you take the attention output from the first step and kind of concatenate it to the word vector for he and then use it in the decoder.', 'start': 3976.63, 'duration': 6.283}], 'summary': 'Flexible soft alignment with word distributions in source sentence.', 'duration': 28.043, 'max_score': 3954.87, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c3954870.jpg'}, {'end': 4136.3, 'src': 'embed', 'start': 4104.102, 'weight': 0, 'content': [{'end': 4111.968, 'text': "And the main reason why it improves it is because it turns out it's super useful to allow the decoder to focus on certain parts of the source sentence when it's translating.", 'start': 4104.102, 'duration': 7.866}, {'end': 4113.929, 'text': 'And you can see why this makes sense right?', 'start': 4112.368, 'duration': 1.561}, {'end': 4118.712, 'text': "Because there's a very natural notion of alignment, and if you can focus on the specific word or words that you're translating,", 'start': 4113.969, 'duration': 4.743}, {'end': 4119.774, 'text': 'you can probably do a better job.', 'start': 4118.712, 'duration': 1.062}, {'end': 4124.274, 'text': 'Another reason why attention is cool is that it solves the bottleneck problem.', 'start': 4121.294, 'duration': 2.98}, {'end': 4135.099, 'text': "Uh, we were noticing that the problem with having a single vector that has to represent the entire source sentence and that's the only way information can pass from encoder to decoder means that if that encoding isn't very good,", 'start': 4124.814, 'duration': 10.285}, {'end': 4136.3, 'text': "then uh, you're not gonna do well.", 'start': 4135.099, 'duration': 1.201}], 'summary': 'Attention improves translation by allowing the decoder to focus on specific parts of the source sentence, solving the bottleneck problem and enhancing translation accuracy.', 'duration': 32.198, 'max_score': 4104.102, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c4104102.jpg'}, {'end': 4186.109, 'src': 'embed', 'start': 4153.096, 'weight': 2, 'content': [{'end': 4160.92, 'text': 'Uh, the reason why attention helps is because you have these, uh, direct connections between the decoder and the encoder kind of over many time steps.', 'start': 4153.096, 'duration': 7.824}, {'end': 4162.72, 'text': "So, it's like a shortcut connection.", 'start': 4161.3, 'duration': 1.42}, {'end': 4169.584, 'text': "And just as we learned last time about, uh, skip connections being useful for reducing vanishing gradient, here it's the same notion.", 'start': 4162.861, 'duration': 6.723}, {'end': 4173.466, 'text': 'We have these, uh, long distance direct connections that help the gradients flow better.', 'start': 4169.604, 'duration': 3.862}, {'end': 4178.046, 'text': 'Another great thing about attention is it provides some interpretability.', 'start': 4175.666, 'duration': 2.38}, {'end': 4186.109, 'text': "Uh, if you look at the attention distribution after you've produced your translation, uh, you can see what the decoder was focusing on, on each step.", 'start': 4178.067, 'duration': 8.042}], 'summary': 'Attention creates direct connections for better gradient flow and interpretability in translation.', 'duration': 33.013, 'max_score': 4153.096, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c4153095.jpg'}], 'start': 3648.919, 'title': 'Attention in deep learning', 'summary': 'Explains attention in deep learning, its unsupervised learning ability, general application in various tasks, and methods of computing attention scores. it also emphasizes its significant role in improving sequence-to-sequence models for machine translation (mt) and obtaining a fixed-size representation from arbitrary sets of representations, addressing the informational bottleneck problem and providing interpretability through attention distribution.', 'chapters': [{'end': 4024.558, 'start': 3648.919, 'title': 'Improving sequence to sequence with attention', 'summary': 'Explains how attention addresses the informational bottleneck problem in sequence-to-sequence models by allowing the decoder to focus on specific parts of the source sequence, using scalar attention scores and a probability distribution to generate a weighted sum of encoder hidden states for better prediction.', 'duration': 375.639, 'highlights': ['Attention addresses the informational bottleneck problem in sequence-to-sequence models by allowing the decoder to focus on specific parts of the source sequence.', 'Scalar attention scores and a probability distribution are used to generate a weighted sum of encoder hidden states for better prediction.', 'Soft alignment in attention allows for a flexible distribution over corresponding words in the source sentence, providing a more nuanced approach compared to hard binary alignment in SMT systems.']}, {'end': 4262.046, 'start': 4024.578, 'title': 'Attention in nmt', 'summary': 'Explains the concept of attention in nmt, highlighting its advantages such as significant nmt performance improvement, solving the bottleneck problem, helping with the vanishing gradient problem, and providing interpretability through attention distribution.', 'duration': 237.468, 'highlights': ['Attention significantly improves NMT performance by allowing the decoder to focus on specific parts of the source sentence, enhancing translation quality.', 'Attention solves the bottleneck problem by allowing the decoder to directly access the encoder and source sentence, eliminating the restriction of a single vector representation for the entire source sentence.', 'Attention helps with the vanishing gradient problem through direct connections between the decoder and the encoder over many time steps, facilitating better gradient flow.', "Attention provides interpretability through attention distribution, allowing the visualization of the decoder's focus on each step, resulting in a soft version of alignment without the need for explicit training."]}, {'end': 4609.477, 'start': 4262.586, 'title': 'Understanding attention in deep learning', 'summary': 'Discusses the concept of attention in deep learning, emphasizing its unsupervised learning ability, general application in various architectures and tasks, and the different methods of computing attention scores. it also highlights its role in improving the sequence-to-sequence model for mt and its significance in obtaining a fixed size representation from an arbitrary set of representations.', 'duration': 346.891, 'highlights': ["Attention's unsupervised learning ability and general application", 'Improvement of sequence-to-sequence model for MT', 'Different methods of computing attention scores', 'Role of attention in obtaining a fixed size representation from an arbitrary set of representations']}], 'duration': 960.558, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/XXtpJxZBa2c/pics/XXtpJxZBa2c3648919.jpg', 'highlights': ['Attention addresses the informational bottleneck problem in sequence-to-sequence models', 'Attention significantly improves NMT performance by allowing the decoder to focus on specific parts of the source sentence', 'Attention helps with the vanishing gradient problem through direct connections between the decoder and the encoder over many time steps', 'Soft alignment in attention allows for a flexible distribution over corresponding words in the source sentence', "Attention provides interpretability through attention distribution, allowing the visualization of the decoder's focus on each step"]}], 'highlights': ['NMT outperformed SMT with significantly less human effort involved, showcasing a remarkable turnaround in just two years.', 'NMT simplifies the learning process by directly learning the translation model, unlike SMT, where the model is broken down into smaller components.', 'Attention significantly improves NMT performance by allowing the decoder to focus on specific parts of the source sentence.', 'NMT systems tend to give better output than SMT systems, with more fluent results and improved context utilization.', 'NMT became the leading standard method for machine translation in 2016, outperforming SMT despite being a fringe research activity in 2014.', 'NMT allows using the same method for all language pairs, offering versatility in language translation systems.', 'NMT is a single neural network that can be optimized end-to-end, offering simplicity, convenience, and less human engineering efforts.', 'NMT systems are more able to generalize what they learn about phrases and how to translate them.', 'NMT requires much less human engineering effort and has no feature engineering, making it less complicated than SMT.', 'Attention helps with the vanishing gradient problem through direct connections between the decoder and the encoder over many time steps.']}