title
Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention

description
For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/3niIw41 Professor Christopher Manning, Stanford University, Ashish Vaswani & Anna Huang, Google http://onlinehub.stanford.edu/ Professor Christopher Manning Thomas M. Siebel Professor in Machine Learning, Professor of Linguistics and of Computer Science Director, Stanford Artificial Intelligence Laboratory (SAIL) To follow along with the course schedule and syllabus, visit: http://web.stanford.edu/class/cs224n/index.html#schedule 0:00 Introduction 2:07 Learning Representations of Variable Length Data 2:28 Recurrent Neural Networks 4:51 Convolutional Neural Networks? 14:06 Attention is Cheap! 16:05 Attention head: Who 16:26 Attention head: Did What? 16:35 Multihead Attention 17:34 Machine Translation: WMT-2014 BLEU 19:07 Frameworks 19:31 Importance of Residuals 23:26 Non-local Means 26:18 Image Transformer Layer 30:56 Raw representations in music and language 37:52 Attention: a weighted average 40:08 Closer look at relative attention 42:41 A Jazz sample from Music Transformer 44:42 Convolutions and Translational Equivariance 45:12 Relative positions Translational Equivariance 50:21 Sequential generation breaks modes. 50:32 Active Research Area #naturallanguageprocessing #deeplearning

detail
{'title': 'Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention', 'heatmap': [{'end': 457.601, 'start': 383.56, 'weight': 0.857}, {'end': 683.476, 'start': 612.944, 'weight': 0.918}, {'end': 842.936, 'start': 707.783, 'weight': 0.927}, {'end': 975.796, 'start': 902.9, 'weight': 0.797}], 'summary': 'The lecture introduces invited speakers discussing self-attention for generative models and its applications in music, explores self-attention in representation learning, discusses the success of transformer architecture in machine translation, and its use in modeling text and image relationships, as well as its application in music generation and memory efficiency, demonstrating its impact on various domains including nlp.', 'chapters': [{'end': 73.748, 'segs': [{'end': 51.632, 'src': 'embed', 'start': 5.603, 'weight': 0, 'content': [{'end': 11.148, 'text': "Okay So, I'm delighted to introduce, um, our first lot of invited speakers.", 'start': 5.603, 'duration': 5.545}, {'end': 14.47, 'text': "And so, we're gonna have two invited speakers, um, today.", 'start': 11.468, 'duration': 3.002}, {'end': 21.916, 'text': "So, starting off, um, we're gonna have Ashish Vaswani, who's gonna be talking about self-attention for generative models.", 'start': 14.55, 'duration': 7.366}, {'end': 28.922, 'text': "And in particular, um, we'll introduce some of the work on transformers that he is well known for along with his colleagues.", 'start': 21.996, 'duration': 6.926}, {'end': 39.288, 'text': "Um, and then as a sort of, um, a special edition, we're then also gonna have, Anna Huang talking about some applications of this work.", 'start': 29.422, 'duration': 9.866}, {'end': 43.67, 'text': 'There are actually at least a couple of people in the class who are actually interested in music applications.', 'start': 39.448, 'duration': 4.222}, {'end': 48.551, 'text': 'So this will be your one chance in the course to see music applications of deep learning.', 'start': 43.97, 'duration': 4.581}, {'end': 51.632, 'text': "Okay Um, so I'll hand it over to Ashish.", 'start': 48.951, 'duration': 2.681}], 'summary': 'Two invited speakers today. ashish vaswani on self-attention for generative models and anna huang on applications of this work.', 'duration': 46.029, 'max_score': 5.603, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY5603.jpg'}], 'start': 5.603, 'title': 'Invited speakers on self-attention', 'summary': 'Introduces two invited speakers, ashish vaswani and anna huang, who will discuss self-attention for generative models and its applications in music, aiming to engage the large class and introduce the work on transformers.', 'chapters': [{'end': 73.748, 'start': 5.603, 'title': 'Invited speakers on self-attention', 'summary': 'Introduces two invited speakers, ashish vaswani and anna huang, who will discuss self-attention for generative models and its applications in music, aiming to engage the large class and introduce the work on transformers.', 'duration': 68.145, 'highlights': ['Anna Huang to present applications of self-attention in music, catering to the interests of a significant portion of the class.', 'Ashish Vaswani to discuss self-attention for generative models, known for his work on transformers.']}], 'duration': 68.145, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY5603.jpg', 'highlights': ['Anna Huang to present applications of self-attention in music, catering to the interests of a significant portion of the class.', 'Ashish Vaswani to discuss self-attention for generative models, known for his work on transformers.']}, {'end': 999.031, 'segs': [{'end': 146.417, 'src': 'embed', 'start': 98.867, 'weight': 0, 'content': [{'end': 105.773, 'text': "and is there a model that exists that's a very good that has the inductive biases to model these properties that exist in my data set.", 'start': 98.867, 'duration': 6.906}, {'end': 109.696, 'text': 'So, hopefully, over the course of this lecture,', 'start': 106.213, 'duration': 3.483}, {'end': 118.164, 'text': 'Anna and I will convince you that self-attention indeed does have some has the ability to model some inductive biases that potentially could be useful for the problems that you care about.', 'start': 109.696, 'duration': 8.468}, {'end': 125.891, 'text': 'So this talk is going to be on learning representations, primarily of variable length data.', 'start': 120.649, 'duration': 5.242}, {'end': 129.892, 'text': 'Well, we have images, but most of it is going to be variable length data.', 'start': 125.931, 'duration': 3.961}, {'end': 137.714, 'text': 'And all of us care about this problem because in deep learning, deep learning is all about representation learning.', 'start': 130.532, 'duration': 7.182}, {'end': 146.417, 'text': 'And building the right tools for learning representations is an important factor in achieving empirical success.', 'start': 138.274, 'duration': 8.143}], 'summary': 'Self-attention can model inductive biases for variable length data in representation learning.', 'duration': 47.55, 'max_score': 98.867, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY98867.jpg'}, {'end': 232.375, 'src': 'embed', 'start': 198.876, 'weight': 5, 'content': [{'end': 206.537, 'text': 'sequentially and at each position, at each time step, they produce a continuous representation.', 'start': 198.876, 'duration': 7.661}, {'end': 210.198, 'text': "that's a summarization of everything that they've actually crunched through.", 'start': 206.537, 'duration': 3.661}, {'end': 222.644, 'text': 'Now, so, In the realm of large data, having parallel models is quite beneficial.', 'start': 212.438, 'duration': 10.206}, {'end': 225.868, 'text': 'In fact, I was actually reading Oliver Selfridge.', 'start': 223.044, 'duration': 2.824}, {'end': 232.375, 'text': 'He was a professor at MIT, and he wrote the precursor recursive to deep nets.', 'start': 225.988, 'duration': 6.387}], 'summary': 'Parallel models in large data benefit, as discussed by oliver selfridge, an mit professor.', 'duration': 33.499, 'max_score': 198.876, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY198876.jpg'}, {'end': 336.941, 'src': 'embed', 'start': 291.613, 'weight': 2, 'content': [{'end': 298.558, 'text': "So there's been excellent work, precursor to self-attention, that actually surmounted some of these difficulties.", 'start': 291.613, 'duration': 6.945}, {'end': 299.539, 'text': 'So what were these difficulties?', 'start': 298.578, 'duration': 0.961}, {'end': 310.007, 'text': "Basically, it's a convolutional sequence models where you have these limited receptive field convolutions that again consume the sentence not sequentially but in depth,", 'start': 299.579, 'duration': 10.428}, {'end': 314.59, 'text': 'and they produce representations of your variable length sequences.', 'start': 310.007, 'duration': 4.583}, {'end': 320.774, 'text': "And they're trivial to parallelize because you can apply these convolutions simultaneously at every position.", 'start': 316.351, 'duration': 4.423}, {'end': 322.555, 'text': 'Each layer is trivial to parallelize.', 'start': 321.194, 'duration': 1.361}, {'end': 326.097, 'text': 'The serial dependencies are only in the number of layers.', 'start': 323.455, 'duration': 2.642}, {'end': 333.76, 'text': 'you can get these local dependencies efficiently because at a single application of a convolution,', 'start': 329.559, 'duration': 4.201}, {'end': 336.941, 'text': 'can consume all the information inside its local receptive field.', 'start': 333.76, 'duration': 3.181}], 'summary': 'Self-attention precursor overcomes limitations of convolutional sequence models.', 'duration': 45.328, 'max_score': 291.613, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY291613.jpg'}, {'end': 457.601, 'src': 'heatmap', 'start': 359.907, 'weight': 4, 'content': [{'end': 362.168, 'text': 'But they were a great development and they actually pushed a lot of research.', 'start': 359.907, 'duration': 2.261}, {'end': 368.151, 'text': 'like Wave RNN, for example, is a classic success story of convolutional sequence models, even ByteNet.', 'start': 362.168, 'duration': 5.983}, {'end': 378.997, 'text': 'So far, attention has been one of the most important components, the sort of content-based memory retrieval mechanism.', 'start': 371.252, 'duration': 7.745}, {'end': 383.54, 'text': "And it's content-based because you have your decoder that attends to all this content.", 'start': 379.037, 'duration': 4.503}, {'end': 390.845, 'text': "That's your encoder, and then just sort of decides what information to absorb based on how similar this content is to every position in the memory.", 'start': 383.56, 'duration': 7.285}, {'end': 394.728, 'text': 'So this has been a very critical mechanism in neural machine translation.', 'start': 391.466, 'duration': 3.262}, {'end': 398.911, 'text': 'So now the question that we asked was why not just use attention for representations?', 'start': 395.068, 'duration': 3.843}, {'end': 405.697, 'text': "Here's what sort of a rough framework of this representation mechanism would look like.", 'start': 401.913, 'duration': 3.784}, {'end': 409.341, 'text': 'Just sort of repeating what attention is essentially.', 'start': 406.798, 'duration': 2.543}, {'end': 413.706, 'text': 'Now, I imagine you want to represent the word, re-represent the word represent.', 'start': 409.361, 'duration': 4.345}, {'end': 420.674, 'text': 'You want to construct its new representation, and then first, you attend or you compare yourself, you compare your content.', 'start': 413.746, 'duration': 6.928}, {'end': 422.556, 'text': 'In the beginning, it could just be a word embedding.', 'start': 420.774, 'duration': 1.782}, {'end': 427.499, 'text': 'You compare a content with all your words, uh, with all- with all the embeddings, and based on these-,', 'start': 422.856, 'duration': 4.643}, {'end': 433.784, 'text': 'based on these compatibilities or these comparisons you produce a- you produce a weighted combination of your entire neighborhood.', 'start': 427.499, 'duration': 6.285}, {'end': 437.806, 'text': 'And based on that weighted combination, you- you summarize all that information.', 'start': 434.224, 'duration': 3.582}, {'end': 442.53, 'text': "So, it's like you're re-expressing yourself in ser- in terms of a weighted combination of your entire neighborhood.", 'start': 437.867, 'duration': 4.663}, {'end': 443.891, 'text': "That's what attention does.", 'start': 442.87, 'duration': 1.021}, {'end': 448.414, 'text': 'And you can add feed forward layers to basically sort of compute new features for you.', 'start': 444.011, 'duration': 4.403}, {'end': 457.601, 'text': 'Um now, um So the first part is going to be about how some of the properties of self-attention actually help us in text generation,', 'start': 449.254, 'duration': 8.347}], 'summary': 'The transcript discusses the importance of attention mechanism in neural machine translation and proposes using attention for representations.', 'duration': 45.79, 'max_score': 359.907, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY359907.jpg'}, {'end': 683.476, 'src': 'heatmap', 'start': 612.944, 'weight': 0.918, 'content': [{'end': 616.525, 'text': 'And notice that attention is permutation invariant.', 'start': 612.944, 'duration': 3.581}, {'end': 619.166, 'text': 'So you just change the order of your positions.', 'start': 616.585, 'duration': 2.581}, {'end': 623.088, 'text': "You change the order of your words and it's not going to affect the actual output.", 'start': 619.186, 'duration': 3.902}, {'end': 626.749, 'text': 'So in order to maintain order, we add position representations.', 'start': 623.468, 'duration': 3.281}, {'end': 632.711, 'text': "And there's two kinds that we tried in the paper, these fantastic sinusoids with Noam Chazir invented.", 'start': 627.469, 'duration': 5.242}, {'end': 636.492, 'text': 'And we also use learned representations, which are very plain vanilla.', 'start': 632.731, 'duration': 3.761}, {'end': 637.933, 'text': 'Both of them work equally well.', 'start': 636.953, 'duration': 0.98}, {'end': 642.714, 'text': 'And so first we have so the encoder looks as follows, right?', 'start': 639.373, 'duration': 3.341}, {'end': 649.836, 'text': 'So we have a self-attention layer that just recomputes the representation for every position simultaneously using attention.', 'start': 643.054, 'duration': 6.782}, {'end': 654.078, 'text': 'Then we have a feed forward layer and we also have residual connections,', 'start': 650.276, 'duration': 3.802}, {'end': 657.158, 'text': "and I'll sort of give you a glimpse of what these residual connections might be bringing.", 'start': 654.078, 'duration': 3.08}, {'end': 662.822, 'text': 'That is between every, every layer and the input we have a skip connection that just adds the activations.', 'start': 657.519, 'duration': 5.303}, {'end': 667.825, 'text': 'And then this tuple of self-attention and feed forward layer just essentially repeats.', 'start': 663.702, 'duration': 4.123}, {'end': 673.749, 'text': 'Now on the decoder side, we have a sort of standard encoder decoder architecture.', 'start': 668.466, 'duration': 5.283}, {'end': 677.292, 'text': 'On the decoder side, we mimic a language model using self-attention.', 'start': 674.07, 'duration': 3.222}, {'end': 683.476, 'text': 'And the way to mimic a language model using self-attention is to impose causality by just masking out the positions that you can look at.', 'start': 677.592, 'duration': 5.884}], 'summary': 'Self-attention is permutation invariant, uses position representations, and employs both sinusoidal and learned representations, with skip connections between layers.', 'duration': 70.532, 'max_score': 612.944, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY612944.jpg'}, {'end': 667.825, 'src': 'embed', 'start': 639.373, 'weight': 8, 'content': [{'end': 642.714, 'text': 'And so first we have so the encoder looks as follows, right?', 'start': 639.373, 'duration': 3.341}, {'end': 649.836, 'text': 'So we have a self-attention layer that just recomputes the representation for every position simultaneously using attention.', 'start': 643.054, 'duration': 6.782}, {'end': 654.078, 'text': 'Then we have a feed forward layer and we also have residual connections,', 'start': 650.276, 'duration': 3.802}, {'end': 657.158, 'text': "and I'll sort of give you a glimpse of what these residual connections might be bringing.", 'start': 654.078, 'duration': 3.08}, {'end': 662.822, 'text': 'That is between every, every layer and the input we have a skip connection that just adds the activations.', 'start': 657.519, 'duration': 5.303}, {'end': 667.825, 'text': 'And then this tuple of self-attention and feed forward layer just essentially repeats.', 'start': 663.702, 'duration': 4.123}], 'summary': 'Encoder includes self-attention, feed forward layer, and residual connections for simultaneous position representation recomputation.', 'duration': 28.452, 'max_score': 639.373, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY639373.jpg'}, {'end': 842.936, 'src': 'heatmap', 'start': 707.783, 'weight': 0.927, 'content': [{'end': 714.524, 'text': 'But so now on the decoder side, we have this causal self-attention layer followed by encoder-decoder attention,', 'start': 707.783, 'duration': 6.741}, {'end': 718.725, 'text': 'where we actually attend to the last layer of the encoder and a feed-forward layer,', 'start': 714.524, 'duration': 4.201}, {'end': 722.646, 'text': 'and this tripled repeats a few times and at the end we have the standard cross-entropy loss.', 'start': 718.725, 'duration': 3.921}, {'end': 737.57, 'text': 'So sort of staring at the particular variant of the attention mechanism that we use, we went for simplicity and speed.', 'start': 726.983, 'duration': 10.587}, {'end': 744.154, 'text': 'So how do you actually compute attention? So imagine you want to re-represent position E2.', 'start': 738.21, 'duration': 5.944}, {'end': 749.818, 'text': "And we're going to first linearly transform it into a query.", 'start': 744.874, 'duration': 4.944}, {'end': 756.002, 'text': "And then we're going to linearly transform every position in your neighborhood or let's say every position at the input,", 'start': 750.558, 'duration': 5.444}, {'end': 758.804, 'text': 'because this is the encoder side to a key.', 'start': 756.002, 'duration': 2.802}, {'end': 762.867, 'text': "And these linear transformations can actually be thought as features, and I'll talk more about it later on.", 'start': 759.424, 'duration': 3.443}, {'end': 765.469, 'text': "So it's basically a bilinear form.", 'start': 763.267, 'duration': 2.202}, {'end': 771.513, 'text': "You're projecting these vectors into a space where just a dot product is a good proxy for similarity.", 'start': 765.489, 'duration': 6.024}, {'end': 773.154, 'text': 'So now you have your logits.', 'start': 772.233, 'duration': 0.921}, {'end': 775.696, 'text': 'You just do a softmax computer convex combination.', 'start': 773.174, 'duration': 2.522}, {'end': 783.001, 'text': "And now, based on this convex combination, you're going to then re-express E2 or, in terms of this convex combination,", 'start': 776.076, 'duration': 6.925}, {'end': 785.142, 'text': 'of all the vectors of all these positions.', 'start': 783.001, 'duration': 2.141}, {'end': 790.186, 'text': 'And before doing the convex combination, we again do a linear transformation to produce values.', 'start': 785.503, 'duration': 4.683}, {'end': 797.351, 'text': 'And then we do a second linear transformation just to mix this information and pass it through a feed forward layer.', 'start': 790.686, 'duration': 6.665}, {'end': 804.673, 'text': 'And all of this can be expressed basically in two matrix multiplications.', 'start': 799.392, 'duration': 5.281}, {'end': 809.014, 'text': "And the squared factor is just to make sure that these dot products don't blow up.", 'start': 805.053, 'duration': 3.961}, {'end': 810.134, 'text': "It's just a scaling factor.", 'start': 809.054, 'duration': 1.08}, {'end': 815.675, 'text': "And why is this mechanism attractive? Well, it's just really fast.", 'start': 810.774, 'duration': 4.901}, {'end': 817.136, 'text': 'You can do this very quickly on a GPU.', 'start': 815.695, 'duration': 1.441}, {'end': 821.937, 'text': 'And you can do it simultaneously for all positions with just two matmuls and a softmax.', 'start': 817.476, 'duration': 4.461}, {'end': 834.868, 'text': "On the decoder side, it's exactly the same except we impose causality by just adding minus 10E9 to the logits.", 'start': 823.917, 'duration': 10.951}, {'end': 837.911, 'text': 'So you just get zero probabilities on those positions.', 'start': 835.749, 'duration': 2.162}, {'end': 842.936, 'text': 'So we just impose causality by adding these highly negative values on the attention logits.', 'start': 838.291, 'duration': 4.645}], 'summary': 'Causal self-attention layer and encoder-decoder attention used for simplicity and speed in computing attention. mechanism allows for fast computation on gpu and imposition of causality on the decoder side.', 'duration': 135.153, 'max_score': 707.783, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY707783.jpg'}, {'end': 984.362, 'src': 'heatmap', 'start': 902.9, 'weight': 9, 'content': [{'end': 905.181, 'text': "And as you can see, it's about four times faster than an RNN.", 'start': 902.9, 'duration': 2.281}, {'end': 914.289, 'text': 'and faster than a convolutional model where you have a kernel of filter with three.', 'start': 908.164, 'duration': 6.125}, {'end': 919.033, 'text': "So, there's still one problem.", 'start': 916.11, 'duration': 2.923}, {'end': 920.554, 'text': "Now, here's something.", 'start': 919.533, 'duration': 1.021}, {'end': 923.677, 'text': 'So, in language typically, we want to know who did what to whom right?', 'start': 920.774, 'duration': 2.903}, {'end': 930.583, 'text': 'So now imagine you applied a convolutional filter, because you actually have different linear transformations based on relative distances,', 'start': 924.017, 'duration': 6.566}, {'end': 933.205, 'text': 'like this linear transformation on the word.', 'start': 930.583, 'duration': 2.622}, {'end': 941.069, 'text': 'who can learn this concept of who and pick out different information from this embedding of the word I?', 'start': 933.205, 'duration': 7.864}, {'end': 949.093, 'text': 'The red linear transformation can pick out different information from kicked and the blue linear transformation can pick out different information from ball.', 'start': 941.069, 'duration': 8.024}, {'end': 953.235, 'text': 'Now, when you have a single attention layer, this is difficult,', 'start': 949.713, 'duration': 3.522}, {'end': 957.297, 'text': "because it's just a convex combination and you have the same linear transformation everywhere.", 'start': 953.235, 'duration': 4.062}, {'end': 959.998, 'text': "All it's available to is just mixing proportions.", 'start': 957.397, 'duration': 2.601}, {'end': 963.1, 'text': "So you can't pick out different pieces of information from different places.", 'start': 960.298, 'duration': 2.802}, {'end': 975.796, 'text': 'Well, what if we had one attention layer for who? So you can think of an attention layer as something like a feature detector almost.', 'start': 964.008, 'duration': 11.788}, {'end': 978.338, 'text': 'Because it carries with it a linear transformation.', 'start': 976.617, 'duration': 1.721}, {'end': 984.362, 'text': "so it's projecting them in a space which starts caring, maybe, about syntax, or it's projecting in a space which starts caring about who or what.", 'start': 978.338, 'duration': 6.024}], 'summary': 'Proposes using attention layers to pick out specific information in language processing, improving comprehension and accuracy.', 'duration': 81.462, 'max_score': 902.9, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY902900.jpg'}], 'start': 76.008, 'title': 'Self-attention in representation learning', 'summary': 'Explores the potential of self-attention in learning representations for variable length data, discussing its contrast with recurrent neural networks, limitations of convolutional sequence models, and the impact on generative tasks and machine translation.', 'chapters': [{'end': 253.282, 'start': 76.008, 'title': 'Learning representations with self-attention', 'summary': 'Discusses the potential of self-attention in learning representations for variable length data in deep learning, contrasting it with recurrent neural networks, and emphasizes the importance of building the right tools for representation learning.', 'duration': 177.274, 'highlights': ["Self-attention's ability to model inductive biases in data sets could be useful for various problems.", 'Importance of building the right tools for learning representations as a key factor in achieving empirical success in deep learning.', 'Discussion on the dominance of recurrent neural networks in learning representations, particularly for variable length data.', 'Limitation of parallelization in recurrent neural networks due to sequential processing, as contrasted with the benefits of parallel models for large data.']}, {'end': 590.441, 'start': 255.143, 'title': 'Self-attention in representation learning', 'summary': 'Discusses the limitations of convolutional sequence models in language representation and the success of self-attention in overcoming these difficulties, highlighting its impact on generative tasks and machine translation.', 'duration': 335.298, 'highlights': ["Self-attention's success in overcoming limitations of convolutional sequence models", 'Efficiency and parallelization of convolutional sequence models', 'Importance of attention mechanism in neural machine translation']}, {'end': 999.031, 'start': 591.201, 'title': 'Transformer model and self-attention', 'summary': "Discusses the transformer model, emphasizing self-attention's permutation invariance, position representations, encoder-decoder architecture, attention mechanism computation, and computational advantages over rnns and convolutions.", 'duration': 407.83, 'highlights': ['Attention mechanism computation', 'Advantages of attention over RNNs and convolutions', 'Position representations and encoder-decoder architecture', 'Multi-head attention for feature detection']}], 'duration': 923.023, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY76008.jpg', 'highlights': ["Self-attention's ability to model inductive biases in data sets could be useful for various problems.", 'Importance of building the right tools for learning representations as a key factor in achieving empirical success in deep learning.', "Self-attention's success in overcoming limitations of convolutional sequence models", 'Discussion on the dominance of recurrent neural networks in learning representations, particularly for variable length data.', 'Importance of attention mechanism in neural machine translation', 'Limitation of parallelization in recurrent neural networks due to sequential processing, as contrasted with the benefits of parallel models for large data.', 'Advantages of attention over RNNs and convolutions', 'Attention mechanism computation', 'Position representations and encoder-decoder architecture', 'Multi-head attention for feature detection', 'Efficiency and parallelization of convolutional sequence models']}, {'end': 1245.897, 'segs': [{'end': 1078.802, 'src': 'embed', 'start': 999.312, 'weight': 0, 'content': [{'end': 1006.756, 'text': 'And for efficiency, instead of actually having these dimensions operate in a large space, we just reduce the dimensionality of all these heads.', 'start': 999.312, 'duration': 7.444}, {'end': 1009.937, 'text': 'And we operate these attention layers in parallel, sort of bridging the gap.', 'start': 1007.116, 'duration': 2.821}, {'end': 1011.638, 'text': "Now, here's a little quiz.", 'start': 1010.418, 'duration': 1.22}, {'end': 1024.218, 'text': 'Is there a combination of heads or is there a configuration in which you can actually exactly simulate a convolution, probably with more parameters?', 'start': 1016.147, 'duration': 8.071}, {'end': 1029.645, 'text': 'I think there should be a simple way to show that if you had more heads, or heads were a function of positions,', 'start': 1024.258, 'duration': 5.387}, {'end': 1033.23, 'text': 'you could probably just simulate a convolution, but although with a lot of parameters.', 'start': 1029.645, 'duration': 3.585}, {'end': 1037.054, 'text': 'So it can, in the limit, it can actually simulate a convolution.', 'start': 1033.91, 'duration': 3.144}, {'end': 1041.398, 'text': 'And also we can continue to enjoy the benefits of parallelism.', 'start': 1037.855, 'duration': 3.543}, {'end': 1044.823, 'text': 'but we did increase the number of soft maxes because each head then carries with it a soft max.', 'start': 1041.398, 'duration': 3.425}, {'end': 1049.989, 'text': "but the amount of flops didn't change because we, instead of actually having these heads operate in very large dimensions,", 'start': 1044.823, 'duration': 5.166}, {'end': 1051.33, 'text': "they're operating in very small dimensions.", 'start': 1049.989, 'duration': 1.341}, {'end': 1057.493, 'text': 'So when we applied this on machine translation,', 'start': 1052.991, 'duration': 4.502}, {'end': 1063.416, 'text': 'we were able to dramatically outperform previous results on English-German and English-French translation.', 'start': 1057.493, 'duration': 5.923}, {'end': 1068.338, 'text': 'So we had a pretty standard setup, 32, 000 word vocabularies, word piece encodings.', 'start': 1063.756, 'duration': 4.582}, {'end': 1073.66, 'text': 'WMT 2014 was our test set, 2013 was the dev set.', 'start': 1070.799, 'duration': 2.861}, {'end': 1078.802, 'text': 'And some of these results were much stronger than even previous ensemble models.', 'start': 1074.4, 'duration': 4.402}], 'summary': 'Reducing dimensionality of attention heads led to outperforming previous results on english-german and english-french translation.', 'duration': 79.49, 'max_score': 999.312, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY999312.jpg'}, {'end': 1136.496, 'src': 'embed', 'start': 1108.157, 'weight': 6, 'content': [{'end': 1113.022, 'text': 'because stochastic gradient descent could just train this architecture really well, because the gradient dynamics and attention are very simple.', 'start': 1108.157, 'duration': 4.865}, {'end': 1114.323, 'text': 'Attention is just a linear combination.', 'start': 1113.062, 'duration': 1.261}, {'end': 1119.448, 'text': "And I think that's actually favorable.", 'start': 1114.863, 'duration': 4.585}, {'end': 1122.01, 'text': 'But hopefully, as we go on.', 'start': 1119.648, 'duration': 2.362}, {'end': 1130.454, 'text': "Well, I'd also like to point out that we do explicitly model all pairwise connections.", 'start': 1122.811, 'duration': 7.643}, {'end': 1136.496, 'text': 'And it has this advantage of modeling very clear relationships directly between any two words.', 'start': 1130.494, 'duration': 6.002}], 'summary': 'Stochastic gradient descent trains architecture well due to simple attention and explicit modeling of pairwise connections.', 'duration': 28.339, 'max_score': 1108.157, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY1108157.jpg'}, {'end': 1205.226, 'src': 'embed', 'start': 1172.91, 'weight': 4, 'content': [{'end': 1175.672, 'text': 'So, uh, we have these residual- residual connections.', 'start': 1172.91, 'duration': 2.762}, {'end': 1176.853, 'text': 'uh, between um.', 'start': 1175.672, 'duration': 1.181}, {'end': 1183.721, 'text': 'So we have these residual connections that go from here to here to here, here to here, like between every pair of layers.', 'start': 1178.68, 'duration': 5.041}, {'end': 1184.841, 'text': "And it's interesting.", 'start': 1184.161, 'duration': 0.68}, {'end': 1190.643, 'text': 'So what we do is we just add the position information, add the input to the model.', 'start': 1185.461, 'duration': 5.182}, {'end': 1195.864, 'text': "And we don't infuse or we don't inject position information at every layer.", 'start': 1191.383, 'duration': 4.481}, {'end': 1205.226, 'text': 'So, when we severed these residual connections and we stared at these layers, stared at these attention distributions as a center,', 'start': 1196.404, 'duration': 8.822}], 'summary': 'Residual connections added between every pair of layers to input position information in the model.', 'duration': 32.316, 'max_score': 1172.91, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY1172910.jpg'}], 'start': 999.312, 'title': 'Transformer architecture advantages', 'summary': 'Discusses the success of transformer architecture in machine translation, outperforming previous results on english-german and english-french translation, achieving state-of-the-art, and highlighting the advantages of residual connections and attention mechanisms.', 'chapters': [{'end': 1051.33, 'start': 999.312, 'title': 'Efficient dimensionality reduction in attention layers', 'summary': 'Discusses the efficient reduction of dimensionality in attention layers through parallel operation, which can simulate a convolution with more parameters and increased soft maxes while maintaining the benefits of parallelism.', 'duration': 52.018, 'highlights': ['Efficient reduction of dimensionality in attention layers through parallel operation, which can simulate a convolution with more parameters and increased soft maxes while maintaining the benefits of parallelism', 'The possibility of simulating a convolution with more parameters by having more heads or heads as a function of positions', 'Operating attention layers in parallel to bridge the gap and reduce dimensionality', 'The increase in the number of soft maxes due to each head carrying with it a soft max, without changing the amount of flops']}, {'end': 1245.897, 'start': 1052.991, 'title': 'Transformer architecture advantages', 'summary': 'Discusses the success of transformer architecture in machine translation, outperforming previous results on english-german and english-french translation, achieving state-of-the-art, and highlighting the advantages of residual connections and attention mechanisms.', 'duration': 192.906, 'highlights': ['The transformer architecture outperformed previous results on English-German and English-French translation, achieving state-of-the-art.', 'Residual connections carry position information through the model, aiding in attention focus.', 'The simplicity of attention mechanisms and gradient dynamics made the architecture favorable for stochastic gradient descent training.']}], 'duration': 246.585, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY999312.jpg', 'highlights': ['Transformer architecture outperformed previous results on English-German and English-French translation, achieving state-of-the-art.', 'Efficient reduction of dimensionality in attention layers through parallel operation, simulating a convolution with more parameters and increased soft maxes while maintaining the benefits of parallelism.', 'Operating attention layers in parallel to bridge the gap and reduce dimensionality.', 'The possibility of simulating a convolution with more parameters by having more heads or heads as a function of positions.', 'Residual connections carry position information through the model, aiding in attention focus.', 'The increase in the number of soft maxes due to each head carrying with it a soft max, without changing the amount of flops.', 'The simplicity of attention mechanisms and gradient dynamics made the architecture favorable for stochastic gradient descent training.']}, {'end': 1837, 'segs': [{'end': 1294.176, 'src': 'embed', 'start': 1248.84, 'weight': 2, 'content': [{'end': 1259.07, 'text': 'Okay, so now we saw that being able to model both long and short term relationships, long and short distance relationships, with attention,', 'start': 1248.84, 'duration': 10.23}, {'end': 1261.112, 'text': 'is beneficial for text generation.', 'start': 1259.07, 'duration': 2.042}, {'end': 1268.8, 'text': 'What kind of inductive biases actually appear or what kind of phenomena appear in images?', 'start': 1262.614, 'duration': 6.186}, {'end': 1275.125, 'text': "And something that we constantly see in images and music is this notion of repeating structure that's very similar to each other.", 'start': 1268.84, 'duration': 6.285}, {'end': 1277.747, 'text': 'You have these motifs that repeat in different scales.', 'start': 1275.165, 'duration': 2.582}, {'end': 1283.79, 'text': "So, for example, there's another artificial but beautiful example of self-similarity, where you have this Van Gogh painting,", 'start': 1278.207, 'duration': 5.583}, {'end': 1286.212, 'text': 'where this texture or these little objects just repeat', 'start': 1283.79, 'duration': 2.422}, {'end': 1291.435, 'text': 'These different pieces of the image are very similar to each other, but they might have different scales.', 'start': 1286.512, 'duration': 4.923}, {'end': 1294.176, 'text': "Again, in music, here's a motif that repeats.", 'start': 1292.315, 'duration': 1.861}], 'summary': 'Modeling long and short term relationships with attention is beneficial for text generation, while images and music exhibit repeating structures and motifs at different scales.', 'duration': 45.336, 'max_score': 1248.84, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY1248840.jpg'}, {'end': 1410.127, 'src': 'embed', 'start': 1383.974, 'weight': 4, 'content': [{'end': 1389.517, 'text': 'But it was actually interesting to see how it even naturally modeled self-similarity.', 'start': 1383.974, 'duration': 5.543}, {'end': 1391.737, 'text': 'And people have used self-similarity in image generation.', 'start': 1389.777, 'duration': 1.96}, {'end': 1397.08, 'text': 'This is this really cool work by Efros, where they actually see OK in the training set.', 'start': 1391.757, 'duration': 5.323}, {'end': 1399.781, 'text': 'what are those patches that are really similar to me?', 'start': 1397.08, 'duration': 2.701}, {'end': 1402.862, 'text': "And based on the patches that are really similar to me, I'm going to fill up the information.", 'start': 1399.821, 'duration': 3.041}, {'end': 1405.003, 'text': "So it's like actually doing image generation.", 'start': 1403.102, 'duration': 1.901}, {'end': 1410.127, 'text': "There's this really classic work called non-local means, where they do image denoising,", 'start': 1405.863, 'duration': 4.264}], 'summary': "Self-similarity modeled in image generation, efros' work uses patches for filling information, non-local means for image denoising.", 'duration': 26.153, 'max_score': 1383.974, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY1383974.jpg'}, {'end': 1604.576, 'src': 'embed', 'start': 1578.646, 'weight': 0, 'content': [{'end': 1584.148, 'text': 'We had two-dimensional position representations along with a very similar attention mechanism.', 'start': 1578.646, 'duration': 5.502}, {'end': 1590.091, 'text': 'And we tried both super resolution and unconditional and conditional image generation.', 'start': 1586.389, 'duration': 3.702}, {'end': 1596.633, 'text': 'This was Nia Ndike, Parmar, I, and a few other authors from Brain.', 'start': 1591.011, 'duration': 5.622}, {'end': 1598.934, 'text': 'And we presented at ICML.', 'start': 1597.914, 'duration': 1.02}, {'end': 1604.576, 'text': 'We were able to achieve better perplexity than existing models.', 'start': 1601.235, 'duration': 3.341}], 'summary': 'Presented at icml achieving better perplexity with 2d positions and attention mechanism.', 'duration': 25.93, 'max_score': 1578.646, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY1578646.jpg'}, {'end': 1726.149, 'src': 'embed', 'start': 1697.972, 'weight': 1, 'content': [{'end': 1699.852, 'text': "So there's only a few options you can have at the output.", 'start': 1697.972, 'duration': 1.88}, {'end': 1702.333, 'text': 'And our super resolution results were much better.', 'start': 1700.133, 'duration': 2.2}, {'end': 1706.414, 'text': 'We were able to get better facial orientation and structure than previous work.', 'start': 1702.413, 'duration': 4.001}, {'end': 1708.635, 'text': 'And these are samples at different temperatures.', 'start': 1706.974, 'duration': 1.661}, {'end': 1716.561, 'text': 'And when we quantified this with actual human evaluators, we flashed an image and said is this real?', 'start': 1709.415, 'duration': 7.146}, {'end': 1717.082, 'text': 'is this false?', 'start': 1716.561, 'duration': 0.521}, {'end': 1722.967, 'text': 'And we were able to fool humans like four times better than previous results on super resolution.', 'start': 1717.522, 'duration': 5.445}, {'end': 1726.149, 'text': 'Again these results are like.', 'start': 1723.607, 'duration': 2.542}], 'summary': 'Super resolution produced better facial orientation and structure, fooling humans 4 times more than previous results.', 'duration': 28.177, 'max_score': 1697.972, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY1697972.jpg'}], 'start': 1248.84, 'title': 'Modeling text and image relationships', 'summary': 'Discusses the importance of modeling long and short term relationships in text generation, and the presence of repeating structures in images and music. it also explores the use of self-attention in modeling images, achieving better perplexity than existing models on imagenet, and superior super resolution results, with the potential for applying it in classification and other tasks.', 'chapters': [{'end': 1294.176, 'start': 1248.84, 'title': 'Text and image relationships', 'summary': 'Discusses the importance of modeling long and short term relationships in text generation, and the presence of repeating structures in images and music.', 'duration': 45.336, 'highlights': ['The importance of modeling long and short term relationships with attention for text generation is emphasized.', 'The presence of repeating structures in images and music, such as motifs that repeat in different scales, is highlighted.', "Examples of self-similarity in Van Gogh's painting and motifs in music are mentioned."]}, {'end': 1837, 'start': 1294.977, 'title': 'Self-attention in image modeling', 'summary': 'Explores the use of self-attention in modeling images, achieving better perplexity than existing models on imagenet, and superior super resolution results, with the potential for applying it in classification and other tasks.', 'duration': 542.023, 'highlights': ['The chapter achieves better perplexity than existing models on ImageNet', 'Superior super resolution results compared to previous work', 'Exploration of self-similarity in images and its potential applications']}], 'duration': 588.16, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY1248840.jpg', 'highlights': ['The chapter achieves better perplexity than existing models on ImageNet', 'Superior super resolution results compared to previous work', 'The importance of modeling long and short term relationships with attention for text generation is emphasized', 'The presence of repeating structures in images and music, such as motifs that repeat in different scales, is highlighted', 'Exploration of self-similarity in images and its potential applications', "Examples of self-similarity in Van Gogh's painting and motifs in music are mentioned"]}, {'end': 2471.986, 'segs': [{'end': 1909.473, 'src': 'embed', 'start': 1837.1, 'weight': 0, 'content': [{'end': 1839.422, 'text': "There's also a lot of self-similarity in music.", 'start': 1837.1, 'duration': 2.322}, {'end': 1842.304, 'text': 'So, we can imagine Transformer being a good model for it.', 'start': 1839.462, 'duration': 2.842}, {'end': 1849.129, 'text': "We're going to show how we can add more to the self-attention,", 'start': 1844.325, 'duration': 4.804}, {'end': 1854.073, 'text': 'to think more about kind of relational information and how that could help music generation.', 'start': 1849.129, 'duration': 4.944}, {'end': 1860.98, 'text': "So, uh, first I want to clarify what is the raw representation that we're working with right now.", 'start': 1855.839, 'duration': 5.141}, {'end': 1863.241, 'text': 'So analogous to language.', 'start': 1861.36, 'duration': 1.881}, {'end': 1863.881, 'text': 'you can think about.', 'start': 1863.241, 'duration': 0.64}, {'end': 1871.503, 'text': "there's text and somebody is reading out a text, so they add their kind of own intonations to it, and then you have sound waves coming out.", 'start': 1863.881, 'duration': 7.622}, {'end': 1872.063, 'text': "that's speech.", 'start': 1871.503, 'duration': 0.56}, {'end': 1880.949, 'text': "So for music there's a very, very similar kind of uh, the line of generation where you say the composer has an idea,", 'start': 1872.583, 'duration': 8.366}, {'end': 1885.314, 'text': 'writes down the score and then a performer performs it, and then you get sound.', 'start': 1880.949, 'duration': 4.365}, {'end': 1889.598, 'text': "So what we're gonna focus on today is mostly um.", 'start': 1885.814, 'duration': 3.784}, {'end': 1893.462, 'text': "you can think of the score, but it's actually a performance.", 'start': 1889.598, 'duration': 3.864}, {'end': 1906.691, 'text': "um, in that it's a symbolic representation where MIDI pianos were used and professional amateur musicians were performing on the pianos.", 'start': 1893.462, 'duration': 13.229}, {'end': 1909.473, 'text': 'so we have the recorded information of their playing.', 'start': 1906.691, 'duration': 2.782}], 'summary': 'Transformer model to enhance self-attention for music generation using midi pianos.', 'duration': 72.373, 'max_score': 1837.1, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY1837100.jpg'}], 'start': 1837.1, 'title': 'Using transformer model for music generation', 'summary': 'Discusses enhancing self-similarity in music, focusing on relational information using symbolic representation of midi pianos and recorded information, and the impact of relative attention on capturing relational information in music generation.', 'chapters': [{'end': 1909.473, 'start': 1837.1, 'title': 'Transformer model for music generation', 'summary': "Discusses using transformer model to enhance self-similarity in music, focusing on relational information for music generation, particularly using symbolic representation of midi pianos and recorded information of musicians' playing.", 'duration': 72.373, 'highlights': ['Transformer model can enhance self-similarity in music generation by adding more to self-attention and considering relational information.', 'The raw representation for music generation involves a symbolic representation where MIDI pianos were used and professional amateur musicians performed on the pianos.', "Composer's idea is written down as a score, which is then performed by musicians, resulting in sound."]}, {'end': 2471.986, 'start': 1909.753, 'title': 'Music modeling with transformer', 'summary': 'Discusses the challenges of modeling music as a sequential process, the limitations of recurrent neural networks, the improvements brought by the transformer model, and the impact of relative attention on capturing relational information in music generation.', 'duration': 562.233, 'highlights': ['The transformer model improves music modeling by addressing the challenge of embedding a long sequence into a fixed-length vector and handling repetitions at a distance, demonstrating the deterioration of the model beyond the trained length and the consistent repetition achieved by the music transformer.', "The music transformer effectively captures relational information in music generation, as evidenced by the model's ability to look at relevant parts even if they were not immediately preceding, and the visualization of self-attention at a note-to-note level, showcasing structural moments and clean sections in the attention.", 'Relative attention in the transformer model adds an additional term to consider the distance and content between positions, influencing the similarity between positions, and has shown significant improvement in translation with short sequences but requires adaptation for longer music samples.']}], 'duration': 634.886, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY1837100.jpg', 'highlights': ['The transformer model improves music modeling by addressing the challenge of embedding a long sequence into a fixed-length vector and handling repetitions at a distance, demonstrating the deterioration of the model beyond the trained length and the consistent repetition achieved by the music transformer.', 'The transformer model can enhance self-similarity in music generation by adding more to self-attention and considering relational information.', "The music transformer effectively captures relational information in music generation, as evidenced by the model's ability to look at relevant parts even if they were not immediately preceding, and the visualization of self-attention at a note-to-note level, showcasing structural moments and clean sections in the attention.", 'The raw representation for music generation involves a symbolic representation where MIDI pianos were used and professional amateur musicians performed on the pianos.', 'Relative attention in the transformer model adds an additional term to consider the distance and content between positions, influencing the similarity between positions, and has shown significant improvement in translation with short sequences but requires adaptation for longer music samples.', "Composer's idea is written down as a score, which is then performed by musicians, resulting in sound."]}, {'end': 3216.874, 'segs': [{'end': 2520.705, 'src': 'embed', 'start': 2472.066, 'weight': 1, 'content': [{'end': 2475.589, 'text': "So it's like 2, 000 tokens need to be able to fit in memory.", 'start': 2472.066, 'duration': 3.523}, {'end': 2485.354, 'text': "So this was a problem, uh, because the original formulation relied on building this 3D tensor that's, uh, uh, that's very large in memory.", 'start': 2476.269, 'duration': 9.085}, {'end': 2487.675, 'text': 'Um, and and why this is the case??', 'start': 2485.374, 'duration': 2.301}, {'end': 2495.219, 'text': "It's because, for every pair, uh, you, you look up what, the, what, the, so you can compute what the relative distance is,", 'start': 2487.775, 'duration': 7.444}, {'end': 2497.98, 'text': 'and then you look up an embedding that corresponds to that distance.', 'start': 2495.219, 'duration': 2.761}, {'end': 2506.948, 'text': 'So, um, for like this this uh length by length, like L by L uh matrix, you need like uh to collect embeddings for each of the positions,', 'start': 2498.541, 'duration': 8.407}, {'end': 2508.952, 'text': "and that's uh depth D.", 'start': 2506.948, 'duration': 2.004}, {'end': 2509.834, 'text': 'So that gives us the 3D.', 'start': 2508.952, 'duration': 0.882}, {'end': 2520.705, 'text': 'what we realize is you can actually just directly multiply the queries and the embedding distances and they, uh,', 'start': 2511.843, 'duration': 8.862}], 'summary': 'The problem involves fitting 2,000 tokens in memory due to building a large 3d tensor, with a depth of d.', 'duration': 48.639, 'max_score': 2472.066, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY2472066.jpg'}, {'end': 2775.44, 'src': 'embed', 'start': 2747.44, 'weight': 0, 'content': [{'end': 2754.366, 'text': "So there's a lot of it seems like this might be an interesting direction to pursue if you wanna push self-attention in images for self-supervised learning.", 'start': 2747.44, 'duration': 6.926}, {'end': 2760.871, 'text': 'I guess, on self-supervised learning, the generative modeling work that I talked about before,', 'start': 2756.067, 'duration': 4.804}, {'end': 2766.976, 'text': 'in and of itself just having probabilistic models of images is, I mean, I guess, the best model of an image is.', 'start': 2760.871, 'duration': 6.105}, {'end': 2769.297, 'text': 'I go to Google search and I pick up an image and I just give it to you.', 'start': 2766.976, 'duration': 2.321}, {'end': 2775.44, 'text': 'But I guess generative models of images are useful because if you want to do something like self-supervised learning,', 'start': 2769.737, 'duration': 5.703}], 'summary': 'Exploring self-attention in images for self-supervised learning.', 'duration': 28, 'max_score': 2747.44, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY2747440.jpg'}, {'end': 3198.934, 'src': 'embed', 'start': 3169.166, 'weight': 4, 'content': [{'end': 3170.047, 'text': "It's kind of cute.", 'start': 3169.166, 'duration': 0.881}, {'end': 3174.79, 'text': "It's been used in speech, but I don't know if there's been some really big success stories of self-attention in speech.", 'start': 3170.407, 'duration': 4.383}, {'end': 3182.35, 'text': 'Again, similar issues where you have very large positions to do self-attention over.', 'start': 3175.55, 'duration': 6.8}, {'end': 3189.605, 'text': 'So yeah, self-supervision, if it works, it would be very beneficial.', 'start': 3183.84, 'duration': 5.765}, {'end': 3191.327, 'text': "We wouldn't need large label data sets.", 'start': 3189.625, 'duration': 1.702}, {'end': 3192.908, 'text': 'Understanding transfer.', 'start': 3192.048, 'duration': 0.86}, {'end': 3198.934, 'text': 'Transfer is becoming a reality in NLP with BERT and some of these other models.', 'start': 3192.948, 'duration': 5.986}], 'summary': 'Self-attention in speech may offer benefits, reducing need for large labeled datasets.', 'duration': 29.768, 'max_score': 3169.166, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY3169166.jpg'}], 'start': 2472.066, 'title': 'Memory efficiency in self-attention', 'summary': "Discusses optimizing memory usage in 3d tensor, enabling fitting of 2000 tokens, and explores self-attention's applications in various domains like music, machine translation, image processing, generative modeling, and more, emphasizing its role in scaling memory efficiency and addressing multiple challenges in nlp.", 'chapters': [{'end': 2540.269, 'start': 2472.066, 'title': 'Efficient memory usage in 3d tensor', 'summary': 'Discusses the optimization of memory usage in the 3d tensor by realizing the possibility of directly multiplying queries and embedding distances, reducing the need for large memory, thereby enabling the fitting of 2000 tokens in memory.', 'duration': 68.203, 'highlights': ['The realization that directly multiplying queries and embedding distances eliminates the need for building a large 3D tensor, enabling the fitting of 2000 tokens in memory', 'The need to collect embeddings for each position in an L by L matrix, requiring a depth D, resulting in a large memory usage', "The problem of the original formulation relying on a 3D tensor that's very large in memory, hindering the fitting of 2000 tokens"]}, {'end': 3216.874, 'start': 2540.269, 'title': 'Self-attention and its applications', 'summary': 'Discusses the benefits of self-attention in scaling memory efficiency, its applications in music, machine translation, image processing, generative modeling, graph modeling, and parallel training, while also highlighting its role in addressing multimodal outputs, transfer learning, and the potential for self-supervision in nlp.', 'duration': 676.605, 'highlights': ["Self-attention's benefits in scaling memory efficiency and its applications in music, machine translation, image processing, generative modeling, and graph modeling.", 'Addressing multimodal outputs and transfer learning.', 'The potential for self-supervision in NLP, and multitask learning.']}], 'duration': 744.808, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/5vcj8kSwBCY/pics/5vcj8kSwBCY2472066.jpg', 'highlights': ["Self-attention's applications in music, machine translation, image processing, generative modeling, and more", 'The realization that directly multiplying queries and embedding distances eliminates the need for building a large 3D tensor, enabling the fitting of 2000 tokens in memory', 'The need to collect embeddings for each position in an L by L matrix, requiring a depth D, resulting in a large memory usage', "The problem of the original formulation relying on a 3D tensor that's very large in memory, hindering the fitting of 2000 tokens", "Self-attention's role in scaling memory efficiency and addressing multiple challenges in NLP"]}], 'highlights': ['Transformer architecture outperformed previous results on English-German and English-French translation, achieving state-of-the-art.', 'The lecture introduces invited speakers discussing self-attention for generative models and its applications in music, catering to the interests of a significant portion of the class.', 'The transformer model improves music modeling by addressing the challenge of embedding a long sequence into a fixed-length vector and handling repetitions at a distance, demonstrating the deterioration of the model beyond the trained length and the consistent repetition achieved by the music transformer.', "Self-attention's applications in music, machine translation, image processing, generative modeling, and more", 'The importance of modeling long and short term relationships with attention for text generation is emphasized', 'The transformer model can enhance self-similarity in music generation by adding more to self-attention and considering relational information.', "The music transformer effectively captures relational information in music generation, as evidenced by the model's ability to look at relevant parts even if they were not immediately preceding, and the visualization of self-attention at a note-to-note level, showcasing structural moments and clean sections in the attention.", 'The realization that directly multiplying queries and embedding distances eliminates the need for building a large 3D tensor, enabling the fitting of 2000 tokens in memory', 'Efficient reduction of dimensionality in attention layers through parallel operation, simulating a convolution with more parameters and increased soft maxes while maintaining the benefits of parallelism.', 'Operating attention layers in parallel to bridge the gap and reduce dimensionality.']}