title
Let's build GPT: from scratch, in code, spelled out.
description
We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3. We talk about connections to ChatGPT, which has taken the world by storm. We watch GitHub Copilot, itself a GPT, help us write a GPT (meta :D!) . I recommend people watch the earlier makemore videos to get comfortable with the autoregressive language modeling framework and basics of tensors and PyTorch nn, which we take for granted in this video.
Links:
- Google colab for the video: https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing
- GitHub repo for the video: https://github.com/karpathy/ng-video-lecture
- Playlist of the whole Zero to Hero series so far: https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ
- nanoGPT repo: https://github.com/karpathy/nanoGPT
- my website: https://karpathy.ai
- my twitter: https://twitter.com/karpathy
- our Discord channel: https://discord.gg/3zy8kqD9Cp
Supplementary links:
- Attention is All You Need paper: https://arxiv.org/abs/1706.03762
- OpenAI GPT-3 paper: https://arxiv.org/abs/2005.14165
- OpenAI ChatGPT blog post: https://openai.com/blog/chatgpt/
- The GPU I'm training the model on is from Lambda GPU Cloud, I think the best and easiest way to spin up an on-demand GPU instance in the cloud that you can ssh to: https://lambdalabs.com . If you prefer to work in notebooks, I think the easiest path today is Google Colab.
Suggested exercises:
- EX1: The n-dimensional tensor mastery challenge: Combine the `Head` and `MultiHeadAttention` into one class that processes all the heads in parallel, treating the heads as another batch dimension (answer is in nanoGPT).
- EX2: Train the GPT on your own dataset of choice! What other data could be fun to blabber on about? (A fun advanced suggestion if you like: train a GPT to do addition of two numbers, i.e. a+b=c. You may find it helpful to predict the digits of c in reverse order, as the typical addition algorithm (that you're hoping it learns) would proceed right to left too. You may want to modify the data loader to simply serve random problems and skip the generation of train.bin, val.bin. You may want to mask out the loss at the input positions of a+b that just specify the problem using y=-1 in the targets (see CrossEntropyLoss ignore_index). Does your Transformer learn to add? Once you have this, swole doge project: build a calculator clone in GPT, for all of +-*/. Not an easy problem. You may need Chain of Thought traces.)
- EX3: Find a dataset that is very large, so large that you can't see a gap between train and val loss. Pretrain the transformer on this data, then initialize with that model and finetune it on tiny shakespeare with a smaller number of steps and lower learning rate. Can you obtain a lower validation loss by the use of pretraining?
- EX4: Read some transformer papers and implement one additional feature or change that people seem to use. Does it improve the performance of your GPT?
Chapters:
00:00:00 intro: ChatGPT, Transformers, nanoGPT, Shakespeare
baseline language modeling, code setup
00:07:52 reading and exploring the data
00:09:28 tokenization, train/val split
00:14:27 data loader: batches of chunks of data
00:22:11 simplest baseline: bigram language model, loss, generation
00:34:53 training the bigram model
00:38:00 port our code to a script
Building the "self-attention"
00:42:13 version 1: averaging past context with for loops, the weakest form of aggregation
00:47:11 the trick in self-attention: matrix multiply as weighted aggregation
00:51:54 version 2: using matrix multiply
00:54:42 version 3: adding softmax
00:58:26 minor code cleanup
01:00:18 positional encoding
01:02:00 THE CRUX OF THE VIDEO: version 4: self-attention
01:11:38 note 1: attention as communication
01:12:46 note 2: attention has no notion of space, operates over sets
01:13:40 note 3: there is no communication across batch dimension
01:14:14 note 4: encoder blocks vs. decoder blocks
01:15:39 note 5: attention vs. self-attention vs. cross-attention
01:16:56 note 6: "scaled" self-attention. why divide by sqrt(head_size)
Building the Transformer
01:19:11 inserting a single self-attention block to our network
01:21:59 multi-headed self-attention
01:24:25 feedforward layers of transformer block
01:26:48 residual connections
01:32:51 layernorm (and its relationship to our previous batchnorm)
01:37:49 scaling up the model! creating a few variables. adding dropout
Notes on Transformer
01:42:39 encoder vs. decoder vs. both (?) Transformers
01:46:22 super quick walkthrough of nanoGPT, batched multi-headed self-attention
01:48:53 back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF
01:54:32 conclusions
Corrections:
00:57:00 Oops "tokens from the _future_ cannot communicate", not "past". Sorry! :)
01:20:05 Oops I should be using the head_size for the normalization, not C
detail
{'title': "Let's build GPT: from scratch, in code, spelled out.", 'heatmap': [{'end': 3777.037, 'start': 3699.638, 'weight': 1}], 'summary': 'Showcases building gpt from scratch, covering topics such as ai interaction, transformer-based language models, efficient self-attention implementation, self-attention mechanism, and optimizing neural networks. it emphasizes quantifiable data such as loss reduction, validation loss improvement, and scaling up a neural net for optimized performance.', 'chapters': [{'end': 465.204, 'segs': [{'end': 35.226, 'src': 'embed', 'start': 0.229, 'weight': 1, 'content': [{'end': 0.569, 'text': 'Hi everyone.', 'start': 0.229, 'duration': 0.34}, {'end': 3.591, 'text': 'So by now you have probably heard of ChatGPT.', 'start': 1.53, 'duration': 2.061}, {'end': 12.294, 'text': 'It has taken the world and the AI community by storm, and it is a system that allows you to interact with an AI and give it text-based tasks.', 'start': 4.03, 'duration': 8.264}, {'end': 18.477, 'text': 'So, for example, we can ask ChatGPT to write us a small haiku about how important it is that people understand AI,', 'start': 12.754, 'duration': 5.723}, {'end': 20.798, 'text': 'and then they can use it to improve the world and make it more prosperous.', 'start': 18.477, 'duration': 2.321}, {'end': 27.061, 'text': 'So when we run this, AI knowledge brings prosperity for all to see, embrace its power.', 'start': 21.598, 'duration': 5.463}, {'end': 28.963, 'text': 'Okay, not bad.', 'start': 28.283, 'duration': 0.68}, {'end': 35.226, 'text': 'And so you could see that ChatGPT went from left to right and generated all these words sequentially.', 'start': 29.324, 'duration': 5.902}], 'summary': "Chatgpt enables text-based interactions, like generating a haiku about ai's importance in improving the world.", 'duration': 34.997, 'max_score': 0.229, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY229.jpg'}, {'end': 201.257, 'src': 'embed', 'start': 175.758, 'weight': 2, 'content': [{'end': 180.3, 'text': 'actually ended up taking over the rest of AI in the next five years after.', 'start': 175.758, 'duration': 4.542}, {'end': 188.623, 'text': 'And so this architecture with minor changes was copy pasted into a huge amount of applications in AI in more recent years.', 'start': 180.94, 'duration': 7.683}, {'end': 191.805, 'text': 'And that includes at the core of ChatGPT.', 'start': 189.184, 'duration': 2.621}, {'end': 201.257, 'text': "Now, what I'd like to do now is I'd like to build out something like ChatGPT, but we're not going to be able to, of course, reproduce ChatGPT.", 'start': 192.87, 'duration': 8.387}], 'summary': 'Ai architecture took over ai applications, including chatgpt.', 'duration': 25.499, 'max_score': 175.758, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY175758.jpg'}, {'end': 381.961, 'src': 'embed', 'start': 356.001, 'weight': 0, 'content': [{'end': 360.444, 'text': "And it's a repository for training transformers on any given text.", 'start': 356.001, 'duration': 4.443}, {'end': 366.369, 'text': "And what I think is interesting about it, because there's many ways to train transformers, but this is a very simple implementation.", 'start': 361.265, 'duration': 5.104}, {'end': 369.551, 'text': "So it's just two files of 300 lines of code each.", 'start': 366.709, 'duration': 2.842}, {'end': 375.536, 'text': 'One file defines the GPT model, the transformer, and one file trains it on some given text dataset.', 'start': 370.172, 'duration': 5.364}, {'end': 381.961, 'text': "And here I'm showing that if you train it on a open web text dataset, which is a fairly large dataset of web pages,", 'start': 376.517, 'duration': 5.444}], 'summary': 'A simple implementation with 300 lines of code trains transformers on open web text dataset.', 'duration': 25.96, 'max_score': 356.001, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY356001.jpg'}, {'end': 449.656, 'src': 'embed', 'start': 423.531, 'weight': 3, 'content': [{'end': 429.515, 'text': "We're going to train it on the tiny Shakespeare dataset and we'll see how we can then generate infinite Shakespeare.", 'start': 423.531, 'duration': 5.984}, {'end': 434.062, 'text': 'And of course, this can copy paste to any arbitrary text dataset that you like.', 'start': 430.399, 'duration': 3.663}, {'end': 440.528, 'text': 'But my goal really here is to just make you understand and appreciate how under the hood ChatGPT works.', 'start': 435.003, 'duration': 5.525}, {'end': 449.656, 'text': "And really all that's required is a proficiency in Python and some basic understanding of calculus and statistics.", 'start': 441.369, 'duration': 8.287}], 'summary': 'Training on tiny shakespeare dataset to generate infinite shakespeare and understand chatgpt workings.', 'duration': 26.125, 'max_score': 423.531, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY423531.jpg'}], 'start': 0.229, 'title': 'Ai interaction and transformer-based language models', 'summary': 'Introduces chatgpt, showcasing its ability to generate diverse and creative responses, including haiku and humorous interactions. it also discusses transformer-based language models, emphasizing the development of gpt and introducing nanogpt, which can generate infinite shakespeare from a small dataset.', 'chapters': [{'end': 69.554, 'start': 0.229, 'title': 'Chatgpt: ai interaction and creativity', 'summary': 'Introduces chatgpt, an ai system that allows text-based interactions and showcases its ability to generate diverse and creative responses, with examples of haiku and humorous interactions.', 'duration': 69.325, 'highlights': ['ChatGPT enables text-based interactions with AI, gaining popularity in the AI community and worldwide.', 'The system showcases creativity by generating diverse and slightly different responses to the same prompt, demonstrating its probabilistic nature.', 'Examples of haiku and humorous interactions demonstrate the diverse applications and creativity of ChatGPT.']}, {'end': 465.204, 'start': 69.554, 'title': 'Understanding transformer-based language models', 'summary': 'Discusses the development of transformer-based language models, particularly gpt, and introduces nanogpt, a simplified repository for training transformers, emphasizing the ability to generate infinite shakespeare from a small dataset.', 'duration': 395.65, 'highlights': ['NanoGPT allows for training transformers on any given text dataset, proving to reproduce the performance of GPT-2 with a small codebase of 300 lines of code each for defining and training the model.', "Training on the 'Tiny Shakespeare' dataset demonstrates the ability to generate infinite Shakespeare-like text, showcasing the potential of the transformer model to produce character sequences reminiscent of Shakespeare's language.", "The transformer architecture, proposed in the 'Attention is All You Need' paper in 2017, forms the neural network responsible for the heavy lifting in ChatGPT, highlighting its significant impact on AI over the past five years."]}], 'duration': 464.975, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY229.jpg', 'highlights': ['NanoGPT reproduces GPT-2 performance with 300 lines of code.', 'ChatGPT showcases diverse and creative responses, gaining worldwide popularity.', 'Transformer architecture, proposed in 2017, powers ChatGPT, impacting AI significantly.', "NanoGPT generates infinite Shakespeare-like text from 'Tiny Shakespeare' dataset."]}, {'end': 1923.348, 'segs': [{'end': 514.832, 'src': 'embed', 'start': 485.62, 'weight': 1, 'content': [{'end': 488.102, 'text': "Now here I've just done some preliminaries.", 'start': 485.62, 'duration': 2.482}, {'end': 493.646, 'text': "I downloaded the dataset, the Tiny Shakespeare dataset, at this URL, and you can see that it's about a one megabyte file.", 'start': 488.582, 'duration': 5.064}, {'end': 502.392, 'text': 'Then here, I open the input.txt file and just read in all the text of the string, and we see that we are working with one million characters roughly.', 'start': 494.827, 'duration': 7.565}, {'end': 507.415, 'text': 'And the first 1, 000 characters, if we just print them out, are basically what you would expect.', 'start': 503.553, 'duration': 3.862}, {'end': 512.438, 'text': 'This is the first 1, 000 characters of the Tiny Shakespeare dataset, roughly up to here.', 'start': 507.515, 'duration': 4.923}, {'end': 514.832, 'text': 'So, so far so good.', 'start': 513.772, 'duration': 1.06}], 'summary': 'Downloaded 1mb tiny shakespeare dataset, working with 1 million characters.', 'duration': 29.212, 'max_score': 485.62, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY485620.jpg'}, {'end': 678.889, 'src': 'embed', 'start': 647.663, 'weight': 3, 'content': [{'end': 652.144, 'text': 'Now this is only one of many possible encodings or many possible sort of tokenizers.', 'start': 647.663, 'duration': 4.481}, {'end': 653.705, 'text': "And it's a very simple one.", 'start': 652.624, 'duration': 1.081}, {'end': 657.145, 'text': "But there's many other schemas that people have come up with in practice.", 'start': 654.245, 'duration': 2.9}, {'end': 659.566, 'text': 'So, for example, Google uses a sentence piece.', 'start': 657.525, 'duration': 2.041}, {'end': 668.862, 'text': 'So sentence piece will also encode text into integers, but in a different schema and using a different vocabulary.', 'start': 660.976, 'duration': 7.886}, {'end': 672.704, 'text': 'And sentence piece is a subword sort of tokenizer.', 'start': 669.562, 'duration': 3.142}, {'end': 678.889, 'text': "And what that means is that you're not encoding entire words, but you're not also encoding individual characters.", 'start': 673.345, 'duration': 5.544}], 'summary': 'Various tokenization methods exist, e.g., google uses sentence piece as a subword tokenizer.', 'duration': 31.226, 'max_score': 647.663, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY647663.jpg'}, {'end': 850.684, 'src': 'embed', 'start': 820.421, 'weight': 0, 'content': [{'end': 822.183, 'text': 'Let me do one more thing before we move on here.', 'start': 820.421, 'duration': 1.762}, {'end': 826.526, 'text': "I'd like to separate out our dataset into a train and a validation split.", 'start': 822.643, 'duration': 3.883}, {'end': 833.692, 'text': "So in particular, we're going to take the first 90% of the dataset and consider that to be the training data for the transformer.", 'start': 827.287, 'duration': 6.405}, {'end': 838.636, 'text': "And we're going to withhold the last 10% at the end of it to be the validation data.", 'start': 834.173, 'duration': 4.463}, {'end': 842.2, 'text': 'And this will help us understand to what extent our model is overfitting.', 'start': 839.337, 'duration': 2.863}, {'end': 845.682, 'text': "So we're going to basically hide and keep the validation data on the side.", 'start': 842.72, 'duration': 2.962}, {'end': 850.684, 'text': "because we don't want just a perfect memorization of this exact Shakespeare.", 'start': 846.483, 'duration': 4.201}], 'summary': 'Dataset split into 90% training and 10% validation to detect overfitting.', 'duration': 30.263, 'max_score': 820.421, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY820421.jpg'}, {'end': 1074.965, 'src': 'embed', 'start': 1043.731, 'weight': 2, 'content': [{'end': 1045.271, 'text': "It's not just done for efficiency.", 'start': 1043.731, 'duration': 1.54}, {'end': 1055.113, 'text': "It's also done to make the transformer network be used to seeing contexts all the way from as little as one all the way to block size.", 'start': 1046.011, 'duration': 9.102}, {'end': 1058.935, 'text': "And we'd like the transformer to be used to seeing everything in between.", 'start': 1056.073, 'duration': 2.862}, {'end': 1063.398, 'text': "And that's going to be useful later during inference, because while we're sampling,", 'start': 1059.575, 'duration': 3.823}, {'end': 1067.18, 'text': 'we can start the sampling generation with as little as one character of context.', 'start': 1063.398, 'duration': 3.782}, {'end': 1072.103, 'text': 'And the transformer knows how to predict the next character with all the way up to just context of one.', 'start': 1067.661, 'duration': 4.442}, {'end': 1074.965, 'text': 'And so then it can predict everything up to block size.', 'start': 1072.744, 'duration': 2.221}], 'summary': 'The transformer network is trained to process contexts from one character to block size, aiding sampling and prediction during inference.', 'duration': 31.234, 'max_score': 1043.731, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY1043731.jpg'}, {'end': 1127.354, 'src': 'embed', 'start': 1097.614, 'weight': 4, 'content': [{'end': 1099.314, 'text': "we're gonna feed them into a transformer.", 'start': 1097.614, 'duration': 1.7}, {'end': 1104.055, 'text': "we're going to have many batches of multiple chunks of text that are all like stacked up in a single tensor.", 'start': 1099.314, 'duration': 4.741}, {'end': 1112.398, 'text': "And that's just done for efficiency, just so that we can keep the GPUs busy, because they are very good at parallel processing of data.", 'start': 1104.676, 'duration': 7.722}, {'end': 1116.942, 'text': 'and so we just want to process multiple chunks all at the same time.', 'start': 1113.118, 'duration': 3.824}, {'end': 1121.567, 'text': "but those chunks are processed completely independently, they don't talk to each other and so on.", 'start': 1116.942, 'duration': 4.625}, {'end': 1125.091, 'text': 'so let me basically just generalize this and introduce a batch dimension.', 'start': 1121.567, 'duration': 3.524}, {'end': 1127.354, 'text': "here's a chunk of code.", 'start': 1125.091, 'duration': 2.263}], 'summary': 'Efficiently processing multiple text chunks in parallel using gpus.', 'duration': 29.74, 'max_score': 1097.614, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY1097614.jpg'}], 'start': 465.204, 'title': 'Transformer neural network training', 'summary': 'Introduces the transformer neural network with a character level language model using a vocabulary size of 65. it discusses tokenization methods, data splitting, and training process, and covers the creation of token embedding table and model generation.', 'chapters': [{'end': 646.783, 'start': 465.204, 'title': 'Introduction to transformer neural network', 'summary': 'Introduces the language modeling framework and focuses on the transformer neural network, using the tiny shakespeare dataset to develop a character level language model with a vocabulary size of 65, and a strategy to tokenize the input text by translating individual characters into integers.', 'duration': 181.579, 'highlights': ['The Tiny Shakespeare dataset is used, which consists of one million characters roughly, and the vocabulary size for the character level language model is determined to be 65.', 'The strategy to tokenize the input text involves translating individual characters into integers, essentially building both the encoder and decoder for the character level language model.', 'The process involves creating a lookup table from characters to integers and vice versa, enabling the translation to integers and back for arbitrary strings.']}, {'end': 819.168, 'start': 647.663, 'title': 'Text encoding and tokenization', 'summary': 'Discusses different tokenization methods such as sentence piece, byte pair encoding, and character level tokenization, highlighting the trade-offs between codebook size and sequence length, and concludes with the tokenization of the entire training set of shakespeare using pytorch library.', 'duration': 171.505, 'highlights': ['Different tokenization methods: sentence piece, byte pair encoding, and character level tokenization', 'Trade-offs between codebook size and sequence length', 'Tokenization of the entire training set of Shakespeare using PyTorch library']}, {'end': 1020.585, 'start': 820.421, 'title': 'Data splitting and training chunks', 'summary': 'Discusses splitting the dataset into training and validation sets, with 90% used for training and the last 10% for validation to prevent overfitting. it also explains the process of training the transformer using chunks of the dataset, with each chunk containing multiple examples for simultaneous prediction, and illustrates this using code.', 'duration': 200.164, 'highlights': ['The dataset is split into 90% for training and 10% for validation to prevent overfitting.', 'Chunks of the dataset are used for training the transformer, with each chunk containing multiple examples for simultaneous prediction.', 'Code illustration is provided to demonstrate the training process using chunks, with X as the inputs and Y as the targets.']}, {'end': 1383.083, 'start': 1021.365, 'title': 'Training transformer networks', 'summary': 'Discusses the training process for transformer networks, including the use of context sizes from one to block size, the batch dimension for parallel processing, and the implementation of a bigram language model for language modeling.', 'duration': 361.718, 'highlights': ['The training process involves using context sizes from one all the way up to block size to familiarize the transformer network with different contexts for later inference, enabling prediction of characters up to block size.', 'The batch dimension is introduced to efficiently process multiple chunks of text simultaneously, with each chunk being processed independently to maintain parallelism and GPU utilization.', 'The implementation of a bigram language model as the simplest neural network for language modeling is discussed, emphasizing its use in processing the inputs and targets for the transformer network.']}, {'end': 1923.348, 'start': 1383.083, 'title': 'Token embedding and model generation', 'summary': 'Covers the creation of a token embedding table, predicting the next character in a sequence, evaluating the loss using cross entropy, reshaping logits to match pytorch expectations, and the process of model generation.', 'duration': 540.265, 'highlights': ['Creation of token embedding table', 'Prediction of next character in a sequence', 'Evaluation of loss using cross entropy', 'Reshaping logits to match PyTorch expectations', 'Model generation process']}], 'duration': 1458.144, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY465204.jpg', 'highlights': ['The dataset is split into 90% for training and 10% for validation to prevent overfitting.', 'The Tiny Shakespeare dataset is used, consisting of one million characters, with a vocabulary size of 65.', 'The training process involves using context sizes from one all the way up to block size to familiarize the transformer network with different contexts for later inference.', 'Different tokenization methods: sentence piece, byte pair encoding, and character level tokenization.', 'The batch dimension is introduced to efficiently process multiple chunks of text simultaneously, with each chunk being processed independently to maintain parallelism and GPU utilization.']}, {'end': 2486.828, 'segs': [{'end': 2045.666, 'src': 'embed', 'start': 2014.112, 'weight': 2, 'content': [{'end': 2016.934, 'text': "So let me bring this back and we're generating a hundred tokens.", 'start': 2014.112, 'duration': 2.822}, {'end': 2017.594, 'text': "Let's run.", 'start': 2017.234, 'duration': 0.36}, {'end': 2021.196, 'text': "And here's the generation that we achieved.", 'start': 2018.975, 'duration': 2.221}, {'end': 2023.238, 'text': "So obviously it's garbage.", 'start': 2021.957, 'duration': 1.281}, {'end': 2025.939, 'text': "And the reason it's garbage is because this is a totally random model.", 'start': 2023.398, 'duration': 2.541}, {'end': 2028.521, 'text': "So next up, we're going to want to train this model.", 'start': 2026.58, 'duration': 1.941}, {'end': 2036.179, 'text': "Now, one more thing I wanted to point out here is This function is written to be general, but it's kind of like ridiculous right now,", 'start': 2029.342, 'duration': 6.837}, {'end': 2045.666, 'text': "because we're feeding in all this, we're building out this context and we're concatenating it all, and we're always feeding it all into the model.", 'start': 2036.179, 'duration': 9.487}], 'summary': 'Generated 100 tokens, model needs training for better output.', 'duration': 31.554, 'max_score': 2014.112, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY2014112.jpg'}, {'end': 2137.823, 'src': 'embed', 'start': 2111.644, 'weight': 3, 'content': [{'end': 2116.088, 'text': 'But I want to use Adam, which is a much more advanced and popular optimizer, and it works extremely well.', 'start': 2111.644, 'duration': 4.444}, {'end': 2121.191, 'text': 'For a typical good setting for the learning rate is roughly 3e-4.', 'start': 2116.668, 'duration': 4.523}, {'end': 2127.476, 'text': "But for very, very small networks, like it's the case here, you can get away with much, much higher learning rates, 1e-3 or even higher probably.", 'start': 2122.032, 'duration': 5.444}, {'end': 2137.823, 'text': 'but let me create the optimizer object, which will basically take the gradients and update the parameters using the gradients.', 'start': 2129.798, 'duration': 8.025}], 'summary': 'Adam optimizer is popular; for small networks, use higher learning rates, e.g., 1e-3.', 'duration': 26.179, 'max_score': 2111.644, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY2111644.jpg'}, {'end': 2179.751, 'src': 'embed', 'start': 2154.016, 'weight': 7, 'content': [{'end': 2158.759, 'text': 'getting the gradients for all the parameters and then using those gradients to update our parameters.', 'start': 2154.016, 'duration': 4.743}, {'end': 2162.681, 'text': 'so typical training loop, as we saw in the make more series.', 'start': 2158.759, 'duration': 3.922}, {'end': 2170.886, 'text': "so let me now run this for, say, 100 iterations and let's see what kind of losses we're going to get.", 'start': 2162.681, 'duration': 8.205}, {'end': 2177.15, 'text': "so we started around 4.7 and now we're going to down to like 4.6, 4.5, etc.", 'start': 2170.886, 'duration': 6.264}, {'end': 2179.751, 'text': 'so the optimization is definitely happening.', 'start': 2177.15, 'duration': 2.601}], 'summary': 'Training loop with 100 iterations reduces loss from 4.7 to 4.5, showing optimization progress.', 'duration': 25.735, 'max_score': 2154.016, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY2154016.jpg'}, {'end': 2279.906, 'src': 'embed', 'start': 2217.898, 'weight': 0, 'content': [{'end': 2220.659, 'text': "And of course, it's not going to be Shakespeare from a background model.", 'start': 2217.898, 'duration': 2.761}, {'end': 2222.72, 'text': 'But at least we see that the loss is improving.', 'start': 2221.019, 'duration': 1.701}, {'end': 2226.481, 'text': "And hopefully, we're expecting something a bit more reasonable.", 'start': 2223.54, 'duration': 2.941}, {'end': 2229.662, 'text': "Okay, so we're down at about 2.5-ish.", 'start': 2227.922, 'duration': 1.74}, {'end': 2230.763, 'text': "Let's see what we get.", 'start': 2230.163, 'duration': 0.6}, {'end': 2234.875, 'text': 'Okay, dramatic improvement certainly on what we had here.', 'start': 2231.693, 'duration': 3.182}, {'end': 2238.256, 'text': 'So let me just increase the number of tokens.', 'start': 2235.775, 'duration': 2.481}, {'end': 2243.699, 'text': "Okay, so we see that we're starting to get something at least like reasonable-ish.", 'start': 2239.197, 'duration': 4.502}, {'end': 2250.342, 'text': 'Certainly not Shakespeare, but the model is making progress.', 'start': 2246.801, 'duration': 3.541}, {'end': 2252.683, 'text': 'So that is the simplest possible model.', 'start': 2250.803, 'duration': 1.88}, {'end': 2261.353, 'text': "So now what I'd like to do is, Obviously, this is a very simple model because the tokens are not talking to each other.", 'start': 2254.244, 'duration': 7.109}, {'end': 2268.238, 'text': "So given the previous context of whatever was generated, we're only looking at the very last character to make the predictions about what comes next.", 'start': 2261.754, 'duration': 6.484}, {'end': 2275.243, 'text': 'So now these tokens have to start talking to each other and figuring out what is in the context,', 'start': 2268.878, 'duration': 6.365}, {'end': 2277.144, 'text': 'so that they can make better predictions for what comes next.', 'start': 2275.243, 'duration': 1.901}, {'end': 2279.906, 'text': "And this is how we're going to kick off the transformer.", 'start': 2277.544, 'duration': 2.362}], 'summary': 'Model loss improving, expecting more reasonable results, 2.5-ish improvement seen.', 'duration': 62.008, 'max_score': 2217.898, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY2217898.jpg'}], 'start': 1925.329, 'title': 'Model training and optimization', 'summary': 'Discusses the generation of tokens using a text generation model and emphasizes the need to train the model for improved text quality. it also covers training a model using pytorch optimization, resulting in a significant loss reduction from 4.7 to around 2.5, indicating progress.', 'chapters': [{'end': 2089.96, 'start': 1925.329, 'title': 'Text generation model training', 'summary': "Discusses the generation of tokens using a text generation model, emphasizing the need to train the model to improve the quality of generated text and the inefficiency of the current implementation in utilizing the model's capabilities.", 'duration': 164.631, 'highlights': ['The model is currently generating garbage text due to its random nature.', "The inefficiency of the current implementation in utilizing the model's capabilities is emphasized.", 'The need to train the model to improve the quality of generated text is emphasized.']}, {'end': 2486.828, 'start': 2090.321, 'title': 'Training model with pytorch optimization', 'summary': "Discusses training a model using pytorch optimization, including using the adam optimizer, adjusting learning rates, updating parameters, and evaluating the model's performance, resulting in a dramatic improvement in loss from 4.7 to around 2.5, indicating the model's progress.", 'duration': 396.507, 'highlights': ['Dramatic improvement in loss from 4.7 to around 2.5', 'Using Adam optimizer for more advanced and popular optimization', 'Adjusting learning rates for small networks to 1e-3 or higher', 'Training the model for 100 iterations and observing the optimization', 'Transitioning from tokens only considering the last character to enabling tokens to communicate and make better predictions']}], 'duration': 561.499, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY1925329.jpg', 'highlights': ['Dramatic improvement in loss from 4.7 to around 2.5', 'Transitioning from tokens only considering the last character to enabling tokens to communicate and make better predictions', 'The need to train the model to improve the quality of generated text is emphasized', 'Using Adam optimizer for more advanced and popular optimization', 'Adjusting learning rates for small networks to 1e-3 or higher', "The inefficiency of the current implementation in utilizing the model's capabilities is emphasized", 'The model is currently generating garbage text due to its random nature', 'Training the model for 100 iterations and observing the optimization']}, {'end': 4108.719, 'segs': [{'end': 2548.578, 'src': 'embed', 'start': 2508.94, 'weight': 0, 'content': [{'end': 2513.621, 'text': 'Now running this script gives us output in the terminal, and it looks something like this.', 'start': 2508.94, 'duration': 4.681}, {'end': 2517.302, 'text': 'It. basically, as I ran this code,', 'start': 2514.862, 'duration': 2.44}, {'end': 2523.824, 'text': 'it was giving me the train loss and the val loss and we see that we convert to somewhere around 2.5 with the bigram model.', 'start': 2517.302, 'duration': 6.522}, {'end': 2526.785, 'text': "And then here's the sample that we produced at the end.", 'start': 2524.504, 'duration': 2.281}, {'end': 2532.867, 'text': "And so we have everything packaged up in the script and we're in a good position now to iterate on this.", 'start': 2528.482, 'duration': 4.385}, {'end': 2540.215, 'text': 'Okay, so we are almost ready to start writing our very first self-attention block for processing these tokens.', 'start': 2533.528, 'duration': 6.687}, {'end': 2548.578, 'text': 'Now, before we actually get there, I want to get you used to a mathematical trick that is used in the self-attention inside a transformer.', 'start': 2541.215, 'duration': 7.363}], 'summary': 'Output shows train loss, val loss ~2.5 with bigram model. ready to write self-attention block.', 'duration': 39.638, 'max_score': 2508.94, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY2508940.jpg'}, {'end': 2645.638, 'src': 'embed', 'start': 2616.082, 'weight': 3, 'content': [{'end': 2620.686, 'text': 'So information only flows from previous context to the current time step.', 'start': 2616.082, 'duration': 4.604}, {'end': 2625.09, 'text': 'And we cannot get any information from the future because we are about to try to predict the future.', 'start': 2621.387, 'duration': 3.703}, {'end': 2630.329, 'text': 'What is the easiest way for tokens to communicate?', 'start': 2627.927, 'duration': 2.402}, {'end': 2637.933, 'text': "The easiest way, I would say, is if we're a fifth token and I'd like to communicate with my past.", 'start': 2631.509, 'duration': 6.424}, {'end': 2645.638, 'text': 'the simplest way we can do that is to just do an average of all the preceding elements.', 'start': 2637.933, 'duration': 7.705}], 'summary': 'Information flows from past to present only, no future data. tokens communicate by averaging preceding elements.', 'duration': 29.556, 'max_score': 2616.082, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY2616082.jpg'}, {'end': 3271.093, 'src': 'embed', 'start': 3239.043, 'weight': 1, 'content': [{'end': 3240.083, 'text': 'So just the first batch.', 'start': 3239.043, 'duration': 1.04}, {'end': 3244.026, 'text': 'And we should see that this and that should be identical, which they are.', 'start': 3240.624, 'duration': 3.402}, {'end': 3254.008, 'text': 'So what happened here? The trick is we were able to use batched matrix multiply to do this aggregation, really.', 'start': 3245.946, 'duration': 8.062}, {'end': 3260.49, 'text': "And it's a weighted aggregation, and the weights are specified in this t by t array.", 'start': 3254.688, 'duration': 5.802}, {'end': 3262.991, 'text': "And we're basically doing weighted sums.", 'start': 3261.49, 'duration': 1.501}, {'end': 3271.093, 'text': 'And these weighted sums are, according to the weights inside here, they take on sort of this triangular form.', 'start': 3263.851, 'duration': 7.242}], 'summary': 'Batched matrix multiply enables weighted aggregation, yielding identical results.', 'duration': 32.05, 'max_score': 3239.043, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY3239043.jpg'}, {'end': 3506.308, 'src': 'embed', 'start': 3482.741, 'weight': 4, 'content': [{'end': 3494.609, 'text': 'long story short from this entire section is that you can do weighted aggregations of your past elements by using matrix multiplication of a lower triangular fashion,', 'start': 3482.741, 'duration': 11.868}, {'end': 3502.527, 'text': 'And then the elements here in the lower triangular part are telling you how much of each element fuses into this position.', 'start': 3495.566, 'duration': 6.961}, {'end': 3506.308, 'text': "So we're going to use this trick now to develop the self-attention block.", 'start': 3503.448, 'duration': 2.86}], 'summary': 'Weighted aggregations using matrix multiplication for self-attention block.', 'duration': 23.567, 'max_score': 3482.741, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY3482741.jpg'}, {'end': 3777.037, 'src': 'heatmap', 'start': 3699.638, 'weight': 1, 'content': [{'end': 3703.179, 'text': 'And this is currently not that useful because of course we just have a simple migraine model.', 'start': 3699.638, 'duration': 3.541}, {'end': 3708.92, 'text': "So it doesn't matter if you're in the fifth position, the second position or wherever, it's all translation invariant at this stage.", 'start': 3703.439, 'duration': 5.481}, {'end': 3715.502, 'text': "So this information currently wouldn't help, but as we work on the self-attention block, we'll see that this starts to matter.", 'start': 3709.72, 'duration': 5.782}, {'end': 3721.835, 'text': 'Okay, so now we get the crux of self-attention.', 'start': 3719.912, 'duration': 1.923}, {'end': 3725.22, 'text': 'So this is probably the most important part of this video to understand.', 'start': 3722.256, 'duration': 2.964}, {'end': 3730.368, 'text': "We're going to implement a small self-attention for a single individual head, as they're called.", 'start': 3726.502, 'duration': 3.866}, {'end': 3732.772, 'text': 'So we start off with where we were.', 'start': 3731.23, 'duration': 1.542}, {'end': 3734.374, 'text': 'So all of this code is familiar.', 'start': 3733.273, 'duration': 1.101}, {'end': 3740.095, 'text': "So right now I'm working with an example where I change the number of channels from two to 32.", 'start': 3735.43, 'duration': 4.665}, {'end': 3748.303, 'text': 'So we have a four by eight arrangement of tokens, and the information at each token is currently 32 dimensional,', 'start': 3740.095, 'duration': 8.208}, {'end': 3749.925, 'text': 'but we just are working with random numbers.', 'start': 3748.303, 'duration': 1.622}, {'end': 3762.152, 'text': 'Now we saw here that The code as we had it before does a simple weight, simple average of all the past tokens and the current token.', 'start': 3751.386, 'duration': 10.766}, {'end': 3766.455, 'text': "So it's just the previous information and current information is just being mixed together in an average.", 'start': 3762.512, 'duration': 3.943}, {'end': 3769.035, 'text': "And that's what this code currently achieves.", 'start': 3767.315, 'duration': 1.72}, {'end': 3777.037, 'text': 'And it does so by creating this lower triangular structure, which allows us to mask out this way matrix that we create.', 'start': 3769.595, 'duration': 7.442}], 'summary': 'Introduction to self-attention and implementation of small self-attention for a single individual head with a 4 by 8 arrangement of tokens and 32-dimensional information.', 'duration': 77.399, 'max_score': 3699.638, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY3699638.jpg'}], 'start': 2486.828, 'title': 'Efficient self-attention implementation', 'summary': 'Discusses efficient self-attention implementation using mathematical tricks, token communication, matrix multiplication, weighted aggregations, and code implementation, with a train loss of approximately 2.5 and data-dependent weighted aggregation.', 'chapters': [{'end': 2570.267, 'start': 2486.828, 'title': 'Self-attention block mathematical trick', 'summary': 'Introduces a mathematical trick for efficient implementation of self-attention inside a transformer, and discusses the output of a script with train loss around 2.5 and a sample produced at the end.', 'duration': 83.439, 'highlights': ['The script is about 120 lines of code, named bigram.py, and gives an output in the terminal with train loss and val loss around 2.5 with the bigram model.', 'The chapter introduces a mathematical trick used in the self-attention inside a transformer, essential for an efficient implementation.', 'Preparing to start writing the first self-attention block for processing tokens, with the mathematical trick being crucial for understanding the operation.']}, {'end': 3238.943, 'start': 2571.509, 'title': 'Token communication and matrix multiplication', 'summary': 'Discusses the communication between tokens in a sequence, emphasizing the flow of information from previous context to the current time step, and the efficient use of matrix multiplication for averaging token vectors, ultimately achieving the same outcome as the original method.', 'duration': 667.434, 'highlights': ['The flow of information from previous context to the current time step is emphasized, ensuring that tokens at a specific location only communicate with the preceding ones in the sequence.', 'Efficient use of matrix multiplication for averaging token vectors is explained, showing how it achieves the same outcome as the original method.', 'Explanation of matrix multiplication for averaging token vectors is provided, demonstrating its incremental fashion and convenience in manipulating elements.']}, {'end': 3506.308, 'start': 3239.043, 'title': 'Weighted aggregations in matrix multiplication', 'summary': 'Discusses using batched matrix multiplication for weighted aggregations in a lower triangular fashion, where the weights determine how much of each element fuses into a specific position, providing a preview for self-attention.', 'duration': 267.265, 'highlights': ['Weighted aggregations are achieved through batched matrix multiplication in a lower triangular fashion, with weights determining the fusion of elements into specific positions.', 'The weights form interactions between tokens, with some tokens finding others more or less interesting, leading to data-dependent affinities.', 'The lower triangular part of the array specifies how much of each element contributes to a specific position, previewing the concept of self-attention.', 'Applying Softmax to the weighted sums normalizes the values and produces the same matrix as the original.', 'Tokens at a certain dimension only receive information from preceding tokens due to the triangular form of the weighted sums.']}, {'end': 3732.772, 'start': 3506.988, 'title': 'Self-attention implementation', 'summary': 'Introduces a new variable nembed for embedding dimensions, with a suggestion of 32, and discusses the creation of token embeddings, positional embeddings, and the crux of self-attention.', 'duration': 225.784, 'highlights': ['The introduction of a new variable nEmbed for embedding dimensions with a suggestion of 32 from GitHub Copilot.', 'Discussion on the creation of token embeddings and their conversion to logits through a linear layer.', 'Explanation of positional embeddings and their addition to token embeddings.', 'Introduction to the crux of self-attention and its significance in the video.']}, {'end': 4108.719, 'start': 3733.273, 'title': 'Implementing self-attention in code', 'summary': 'Demonstrates the implementation of self-attention by explaining the concept, the creation of query and key vectors, and the calculation of affinities between tokens, resulting in data-dependent weighted aggregation.', 'duration': 375.446, 'highlights': ['The chapter demonstrates the implementation of self-attention by explaining the concept, the creation of query and key vectors, and the calculation of affinities between tokens, resulting in data-dependent weighted aggregation.', 'The affinities between tokens are calculated by performing a dot product between the keys and queries, allowing for data-dependent weighted aggregation.', 'The implementation involves the creation of query and key vectors for each token, followed by the calculation of affinities through matrix multiplication and transposition, resulting in data-dependent weighted aggregation.']}], 'duration': 1621.891, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY2486828.jpg', 'highlights': ['The script is about 120 lines of code, named bigram.py, and gives an output in the terminal with train loss and val loss around 2.5 with the bigram model.', 'Weighted aggregations are achieved through batched matrix multiplication in a lower triangular fashion, with weights determining the fusion of elements into specific positions.', 'The chapter introduces a mathematical trick used in the self-attention inside a transformer, essential for an efficient implementation.', 'The flow of information from previous context to the current time step is emphasized, ensuring that tokens at a specific location only communicate with the preceding ones in the sequence.', 'The chapter demonstrates the implementation of self-attention by explaining the concept, the creation of query and key vectors, and the calculation of affinities between tokens, resulting in data-dependent weighted aggregation.']}, {'end': 4793.48, 'segs': [{'end': 4137.279, 'src': 'embed', 'start': 4109.379, 'weight': 0, 'content': [{'end': 4113.622, 'text': "And that's how the query and the key, when they dot product, they can find each other and create a high affinity.", 'start': 4109.379, 'duration': 4.243}, {'end': 4120.946, 'text': 'And when they have a high affinity, like say, this token was pretty interesting to this eighth token.', 'start': 4114.721, 'duration': 6.225}, {'end': 4128.51, 'text': 'When they have a high affinity, then through the softmax, I will end up aggregating a lot of its information into my position.', 'start': 4122.426, 'duration': 6.084}, {'end': 4131.036, 'text': "And so I'll get to learn a lot about it.", 'start': 4129.314, 'duration': 1.722}, {'end': 4137.279, 'text': "Now, we're looking at way after this has already happened.", 'start': 4132.716, 'duration': 4.563}], 'summary': 'Query and key dot product creates high affinity, leading to information aggregation and learning.', 'duration': 27.9, 'max_score': 4109.379, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY4109379.jpg'}, {'end': 4197.328, 'src': 'embed', 'start': 4153.792, 'weight': 1, 'content': [{'end': 4158.335, 'text': 'And these are the raw outputs and they take on values from negative two to positive two, et cetera.', 'start': 4153.792, 'duration': 4.543}, {'end': 4163.484, 'text': "So that's the raw interactions and raw affinities between all the nodes.", 'start': 4159.781, 'duration': 3.703}, {'end': 4170.85, 'text': "But now if I'm a fifth node, I will not want to aggregate anything from the sixth node, seventh node and the eighth node.", 'start': 4164.365, 'duration': 6.485}, {'end': 4174.294, 'text': 'So actually we use the upper triangular masking.', 'start': 4171.371, 'duration': 2.923}, {'end': 4176.435, 'text': 'So those are not allowed to communicate.', 'start': 4174.993, 'duration': 1.442}, {'end': 4181.72, 'text': 'And now we actually want to have a nice distribution.', 'start': 4178.417, 'duration': 3.303}, {'end': 4185.551, 'text': "So we don't want to aggregate negative 0.11 of this node.", 'start': 4182.546, 'duration': 3.005}, {'end': 4186.111, 'text': "That's crazy.", 'start': 4185.611, 'duration': 0.5}, {'end': 4188.295, 'text': 'So instead we exponentiate and normalize.', 'start': 4186.671, 'duration': 1.624}, {'end': 4190.738, 'text': 'And now we get a nice distribution that sums to one.', 'start': 4189.055, 'duration': 1.683}, {'end': 4197.328, 'text': 'And this is telling us now in a data-dependent manner how much of information to aggregate from any of these tokens in the past.', 'start': 4191.639, 'duration': 5.689}], 'summary': 'Raw outputs range from -2 to +2. upper triangular masking used to block communication. data-dependent aggregation from past tokens.', 'duration': 43.536, 'max_score': 4153.792, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY4153792.jpg'}, {'end': 4256.606, 'src': 'embed', 'start': 4226.075, 'weight': 5, 'content': [{'end': 4238.366, 'text': "and then here we don't aggregate x, we calculate a v, which is just achieved by propagating this linear on top of x again,", 'start': 4226.075, 'duration': 12.291}, {'end': 4243.25, 'text': 'and then we output y multiplied by v.', 'start': 4238.366, 'duration': 4.884}, {'end': 4248.315, 'text': 'so v is the elements that we aggregate or the vectors that we aggregate instead of the raw x.', 'start': 4243.25, 'duration': 5.065}, {'end': 4256.606, 'text': 'And now, of course, this will make it so that the output here of the single head will be 16 dimensional, because that is the head size.', 'start': 4249.861, 'duration': 6.745}], 'summary': 'Aggregating vectors instead of raw x to achieve 16-dimensional output.', 'duration': 30.531, 'max_score': 4226.075, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY4226075.jpg'}, {'end': 4311.012, 'src': 'embed', 'start': 4279.685, 'weight': 2, 'content': [{'end': 4282.688, 'text': "And if you find me interesting, here's what I will communicate to you.", 'start': 4279.685, 'duration': 3.003}, {'end': 4285.171, 'text': "And that's stored in V.", 'start': 4283.249, 'duration': 1.922}, {'end': 4290.436, 'text': 'And so V is the thing that gets aggregated for the purposes of this single head between the different nodes.', 'start': 4285.171, 'duration': 5.265}, {'end': 4295.141, 'text': "And that's basically the self-attention mechanism.", 'start': 4291.777, 'duration': 3.364}, {'end': 4297.103, 'text': 'This is what it does.', 'start': 4295.321, 'duration': 1.782}, {'end': 4301.087, 'text': 'There are a few notes that I would like to make about attention.', 'start': 4298.166, 'duration': 2.921}, {'end': 4304.689, 'text': 'Number one, attention is a communication mechanism.', 'start': 4301.788, 'duration': 2.901}, {'end': 4311.012, 'text': 'You can really think about it as a communication mechanism where you have a number of nodes in a directed graph,', 'start': 4305.27, 'duration': 5.742}], 'summary': 'V is aggregated for self-attention mechanism in a communication network.', 'duration': 31.327, 'max_score': 4279.685, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY4279685.jpg'}, {'end': 4622.834, 'src': 'embed', 'start': 4590.914, 'weight': 6, 'content': [{'end': 4594.456, 'text': "And here we're just producing queries and we're reading off information from the side.", 'start': 4590.914, 'duration': 3.542}, {'end': 4603.205, 'text': "So cross attention is used when there's a separate source of nodes we'd like to pull information from into our nodes.", 'start': 4595.336, 'duration': 7.869}, {'end': 4607.47, 'text': "And it's self-attention if we just have nodes that would like to look at each other and talk to each other.", 'start': 4603.966, 'duration': 3.504}, {'end': 4611.114, 'text': 'So this attention here happens to be self-attention.', 'start': 4608.371, 'duration': 2.743}, {'end': 4616.388, 'text': 'but in principle, attention is a lot more general.', 'start': 4612.844, 'duration': 3.544}, {'end': 4622.834, 'text': "Okay, and the last note at this stage is if we come to the attention is all you need paper here, we've already implemented attention.", 'start': 4616.948, 'duration': 5.886}], 'summary': 'Self-attention used for nodes to look at and talk to each other. attention is more general.', 'duration': 31.92, 'max_score': 4590.914, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY4590914.jpg'}, {'end': 4763.171, 'src': 'embed', 'start': 4734.714, 'weight': 4, 'content': [{'end': 4736.555, 'text': 'Otherwise Softmax will be way too peaky.', 'start': 4734.714, 'duration': 1.841}, {'end': 4741.8, 'text': "And you're basically aggregating information from like a single node.", 'start': 4737.116, 'duration': 4.684}, {'end': 4744.502, 'text': 'Every node just aggregates information from a single other node.', 'start': 4742.12, 'duration': 2.382}, {'end': 4746.844, 'text': "That's not what we want, especially at initialization.", 'start': 4744.702, 'duration': 2.142}, {'end': 4751.328, 'text': 'And so the scaling is used just to control the variance at initialization.', 'start': 4747.445, 'duration': 3.883}, {'end': 4756.713, 'text': "Okay, so having said all that, let's now take our self-attention knowledge and let's take it for a spin.", 'start': 4751.989, 'duration': 4.724}, {'end': 4763.171, 'text': "So here in the code, I've created this head module and implements a single head of self-attention.", 'start': 4758.206, 'duration': 4.965}], 'summary': 'Softmax peakiness controlled by scaling for better initialization.', 'duration': 28.457, 'max_score': 4734.714, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY4734714.jpg'}], 'start': 4109.379, 'title': 'Self-attention mechanism in transformer', 'summary': 'Explains self-attention mechanism, including affinity calculation, masking, and token information aggregation, resulting in a 16-dimensional output. it also covers the communication mechanism between nodes, scaled attention for normalization, variance control, and distinction between self-attention and cross-attention.', 'chapters': [{'end': 4278.704, 'start': 4109.379, 'title': 'Self-attention mechanism explained', 'summary': 'Explains the self-attention mechanism, including the calculation of affinity, masking, and aggregation of information from tokens, resulting in a 16-dimensional output.', 'duration': 169.325, 'highlights': ['The self-attention mechanism involves calculating affinity between tokens through the dot product, leading to high affinity and information aggregation based on softmax, resulting in learning about the tokens (e.g., 8th token).', 'The upper triangular masking is used to restrict certain nodes (e.g., 6th, 7th, and 8th) from communicating and aggregating information, ensuring a more refined distribution.', 'Exponentiation and normalization are applied to the raw outputs to obtain a distribution that sums to one, providing data-dependent information aggregation from past tokens.', "The self-attention head involves the calculation of a value 'v' by propagating a linear transformation on top of 'x', resulting in a 16-dimensional output for each token, where 'x' represents private information for each token."]}, {'end': 4793.48, 'start': 4279.685, 'title': 'Self-attention mechanism', 'summary': 'Explains the self-attention mechanism in a transformer, highlighting the communication mechanism between nodes in a directed graph and the importance of scaled attention for normalization and variance control, along with the distinction between self-attention and cross-attention.', 'duration': 513.795, 'highlights': ['The self-attention mechanism functions as a communication mechanism between nodes in a directed graph, allowing nodes to aggregate information via a weighted sum from all the nodes that point to them.', 'The importance of scaled attention is emphasized for normalization and variance control, particularly to prevent extreme values that could lead to overly peaked Softmax distributions, affecting the aggregation of information.', 'The distinction between self-attention and cross-attention is explained, where self-attention involves nodes looking at and talking to each other, while cross-attention involves pulling information from a separate source of nodes.']}], 'duration': 684.101, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY4109379.jpg', 'highlights': ['The self-attention mechanism involves calculating affinity between tokens through the dot product, leading to high affinity and information aggregation based on softmax, resulting in learning about the tokens (e.g., 8th token).', 'The upper triangular masking is used to restrict certain nodes (e.g., 6th, 7th, and 8th) from communicating and aggregating information, ensuring a more refined distribution.', 'The self-attention mechanism functions as a communication mechanism between nodes in a directed graph, allowing nodes to aggregate information via a weighted sum from all the nodes that point to them.', 'Exponentiation and normalization are applied to the raw outputs to obtain a distribution that sums to one, providing data-dependent information aggregation from past tokens.', 'The importance of scaled attention is emphasized for normalization and variance control, particularly to prevent extreme values that could lead to overly peaked Softmax distributions, affecting the aggregation of information.', "The self-attention head involves the calculation of a value 'v' by propagating a linear transformation on top of 'x', resulting in a 16-dimensional output for each token, where 'x' represents private information for each token.", 'The distinction between self-attention and cross-attention is explained, where self-attention involves nodes looking at and talking to each other, while cross-attention involves pulling information from a separate source of nodes.']}, {'end': 5571.707, 'segs': [{'end': 4940.851, 'src': 'embed', 'start': 4881.734, 'weight': 2, 'content': [{'end': 4884.484, 'text': 'so that we never pass in more than block size elements.', 'start': 4881.734, 'duration': 2.75}, {'end': 4887.627, 'text': "So those are the changes and let's now train the network.", 'start': 4885.646, 'duration': 1.981}, {'end': 4894.83, 'text': "Okay, so I also came up to the script here and I decreased the learning rate because the self-attention can't tolerate very, very high learning rates.", 'start': 4888.087, 'duration': 6.743}, {'end': 4898.711, 'text': 'And then I also increased the number of iterations because the learning rate is lower.', 'start': 4895.63, 'duration': 3.081}, {'end': 4905.354, 'text': 'And then I trained it and previously we were only able to get to up to 2.5 and now we are down to 2.4.', 'start': 4899.372, 'duration': 5.982}, {'end': 4911.317, 'text': 'So we definitely see a little bit of an improvement from 2.5 to 2.4 roughly, but the text is still not amazing.', 'start': 4905.354, 'duration': 5.963}, {'end': 4919.401, 'text': 'so clearly the self-attention head is doing some useful communication, but we still have a long way to go.', 'start': 4912.057, 'duration': 7.344}, {'end': 4922.142, 'text': "okay, so now we've implemented the scale.product attention.", 'start': 4919.401, 'duration': 2.741}, {'end': 4925.024, 'text': 'now, next up, and the attention is all you need paper.', 'start': 4922.142, 'duration': 2.882}, {'end': 4927.105, 'text': "there's something called multi-head attention.", 'start': 4925.024, 'duration': 2.081}, {'end': 4928.946, 'text': 'and what is multi-head attention?', 'start': 4927.105, 'duration': 1.841}, {'end': 4933.208, 'text': "it's just applying multiple attentions in parallel and concatenating the results.", 'start': 4928.946, 'duration': 4.262}, {'end': 4935.729, 'text': 'So they have a little bit of diagram here.', 'start': 4934.048, 'duration': 1.681}, {'end': 4937.55, 'text': "I don't know if this is super clear.", 'start': 4936.249, 'duration': 1.301}, {'end': 4940.851, 'text': "It's really just multiple attentions in parallel.", 'start': 4938.37, 'duration': 2.481}], 'summary': 'Decreased learning rate, increased iterations, improved from 2.5 to 2.4, implementing multi-head attention.', 'duration': 59.117, 'max_score': 4881.734, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY4881734.jpg'}, {'end': 5199.234, 'src': 'embed', 'start': 5171.45, 'weight': 3, 'content': [{'end': 5175.152, 'text': "And then once they've gathered all the data, now they need to think on that data individually.", 'start': 5171.45, 'duration': 3.702}, {'end': 5177.499, 'text': "And so that's what FeedForward is doing.", 'start': 5176.178, 'duration': 1.321}, {'end': 5179.46, 'text': "And that's why I've added it here.", 'start': 5178.139, 'duration': 1.321}, {'end': 5185.905, 'text': 'Now, when I train this, the validation loss actually continues to go down now to 2.24, which is down from 2.28.', 'start': 5180.321, 'duration': 5.584}, {'end': 5191.188, 'text': "The outputs still look kind of terrible, but at least we've improved the situation.", 'start': 5185.905, 'duration': 5.283}, {'end': 5199.234, 'text': "And so as a preview, we're going to now start to intersperse the communication with the computation.", 'start': 5192.129, 'duration': 7.105}], 'summary': 'Feedforward improves validation loss to 2.24 from 2.28.', 'duration': 27.784, 'max_score': 5171.45, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY5171450.jpg'}, {'end': 5241.127, 'src': 'embed', 'start': 5212.466, 'weight': 6, 'content': [{'end': 5217.271, 'text': 'We have a block and this block is basically this part here, except for the cross attention.', 'start': 5212.466, 'duration': 4.805}, {'end': 5221.983, 'text': 'Now, the block basically intersperses communication and then computation.', 'start': 5218.68, 'duration': 3.303}, {'end': 5230.832, 'text': 'The computation is done using multi-headed self-attention and then the computation is done using a feed-forward network on all the tokens independently.', 'start': 5222.624, 'duration': 8.208}, {'end': 5235.317, 'text': "Now, what I've added here also is, you'll notice,", 'start': 5232.574, 'duration': 2.743}, {'end': 5241.127, 'text': 'This takes the number of embeddings in the embedding dimension and the number of heads that we would like,', 'start': 5237.264, 'duration': 3.863}], 'summary': 'Block alternates communication and computation using multi-headed self-attention and feed-forward network.', 'duration': 28.661, 'max_score': 5212.466, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY5212466.jpg'}, {'end': 5346.825, 'src': 'embed', 'start': 5317.727, 'weight': 1, 'content': [{'end': 5320.648, 'text': 'Those are skip connections or sometimes called residual connections.', 'start': 5317.727, 'duration': 2.921}, {'end': 5326.751, 'text': 'They come from this paper, the Procedural Learning for Image Recognition from about 2015 that introduced the concept.', 'start': 5321.549, 'duration': 5.202}, {'end': 5338.601, 'text': 'Now, what it means is you transform the data, but then you have a skip connection with addition from the previous features.', 'start': 5330.017, 'duration': 8.584}, {'end': 5343.664, 'text': 'Now, the way I like to visualize it that I prefer is the following.', 'start': 5339.422, 'duration': 4.242}, {'end': 5346.825, 'text': 'Here, the computation happens from the top to bottom.', 'start': 5344.304, 'duration': 2.521}], 'summary': 'Skip connections, introduced in a 2015 paper, involve adding previous features to transformed data.', 'duration': 29.098, 'max_score': 5317.727, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY5317727.jpg'}, {'end': 5547.823, 'src': 'embed', 'start': 5525.188, 'weight': 0, 'content': [{'end': 5534.899, 'text': 'So I came here and I multiplied four times embed here for the feedforward and then from four times N embed coming back down to N embed when we go back to the projection.', 'start': 5525.188, 'duration': 9.711}, {'end': 5541.848, 'text': 'So adding a bit of computation here and growing that layer that is in the residual block on the side of the residual pathway.', 'start': 5535.38, 'duration': 6.468}, {'end': 5547.823, 'text': 'And then I train this, and we actually get down all the way to 2.08 validation loss.', 'start': 5543.157, 'duration': 4.666}], 'summary': 'Achieved 2.08 validation loss after training with increased computation in the residual block.', 'duration': 22.635, 'max_score': 5525.188, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY5525188.jpg'}], 'start': 4794.521, 'title': 'Implementing self-attention and transformer structure in neural networks', 'summary': "Delves into the implementation of self-attention and transformer structure in neural networks, detailing the calculation of attention scores, multi-head attention, and feed-forward computation leading to a decrease in validation loss from 2.5 to 2.24. it also covers the transformer's integration of communication and computation, optimization through skip connections, and dimensionality adjustments, resulting in a validation loss of 2.08 and reduced overfitting.", 'chapters': [{'end': 5191.188, 'start': 4794.521, 'title': 'Implementing self-attention in neural networks', 'summary': 'Discusses the implementation of self-attention in neural networks, covering the calculation of attention scores, creating multi-head attention, and adding feed-forward computation resulting in a decrease in validation loss from 2.5 to 2.24.', 'duration': 396.667, 'highlights': ["The addition of multi-head attention resulted in a decrease in validation loss from 2.4 to 2.28, indicating an improvement in the network's performance.", "The implementation of feed-forward computation led to a further decrease in validation loss to 2.24, demonstrating an enhancement in the network's performance.", "The network's learning rate was decreased due to the self-attention's intolerance to high learning rates, resulting in an increase in the number of iterations and an improvement in the network's performance.", "The chapter also discusses the concept of multi-head attention, which involves applying multiple attentions in parallel and concatenating the results, thereby enhancing the communication channels and improving the network's performance."]}, {'end': 5571.707, 'start': 5192.129, 'title': 'Transformer structure and optimization', 'summary': 'Explains how the transformer intersperses communication and computation, introduces skip connections for optimization, and adjusts dimensionality for improved results, resulting in a validation loss of 2.08 and a visible reduction in overfitting.', 'duration': 379.578, 'highlights': ['The transformer intersperses communication and computation using multi-headed self-attention and a feed-forward network, with the number of heads being four due to a 32-dimensional embedding, aiming to resolve issues with deep neural nets.', 'Skip connections, inspired by Procedural Learning for Image Recognition paper, enable a gradient superhighway, aiding optimization by allowing gradients to flow unimpeded from supervision to input, contributing to the reduction of overfitting.', 'The inner layer of the feedforward network is multiplied by four in terms of channel sizes, leading to a validation loss of 2.08 and a visible reduction in overfitting.']}], 'duration': 777.186, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY4794521.jpg', 'highlights': ['The inner layer of the feedforward network is multiplied by four, resulting in a validation loss of 2.08 and reduced overfitting.', 'Skip connections enable a gradient superhighway, aiding optimization and contributing to the reduction of overfitting.', "The addition of multi-head attention resulted in a decrease in validation loss from 2.4 to 2.28, indicating an improvement in the network's performance.", "The implementation of feed-forward computation led to a further decrease in validation loss to 2.24, demonstrating an enhancement in the network's performance.", "The network's learning rate was decreased due to the self-attention's intolerance to high learning rates, resulting in an increase in the number of iterations and an improvement in the network's performance.", "The chapter discusses the concept of multi-head attention, enhancing communication channels and improving the network's performance.", 'The transformer integrates communication and computation using multi-headed self-attention and a feed-forward network, aiming to resolve issues with deep neural nets.']}, {'end': 6053.787, 'segs': [{'end': 5834.244, 'src': 'embed', 'start': 5809.858, 'weight': 1, 'content': [{'end': 5820.86, 'text': 'because these layer norms inside it have these gamma and beta trainable parameters and the layer norm will eventually create outputs that might not be unit Gaussian,', 'start': 5809.858, 'duration': 11.002}, {'end': 5822.402, 'text': 'but the optimization will determine that.', 'start': 5820.86, 'duration': 1.542}, {'end': 5827.61, 'text': "So for now, this is incorporating the layer norms and let's train them up.", 'start': 5823.504, 'duration': 4.106}, {'end': 5834.244, 'text': 'Okay, so I let it run and we see that we get down to 2.06,, which is better than the previous 2.08,', 'start': 5828.482, 'duration': 5.762}], 'summary': 'Incorporating trainable layer norms led to improved optimization, reducing the output to 2.06 from the previous 2.08.', 'duration': 24.386, 'max_score': 5809.858, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY5809858.jpg'}, {'end': 6010.071, 'src': 'embed', 'start': 5983.263, 'weight': 2, 'content': [{'end': 5988.007, 'text': 'I changed the block size to be 256, so previously it was just 8, 8 characters of context.', 'start': 5983.263, 'duration': 4.744}, {'end': 5991.991, 'text': 'Now it is 256 characters of context to predict the 257th.', 'start': 5988.307, 'duration': 3.684}, {'end': 5999.217, 'text': 'I brought down the learning rate a little bit because the neural net is now much bigger, so I brought down the learning rate.', 'start': 5992.051, 'duration': 7.166}, {'end': 6003.208, 'text': 'The embedding dimension is now 384 and there are six heads.', 'start': 6000.287, 'duration': 2.921}, {'end': 6010.071, 'text': 'So 384 divided by 6 means that every head is 64-dimensional as a standard.', 'start': 6004.009, 'duration': 6.062}], 'summary': 'Increased block size to 256, reduced learning rate, and changed embedding dimension to 384 with six heads, each 64-dimensional.', 'duration': 26.808, 'max_score': 5983.263, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY5983263.jpg'}, {'end': 6056.889, 'src': 'embed', 'start': 6032.691, 'weight': 0, 'content': [{'end': 6040.518, 'text': 'We get a validation loss of 1.48, which is actually quite a bit of an improvement on what we had before, which I think was 2.07.', 'start': 6032.691, 'duration': 7.827}, {'end': 6045.582, 'text': 'So we went from 2.07 all the way down to 1.48 just by scaling up this neural net with the code that we have.', 'start': 6040.518, 'duration': 5.064}, {'end': 6047.946, 'text': 'And this, of course, ran for a lot longer.', 'start': 6046.385, 'duration': 1.561}, {'end': 6052.727, 'text': 'This maybe trained for, I want to say, about 15 minutes on my A100 GPU.', 'start': 6047.966, 'duration': 4.761}, {'end': 6053.787, 'text': "So that's a pretty good GPU.", 'start': 6052.847, 'duration': 0.94}, {'end': 6056.889, 'text': "And if you don't have a GPU, you're not going to be able to reproduce this.", 'start': 6054.588, 'duration': 2.301}], 'summary': "Neural net's validation loss decreased from 2.07 to 1.48 after 15-minute training on a100 gpu.", 'duration': 24.198, 'max_score': 6032.691, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY6032691.jpg'}], 'start': 5571.707, 'title': 'Optimizing neural networks and scaling', 'summary': 'Discusses the implementation of layer normalization in pytorch for optimizing very deep neural networks, noting a performance improvement from 2.08 to 2.06, and scaling up a neural net with increased batch size to 64, block size to 256, embedding dimension to 384, and dropout at 0.2, resulting in a validation loss improvement from 2.07 to 1.48, trained on an a100 gpu for about 15 minutes.', 'chapters': [{'end': 5874.159, 'start': 5571.707, 'title': 'Optimizing neural networks with layer norm', 'summary': 'Discusses the implementation of layer normalization in pytorch for optimizing very deep neural networks, noting a slight improvement in performance to 2.06 from 2.08, and the need for a layer norm at the end of the transformer.', 'duration': 302.452, 'highlights': ['The addition of layer normalization in PyTorch for optimizing deep neural networks is discussed, resulting in a slight improvement in performance to 2.06 from 2.08.', 'The importance of adding a layer norm at the end of the transformer for achieving a complete decoder-only transformer according to the original paper is emphasized.', 'The need for cosmetic changes to scale up the model is mentioned, indicating further development in the optimization process.']}, {'end': 6053.787, 'start': 5874.659, 'title': 'Neural net scaling and improvement', 'summary': 'Discusses scaling up a neural net by increasing batch size to 64, block size to 256, embedding dimension to 384, and implementing dropout at 0.2, resulting in a validation loss improvement from 2.07 to 1.48, trained on an a100 gpu for about 15 minutes.', 'duration': 179.128, 'highlights': ['The validation loss improved from 2.07 to 1.48 after scaling up the neural net.', 'The batch size was increased to 64, and the block size was extended to 256 characters of context.', 'The embedding dimension is now 384, with six heads and six layers, and a dropout rate of 0.2.', 'The neural net was trained for about 15 minutes on an A100 GPU.']}], 'duration': 482.08, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY5571707.jpg', 'highlights': ['Validation loss improved from 2.07 to 1.48 after scaling up the neural net.', 'Layer normalization in PyTorch optimized deep neural networks, improving performance to 2.06 from 2.08.', 'Batch size increased to 64, block size extended to 256 characters, embedding dimension now 384, with dropout at 0.2.', 'Neural net trained for about 15 minutes on an A100 GPU.']}, {'end': 6978.864, 'segs': [{'end': 6100.661, 'src': 'embed', 'start': 6075.133, 'weight': 6, 'content': [{'end': 6084.32, 'text': 'but what I did also is I printed 10 000 characters, so a lot more, and I wrote them to a file, and so here we see some of the outputs.', 'start': 6075.133, 'duration': 9.187}, {'end': 6088.343, 'text': "so it's a lot more recognizable as the input text file.", 'start': 6084.32, 'duration': 4.023}, {'end': 6091.845, 'text': 'so the input text file, just for reference, looked like this.', 'start': 6088.343, 'duration': 3.502}, {'end': 6100.661, 'text': "so there's always like someone speaking in this manner, and predictions now take on that form, except of course they're, uh,", 'start': 6091.845, 'duration': 8.816}], 'summary': 'Printed 10,000 characters, wrote to file, improved recognition of input text.', 'duration': 25.528, 'max_score': 6075.133, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY6075133.jpg'}, {'end': 6171.036, 'src': 'embed', 'start': 6148.559, 'weight': 3, 'content': [{'end': 6156.944, 'text': "We basically kind of did a pretty good job of implementing this transformer, but the picture doesn't exactly match up to what we've done.", 'start': 6148.559, 'duration': 8.385}, {'end': 6163.107, 'text': "So what's going on with all these additional parts here? So let me finish explaining this architecture and why it looks so funky.", 'start': 6157.424, 'duration': 5.683}, {'end': 6168.776, 'text': "Basically what's happening here is what we implemented here is a decoder only transformer.", 'start': 6164.175, 'duration': 4.601}, {'end': 6171.036, 'text': "So there's no component here.", 'start': 6169.436, 'duration': 1.6}], 'summary': 'Implemented decoder-only transformer, no additional components added', 'duration': 22.477, 'max_score': 6148.559, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY6148559.jpg'}, {'end': 6235.078, 'src': 'embed', 'start': 6208.635, 'weight': 2, 'content': [{'end': 6215.881, 'text': "So the fact that it's using the triangular mask to mask out the attention makes it a decoder and it can be used for language modeling.", 'start': 6208.635, 'duration': 7.246}, {'end': 6222.627, 'text': 'Now, the reason that the original paper had an encoder-decoder architecture is because it is a machine translation paper.', 'start': 6216.702, 'duration': 5.925}, {'end': 6225.27, 'text': 'So it is concerned with a different setting.', 'start': 6223.248, 'duration': 2.022}, {'end': 6235.078, 'text': 'In particular, it expects some tokens that encode, say, for example, French, and then it is expected to decode the translation in English.', 'start': 6225.63, 'duration': 9.448}], 'summary': 'Transformer model can be used for language modeling and machine translation.', 'duration': 26.443, 'max_score': 6208.635, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY6208635.jpg'}, {'end': 6594.098, 'src': 'embed', 'start': 6566.945, 'weight': 0, 'content': [{'end': 6571.128, 'text': 'And so in our case, this is how you print a number of parameters.', 'start': 6566.945, 'duration': 4.183}, {'end': 6573.269, 'text': "I printed it and it's about 10 million.", 'start': 6571.728, 'duration': 1.541}, {'end': 6580.974, 'text': 'So this transformer that I created here to create a little Shakespeare, transformer was about 10 million parameters.', 'start': 6573.57, 'duration': 7.404}, {'end': 6586.336, 'text': 'Our data set is roughly 1 million characters, so roughly 1 million tokens.', 'start': 6581.754, 'duration': 4.582}, {'end': 6589.497, 'text': 'But you have to remember that OpenAI uses different vocabulary.', 'start': 6587.036, 'duration': 2.461}, {'end': 6590.677, 'text': "They're not on the character level.", 'start': 6589.597, 'duration': 1.08}, {'end': 6594.098, 'text': 'They use these subword chunks of words.', 'start': 6590.957, 'duration': 3.141}], 'summary': 'A transformer with 10 million parameters generates shakespeare using 1 million tokens.', 'duration': 27.153, 'max_score': 6566.945, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY6566945.jpg'}, {'end': 6763.579, 'src': 'embed', 'start': 6739.853, 'weight': 4, 'content': [{'end': 6746.922, 'text': 'so the second fine-tuning stage is to actually align it to be an assistant, and this is the second stage.', 'start': 6739.853, 'duration': 7.069}, {'end': 6752.409, 'text': 'and so this ChatGPT blog post from OpenAI talks a little bit about how this stage is achieved.', 'start': 6746.922, 'duration': 5.487}, {'end': 6757.237, 'text': "we basically There's roughly three steps to this stage.", 'start': 6752.409, 'duration': 4.828}, {'end': 6763.579, 'text': 'So what they do here is they start to collect training data that looks specifically like what an assistant would do.', 'start': 6758.097, 'duration': 5.482}], 'summary': 'Second stage of fine-tuning aligns model to be an assistant, with three steps in the process.', 'duration': 23.726, 'max_score': 6739.853, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY6739853.jpg'}, {'end': 6900.007, 'src': 'embed', 'start': 6872.492, 'weight': 1, 'content': [{'end': 6875.034, 'text': "Okay, and that's everything that I wanted to cover today.", 'start': 6872.492, 'duration': 2.542}, {'end': 6882.219, 'text': 'so we trained, to summarize, a decoder only transformer following this famous paper.', 'start': 6875.034, 'duration': 7.185}, {'end': 6887.243, 'text': "attention is all you need from 2017, and so that's basically a GPT.", 'start': 6882.219, 'duration': 5.024}, {'end': 6896.486, 'text': 'we trained it on a tiny Shakespeare and got sensible results, and All of the training code is roughly 200 lines of code.', 'start': 6887.243, 'duration': 9.243}, {'end': 6900.007, 'text': 'I will be releasing this code base.', 'start': 6897.266, 'duration': 2.741}], 'summary': 'Trained a decoder-only transformer on tiny shakespeare data with 200 lines of code', 'duration': 27.515, 'max_score': 6872.492, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY6872492.jpg'}, {'end': 6975.363, 'src': 'embed', 'start': 6947.216, 'weight': 8, 'content': [{'end': 6949.598, 'text': "you don't want something that's just a document completer.", 'start': 6947.216, 'duration': 2.382}, {'end': 6953.12, 'text': 'you have to complete further stages of fine-tuning, which we did not cover.', 'start': 6949.598, 'duration': 3.522}, {'end': 6958.84, 'text': 'And that could be simple supervised fine-tuning, or it can be something more fancy like we see in ChaiGPT.', 'start': 6954.279, 'duration': 4.561}, {'end': 6964.601, 'text': 'We actually train a reward model and then do rounds of PPO to align it with respect to the reward model.', 'start': 6959.14, 'duration': 5.461}, {'end': 6967.002, 'text': "So there's a lot more that can be done on top of it.", 'start': 6965.421, 'duration': 1.581}, {'end': 6969.882, 'text': "I think for now we're starting to get to about two hours, Mark.", 'start': 6967.542, 'duration': 2.34}, {'end': 6972.923, 'text': "So I'm going to kind of finish here.", 'start': 6970.402, 'duration': 2.521}, {'end': 6975.363, 'text': 'I hope you enjoyed the lecture.', 'start': 6972.943, 'duration': 2.42}], 'summary': 'Lecture covered fine-tuning, including supervised and reward-based, with potential for further enhancements.', 'duration': 28.147, 'max_score': 6947.216, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY6947216.jpg'}], 'start': 6054.588, 'title': 'Transformer models and chatgpt training', 'summary': 'Covers the training of a transformer model on shakespeare text, achieving recognizable output in 15 minutes with a gpu, understanding transformer architecture, including nanogpt implementation, and the training process for chatgpt, scaling to models with billions of parameters.', 'chapters': [{'end': 6148.179, 'start': 6054.588, 'title': 'Transformer model for shakespeare text generation', 'summary': 'Demonstrates the training of a transformer model on one million characters from shakespeare, producing a recognizable but nonsensical output, achieving these results in about 15 minutes with a gpu.', 'duration': 93.591, 'highlights': ['The transformer model was trained on one million characters from Shakespeare, producing a recognizable but nonsensical output, achieved in about 15 minutes with a GPU.', 'The output text file contained 10,000 characters, providing a more recognizable form of the input text.', "It's a demonstration of what's possible with the transformer model trained on the character level for one million characters from Shakespeare."]}, {'end': 6522.461, 'start': 6148.559, 'title': 'Understanding transformer architecture', 'summary': 'Explains the difference between a decoder-only transformer and an encoder-decoder transformer, highlighting the use cases and structural variations, and also discusses the implementation of nanogpt.', 'duration': 373.902, 'highlights': ['The chapter explains the difference between a decoder-only transformer and an encoder-decoder transformer, highlighting the use cases and structural variations.', 'The chapter discusses the implementation of nanoGPT, comparing train.py and model.py, emphasizing the similarities and differences in the code structure and functionality.', 'The explanation of the transformer architecture includes the use of a triangular mask for autoregressive property, the purpose of an encoder-decoder model in machine translation, and the role of cross attention in conditioning the decoding process.']}, {'end': 6978.864, 'start': 6524.234, 'title': 'Training chatgpt models', 'summary': 'Discusses the training process for chatgpt, covering pre-training with 10 million parameters, fine-tuning stages, and the infrastructure challenge of scaling to models with billions of parameters, along with the transition from document completer to question answerer.', 'duration': 454.63, 'highlights': ['The pre-training stage involves training a transformer with roughly 10 million parameters on a data set of approximately 300,000 tokens, while OpenAI uses a vocabulary of 50,000 elements and trains on 300 billion tokens, representing a significant increase in scale.', "The fine-tuning stage includes aligning the model to be an assistant through steps such as collecting training data that resembles an assistant's interactions, fine-tuning the model to focus on specific document formats, and implementing a reward model and policy gradient reinforcement learning to transition the model from a document completer to a question answerer.", "The training code for a decoder-only transformer, akin to GPT, based on the 'attention is all you need' paper from 2017, spans around 200 lines and was illustrated through training a model on a small Shakespeare data set, producing sensible results and an intention to release the codebase, along with related resources like the notebook and Google Colab.", 'The transcript concludes by mentioning the need for further fine-tuning stages beyond language modeling, such as supervised fine-tuning or more advanced methods like training a reward model and utilizing rounds of PPO to align the model with respect to the reward model, hinting at the complexity and depth of potential enhancements to the trained model.']}], 'duration': 924.276, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/kCc8FmEb1nY/pics/kCc8FmEb1nY6054588.jpg', 'highlights': ['The pre-training stage involves training a transformer with roughly 10 million parameters on a data set of approximately 300,000 tokens, while OpenAI uses a vocabulary of 50,000 elements and trains on 300 billion tokens, representing a significant increase in scale.', "The training code for a decoder-only transformer, akin to GPT, based on the 'attention is all you need' paper from 2017, spans around 200 lines and was illustrated through training a model on a small Shakespeare data set, producing sensible results and an intention to release the codebase, along with related resources like the notebook and Google Colab.", 'The explanation of the transformer architecture includes the use of a triangular mask for autoregressive property, the purpose of an encoder-decoder model in machine translation, and the role of cross attention in conditioning the decoding process.', 'The chapter explains the difference between a decoder-only transformer and an encoder-decoder transformer, highlighting the use cases and structural variations.', "The fine-tuning stage includes aligning the model to be an assistant through steps such as collecting training data that resembles an assistant's interactions, fine-tuning the model to focus on specific document formats, and implementing a reward model and policy gradient reinforcement learning to transition the model from a document completer to a question answerer.", 'The transformer model was trained on one million characters from Shakespeare, producing a recognizable but nonsensical output, achieved in about 15 minutes with a GPU.', 'The output text file contained 10,000 characters, providing a more recognizable form of the input text.', "It's a demonstration of what's possible with the transformer model trained on the character level for one million characters from Shakespeare.", 'The transcript concludes by mentioning the need for further fine-tuning stages beyond language modeling, such as supervised fine-tuning or more advanced methods like training a reward model and utilizing rounds of PPO to align the model with respect to the reward model, hinting at the complexity and depth of potential enhancements to the trained model.']}], 'highlights': ['Validation loss improved from 2.07 to 1.48 after scaling up the neural net.', 'NanoGPT reproduces GPT-2 performance with 300 lines of code.', 'The inner layer of the feedforward network is multiplied by four, resulting in a validation loss of 2.08 and reduced overfitting.', 'The dataset is split into 90% for training and 10% for validation to prevent overfitting.', 'The chapter introduces a mathematical trick used in the self-attention inside a transformer, essential for an efficient implementation.', 'The self-attention mechanism involves calculating affinity between tokens through the dot product, leading to high affinity and information aggregation based on softmax, resulting in learning about the tokens (e.g., 8th token).', 'The need to train the model to improve the quality of generated text is emphasized', "The inefficiency of the current implementation in utilizing the model's capabilities is emphasized", 'The training process involves using context sizes from one all the way up to block size to familiarize the transformer network with different contexts for later inference.', 'The transformer integrates communication and computation using multi-headed self-attention and a feed-forward network, aiming to resolve issues with deep neural nets.']}