title

The spelled-out intro to language modeling: building makemore

description

We implement a bigram character-level language model, which we will further complexify in followup videos into a modern Transformer language model, like GPT. In this video, the focus is on (1) introducing torch.Tensor and its subtleties and use in efficiently evaluating neural networks and (2) the overall framework of language modeling that includes model training, sampling, and the evaluation of a loss (e.g. the negative log likelihood for classification).
Links:
- makemore on github: https://github.com/karpathy/makemore
- jupyter notebook I built in this video: https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part1_bigrams.ipynb
- my website: https://karpathy.ai
- my twitter: https://twitter.com/karpathy
- (new) Neural Networks: Zero to Hero series Discord channel: https://discord.gg/3zy8kqD9Cp , for people who'd like to chat more and go beyond youtube comments
Useful links for practice:
- Python + Numpy tutorial from CS231n https://cs231n.github.io/python-numpy-tutorial/ . We use torch.tensor instead of numpy.array in this video. Their design (e.g. broadcasting, data types, etc.) is so similar that practicing one is basically practicing the other, just be careful with some of the APIs - how various functions are named, what arguments they take, etc. - these details can vary.
- PyTorch tutorial on Tensor https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html
- Another PyTorch intro to Tensor https://pytorch.org/tutorials/beginner/nlp/pytorch_tutorial.html
Exercises:
E01: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?
E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?
E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?
E04: we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W?
E05: look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we'd prefer to use F.cross_entropy instead?
E06: meta-exercise! Think of a fun/interesting exercise and complete it.
Chapters:
00:00:00 intro
00:03:03 reading and exploring the dataset
00:06:24 exploring the bigrams in the dataset
00:09:24 counting bigrams in a python dictionary
00:12:45 counting bigrams in a 2D torch tensor ("training the model")
00:18:19 visualizing the bigram tensor
00:20:54 deleting spurious (S) and (E) tokens in favor of a single . token
00:24:02 sampling from the model
00:36:17 efficiency! vectorized normalization of the rows, tensor broadcasting
00:50:14 loss function (the negative log likelihood of the data under our model)
01:00:50 model smoothing with fake counts
01:02:57 PART 2: the neural network approach: intro
01:05:26 creating the bigram dataset for the neural net
01:10:01 feeding integers into neural nets? one-hot encodings
01:13:53 the "neural net": one linear layer of neurons implemented with matrix multiplication
01:18:46 transforming neural net outputs into probabilities: the softmax
01:26:17 summary, preview to next steps, reference to micrograd
01:35:49 vectorized loss
01:38:36 backward and update, in PyTorch
01:42:55 putting everything together
01:47:49 note 1: one-hot encoding really just selects a row of the next Linear layer's weight matrix
01:50:18 note 2: model smoothing as regularization loss
01:54:31 sampling from the neural net
01:56:16 conclusion

detail

{'title': 'The spelled-out intro to language modeling: building makemore', 'heatmap': [{'end': 6642.079, 'start': 6567.612, 'weight': 1}], 'summary': 'Delves into character-level language modeling using a dataset of 32,000 names and various neural network models, discussing bigram language model construction, tensor manipulation, probability matrix optimization, and training with a culminating training set loss of 2.4. it also covers neural network operations, optimization through gradient-based backpropagation with a decrease in loss from 3.76 to 3.72, scalability, and regularization in neural networks.', 'chapters': [{'end': 210.923, 'segs': [{'end': 99.803, 'src': 'embed', 'start': 69.734, 'weight': 0, 'content': [{'end': 75.215, 'text': 'So here are some example generations from the neural network once we train it on our dataset.', 'start': 69.734, 'duration': 5.481}, {'end': 79.144, 'text': "So here's some example unique names that it will generate.", 'start': 76.4, 'duration': 2.744}, {'end': 85.053, 'text': 'Dontel, irot, zendy, and so on.', 'start': 79.865, 'duration': 5.188}, {'end': 89.18, 'text': "And so all these sort of sound name-like, but they're not, of course, names.", 'start': 85.774, 'duration': 3.406}, {'end': 94.241, 'text': 'So under the hood, MakeMore is a character-level language model.', 'start': 90.819, 'duration': 3.422}, {'end': 99.803, 'text': 'So what that means is that it is treating every single line here as an example,', 'start': 94.901, 'duration': 4.902}], 'summary': 'Neural network generates unique names like dontel, irot, zendy from dataset.', 'duration': 30.069, 'max_score': 69.734, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo69734.jpg'}, {'end': 160.423, 'src': 'embed', 'start': 132.454, 'weight': 2, 'content': [{'end': 140.276, 'text': 'So very simple bigram and bag of word models, multilayered perceptrons, recurrent neural networks, all the way to modern transformers.', 'start': 132.454, 'duration': 7.822}, {'end': 147.418, 'text': 'In fact, the transformer that we will build will be basically the equivalent transformer to GPT-2 if you have heard of GPT.', 'start': 140.936, 'duration': 6.482}, {'end': 149.158, 'text': "So that's kind of a big deal.", 'start': 148.218, 'duration': 0.94}, {'end': 154.6, 'text': "It's a modern network and by the end of the series you will actually understand how that works.", 'start': 149.438, 'duration': 5.162}, {'end': 160.423, 'text': 'on the level of characters, now, to give you a sense of the extensions here after characters,', 'start': 154.6, 'duration': 5.823}], 'summary': 'Intro to various neural network models from bigram to modern transformers, including equivalent to gpt-2. series will cover understanding them.', 'duration': 27.969, 'max_score': 132.454, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo132454.jpg'}, {'end': 210.923, 'src': 'embed', 'start': 171.187, 'weight': 3, 'content': [{'end': 177.589, 'text': "And then we're probably going to go into images and image text networks such as DALI, stable diffusion and so on.", 'start': 171.187, 'duration': 6.402}, {'end': 181.991, 'text': 'But for now, we have to start here, character-level language modeling.', 'start': 178.27, 'duration': 3.721}, {'end': 182.671, 'text': "Let's go.", 'start': 182.371, 'duration': 0.3}, {'end': 186.653, 'text': 'So like before, we are starting with a completely blank Jupyter Notebook page.', 'start': 183.592, 'duration': 3.061}, {'end': 190.954, 'text': 'The first thing is I would like to basically load up the dataset, names.txt.', 'start': 187.193, 'duration': 3.761}, {'end': 194.275, 'text': "So we're going to open up names.txt for reading.", 'start': 191.955, 'duration': 2.32}, {'end': 198.517, 'text': "And we're going to read in everything into a massive string.", 'start': 195.616, 'duration': 2.901}, {'end': 203.918, 'text': "And then because it's a massive string, we'd only like the individual words and put them in the list.", 'start': 199.895, 'duration': 4.023}, {'end': 210.923, 'text': "So let's call split lines on that string to get all of our words as a Python list of strings.", 'start': 204.638, 'duration': 6.285}], 'summary': 'Starting character-level language modeling with names.txt dataset', 'duration': 39.736, 'max_score': 171.187, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo171187.jpg'}], 'start': 0.369, 'title': 'Character-level language modeling and building makemore', 'summary': 'Explores character-level language modeling, utilizing a dataset of 32,000 names and various neural network models including gpt-2 transformer, while also discussing the process of generating unique names and exploring image text networks such as dali and stable diffusion.', 'chapters': [{'end': 154.6, 'start': 0.369, 'title': 'Building makemore: generating unique names', 'summary': 'Discusses building makemore, a character-level language model, to generate unique names, utilizing a dataset of 32,000 names, and encompassing various neural network models, including the equivalent of gpt-2 transformer.', 'duration': 154.231, 'highlights': ['MakeMore is a character-level language model that learns to generate unique names from a dataset of 32,000 names, offering potential assistance in finding new and distinct names for various applications, such as naming a baby. (Relevance Score: 5)', "The dataset used to train MakeMore consists of 32,000 randomly sourced names from a government website, showcasing the model's ability to learn and generate new name-like variations. (Relevance Score: 4)", "The implementation of MakeMore involves various neural network models, starting from simple bigram and bag of word models to advanced transformers, including a modern equivalent of GPT-2, providing a comprehensive understanding of the model's functionality. (Relevance Score: 3)"]}, {'end': 210.923, 'start': 154.6, 'title': 'Character-level language modeling', 'summary': 'Discusses the process of character-level language modeling, starting with loading the dataset names.txt, reading it into a massive string, and extracting individual words into a python list of strings, aiming to generate documents of words and explore image text networks such as dali and stable diffusion.', 'duration': 56.323, 'highlights': ['The chapter aims to generate documents of words and explore image text networks such as DALI and stable diffusion.', 'The process begins with loading the dataset names.txt and reading it into a massive string.', 'The next step involves extracting individual words from the massive string and putting them in a Python list of strings.']}], 'duration': 210.554, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo369.jpg', 'highlights': ['MakeMore is a character-level language model that learns to generate unique names from a dataset of 32,000 names, offering potential assistance in finding new and distinct names for various applications, such as naming a baby. (Relevance Score: 5)', "The dataset used to train MakeMore consists of 32,000 randomly sourced names from a government website, showcasing the model's ability to learn and generate new name-like variations. (Relevance Score: 4)", "The implementation of MakeMore involves various neural network models, starting from simple bigram and bag of word models to advanced transformers, including a modern equivalent of GPT-2, providing a comprehensive understanding of the model's functionality. (Relevance Score: 3)", 'The chapter aims to generate documents of words and explore image text networks such as DALI and stable diffusion.', 'The process begins with loading the dataset names.txt and reading it into a massive string.', 'The next step involves extracting individual words from the massive string and putting them in a Python list of strings.']}, {'end': 853.62, 'segs': [{'end': 264.031, 'src': 'embed', 'start': 212.224, 'weight': 0, 'content': [{'end': 220.91, 'text': "So basically we can look at, for example, the first 10 words and we have that it's a list of Emma, Olivia, Ava, and so on.", 'start': 212.224, 'duration': 8.686}, {'end': 226.234, 'text': 'And if we look at the top of the page here, that is indeed what we see.', 'start': 221.69, 'duration': 4.544}, {'end': 228.855, 'text': "So that's good.", 'start': 226.254, 'duration': 2.601}, {'end': 233.984, 'text': 'This list actually makes me feel that this is probably sorted by frequency.', 'start': 229.861, 'duration': 4.123}, {'end': 238.167, 'text': 'But okay, so these are the words.', 'start': 235.725, 'duration': 2.442}, {'end': 241.569, 'text': "Now we'd like to actually like learn a little bit more about this data set.", 'start': 238.507, 'duration': 3.062}, {'end': 243.21, 'text': "Let's look at the total number of words.", 'start': 241.969, 'duration': 1.241}, {'end': 245.311, 'text': 'We expect this to be roughly 32, 000.', 'start': 243.43, 'duration': 1.881}, {'end': 249.935, 'text': 'And then what is the, for example, shortest word? So min of.', 'start': 245.312, 'duration': 4.623}, {'end': 264.031, 'text': 'len of each word for w in words, so the shortest word will be length two, and max of len w for w in words, so the longest word will be 15 characters.', 'start': 251.144, 'duration': 12.887}], 'summary': 'Analyzing a dataset of 32,000 words, finding shortest (2 characters) and longest (15 characters) words.', 'duration': 51.807, 'max_score': 212.224, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo212224.jpg'}, {'end': 376.928, 'src': 'embed', 'start': 338.243, 'weight': 4, 'content': [{'end': 339.904, 'text': "And then of course, we don't have just an individual word.", 'start': 338.243, 'duration': 1.661}, {'end': 341.965, 'text': 'We actually have 32, 000 of these.', 'start': 340.404, 'duration': 1.561}, {'end': 343.945, 'text': "And so there's a lot of structure here to model.", 'start': 342.285, 'duration': 1.66}, {'end': 349.907, 'text': "Now in the beginning, what I'd like to start with is I'd like to start with building a bigram language model.", 'start': 345.025, 'duration': 4.882}, {'end': 355.99, 'text': "Now in a bigram language model, we're always working with just two characters at a time.", 'start': 351.462, 'duration': 4.528}, {'end': 362.942, 'text': "So we're only looking at one character that we are given, and we're trying to predict the next character in the sequence.", 'start': 356.851, 'duration': 6.091}, {'end': 369.783, 'text': 'so what characters are likely to follow are what characters are likely to follow a and so on,', 'start': 364.019, 'duration': 5.764}, {'end': 376.928, 'text': "and we're just modeling that kind of a little local structure and we're forgetting the fact that we may have a lot more information.", 'start': 369.783, 'duration': 7.145}], 'summary': 'Modeling 32,000 words using bigram language model to predict next characters.', 'duration': 38.685, 'max_score': 338.243, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo338243.jpg'}, {'end': 791.995, 'src': 'embed', 'start': 765.098, 'weight': 6, 'content': [{'end': 772.682, 'text': "Now it's actually going to be significantly more convenient for us to keep this information in a two-dimensional array instead of a Python dictionary.", 'start': 765.098, 'duration': 7.584}, {'end': 784.774, 'text': "So we're going to store this information in a 2D array and the rows are going to be the first character of the bigram and the columns are going to be the second character.", 'start': 773.763, 'duration': 11.011}, {'end': 791.995, 'text': 'Each entry in this two-dimensional array will tell us how often that first character follows the second character in the dataset.', 'start': 785.414, 'duration': 6.581}], 'summary': 'Storing bigram information in a 2d array for convenient analysis.', 'duration': 26.897, 'max_score': 765.098, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo765098.jpg'}], 'start': 212.224, 'title': 'Analyzing word list data and building bigram language model', 'summary': 'Discusses analyzing a word list and extracting key statistics such as the total number of words (roughly 32,000), shortest word length (2 characters), and longest word length (15 characters), as well as building a bigram language model for character-level language prediction and manipulating a two-dimensional array to represent character bigram counts using pytorch.', 'chapters': [{'end': 264.031, 'start': 212.224, 'title': 'Analyzing word list data', 'summary': "Discusses analyzing a word list, confirming it's sorted by frequency, and extracting key statistics such as the total number of words (roughly 32,000), shortest word length (2 characters), and longest word length (15 characters).", 'duration': 51.807, 'highlights': ['The total number of words in the dataset is expected to be roughly 32,000.', 'The longest word in the dataset is 15 characters in length.', 'The shortest word in the dataset is 2 characters in length.', 'The list of words appears to be sorted by frequency.']}, {'end': 853.62, 'start': 264.812, 'title': 'Building bigram language model', 'summary': 'Discusses building a bigram language model for character-level language prediction, analyzing the statistical structure of character sequences, and using pytorch to create and manipulate a two-dimensional array to represent character bigram counts.', 'duration': 588.808, 'highlights': ['The chapter discusses building a bigram language model for character-level language prediction.', 'Analyzing the statistical structure of character sequences is an important aspect of the discussion.', 'Using PyTorch to create and manipulate a two-dimensional array to represent character bigram counts.']}], 'duration': 641.396, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo212224.jpg', 'highlights': ['The total number of words in the dataset is roughly 32,000.', 'The longest word in the dataset is 15 characters in length.', 'The shortest word in the dataset is 2 characters in length.', 'The list of words appears to be sorted by frequency.', 'The chapter discusses building a bigram language model for character-level language prediction.', 'Analyzing the statistical structure of character sequences is an important aspect of the discussion.', 'Using PyTorch to create and manipulate a two-dimensional array to represent character bigram counts.']}, {'end': 2176.186, 'segs': [{'end': 882.336, 'src': 'embed', 'start': 854.764, 'weight': 4, 'content': [{'end': 860.066, 'text': 'Now, tensors allow us to really manipulate all the individual entries and do it very efficiently.', 'start': 854.764, 'duration': 5.302}, {'end': 865.289, 'text': 'So for example, if we want to change this bit, we have to index into the tensor.', 'start': 860.887, 'duration': 4.402}, {'end': 872.672, 'text': "And in particular here, this is the first row because it's zero indexed.", 'start': 865.909, 'duration': 6.763}, {'end': 878.114, 'text': 'So this is row index one and column index zero, one, two, three.', 'start': 872.932, 'duration': 5.182}, {'end': 882.336, 'text': 'So a at one comma three, we can set that to one.', 'start': 878.914, 'duration': 3.422}], 'summary': 'Tensors enable efficient manipulation of entries. indexing into the tensor for specific changes, such as setting a value at a specific index.', 'duration': 27.572, 'max_score': 854.764, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo854764.jpg'}, {'end': 1128.518, 'src': 'embed', 'start': 1099.921, 'weight': 2, 'content': [{'end': 1104.105, 'text': "So let's erase this ugly mess and let's try to visualize it a bit more nicer.", 'start': 1099.921, 'duration': 4.184}, {'end': 1107.868, 'text': "So for that, we're going to use a library called matplotlib.", 'start': 1104.965, 'duration': 2.903}, {'end': 1110.75, 'text': 'So matplotlib allows us to create figures.', 'start': 1108.989, 'duration': 1.761}, {'end': 1114.353, 'text': 'So we can do things like plt im show of the counter array.', 'start': 1111.17, 'duration': 3.183}, {'end': 1120.552, 'text': 'So this is the 28 by 28 array, and this is the structure.', 'start': 1116.249, 'duration': 4.303}, {'end': 1123.254, 'text': 'But even this, I would say, is still pretty ugly.', 'start': 1121.053, 'duration': 2.201}, {'end': 1128.518, 'text': "So we're going to try to create a much nicer visualization of it, and I wrote a bunch of code for that.", 'start': 1123.995, 'duration': 4.523}], 'summary': 'Using matplotlib library to create a visualization of a 28x28 array for improved clarity.', 'duration': 28.597, 'max_score': 1099.921, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo1099921.jpg'}, {'end': 1269.953, 'src': 'embed', 'start': 1245.432, 'weight': 3, 'content': [{'end': 1253.347, 'text': 'So what is the structure of this array? We have all these counts and we see that some of them occur often and some of them do not occur often.', 'start': 1245.432, 'duration': 7.915}, {'end': 1257.889, 'text': "Now, if you scrutinize this carefully, you will notice that we're not actually being very clever.", 'start': 1254.108, 'duration': 3.781}, {'end': 1264.111, 'text': "That's because when you come over here, you'll notice that, for example, we have an entire row of completely zeros.", 'start': 1258.769, 'duration': 5.342}, {'end': 1269.953, 'text': "And that's because the end character is never possibly going to be the first character of a bigram,", 'start': 1264.771, 'duration': 5.182}], 'summary': 'The array structure includes counts, with some occurring often and others not, including entire rows of zeros indicating impossible bigram combinations.', 'duration': 24.521, 'max_score': 1245.432, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo1245432.jpg'}, {'end': 1469.332, 'src': 'embed', 'start': 1433.169, 'weight': 1, 'content': [{'end': 1436.232, 'text': 'And in between we have the structure of what characters follow each other.', 'start': 1433.169, 'duration': 3.063}, {'end': 1441.255, 'text': 'So this is the counts array of our entire dataset.', 'start': 1437.148, 'duration': 4.107}, {'end': 1448.867, 'text': 'So this array actually has all the information necessary for us to actually sample from this bigram character level language model.', 'start': 1441.795, 'duration': 7.072}, {'end': 1451.519, 'text': 'And, roughly speaking,', 'start': 1449.818, 'duration': 1.701}, {'end': 1458.144, 'text': "what we're going to do is we're just going to start following these probabilities and these counts and we're going to start sampling from the model.", 'start': 1451.519, 'duration': 6.625}, {'end': 1464.068, 'text': 'So in the beginning, of course, we start with the dot, the start token dot.', 'start': 1458.945, 'duration': 5.123}, {'end': 1469.332, 'text': "So to sample the first character of a name, we're looking at this row here.", 'start': 1464.709, 'duration': 4.623}], 'summary': 'Bigram character level language model uses counts array to sample characters.', 'duration': 36.163, 'max_score': 1433.169, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo1433169.jpg'}, {'end': 1553.642, 'src': 'embed', 'start': 1528.258, 'weight': 0, 'content': [{'end': 1534.122, 'text': "now these are the counts, and now what we'd like to do is we'd like to basically um sample from this.", 'start': 1528.258, 'duration': 5.864}, {'end': 1538.098, 'text': 'Since these are the raw counts, we actually have to convert this to probabilities.', 'start': 1535.137, 'duration': 2.961}, {'end': 1541.499, 'text': 'So we create a probability vector.', 'start': 1539.238, 'duration': 2.261}, {'end': 1548.58, 'text': "So we'll take n of zero and we'll actually convert this to float first.", 'start': 1543.019, 'duration': 5.561}, {'end': 1553.642, 'text': 'Okay, so these integers are converted to float, floating point numbers.', 'start': 1550.121, 'duration': 3.521}], 'summary': 'Convert raw counts to probabilities by creating a probability vector.', 'duration': 25.384, 'max_score': 1528.258, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo1528258.jpg'}, {'end': 1610.933, 'src': 'embed', 'start': 1581.201, 'weight': 6, 'content': [{'end': 1586.865, 'text': 'It sums to one, and this is giving us the probability for any single character to be the first character of a word.', 'start': 1581.201, 'duration': 5.664}, {'end': 1590.238, 'text': 'So now we can try to sample from this distribution.', 'start': 1588.075, 'duration': 2.163}, {'end': 1595.105, 'text': "To sample from these distributions, we're going to use torch.multinomial, which I've pulled up here.", 'start': 1590.839, 'duration': 4.266}, {'end': 1605.249, 'text': 'So torch.multinomial returns samples from the multinomial probability distribution, which is a complicated way of saying.', 'start': 1596.327, 'duration': 8.922}, {'end': 1610.933, 'text': 'you give me probabilities and I will give you integers which are sampled according to the probability distribution.', 'start': 1605.249, 'duration': 5.684}], 'summary': 'The probability distribution gives the chance for a character to be the first letter in a word, and torch.multinomial is used to sample from these distributions.', 'duration': 29.732, 'max_score': 1581.201, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo1581201.jpg'}, {'end': 2086.022, 'src': 'embed', 'start': 2060.666, 'weight': 9, 'content': [{'end': 2066.913, 'text': 'The reason these samples are so terrible is that bigram language model is actually just like really terrible.', 'start': 2060.666, 'duration': 6.247}, {'end': 2069.056, 'text': 'We can generate a few more here.', 'start': 2067.954, 'duration': 1.102}, {'end': 2077.025, 'text': "And you can see that they're kind of like, they're name-like a little bit, like Keanu, Riley, et cetera, but they're just like totally messed up.", 'start': 2070.157, 'duration': 6.868}, {'end': 2086.022, 'text': "And I mean, the reason that this is so bad, like we're generating age as a name, but you have to think through it from the model's eyes.", 'start': 2078.732, 'duration': 7.29}], 'summary': 'Bigram language model produces terrible samples, generating name-like outputs like keanu, riley, etc.', 'duration': 25.356, 'max_score': 2060.666, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo2060666.jpg'}, {'end': 2179.547, 'src': 'embed', 'start': 2154.324, 'weight': 8, 'content': [{'end': 2160.748, 'text': "okay. so it's this is what you have from a model that is completely untrained, where everything is equally likely.", 'start': 2154.324, 'duration': 6.424}, {'end': 2162.469, 'text': "so it's obviously garbage.", 'start': 2160.748, 'duration': 1.721}, {'end': 2168.593, 'text': 'and then, if we have a trained model which is trained on just bigrams, this is what we get.', 'start': 2162.469, 'duration': 6.124}, {'end': 2170.614, 'text': 'so you can see that it is more name-like.', 'start': 2168.593, 'duration': 2.021}, {'end': 2171.995, 'text': 'it is actually working.', 'start': 2170.614, 'duration': 1.381}, {'end': 2176.186, 'text': "it's just bigram is so terrible and we have to do better.", 'start': 2171.995, 'duration': 4.191}, {'end': 2179.547, 'text': 'Now next, I would like to fix an inefficiency that we have going on here.', 'start': 2176.686, 'duration': 2.861}], 'summary': 'Untrained model yields garbage results, trained bigram model shows improvement, but still inadequate.', 'duration': 25.223, 'max_score': 2154.324, 'thumbnail': ''}], 'start': 854.764, 'title': 'Manipulating tensors and language modeling', 'summary': 'Covers manipulating arrays using tensors, creating and visualizing bigram arrays, discussing bigram character-level language model structure and efficiency, converting counts to probabilities using torch, and torchlight multinomial sampling for language modeling, with a focus on training impact.', 'chapters': [{'end': 1244.291, 'start': 854.764, 'title': 'Manipulating tensors and visualizing arrays', 'summary': 'Explains how to manipulate arrays using tensors efficiently, creates a 28x28 array representing counts of bigrams in a dataset, and visualizes the array using matplotlib to represent bigrams and their occurrences.', 'duration': 389.527, 'highlights': ['Creating a 28x28 array to represent counts of bigrams in a dataset', 'Visualizing the array using matplotlib to represent bigrams and their occurrences', 'Manipulating arrays efficiently using tensors']}, {'end': 1528.258, 'start': 1245.432, 'title': 'Bigram character-level language model', 'summary': "Explains the structure of a bigram character-level language model, highlighting the inefficiencies in the initial array configuration, proposing a more efficient 27x27 array with a special token 'dot', and detailing the process of sampling characters based on the model's probabilities and counts.", 'duration': 282.826, 'highlights': ["The chapter explains the inefficiencies in the initial array configuration, including entire rows and columns of zeros, and proposes a more efficient 27x27 array with a special token 'dot'.", "The chapter details the process of sampling characters based on the model's probabilities and counts, using the example of sampling the first character of a name from the first row of counts.", 'The chapter discusses the structure of the array, showcasing the counts for all the first letters and the ending characters, providing the necessary information for sampling from the bigram character-level language model.']}, {'end': 1730.895, 'start': 1528.258, 'title': 'Converting counts to probabilities using torch', 'summary': 'Discusses converting raw counts to probabilities using torch, creating a proper probability distribution, and sampling from the distribution using torch.multinomial, ensuring deterministic results.', 'duration': 202.637, 'highlights': ['The chapter covers the process of converting raw counts to probabilities using torch, ensuring a proper probability distribution, and sampling from the distribution using torch.multinomial for deterministic results.', 'Demonstrates the creation of a probability vector by converting integers to float, and then normalizing the counts to obtain smaller numbers representing probabilities, with the sum of probabilities being equal to one.', 'Explains the use of torch.multinomial to draw samples from the probability distributions, specifying the number of samples, enabling replacement, and ensuring deterministic results by using a generator object.']}, {'end': 2176.186, 'start': 1731.396, 'title': 'Torchlight multinomial sampling', 'summary': 'Discusses the process of torchlight multinomial sampling, where a distribution is used to generate samples, with a focus on the probabilities of each element in the tensor and the challenges of bigram language model. the chapter also demonstrates the impact of training on model performance.', 'duration': 444.79, 'highlights': ['The probability for the first element in the tensor is 60%, with 60% of the 20 samples expected to be zero.', "The bigram language model generates names that are somewhat name-like but are largely nonsensical due to the model's lack of understanding of context.", 'Using a uniform distribution, where everything is equally likely, results in garbage output, indicating the impact of training on model performance.']}], 'duration': 1321.422, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo854764.jpg', 'highlights': ['Covering the process of converting raw counts to probabilities using torch', 'Creating a 28x28 array to represent counts of bigrams in a dataset', 'Visualizing the array using matplotlib to represent bigrams and their occurrences', 'Discussing the structure of the array, showcasing the counts for all the first letters and the ending characters', 'Manipulating arrays efficiently using tensors', "Detailing the process of sampling characters based on the model's probabilities and counts", 'Explaining the use of torch.multinomial to draw samples from the probability distributions', 'Demonstrating the creation of a probability vector by converting integers to float, and then normalizing the counts', 'Using a uniform distribution, where everything is equally likely, results in garbage output', "The bigram language model generates names that are somewhat name-like but largely nonsensical due to the model's lack of understanding of context"]}, {'end': 3013.989, 'segs': [{'end': 2220.183, 'src': 'embed', 'start': 2192.849, 'weight': 0, 'content': [{'end': 2196.77, 'text': "And we just keep renormalizing these rows over and over again and it's extremely inefficient and wasteful.", 'start': 2192.849, 'duration': 3.921}, {'end': 2203.451, 'text': "So what I'd like to do is I'd like to actually prepare a matrix capital P that will just have the probabilities in it.", 'start': 2197.51, 'duration': 5.941}, {'end': 2208.094, 'text': "So, in other words, it's going to be the same as the capital N matrix here of counts,", 'start': 2204.071, 'duration': 4.023}, {'end': 2216.12, 'text': 'but every single row will have the row of probabilities that is normalized to one, indicating the probability distribution for the next character,', 'start': 2208.094, 'duration': 8.026}, {'end': 2220.183, 'text': "given the character before it as defined by which row we're in.", 'start': 2216.12, 'duration': 4.063}], 'summary': 'Create a matrix p with probabilities normalized to one for efficient calculations.', 'duration': 27.334, 'max_score': 2192.849, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo2192849.jpg'}, {'end': 2552.53, 'src': 'embed', 'start': 2528.394, 'weight': 2, 'content': [{'end': 2536.077, 'text': "So if you just search broadcasting semantics in Torch, you'll notice that there's a special definition for what's called broadcasting,", 'start': 2528.394, 'duration': 7.683}, {'end': 2542.86, 'text': 'that for whether or not these two arrays can be combined in a binary operation like division.', 'start': 2536.077, 'duration': 6.783}, {'end': 2547.827, 'text': 'So the first condition is each tensor has at least one dimension, which is the case for us.', 'start': 2544.045, 'duration': 3.782}, {'end': 2552.53, 'text': 'And then when iterating over the dimension sizes, starting at the trailing dimension.', 'start': 2548.668, 'duration': 3.862}], 'summary': 'Broadcasting semantics in torch define rules for combining arrays in binary operations like division.', 'duration': 24.136, 'max_score': 2528.394, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo2528394.jpg'}, {'end': 2672.218, 'src': 'embed', 'start': 2645.411, 'weight': 4, 'content': [{'end': 2649.213, 'text': "We expect this to be one, because it's not normalized.", 'start': 2645.411, 'duration': 3.802}, {'end': 2657.419, 'text': 'And then we expect this now, because if we actually correctly normalize all the rows, we expect to get the exact same result here.', 'start': 2650.454, 'duration': 6.965}, {'end': 2658.42, 'text': "So let's run this.", 'start': 2657.819, 'duration': 0.601}, {'end': 2660.121, 'text': "It's the exact same result.", 'start': 2659.34, 'duration': 0.781}, {'end': 2662.469, 'text': 'So this is correct.', 'start': 2661.468, 'duration': 1.001}, {'end': 2664.671, 'text': 'So now I would like to scare you a little bit.', 'start': 2663.049, 'duration': 1.622}, {'end': 2666.412, 'text': 'You actually have to like.', 'start': 2665.532, 'duration': 0.88}, {'end': 2672.218, 'text': 'I basically encourage you very strongly to read through broadcasting semantics and I encourage you to treat this with respect.', 'start': 2666.412, 'duration': 5.806}], 'summary': 'Normalization testing yields consistent results, emphasizing the importance of understanding broadcasting semantics.', 'duration': 26.807, 'max_score': 2645.411, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo2645411.jpg'}, {'end': 3013.989, 'src': 'embed', 'start': 2988.401, 'weight': 3, 'content': [{'end': 2994.526, 'text': "We don't want to be doing this here because this creates a completely new tensor that we store into P.", 'start': 2988.401, 'duration': 6.125}, {'end': 2996.907, 'text': 'We prefer to use in place operations if possible.', 'start': 2994.526, 'duration': 2.381}, {'end': 3001.934, 'text': 'So this would be an in place operation has the potential to be faster.', 'start': 2997.948, 'duration': 3.986}, {'end': 3013.989, 'text': "It doesn't create new memory under the hood and then let's erase this we don't need it and let's also Just do fewer, just so I'm not wasting space.", 'start': 3001.934, 'duration': 12.055}], 'summary': 'Prefer in-place operations for faster performance and efficient memory usage.', 'duration': 25.588, 'max_score': 2988.401, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo2988401.jpg'}], 'start': 2176.686, 'title': 'Optimizing probability matrix calculation and tensor manipulation in torch', 'summary': 'Addresses inefficiency in fetching and recalculating rows of probabilities from a matrix by proposing the creation of a precomputed matrix capital p containing normalized probabilities, and discusses the practice of manipulating n-dimensional tensors, particularly focusing on broadcasting, and the importance of understanding broadcasting semantics in torch for efficient array operations.', 'chapters': [{'end': 2232.172, 'start': 2176.686, 'title': 'Optimizing probability matrix calculation', 'summary': 'Addresses the inefficiency in fetching and recalculating rows of probabilities from a matrix, proposing the creation of a precomputed matrix capital p containing normalized probabilities to optimize the process.', 'duration': 55.486, 'highlights': ['Creating a precomputed matrix capital P containing normalized probabilities will optimize the process by avoiding the repetitive renormalization of rows, enhancing efficiency and reducing wasteful computations.', 'The proposed approach will eliminate the need to fetch and recalculate rows of probabilities from the counts matrix, leading to a significant reduction in inefficient and wasteful computations.']}, {'end': 3013.989, 'start': 2232.192, 'title': 'Tensor manipulation and broadcasting in torch', 'summary': 'Discusses the practice of manipulating n-dimensional tensors, particularly focusing on broadcasting, and the importance of understanding broadcasting semantics in torch for efficient array operations, highlighting the potential bugs and the need for caution and respect in implementing broadcasting. it emphasizes the use of in place operations for efficiency.', 'duration': 781.797, 'highlights': ['The importance of understanding broadcasting semantics in torch for efficient array operations and the potential bugs related to it.', 'The significance of using in place operations for efficiency and memory management.', 'The need to normalize every single row through broadcasting and the potential bugs that can arise if broadcasting is not correctly implemented.']}], 'duration': 837.303, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo2176686.jpg', 'highlights': ['Creating a precomputed matrix capital P containing normalized probabilities will optimize the process by avoiding the repetitive renormalization of rows, enhancing efficiency and reducing wasteful computations.', 'The proposed approach will eliminate the need to fetch and recalculate rows of probabilities from the counts matrix, leading to a significant reduction in inefficient and wasteful computations.', 'The importance of understanding broadcasting semantics in torch for efficient array operations and the potential bugs related to it.', 'The significance of using in place operations for efficiency and memory management.', 'The need to normalize every single row through broadcasting and the potential bugs that can arise if broadcasting is not correctly implemented.']}, {'end': 4072.462, 'segs': [{'end': 3046.003, 'src': 'embed', 'start': 3014.81, 'weight': 4, 'content': [{'end': 3016.451, 'text': "Okay so we're actually in a pretty good spot now.", 'start': 3014.81, 'duration': 1.641}, {'end': 3027.254, 'text': 'We trained a bigram language model and we trained it really just by counting how frequently any pairing occurs and then normalizing so that we get a nice probability distribution.', 'start': 3017.091, 'duration': 10.163}, {'end': 3033.616, 'text': 'So really these elements of this array p are really the parameters of our bigram language model,', 'start': 3028.054, 'duration': 5.562}, {'end': 3036.117, 'text': 'giving us and summarizing the statistics of these bigrams.', 'start': 3033.616, 'duration': 2.501}, {'end': 3039.899, 'text': 'So we train the model and then we know how to sample from the model.', 'start': 3037.077, 'duration': 2.822}, {'end': 3046.003, 'text': 'We just iteratively sample the next character and feed it in each time and get the next character.', 'start': 3040.259, 'duration': 5.744}], 'summary': 'Trained bigram language model for character prediction.', 'duration': 31.193, 'max_score': 3014.81, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo3014810.jpg'}, {'end': 3085.913, 'src': 'embed', 'start': 3059.119, 'weight': 6, 'content': [{'end': 3064.382, 'text': 'and as an example, so in the training set we can evaluate now the training loss.', 'start': 3059.119, 'duration': 5.263}, {'end': 3072.047, 'text': 'and this training loss is telling us about sort of the quality of this model in a single number, just like we saw in micrograd.', 'start': 3064.382, 'duration': 7.665}, {'end': 3077.191, 'text': "so let's try to think through the quality of the model and how we would evaluate it.", 'start': 3072.047, 'duration': 5.144}, {'end': 3082.134, 'text': "basically, what we're going to do is we're going to copy paste this code that we previously used for counting.", 'start': 3077.191, 'duration': 4.943}, {'end': 3085.913, 'text': 'okay, And let me just print these bigrams first.', 'start': 3082.134, 'duration': 3.779}], 'summary': 'Model training loss evaluates model quality, similar to micrograd, using a single number.', 'duration': 26.794, 'max_score': 3059.119, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo3059119.jpg'}, {'end': 3236.002, 'src': 'embed', 'start': 3205.979, 'weight': 8, 'content': [{'end': 3208.861, 'text': 'And so the product of all these probabilities is the likelihood.', 'start': 3205.979, 'duration': 2.882}, {'end': 3217.147, 'text': "And it's really telling us about the probability of the entire data set assigned by the model that we've trained.", 'start': 3209.361, 'duration': 7.786}, {'end': 3218.868, 'text': 'And that is a measure of quality.', 'start': 3217.787, 'duration': 1.081}, {'end': 3224.452, 'text': 'So the product of these should be as high as possible when you are training the model.', 'start': 3219.549, 'duration': 4.903}, {'end': 3228.275, 'text': 'And when you have a good model, your product of these probabilities should be very high.', 'start': 3224.552, 'duration': 3.723}, {'end': 3236.002, 'text': 'Now, because the product of these probabilities is an unwieldy thing to work with, you can see that all of them are between zero and one.', 'start': 3230.5, 'duration': 5.502}], 'summary': 'Product of probabilities measures quality, should be high when training a model.', 'duration': 30.023, 'max_score': 3205.979, 'thumbnail': ''}, {'end': 3453.488, 'src': 'embed', 'start': 3432.209, 'weight': 3, 'content': [{'end': 3442.204, 'text': 'and so the negative log likelihood is a very nice loss function, because the lowest it can get is zero, and the higher it is, the worse off.', 'start': 3432.209, 'duration': 9.995}, {'end': 3443.824, 'text': "the predictions are that you're making.", 'start': 3442.204, 'duration': 1.62}, {'end': 3451.407, 'text': 'And then one more modification to this that sometimes people do is that, for convenience, they actually like to normalize by.', 'start': 3444.825, 'duration': 6.582}, {'end': 3453.488, 'text': 'they like to make it an average instead of a sum.', 'start': 3451.407, 'duration': 2.081}], 'summary': 'Negative log likelihood is a loss function with minimum value of zero and worsens with higher predictions. some normalize by averaging instead of sum.', 'duration': 21.279, 'max_score': 3432.209, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo3432209.jpg'}, {'end': 3514.843, 'src': 'embed', 'start': 3482.407, 'weight': 1, 'content': [{'end': 3485.69, 'text': 'So our loss function for the training set assigned by the model is 2.4..', 'start': 3482.407, 'duration': 3.283}, {'end': 3487.892, 'text': "That's the quality of this model.", 'start': 3485.69, 'duration': 2.202}, {'end': 3490.594, 'text': 'And the lower it is, the better off we are.', 'start': 3488.673, 'duration': 1.921}, {'end': 3492.336, 'text': 'And the higher it is, the worse off we are.', 'start': 3490.815, 'duration': 1.521}, {'end': 3501.084, 'text': 'And the job of our training is to find the parameters that minimize the negative log likelihood loss.', 'start': 3493.557, 'duration': 7.527}, {'end': 3504.939, 'text': 'and that would be like a high quality model.', 'start': 3502.978, 'duration': 1.961}, {'end': 3507.38, 'text': 'Okay, so to summarize, I actually wrote it out here.', 'start': 3505.519, 'duration': 1.861}, {'end': 3514.843, 'text': 'So our goal is to maximize likelihood, which is the product of all the probabilities assigned by the model.', 'start': 3508.16, 'duration': 6.683}], 'summary': 'Training aims to minimize loss function for higher model quality and maximize likelihood of probabilities.', 'duration': 32.436, 'max_score': 3482.407, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo3482407.jpg'}, {'end': 3595.713, 'src': 'embed', 'start': 3569.58, 'weight': 7, 'content': [{'end': 3575.505, 'text': 'And so the optimization problem here and here are actually equivalent because this is just scaling.', 'start': 3569.58, 'duration': 5.925}, {'end': 3576.406, 'text': 'You can look at it that way.', 'start': 3575.625, 'duration': 0.781}, {'end': 3579.709, 'text': 'And so these are two identical optimization problems.', 'start': 3577.166, 'duration': 2.543}, {'end': 3585.254, 'text': 'Maximizing the log likelihood is equivalent to minimizing the negative log likelihood.', 'start': 3582.051, 'duration': 3.203}, {'end': 3593.052, 'text': 'And then in practice, people actually minimize the average negative log likelihood to get numbers like 2.4.', 'start': 3586.294, 'duration': 6.758}, {'end': 3595.713, 'text': 'And then this summarizes the quality of your model.', 'start': 3593.052, 'duration': 2.661}], 'summary': 'Maximizing log likelihood=minimizing negative log likelihood; average nll=2.4.', 'duration': 26.133, 'max_score': 3569.58, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo3569580.jpg'}, {'end': 3742.429, 'src': 'embed', 'start': 3712.239, 'weight': 2, 'content': [{'end': 3716.382, 'text': "and roughly what's happening is that we will, we will add some fake counts.", 'start': 3712.239, 'duration': 4.143}, {'end': 3721.106, 'text': 'so imagine adding a count of one to everything.', 'start': 3716.382, 'duration': 4.724}, {'end': 3727.899, 'text': 'so we add a count of one like this and then we recalculate the probabilities.', 'start': 3721.106, 'duration': 6.793}, {'end': 3730.201, 'text': "and that's model smoothing, and you can add as much as you like.", 'start': 3727.899, 'duration': 2.302}, {'end': 3733.022, 'text': 'you can add five, and that will give you a smoother model.', 'start': 3730.201, 'duration': 2.821}, {'end': 3742.429, 'text': "and the more you add here, the more uniform model you're going to have, and the less you add, the more peaked model you are going to have, of course.", 'start': 3733.022, 'duration': 9.407}], 'summary': 'Model smoothing involves adding counts to adjust probabilities for a smoother or more peaked model.', 'duration': 30.19, 'max_score': 3712.239, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo3712239.jpg'}, {'end': 3945.615, 'src': 'embed', 'start': 3915.51, 'weight': 0, 'content': [{'end': 3918.092, 'text': "because we have the loss function and we're going to minimize it.", 'start': 3915.51, 'duration': 2.582}, {'end': 3923.577, 'text': "So we're going to tune the weights so that the neural net is correctly predicting the probabilities for the next character.", 'start': 3918.552, 'duration': 5.025}, {'end': 3925.439, 'text': "So let's get started.", 'start': 3924.518, 'duration': 0.921}, {'end': 3934.068, 'text': 'The first thing I want to do is I want to compile the training set of this neural network, right? So create the training set of all the bigrams.', 'start': 3925.699, 'duration': 8.369}, {'end': 3945.615, 'text': "And here I'm going to copy paste this code because this code iterates over all the bigrams.", 'start': 3937.749, 'duration': 7.866}], 'summary': 'Minimize loss function by tuning weights for neural net to predict character probabilities.', 'duration': 30.105, 'max_score': 3915.51, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo3915510.jpg'}], 'start': 3014.81, 'title': 'Training bigram language models', 'summary': 'Discusses training a bigram language model by counting pair frequency, normalizing to obtain probability distribution, evaluating model quality using training loss, and culminating in a training set loss of 2.4. it also covers bigram character level language model optimization, negative log likelihood minimization, model evaluation, probability estimation, model smoothing, and transition to a neural network framework for language modeling.', 'chapters': [{'end': 3077.191, 'start': 3014.81, 'title': 'Bigram language model training', 'summary': "Discusses training a bigram language model by counting the frequency of pairings and normalizing to obtain a probability distribution, evaluating the model's quality using training loss as a single number.", 'duration': 62.381, 'highlights': ['The model is trained by counting the frequency of pairings and normalizing to obtain a probability distribution.', "Evaluating the model's quality using training loss as a single number."]}, {'end': 3548.19, 'start': 3077.191, 'title': 'Model training and evaluation', 'summary': 'Discusses the calculation of probabilities for bigrams, the use of log likelihood as a measure of model quality, and the derivation of negative log likelihood as a loss function for training, culminating in a training set loss of 2.4.', 'duration': 470.999, 'highlights': ['The product of all these probabilities is the likelihood, which should be as high as possible when training the model, and when the model is good, the product of these probabilities should be very high.', 'The negative log likelihood is a very nice loss function, as the lowest it can get is zero, and the higher it is, the worse off the predictions are that the model is making.', 'The loss function for the training set assigned by the model is 2.4, and the lower it is, the better off, indicating the quality of the model.']}, {'end': 4072.462, 'start': 3548.19, 'title': 'Bigram character level language model', 'summary': 'Discusses the optimization of likelihood, negative log likelihood minimization, model evaluation, probability estimation, model smoothing, training a bigram character level language model, and transitioning to a neural network framework for language modeling.', 'duration': 524.272, 'highlights': ['The quality of the model is summarized by the negative log likelihood, aiming to minimize it, with a target of 2.4 as a measure of model performance.', 'Model smoothing involves adding fake counts to achieve a smoother model, with the option to adjust the level of smoothing, ensuring no zero probabilities in the probability matrix, and eliminating instances of infinite loss.', 'Transitioning to a neural network framework for language modeling involves compiling the training set of all the bigrams, where the inputs and targets are denoted by integers, and using gradient-based optimization to tune the parameters of the network for correct probability prediction.', 'Maximizing the likelihood is equivalent to maximizing the log likelihood, and maximizing the log likelihood is equivalent to minimizing the negative log likelihood, which is practiced by minimizing the average negative log likelihood to attain numbers like 2.4 as a measure of model performance.', 'The optimization problem of maximizing the likelihood and maximizing the log likelihood are equivalent due to scaling, making them two identical optimization problems.']}], 'duration': 1057.652, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo3014810.jpg', 'highlights': ['Transitioning to a neural network framework for language modeling involves compiling the training set of all the bigrams, using gradient-based optimization to tune the parameters of the network for correct probability prediction', 'The loss function for the training set assigned by the model is 2.4, and the lower it is, the better off, indicating the quality of the model', 'Model smoothing involves adding fake counts to achieve a smoother model, ensuring no zero probabilities in the probability matrix, and eliminating instances of infinite loss', 'The negative log likelihood is a very nice loss function, as the lowest it can get is zero, and the higher it is, the worse off the predictions are that the model is making', 'The model is trained by counting the frequency of pairings and normalizing to obtain a probability distribution', 'The quality of the model is summarized by the negative log likelihood, aiming to minimize it, with a target of 2.4 as a measure of model performance', "Evaluating the model's quality using training loss as a single number", 'Maximizing the likelihood is equivalent to maximizing the log likelihood, and maximizing the log likelihood is equivalent to minimizing the negative log likelihood, which is practiced by minimizing the average negative log likelihood to attain numbers like 2.4 as a measure of model performance', 'The product of all these probabilities is the likelihood, which should be as high as possible when training the model, and when the model is good, the product of these probabilities should be very high']}, {'end': 5539.358, 'segs': [{'end': 4175.06, 'src': 'embed', 'start': 4123.72, 'weight': 2, 'content': [{'end': 4125.781, 'text': "it's just like it doesn't?", 'start': 4123.72, 'duration': 2.061}, {'end': 4126.781, 'text': "it doesn't make sense.", 'start': 4125.781, 'duration': 1}, {'end': 4132.883, 'text': 'so the actual difference, as far as i can tell, is explained eventually in this random thread that you can google, and really it comes down to.', 'start': 4126.781, 'duration': 6.102}, {'end': 4138.663, 'text': 'i believe that, um, where is this?', 'start': 4132.883, 'duration': 5.78}, {'end': 4144.566, 'text': 'torch.tensor infers the d type, the data type, automatically, while torch.tensor just returns a float tensor.', 'start': 4138.663, 'duration': 5.903}, {'end': 4147.907, 'text': 'i would recommend to stick to torch.lowercase tensor.', 'start': 4144.566, 'duration': 3.341}, {'end': 4156.593, 'text': 'so, um, Indeed, we see that when I construct this with a capital T, the data type here of x is float32..', 'start': 4147.907, 'duration': 8.686}, {'end': 4165.036, 'text': "But torch.lowerCaseTensor, you see how it's now x.dtype is now integer.", 'start': 4158.252, 'duration': 6.784}, {'end': 4173.379, 'text': "So it's advised that you use lowercase t, and you can read more about it if you like in some of these threads.", 'start': 4166.877, 'duration': 6.502}, {'end': 4175.06, 'text': 'But basically,', 'start': 4174.359, 'duration': 0.701}], 'summary': 'Use torch.tensor for automatic data type inference, not capital t torch.tensor.', 'duration': 51.34, 'max_score': 4123.72, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo4123720.jpg'}, {'end': 4269.802, 'src': 'embed', 'start': 4241.846, 'weight': 3, 'content': [{'end': 4246.209, 'text': "So instead, a common way of encoding integers is what's called one-hot encoding.", 'start': 4241.846, 'duration': 4.363}, {'end': 4256.857, 'text': 'In one hot encoding, we take an integer like 13 and we create a vector that is all zeros except for the 13th dimension, which we turn to a one.', 'start': 4247.153, 'duration': 9.704}, {'end': 4260.018, 'text': 'And then that vector can feed into a neural net.', 'start': 4257.557, 'duration': 2.461}, {'end': 4269.802, 'text': 'Now, conveniently, PyTorch actually has something called the one hot function inside torch and then functional.', 'start': 4261.239, 'duration': 8.563}], 'summary': 'One-hot encoding converts integers to vectors for neural nets.', 'duration': 27.956, 'max_score': 4241.846, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo4241846.jpg'}, {'end': 5103.389, 'src': 'embed', 'start': 5077.681, 'weight': 4, 'content': [{'end': 5082.907, 'text': 'we made sure that this output of this neural net now are probabilities or we can interpret to be probabilities.', 'start': 5077.681, 'duration': 5.226}, {'end': 5090.114, 'text': 'So our WX here gave us logits, and then we interpret those to be log counts.', 'start': 5084.188, 'duration': 5.926}, {'end': 5096.7, 'text': 'We exponentiate to get something that looks like counts, and then we normalize those counts to get a probability distribution.', 'start': 5090.875, 'duration': 5.825}, {'end': 5099.726, 'text': 'And all of these are differentiable operations.', 'start': 5097.645, 'duration': 2.081}, {'end': 5103.389, 'text': "So what we've done now is we are taking inputs,", 'start': 5100.527, 'duration': 2.862}], 'summary': 'Neural net output interpreted as probabilities through differentiable operations.', 'duration': 25.708, 'max_score': 5077.681, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo5077681.jpg'}, {'end': 5231.108, 'src': 'embed', 'start': 5198.755, 'weight': 5, 'content': [{'end': 5205.518, 'text': "And I'm generating 27 neurons weights and each neuron here receives 27 inputs.", 'start': 5198.755, 'duration': 6.763}, {'end': 5212.661, 'text': "Then here we're going to plug in all the input examples, Xs into a neural net.", 'start': 5208.739, 'duration': 3.922}, {'end': 5214.402, 'text': 'So here, this is a forward pass.', 'start': 5212.761, 'duration': 1.641}, {'end': 5219.564, 'text': 'First, we have to encode all of the inputs into one hot representations.', 'start': 5215.822, 'duration': 3.742}, {'end': 5231.108, 'text': 'So we have 27 classes, we pass in these integers, and xinc becomes a array that is five by 27, zeros except for a few ones.', 'start': 5220.566, 'duration': 10.542}], 'summary': 'Generating 27 neuron weights with 27 inputs for a forward pass into a neural network.', 'duration': 32.353, 'max_score': 5198.755, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo5198755.jpg'}, {'end': 5517.571, 'src': 'embed', 'start': 5489.717, 'weight': 0, 'content': [{'end': 5494.199, 'text': 'just the one word is 3.76,, which is actually fairly high loss.', 'start': 5489.717, 'duration': 4.482}, {'end': 5495.98, 'text': 'This is not a very good setting of Ws.', 'start': 5494.26, 'duration': 1.72}, {'end': 5497.781, 'text': "Now here's what we can do.", 'start': 5497.021, 'duration': 0.76}, {'end': 5501.428, 'text': "We're currently getting 3.76.", 'start': 5498.782, 'duration': 2.646}, {'end': 5504.369, 'text': 'We can actually come here and we can change our w.', 'start': 5501.428, 'duration': 2.941}, {'end': 5505.269, 'text': 'We can resample it.', 'start': 5504.369, 'duration': 0.9}, {'end': 5507.949, 'text': 'So let me just add one to have a different seed.', 'start': 5505.769, 'duration': 2.18}, {'end': 5510.69, 'text': 'And then we get a different w.', 'start': 5508.869, 'duration': 1.821}, {'end': 5511.79, 'text': 'And then we can rerun this.', 'start': 5510.69, 'duration': 1.1}, {'end': 5517.571, 'text': "And with this different setting of w's, we now get 3.37.", 'start': 5513.01, 'duration': 4.561}], 'summary': 'Initial loss of 3.76 reduced to 3.37 by changing ws setting', 'duration': 27.854, 'max_score': 5489.717, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo5489717.jpg'}], 'start': 4075.188, 'title': 'Neural network operations', 'summary': 'Discusses torch tensor usage, integer encoding for neural networks, probability distribution transformations, and forward pass calculations, achieving an average loss of 3.76 and improved loss of 3.37.', 'chapters': [{'end': 4196.453, 'start': 4075.188, 'title': 'Torch tensor caution', 'summary': 'Discusses the caution required when using torch.tensor and torch.tensor, highlighting the differences in data types and the recommendation to use lowercase t for inferring data type automatically.', 'duration': 121.265, 'highlights': ['The difference between torch.tensor and torch.Tensor is not clearly explained in the documentation, leading to confusion among users.', 'Using torch.lowercase tensor infers the data type automatically, while torch.capitalTensor returns a float tensor, as demonstrated by the examples with float32 and integer data types for x.', 'It is recommended to stick to torch.lowercase tensor due to its ability to automatically infer the data type, providing a clearer and more intuitive approach for users.']}, {'end': 4650.71, 'start': 4198.194, 'title': 'Encoding integers into vectors for neural networks', 'summary': "Discusses encoding integers into one-hot vectors for feeding into a neural network, utilizing pytorch's one hot function, handling data types, and conducting matrix multiplication to evaluate neuron activations on input examples.", 'duration': 452.516, 'highlights': ["PyTorch's one hot function is used to encode integers into one-hot vectors, with the resulting shape of the encoded vectors being 5 by 27.", "Data type handling is crucial as integers need to be converted to floating-point numbers for feeding into neural nets, and PyTorch's one hot function returns a 64-bit integer, necessitating a conversion to float32 for compatibility.", 'Matrix multiplication is employed to evaluate neuron activations on the input examples, resulting in a 5 by 27 output representing the firing rate of 27 neurons on each of the five examples.']}, {'end': 5176.835, 'start': 4652.051, 'title': 'Neural net probability distribution', 'summary': 'Discusses the process of transforming the output of a neural net into a probability distribution, utilizing logarithmic counts, exponentiation, and normalization, resulting in differentiable operations that yield probability distributions.', 'duration': 524.784, 'highlights': ['The neural net output is transformed into log counts, which are then exponentiated to obtain counts, and subsequently normalized to form a probability distribution, enabling the differentiation of operations.', 'The process of transforming the neural net output into probability distributions involves interpreting the output as log counts, exponentiating the counts, and then normalizing them to obtain probabilities.', "The neural net's output is interpreted as log counts, which are exponentiated to yield counts, and then normalized to generate a probability distribution, allowing for the differentiation of operations.", 'The neural net output undergoes a transformation to obtain probability distributions by interpreting the output as log counts, exponentiating the counts, and normalizing them, facilitating differentiable operations.']}, {'end': 5539.358, 'start': 5177.355, 'title': 'Neural net forward pass and loss calculation', 'summary': 'Details the process of a forward pass in a neural net, including encoding inputs, calculating probabilities using softmax, and evaluating loss based on negative log likelihood, with an average loss of 3.76 and an improved loss of 3.37 after adjusting the weights.', 'duration': 362.003, 'highlights': ['Neural net forward pass involves encoding inputs into one hot representations, calculating probabilities using softmax, and evaluating loss based on negative log likelihood.', 'The average negative log likelihood, which is the loss, for the neural net on a specific word is 3.76, indicating a high loss.', "Adjusting the weights results in an improved loss of 3.37, demonstrating the impact of weight adjustments on the neural net's performance.", "The process involves backpropagation through differentiable operations such as multiplication, addition, exponentiation, summation, and division, enabling learning and adjustment of the neural net's weights and biases.", "The neural net assigns probabilities to the correct characters, with varying likelihoods and negative log likelihoods, influencing the overall loss and the network's performance."]}], 'duration': 1464.17, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo4075188.jpg', 'highlights': ['The average loss for the neural net is 3.76', 'The improved loss after weight adjustment is 3.37', 'The torch.lowercase tensor infers data type automatically', "PyTorch's one hot function encodes integers into 5 by 27 vectors", 'Neural net output is transformed into log counts and normalized to form probability distributions', 'Neural net forward pass involves encoding inputs into one hot representations, calculating probabilities using softmax, and evaluating loss based on negative log likelihood', 'The difference between torch.tensor and torch.Tensor is not clearly explained in the documentation']}, {'end': 6457.13, 'segs': [{'end': 5591.204, 'src': 'embed', 'start': 5559.25, 'weight': 2, 'content': [{'end': 5564.513, 'text': "The way you optimize in neural net is you start with some random guess, and we're gonna commit to this one, even though it's not very good.", 'start': 5559.25, 'duration': 5.263}, {'end': 5567.315, 'text': 'But now the big deal is we have a loss function.', 'start': 5565.314, 'duration': 2.001}, {'end': 5573.12, 'text': 'So this loss is made up only of differentiable operations.', 'start': 5568.436, 'duration': 4.684}, {'end': 5583.7, 'text': 'And we can minimize the loss by tuning Ws by computing the gradients of the loss with respect to these W matrices.', 'start': 5574.375, 'duration': 9.325}, {'end': 5591.204, 'text': 'And so then we can tune W to minimize the loss and find a good setting of W using gradient based optimization.', 'start': 5585.221, 'duration': 5.983}], 'summary': 'Neural net optimization involves minimizing loss using gradient-based optimization.', 'duration': 31.954, 'max_score': 5559.25, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo5559250.jpg'}, {'end': 5739.572, 'src': 'embed', 'start': 5711.255, 'weight': 1, 'content': [{'end': 5714.017, 'text': 'grad started at zero, but backpropagation filled it in.', 'start': 5711.255, 'duration': 2.762}, {'end': 5715.698, 'text': 'And then in the update,', 'start': 5714.677, 'duration': 1.021}, {'end': 5726.426, 'text': 'we iterated all the parameters and we simply did a parameter update where every single element of our parameters was notched in the opposite direction of the gradient.', 'start': 5715.698, 'duration': 10.728}, {'end': 5730.825, 'text': "And so we're going to do the exact same thing here.", 'start': 5727.682, 'duration': 3.143}, {'end': 5739.572, 'text': "So I'm going to pull this up on the side here so that we have it available.", 'start': 5730.845, 'duration': 8.727}], 'summary': 'Backpropagation filled in gradients to update parameters.', 'duration': 28.317, 'max_score': 5711.255, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo5711255.jpg'}, {'end': 6340.22, 'src': 'embed', 'start': 6309.901, 'weight': 0, 'content': [{'end': 6313.423, 'text': "And so that's actually roughly the vicinity of what we expect to achieve.", 'start': 6309.901, 'duration': 3.522}, {'end': 6320.346, 'text': 'But before we achieved it by counting, and here we are achieving roughly the same result, but with gradient-based optimization.', 'start': 6313.943, 'duration': 6.403}, {'end': 6325.629, 'text': 'So we come to about 2.46, 2.45, etc.', 'start': 6321.107, 'duration': 4.522}, {'end': 6329.751, 'text': "And that makes sense because fundamentally, we're not taking in any additional information.", 'start': 6326.449, 'duration': 3.302}, {'end': 6333.153, 'text': "We're still just taking in the previous character and trying to predict the next one.", 'start': 6330.051, 'duration': 3.102}, {'end': 6340.22, 'text': 'But instead of doing it explicitly by counting and normalizing, are doing it with gradient-based learning.', 'start': 6333.773, 'duration': 6.447}], 'summary': 'Achieving 2.46, 2.45, etc. with gradient-based learning, similar to previous counting method.', 'duration': 30.319, 'max_score': 6309.901, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo6309901.jpg'}, {'end': 6422.359, 'src': 'embed', 'start': 6392.802, 'weight': 3, 'content': [{'end': 6395.163, 'text': 'And those logits will go through the exact same transformation.', 'start': 6392.802, 'duration': 2.361}, {'end': 6403.046, 'text': 'We are going to take them through a softmax, calculate the loss function and the negative log likelihood, and do gradient-based optimization.', 'start': 6395.643, 'duration': 7.403}, {'end': 6411.789, 'text': 'And so actually, as we complexify the neural nets and work all the way up to transformers, none of this will really fundamentally change.', 'start': 6403.746, 'duration': 8.043}, {'end': 6413.31, 'text': 'None of this will fundamentally change.', 'start': 6412.09, 'duration': 1.22}, {'end': 6417.533, 'text': 'The only thing that will change is the way we do the forward pass,', 'start': 6413.75, 'duration': 3.783}, {'end': 6422.359, 'text': 'where we take in some previous characters and calculate logits for the next character in a sequence.', 'start': 6417.533, 'duration': 4.826}], 'summary': 'Neural nets will undergo softmax transformation, loss calculation, and gradient-based optimization; forward pass changes with complexification.', 'duration': 29.557, 'max_score': 6392.802, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo6392802.jpg'}], 'start': 5541.06, 'title': 'Neural net and language model optimization', 'summary': 'Covers the optimization of neural nets by minimizing loss through gradient-based optimization and backpropagation, resulting in a decrease in loss from 3.76 to 3.72. additionally, it discusses the optimization of a bigram language model, initially working with 5 bigrams but later expanding to 228,000 bigrams, achieving a loss of around 2.46 through gradient-based optimization and the potential for further complexity in neural net design.', 'chapters': [{'end': 5739.572, 'start': 5541.06, 'title': 'Neural net optimization', 'summary': 'Discusses the process of optimizing a neural net by minimizing the loss through gradient-based optimization, using differentiable operations and backpropagation to update parameters for improved performance.', 'duration': 198.512, 'highlights': ['The process of optimizing a neural net involves minimizing the loss by tuning Ws through gradient-based optimization and differentiable operations.', 'The loss function is made up only of differentiable operations, and the gradients of the loss with respect to the W matrices are computed to tune W and minimize the loss.', 'The neural net is updated through backpropagation, where the gradients are set to zero, and then loss.backward is called to initiate backpropagation for parameter updates.']}, {'end': 6178.602, 'start': 5739.992, 'title': 'Neural network training process', 'summary': 'Covers the process of calculating loss using negative log likelihood for classification, accessing and updating gradients for weight optimization, resulting in a decreasing loss from 3.76 to 3.72 through gradient descent.', 'duration': 438.61, 'highlights': ['Calculating Loss using Negative Log Likelihood', 'Accessing and Updating Gradients for Weight Optimization', 'Efficient Way to Access Probabilities using PyTorch']}, {'end': 6457.13, 'start': 6179.583, 'title': 'Bigram language model optimization', 'summary': 'Discusses the optimization of a bigram language model, initially working with 5 bigrams but later expanding to 228,000 bigrams, achieving a loss of around 2.46 through gradient-based optimization and the potential for further complexity in neural net design.', 'duration': 277.547, 'highlights': ['The chapter discusses the optimization of a bigram language model, initially working with 5 bigrams but later expanding to 228,000 bigrams, achieving a loss of around 2.46 through gradient-based optimization and the potential for further complexity in neural net design.', 'The number of examples is explicitly stated to be five when working with Emma, and later expands to 228,000 bigrams.', 'The optimization process using gradient descent leads to a decrease in loss, achieving roughly 2.46, showcasing the efficiency of the gradient-based approach.', 'The chapter discusses the potential for complexity in neural net design, with the prospect of working up to transformers while maintaining the same fundamental gradient-based optimization process.']}], 'duration': 916.07, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo5541060.jpg', 'highlights': ['The optimization process using gradient descent leads to a decrease in loss, achieving roughly 2.46', 'The neural net is updated through backpropagation, where the gradients are set to zero', 'The loss function is made up only of differentiable operations, and the gradients of the loss with respect to the W matrices are computed', 'The chapter discusses the potential for complexity in neural net design, with the prospect of working up to transformers']}, {'end': 7064.257, 'segs': [{'end': 6482.238, 'src': 'embed', 'start': 6457.53, 'weight': 0, 'content': [{'end': 6463.212, 'text': 'So this is fundamentally an unscalable approach, and the neural network approach is significantly more scalable,', 'start': 6457.53, 'duration': 5.682}, {'end': 6466.353, 'text': "and it's something that actually we can improve on over time.", 'start': 6463.212, 'duration': 3.141}, {'end': 6468.314, 'text': "So that's where we will be digging next.", 'start': 6466.874, 'duration': 1.44}, {'end': 6470.375, 'text': 'I wanted to point out two more things.', 'start': 6468.654, 'duration': 1.721}, {'end': 6478.477, 'text': 'Number one, I want you to notice that this x-enc here, this is made up of one-hot vectors.', 'start': 6471.235, 'duration': 7.242}, {'end': 6482.238, 'text': 'And then those one-hot vectors are multiplied by this W matrix.', 'start': 6479.097, 'duration': 3.141}], 'summary': 'Neural network approach is more scalable than the unscalable approach, using one-hot vectors and w matrix.', 'duration': 24.708, 'max_score': 6457.53, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo6457530.jpg'}, {'end': 6520.2, 'src': 'embed', 'start': 6497.881, 'weight': 1, 'content': [{'end': 6508.049, 'text': 'then because of the way the matrix multiplication works, Multiplying that one-hot vector with W actually ends up plucking out the fifth row of W.', 'start': 6497.881, 'duration': 10.168}, {'end': 6517.078, 'text': "Logits would become just the fifth row of W, and that's because of the way the matrix multiplication works.", 'start': 6508.049, 'duration': 9.029}, {'end': 6520.2, 'text': "So that's actually what ends up happening.", 'start': 6517.078, 'duration': 3.122}], 'summary': 'Matrix multiplication selects the fifth row of w as logits.', 'duration': 22.319, 'max_score': 6497.881, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo6497881.jpg'}, {'end': 6642.079, 'src': 'heatmap', 'start': 6567.612, 'weight': 1, 'content': [{'end': 6573.794, 'text': 'So this w here is literally the same as this array here.', 'start': 6567.612, 'duration': 6.182}, {'end': 6578.696, 'text': 'But w, remember, is the log counts, not the counts.', 'start': 6575.194, 'duration': 3.502}, {'end': 6585.438, 'text': "So it's more precise to say that w exponentiated, w.exp, is this array.", 'start': 6579.036, 'duration': 6.402}, {'end': 6594.017, 'text': 'But this array was filled in by counting and by basically populating the counts of bigrams,', 'start': 6586.331, 'duration': 7.686}, {'end': 6602.484, 'text': 'whereas in the gradient-based framework we initialize it randomly and then we let the loss guide us to arrive at the exact same array.', 'start': 6594.017, 'duration': 8.467}, {'end': 6613.893, 'text': 'So this array exactly here is basically the array W at the end of optimization, except we arrived at it piece by piece by following the loss.', 'start': 6603.345, 'duration': 10.548}, {'end': 6617.662, 'text': "And that's why we also obtain the same loss function at the end.", 'start': 6615.14, 'duration': 2.522}, {'end': 6621.805, 'text': 'And the second note is, if I come here, remember the smoothing,', 'start': 6618.062, 'duration': 3.743}, {'end': 6630.291, 'text': 'where we added fake counts to our counts in order to smooth out and make more uniform the distributions of these probabilities.', 'start': 6621.805, 'duration': 8.486}, {'end': 6636.375, 'text': 'And that prevented us from assigning zero probability to any one bigram.', 'start': 6631.191, 'duration': 5.184}, {'end': 6642.079, 'text': "Now, if I increase the count here, what's happening to the probability?", 'start': 6637.295, 'duration': 4.784}], 'summary': 'W exponentiated is the array filled in by counting bigrams, with loss guiding to the same array. adding fake counts prevents zero probability.', 'duration': 74.467, 'max_score': 6567.612, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo6567612.jpg'}, {'end': 6630.291, 'src': 'embed', 'start': 6603.345, 'weight': 2, 'content': [{'end': 6613.893, 'text': 'So this array exactly here is basically the array W at the end of optimization, except we arrived at it piece by piece by following the loss.', 'start': 6603.345, 'duration': 10.548}, {'end': 6617.662, 'text': "And that's why we also obtain the same loss function at the end.", 'start': 6615.14, 'duration': 2.522}, {'end': 6621.805, 'text': 'And the second note is, if I come here, remember the smoothing,', 'start': 6618.062, 'duration': 3.743}, {'end': 6630.291, 'text': 'where we added fake counts to our counts in order to smooth out and make more uniform the distributions of these probabilities.', 'start': 6621.805, 'duration': 8.486}], 'summary': 'Optimization process yields array w with same loss function, using smoothing for uniform distributions.', 'duration': 26.946, 'max_score': 6603.345, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo6603345.jpg'}, {'end': 6821.542, 'src': 'embed', 'start': 6792.219, 'weight': 4, 'content': [{'end': 6794.962, 'text': 'And now this optimization actually has two components.', 'start': 6792.219, 'duration': 2.743}, {'end': 6799.367, 'text': 'Not only is it trying to make all the probabilities work out, but in addition to that,', 'start': 6795.423, 'duration': 3.944}, {'end': 6803.432, 'text': "there's an additional component that simultaneously tries to make all w's be zero.", 'start': 6799.367, 'duration': 4.065}, {'end': 6806.115, 'text': "Because if w's are not zero, you feel a loss.", 'start': 6804.012, 'duration': 2.103}, {'end': 6809.979, 'text': 'And so minimizing this, the only way to achieve that is for w to be zero.', 'start': 6806.375, 'duration': 3.604}, {'end': 6817.281, 'text': 'And so you can think of this as adding like a spring force or like a gravity force that pushes W to be zero.', 'start': 6810.8, 'duration': 6.481}, {'end': 6821.542, 'text': 'So W wants to be zero and the probabilities want to be uniform,', 'start': 6817.961, 'duration': 3.581}], 'summary': "Optimization aims to make probabilities uniform and w's zero to minimize loss.", 'duration': 29.323, 'max_score': 6792.219, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo6792219.jpg'}, {'end': 6963.753, 'src': 'embed', 'start': 6941.482, 'weight': 3, 'content': [{'end': 6949.807, 'text': 'So if I run this, kind of anticlimactic or climatic, depending how you look at it, but we get the exact same result.', 'start': 6941.482, 'duration': 8.325}, {'end': 6954.389, 'text': "And that's because this is the identical model.", 'start': 6952.408, 'duration': 1.981}, {'end': 6963.753, 'text': "Not only does it achieve the same loss, but as I mentioned, these are identical models, and this W is the log counts of what we've estimated before.", 'start': 6954.769, 'duration': 8.984}], 'summary': 'Running the identical model yields the same results, with w representing the log counts of previous estimates.', 'duration': 22.271, 'max_score': 6941.482, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo6941482.jpg'}], 'start': 6457.53, 'title': 'Neural network scalability and regularization', 'summary': 'Discusses the scalability of neural network approach, use of one-hot vectors and matrix multiplication, and the concept of label smoothing and regularization in neural networks, illustrating the achieved results through gradient-based optimization and different training methods.', 'chapters': [{'end': 6621.805, 'start': 6457.53, 'title': 'Neural network approach and matrix multiplication', 'summary': 'Discusses the scalability of the neural network approach, the use of one-hot vectors and matrix multiplication in obtaining logits, and the process of obtaining the same array through gradient-based optimization.', 'duration': 164.275, 'highlights': ['The neural network approach is significantly more scalable than the current approach.', 'The process of obtaining logits through matrix multiplication using one-hot vectors.', 'The process of obtaining the same array through gradient-based optimization.']}, {'end': 7064.257, 'start': 6621.805, 'title': 'Regularization in neural networks', 'summary': 'Discusses the concept of label smoothing and regularization in neural networks, demonstrating how adding fake counts to the counts matrix and applying a regularization loss can lead to a more uniform distribution of probabilities, ultimately achieving the same results through different training methods.', 'duration': 442.452, 'highlights': ['The concept of label smoothing is demonstrated through adding fake counts to the counts matrix to achieve a more uniform distribution of probabilities.', 'Initializing the weights (Ws) to be zero results in probabilities turning out to be exactly uniform, equivalent to label smoothing.', 'The addition of a regularization loss in the loss function incentivizes the weights to be near zero, which leads to a more smooth distribution.', 'The optimization process in the gradient-based framework involves both making the probabilities uniform and simultaneously pushing the weights (Ws) towards zero, controlled by the regularization strength.', 'Training the model using two different methods, counting up the frequency of bigrams and using the negative log likelihood loss in a gradient-based framework, yields the same result and model.']}], 'duration': 606.727, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/PaCmpygFfXo/pics/PaCmpygFfXo6457530.jpg', 'highlights': ['The neural network approach is significantly more scalable than the current approach.', 'The process of obtaining logits through matrix multiplication using one-hot vectors.', 'The process of obtaining the same array through gradient-based optimization.', 'Training the model using two different methods yields the same result and model.', 'The optimization process involves making the probabilities uniform and pushing the weights towards zero.']}], 'highlights': ['MakeMore learns to generate unique names from a dataset of 32,000 names, offering potential assistance in finding new and distinct names. (Relevance Score: 5)', "The dataset used to train MakeMore consists of 32,000 randomly sourced names from a government website, showcasing the model's ability to learn and generate new name-like variations. (Relevance Score: 4)", 'Creating a precomputed matrix capital P containing normalized probabilities will optimize the process by avoiding the repetitive renormalization of rows, enhancing efficiency and reducing wasteful computations. (Relevance Score: 3)', 'Transitioning to a neural network framework for language modeling involves compiling the training set of all the bigrams, using gradient-based optimization to tune the parameters of the network for correct probability prediction. (Relevance Score: 2)', 'The optimization process using gradient descent leads to a decrease in loss, achieving roughly 2.46. (Relevance Score: 1)']}