title
Building makemore Part 5: Building a WaveNet

description
We take the 2-layer MLP from previous video and make it deeper with a tree-like structure, arriving at a convolutional neural network architecture similar to the WaveNet (2016) from DeepMind. In the WaveNet paper, the same hierarchical architecture is implemented more efficiently using causal dilated convolutions (not yet covered). Along the way we get a better sense of torch.nn and what it is and how it works under the hood, and what a typical deep learning development process looks like (a lot of reading of documentation, keeping track of multidimensional tensor shapes, moving between jupyter notebooks and repository code, ...). Links: - makemore on github: https://github.com/karpathy/makemore - jupyter notebook I built in this video: https://github.com/karpathy/nn-zero-to-hero/blob/master/lectures/makemore/makemore_part5_cnn1.ipynb - collab notebook: https://colab.research.google.com/drive/1CXVEmCO_7r7WYZGb5qnjfyxTvQa13g5X?usp=sharing - my website: https://karpathy.ai - my twitter: https://twitter.com/karpathy - our Discord channel: https://discord.gg/3zy8kqD9Cp Supplementary links: - WaveNet 2016 from DeepMind https://arxiv.org/abs/1609.03499 - Bengio et al. 2003 MLP LM https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf Chapters: intro 00:00:00 intro 00:01:40 starter code walkthrough 00:06:56 let’s fix the learning rate plot 00:09:16 pytorchifying our code: layers, containers, torch.nn, fun bugs implementing wavenet 00:17:11 overview: WaveNet 00:19:33 dataset bump the context size to 8 00:19:55 re-running baseline code on block_size 8 00:21:36 implementing WaveNet 00:37:41 training the WaveNet: first pass 00:38:50 fixing batchnorm1d bug 00:45:21 re-training WaveNet with bug fix 00:46:07 scaling up our WaveNet conclusions 00:46:58 experimental harness 00:47:44 WaveNet but with “dilated causal convolutions” 00:51:34 torch.nn 00:52:28 the development process of building deep neural nets 00:54:17 going forward 00:55:26 improve on my loss! how far can we improve a WaveNet on this data?

detail
{'title': 'Building makemore Part 5: Building a WaveNet', 'heatmap': [{'end': 3380.355, 'start': 3348.6, 'weight': 1}], 'summary': 'Series delves into implementing a multi-layer perceptron character-level language model, optimizing pytorch neural network, improving performance, and implementing wavenet architecture, achieving validation loss reduction, and restructuring the network for better performance and efficiency.', 'chapters': [{'end': 151.903, 'segs': [{'end': 62.799, 'src': 'embed', 'start': 35.637, 'weight': 0, 'content': [{'end': 39.718, 'text': 'In particular, we would like to take more characters in a sequence as an input, not just three.', 'start': 35.637, 'duration': 4.081}, {'end': 47.061, 'text': "And in addition to that, we don't just want to feed them all into a single hidden layer because that squashes too much information too quickly.", 'start': 40.419, 'duration': 6.642}, {'end': 54.603, 'text': 'Instead, we would like to make a deeper model that progressively fuses this information to make its guess about the next character in a sequence.', 'start': 47.781, 'duration': 6.822}, {'end': 62.799, 'text': "And so we'll see that as we make this architecture more complex, we're actually going to arrive at something that looks very much like a WaveNet.", 'start': 55.714, 'duration': 7.085}], 'summary': 'Developing a deeper model for character sequence prediction akin to wavenet.', 'duration': 27.162, 'max_score': 35.637, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ035637.jpg'}, {'end': 105.206, 'src': 'embed', 'start': 77.089, 'weight': 1, 'content': [{'end': 79.83, 'text': 'But fundamentally, the modeling setup is identical.', 'start': 77.089, 'duration': 2.741}, {'end': 84.314, 'text': 'It is an autoregressive model, and it tries to predict the next character in a sequence.', 'start': 80.231, 'duration': 4.083}, {'end': 93.939, 'text': 'And the architecture actually takes this interesting hierarchical approach to predicting the next character in a sequence with this tree-like structure.', 'start': 85.074, 'duration': 8.865}, {'end': 98.763, 'text': "And this is the architecture, and we're going to implement it in the course of this video.", 'start': 94.841, 'duration': 3.922}, {'end': 99.904, 'text': "So let's get started.", 'start': 99.143, 'duration': 0.761}, {'end': 105.206, 'text': 'So the starter code for part five is very similar to where we ended up in part three.', 'start': 100.624, 'duration': 4.582}], 'summary': 'Autoregressive model predicts next character in sequence with hierarchical architecture.', 'duration': 28.117, 'max_score': 77.089, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ077089.jpg'}, {'end': 161.346, 'src': 'embed', 'start': 136.632, 'weight': 2, 'content': [{'end': 142.736, 'text': 'In particular, we have 182, 000 examples of three characters trying to predict the fourth one.', 'start': 136.632, 'duration': 6.104}, {'end': 148.675, 'text': "and we've broken up every one of these words into little problems of given three characters, predict the fourth one.", 'start': 143.561, 'duration': 5.114}, {'end': 151.903, 'text': "So this is our data set and this is what we're trying to get the neural net to do.", 'start': 148.695, 'duration': 3.208}, {'end': 161.346, 'text': 'Now, in part three, we started to develop our code around these layer modules that are, for example, a class linear.', 'start': 153.103, 'duration': 8.243}], 'summary': 'Dataset contains 182,000 examples of three characters predicting the fourth one, used to develop neural net code.', 'duration': 24.714, 'max_score': 136.632, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ0136632.jpg'}], 'start': 0.169, 'title': 'Implementing makemore language model', 'summary': "Delves into the implementation of a multi-layer perceptron character-level language model, aimed at complexifying the architecture to handle more characters in a sequence as input and progressively fuse information, similar to deepmind's wavenet.", 'chapters': [{'end': 151.903, 'start': 0.169, 'title': 'Implementing makemore language model', 'summary': "Discusses the implementation of a multi-layer perceptron character-level language model, aiming to complexify the architecture to take more characters in a sequence as input and progressively fuse information, resembling deepmind's wavenet.", 'duration': 151.734, 'highlights': ['The architecture is being complexified to take more characters in a sequence as an input, not just three, and to make a deeper model that progressively fuses information to predict the next character in a sequence.', 'The course is implementing an autoregressive model similar to WaveNet, aiming to predict the next character in a sequence with an architecture that takes an interesting hierarchical approach with a tree-like structure.', 'The dataset comprises 182,000 examples of three characters trying to predict the fourth one, with the neural net being trained to accomplish this task.']}], 'duration': 151.734, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ0169.jpg', 'highlights': ['The architecture is being complexified to take more characters in a sequence as an input, not just three, and to make a deeper model that progressively fuses information to predict the next character in a sequence.', 'The course is implementing an autoregressive model similar to WaveNet, aiming to predict the next character in a sequence with an architecture that takes an interesting hierarchical approach with a tree-like structure.', 'The dataset comprises 182,000 examples of three characters trying to predict the fourth one, with the neural net being trained to accomplish this task.']}, {'end': 564.083, 'segs': [{'end': 202.274, 'src': 'embed', 'start': 176.84, 'weight': 4, 'content': [{'end': 182.717, 'text': 'Now, we also developed these layers to have APIs and signatures very similar to those that are found in PyTorch.', 'start': 176.84, 'duration': 5.877}, {'end': 187.682, 'text': "So we have torch.nn, and it's got all these layer building blocks that you would use in practice.", 'start': 183.578, 'duration': 4.104}, {'end': 191.065, 'text': 'And we were developing all these to mimic the APIs of these.', 'start': 188.342, 'duration': 2.723}, {'end': 192.666, 'text': 'So for example, we have linear.', 'start': 191.605, 'duration': 1.061}, {'end': 199.031, 'text': 'So there will also be a torch.nn.linear, and its signature will be very similar to our signature.', 'start': 193.226, 'duration': 5.805}, {'end': 202.274, 'text': "And the functionality will be also quite identical, as far as I'm aware.", 'start': 199.512, 'duration': 2.762}], 'summary': 'Developed layers with pytorch-like apis and signatures for practical use.', 'duration': 25.434, 'max_score': 176.84, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ0176840.jpg'}, {'end': 272.979, 'src': 'embed', 'start': 247.741, 'weight': 3, 'content': [{'end': 253.887, 'text': "So that's something to now keep track of, something that sometimes introduces bugs because you forget to put it into the right mode.", 'start': 247.741, 'duration': 6.146}, {'end': 260.793, 'text': 'And finally, we saw that BatchNorm couples the statistics or the activations across the examples in the batch.', 'start': 254.647, 'duration': 6.146}, {'end': 264.056, 'text': 'So normally we thought of the batch as just an efficiency thing.', 'start': 261.413, 'duration': 2.643}, {'end': 272.979, 'text': "But now we are coupling the computation across batch elements, and it's done for the purposes of controlling the activation statistics,", 'start': 264.996, 'duration': 7.983}], 'summary': 'Batchnorm couples statistics across batch for controlling activation stats.', 'duration': 25.238, 'max_score': 247.741, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ0247741.jpg'}, {'end': 331.331, 'src': 'embed', 'start': 306.685, 'weight': 1, 'content': [{'end': 315.812, 'text': "I've discarded that in favor of just initializing the torch RNG outside here just once globally, just for simplicity.", 'start': 306.685, 'duration': 9.127}, {'end': 319.622, 'text': 'And then here we are starting to build out some of the neural network elements.', 'start': 316.92, 'duration': 2.702}, {'end': 321.203, 'text': 'This should look very familiar.', 'start': 320.202, 'duration': 1.001}, {'end': 325.306, 'text': 'We have our embedding table C, and then we have a list of players.', 'start': 321.844, 'duration': 3.462}, {'end': 331.331, 'text': "And it's a linear, feeds to BatchNorm, feeds to 10H, and then a linear output layer.", 'start': 326.047, 'duration': 5.284}], 'summary': 'Neural network is being initiated with torch rng for simplicity.', 'duration': 24.646, 'max_score': 306.685, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ0306685.jpg'}, {'end': 386.418, 'src': 'embed', 'start': 354.067, 'weight': 0, 'content': [{'end': 356.729, 'text': "And that's because 32 batch elements are too few.", 'start': 354.067, 'duration': 2.662}, {'end': 363.253, 'text': 'And so you can get very lucky or unlucky in any one of these batches, and it creates a very thick loss function.', 'start': 357.309, 'duration': 5.944}, {'end': 365.175, 'text': "So we're going to fix that soon.", 'start': 363.273, 'duration': 1.902}, {'end': 371.413, 'text': 'Now, once we want to evaluate the trained neural network, we need to remember, because of the BatchNormLayers,', 'start': 366.448, 'duration': 4.965}, {'end': 373.755, 'text': 'to set all the layers to be training equals false.', 'start': 371.413, 'duration': 2.342}, {'end': 376.117, 'text': 'This only matters for the BatchNormLayers so far.', 'start': 374.376, 'duration': 1.741}, {'end': 378.56, 'text': 'And then we evaluate.', 'start': 377.259, 'duration': 1.301}, {'end': 386.418, 'text': "we see that currently we have validation loss of 2.10, which is fairly good, but there's still ways to go.", 'start': 380.353, 'duration': 6.065}], 'summary': '32 batch elements create thick loss function. validation loss currently at 2.10, with room for improvement.', 'duration': 32.351, 'max_score': 354.067, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ0354067.jpg'}, {'end': 548.56, 'src': 'embed', 'start': 519.778, 'weight': 2, 'content': [{'end': 525.56, 'text': "So that's very helpful because now we can do a mean along the rows and the shape of this will just be 200.", 'start': 519.778, 'duration': 5.782}, {'end': 529.202, 'text': "And so we've taken basically the mean on every row.", 'start': 525.56, 'duration': 3.642}, {'end': 533.904, 'text': 'So plt.plot of that should be something nicer, much better.', 'start': 529.862, 'duration': 4.042}, {'end': 537.145, 'text': "So we see that we've basically made a lot of progress.", 'start': 535.162, 'duration': 1.983}, {'end': 539.768, 'text': 'And then here, this is the learning rate decay.', 'start': 537.725, 'duration': 2.043}, {'end': 542.572, 'text': 'So here we see that the learning rate decay,', 'start': 540.91, 'duration': 1.662}, {'end': 548.56, 'text': 'subtracted a ton of energy out of the system and allowed us to settle into sort of the local minimum in this optimization.', 'start': 542.572, 'duration': 5.988}], 'summary': 'Mean taken along rows with shape 200, showing progress and learning rate decay.', 'duration': 28.782, 'max_score': 519.778, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ0519778.jpg'}], 'start': 153.103, 'title': 'Neural network development and optimization', 'summary': 'Covers the development of layer modules such as linear, batchnorm, and apis in pytorch, emphasizing complexities and bugs. it also describes the optimization of a pytorch neural network, involving global torch rng initialization, adjusting loss function, fixing graph representation, and implementing learning rate decay, resulting in significant performance improvement.', 'chapters': [{'end': 306.145, 'start': 153.103, 'title': 'Developing layer modules for neural networks', 'summary': 'Outlines the development of layer modules like linear, batchnorm, and the apis to mimic those found in pytorch for building neural networks, emphasizing the complexities and bugs associated with batchnorm.', 'duration': 153.042, 'highlights': ['The chapter discusses the development of layer modules such as linear and BatchNorm to serve as building blocks for neural networks, with a focus on mimicking the APIs found in PyTorch for practical usage.', 'The complexities of BatchNorm are highlighted, including the training flag for distinguishing behavior during train time and evaluation time, the coupling of computation across batch elements, and the need to modulate the training and evaluation phase, which can introduce bugs.', 'The chapter emphasizes the challenges associated with BatchNorm, such as the need to ensure it is in the correct state, the introduction of bugs due to forgetting to put it into the right mode, and the requirement to wait for the mean and variance to settle into a steady state.']}, {'end': 564.083, 'start': 306.685, 'title': 'Pytorch neural network optimization', 'summary': "Describes the optimization of a pytorch neural network, involving initializing the torch rng globally, adjusting the loss function due to batch size, fixing the graph representation, and implementing learning rate decay, resulting in significant progress in the model's performance.", 'duration': 257.398, 'highlights': ['Adjusting loss function due to batch size', 'Implementing learning rate decay', 'Initializing the torch RNG globally', 'Fixing the graph representation']}], 'duration': 410.98, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ0153103.jpg', 'highlights': ['Significant performance improvement achieved by adjusting loss function due to batch size', 'Global torch RNG initialization for optimization of PyTorch neural network', 'Implementing learning rate decay for improved performance of PyTorch neural network', 'Challenges associated with BatchNorm, including complexities and bugs', 'Development of layer modules such as linear and BatchNorm for practical usage in PyTorch']}, {'end': 1045.153, 'segs': [{'end': 676.23, 'src': 'embed', 'start': 626.049, 'weight': 1, 'content': [{'end': 634.073, 'text': "so let's create modules for both of these operations the embedding operation and the flattening operation.", 'start': 626.049, 'duration': 8.024}, {'end': 637.495, 'text': 'so i actually wrote the code in just to save some time.', 'start': 634.073, 'duration': 3.422}, {'end': 641.868, 'text': 'So we have a module embedding and a module flatten,', 'start': 638.746, 'duration': 3.122}, {'end': 648.411, 'text': 'and both of them simply do the indexing operation in the forward pass and the flattening operation here.', 'start': 641.868, 'duration': 6.543}, {'end': 655.795, 'text': 'And this C now will just become a self.weight inside an embedding module.', 'start': 649.712, 'duration': 6.083}, {'end': 662.079, 'text': "And I'm calling these layers specifically embedding and flatten because it turns out that both of them actually exist in PyTorch.", 'start': 656.736, 'duration': 5.343}, {'end': 670.109, 'text': 'So in PyTorch we have n and dot embedding, and it also takes the number of embeddings and the dimensionality of the embedding, just like we have here.', 'start': 663.007, 'duration': 7.102}, {'end': 676.23, 'text': 'But in addition, PyTorch takes in a lot of other keyword arguments that we are not using for our purposes yet.', 'start': 670.849, 'duration': 5.381}], 'summary': 'Created modules for embedding and flattening operations in pytorch.', 'duration': 50.181, 'max_score': 626.049, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ0626049.jpg'}, {'end': 758.86, 'src': 'embed', 'start': 732.027, 'weight': 4, 'content': [{'end': 736.33, 'text': "outside and explicitly, they're now inside layers, so we can delete those.", 'start': 732.027, 'duration': 4.303}, {'end': 744.473, 'text': 'But now, to kick things off, we want this little x, which in the beginning is just xb, the tensor of integers,', 'start': 737.211, 'duration': 7.262}, {'end': 746.913, 'text': 'specifying the identities of these characters at the input.', 'start': 744.473, 'duration': 2.44}, {'end': 751.655, 'text': 'And so these characters can now directly feed into the first layer, and this should just work.', 'start': 747.834, 'duration': 3.821}, {'end': 758.86, 'text': "So let me come here and insert a break because I just want to make sure that the first iteration of this runs and that there's no mistake.", 'start': 752.657, 'duration': 6.203}], 'summary': 'Transitioning characters to first layer for initial iteration.', 'duration': 26.833, 'max_score': 732.027, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ0732027.jpg'}, {'end': 795.781, 'src': 'embed', 'start': 769.696, 'weight': 0, 'content': [{'end': 774.541, 'text': 'Now, one more thing that I would like to do in order to PyTorchify our code even further is that right now,', 'start': 769.696, 'duration': 4.845}, {'end': 777.564, 'text': 'we are maintaining all of our modules in a naked list of layers.', 'start': 774.541, 'duration': 3.023}, {'end': 783.449, 'text': 'And we can also simplify this because we can introduce the concept of PyTorch containers.', 'start': 778.344, 'duration': 5.105}, {'end': 788.414, 'text': "So in torch.nn, which we are basically rebuilding from scratch here, there's a concept of containers.", 'start': 784.13, 'duration': 4.284}, {'end': 795.781, 'text': 'these containers are basically a way of organizing layers into lists or dicts and so on.', 'start': 789.335, 'duration': 6.446}], 'summary': 'Introducing pytorch containers to organize layers, simplifying module maintenance.', 'duration': 26.085, 'max_score': 769.696, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ0769696.jpg'}, {'end': 1045.153, 'src': 'embed', 'start': 1023.095, 'weight': 2, 'content': [{'end': 1031.097, 'text': 'And so again, an example of introducing a bug inline because we did not properly maintain the state of what is training or not.', 'start': 1023.095, 'duration': 8.002}, {'end': 1034.097, 'text': "Okay, so I rerun everything and here's where we are.", 'start': 1031.837, 'duration': 2.26}, {'end': 1038.907, 'text': 'As a reminder, we have the training loss of 2.05 and validation of 2.10.', 'start': 1034.477, 'duration': 4.43}, {'end': 1041.971, 'text': 'Now, because these losses are very similar to each other.', 'start': 1038.907, 'duration': 3.064}, {'end': 1045.153, 'text': 'we have a sense that we are not overfitting too much on this task,', 'start': 1041.971, 'duration': 3.182}], 'summary': 'Introducing a bug led to similar training and validation losses, indicating minimal overfitting.', 'duration': 22.058, 'max_score': 1023.095, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ01023095.jpg'}], 'start': 564.723, 'title': 'Pytorch operations and containers', 'summary': 'Covers implementing embedding and flattening operations for simplifying the neural network implementation and introducing pytorch containers to reduce code complexity, addressing issues with batch norm layer, resulting in a training loss of 2.05 and a validation loss of 2.10.', 'chapters': [{'end': 768.465, 'start': 564.723, 'title': 'Implementing embedding and flattening operations', 'summary': 'Discusses creating modules for embedding and flattening operations in pytorch, simplifying the forward pass and improving efficiency in the neural network implementation.', 'duration': 203.742, 'highlights': ['Creating modules for embedding and flattening operations in PyTorch simplifies the forward pass, improving efficiency.', "The created 'embedding' and 'flatten' modules perform indexing and flattening operations, simplifying the code and reducing the need for special cases.", "PyTorch's existing 'embedding' and 'flatten' modules offer additional keyword arguments not utilized in the current implementation, highlighting potential for further optimization.", 'The forward pass is substantially simplified, as the embedding and flattening operations are now inside layers, eliminating the need for special casing and improving code efficiency.', 'Directly feeding characters into the first layer after implementing the embedding and flattening operations simplifies the neural network implementation and improves processing speed.']}, {'end': 1045.153, 'start': 769.696, 'title': 'Simplifying code with pytorch containers', 'summary': 'Discusses simplifying code by introducing the concept of pytorch containers, particularly the sequential container, reducing the complexity of maintaining a list of layers and addressing issues with the batch norm layer during training, resulting in a training loss of 2.05 and a validation loss of 2.10.', 'duration': 275.457, 'highlights': ['Introducing the concept of PyTorch containers, particularly the sequential container, to simplify maintaining a list of layers and organizing layers into lists or dicts.', 'Addressing the issue with the batch norm layer during training, resulting in a training loss of 2.05 and a validation loss of 2.10.']}], 'duration': 480.43, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ0564723.jpg', 'highlights': ['Introducing PyTorch containers simplifies maintaining layers and organizing them (5)', 'Implementing embedding and flattening operations simplifies the forward pass (4)', 'Addressing batch norm layer issue results in training loss of 2.05 and validation loss of 2.10 (3)', "Utilizing PyTorch's 'embedding' and 'flatten' modules highlights potential for further optimization (2)", 'Directly feeding characters into the first layer improves processing speed (1)']}, {'end': 2069.36, 'segs': [{'end': 1156.505, 'src': 'embed', 'start': 1120.399, 'weight': 0, 'content': [{'end': 1127.984, 'text': 'and then we take the bigrams and we fuse those into four character level chunks And then we fuse that again.', 'start': 1120.399, 'duration': 7.585}, {'end': 1131.167, 'text': 'And so we do that in this tree-like hierarchical manner.', 'start': 1128.004, 'duration': 3.163}, {'end': 1137.512, 'text': 'So we fuse the information from the previous context slowly into the network as it gets deeper.', 'start': 1132.007, 'duration': 5.505}, {'end': 1140.194, 'text': 'And so this is the kind of architecture that we want to implement.', 'start': 1138.232, 'duration': 1.962}, {'end': 1145.718, 'text': 'Now, in the WaveNets case, this is a visualization of a stack of dilated causal convolution layers.', 'start': 1140.975, 'duration': 4.743}, {'end': 1149.481, 'text': 'And this makes it sound very scary, but actually the idea is very simple.', 'start': 1146.619, 'duration': 2.862}, {'end': 1155.405, 'text': "And the fact that it's a dilated causal convolution layer is really just an implementation detail to make everything fast.", 'start': 1150.061, 'duration': 5.344}, {'end': 1156.505, 'text': "We're going to see that later.", 'start': 1155.605, 'duration': 0.9}], 'summary': 'Hierarchical fusion of bigrams for network implementation.', 'duration': 36.106, 'max_score': 1120.399, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ01120399.jpg'}, {'end': 1231.032, 'src': 'embed', 'start': 1197.613, 'weight': 4, 'content': [{'end': 1199.054, 'text': 'So we should be able to redefine the network.', 'start': 1197.613, 'duration': 1.441}, {'end': 1205.361, 'text': "You see that number of parameters has increased by 10, 000, and that's because the block size has grown.", 'start': 1199.935, 'duration': 5.426}, {'end': 1207.383, 'text': 'So this first linear layer is much, much bigger.', 'start': 1205.681, 'duration': 1.702}, {'end': 1212.087, 'text': 'Our linear layer now takes eight characters into this middle layer.', 'start': 1208.123, 'duration': 3.964}, {'end': 1213.809, 'text': "So there's a lot more parameters there.", 'start': 1212.688, 'duration': 1.121}, {'end': 1215.951, 'text': 'But this should just run.', 'start': 1214.43, 'duration': 1.521}, {'end': 1219.296, 'text': 'Let me just break right after the very first iteration.', 'start': 1217.093, 'duration': 2.203}, {'end': 1221.459, 'text': 'So you see that this runs just fine.', 'start': 1219.316, 'duration': 2.143}, {'end': 1223.462, 'text': "It's just that this network doesn't make too much sense.", 'start': 1221.839, 'duration': 1.623}, {'end': 1226.025, 'text': "We're crushing way too much information way too fast.", 'start': 1223.482, 'duration': 2.543}, {'end': 1231.032, 'text': "So let's now come in and see how we could try to implement the hierarchical scheme.", 'start': 1227.067, 'duration': 3.965}], 'summary': 'Network parameters increased by 10,000 due to larger block size, causing rapid information compression and need for hierarchical scheme.', 'duration': 33.419, 'max_score': 1197.613, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ01197613.jpg'}, {'end': 1273.347, 'src': 'embed', 'start': 1250.953, 'weight': 3, 'content': [{'end': 1265.36, 'text': 'So I started a little bit of a performance log here and previously where we were is we were getting a performance of 2.10 on the validation loss and now simply scaling up the context length from three to eight gives us a performance of 2.02..', 'start': 1250.953, 'duration': 14.407}, {'end': 1267.061, 'text': 'Quite a bit of an improvement here.', 'start': 1265.36, 'duration': 1.701}, {'end': 1273.347, 'text': 'Also, when you sample from the model, you see that the names are definitely improving qualitatively as well.', 'start': 1267.161, 'duration': 6.186}], 'summary': 'Improved model performance from 2.10 to 2.02 with context length scaling from 3 to 8, leading to qualitative improvement in generated names.', 'duration': 22.394, 'max_score': 1250.953, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ01250953.jpg'}, {'end': 1383.654, 'src': 'embed', 'start': 1359.285, 'weight': 5, 'content': [{'end': 1365.408, 'text': 'So our embedding table has, for each character, a 10-dimensional vector that we are trying to learn.', 'start': 1359.285, 'duration': 6.123}, {'end': 1377.312, 'text': 'And so what the embedding layer does here is it blocks out the embedding vector for each one of these integers and organizes it all in a 4 by 8 by 10 tensor now.', 'start': 1366.208, 'duration': 11.104}, {'end': 1383.654, 'text': 'So all of these integers are translated into 10-dimensional vectors inside this three-dimensional tensor now.', 'start': 1378.67, 'duration': 4.984}], 'summary': 'Embedding table has 10-dimensional vectors for each character, organized in a 4x8x10 tensor.', 'duration': 24.369, 'max_score': 1359.285, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ01359285.jpg'}, {'end': 1674.432, 'src': 'embed', 'start': 1649.749, 'weight': 2, 'content': [{'end': 1657.111, 'text': "So what we want is now is we need to change the flatten layer so it doesn't output a 4 by 80, but it outputs a 4 by 4 by 20,", 'start': 1649.749, 'duration': 7.362}, {'end': 1669.004, 'text': 'where basically these every two consecutive characters are packed in on the very last dimension, and then these four is the first batch dimension,', 'start': 1657.111, 'duration': 11.893}, {'end': 1674.432, 'text': 'and this four is the second batch dimension, referring to the four groups inside every one of these examples.', 'start': 1669.004, 'duration': 5.428}], 'summary': 'Adjust the flatten layer to output 4x4x20 instead of 4x80', 'duration': 24.683, 'max_score': 1649.749, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ01649749.jpg'}, {'end': 1791.244, 'src': 'embed', 'start': 1765.018, 'weight': 6, 'content': [{'end': 1772.202, 'text': 'So at indexes zero, two, four, and eight, and then all the parts here from this last dimension.', 'start': 1765.018, 'duration': 7.184}, {'end': 1776.865, 'text': 'And this gives us the even characters.', 'start': 1773.661, 'duration': 3.204}, {'end': 1781.211, 'text': 'And then here, this gives us all the odd characters.', 'start': 1777.686, 'duration': 3.525}, {'end': 1786.137, 'text': 'And basically what we want to do is we want to make sure that these get concatenated in PyTorch.', 'start': 1781.872, 'duration': 4.265}, {'end': 1791.244, 'text': 'And then we want to concatenate these two tensors along the second dimension.', 'start': 1786.738, 'duration': 4.506}], 'summary': 'Concatenate tensors at specific indexes to get odd and even characters, then concatenate them along the second dimension.', 'duration': 26.226, 'max_score': 1765.018, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ01765018.jpg'}, {'end': 1922.232, 'src': 'embed', 'start': 1898.159, 'weight': 7, 'content': [{'end': 1906.868, 'text': 'so let me create a constructor and take the number of elements that are consecutive that we would like to concatenate now in the last dimension of the output.', 'start': 1898.159, 'duration': 8.709}, {'end': 1909.45, 'text': "so here we're just going to remember solve data n equals n.", 'start': 1906.868, 'duration': 2.582}, {'end': 1912.189, 'text': 'And then I want to be careful here,', 'start': 1910.669, 'duration': 1.52}, {'end': 1918.911, 'text': 'because PyTorch actually has a torch.flatten and its keyword arguments are different and they kind of like function differently.', 'start': 1912.189, 'duration': 6.722}, {'end': 1922.232, 'text': 'So our flatten is going to start to depart from PyTorch flatten.', 'start': 1919.371, 'duration': 2.861}], 'summary': 'Creating a constructor to concatenate consecutive elements in the last dimension of the output, while addressing differences from pytorch flatten.', 'duration': 24.073, 'max_score': 1898.159, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ01898159.jpg'}, {'end': 2049.516, 'src': 'embed', 'start': 2020.1, 'weight': 8, 'content': [{'end': 2024.203, 'text': "So we don't want to return a three-dimensional tensor with a one here.", 'start': 2020.1, 'duration': 4.103}, {'end': 2027.646, 'text': 'We just want to return a two-dimensional tensor exactly as we did before.', 'start': 2024.704, 'duration': 2.942}, {'end': 2033.629, 'text': 'So in this case, basically we will just say x equals x dot squeeze.', 'start': 2028.767, 'duration': 4.862}, {'end': 2044.854, 'text': 'That is a pytorch function, and squeeze takes a dimension, that it either squeezes out all the dimensions of a tensor that are one,', 'start': 2033.629, 'duration': 11.225}, {'end': 2049.516, 'text': 'or you can specify the exact dimension that you want to be squeezed.', 'start': 2044.854, 'duration': 4.662}], 'summary': 'Use pytorch function squeeze to return 2d tensor.', 'duration': 29.416, 'max_score': 2020.1, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ02020100.jpg'}], 'start': 1045.153, 'title': 'Improving neural network performance', 'summary': "Focuses on scaling up neural networks using a hierarchical architecture like wavenet, resulting in better predictions and efficiency. it also discusses implementing a hierarchical model, achieving a context length of 8 with a validation loss reduction from 2.10 to 2.02, and reshaping linear layers for improved efficiency. additionally, it covers implementing concatenation in pytorch using various methods and a custom 'flatten consecutive' function for concatenating elements in the output tensor.", 'chapters': [{'end': 1156.505, 'start': 1045.153, 'title': 'Scaling up neural networks for improved performance', 'summary': 'Discusses the need to scale up neural networks for improved performance by adopting a hierarchical architecture like wavenet, which fuses information from previous contexts slowly into the network, resulting in better predictions and efficiency.', 'duration': 111.352, 'highlights': ['The WaveNet architecture fuses information from previous contexts slowly into the network, adopting a hierarchical approach, resulting in improved predictions and efficiency.', 'The hierarchical approach involves fusing characters into bigram representations, then into four-character level chunks, and so on, allowing the network to capture context more effectively.', 'The use of dilated causal convolution layers in WaveNet enhances efficiency while accommodating the hierarchical architecture for improved performance.']}, {'end': 1735.744, 'start': 1157.166, 'title': 'Implementing hierarchical model', 'summary': 'Discusses implementing a hierarchical model by progressively fusing elements, resulting in improved performance with a context length of 8, as seen in the 2.02 validation loss compared to the previous 2.10, and the reshaping of the linear layer to process consecutive elements in parallel for a more efficient structure.', 'duration': 578.578, 'highlights': ['The performance improved from a validation loss of 2.10 to 2.02 by increasing the context length from three to eight, showcasing the effectiveness of the hierarchical model.', "The reshaping of the linear layer to process consecutive elements in parallel, aiming for a structure of 4 by 4 by 20 instead of 4 by 80, demonstrates the optimization for efficiency in the model's implementation.", "The linear layer's number of parameters increased by 10,000 due to the growth in block size, resulting in a much larger first linear layer and a significant amount of parameters, indicating the impact of the hierarchical model's changes on the network's structure.", 'The embedding layer processes the input into a 4 by 8 by 10 tensor, translating each character into a 10-dimensional vector, showcasing the initial processing of the dataset for the hierarchical model.']}, {'end': 2069.36, 'start': 1735.744, 'title': 'Implementing concatenation in pytorch', 'summary': "Discusses implementing concatenation in pytorch using indexing, explicit concatenation, and the use of view, along with the implementation of a custom 'flatten consecutive' function to concatenate elements in the last dimension of the output tensor.", 'duration': 333.616, 'highlights': ['The chapter discusses implementing concatenation in PyTorch using indexing, explicit concatenation, and the use of view.', "The implementation of a custom 'flatten consecutive' function to concatenate elements in the last dimension of the output tensor.", "Explanation of the 'squeeze' function in PyTorch to remove spurious dimensions and return a two-dimensional tensor."]}], 'duration': 1024.207, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ01045153.jpg', 'highlights': ['The hierarchical approach in WaveNet fuses characters into bigram representations, then into four-character level chunks, allowing the network to capture context more effectively.', 'The use of dilated causal convolution layers in WaveNet enhances efficiency while accommodating the hierarchical architecture for improved performance.', "The reshaping of the linear layer to process consecutive elements in parallel, aiming for a structure of 4 by 4 by 20 instead of 4 by 80, demonstrates the optimization for efficiency in the model's implementation.", 'The performance improved from a validation loss of 2.10 to 2.02 by increasing the context length from three to eight, showcasing the effectiveness of the hierarchical model.', "The linear layer's number of parameters increased by 10,000 due to the growth in block size, indicating the impact of the hierarchical model's changes on the network's structure.", 'The embedding layer processes the input into a 4 by 8 by 10 tensor, translating each character into a 10-dimensional vector, showcasing the initial processing of the dataset for the hierarchical model.', 'The chapter discusses implementing concatenation in PyTorch using indexing, explicit concatenation, and the use of view.', "The implementation of a custom 'flatten consecutive' function to concatenate elements in the last dimension of the output tensor.", "Explanation of the 'squeeze' function in PyTorch to remove spurious dimensions and return a two-dimensional tensor."]}, {'end': 2839.03, 'segs': [{'end': 2126.081, 'src': 'embed', 'start': 2099.545, 'weight': 2, 'content': [{'end': 2104.268, 'text': 'So we should be able to run the model and here we can inspect.', 'start': 2099.545, 'duration': 4.723}, {'end': 2108.49, 'text': 'I have a little code snippet here where I iterate over all the layers.', 'start': 2104.968, 'duration': 3.522}, {'end': 2113.633, 'text': 'I print the name of this class and the shape.', 'start': 2109.131, 'duration': 4.502}, {'end': 2120.037, 'text': 'And so we see the shapes as we expect them after every single layer in its output.', 'start': 2114.954, 'duration': 5.083}, {'end': 2126.081, 'text': "So now let's try to restructure it using our flattened consecutive and do it hierarchically.", 'start': 2120.858, 'duration': 5.223}], 'summary': 'Model is iterated over all layers to inspect class names and shapes.', 'duration': 26.536, 'max_score': 2099.545, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ02099545.jpg'}, {'end': 2269.822, 'src': 'embed', 'start': 2241.219, 'weight': 3, 'content': [{'end': 2250.461, 'text': "whereas here in this example there's four layers with a total receptive field size of 16 characters instead of just eight characters.", 'start': 2241.219, 'duration': 9.242}, {'end': 2252.748, 'text': 'so the block size here is 16..', 'start': 2250.461, 'duration': 2.287}, {'end': 2255.469, 'text': 'So this piece of it is basically implemented here.', 'start': 2252.748, 'duration': 2.721}, {'end': 2260.89, 'text': 'Now we just have to kind of figure out some good channel numbers to use here.', 'start': 2257.109, 'duration': 3.781}, {'end': 2269.822, 'text': 'Now. in particular, I changed the number of hidden units to be 68 in this architecture, because when I use 68, the number of parameters comes out to be 22,', 'start': 2261.43, 'duration': 8.392}], 'summary': 'Four layers with a total receptive field size of 16 characters, block size of 16, and 68 hidden units result in 22 parameters.', 'duration': 28.603, 'max_score': 2241.219, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ02241219.jpg'}, {'end': 2751.206, 'src': 'embed', 'start': 2721.626, 'weight': 0, 'content': [{'end': 2723.447, 'text': "Okay, so I've retrained the neural net with the bug fix.", 'start': 2721.626, 'duration': 1.821}, {'end': 2727.95, 'text': 'We get a nice curve, and when we look at the validation performance, we do actually see a slight improvement.', 'start': 2724.068, 'duration': 3.882}, {'end': 2729.912, 'text': 'So it went from 2.029 to 2.022.', 'start': 2728.411, 'duration': 1.501}, {'end': 2738.698, 'text': 'So, basically, the bug inside the batch term was holding us back a little bit, it looks like, and we are getting a tiny improvement now,', 'start': 2729.912, 'duration': 8.786}, {'end': 2740.459, 'text': "but it's not clear if this is statistically significant.", 'start': 2738.698, 'duration': 1.761}, {'end': 2751.206, 'text': "And the reason we slightly expect an improvement is because we're not maintaining so many different means and variances that are only estimated using 32 numbers effectively.", 'start': 2742.74, 'duration': 8.466}], 'summary': 'Retrained neural net with bug fix, validation performance improved from 2.029 to 2.022, but not statistically significant.', 'duration': 29.58, 'max_score': 2721.626, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ02721626.jpg'}, {'end': 2819.939, 'src': 'embed', 'start': 2780.888, 'weight': 5, 'content': [{'end': 2788.614, 'text': 'But using the exact same architecture, we now have 76, 000 parameters and the training takes a lot longer, but we do get a nice curve.', 'start': 2780.888, 'duration': 7.726}, {'end': 2794.017, 'text': 'And then when you actually evaluate the performance, we are now getting validation performance of 1.993.', 'start': 2789.054, 'duration': 4.963}, {'end': 2795.919, 'text': "So we've crossed over the 2.0 sort of territory.", 'start': 2794.017, 'duration': 1.902}, {'end': 2802.384, 'text': "And we're at about 1.99, but we are starting to have to wait quite a bit longer.", 'start': 2797.86, 'duration': 4.524}, {'end': 2808.349, 'text': "And we're a little bit in the dark with respect to the correct setting of the hyperparameters here and the learning rates and so on,", 'start': 2802.865, 'duration': 5.484}, {'end': 2810.531, 'text': 'because the experiments are starting to take longer to train.', 'start': 2808.349, 'duration': 2.182}, {'end': 2817.938, 'text': 'And so we are missing sort of like an experimental harness on which we could run a number of experiments and really tune this architecture very well.', 'start': 2811.052, 'duration': 6.886}, {'end': 2819.939, 'text': "So I'd like to conclude now with a few notes.", 'start': 2818.478, 'duration': 1.461}], 'summary': 'Model with 76,000 parameters takes longer to train, achieving a validation performance of 1.993, but lacks efficient hyperparameter tuning.', 'duration': 39.051, 'max_score': 2780.888, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ02780888.jpg'}], 'start': 2069.36, 'title': 'Neural network optimization', 'summary': 'Covers the restructuring of a neural network to create a three-layer net with a 16-character receptive field, and discusses optimizing the architecture, achieving a validation performance improvement from 2.029 to 2.022 and surpassing the 2.0 validation performance threshold by adjusting hidden units and fixing a bug in batchnorm1d.', 'chapters': [{'end': 2260.89, 'start': 2069.36, 'title': 'Neural network restructuring', 'summary': 'Describes the restructuring of a neural network by consecutively flattening and processing layers, resulting in a three-layer neural net with a total receptive field size of 16 characters instead of eight characters.', 'duration': 191.53, 'highlights': ['The restructuring involves consecutively flattening and processing layers to create a three-layer neural net with a total receptive field size of 16 characters, compared to the original four-layer network with a total receptive field size of 8 characters.', "The model's shape transformation is demonstrated through the iterative inspection of layer shapes, showing the impact of consecutive flattening and processing on the model's architecture.", 'The flattened consecutive process results in a shape transformation from 4x8x20 to 4x400, followed by linear and BatchNorm operations, ultimately leading to the generation of the logits with the same shape as the original model.']}, {'end': 2839.03, 'start': 2261.43, 'title': 'Neural net architecture optimization', 'summary': 'Discusses optimizing the neural net architecture by adjusting the number of hidden units and fixing a bug in batchnorm1d, resulting in a slight improvement in validation performance from 2.029 to 2.022 and crossing the 2.0 validation performance boundary when increasing the network size.', 'duration': 577.6, 'highlights': ['Fixing bug in BatchNorm1D leads to slight improvement', 'Crossing 2.0 validation performance with larger network size', 'Challenge in tuning hyperparameters due to longer training time']}], 'duration': 769.67, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ02069360.jpg', 'highlights': ['Achieved validation performance improvement from 2.029 to 2.022', 'Surpassed the 2.0 validation performance threshold by adjusting hidden units and fixing a bug in BatchNorm1D', 'Demonstrated shape transformation through iterative inspection of layer shapes', 'Restructured neural network to create a three-layer net with a 16-character receptive field', 'Fixed bug in BatchNorm1D leading to slight improvement', 'Challenged by tuning hyperparameters due to longer training time', 'Crossed 2.0 validation performance with larger network size']}, {'end': 3380.836, 'segs': [{'end': 2888.522, 'src': 'embed', 'start': 2858.336, 'weight': 1, 'content': [{'end': 2862.6, 'text': "And there's residual connections and skip connections and so on, so we did not implement that.", 'start': 2858.336, 'duration': 4.264}, {'end': 2864.342, 'text': 'We just implemented this structure.', 'start': 2862.68, 'duration': 1.662}, {'end': 2871.888, 'text': "I would like to briefly hint or preview how what we've done here relates to convolutional neural networks as used in the WaveNet paper.", 'start': 2865.243, 'duration': 6.645}, {'end': 2875.49, 'text': 'And basically, the use of convolutions is strictly for efficiency.', 'start': 2872.648, 'duration': 2.842}, {'end': 2877.772, 'text': "It doesn't actually change the model we've implemented.", 'start': 2876.011, 'duration': 1.761}, {'end': 2883.317, 'text': 'So here, for example, Let me look at a specific name to work with an example.', 'start': 2878.572, 'duration': 4.745}, {'end': 2888.522, 'text': "So there's a name in our training set and it's D'Andre and it has seven letters.", 'start': 2883.717, 'duration': 4.805}], 'summary': "Implemented structure relates to convolutional neural networks for efficiency, doesn't change the model. example: name d'andre with seven letters.", 'duration': 30.186, 'max_score': 2858.336, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ02858336.jpg'}, {'end': 2983.663, 'src': 'embed', 'start': 2953.319, 'weight': 2, 'content': [{'end': 2957.622, 'text': "Now for us with the model, as we've implemented it right now, this is eight independent calls to our model.", 'start': 2953.319, 'duration': 4.303}, {'end': 2965.009, 'text': 'But what convolutions allow you to do is it allow you to basically slide this model efficiently over the input sequence.', 'start': 2958.522, 'duration': 6.487}, {'end': 2971.896, 'text': 'And so this for loop can be done not outside in Python, but inside of kernels in CUDA.', 'start': 2965.85, 'duration': 6.046}, {'end': 2974.859, 'text': 'And so this for loop gets hidden into the convolution.', 'start': 2972.717, 'duration': 2.142}, {'end': 2983.663, 'text': "So the convolution, basically, you can think of it as it's a for loop applying a little linear filter over space of some input sequence.", 'start': 2975.74, 'duration': 7.923}], 'summary': 'Convolution allows sliding model efficiently, hiding loops in cuda kernels.', 'duration': 30.344, 'max_score': 2953.319, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ02953319.jpg'}, {'end': 3274.142, 'src': 'embed', 'start': 3241.332, 'weight': 0, 'content': [{'end': 3243.973, 'text': "And so then I'm working with VS Code on the side.", 'start': 3241.332, 'duration': 2.641}, {'end': 3246.134, 'text': 'So I usually have Jupyter Notebook and VS Code.', 'start': 3244.413, 'duration': 1.721}, {'end': 3253.536, 'text': 'I develop in Jupyter Notebook, I paste into VS Code, and then I kick off experiments from the repo, of course, from the code repository.', 'start': 3246.454, 'duration': 7.082}, {'end': 3257.697, 'text': "So that's roughly some notes on the development process of working with neural nets.", 'start': 3254.376, 'duration': 3.321}, {'end': 3263.119, 'text': 'Lastly, I think this lecture unlocks a lot of potential further lectures because, number one,', 'start': 3258.137, 'duration': 4.982}, {'end': 3267.24, 'text': 'we have to convert our neural network to actually use these dilated causal convolutional layers.', 'start': 3263.119, 'duration': 4.121}, {'end': 3269.621, 'text': 'So implementing the comnet.', 'start': 3268.64, 'duration': 0.981}, {'end': 3274.142, 'text': "Number two, I'm potentially starting to get into what this means.", 'start': 3270.521, 'duration': 3.621}], 'summary': 'Develops in jupyter notebook, pastes into vs code, and works with neural nets for experiments.', 'duration': 32.81, 'max_score': 3241.332, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ03241332.jpg'}, {'end': 3380.355, 'src': 'heatmap', 'start': 3348.6, 'weight': 1, 'content': [{'end': 3355.945, 'text': "Maybe it's possible to actually take the original network with just one hidden layer and make it big enough and actually beat my fancy hierarchical network.", 'start': 3348.6, 'duration': 7.345}, {'end': 3356.625, 'text': "It's not obvious.", 'start': 3356.025, 'duration': 0.6}, {'end': 3362.148, 'text': 'That would be kind of embarrassing if this did not do better even once you torture it a little bit.', 'start': 3357.606, 'duration': 4.542}, {'end': 3367.972, 'text': 'Maybe you can read the WaveNet paper and try to figure out how some of these layers work and implement them yourselves using what we have.', 'start': 3362.969, 'duration': 5.003}, {'end': 3375.309, 'text': 'And of course, you can always tune some of the initialization or some of the optimization and see if you can improve it that way.', 'start': 3369.222, 'duration': 6.087}, {'end': 3378.573, 'text': "So I'd be curious if people can come up with some ways to beat this.", 'start': 3375.89, 'duration': 2.683}, {'end': 3380.355, 'text': "And yeah, that's it for now.", 'start': 3379.675, 'duration': 0.68}], 'summary': 'Exploring ways to improve network performance by tuning, implementing new layers, and seeking better results.', 'duration': 31.755, 'max_score': 3348.6, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ03348600.jpg'}], 'start': 2839.591, 'title': 'Implementing neural networks', 'summary': 'Covers the implementation of wavenet architecture using convolutions and the development process of building deep neural networks, along with potential future lectures on various neural network components.', 'chapters': [{'end': 2877.772, 'start': 2839.591, 'title': 'Implementing wavenet architecture', 'summary': 'Discusses the implementation of the wavenet architecture, highlighting the use of convolutions for efficiency and the omission of specific forward pass components.', 'duration': 38.181, 'highlights': ['The implementation of the WaveNet architecture involves the use of convolutions for efficiency, without altering the model structure.', 'The specific forward pass components, such as complicated linear layers and gated linear layers with residual and skip connections, were not included in the implemented structure.']}, {'end': 3380.836, 'start': 2878.572, 'title': 'Implementing convolutional neural networks', 'summary': 'Discusses the implementation of convolutional neural networks, with a focus on efficiently sliding models over input sequences to calculate outputs, the development process of building deep neural networks, and potential future lectures on dilated causal convolutional layers, residual and skip connections, experimental harness setup, and recurrent neural networks.', 'duration': 502.264, 'highlights': ['The chapter explains how convolutional neural networks efficiently slide models over input sequences to calculate outputs, allowing for variable reuse and efficient for loop hiding in CUDA kernels, with potential applications in dilated causal convolutional layers and residual and skip connections.', 'The development process of building deep neural networks involves spending time in the PyTorch documentation, dealing with multidimensional array gymnastics, and prototype implementation in Jupyter Notebooks before transferring to the code repository for training with VS Code.', 'Potential future lectures on implementing dilated causal convolutional layers, exploring residual and skip connections, setting up an experimental harness, and covering recurrent neural networks, LSTMs, GRUs, and transformers.']}], 'duration': 541.245, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/t3YJ5hKiMQ0/pics/t3YJ5hKiMQ02839591.jpg', 'highlights': ['The development process involves spending time in the PyTorch documentation and prototype implementation in Jupyter Notebooks before transferring to the code repository for training with VS Code.', 'The implementation of the WaveNet architecture involves the use of convolutions for efficiency, without altering the model structure.', 'Convolutional neural networks efficiently slide models over input sequences to calculate outputs, allowing for variable reuse and efficient for loop hiding in CUDA kernels.', 'Potential future lectures on implementing dilated causal convolutional layers, exploring residual and skip connections, setting up an experimental harness, and covering recurrent neural networks, LSTMs, GRUs, and transformers.']}], 'highlights': ['Significant performance improvement achieved by adjusting loss function due to batch size', 'Achieved validation performance improvement from 2.029 to 2.022', "The reshaping of the linear layer to process consecutive elements in parallel, aiming for a structure of 4 by 4 by 20 instead of 4 by 80, demonstrates the optimization for efficiency in the model's implementation", 'The hierarchical approach in WaveNet fuses characters into bigram representations, then into four-character level chunks, allowing the network to capture context more effectively', 'The architecture is being complexified to take more characters in a sequence as an input, not just three, and to make a deeper model that progressively fuses information to predict the next character in a sequence']}