title
CS231n Winter 2016: Lecture 7: Convolutional Neural Networks

description
Stanford Winter Quarter 2016 class: CS231n: Convolutional Neural Networks for Visual Recognition. Lecture 7. Get in touch on Twitter @cs231n, or on Reddit /r/cs231n.

detail
{'title': 'CS231n Winter 2016: Lecture 7: Convolutional Neural Networks', 'heatmap': [{'end': 379.611, 'start': 329.803, 'weight': 0.709}, {'end': 2087.648, 'start': 1801.987, 'weight': 0.724}, {'end': 3935.077, 'start': 3887.481, 'weight': 0.873}], 'summary': 'The lecture provides an introduction to convolutional neural networks, covers the training process and visualization of convolutional layers, explains the application of filter sizes, discusses image padding, and explores the evolution of cnn architectures, including specific examples like lenet-5, alexnet, and vggnet, with emphasis on key architectural details and performance metrics.', 'chapters': [{'end': 46.025, 'segs': [{'end': 46.025, 'src': 'embed', 'start': 1.434, 'weight': 0, 'content': [{'end': 4.817, 'text': 'So today, we finally get to cover convolutional neural networks.', 'start': 1.434, 'duration': 3.383}, {'end': 6.259, 'text': "So we're super excited.", 'start': 5.398, 'duration': 0.861}, {'end': 8.981, 'text': 'But first, let me dive into some administrative items.', 'start': 6.879, 'duration': 2.102}, {'end': 12.425, 'text': 'As a reminder, again, Assignment 2 is due next Friday.', 'start': 9.962, 'duration': 2.463}, {'end': 18.431, 'text': 'How is Assignment 2 going, by the way? Did people finish the fully connected stuff at all? Some people? OK.', 'start': 13.546, 'duration': 4.885}, {'end': 22.051, 'text': 'How about, did anyone finish BatchNorm? OK, OK.', 'start': 18.451, 'duration': 3.6}, {'end': 24.673, 'text': 'All right.', 'start': 22.091, 'duration': 2.582}, {'end': 31.997, 'text': 'Another thing to worry about perhaps for you guys is the project proposal, which is due very soon on Saturday.', 'start': 24.873, 'duration': 7.124}, {'end': 33.218, 'text': "It's an ungraded.", 'start': 32.557, 'duration': 0.661}, {'end': 34.158, 'text': 'We just want a paragraph.', 'start': 33.278, 'duration': 0.88}, {'end': 36.36, 'text': 'We want to make sure that you guys are on the right track.', 'start': 34.178, 'duration': 2.182}, {'end': 37.64, 'text': "You've thought about the project.", 'start': 36.62, 'duration': 1.02}, {'end': 39.301, 'text': 'You have a rough proposal for what you want to do.', 'start': 37.68, 'duration': 1.621}, {'end': 41.903, 'text': 'You can also send us a few possibilities.', 'start': 39.742, 'duration': 2.161}, {'end': 43.023, 'text': 'And by few, I mean two.', 'start': 42.063, 'duration': 0.96}, {'end': 46.025, 'text': "Don't send us hundreds.", 'start': 43.184, 'duration': 2.841}], 'summary': 'Covered convolutional neural networks, assignment 2 due next friday, project proposal due soon', 'duration': 44.591, 'max_score': 1.434, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ1434.jpg'}], 'start': 1.434, 'title': 'Introduction to convolutional neural networks', 'summary': 'Provides an introduction to convolutional neural networks, along with reminders for assignment 2 due date and ungraded project proposal deadline, emphasizing the need for a rough proposal by saturday.', 'chapters': [{'end': 46.025, 'start': 1.434, 'title': 'Introduction to convolutional neural networks', 'summary': 'Covers the introduction to convolutional neural networks, reminding students about the due date for assignment 2 and the approaching deadline for the ungraded project proposal, emphasizing the need for a rough proposal by saturday.', 'duration': 44.591, 'highlights': ['The ungraded project proposal is due soon on Saturday, requiring a rough proposal for the project and limiting the submission to a maximum of two possibilities.', 'Assignment 2 is due next Friday, prompting a check on the progress of the fully connected portion and BatchNorm, with some students indicating partial completion.', 'The instructor emphasizes the importance of having a rough proposal for the project, ensuring that students have considered and thought about their project ideas.']}], 'duration': 44.591, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ1434.jpg', 'highlights': ['Assignment 2 due next Friday, check progress of fully connected portion and BatchNorm', 'Ungraded project proposal due Saturday, requires rough proposal, limited to two possibilities', 'Emphasize importance of rough project proposal, ensuring students have considered their ideas']}, {'end': 571.073, 'segs': [{'end': 122.999, 'src': 'embed', 'start': 68.432, 'weight': 3, 'content': [{'end': 72.134, 'text': "But AttaDelta is kind of analogous to the other parameter updates that I've covered last time.", 'start': 68.432, 'duration': 3.702}, {'end': 74.076, 'text': 'We also talked about dropout.', 'start': 72.955, 'duration': 1.121}, {'end': 77.579, 'text': 'And I introduced briefly convolutional neural networks.', 'start': 74.957, 'duration': 2.622}, {'end': 80.021, 'text': 'And I talked about some of the history of the field and how this developed.', 'start': 77.639, 'duration': 2.382}, {'end': 86.406, 'text': 'So I talked particularly about the experiments of Hubel and Wiesel in 1960s with the cat visual cortex.', 'start': 80.981, 'duration': 5.425}, {'end': 93.251, 'text': 'And there are takeaways from a lot of this research, which is that the hierarchical cortex is arranged hierarchically,', 'start': 86.966, 'duration': 6.285}, {'end': 97.494, 'text': 'with these simple to complex cells and more and more complex things happening over time.', 'start': 93.251, 'duration': 4.243}, {'end': 103.005, 'text': "And so today, we'll get to dive into some of these models in detail.", 'start': 99.622, 'duration': 3.383}, {'end': 104.466, 'text': "And we'll talk about convolutional neural networks.", 'start': 103.045, 'duration': 1.421}, {'end': 109.329, 'text': "So first, as I did with neural networks, I'd like to talk about comnets without all the brain stuff.", 'start': 105.426, 'duration': 3.903}, {'end': 111.431, 'text': 'So no analogies to neurons or anything like that.', 'start': 109.349, 'duration': 2.082}, {'end': 113.493, 'text': "We'll just see what the operations are mathematically.", 'start': 111.471, 'duration': 2.022}, {'end': 119.617, 'text': "And then we'll go into how you can interpret them in terms of neurons being connected in some kind of a simulated brain tissue or something like that.", 'start': 113.833, 'duration': 5.784}, {'end': 122.999, 'text': 'So we start off with some image.', 'start': 120.658, 'duration': 2.341}], 'summary': 'Overview of convolutional neural networks with historical context and hierarchical arrangement of cortex.', 'duration': 54.567, 'max_score': 68.432, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ68432.jpg'}, {'end': 379.611, 'src': 'heatmap', 'start': 329.803, 'weight': 0.709, 'content': [{'end': 332.083, 'text': 'And so this activation map is computed independently from the first.', 'start': 329.803, 'duration': 2.28}, {'end': 333.484, 'text': 'These are all independent filters.', 'start': 332.123, 'duration': 1.361}, {'end': 338.325, 'text': "So what we'll end up with now is we'll have an entire set of filters.", 'start': 334.364, 'duration': 3.961}, {'end': 342.546, 'text': "So suppose, for example, that this convolutional layer will have six filters, and that's just a hyperparameter.", 'start': 338.485, 'duration': 4.061}, {'end': 344.367, 'text': 'So suppose we had six of them.', 'start': 343.166, 'duration': 1.201}, {'end': 349.528, 'text': "Then what we'll do is we'll slide every one of them independently through the input volume, computing that product along the way.", 'start': 344.827, 'duration': 4.701}, {'end': 351.869, 'text': "And that's actually called the convolution operation.", 'start': 349.548, 'duration': 2.321}, {'end': 362.805, 'text': 'And that gives us this entire 28 by 28 by 6 set of activation maps that are stacked together along the depth dimension.', 'start': 354.622, 'duration': 8.183}, {'end': 370.688, 'text': 'And so what this convolutional layer has done is it has looked at this image and it has re-represented this image, the 32 by 32, by 3,', 'start': 363.525, 'duration': 7.163}, {'end': 373.889, 'text': 'in terms of the activations on this image of those filters.', 'start': 370.688, 'duration': 3.201}, {'end': 379.611, 'text': 'So we end up with a re-representation of the image of size 28 by 28 by 6.', 'start': 374.449, 'duration': 5.162}], 'summary': 'Convolutional layer computes 6 independent filters, resulting in a 28x28x6 set of activation maps.', 'duration': 49.808, 'max_score': 329.803, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ329803.jpg'}, {'end': 379.611, 'src': 'embed', 'start': 354.622, 'weight': 6, 'content': [{'end': 362.805, 'text': 'And that gives us this entire 28 by 28 by 6 set of activation maps that are stacked together along the depth dimension.', 'start': 354.622, 'duration': 8.183}, {'end': 370.688, 'text': 'And so what this convolutional layer has done is it has looked at this image and it has re-represented this image, the 32 by 32, by 3,', 'start': 363.525, 'duration': 7.163}, {'end': 373.889, 'text': 'in terms of the activations on this image of those filters.', 'start': 370.688, 'duration': 3.201}, {'end': 379.611, 'text': 'So we end up with a re-representation of the image of size 28 by 28 by 6.', 'start': 374.449, 'duration': 5.162}], 'summary': 'Convolutional layer re-represents 32x32x3 image as 28x28x6 activations.', 'duration': 24.989, 'max_score': 354.622, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ354622.jpg'}, {'end': 429.151, 'src': 'embed', 'start': 395.631, 'weight': 1, 'content': [{'end': 399.522, 'text': 'And these filters will have some spatial extent, like, say, 5 by 5.', 'start': 395.631, 'duration': 3.891}, {'end': 403.523, 'text': 'And so this conv layer, short for convolution, will be sliding through.', 'start': 399.522, 'duration': 4.001}, {'end': 407.584, 'text': 'We get, say, a 28 by 28 by 6 volume instead of the original volume.', 'start': 403.763, 'duration': 3.821}, {'end': 410.124, 'text': 'And then this will feed into the next convolutional layer.', 'start': 408.024, 'duration': 2.1}, {'end': 415.505, 'text': "And of course, in the middle, always we're going to be applying activation functions as we did before with neural networks.", 'start': 410.764, 'duration': 4.741}, {'end': 418.045, 'text': 'So we perform convolutions, which are these linear operations.', 'start': 415.885, 'duration': 2.16}, {'end': 421.566, 'text': 'Then we threshold all the activations at 0.', 'start': 418.646, 'duration': 2.92}, {'end': 429.151, 'text': 'And then we proceed again with another set of convolutional another convolutional layer with its own filters, maybe of different size.', 'start': 421.566, 'duration': 7.585}], 'summary': 'Convolutional layer creates 28x28x6 volume from 5x5 filters, followed by activation functions and subsequent layers.', 'duration': 33.52, 'max_score': 395.631, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ395631.jpg'}, {'end': 530.919, 'src': 'embed', 'start': 500.376, 'weight': 0, 'content': [{'end': 506.521, 'text': 'through back propagation, tune themselves to become all these little blobs of edge pieces and little color pieces and blobs.', 'start': 500.376, 'duration': 6.145}, {'end': 510.825, 'text': 'And so these are basically the filters in the first convolutional layer when you visualize them.', 'start': 507.242, 'duration': 3.583}, {'end': 515.707, 'text': 'So all of these filters will be looking for these things in the original image when we convolve through.', 'start': 511.325, 'duration': 4.382}, {'end': 525.895, 'text': "And as you go into deeper and deeper convolutional layers and we're performing this successive operations of conv and conv on top of each other you'll end up eventually with the second convolutional layer,", 'start': 516.808, 'duration': 9.087}, {'end': 526.376, 'text': 'for example.', 'start': 525.895, 'duration': 0.481}, {'end': 530.919, 'text': "It's going to be doing dot products over the outputs of the first conv layer.", 'start': 526.836, 'duration': 4.083}], 'summary': 'Neural networks use back propagation to tune filters for image recognition.', 'duration': 30.543, 'max_score': 500.376, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ500376.jpg'}], 'start': 46.886, 'title': 'Convolutional neural networks', 'summary': "Covers the training process of neural networks, including parameter updates and the introduction of convolutional neural networks, with a brief history and key takeaways from the research on hierarchical cortex arrangement. it also explains the working of convolutional neural networks, including the convolutional layer's operation, filter sliding, and activation map creation, leading to the re-representation of the input image, and the subsequent building of a feature hierarchy through successive convolutional layers.", 'chapters': [{'end': 122.999, 'start': 46.886, 'title': 'Neural network training and convolutional neural networks', 'summary': 'Covers the training process of neural networks, including parameter updates and the introduction of convolutional neural networks, with a brief history and key takeaways from the research on hierarchical cortex arrangement.', 'duration': 76.113, 'highlights': ['The hierarchical cortex is arranged hierarchically, with simple to complex cells and more complex developments over time.', 'Covered parameter updates including Atom and AttaDelta, with AttaDelta being analogous to other updates.', 'Introduction of convolutional neural networks and the mathematical operations before discussing their interpretation in terms of neurons.']}, {'end': 571.073, 'start': 123.699, 'title': 'Convolutional neural networks', 'summary': "Explains the working of convolutional neural networks, including the convolutional layer's operation, filter sliding, and activation map creation, leading to the re-representation of the input image, and the subsequent building of a feature hierarchy through successive convolutional layers.", 'duration': 447.374, 'highlights': ['The convolutional layer operates by sliding small spatial filters (e.g., 5x5x3) through the input volume, computing dot products, and generating activation maps, producing a re-representation of the input image (28x28x6).', 'Convolutional layers use independent filter banks to create activation maps, with each filter being independently slid through the input volume, resulting in stacked activation maps (28x28x6) along the depth dimension.', 'Successive convolutional layers build a feature hierarchy by tuning filters to detect specific features in the input image, leading to the creation of templates for various object pieces through dot products.', 'Visualization of filters in the first convolutional layer reveals their tuning to become edge and color pieces, while deeper layers perform dot products over the outputs of previous layers, assembling larger object pieces.']}], 'duration': 524.187, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ46886.jpg', 'highlights': ['Successive convolutional layers build a feature hierarchy by tuning filters to detect specific features in the input image, leading to the creation of templates for various object pieces through dot products.', 'The convolutional layer operates by sliding small spatial filters (e.g., 5x5x3) through the input volume, computing dot products, and generating activation maps, producing a re-representation of the input image (28x28x6).', 'Visualization of filters in the first convolutional layer reveals their tuning to become edge and color pieces, while deeper layers perform dot products over the outputs of previous layers, assembling larger object pieces.', 'The hierarchical cortex is arranged hierarchically, with simple to complex cells and more complex developments over time.', 'Covered parameter updates including Atom and AttaDelta, with AttaDelta being analogous to other updates.', 'Introduction of convolutional neural networks and the mathematical operations before discussing their interpretation in terms of neurons.', 'Convolutional layers use independent filter banks to create activation maps, with each filter being independently slid through the input volume, resulting in stacked activation maps (28x28x6) along the depth dimension.']}, {'end': 817.623, 'segs': [{'end': 625.884, 'src': 'embed', 'start': 590.764, 'weight': 2, 'content': [{'end': 592.905, 'text': 'But they are operating over the outputs of these filters.', 'start': 590.764, 'duration': 2.141}, {'end': 595.186, 'text': "So that's just a subtle point I wanted to bring up.", 'start': 593.565, 'duration': 1.621}, {'end': 602.151, 'text': "And so you end up with an image that's very similar to maybe what Hubel and Wiesel may have imagined, where you have these simple cells looking for,", 'start': 595.886, 'duration': 6.265}, {'end': 605.893, 'text': 'say, for example, a bar of a specific orientation somewhere specifically in the image.', 'start': 602.151, 'duration': 3.742}, {'end': 615.84, 'text': "And then we're building up the hierarchy of these features and composing them together spatially to get more and more complex responses of different kinds of objects,", 'start': 607.254, 'duration': 8.586}, {'end': 616.161, 'text': 'and so on.', 'start': 615.84, 'duration': 0.321}, {'end': 625.544, 'text': "And so, to put this another way, Let's, for example, consider this example where we have this input image here, which is a small piece of a car,", 'start': 617.282, 'duration': 8.262}, {'end': 625.884, 'text': 'I believe.', 'start': 625.544, 'duration': 0.34}], 'summary': 'Neural networks compose features spatially, building hierarchy for image recognition.', 'duration': 35.12, 'max_score': 590.764, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ590764.jpg'}, {'end': 700.285, 'src': 'embed', 'start': 660.591, 'weight': 0, 'content': [{'end': 663.192, 'text': "Because there's orange stuff there, so this filter gets happy about that part.", 'start': 660.591, 'duration': 2.601}, {'end': 666.494, 'text': 'And so these are the activation maps produced by these filters.', 'start': 663.913, 'duration': 2.581}, {'end': 670.517, 'text': "We'll stack them up along the depth and then we'll feed that into the next convolutional layer,", 'start': 666.835, 'duration': 3.682}, {'end': 672.979, 'text': 'which will be putting together combinations of these guys.', 'start': 670.517, 'duration': 2.462}, {'end': 674.36, 'text': 'over and over again.', 'start': 673.659, 'duration': 0.701}, {'end': 678.124, 'text': 'And so a convolutional network will basically have a layout like this.', 'start': 675.141, 'duration': 2.983}, {'end': 681.587, 'text': "And we'll see how we'll be arranging all of this soon.", 'start': 679.045, 'duration': 2.542}, {'end': 690.476, 'text': 'But there will be basically three core building blocks, a convolutional layer, a rectifier layer, which is like a non-linearity, just thresholding.', 'start': 682.008, 'duration': 8.468}, {'end': 692.959, 'text': "There will be pooling operations, which we'll go into in a bit.", 'start': 690.997, 'duration': 1.962}, {'end': 695.721, 'text': 'And there is a fully connected layer at the very end.', 'start': 693.559, 'duration': 2.162}, {'end': 698.163, 'text': "which we'll also go into in a bit.", 'start': 696.862, 'duration': 1.301}, {'end': 700.285, 'text': 'But basically, the image feeds in here.', 'start': 698.364, 'duration': 1.921}], 'summary': 'Convolutional network consists of activation maps, convolutional layers, rectifier layers, pooling operations, and a fully connected layer.', 'duration': 39.694, 'max_score': 660.591, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ660591.jpg'}], 'start': 572.114, 'title': 'Visualizing and operating convolutional layers', 'summary': 'Covers the visualization of convolutional layers, focusing on feature hierarchy and complex responses within neural networks. it also details the layout and operations of convolutional layers in a network, including filters, activation maps, and 3d volumes, addressing common questions about their organizational layout and filter choices.', 'chapters': [{'end': 625.884, 'start': 572.114, 'title': 'Visualizing convolutional layers', 'summary': 'Explains the visualization of convolutional layers, emphasizing the hierarchy of features and the composition of complex responses within neural networks, with a subtle point on the operation of filters and spatial composition.', 'duration': 53.77, 'highlights': ['The filters in the convolutional layer are operating over the outputs of these filters, creating an image similar to what Hubel and Wiesel imagined, where simple cells look for specific features in the image.', 'The chapter also mentions the composition of features to obtain more complex responses of different kinds of objects within the neural network.', 'The subtle point about the operation of filters and spatial composition is highlighted, emphasizing the importance of understanding the processing within the convolutional layers.']}, {'end': 817.623, 'start': 626.865, 'title': 'Convolutional neural networks', 'summary': 'Details the layout and operations of convolutional layers in a convolutional network, including the number of filters, activation maps, and the process of creating 3d volumes of higher abstraction, and it addresses common questions about the organizational layout and the choice of the number of filters.', 'duration': 190.758, 'highlights': ['The layout of a convolutional network consists of three core building blocks: convolutional layer, rectifier layer, and pooling operations, eventually leading to a fully connected layer at the end.', 'The first convolutional layer in the example comprises 32 filters of 5x5 spatial dimensions, producing activation maps with white indicating high activations and black indicating low activations.', 'The process involves stacking the activation maps and feeding them into the next convolutional layer, creating 3D volumes of higher abstraction, and ultimately generating class scores for different classes through a fully connected layer.', 'The chapter addresses common questions about the organizational layout and the choice of the number of filters, providing insights into the spatial dimension and the process of local connectivity to the input in a convolutional network.']}], 'duration': 245.509, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ572114.jpg', 'highlights': ['The process involves stacking the activation maps and feeding them into the next convolutional layer, creating 3D volumes of higher abstraction, and ultimately generating class scores for different classes through a fully connected layer.', 'The layout of a convolutional network consists of three core building blocks: convolutional layer, rectifier layer, and pooling operations, eventually leading to a fully connected layer at the end.', 'The filters in the convolutional layer are operating over the outputs of these filters, creating an image similar to what Hubel and Wiesel imagined, where simple cells look for specific features in the image.', 'The chapter also mentions the composition of features to obtain more complex responses of different kinds of objects within the neural network.']}, {'end': 1587.289, 'segs': [{'end': 1082.699, 'src': 'embed', 'start': 1050.686, 'weight': 1, 'content': [{'end': 1053.868, 'text': 'And so basically, if you have a 3 by 3 filter, you want to 0 pad with 1.', 'start': 1050.686, 'duration': 3.182}, {'end': 1055.889, 'text': 'If you have a 5 by 5 filter, 0 pad with 2.', 'start': 1053.868, 'duration': 2.021}, {'end': 1058.49, 'text': 'If you have a 7 by 7 filter, you 0 pad with 3.', 'start': 1055.889, 'duration': 2.601}, {'end': 1064.492, 'text': 'OK. so in those cases, if you use exactly that zero padding with that filter size and using stride one,', 'start': 1058.49, 'duration': 6.002}, {'end': 1066.613, 'text': "you'll always achieve the same output volume spatially.", 'start': 1064.492, 'duration': 2.121}, {'end': 1068.854, 'text': "And we'll see why that is very nice in a bit.", 'start': 1066.713, 'duration': 2.141}, {'end': 1069.074, 'text': 'Go ahead.', 'start': 1068.874, 'duration': 0.2}, {'end': 1075.036, 'text': 'Could you explain why zero padding makes it a 7 by 7 output? Yeah.', 'start': 1069.314, 'duration': 5.722}, {'end': 1076.276, 'text': "OK So we've zero padded.", 'start': 1075.056, 'duration': 1.22}, {'end': 1082.699, 'text': 'And now basically what would happen now is that instead of a 7 by 7 input, we really ended up with a 9 by 9 input.', 'start': 1077.017, 'duration': 5.682}], 'summary': 'Zero padding with filter sizes achieves consistent output volume spatially', 'duration': 32.013, 'max_score': 1050.686, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ1050686.jpg'}, {'end': 1304.706, 'src': 'embed', 'start': 1279.961, 'weight': 0, 'content': [{'end': 1288.623, 'text': 'So, as a summary of a convolutional layer, to just put this, with a lot of parameters here you always accept a volume of W1H1D1,', 'start': 1279.961, 'duration': 8.662}, {'end': 1292.443, 'text': 'and a convolutional layer produces a volume of activation W2H2D2..', 'start': 1288.623, 'duration': 3.82}, {'end': 1297.644, 'text': 'And a convolutional layer takes four hyperparameters, K, F, S, and P.', 'start': 1293.084, 'duration': 4.56}, {'end': 1300.885, 'text': 'So the number of filters you want, the spatial extent of these filters,', 'start': 1297.644, 'duration': 3.241}, {'end': 1304.706, 'text': 'the stride at which you want to apply them and the amount of zero padding you want to do on the borders.', 'start': 1300.885, 'duration': 3.821}], 'summary': 'Convolutional layer takes w1h1d1 and produces w2h2d2 with hyperparameters k, f, s, and p.', 'duration': 24.745, 'max_score': 1279.961, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ1279961.jpg'}, {'end': 1361.151, 'src': 'embed', 'start': 1334.581, 'weight': 3, 'content': [{'end': 1338.083, 'text': 'is just by doing this convolution of sliding this filter through and computing dot products.', 'start': 1334.581, 'duration': 3.502}, {'end': 1340.704, 'text': 'In terms of common settings of these hyperparameters.', 'start': 1338.883, 'duration': 1.821}, {'end': 1344.306, 'text': "that you'll see in practice and we'll see a lot of examples of case studies by the end of this lecture.", 'start': 1340.704, 'duration': 3.602}, {'end': 1348.507, 'text': 'k usually is chosen as a powers of 2 for computational reasons.', 'start': 1345.586, 'duration': 2.921}, {'end': 1355.189, 'text': 'Because if some libraries, when they see powers of 2 in terms of number of, say, your dimensions or number of kernels,', 'start': 1349.007, 'duration': 6.182}, {'end': 1361.151, 'text': 'sometimes they go into a special subroutine that is very, very efficient to perform in a vectorized form.', 'start': 1355.189, 'duration': 5.962}], 'summary': 'Convolutional filter is often chosen as powers of 2 for computational efficiency', 'duration': 26.57, 'max_score': 1334.581, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ1334581.jpg'}, {'end': 1421.522, 'src': 'embed', 'start': 1391.933, 'weight': 5, 'content': [{'end': 1393.514, 'text': 'with a padding of 0.', 'start': 1391.933, 'duration': 1.581}, {'end': 1399.497, 'text': "And I'd just like to make the point that this sometimes confuses people to see a 1 by 1 filter with convolution that seems to not make sense.", 'start': 1393.514, 'duration': 5.983}, {'end': 1402.558, 'text': "I'd just like to point out that this actually does make sense in convolutional layers.", 'start': 1399.857, 'duration': 2.701}, {'end': 1405.58, 'text': "And the way to think about it is suppose you're working with this example.", 'start': 1403.118, 'duration': 2.462}, {'end': 1408.601, 'text': 'So 56 by 56 by 64 volume coming in.', 'start': 1406.2, 'duration': 2.401}, {'end': 1410.002, 'text': 'And doing 1 by 1 conv.', 'start': 1409.281, 'duration': 0.721}, {'end': 1415.735, 'text': 'with 32 filters would give you the same sized output, except for 32 in depth now.', 'start': 1411.55, 'duration': 4.185}, {'end': 1421.522, 'text': "And each filter is one by one spatially, but you have to remember that we're doing these dot products through the full depth of the volume.", 'start': 1416.576, 'duration': 4.946}], 'summary': 'Using 1x1 convolution with 32 filters on a 56x56x64 volume results in an output of the same size but with a depth of 32.', 'duration': 29.589, 'max_score': 1391.933, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ1391933.jpg'}, {'end': 1498.148, 'src': 'embed', 'start': 1470.825, 'weight': 4, 'content': [{'end': 1475.507, 'text': 'Why is the size of f always odd? You almost always see 1, 3, 5, 7, 11 sometimes.', 'start': 1470.825, 'duration': 4.682}, {'end': 1476.608, 'text': "And that's mostly what you'll see.", 'start': 1475.587, 'duration': 1.021}, {'end': 1479.529, 'text': "You won't see even numbers for the sizes of filters, just because,", 'start': 1476.948, 'duration': 2.581}, {'end': 1490.24, 'text': 'Just because odd filters have this nice representation of three is the smallest thing that makes sense in terms of having something on the left and on the right of a filter.', 'start': 1482.253, 'duration': 7.987}, {'end': 1491.762, 'text': 'You can do two by two filters.', 'start': 1490.641, 'duration': 1.121}, {'end': 1494.384, 'text': "I've seen some people do it, but it's not very common.", 'start': 1491.802, 'duration': 2.582}, {'end': 1498.148, 'text': 'So the lowest people usually use is three by three just for convenience.', 'start': 1494.764, 'duration': 3.384}], 'summary': 'Odd filter sizes, like 1, 3, 5, 7, 11, are commonly used due to better representation and convenience.', 'duration': 27.323, 'max_score': 1470.825, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ1470825.jpg'}], 'start': 818.063, 'title': 'Convolutional layers and filter sizes', 'summary': 'Explains the application of filter sizes in convolutional layers, emphasizing the significance of 3x3, 5x5, and 1x1 filters, and the rationale behind using odd-sized filters over even ones, with practical examples and considerations for depth and spatial dimensions.', 'chapters': [{'end': 1365.912, 'start': 818.063, 'title': 'Convolutional layers and spatial dimension control', 'summary': 'Explains the concept of convolutional layers, including spatial dimension control through stride, padding, and filter size, and the calculation of the output volume size and the number of parameters in a convolutional layer, with examples and practical considerations.', 'duration': 547.849, 'highlights': ['The output volume size from a convolutional layer is calculated using the formula (n-f+2p)/s + 1, where n is the input size, f is the filter size, p is the padding, and s is the stride. The formula for calculating the output volume size from a convolutional layer is (n-f+2p)/s + 1, where n is the input size, f is the filter size, p is the padding, and s is the stride.', 'Padding with zeros can be used to preserve the spatial size of the output volume, preventing rapid decrease in spatial size and maintaining a fixed-sized representation. Padding with zeros can be used to preserve the spatial size of the output volume, preventing rapid decrease in spatial size and maintaining a fixed-sized representation.', 'The number of parameters in a convolutional layer is calculated as (f * f * d_in + 1) * k, where f is the filter size, d_in is the input depth, and k is the number of filters, including biases. The number of parameters in a convolutional layer is calculated as (f * f * d_in + 1) * k, where f is the filter size, d_in is the input depth, and k is the number of filters, including biases.', 'Common settings for the number of filters (k) in practice often involve choosing powers of 2 for computational efficiency. Common settings for the number of filters (k) in practice often involve choosing powers of 2 for computational efficiency.']}, {'end': 1587.289, 'start': 1367.913, 'title': 'Convolutional layers and filter sizes', 'summary': 'Explains the application of filter sizes in convolutional layers, emphasizing the significance of 3x3, 5x5, and 1x1 filters, and the rationale behind using odd-sized filters over even ones, with an example of 1x1 convolutional layer and addressing the concerns regarding depth and spatial dimensions.', 'duration': 219.376, 'highlights': ['The significance of 3x3, 5x5, and 1x1 filters and the rationale behind using odd-sized filters over even ones 3x3 and 5x5 filters commonly utilized, 1x1 filter explained, odd filter sizes reasoning', 'Explanation of the 1x1 convolutional layer and addressing concerns regarding depth and spatial dimensions Example of 1x1 convolutional layer, explanation of depth and spatial dimensions', 'Clarification on the rationale behind using odd-sized filters over even ones Reasoning behind using odd-sized filters over even ones']}], 'duration': 769.226, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ818063.jpg', 'highlights': ['The output volume size from a convolutional layer is calculated using the formula (n-f+2p)/s + 1', 'Padding with zeros can be used to preserve the spatial size of the output volume', 'The number of parameters in a convolutional layer is calculated as (f * f * d_in + 1) * k', 'Common settings for the number of filters (k) in practice often involve choosing powers of 2 for computational efficiency', 'The significance of 3x3, 5x5, and 1x1 filters and the rationale behind using odd-sized filters over even ones', 'Explanation of the 1x1 convolutional layer and addressing concerns regarding depth and spatial dimensions', 'Clarification on the rationale behind using odd-sized filters over even ones']}, {'end': 2377.94, 'segs': [{'end': 1652.939, 'src': 'embed', 'start': 1587.83, 'weight': 0, 'content': [{'end': 1596.941, 'text': "I think it means like if you cut out something in the picture, but there's actually something you can expect is continuous, so you can use that value.", 'start': 1587.83, 'duration': 9.111}, {'end': 1599.457, 'text': "if there's something in the picture.", 'start': 1598.075, 'duration': 1.382}, {'end': 1609.689, 'text': 'So when we do this padding and this is our input image then when we do this padding around the full input image,', 'start': 1599.857, 'duration': 9.832}, {'end': 1615.176, 'text': "so you actually don't know what's outside of that image, we'd be padding the zeros, because you don't know actually what's in that image.", 'start': 1609.689, 'duration': 5.487}, {'end': 1618.446, 'text': 'I see.', 'start': 1618.246, 'duration': 0.2}, {'end': 1621.529, 'text': "Yeah, so you're saying that we could maybe fill this with immediate neighbors or something like that.", 'start': 1618.526, 'duration': 3.003}, {'end': 1622.89, 'text': "In practice, that's true.", 'start': 1622.069, 'duration': 0.821}, {'end': 1625.613, 'text': "I don't think people end up doing that a lot.", 'start': 1623.01, 'duration': 2.603}, {'end': 1627.174, 'text': "But yeah, that's something you might imagine.", 'start': 1625.713, 'duration': 1.461}, {'end': 1627.915, 'text': "I don't think it's common.", 'start': 1627.294, 'duration': 0.621}, {'end': 1629.436, 'text': 'Go ahead.', 'start': 1629.276, 'duration': 0.16}, {'end': 1635.371, 'text': 'Thank you.', 'start': 1635.111, 'duration': 0.26}, {'end': 1637.992, 'text': 'So are we always working with squares? The answer is yes.', 'start': 1635.451, 'duration': 2.541}, {'end': 1643.415, 'text': 'So this came up in a very early class where someone asked, say, for ImageNet we have all these images of different sizes, like, say,', 'start': 1638.533, 'duration': 4.882}, {'end': 1644.175, 'text': 'rectangles and so on.', 'start': 1643.415, 'duration': 0.76}, {'end': 1648.417, 'text': 'We always resize to just squares, just by default.', 'start': 1644.515, 'duration': 3.902}, {'end': 1652.939, 'text': "And we'll see how we can process non-rectangular images later, I think.", 'start': 1648.797, 'duration': 4.142}], 'summary': 'Padding zeros in input images, working with squares, and resizing to squares by default.', 'duration': 65.109, 'max_score': 1587.83, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ1587830.jpg'}, {'end': 1778.352, 'src': 'embed', 'start': 1753.579, 'weight': 3, 'content': [{'end': 1759.664, 'text': "So if you look, for example, at the API of spatial convolutional layer in Torch, you'll see that they require a whole bunch of parameters here.", 'start': 1753.579, 'duration': 6.085}, {'end': 1762.526, 'text': 'So they require, for example, an input plane.', 'start': 1760.384, 'duration': 2.142}, {'end': 1766.127, 'text': 'An input plane actually is not one of these hyperparameters.', 'start': 1763.126, 'duration': 3.001}, {'end': 1768.108, 'text': 'That is the depth of the input volume.', 'start': 1766.227, 'duration': 1.881}, {'end': 1772.09, 'text': "And they need to know that when you construct a layer, because they're about to initialize the filters.", 'start': 1768.428, 'duration': 3.662}, {'end': 1778.352, 'text': "And so these filters, when you initialize memory for them, they need to know what's the input depth, because that determines how large they will be.", 'start': 1772.55, 'duration': 5.802}], 'summary': "Torch's spatial convolutional layer requires input parameters like input plane for filter initialization.", 'duration': 24.773, 'max_score': 1753.579, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ1753579.jpg'}, {'end': 2087.648, 'src': 'heatmap', 'start': 1801.987, 'weight': 0.724, 'content': [{'end': 1804.369, 'text': 'And pad here is what padding you want.', 'start': 1801.987, 'duration': 2.382}, {'end': 1806.411, 'text': 'So you see those four hyperparameters.', 'start': 1804.909, 'duration': 1.502}, {'end': 1809.073, 'text': 'And also here, you have to pass in the input depth volume as well.', 'start': 1806.871, 'duration': 2.202}, {'end': 1811.174, 'text': "You'll see the same in, say, cafe.", 'start': 1809.693, 'duration': 1.481}, {'end': 1814.436, 'text': "I don't want to go into this too much, but you'll see, like, num output.", 'start': 1811.674, 'duration': 2.762}, {'end': 1815.497, 'text': 'This is number of filters.', 'start': 1814.516, 'duration': 0.981}, {'end': 1818.699, 'text': 'What is the kernel size? What is the stride? So these kinds of parameters.', 'start': 1815.677, 'duration': 3.022}, {'end': 1821.461, 'text': 'And I was going to go through lasagna, but I think in the interest of time.', 'start': 1819.3, 'duration': 2.161}, {'end': 1824.543, 'text': 'They all take the same height, four hyperparameters, trust me.', 'start': 1822.602, 'duration': 1.941}, {'end': 1826.565, 'text': "And so that's what defines a convolutional layer.", 'start': 1825.044, 'duration': 1.521}, {'end': 1830.767, 'text': "OK, so I'll go into the brain neuron view of the convolutional layer.", 'start': 1827.443, 'duration': 3.324}, {'end': 1837.436, 'text': "Instead of talking about filters, let's try to talk about how neurons are wired in this simulated brain or something like that.", 'start': 1830.968, 'duration': 6.468}, {'end': 1839.518, 'text': 'So basically, this.', 'start': 1838.878, 'duration': 0.64}, {'end': 1847.126, 'text': "filter here, as we're sliding through the image, at this particular position, this filter is computing this dot product.", 'start': 1842.125, 'duration': 5.001}, {'end': 1849.807, 'text': "And this, of course, is very analogous to what we've seen before.", 'start': 1847.566, 'duration': 2.241}, {'end': 1853.688, 'text': 'So these neurons, they computed w transpose x plus b on their inputs.', 'start': 1849.827, 'duration': 3.861}, {'end': 1859.309, 'text': 'So we can interpret the output of the filter at this position as just a neuron that is fixed in space over there.', 'start': 1854.028, 'duration': 5.281}, {'end': 1862.91, 'text': 'And it happens to be looking at a small local region in the input image.', 'start': 1859.829, 'duration': 3.081}, {'end': 1865.39, 'text': "And it's computing w transpose x plus b.", 'start': 1863.33, 'duration': 2.06}, {'end': 1869.233, 'text': 'So its connections here are in this particular image.', 'start': 1865.39, 'duration': 3.843}, {'end': 1871.436, 'text': "And it doesn't have connections to the other parts of the image.", 'start': 1869.493, 'duration': 1.943}, {'end': 1872.878, 'text': "It's a local connectivity pattern.", 'start': 1871.616, 'duration': 1.262}, {'end': 1878.686, 'text': "And we would also sometimes say that this neuron's receptive field is five by five.", 'start': 1873.759, 'duration': 4.927}, {'end': 1885.093, 'text': "That's the size of its, the size of the region of the input volume that it's looking at.", 'start': 1878.866, 'duration': 6.227}, {'end': 1887.135, 'text': "So that's just some terminology.", 'start': 1885.874, 'duration': 1.261}, {'end': 1893.579, 'text': "And also what's interesting is that as we slide the filter through with these weights, we use the same weights throughout,", 'start': 1887.535, 'duration': 6.044}, {'end': 1895.901, 'text': "because it's just one filter row sliding through the comp volume.", 'start': 1893.579, 'duration': 2.322}, {'end': 1901.165, 'text': 'And so you can imagine that for one activation map, we think of that as a grid of neurons arranged in a 28 by 28 grid.', 'start': 1896.401, 'duration': 4.764}, {'end': 1908.052, 'text': 'And these neurons are all looking at their own little 5 by 5 patch in the input volume.', 'start': 1904.167, 'duration': 3.885}, {'end': 1911.776, 'text': "But all of them share parameters, because it's one filter computing all the outputs.", 'start': 1908.432, 'duration': 3.344}, {'end': 1916.241, 'text': "So all the neurons have the same weight, w, that they're using.", 'start': 1912.337, 'duration': 3.904}, {'end': 1918.644, 'text': "But they're all looking at a slightly different part of the image.", 'start': 1916.321, 'duration': 2.323}, {'end': 1922.569, 'text': 'So they all share weights, and they have local connectivity.', 'start': 1920.506, 'duration': 2.063}, {'end': 1924.251, 'text': 'Those are the two most important parts.', 'start': 1922.609, 'duration': 1.642}, {'end': 1929.81, 'text': "And OK, so that's neurons that share weights in one activation map.", 'start': 1925.507, 'duration': 4.303}, {'end': 1932.031, 'text': 'But of course, we have actually several filters.', 'start': 1930.35, 'duration': 1.681}, {'end': 1933.852, 'text': 'So for example, we have five different filters.', 'start': 1932.271, 'duration': 1.581}, {'end': 1942.537, 'text': 'So we actually end up having the full view of this is that you have a 3D volume of neurons arranged in this 3D spatial layout.', 'start': 1934.513, 'duration': 8.024}, {'end': 1947.901, 'text': 'And all of them are looking at the input volume in a local pattern and sharing parameters across space here.', 'start': 1943.138, 'duration': 4.763}, {'end': 1950.983, 'text': "But across depth, they're all different neurons.", 'start': 1948.681, 'duration': 2.302}, {'end': 1957.069, 'text': 'So these five neurons here are all looking at the same patch of the input volume, but they all have different weights.', 'start': 1951.164, 'duration': 5.905}, {'end': 1961.514, 'text': 'But they share those weights spatially with their friends in the same depth slice.', 'start': 1957.49, 'duration': 4.024}, {'end': 1965.565, 'text': "So that's the neuron way of looking at what this is doing.", 'start': 1963.223, 'duration': 2.342}, {'end': 1970.189, 'text': 'We have this 3D arrangement of neurons, and they have local connectivity, and they share patterns in this funny way.', 'start': 1965.825, 'duration': 4.364}, {'end': 1976.875, 'text': 'And the reason they share parameters is a nice.', 'start': 1970.99, 'duration': 5.885}, {'end': 1981.619, 'text': "advantage of both the local connectivity and the parameter sharing is that it's basically controlling the capacity of the model.", 'start': 1976.875, 'duration': 4.744}, {'end': 1986.464, 'text': "So it makes sense that neurons spatially would want to compute similar things, like say they're looking for little edges.", 'start': 1982.159, 'duration': 4.305}, {'end': 1990.449, 'text': 'You might imagine that a vertical edge, or looking for a vertical edge in the middle of an image,', 'start': 1986.945, 'duration': 3.504}, {'end': 1993.392, 'text': 'is just as useful as looking for a vertical edge anywhere else spatially.', 'start': 1990.449, 'duration': 2.943}, {'end': 1998.519, 'text': 'And so it makes sense as a way of controlling overfitting to share those parameters spatially.', 'start': 1993.733, 'duration': 4.786}, {'end': 2007.262, 'text': 'So there will be all of these 28 by 28 grid of neurons looking for just, say, a vertical bar at all those spatial positions.', 'start': 1999.019, 'duration': 8.243}, {'end': 2012.564, 'text': "And they have local connectivity, because that's partly inspired by some of the experiments in, say, Hubel and Wiesel and so on.", 'start': 2007.782, 'duration': 4.782}, {'end': 2016.925, 'text': "We don't want full global connectivity, because then you have way too many parameters.", 'start': 2013.484, 'duration': 3.441}, {'end': 2021.167, 'text': 'So small filters, and then we make a large depth of the network.', 'start': 2017.406, 'duration': 3.761}, {'end': 2025.13, 'text': "OK, so right now I've covered the conv layers.", 'start': 2022.349, 'duration': 2.781}, {'end': 2026.671, 'text': 'We know what the rel layers are.', 'start': 2025.59, 'duration': 1.081}, {'end': 2029.592, 'text': "So there's pool and FC to go that I'll just talk about briefly.", 'start': 2026.731, 'duration': 2.861}, {'end': 2031.253, 'text': 'The pooling neurons.', 'start': 2030.392, 'duration': 0.861}, {'end': 2034.614, 'text': 'what they do is, as I mentioned, the conv layers.', 'start': 2031.253, 'duration': 3.361}, {'end': 2038.496, 'text': "as we'll see in a lot of the case studies at the end of the class, when we'll be doing conv operations,", 'start': 2034.614, 'duration': 3.882}, {'end': 2041.457, 'text': "we won't be shrinking the volume size spatially.", 'start': 2038.496, 'duration': 2.961}, {'end': 2043.238, 'text': "So we'll be preserving the spatial size.", 'start': 2041.777, 'duration': 1.461}, {'end': 2048.681, 'text': 'the reducing of the spatial size will in many cases be handled by the pooling layers.', 'start': 2044.338, 'duration': 4.343}, {'end': 2057.35, 'text': 'And intuitively what pooling layers are is they take your input volume and they just squish it spatially by just doing a downsampling operation.', 'start': 2049.243, 'duration': 8.107}, {'end': 2061.893, 'text': 'This downsampling operation happens on every single activation map independently.', 'start': 2058.01, 'duration': 3.883}, {'end': 2069.799, 'text': "So say you have a 224, 224, 64 input, you'd end up with half of that spatially, so 112, 112, 64.", 'start': 2062.293, 'duration': 7.506}, {'end': 2074.922, 'text': 'And every one of these depth slices of activation maps are basically just downsampled down and inserted back.', 'start': 2069.801, 'duration': 5.121}, {'end': 2078.043, 'text': "And so it's just a squish operation of downsampling.", 'start': 2075.322, 'duration': 2.721}, {'end': 2080.304, 'text': "Now mathematically, really what we're performing.", 'start': 2078.543, 'duration': 1.761}, {'end': 2085.847, 'text': 'the most common form of actually doing the downsampling that turns out to work best, and you can imagine other things as well,', 'start': 2080.304, 'duration': 5.543}, {'end': 2087.648, 'text': 'but the most commonly used is max pooling.', 'start': 2085.847, 'duration': 1.801}], 'summary': 'Neurons in convolutional layers share parameters and have local connectivity, pooling layers perform downsampling.', 'duration': 285.661, 'max_score': 1801.987, 'thumbnail': ''}, {'end': 1998.519, 'src': 'embed', 'start': 1970.99, 'weight': 4, 'content': [{'end': 1976.875, 'text': 'And the reason they share parameters is a nice.', 'start': 1970.99, 'duration': 5.885}, {'end': 1981.619, 'text': "advantage of both the local connectivity and the parameter sharing is that it's basically controlling the capacity of the model.", 'start': 1976.875, 'duration': 4.744}, {'end': 1986.464, 'text': "So it makes sense that neurons spatially would want to compute similar things, like say they're looking for little edges.", 'start': 1982.159, 'duration': 4.305}, {'end': 1990.449, 'text': 'You might imagine that a vertical edge, or looking for a vertical edge in the middle of an image,', 'start': 1986.945, 'duration': 3.504}, {'end': 1993.392, 'text': 'is just as useful as looking for a vertical edge anywhere else spatially.', 'start': 1990.449, 'duration': 2.943}, {'end': 1998.519, 'text': 'And so it makes sense as a way of controlling overfitting to share those parameters spatially.', 'start': 1993.733, 'duration': 4.786}], 'summary': 'Sharing parameters controls model capacity, reduces overfitting.', 'duration': 27.529, 'max_score': 1970.99, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ1970990.jpg'}, {'end': 2074.922, 'src': 'embed', 'start': 2044.338, 'weight': 7, 'content': [{'end': 2048.681, 'text': 'the reducing of the spatial size will in many cases be handled by the pooling layers.', 'start': 2044.338, 'duration': 4.343}, {'end': 2057.35, 'text': 'And intuitively what pooling layers are is they take your input volume and they just squish it spatially by just doing a downsampling operation.', 'start': 2049.243, 'duration': 8.107}, {'end': 2061.893, 'text': 'This downsampling operation happens on every single activation map independently.', 'start': 2058.01, 'duration': 3.883}, {'end': 2069.799, 'text': "So say you have a 224, 224, 64 input, you'd end up with half of that spatially, so 112, 112, 64.", 'start': 2062.293, 'duration': 7.506}, {'end': 2074.922, 'text': 'And every one of these depth slices of activation maps are basically just downsampled down and inserted back.', 'start': 2069.801, 'duration': 5.121}], 'summary': 'Pooling layers downsample input volume by half spatially, e.g., from 224x224x64 to 112x112x64.', 'duration': 30.584, 'max_score': 2044.338, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ2044338.jpg'}, {'end': 2367.689, 'src': 'embed', 'start': 2339.103, 'weight': 5, 'content': [{'end': 2343.405, 'text': "And yeah, so at the end here, we're actually correctly classifying 80% of the CIFAR-10 data.", 'start': 2339.103, 'duration': 4.302}, {'end': 2346.187, 'text': 'And so you guys can look through this and play with it.', 'start': 2344.346, 'duration': 1.841}, {'end': 2350.429, 'text': 'The project is called CommNet.js, and this is a CIFAR-10 demo for CommNet.js.', 'start': 2346.707, 'duration': 3.722}, {'end': 2363.085, 'text': 'OK Did you use GPU in this? Did I use GPU? No, this is all just JavaScript for loops.', 'start': 2350.449, 'duration': 12.636}, {'end': 2367.689, 'text': "I think there's like this convolutional layer, for example, I think it's like a six or seven nested loops or something like that.", 'start': 2363.365, 'duration': 4.324}], 'summary': 'Commnet.js project achieves 80% cifar-10 data classification using javascript for loops.', 'duration': 28.586, 'max_score': 2339.103, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ2339103.jpg'}], 'start': 1587.83, 'title': 'Image padding and cnn overview', 'summary': 'Discusses the concept of image padding and continuous data, emphasizing zeros and immediate neighbors. it also provides an overview of cnns, covering the structure of layers, parameters, pooling, and achieving 80% accuracy on cifar-10 dataset without gpu usage.', 'chapters': [{'end': 1627.915, 'start': 1587.83, 'title': 'Image padding and continuous data', 'summary': 'Discusses the concept of image padding and the consideration of continuous data when filling the padding with values, with an emphasis on the use of zeros and the potential use of immediate neighbors, highlighting the limited commonality of the latter approach.', 'duration': 40.085, 'highlights': ['The concept of image padding and the consideration of continuous data when filling the padding with values is discussed.', 'The use of zeros for padding is emphasized due to the unknown content outside the input image.', 'The potential use of immediate neighbors for padding is mentioned, with the note that it is not commonly practiced.']}, {'end': 2377.94, 'start': 1629.276, 'title': 'Convolutional neural networks overview', 'summary': 'Provides an overview of convolutional neural networks (cnns), highlighting the use of squares as the default image format, the structure of convolutional layers, parameters used in spatial convolutional layers, the neuron view of the convolutional layer, the purpose and operation of pooling layers, and the fully connected layer at the end. it also includes a demonstration of training a cnn in javascript and achieving 80% accuracy on cifar-10 dataset without gpu usage.', 'duration': 748.664, 'highlights': ['CNNs use squares as the default image format, and non-rectangular images are resized to squares for processing. The chapter emphasizes the use of squares as the default image format for CNNs and mentions the resizing of non-rectangular images to squares for processing.', 'Explanation of parameters used in spatial convolutional layers, including input plane, output plane, kernel width, kernel height, stride, and padding. The detailed explanation of parameters used in spatial convolutional layers, such as input plane, output plane, kernel width, kernel height, stride, and padding, is provided.', "Description of the neuron view of the convolutional layer, including local connectivity, shared weights, and the purpose of parameter sharing in controlling the model's capacity. The neuron view of the convolutional layer is described, highlighting local connectivity, shared weights, and the role of parameter sharing in controlling the model's capacity.", 'Explanation of pooling layers, their purpose in downsampling the spatial size of activation maps, and the operation of max pooling and average pooling. The purpose and operation of pooling layers, specifically their role in downsampling the spatial size of activation maps, and the explanation of max pooling and average pooling are provided.', 'Demonstration of training a CNN in JavaScript without GPU usage and achieving 80% accuracy on CIFAR-10 dataset, showcasing the visualization of filters, activation maps, and gradients. A demonstration of training a CNN in JavaScript without GPU usage is showcased, achieving 80% accuracy on CIFAR-10 dataset. It includes the visualization of filters, activation maps, and gradients.']}], 'duration': 790.11, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ1587830.jpg', 'highlights': ['The concept of image padding and the consideration of continuous data when filling the padding with values is discussed.', 'The use of zeros for padding is emphasized due to the unknown content outside the input image.', 'CNNs use squares as the default image format, and non-rectangular images are resized to squares for processing.', 'Explanation of parameters used in spatial convolutional layers, including input plane, output plane, kernel width, kernel height, stride, and padding.', "Description of the neuron view of the convolutional layer, including local connectivity, shared weights, and the purpose of parameter sharing in controlling the model's capacity.", 'Demonstration of training a CNN in JavaScript without GPU usage and achieving 80% accuracy on CIFAR-10 dataset, showcasing the visualization of filters, activation maps, and gradients.', 'The potential use of immediate neighbors for padding is mentioned, with the note that it is not commonly practiced.', 'The purpose and operation of pooling layers, specifically their role in downsampling the spatial size of activation maps, and the explanation of max pooling and average pooling are provided.']}, {'end': 2742.649, 'segs': [{'end': 2409.106, 'src': 'embed', 'start': 2380.405, 'weight': 0, 'content': [{'end': 2383.387, 'text': "So at this point, I'm going to go into lots of case studies for convolutional networks.", 'start': 2380.405, 'duration': 2.982}, {'end': 2387.05, 'text': "And we'll see all the winning convolutional networks for all of ImageNet competitions.", 'start': 2383.467, 'duration': 3.583}, {'end': 2390.012, 'text': "And we'll see how people actually wire up these convolutional networks in practice.", 'start': 2387.13, 'duration': 2.882}, {'end': 2393.334, 'text': 'Before I dive into that, I can take a few more questions regarding comnets.', 'start': 2390.693, 'duration': 2.641}, {'end': 2393.595, 'text': 'So go ahead.', 'start': 2393.355, 'duration': 0.24}, {'end': 2404.382, 'text': 'What is the intuition behind actually stacking the filters as opposed to doing something like a linear combination?', 'start': 2393.615, 'duration': 10.767}, {'end': 2409.106, 'text': "So you're saying instead of stacking filters like one on top of another in terms of conv, you'd like to do something else instead?", 'start': 2404.422, 'duration': 4.684}], 'summary': 'Discussing case studies of convolutional networks and imagenet competition winners.', 'duration': 28.701, 'max_score': 2380.405, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ2380405.jpg'}, {'end': 2463.818, 'src': 'embed', 'start': 2434.104, 'weight': 1, 'content': [{'end': 2439.008, 'text': 'So in practice, anything you can back prop through, you could put in a comnet or into a neural net in general.', 'start': 2434.104, 'duration': 4.904}, {'end': 2444.352, 'text': 'So we use these functions because they happen to train efficiently and maybe partly historically.', 'start': 2439.508, 'duration': 4.844}, {'end': 2446.935, 'text': "But I also have trouble thinking about what else you'd put in there.", 'start': 2444.392, 'duration': 2.543}, {'end': 2451.118, 'text': 'So conv layer is the main kind of workhorse of convolutional networks.', 'start': 2448.116, 'duration': 3.002}, {'end': 2452.599, 'text': "It's doing all the most important computation.", 'start': 2451.178, 'duration': 1.421}, {'end': 2454.501, 'text': 'Go ahead.', 'start': 2454.321, 'duration': 0.18}, {'end': 2463.818, 'text': 'Yeah, so these layers of very big depth, like volumes of very big depth.', 'start': 2459.775, 'duration': 4.043}], 'summary': 'Convolutional networks utilize conv layers for important computations and efficient training.', 'duration': 29.714, 'max_score': 2434.104, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ2434104.jpg'}, {'end': 2554.151, 'src': 'embed', 'start': 2522.015, 'weight': 2, 'content': [{'end': 2524.496, 'text': "It's like 50 slides that I can't find it.", 'start': 2522.015, 'duration': 2.481}, {'end': 2526.257, 'text': 'Do you remember it? No? OK.', 'start': 2524.516, 'duration': 1.741}, {'end': 2526.737, 'text': "We'll find it.", 'start': 2526.317, 'duration': 0.42}, {'end': 2531.568, 'text': 'There are some techniques that have been developed over the last two years.', 'start': 2529.488, 'duration': 2.08}, {'end': 2538.09, 'text': "So these visualizations are visualizing what the neurons are responding to, but they're not visualizing what those filters are.", 'start': 2531.729, 'duration': 6.361}, {'end': 2540.13, 'text': "And there's no good way to visualize that, in fact.", 'start': 2538.17, 'duration': 1.96}, {'end': 2546.092, 'text': "So in ComNetJS, I visualize them just by still insisting on just like, here's the weights.", 'start': 2540.75, 'duration': 5.342}, {'end': 2549.072, 'text': "But you can't interpret them when you look at them, because it doesn't make sense.", 'start': 2546.132, 'duration': 2.94}, {'end': 2554.151, 'text': "Because they don't directly connect to the image, right? Okay.", 'start': 2550.093, 'duration': 4.058}], 'summary': 'Difficulty finding 50 slides, discussing visualization techniques for neural responses and filters.', 'duration': 32.136, 'max_score': 2522.015, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ2522015.jpg'}, {'end': 2626.603, 'src': 'embed', 'start': 2600.005, 'weight': 4, 'content': [{'end': 2603.906, 'text': "So spatially we've lost a tiny bit of information, but we still know there's a car wheel somewhere there.", 'start': 2600.005, 'duration': 3.901}, {'end': 2608.528, 'text': 'And so we are throwing away a tiny bit of information as we do every single pooling layer.', 'start': 2604.747, 'duration': 3.781}, {'end': 2618.051, 'text': 'I think the whole idea of this content is to abstract information from high dimensional information you only abstract a label.', 'start': 2608.548, 'duration': 9.503}, {'end': 2619.956, 'text': "Yeah, that's right.", 'start': 2619.495, 'duration': 0.461}, {'end': 2622.879, 'text': "So I mean, to some extent, we're getting this image.", 'start': 2620.016, 'duration': 2.863}, {'end': 2625.422, 'text': 'We want to throw away some information at some point, maybe.', 'start': 2622.919, 'duration': 2.503}, {'end': 2626.603, 'text': "I mean, it's not really clear.", 'start': 2625.442, 'duration': 1.161}], 'summary': 'Pooling layers discard some information to abstract high-dimensional data.', 'duration': 26.598, 'max_score': 2600.005, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ2600005.jpg'}, {'end': 2668.56, 'src': 'embed', 'start': 2639.818, 'weight': 3, 'content': [{'end': 2641.24, 'text': "At the same time there are papers we'll go into.", 'start': 2639.818, 'duration': 1.422}, {'end': 2645.764, 'text': 'this called, for example, is spatial information preserved in convnets?', 'start': 2641.24, 'duration': 4.524}, {'end': 2647.226, 'text': 'And they study it and in fact it is.', 'start': 2645.844, 'duration': 1.382}, {'end': 2650.189, 'text': "And so this almost seems like we're throwing away spatial information,", 'start': 2647.666, 'duration': 2.523}, {'end': 2654.453, 'text': 'but convnets are still very good at precisely figuring out where things are in the input image.', 'start': 2650.189, 'duration': 4.264}, {'end': 2656.616, 'text': "And so that seems like a paradox, and we'll go into that in a bit.", 'start': 2654.614, 'duration': 2.002}, {'end': 2658.117, 'text': 'Well, we have too many questions.', 'start': 2657.176, 'duration': 0.941}, {'end': 2658.337, 'text': 'Go ahead.', 'start': 2658.177, 'duration': 0.16}, {'end': 2668.56, 'text': 'Is there ever depth reduction before we can Is there ever depth reduction before the colloquially-continuated layer? There is.', 'start': 2658.357, 'duration': 10.203}], 'summary': 'Spatial information is preserved in convnets, despite seeming paradoxical. depth reduction before the colloquially-continuated layer exists.', 'duration': 28.742, 'max_score': 2639.818, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ2639818.jpg'}], 'start': 2380.405, 'title': 'Convolutional networks and pooling in convnets', 'summary': 'Explores case studies of winning convolutional networks for imagenet competitions, discusses the intuition behind stacking filters, practical implementation of convolutional networks, and visualization techniques for conv layers. it also delves into the impact of pooling on spatial information in convnets, highlighting trade-offs in information preservation, the paradox of spatial information in convolutional neural networks, and touching on depth reduction and padding issues.', 'chapters': [{'end': 2562.021, 'start': 2380.405, 'title': 'Convolutional networks: case studies & visualization', 'summary': 'Explores case studies of winning convolutional networks for imagenet competitions, discusses the intuition behind stacking filters, the practical implementation of convolutional networks, and the visualization techniques for conv layers.', 'duration': 181.616, 'highlights': ['The chapter explores the winning convolutional networks for all ImageNet competitions, providing practical insights into how these networks are wired up in practice.', 'The discussion delves into the intuition behind stacking filters in convolutional networks, emphasizing the preference for dot products due to efficient backpropagation and the maintenance of a local receptive field.', 'The practical implementation of convolutional networks is examined, highlighting the use of functions that can be efficiently backpropagated through and the main workhorse role of the conv layer.', 'The visualization techniques for conv layers are addressed, with a focus on the challenges of visualizing the filters and the development of techniques over the last two years for visualizing what the neurons are responding to.']}, {'end': 2742.649, 'start': 2562.021, 'title': 'Pooling and spatial information in convnets', 'summary': 'Discusses the impact of pooling on spatial information in convnets, highlighting the trade-offs in information preservation and the paradox of spatial information in convolutional neural networks, while also touching on depth reduction and padding issues.', 'duration': 180.628, 'highlights': ['Pooling operations result in the loss of spatial information, impacting the precise positioning of features, while still retaining abstracted labels. Pooling layers discard spatial information, leading to a loss of precise feature positioning, while still retaining abstracted labels.', 'Convnets exhibit a paradox in which they are adept at determining spatial locations despite the spatial information loss from pooling layers. Research indicates that convnets excel at determining spatial locations despite the spatial information loss from pooling layers, presenting a paradox.', 'The discussion also briefly touches on depth reduction and padding issues in convnets. The conversation briefly addresses depth reduction and padding issues in convnets, raising concerns about the impact of padding with zeros on filter statistics.']}], 'duration': 362.244, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ2380405.jpg', 'highlights': ['The chapter explores the winning convolutional networks for all ImageNet competitions, providing practical insights into how these networks are wired up in practice.', 'The practical implementation of convolutional networks is examined, highlighting the use of functions that can be efficiently backpropagated through and the main workhorse role of the conv layer.', 'The visualization techniques for conv layers are addressed, with a focus on the challenges of visualizing the filters and the development of techniques over the last two years for visualizing what the neurons are responding to.', 'Convnets exhibit a paradox in which they are adept at determining spatial locations despite the spatial information loss from pooling layers. Research indicates that convnets excel at determining spatial locations despite the spatial information loss from pooling layers, presenting a paradox.', 'Pooling operations result in the loss of spatial information, impacting the precise positioning of features, while still retaining abstracted labels. Pooling layers discard spatial information, leading to a loss of precise feature positioning, while still retaining abstracted labels.']}, {'end': 3475.328, 'segs': [{'end': 2796.123, 'src': 'embed', 'start': 2765.731, 'weight': 4, 'content': [{'end': 2767.572, 'text': 'You can see this is a figure from the paper.', 'start': 2765.731, 'duration': 1.841}, {'end': 2769.592, 'text': 'We received a 32 by 32 image.', 'start': 2767.992, 'duration': 1.6}, {'end': 2775.815, 'text': 'Then they had six kernels, six filters that were 5 by 5.', 'start': 2770.673, 'duration': 5.142}, {'end': 2777.455, 'text': 'So they used 5 by 5 throughout this architecture.', 'start': 2775.815, 'duration': 1.64}, {'end': 2782.497, 'text': 'So six 5 by 5 filters, which brought it down to 28 by 28.', 'start': 2777.796, 'duration': 4.701}, {'end': 2786.479, 'text': 'And then they did subsampling layer, S2, which is subsampling or max pooling.', 'start': 2782.497, 'duration': 3.982}, {'end': 2796.123, 'text': 'So they subsampled it and then they did 16 convolutional filters again five by five, applied at stride one, and so they got 10 by 10,', 'start': 2787.341, 'duration': 8.782}], 'summary': '32x32 image processed with 6x5x5 kernels, resulting in 10x10 output', 'duration': 30.392, 'max_score': 2765.731, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ2765731.jpg'}, {'end': 2836.109, 'src': 'embed', 'start': 2808.585, 'weight': 2, 'content': [{'end': 2812.646, 'text': 'And they used five by five filters applied at stride one, and the pooling layers were two by two applied at stride two.', 'start': 2808.585, 'duration': 4.061}, {'end': 2817.783, 'text': "I don't want to dwell on this architecture too much, because we have a few more interesting architectures as well.", 'start': 2813.9, 'duration': 3.883}, {'end': 2820.625, 'text': "So we'll go into much more detail in AlexNet.", 'start': 2819.004, 'duration': 1.621}, {'end': 2828.23, 'text': 'So AlexNet is the architecture from in 2012 that famously won the ImageNet competition by a big margin.', 'start': 2821.225, 'duration': 7.005}, {'end': 2830.992, 'text': 'So its inputs were 227 by 227 by 3 images.', 'start': 2829.291, 'duration': 1.701}, {'end': 2836.109, 'text': 'It had this architecture here.', 'start': 2835.069, 'duration': 1.04}], 'summary': 'Alexnet won the 2012 imagenet competition with a large margin using an architecture with 227x227x3 input images.', 'duration': 27.524, 'max_score': 2808.585, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ2808585.jpg'}, {'end': 3243.371, 'src': 'embed', 'start': 3216.344, 'weight': 0, 'content': [{'end': 3223.566, 'text': 'And the performance on the ImageNet test set was, at the end, for a single convolutional network, was 18.2% error.', 'start': 3216.344, 'duration': 7.222}, {'end': 3230.287, 'text': 'For if you form a seven convolutional network ensemble, you get 15.4% error.', 'start': 3225.606, 'duration': 4.681}, {'end': 3235.269, 'text': "So if you remember from last lecture, you're supposed to get 2% extra when you do an ensemble.", 'start': 3230.887, 'duration': 4.382}, {'end': 3240.49, 'text': "In this case, we're seeing a bit better than 2%, so the approximate rule.", 'start': 3235.589, 'duration': 4.901}, {'end': 3243.371, 'text': "But yeah, so that's the AlexNet.", 'start': 3241.81, 'duration': 1.561}], 'summary': 'Ensemble of seven convolutional networks achieved 15.4% error on imagenet test set, surpassing 2% improvement expected from ensemble.', 'duration': 27.027, 'max_score': 3216.344, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ3216344.jpg'}, {'end': 3470.783, 'src': 'embed', 'start': 3440.829, 'weight': 1, 'content': [{'end': 3447.672, 'text': "let's throughout the convolutional network just commit to using 3 by 3 comm stride 1 with pad 1 and 2 by 2 max pool of stride 2..", 'start': 3440.829, 'duration': 6.843}, {'end': 3450.113, 'text': "That's the only spatial dimensions we're using.", 'start': 3447.672, 'duration': 2.441}, {'end': 3454.075, 'text': "So throughout the entire comm net, we're just using these guys.", 'start': 3450.894, 'duration': 3.181}, {'end': 3455.816, 'text': "And now it's just about how many you put in there.", 'start': 3454.175, 'duration': 1.641}, {'end': 3459.718, 'text': 'And so it turned out that a 16-layer model, I think this one, ended up performing best.', 'start': 3456.336, 'duration': 3.382}, {'end': 3462.999, 'text': "I'm going to go into some of the details of this architecture.", 'start': 3461.138, 'duration': 1.861}, {'end': 3470.783, 'text': "I'd just like to point out that their error went down from 11.2%, which was the previous year, and it got it down to 7.3% error at this point.", 'start': 3463.379, 'duration': 7.404}], 'summary': 'Using 3x3 convolutions and 2x2 max pool, achieved 7.3% error reduction from 11.2%.', 'duration': 29.954, 'max_score': 3440.829, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ3440829.jpg'}], 'start': 2743.07, 'title': 'Evolution of convolutional neural network architectures', 'summary': 'Discusses the evolution of convolutional neural network architectures, focusing on specific examples such as lenet-5 and alexnet, highlighting key architectural details and numerical values like filter sizes, image dimensions, and number of parameters. it also covers the evolution of convolutional networks from alexnet to vggnet, highlighting key architectural features, hyperparameters, and performance metrics, including an 18.2% error rate for alexnet and a 7.3% error rate for vggnet.', 'chapters': [{'end': 3143.31, 'start': 2743.07, 'title': 'Evolution of convolutional neural network architectures', 'summary': 'Discusses the evolution of convolutional neural network architectures, focusing on specific examples such as lenet-5 and alexnet, highlighting key architectural details and numerical values like filter sizes, image dimensions, and number of parameters.', 'duration': 400.24, 'highlights': ['The architecture of LeNet-5 from the 1980s consisted of 32x32 input images, six 5x5 filters, subsampling layers, and fully connected layers, totaling approximately 35,000 parameters in the first convolutional layer.', 'AlexNet, the 2012 ImageNet competition winner, featured 227x227x3 input images, 96 11x11 filters in the first convolutional layer, and a series of convolutional, pooling, and fully connected layers, ultimately producing class scores for 1,000 categories in ImageNet.', 'The evolution of architectures typically follows a pattern of convolutional, pooling, and fully connected layers, with variations in filter sizes, spatial dimensions, and number of parameters, ultimately transforming input images into class scores through differentiable operations.']}, {'end': 3475.328, 'start': 3143.33, 'title': 'Evolution of convolutional networks', 'summary': 'Covers the evolution of convolutional networks from alexnet to vggnet, highlighting key architectural features, hyperparameters, and performance metrics, including an 18.2% error rate for alexnet and a 7.3% error rate for vggnet.', 'duration': 331.998, 'highlights': ['VGGNet achieved a significant improvement with a 7.3% error rate, using a 16-layer model with 3 by 3 comm stride 1 and 2 by 2 max pool of stride 2 throughout the entire comm net. VGGNet achieved a 7.3% error rate, a significant improvement from the previous year, using a 16-layer model with specific architectural choices of 3 by 3 comm stride 1 and 2 by 2 max pool of stride 2 throughout the entire comm net.', 'AlexNet performance on the ImageNet test set resulted in an 18.2% error rate for a single convolutional network and a 15.4% error rate for a seven convolutional network ensemble. AlexNet achieved an 18.2% error rate for a single convolutional network and a 15.4% error rate for a seven convolutional network ensemble on the ImageNet test set, surpassing the expected 2% improvement with the ensemble.', 'The evolution from AlexNet to VGGNet involved architectural enhancements, including changes in filter sizes, stride parameters, and the use of rectified linear units and normalization layers. The evolution from AlexNet to VGGNet involved architectural enhancements such as changes in filter sizes, stride parameters, and the use of rectified linear units and normalization layers, resulting in improved performance.']}], 'duration': 732.258, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ2743070.jpg', 'highlights': ['AlexNet achieved an 18.2% error rate for a single convolutional network and a 15.4% error rate for a seven convolutional network ensemble on the ImageNet test set, surpassing the expected 2% improvement with the ensemble.', 'VGGNet achieved a significant improvement with a 7.3% error rate, using a 16-layer model with specific architectural choices of 3 by 3 comm stride 1 and 2 by 2 max pool of stride 2 throughout the entire comm net.', 'The evolution from AlexNet to VGGNet involved architectural enhancements such as changes in filter sizes, stride parameters, and the use of rectified linear units and normalization layers, resulting in improved performance.', 'AlexNet, the 2012 ImageNet competition winner, featured 227x227x3 input images, 96 11x11 filters in the first convolutional layer, and a series of convolutional, pooling, and fully connected layers, ultimately producing class scores for 1,000 categories in ImageNet.', 'The architecture of LeNet-5 from the 1980s consisted of 32x32 input images, six 5x5 filters, subsampling layers, and fully connected layers, totaling approximately 35,000 parameters in the first convolutional layer.']}, {'end': 4732.994, 'segs': [{'end': 3626.199, 'src': 'embed', 'start': 3586.157, 'weight': 1, 'content': [{'end': 3587.958, 'text': 'For backward pass, we need also the gradients.', 'start': 3586.157, 'duration': 1.801}, {'end': 3592.381, 'text': 'So we would end up with a rough footprint of about 200 megabytes per image,', 'start': 3588.479, 'duration': 3.902}, {'end': 3595.343, 'text': 'just to give you an idea of what the footprint of some of this computation is.', 'start': 3592.381, 'duration': 2.962}, {'end': 3601.968, 'text': 'And the total number of parameters is 140 million at the end when you add up all the parameters throughout this.', 'start': 3596.324, 'duration': 5.644}, {'end': 3609.933, 'text': 'So there are some fun things to note about the asymmetry between where all the memory is and where all the parameters are in the network.', 'start': 3603.068, 'duration': 6.865}, {'end': 3619.814, 'text': "OK? So in particular, you'll note that most of the memory of this is actually in the very first few convolutional layers.", 'start': 3611.087, 'duration': 8.727}, {'end': 3626.199, 'text': "You've taken 64 kernels, and you went through the image, and you ended up with 224 by 224 by 64.", 'start': 3620.274, 'duration': 5.925}], 'summary': 'Backward pass requires gradients, footprint of 200mb per image, 140m parameters.', 'duration': 40.042, 'max_score': 3586.157, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ3586157.jpg'}, {'end': 3790.164, 'src': 'embed', 'start': 3763.563, 'weight': 3, 'content': [{'end': 3768.187, 'text': "But I'll just point out that the winner was the winner of 2014 challenge.", 'start': 3763.563, 'duration': 4.624}, {'end': 3770.329, 'text': 'It had a 6.7% top five error.', 'start': 3768.267, 'duration': 2.062}, {'end': 3777.774, 'text': 'So if you remember, original AlexNet was at an error of, where are we, 15.4%.', 'start': 3770.929, 'duration': 6.845}, {'end': 3779.036, 'text': "So we've come down quite a bit.", 'start': 3777.774, 'duration': 1.262}, {'end': 3782.318, 'text': "And we're now at roughly 6.7%.", 'start': 3779.516, 'duration': 2.802}, {'end': 3784.66, 'text': 'And the GoogleNet, you can go through this in a paper.', 'start': 3782.318, 'duration': 2.342}, {'end': 3786.702, 'text': "I don't think I want to spend too much time on this.", 'start': 3784.88, 'duration': 1.822}, {'end': 3790.164, 'text': 'They have these inception layers instead of convolutional layers.', 'start': 3786.762, 'duration': 3.402}], 'summary': "The 2014 challenge winner achieved a 6.7% top five error, surpassing the original alexnet's 15.4% error, and googlenet uses inception layers instead of convolutional layers.", 'duration': 26.601, 'max_score': 3763.563, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ3763563.jpg'}, {'end': 3935.077, 'src': 'heatmap', 'start': 3887.481, 'weight': 0.873, 'content': [{'end': 3890.663, 'text': "And we'll go into that in next later lectures.", 'start': 3887.481, 'duration': 3.182}, {'end': 3892.284, 'text': 'But around 5%.', 'start': 3891.244, 'duration': 1.04}, {'end': 3896.668, 'text': 'But if you take an ensemble of humans and you train them for a long time, you can get down to maybe 2%, 3% or so.', 'start': 3892.284, 'duration': 4.384}, {'end': 3900.511, 'text': "It's my estimate.", 'start': 3899.85, 'duration': 0.661}, {'end': 3901.592, 'text': "But we'll see that in a bit.", 'start': 3900.631, 'duration': 0.961}, {'end': 3905.325, 'text': 'So these networks are working very well.', 'start': 3903.883, 'duration': 1.442}, {'end': 3909.189, 'text': "And you'll see ResNet, by the way, the winner of 2015, which is, I think, my next slide.", 'start': 3905.365, 'duration': 3.824}, {'end': 3912.833, 'text': 'So 3.6% top five error.', 'start': 3910.47, 'duration': 2.363}, {'end': 3918.278, 'text': 'So this is residual networks from Microsoft Research Asia, work by and colleagues.', 'start': 3913.473, 'duration': 4.805}, {'end': 3923.124, 'text': 'And they did not, in fact, just win ImageNet 2015.', 'start': 3919.059, 'duration': 4.065}, {'end': 3925.406, 'text': 'They won a whole bunch of competitions at the same time.', 'start': 3923.124, 'duration': 2.282}, {'end': 3930.131, 'text': 'So all of them first places in a whole bunch of just very important competitions here.', 'start': 3925.867, 'duration': 4.264}, {'end': 3932.073, 'text': "And it's with one architecture.", 'start': 3930.912, 'duration': 1.161}, {'end': 3933.735, 'text': 'And so that was very interesting.', 'start': 3932.434, 'duration': 1.301}, {'end': 3935.077, 'text': "And I'm going to tell you about it now.", 'start': 3933.775, 'duration': 1.302}], 'summary': 'Resnet achieved a 3.6% top five error, winning multiple 2015 competitions.', 'duration': 47.596, 'max_score': 3887.481, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ3887481.jpg'}, {'end': 3918.278, 'src': 'embed', 'start': 3892.284, 'weight': 0, 'content': [{'end': 3896.668, 'text': 'But if you take an ensemble of humans and you train them for a long time, you can get down to maybe 2%, 3% or so.', 'start': 3892.284, 'duration': 4.384}, {'end': 3900.511, 'text': "It's my estimate.", 'start': 3899.85, 'duration': 0.661}, {'end': 3901.592, 'text': "But we'll see that in a bit.", 'start': 3900.631, 'duration': 0.961}, {'end': 3905.325, 'text': 'So these networks are working very well.', 'start': 3903.883, 'duration': 1.442}, {'end': 3909.189, 'text': "And you'll see ResNet, by the way, the winner of 2015, which is, I think, my next slide.", 'start': 3905.365, 'duration': 3.824}, {'end': 3912.833, 'text': 'So 3.6% top five error.', 'start': 3910.47, 'duration': 2.363}, {'end': 3918.278, 'text': 'So this is residual networks from Microsoft Research Asia, work by and colleagues.', 'start': 3913.473, 'duration': 4.805}], 'summary': 'Ensemble of humans trained for a long time can achieve 2-3% error rate in image recognition.', 'duration': 25.994, 'max_score': 3892.284, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ3892284.jpg'}, {'end': 4075.319, 'src': 'embed', 'start': 4041.196, 'weight': 4, 'content': [{'end': 4042.516, 'text': 'You want to do it in a ResNet way.', 'start': 4041.196, 'duration': 1.32}, {'end': 4044.258, 'text': "And so we'll see what that is in a bit.", 'start': 4042.997, 'duration': 1.261}, {'end': 4048.14, 'text': 'So this is just a visualization of the ResNet, 152 layers.', 'start': 4044.938, 'duration': 3.202}, {'end': 4052.784, 'text': 'So the VGGNet here was 20, so it basically dwarfs all the other previous architectures.', 'start': 4048.721, 'duration': 4.063}, {'end': 4056.966, 'text': "Just to give you an idea about the scale, and we'll go into computational considerations soon,", 'start': 4053.584, 'duration': 3.382}, {'end': 4060.289, 'text': 'but you want roughly two to three weeks of training time on eight GPU machine.', 'start': 4056.966, 'duration': 3.323}, {'end': 4064.892, 'text': "So if you guys are not getting some good numbers sometimes working on your laptop, then don't feel too bad,", 'start': 4060.889, 'duration': 4.003}, {'end': 4067.454, 'text': 'because this stuff just takes a while to train.', 'start': 4064.892, 'duration': 2.562}, {'end': 4075.319, 'text': "And even though it's 150 layers, it's actually faster than a VGG net, which is quite interesting.", 'start': 4069.475, 'duration': 5.844}], 'summary': 'Resnet has 152 layers, requiring 2-3 weeks of training on an 8-gpu machine, faster than vgg net.', 'duration': 34.123, 'max_score': 4041.196, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ4041196.jpg'}, {'end': 4229.482, 'src': 'embed', 'start': 4199.623, 'weight': 5, 'content': [{'end': 4201.304, 'text': 'So this is a ResNet module.', 'start': 4199.623, 'duration': 1.681}, {'end': 4208.745, 'text': "What's nice about this is you're computing just these deltas to these Xs,", 'start': 4203.902, 'duration': 4.843}, {'end': 4222.515, 'text': 'and so one way to look at why this might be nice is if you think about the gradient flow of backwards through these layers in here the gradient has to kind of go through all these weights and backprop through them.', 'start': 4208.745, 'duration': 13.77}, {'end': 4225.258, 'text': 'But in here, the gradient flows in through here.', 'start': 4223.056, 'duration': 2.202}, {'end': 4229.482, 'text': "And since you're doing addition, remember, addition just distributes the gradient equally to all of its children.", 'start': 4225.338, 'duration': 4.144}], 'summary': 'Resnet module computes deltas to xs, enabling efficient gradient flow through layers.', 'duration': 29.859, 'max_score': 4199.623, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ4199623.jpg'}, {'end': 4307.853, 'src': 'embed', 'start': 4282.574, 'weight': 6, 'content': [{'end': 4288.38, 'text': 'just to give you an idea about again, the hyperparameters that people use in practice, They use batch normalization after every single comb layer.', 'start': 4282.574, 'duration': 5.806}, {'end': 4293.563, 'text': "So there's many comb layers, many batch normalizations everywhere.", 'start': 4288.6, 'duration': 4.963}, {'end': 4295.845, 'text': 'They use the Javier over 2 initialization.', 'start': 4294.284, 'duration': 1.561}, {'end': 4301.829, 'text': 'If you remember the paper that proposed this over 2 initialization that I talked about when you use relu layers,', 'start': 4296.245, 'duration': 5.584}, {'end': 4304.791, 'text': "that's actually the same person coming that proposed that initialization.", 'start': 4301.829, 'duration': 2.962}, {'end': 4307.853, 'text': 'They use SGD momentum with 0.9.', 'start': 4305.811, 'duration': 2.042}], 'summary': 'Common hyperparameters include batch normalization after each comb layer, xavier initialization over 2, and sgd momentum of 0.9.', 'duration': 25.279, 'max_score': 4282.574, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ4282574.jpg'}, {'end': 4576.189, 'src': 'embed', 'start': 4549.349, 'weight': 7, 'content': [{'end': 4554.212, 'text': "It's 48 because they compute, I think, 48 different types of features that have to do with the specific rules of Go.", 'start': 4549.349, 'duration': 4.863}, {'end': 4557.335, 'text': "And so it's not the raw array, which would have been much nicer.", 'start': 4554.813, 'duration': 2.522}, {'end': 4560.176, 'text': 'But they compute some kinds of features on every single position here.', 'start': 4557.635, 'duration': 2.541}, {'end': 4565, 'text': 'So 19 by 48 input consisting of 48 feature types.', 'start': 4560.737, 'duration': 4.263}, {'end': 4570.083, 'text': 'And then they do convols, kernel size 5, stride 1.', 'start': 4565.7, 'duration': 4.383}, {'end': 4571.144, 'text': 'applies rectifier.', 'start': 4570.083, 'duration': 1.061}, {'end': 4576.189, 'text': "Anyway, so I'm just making the point that you can basically read some of this, and you can distill what the architecture roughly is.", 'start': 4571.965, 'duration': 4.224}], 'summary': 'The architecture involves computing 48 different features for 19x48 input and applying convolutions with kernel size 5 and stride 1.', 'duration': 26.84, 'max_score': 4549.349, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ4549349.jpg'}, {'end': 4656.47, 'src': 'embed', 'start': 4625.232, 'weight': 8, 'content': [{'end': 4629.114, 'text': 'And the way you do spatial reduction is by doing strides to convolutions instead of doing max pooling.', 'start': 4625.232, 'duration': 3.882}, {'end': 4636.678, 'text': "There's a trend towards smaller filter sizes, like 3x3, but having many more layers.", 'start': 4630.734, 'duration': 5.944}, {'end': 4640.1, 'text': 'But a typical architecture for now looks like conv relu,', 'start': 4637.378, 'duration': 2.722}, {'end': 4647.464, 'text': "sometimes pool and then fully connected layers maybe at the end and some softmax at the end and some guidelines based on what you've seen.", 'start': 4640.1, 'duration': 7.364}, {'end': 4650.986, 'text': 'But recent advances, I think, are kind of challenging this paradigm.', 'start': 4648.045, 'duration': 2.941}, {'end': 4656.47, 'text': "Like ResNet and GoogleNet, they kind of play with the architecture a bit, and it's a little more funky than just stacking these layers in a sequence.", 'start': 4651.307, 'duration': 5.163}], 'summary': 'Spatial reduction with strides in convolutions, trend towards smaller filter sizes, and challenging traditional architectures with recent advances like resnet and googlenet.', 'duration': 31.238, 'max_score': 4625.232, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ4625232.jpg'}], 'start': 3476.788, 'title': 'Convolutional neural networks', 'summary': "Delves into memory and parameter distribution in cnn, with a total memory for forward pass of 93mb, 200mb per image for backward pass, and 140m parameters. innovations in googlenet, including inception module, and resnet's success in 2015 challenge are explored. it also explains resnet modules, advantages, and training hyperparameters. additionally, it discusses the impressive performance of deepmind's go-playing network and recent trends in cnn architecture.", 'chapters': [{'end': 3711.637, 'start': 3476.788, 'title': 'Neural network convolutional layers analysis', 'summary': 'Discusses the memory and parameter distribution in a convolutional neural network, highlighting that the total memory for forward pass is 93 megabytes, and the rough footprint for backward pass is about 200 megabytes per image, with a total of 140 million parameters at the end.', 'duration': 234.849, 'highlights': ['The total memory for forward pass is 93 megabytes, and for backward pass, the rough footprint is about 200 megabytes per image. Total memory for forward pass, rough footprint for backward pass', 'The total number of parameters at the end is 140 million, with the majority of the memory concentrated in the first few convolutional layers and most parameters in the first fully connected layer. Total parameters at the end, concentration of memory and parameters in specific layers', 'The first fully connected layer accumulates a huge amount of parameters, with 100 million parameters just in that layer alone. Huge amount of parameters in the first fully connected layer']}, {'end': 4158.724, 'start': 3712.177, 'title': 'Googlenet and resnet', 'summary': "Discusses the key innovations in googlenet, including the introduction of the inception module, leading to a 6.7% top five error in the 2014 challenge, and the significant reduction in parameters to 5 million compared to vggnet's 140 million and alexnet's 60 million. it also explores the success of resnet in the 2015 challenge, achieving a remarkable 3.6% top five error and the importance of its unique architecture with skip connections.", 'duration': 446.547, 'highlights': ["GoogleNet introduced the inception module, resulting in a 6.7% top five error in the 2014 challenge, a significant reduction from the original AlexNet's 15.4% error.", "GoogleNet achieved a substantial reduction in parameters to 5 million compared to VGGNet's 140 million and AlexNet's 60 million, while maintaining much better accuracy.", 'ResNet, the winner of the 2015 challenge, achieved a remarkable 3.6% top five error and presented a unique architecture with skip connections, showcasing consistent improvement in both train and test error as the number of layers scaled up.', 'The ResNet architecture, with 152 layers, showcased significant advancements in comparison to previous architectures such as VGGNet, and despite its depth, it was faster to train.', "The ResNet's unique architecture included skip connections and emphasized the importance of optimizing layers in a 'ResNet way' to achieve consistent improvement in both train and test error."]}, {'end': 4494.579, 'start': 4159.504, 'title': 'Understanding residual networks', 'summary': 'Explains the concept of resnet modules in neural networks, highlighting the computation of deltas to input xs, the advantages of gradient flow, and the training hyperparameters used, such as batch normalization, initialization, and learning rate.', 'duration': 335.075, 'highlights': ['ResNet modules compute deltas to input Xs, allowing for advantageous gradient flow and faster training close to the image. The ResNet module computes deltas to input Xs, enabling advantageous gradient flow and faster training close to the image, which is beneficial in optimizing the neural network.', 'Training hyperparameters include batch normalization after every convolutional layer, Javier over 2 initialization, SGD momentum of 0.9, learning rate of 0.1, and mini-batches of size 256. Training hyperparameters include batch normalization after every convolutional layer, Javier over 2 initialization, SGD momentum of 0.9, learning rate of 0.1, and mini-batches of size 256, showcasing the practical approach used in training ResNets.', 'Residual networks do not use dropout due to the claim in the BatchNorm paper that it reduces the need for dropout. Residual networks do not use dropout, as suggested in the BatchNorm paper, which claims that dropout is less necessary when batch normalization is applied.']}, {'end': 4732.994, 'start': 4495.299, 'title': 'Advances in convolutional neural networks', 'summary': "Discusses the impressive performance of deepmind's go-playing network, emphasizing the policy network architecture and recent trends in cnn architecture, and the computational performance comparison between different cnn models.", 'duration': 237.695, 'highlights': ["The policy network of DeepMind's Go-playing network has an input of 19x19x48, consisting of 48 feature types, and utilizes convolutions with kernel size 5 and stride 1. The policy network of DeepMind's Go-playing network uses an input of 19x19x48 with 48 feature types, employing convolutions with a kernel size of 5 and stride 1.", 'Recent trends in CNN architecture involve a move towards smaller filter sizes, like 3x3, and a reduction in fully connected layers, aiming for networks that are purely convolutional. Recent trends in CNN architecture include a shift towards smaller filter sizes, such as 3x3, and a decrease in fully connected layers, with a goal of achieving purely convolutional networks.', "ResNet is faster than VGGNet in terms of computational performance, with VGGNet having a forward pass time on the order of tens or hundreds of milliseconds on a GPU. ResNet demonstrates superior computational performance compared to VGGNet, with VGGNet's forward pass time being on the order of tens or hundreds of milliseconds on a GPU."]}], 'duration': 1256.206, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/LxfUGhug-iQ/pics/LxfUGhug-iQ3476788.jpg', 'highlights': ["ResNet's unique architecture with skip connections achieved a remarkable 3.6% top five error in the 2015 challenge", 'The total memory for forward pass is 93 megabytes, and for backward pass, the rough footprint is about 200 megabytes per image', 'The total number of parameters at the end is 140 million, with the majority of the memory concentrated in the first few convolutional layers', 'GoogleNet introduced the inception module, resulting in a 6.7% top five error in the 2014 challenge', 'ResNet, with 152 layers, showcased significant advancements in comparison to previous architectures such as VGGNet', 'ResNet modules compute deltas to input Xs, enabling advantageous gradient flow and faster training close to the image', 'Training hyperparameters include batch normalization after every convolutional layer, Javier over 2 initialization, SGD momentum of 0.9, learning rate of 0.1, and mini-batches of size 256', "The policy network of DeepMind's Go-playing network uses an input of 19x19x48 with 48 feature types, employing convolutions with a kernel size of 5 and stride 1", 'Recent trends in CNN architecture involve a move towards smaller filter sizes, such as 3x3, and a decrease in fully connected layers, aiming for networks that are purely convolutional']}], 'highlights': ['The lecture provides an introduction to convolutional neural networks, covers the training process and visualization of convolutional layers, explains the application of filter sizes, discusses image padding, and explores the evolution of cnn architectures, including specific examples like lenet-5, alexnet, and vggnet, with emphasis on key architectural details and performance metrics.', "ResNet's unique architecture with skip connections achieved a remarkable 3.6% top five error in the 2015 challenge", 'AlexNet achieved an 18.2% error rate for a single convolutional network and a 15.4% error rate for a seven convolutional network ensemble on the ImageNet test set, surpassing the expected 2% improvement with the ensemble.', 'The output volume size from a convolutional layer is calculated using the formula (n-f+2p)/s + 1', 'The total number of parameters at the end is 140 million, with the majority of the memory concentrated in the first few convolutional layers', 'The process involves stacking the activation maps and feeding them into the next convolutional layer, creating 3D volumes of higher abstraction, and ultimately generating class scores for different classes through a fully connected layer.', 'The hierarchical cortex is arranged hierarchically, with simple to complex cells and more complex developments over time.', 'The concept of image padding and the consideration of continuous data when filling the padding with values is discussed.', 'The layout of a convolutional network consists of three core building blocks: convolutional layer, rectifier layer, and pooling operations, eventually leading to a fully connected layer at the end.', 'The convolutional layer operates by sliding small spatial filters (e.g., 5x5x3) through the input volume, computing dot products, and generating activation maps, producing a re-representation of the input image (28x28x6).']}