title
Lecture 5 | Convolutional Neural Networks
description
In Lecture 5 we move from fully-connected neural networks to convolutional neural networks. We discuss some of the key historical milestones in the development of convolutional networks, including the perceptron, the neocognitron, LeNet, and AlexNet. We introduce convolution, pooling, and fully-connected layers which form the basis for modern convolutional networks.
Keywords: Convolutional neural networks, perceptron, neocognitron, LeNet, AlexNet, convolution, pooling, fully-connected layers
Slides: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture5.pdf
--------------------------------------------------------------------------------------
Convolutional Neural Networks for Visual Recognition
Instructors:
Fei-Fei Li: http://vision.stanford.edu/feifeili/
Justin Johnson: http://cs.stanford.edu/people/jcjohns/
Serena Yeung: http://ai.stanford.edu/~syyeung/
Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image classification, localization and detection. Recent developments in neural network (aka “deep learning”) approaches have greatly advanced the performance of these state-of-the-art visual recognition systems. This lecture collection is a deep dive into details of the deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification. From this lecture collection, students will learn to implement, train and debug their own neural networks and gain a detailed understanding of cutting-edge research in computer vision.
Website:
http://cs231n.stanford.edu/
For additional learning opportunities please visit:
http://online.stanford.edu/
detail
{'title': 'Lecture 5 | Convolutional Neural Networks', 'heatmap': [{'end': 993.379, 'start': 907.136, 'weight': 1}, {'end': 1458.94, 'start': 1363.403, 'weight': 0.951}, {'end': 2605.822, 'start': 2558.385, 'weight': 0.726}, {'end': 2729.964, 'start': 2686.771, 'weight': 0.713}, {'end': 2813.826, 'start': 2770.399, 'weight': 0.716}, {'end': 3190.712, 'start': 3100.417, 'weight': 0.739}], 'summary': 'Lecture on convolutional neural networks covers the history, evolution, and applications of cnns, including their use in image retrieval, object detection, self-driving cars, face recognition, medical image interpretation, and neural style artwork. it also explains the basics of cnns, convolution in image processing, hierarchical structure, zero padding, stride impact, downsampling, activation map size, and pooling layers, with a focus on preserving spatial structure and reducing representation size.', 'chapters': [{'end': 72.658, 'segs': [{'end': 34.302, 'src': 'embed', 'start': 4.838, 'weight': 0, 'content': [{'end': 5.999, 'text': 'at Stanford University.', 'start': 4.838, 'duration': 1.161}, {'end': 9.602, 'text': "Okay, let's get started.", 'start': 8.421, 'duration': 1.181}, {'end': 15.046, 'text': 'All right, so welcome to lecture five.', 'start': 9.622, 'duration': 5.424}, {'end': 19.89, 'text': "Today we're gonna be getting to the title of the class, Convolutional Neural Networks.", 'start': 15.847, 'duration': 4.043}, {'end': 24.914, 'text': 'Okay, so a couple administrative details before we get started.', 'start': 22.092, 'duration': 2.822}, {'end': 30.719, 'text': 'Assignment one is due Thursday, April 20th, 11.59 p.m. on Canvas.', 'start': 24.934, 'duration': 5.785}, {'end': 34.302, 'text': "We're also going to be releasing assignment two on Thursday.", 'start': 31.399, 'duration': 2.903}], 'summary': 'Lecture five at stanford covers convolutional neural networks; assignment one due april 20th, with assignment two releasing thursday.', 'duration': 29.464, 'max_score': 4.838, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo4838.jpg'}, {'end': 80.71, 'src': 'embed', 'start': 56.978, 'weight': 2, 'content': [{'end': 65.048, 'text': 'And we also saw that this could help address the mode problem, where we are able to learn intermediate templates that are looking for, for example,', 'start': 56.978, 'duration': 8.07}, {'end': 68.553, 'text': 'different types of cars right?, A red car versus a yellow car, and so on.', 'start': 65.048, 'duration': 3.505}, {'end': 72.658, 'text': 'And to combine these together to come up with a final score function for a class.', 'start': 68.833, 'duration': 3.825}, {'end': 80.71, 'text': "Okay, so today we're gonna talk about convolutional neural networks, which is basically the same sort of idea,", 'start': 74.861, 'duration': 5.849}], 'summary': 'Discussing convolutional neural networks for classifying types of cars.', 'duration': 23.732, 'max_score': 56.978, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo56978.jpg'}], 'start': 4.838, 'title': 'Lecture 5: convolutional neural networks', 'summary': 'Provides an introduction to convolutional neural networks, and it includes a review of the previous lecture, along with important details such as the deadline for assignment one and the release of assignment two.', 'chapters': [{'end': 72.658, 'start': 4.838, 'title': 'Lecture 5: convolutional neural networks', 'summary': 'Covers the introduction to convolutional neural networks and a review of the previous lecture, including the deadline for assignment one and the release of assignment two.', 'duration': 67.82, 'highlights': ['The chapter introduces the topic of Convolutional Neural Networks, a key aspect of the lecture (relevance: 5)', 'The lecture provides details about the assignment deadlines, with assignment one due on April 20th and the release of assignment two (relevance: 4)', 'The previous lecture included a review of neural networks and the concept of addressing the mode problem by learning intermediate templates (relevance: 3)']}], 'duration': 67.82, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo4838.jpg', 'highlights': ['The chapter introduces the topic of Convolutional Neural Networks, a key aspect of the lecture (relevance: 5)', 'The lecture provides details about the assignment deadlines, with assignment one due on April 20th and the release of assignment two (relevance: 4)', 'The previous lecture included a review of neural networks and the concept of addressing the mode problem by learning intermediate templates (relevance: 3)']}, {'end': 496.992, 'segs': [{'end': 126.463, 'src': 'embed', 'start': 98.943, 'weight': 4, 'content': [{'end': 106.168, 'text': 'So we can go all the way back to 1957 with Frank Rosenblatt, who developed the Mark I Perceptron machine,', 'start': 98.943, 'duration': 7.225}, {'end': 110.19, 'text': 'which was the first implementation of an algorithm called the Perceptron,', 'start': 106.168, 'duration': 4.022}, {'end': 117.415, 'text': 'which had sort of the similar idea of getting score functions using some w times, x plus a bias.', 'start': 110.19, 'duration': 7.225}, {'end': 120.918, 'text': 'But here the outputs are going to be either one or a zero.', 'start': 118.316, 'duration': 2.602}, {'end': 126.463, 'text': 'And then in this case we have an update rule, so an update rule for our weights w,', 'start': 121.938, 'duration': 4.525}], 'summary': 'In 1957, frank rosenblatt developed the first perceptron machine, implementing an algorithm with binary outputs and an update rule for weights.', 'duration': 27.52, 'max_score': 98.943, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo98943.jpg'}, {'end': 216.691, 'src': 'embed', 'start': 170.976, 'weight': 0, 'content': [{'end': 175.458, 'text': 'And so the first time backprop was really introduced was in 1986 with Rumelhart.', 'start': 170.976, 'duration': 4.482}, {'end': 183.66, 'text': "And so here we can start seeing these kinds of equations with the chain rule and the update rules that we're starting to get familiar with.", 'start': 175.938, 'duration': 7.722}, {'end': 188.262, 'text': 'And so this is the first time we started to have a principled way to train these kinds of network architectures.', 'start': 183.68, 'duration': 4.582}, {'end': 197.034, 'text': "And so after that it still wasn't able to scale to very large neural networks,", 'start': 191.609, 'duration': 5.425}, {'end': 205.542, 'text': "and so there's sort of a period in which there wasn't a whole lot of new things happening here or a lot of popular use of these kinds of networks.", 'start': 197.034, 'duration': 8.508}, {'end': 209.745, 'text': 'And so this really started being reinvigorated around the 2000s.', 'start': 206.042, 'duration': 3.703}, {'end': 216.691, 'text': 'So, in 2006,, there was this paper by Jeff Hinton and Ruslan Solokhodinov,', 'start': 209.865, 'duration': 6.826}], 'summary': 'Backpropagation introduced in 1986, reinvigorated in 2000s', 'duration': 45.715, 'max_score': 170.976, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo170976.jpg'}, {'end': 322.226, 'src': 'embed', 'start': 278.669, 'weight': 1, 'content': [{'end': 284.413, 'text': 'where we had first the strongest results for speech recognition.', 'start': 278.669, 'duration': 5.744}, {'end': 289.938, 'text': "And so this is work out of Jeff Hinton's lab for acoustic modeling and speech recognition.", 'start': 284.914, 'duration': 5.024}, {'end': 291.9, 'text': 'And then for image recognition.', 'start': 290.458, 'duration': 1.442}, {'end': 298.665, 'text': "2012 was the landmark paper from Alex Krzyzewski in Jeff Hinton's lab,", 'start': 291.9, 'duration': 6.765}, {'end': 306.552, 'text': 'which introduced the first convolutional neural network architecture that was able to get really strong results on ImageNet classification.', 'start': 298.665, 'duration': 7.887}, {'end': 315.72, 'text': 'And so it took the ImageNet image classification benchmark and was able to dramatically reduce the error on that benchmark.', 'start': 306.612, 'duration': 9.108}, {'end': 322.226, 'text': 'And so since then, ComNets have gotten really widely used in all kinds of applications.', 'start': 316.721, 'duration': 5.505}], 'summary': 'Breakthroughs in speech and image recognition, with significant impact on reducing errors in imagenet classification, led to widespread use of convolutional neural networks.', 'duration': 43.557, 'max_score': 278.669, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo278669.jpg'}], 'start': 74.861, 'title': 'Evolution of convolutional neural networks', 'summary': 'Covers the history of neural networks, including the development of perceptron, adaline, madelaine, implementation of backpropagation in 1986, and the reinvigoration of neural networks in the 2000s, with landmark results in speech and image recognition in 2012 and the widespread use of cnns in various applications.', 'chapters': [{'end': 249.563, 'start': 74.861, 'title': 'Convolutional neural networks history', 'summary': 'Covers the history of neural networks, including the development of perceptron, adaline, madelaine, implementation of backpropagation in 1986, and the reinvigoration of neural networks in the 2000s, particularly in the paper by jeff hinton and ruslan solokhodinov in 2006.', 'duration': 174.702, 'highlights': ['The first implementation of an algorithm called the Perceptron was in 1957 by Frank Rosenblatt, which had the idea of getting score functions using some w times, x plus a bias.', 'The first introduction of backpropagation was in 1986 with Rumelhart, laying the foundation for a principled way to train network architectures.', 'The reinvigoration of neural networks in the 2000s, particularly in the paper by Jeff Hinton and Ruslan Solokhodinov in 2006, showed the effective training of deep neural networks with careful initialization.']}, {'end': 496.992, 'start': 249.563, 'title': 'Evolution of convolutional neural networks', 'summary': 'Discusses the development of convolutional neural networks, starting from the 1950s experiments by hubel and wiesel, leading to the landmark results in speech and image recognition in 2012 and the widespread use of cnns in various applications.', 'duration': 247.429, 'highlights': ["In 2012, the landmark paper from Alex Krzyzewski in Jeff Hinton's lab introduced the first convolutional neural network architecture that dramatically reduced the error on the ImageNet image classification benchmark.", "The first really strong results using neural networks for speech recognition were achieved around 2012, stemming from work out of Jeff Hinton's lab for acoustic modeling and speech recognition.", "Hubel and Wiesel's experiments in the 1950s led to important conclusions about the hierarchical organization of neurons in the visual cortex, with observations of topographical mapping and hierarchical organization of responsive neurons."]}], 'duration': 422.131, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo74861.jpg', 'highlights': ['The first introduction of backpropagation was in 1986 with Rumelhart, laying the foundation for a principled way to train network architectures.', "In 2012, the landmark paper from Alex Krzyzewski in Jeff Hinton's lab introduced the first convolutional neural network architecture that dramatically reduced the error on the ImageNet image classification benchmark.", 'The reinvigoration of neural networks in the 2000s, particularly in the paper by Jeff Hinton and Ruslan Solokhodinov in 2006, showed the effective training of deep neural networks with careful initialization.', "The first really strong results using neural networks for speech recognition were achieved around 2012, stemming from work out of Jeff Hinton's lab for acoustic modeling and speech recognition.", 'The first implementation of an algorithm called the Perceptron was in 1957 by Frank Rosenblatt, which had the idea of getting score functions using some w times, x plus a bias.']}, {'end': 838.964, 'segs': [{'end': 548.144, 'src': 'embed', 'start': 497.072, 'weight': 0, 'content': [{'end': 503.076, 'text': 'And so by 1998, Yann LeCun basically showed the first example,', 'start': 497.072, 'duration': 6.004}, {'end': 509.529, 'text': 'applying back propagation and gradient-based learning to train convolutional neural networks.', 'start': 504.285, 'duration': 5.244}, {'end': 511.891, 'text': 'that did really well on document recognition.', 'start': 509.529, 'duration': 2.362}, {'end': 517.354, 'text': 'And specifically, they were able to do a good job of recognizing digits of zip codes.', 'start': 512.15, 'duration': 5.204}, {'end': 523.619, 'text': 'And so these were then used pretty widely for zip code recognition in the postal service.', 'start': 517.414, 'duration': 6.205}, {'end': 531.108, 'text': "But beyond that, it wasn't able to scale yet to more challenging and complex data.", 'start': 524.981, 'duration': 6.127}, {'end': 535.292, 'text': 'Digits are still fairly simple and a limited set to recognize.', 'start': 531.548, 'duration': 3.744}, {'end': 544.041, 'text': 'And so this is where Alex Krashevsky in 2012 gave the modern incarnation of convolutional neural networks.', 'start': 536.253, 'duration': 7.788}, {'end': 548.144, 'text': 'and his network we sort of colloquially call AlexNet.', 'start': 544.882, 'duration': 3.262}], 'summary': 'By 1998, yann lecun applied backpropagation to train cnns for document recognition, excelling at zip code recognition. in 2012, alex krashevsky introduced alexnet, marking a modern incarnation of cnns.', 'duration': 51.072, 'max_score': 497.072, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo497072.jpg'}, {'end': 691.734, 'src': 'embed', 'start': 641.952, 'weight': 2, 'content': [{'end': 648.253, 'text': "that's able to do parallel processing and able to efficiently train and run these ComNets.", 'start': 641.952, 'duration': 6.301}, {'end': 657.095, 'text': 'And so we have modern, powerful GPUs, as well as ones that work in embedded systems, for example, that you would use in a self-driving car.', 'start': 648.673, 'duration': 8.422}, {'end': 662.636, 'text': 'So we can also look at some of the other applications that ComNets are used for.', 'start': 659.175, 'duration': 3.461}, {'end': 671.622, 'text': 'So face recognition, we can put an input image of a face and get out a likelihood of who this person is.', 'start': 663.377, 'duration': 8.245}, {'end': 674.664, 'text': 'Comnets are applied to video,', 'start': 672.523, 'duration': 2.141}, {'end': 683.769, 'text': 'and so this is an example of a video network that looks at both images as well as temporal information and from there is able to classify videos.', 'start': 674.664, 'duration': 9.105}, {'end': 691.734, 'text': "We're also able to do pose recognition, being able to recognize shoulders, elbows and different joints.", 'start': 685.81, 'duration': 5.924}], 'summary': 'Comnets enable efficient processing for face recognition, video classification, and pose recognition.', 'duration': 49.782, 'max_score': 641.952, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo641952.jpg'}, {'end': 749.69, 'src': 'embed', 'start': 718.564, 'weight': 5, 'content': [{'end': 720.565, 'text': 'And Comnets are an important part of all of these.', 'start': 718.564, 'duration': 2.001}, {'end': 723.905, 'text': 'Some other applications.', 'start': 722.905, 'duration': 1}, {'end': 735.407, 'text': "so they're being used for interpretation and diagnosis of medical images, for classification of galaxies, for street sign recognition.", 'start': 723.905, 'duration': 11.502}, {'end': 739.228, 'text': "There's also whale recognition.", 'start': 737.948, 'duration': 1.28}, {'end': 741.548, 'text': 'This is from a recent Kaggle challenge.', 'start': 739.248, 'duration': 2.3}, {'end': 749.69, 'text': 'We also have examples of looking at aerial maps and being able to draw where are the streets on these maps, where are buildings,', 'start': 742.168, 'duration': 7.522}], 'summary': 'Comnets used for medical image diagnosis, galaxy classification, street sign recognition, and whale recognition, showcased in a recent kaggle challenge.', 'duration': 31.126, 'max_score': 718.564, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo718564.jpg'}, {'end': 803.476, 'src': 'embed', 'start': 768.404, 'weight': 7, 'content': [{'end': 771.046, 'text': "And so this is something that we'll go into a little bit more later in the class.", 'start': 768.404, 'duration': 2.642}, {'end': 780.814, 'text': 'And we also have really fancy and cool kind of artwork that we can do using neural networks.', 'start': 773.008, 'duration': 7.806}, {'end': 781.474, 'text': 'And so on.', 'start': 781.194, 'duration': 0.28}, {'end': 783.696, 'text': 'the left is an example of Deep Dream.', 'start': 781.474, 'duration': 2.222}, {'end': 791.511, 'text': "We're able to take images and kind of hallucinate different kinds of objects and concepts in the image.", 'start': 784.309, 'duration': 7.202}, {'end': 803.476, 'text': "There's also a neural style type work where we take an image and we're able to re-render this image using a style of a particular artist and artwork.", 'start': 792.252, 'duration': 11.224}], 'summary': 'Neural networks used for creating fancy artwork and hallucinating images.', 'duration': 35.072, 'max_score': 768.404, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo768404.jpg'}], 'start': 497.072, 'title': 'Evolution and applications of comnets', 'summary': 'Discusses the evolution of convolutional neural networks, highlighting its scalability and applications in image retrieval, object detection, and self-driving cars. it also explores diverse applications including face recognition, medical image interpretation, and neural style artwork.', 'chapters': [{'end': 662.636, 'start': 497.072, 'title': 'Evolution of convolutional neural networks', 'summary': "Discusses the evolution of convolutional neural networks, from yann lecun's initial success in document recognition to alex krashevsky's modern incarnation, which scaled to handle complex data, taking advantage of large data sets and parallel computing power. the usage of comnets today spans image retrieval, object detection, and segmentation, with applications in self-driving cars, powered by modern gpus.", 'duration': 165.564, 'highlights': ["Yann LeCun's demonstration of back propagation and gradient-based learning in 1998 led to successful recognition of digits and zip codes, widely used in postal services.", "Alex Krashevsky's 2012 convolutional neural network, colloquially known as AlexNet, scaled to handle large data sets and parallel computing power, leading to widespread usage of ComNets today.", 'The applications of ComNets encompass image retrieval, object detection, and segmentation, with significant importance in self-driving cars, all powered by modern GPUs.']}, {'end': 838.964, 'start': 663.377, 'title': 'Applications of comnets in ai', 'summary': 'Explores the various applications of comnets, including face recognition, pose recognition, game playing, medical image interpretation, classification of galaxies, street sign recognition, aerial map analysis, image captioning, and neural style artwork.', 'duration': 175.587, 'highlights': ['ComNets are applied to video and able to classify videos by looking at both images as well as temporal information.', 'ComNets are used in game playing, particularly in reinforcement learning, and have been employed in playing Atari games and Go.', 'ComNets are used in interpretation and diagnosis of medical images, classification of galaxies, street sign recognition, and aerial map analysis.', 'ComNets are also employed in tasks like image captioning, where given an image, a sentence description about its content is generated.', 'ComNets contribute to creating artwork using neural networks, such as Deep Dream and re-rendering images in the style of a particular artist and artwork.']}], 'duration': 341.892, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo497072.jpg', 'highlights': ["Yann LeCun's demonstration of back propagation and gradient-based learning in 1998 led to successful recognition of digits and zip codes, widely used in postal services.", "Alex Krashevsky's 2012 convolutional neural network, colloquially known as AlexNet, scaled to handle large data sets and parallel computing power, leading to widespread usage of ComNets today.", 'The applications of ComNets encompass image retrieval, object detection, and segmentation, with significant importance in self-driving cars, all powered by modern GPUs.', 'ComNets are applied to video and able to classify videos by looking at both images as well as temporal information.', 'ComNets are used in game playing, particularly in reinforcement learning, and have been employed in playing Atari games and Go.', 'ComNets are used in interpretation and diagnosis of medical images, classification of galaxies, street sign recognition, and aerial map analysis.', 'ComNets are also employed in tasks like image captioning, where given an image, a sentence description about its content is generated.', 'ComNets contribute to creating artwork using neural networks, such as Deep Dream and re-rendering images in the style of a particular artist and artwork.']}, {'end': 1154.841, 'segs': [{'end': 895.973, 'src': 'embed', 'start': 865.409, 'weight': 1, 'content': [{'end': 870.395, 'text': 'Okay, so last lecture, we talked about this idea of a fully connected layer.', 'start': 865.409, 'duration': 4.986}, {'end': 876.304, 'text': 'And for a fully connected layer.', 'start': 872.882, 'duration': 3.422}, {'end': 879.546, 'text': "what we're doing is we operate on top of these vectors, right?", 'start': 876.304, 'duration': 3.242}, {'end': 885.61, 'text': "And so let's say we have an image, a 3D image, 32 by 32 by three.", 'start': 879.626, 'duration': 5.984}, {'end': 887.971, 'text': 'so some of the images that we were looking at previously.', 'start': 885.61, 'duration': 2.361}, {'end': 895.973, 'text': "We'll take that, we'll stretch all the pixels out, and then we have this 3, 072 dimensional vector, for example, in this case.", 'start': 888.371, 'duration': 7.602}], 'summary': 'Discussed fully connected layer in neural networks with 3d image example.', 'duration': 30.564, 'max_score': 865.409, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo865409.jpg'}, {'end': 993.379, 'src': 'heatmap', 'start': 907.136, 'weight': 1, 'content': [{'end': 911.9, 'text': "And then we're going to get the activations, the output of this layer.", 'start': 907.136, 'duration': 4.764}, {'end': 920.568, 'text': 'And so in this case we take each of our 10 rows and we do this dot product with the 3, 072 dimensional input.', 'start': 912.461, 'duration': 8.107}, {'end': 927.093, 'text': "And from there we get this one number that's kind of the value of that neuron.", 'start': 922.189, 'duration': 4.904}, {'end': 932.218, 'text': "And so in this case we're going to have 10 of these neuron outputs.", 'start': 927.874, 'duration': 4.344}, {'end': 936.943, 'text': 'And so a convolutional layer.', 'start': 935.342, 'duration': 1.601}, {'end': 943.168, 'text': "so the main difference between this and the fully connected layer that we've been talking about is that here we want to preserve spatial structure.", 'start': 936.943, 'duration': 6.225}, {'end': 951.294, 'text': 'And so, taking this 32 by 32 by three image that we had earlier, instead of stretching this all out into one long vector,', 'start': 944.149, 'duration': 7.145}, {'end': 955.858, 'text': "we're now going to keep the structure of this image, this three-dimensional input.", 'start': 951.294, 'duration': 4.564}, {'end': 965.061, 'text': "And then what we're going to do is our weights are gonna be these small filters, so in this case, for example, a five by five by three filter.", 'start': 957.639, 'duration': 7.422}, {'end': 972.904, 'text': "And we're going to take this filter and we're going to slide it over the image spatially and compute dot products at every spatial location.", 'start': 965.622, 'duration': 7.282}, {'end': 975.665, 'text': "And so we're going to go into detail of how exactly this works.", 'start': 973.024, 'duration': 2.641}, {'end': 983.492, 'text': 'So our filters, first of all, always extend the full depth of the input volume.', 'start': 978.669, 'duration': 4.823}, {'end': 988.656, 'text': "And so they're going to be just a smaller spatial area.", 'start': 983.892, 'duration': 4.764}, {'end': 993.379, 'text': 'so in this case five by five instead of our full 32 by 32 spatial input.', 'start': 988.656, 'duration': 4.723}], 'summary': 'Convolutional layer preserves spatial structure by using small filters to compute dot products at every spatial location.', 'duration': 86.243, 'max_score': 907.136, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo907136.jpg'}, {'end': 983.492, 'src': 'embed', 'start': 957.639, 'weight': 0, 'content': [{'end': 965.061, 'text': "And then what we're going to do is our weights are gonna be these small filters, so in this case, for example, a five by five by three filter.", 'start': 957.639, 'duration': 7.422}, {'end': 972.904, 'text': "And we're going to take this filter and we're going to slide it over the image spatially and compute dot products at every spatial location.", 'start': 965.622, 'duration': 7.282}, {'end': 975.665, 'text': "And so we're going to go into detail of how exactly this works.", 'start': 973.024, 'duration': 2.641}, {'end': 983.492, 'text': 'So our filters, first of all, always extend the full depth of the input volume.', 'start': 978.669, 'duration': 4.823}], 'summary': 'Using small filters like 5x5x3 to slide over the image and compute dot products at every spatial location.', 'duration': 25.853, 'max_score': 957.639, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo957639.jpg'}, {'end': 1046.701, 'src': 'embed', 'start': 1018.496, 'weight': 2, 'content': [{'end': 1026.492, 'text': "the multiplication of each element of that filter with each corresponding element in that spatial location that we've just plopped it on top of.", 'start': 1018.496, 'duration': 7.996}, {'end': 1029.694, 'text': 'And then this is going to give us a dot product.', 'start': 1027.034, 'duration': 2.66}, {'end': 1033.516, 'text': 'So in this case, we have five times five times three.', 'start': 1029.714, 'duration': 3.802}, {'end': 1037.758, 'text': "This is the number of multiplications that we're gonna do, right, plus the bias term.", 'start': 1034.256, 'duration': 3.502}, {'end': 1046.701, 'text': 'And so this is basically taking our filter w and basically doing w transpose times x and plus bias.', 'start': 1038.698, 'duration': 8.003}], 'summary': 'Multiplying filter elements at spatial location to get dot product.', 'duration': 28.205, 'max_score': 1018.496, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo1018496.jpg'}], 'start': 838.964, 'title': 'Convolutional neural networks', 'summary': 'Covers the understanding, preservation of spatial structure, and basics of convolutional neural networks, including the transformation of images and filters, the use of small filters for dot products, and the intuition and mathematical notation behind the operations.', 'chapters': [{'end': 936.943, 'start': 838.964, 'title': 'Understanding convolutional neural networks', 'summary': 'Explains how convolutional neural networks work, including the concept of fully connected layers and the transformation of images into 3,072 dimensional vectors before discussing the convolutional layer.', 'duration': 97.979, 'highlights': ['The chapter discusses the concept of fully connected layers, which operate on 3D images (32x32x3) by transforming them into 3,072 dimensional vectors and multiplying them by a weight matrix, resulting in 10 neuron outputs.', 'The lecture explores the process of transforming images into 3,072 dimensional vectors and the subsequent multiplication by a weight matrix to obtain 10 neuron outputs.', 'The chapter briefly touches on the concept of convolutional layers, indicating a continuation of the discussion on convolutional neural networks.']}, {'end': 999.863, 'start': 936.943, 'title': 'Convolutional neural networks: preserving spatial structure', 'summary': 'Explains how convolutional neural networks preserve spatial structure by using small filters to compute dot products at every spatial location, thus enhancing the understanding of the 32x32x3 image.', 'duration': 62.92, 'highlights': ['Convolutional Neural Networks (CNNs) preserve spatial structure by using small filters to compute dot products at every spatial location, enhancing the understanding of the 32x32x3 image.', 'CNNs utilize filters, such as a 5x5x3 filter, to slide over the image spatially and compute dot products at every spatial location, maintaining the structure of the three-dimensional input.', 'Filters in CNNs always extend the full depth of the input volume, like a 5x5x3 filter, and go through the full depth, enhancing the understanding of the spatial area.']}, {'end': 1154.841, 'start': 1002.403, 'title': 'Convolutional neural network basics', 'summary': 'Discusses the process of applying a filter to an image using dot product and explains the transformation of the filter into a 1d vector for the dot product, with a focus on the intuition behind the operations and the mathematical notation used.', 'duration': 152.438, 'highlights': ['The process involves overlaying a filter on a spatial location in the image and performing a dot product, resulting in a number of multiplications, e.g., 5x5x3, plus the bias term.', 'The transformation of the filter into a 1D vector for the dot product is explained, highlighting the necessity for the filter to be represented as a one by n vector for the dot product to be computed.', 'The mathematical notation of using W transpose in the dot product is discussed, emphasizing that it is used to make the math work out as a dot product, with no specific intuition behind it.']}], 'duration': 315.877, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo838964.jpg', 'highlights': ['CNNs utilize filters, such as a 5x5x3 filter, to slide over the image spatially and compute dot products at every spatial location, maintaining the structure of the three-dimensional input.', 'The chapter discusses the concept of fully connected layers, which operate on 3D images (32x32x3) by transforming them into 3,072 dimensional vectors and multiplying them by a weight matrix, resulting in 10 neuron outputs.', 'The process involves overlaying a filter on a spatial location in the image and performing a dot product, resulting in a number of multiplications, e.g., 5x5x3, plus the bias term.']}, {'end': 1566.263, 'segs': [{'end': 1209.346, 'src': 'embed', 'start': 1154.841, 'weight': 3, 'content': [{'end': 1162.727, 'text': "the actual operation that we're doing here is plopping our filter on top of a spatial location in the image and multiplying all the corresponding values together.", 'start': 1154.841, 'duration': 7.886}, {'end': 1169.652, 'text': "But in order just to make it kind of an easy expression, similar to what we've seen before, we can also just stretch each of these out,", 'start': 1163.307, 'duration': 6.345}, {'end': 1174.155, 'text': 'make sure that dimensions are transposed correctly, so that it works out as a dot product.', 'start': 1169.652, 'duration': 4.503}, {'end': 1185.384, 'text': "Okay, the question is how do we slide the filter over the image? We'll go into that next.", 'start': 1180.7, 'duration': 4.684}, {'end': 1199.356, 'text': 'Okay, so the question is should we rotate the kernel by 180 degrees to better match the convolution, the definition of a convolution?', 'start': 1191.989, 'duration': 7.367}, {'end': 1209.346, 'text': "And so the answer is that we'll also show the equation for this later, but we're using convolution as kind of a looser definition of what's happening.", 'start': 1200.096, 'duration': 9.25}], 'summary': 'Explaining the convolution operation and kernel rotation for image analysis.', 'duration': 54.505, 'max_score': 1154.841, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo1154841.jpg'}, {'end': 1458.94, 'src': 'heatmap', 'start': 1351.332, 'weight': 1, 'content': [{'end': 1356.538, 'text': "And then I'm going to get out this second green activation map, also of the same size.", 'start': 1351.332, 'duration': 5.206}, {'end': 1363.323, 'text': 'And we can do this for as many filters as we want to have in this layer.', 'start': 1359.98, 'duration': 3.343}, {'end': 1371.169, 'text': "So for example, if we have six filters, six of these five by five filters, then we're going to get in total six activation maps out.", 'start': 1363.403, 'duration': 7.766}, {'end': 1381.753, 'text': "So we're gonna get this output volume that's going to be basically six by 28 by 28.", 'start': 1372.69, 'duration': 9.063}, {'end': 1392.564, 'text': "And so a preview of how we're going to use these convolutional layers in our convolutional network is that our commnet is basically going to be a sequence of these convolutional layers stacked on top of each other,", 'start': 1381.753, 'duration': 10.811}, {'end': 1396.207, 'text': 'same way as what we had with the simple linear layers in the neural network.', 'start': 1392.564, 'duration': 3.643}, {'end': 1399.29, 'text': "And then we're going to intersperse these with activation functions.", 'start': 1396.668, 'duration': 2.622}, {'end': 1402.073, 'text': 'So for example, a ReLU activation function.', 'start': 1399.31, 'duration': 2.763}, {'end': 1414.608, 'text': "And so you're gonna get something like conv, relu and usually also some pooling layers, and then you're just going to get a sequence of these,", 'start': 1404.62, 'duration': 9.988}, {'end': 1415.849, 'text': 'each creating an output.', 'start': 1414.608, 'duration': 1.241}, {'end': 1419.051, 'text': "that's now going to be the input to the next convolutional layer.", 'start': 1415.849, 'duration': 3.202}, {'end': 1429.219, 'text': 'Okay, and so each of these layers, as I said earlier, has multiple filters right, many filters,', 'start': 1423.614, 'duration': 5.605}, {'end': 1432.141, 'text': 'and each of the filters is producing an activation map.', 'start': 1429.219, 'duration': 2.922}, {'end': 1437.124, 'text': 'And so when you look at multiple of these layers stacked together in a comnet,', 'start': 1433.021, 'duration': 4.103}, {'end': 1446.091, 'text': "what ends up happening is you end up learning this hierarchy of filters where the filters at the earlier layers usually represent low-level features that you're looking for.", 'start': 1437.124, 'duration': 8.967}, {'end': 1454.157, 'text': "So things kind of like edges, right? And then at the mid-level, you're gonna get more complex kinds of features.", 'start': 1446.151, 'duration': 8.006}, {'end': 1458.94, 'text': "So maybe it's looking more for things like corners and blobs and so on.", 'start': 1454.197, 'duration': 4.743}], 'summary': 'Using six 5x5 filters generates six 28x28 activation maps, forming a hierarchy of filters for learning features.', 'duration': 30.421, 'max_score': 1351.332, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo1351332.jpg'}, {'end': 1458.94, 'src': 'embed', 'start': 1429.219, 'weight': 0, 'content': [{'end': 1432.141, 'text': 'and each of the filters is producing an activation map.', 'start': 1429.219, 'duration': 2.922}, {'end': 1437.124, 'text': 'And so when you look at multiple of these layers stacked together in a comnet,', 'start': 1433.021, 'duration': 4.103}, {'end': 1446.091, 'text': "what ends up happening is you end up learning this hierarchy of filters where the filters at the earlier layers usually represent low-level features that you're looking for.", 'start': 1437.124, 'duration': 8.967}, {'end': 1454.157, 'text': "So things kind of like edges, right? And then at the mid-level, you're gonna get more complex kinds of features.", 'start': 1446.151, 'duration': 8.006}, {'end': 1458.94, 'text': "So maybe it's looking more for things like corners and blobs and so on.", 'start': 1454.197, 'duration': 4.743}], 'summary': 'Neural network layers learn hierarchy of filters representing low to mid-level features.', 'duration': 29.721, 'max_score': 1429.219, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo1429219.jpg'}], 'start': 1154.841, 'title': 'Convolution in image processing and neural networks', 'summary': 'Covers convolving a filter with an image, the operation of sliding filters over input volumes, and stacking multiple filters to learn features, with a focus on process, design choices, and performance.', 'chapters': [{'end': 1231.836, 'start': 1154.841, 'title': 'Convolution operation in image processing', 'summary': 'Discusses the operation of convolving a filter with an image, highlighting the process of multiplying corresponding values and the concept of sliding the filter over the image, while addressing the question of rotating the kernel by 180 degrees and the loose definition of convolution.', 'duration': 76.995, 'highlights': ['The process involves plopping a filter on a spatial location in the image and multiplying corresponding values together, which can be simplified as a dot product after transposing dimensions correctly.', 'Addressing the question of rotating the kernel by 180 degrees, it is explained that the convolution is performed with the flipped version of the filter, presenting convolution as a looser definition of the operation.', 'The chapter addresses the concept of sliding the filter over the image and the implications of using convolution as a looser definition of the process.']}, {'end': 1566.263, 'start': 1231.836, 'title': 'Convolutional neural networks', 'summary': "Explains the process of sliding filters over input volumes, producing activation maps of different dimensions, and stacking multiple filters to learn a hierarchy of features, with design choices affecting the network's performance and efficiency.", 'duration': 334.427, 'highlights': ["The output activation map's dimensions are 28 by 28, derived from a 32 by 32 input volume, showcasing the impact of sliding filters over spatial locations.", 'Using six filters of size 5x5, the output volume is six by 28 by 28, demonstrating the relationship between the number of filters and the resulting activation maps.', 'The hierarchy of filters in a convolutional network progresses from low-level features, like edges, to mid-level features, such as corners and blobs, and then to higher-level features resembling more complex concepts and blobs.']}], 'duration': 411.422, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo1154841.jpg', 'highlights': ['The hierarchy of filters in a convolutional network progresses from low-level features to mid-level features to higher-level features.', 'The output volume is six by 28 by 28, demonstrating the relationship between the number of filters and the resulting activation maps.', "The output activation map's dimensions are 28 by 28, derived from a 32 by 32 input volume, showcasing the impact of sliding filters over spatial locations.", 'The process involves plopping a filter on a spatial location in the image and multiplying corresponding values together, which can be simplified as a dot product after transposing dimensions correctly.', 'Addressing the question of rotating the kernel by 180 degrees, it is explained that the convolution is performed with the flipped version of the filter, presenting convolution as a looser definition of the operation.']}, {'end': 2310.211, 'segs': [{'end': 1598.98, 'src': 'embed', 'start': 1571.98, 'weight': 7, 'content': [{'end': 1577.784, 'text': 'Okay, so each of these layers convolutional layers that we have stacked together.', 'start': 1571.98, 'duration': 5.804}, {'end': 1585.67, 'text': "we saw how we're starting with more simpler features and then aggregating these into more complex features later on.", 'start': 1577.784, 'duration': 7.886}, {'end': 1592.496, 'text': 'And so in practice, this is compatible with what Tubal and Wiesel noticed in their experiments,', 'start': 1585.69, 'duration': 6.806}, {'end': 1598.98, 'text': 'right that we had these simple cells at the earlier stages of processing, followed by more complex cells later on.', 'start': 1592.496, 'duration': 6.484}], 'summary': "Convolutional layers aggregate simpler features into complex features, following tubal and wiesel's observations.", 'duration': 27, 'max_score': 1571.98, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo1571980.jpg'}, {'end': 1655.875, 'src': 'embed', 'start': 1626.321, 'weight': 4, 'content': [{'end': 1630.044, 'text': 'so the question is what are we seeing in these visualizations?', 'start': 1626.321, 'duration': 3.723}, {'end': 1639.713, 'text': 'And so, in these visualizations, if we look at this conv one, the first convolutional layer, each of these grid, each part of this grid,', 'start': 1630.905, 'duration': 8.808}, {'end': 1640.914, 'text': 'is a one neuron.', 'start': 1639.713, 'duration': 1.201}, {'end': 1649.643, 'text': 'And so what we visualized here is what the input looks like that maximizes the activation of that particular neuron.', 'start': 1641.035, 'duration': 8.608}, {'end': 1655.875, 'text': 'what sort of image you would get that would give you the largest, make that neuron fire and have the largest value.', 'start': 1650.253, 'duration': 5.622}], 'summary': 'Visualizations show input maximizing neuron activation in conv layer.', 'duration': 29.554, 'max_score': 1626.321, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo1626321.jpg'}, {'end': 1718.416, 'src': 'embed', 'start': 1693.413, 'weight': 5, 'content': [{'end': 1699.478, 'text': 'Okay, so here is an example of some of the activation maps produced by each filter.', 'start': 1693.413, 'duration': 6.065}, {'end': 1701.659, 'text': 'right?. So we can visualize up here.', 'start': 1699.478, 'duration': 2.181}, {'end': 1710.306, 'text': 'on the top we have this whole row of example five by five filters, and so this is basically a real case from a trained ConvNet,', 'start': 1701.659, 'duration': 8.647}, {'end': 1718.416, 'text': 'where each of these is is what a five by five filter looks like and then, as we convolve this over an image,', 'start': 1710.306, 'duration': 8.11}], 'summary': 'Example of activation maps produced by five by five filters in a trained convnet.', 'duration': 25.003, 'max_score': 1693.413, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo1693413.jpg'}, {'end': 1835.731, 'src': 'embed', 'start': 1809.028, 'weight': 0, 'content': [{'end': 1812.371, 'text': 'Okay, so, as I mentioned earlier,', 'start': 1809.028, 'duration': 3.343}, {'end': 1821.32, 'text': "what our total convolutional neural network is going to look like is we're going to have an input image and then we're going to pass it through these sequence of layers,", 'start': 1812.371, 'duration': 8.949}, {'end': 1828.127, 'text': "where we're gonna have a convolutional layer first and we usually have our nonlinear layer after that.", 'start': 1821.32, 'duration': 6.807}, {'end': 1835.731, 'text': "so ReLU is something that's very commonly used, that we're going to talk about more later, and then we have these comb relu,", 'start': 1828.127, 'duration': 7.604}], 'summary': 'The total convolutional neural network includes input image, convolutional layer, and relu activation.', 'duration': 26.703, 'max_score': 1809.028, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo1809028.jpg'}, {'end': 1912.602, 'src': 'embed', 'start': 1883.187, 'weight': 1, 'content': [{'end': 1887.011, 'text': "and we have our five by five by three filter that we're going to slide over this image.", 'start': 1883.187, 'duration': 3.824}, {'end': 1892.916, 'text': "And we're gonna see how we're going to use that to produce exactly this 28 by 28 activation map.", 'start': 1887.932, 'duration': 4.984}, {'end': 1900.223, 'text': "So let's assume that we actually have a seven by seven input, just to be simpler, and let's assume we have a three by three filter.", 'start': 1894.157, 'duration': 6.066}, {'end': 1908.751, 'text': "So what we're going to do is we're going to take this filter, plop it down in our upper left hand corner and we're going to multiply,", 'start': 1901.344, 'duration': 7.407}, {'end': 1912.602, 'text': 'multiply all these values together to get our first value.', 'start': 1910.021, 'duration': 2.581}], 'summary': 'Using a 5x5x3 filter on a 7x7 input to produce a 28x28 activation map.', 'duration': 29.415, 'max_score': 1883.187, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo1883187.jpg'}, {'end': 2015.544, 'src': 'embed', 'start': 1992.99, 'weight': 2, 'content': [{'end': 2001.195, 'text': "Okay, and so what happens when we have a stride of three? What's the output size of this? And so in this case, right, we have three.", 'start': 1992.99, 'duration': 8.205}, {'end': 2003.296, 'text': 'We slide it over by three again.', 'start': 2001.735, 'duration': 1.561}, {'end': 2011.802, 'text': "And the problem is that here, it actually doesn't fit, right? So we slide it over by three and now it doesn't fit nicely within the image.", 'start': 2005.017, 'duration': 6.785}, {'end': 2015.544, 'text': "And so in practice, it just doesn't work.", 'start': 2012.262, 'duration': 3.282}], 'summary': 'With a stride of three, the output size does not fit within the image, making it impractical.', 'duration': 22.554, 'max_score': 1992.99, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo1992990.jpg'}, {'end': 2071.467, 'src': 'embed', 'start': 2046.716, 'weight': 3, 'content': [{'end': 2052.398, 'text': 'the dimension spatial dimension of each output size is going to be n minus f, divided by the stride plus one.', 'start': 2046.716, 'duration': 5.682}, {'end': 2057.46, 'text': "And you can kind of see this as if I'm gonna take my filter.", 'start': 2052.918, 'duration': 4.542}, {'end': 2063.964, 'text': "let's say I fill it in at the very last possible position that it can be in and then take all the pixels before that.", 'start': 2057.46, 'duration': 6.504}, {'end': 2068.826, 'text': 'how many instances of moving by the stride can I fit in?', 'start': 2063.964, 'duration': 4.862}, {'end': 2071.467, 'text': "And so that's how this equation kind of works out.", 'start': 2069.366, 'duration': 2.101}], 'summary': 'Spatial dimension of each output size is n-f/stride+1.', 'duration': 24.751, 'max_score': 2046.716, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo2046716.jpg'}, {'end': 2129.814, 'src': 'embed', 'start': 2099.471, 'weight': 6, 'content': [{'end': 2103.854, 'text': 'And so this is kind of related to a question earlier, which is what do we do right at the corners?', 'start': 2099.471, 'duration': 4.383}, {'end': 2109.118, 'text': "And so what in practice happens is we're going to actually pad our input image with zeros,", 'start': 2103.954, 'duration': 5.164}, {'end': 2117.243, 'text': "and so now you're going to be able to place a filter centered at the upper right hand pixel location of your actual input image.", 'start': 2109.118, 'duration': 8.125}, {'end': 2121.364, 'text': "Okay, so here's a question.", 'start': 2119.141, 'duration': 2.223}, {'end': 2124.868, 'text': 'So who can tell me if I have my same input?', 'start': 2121.845, 'duration': 3.023}, {'end': 2129.814, 'text': 'seven by seven, three by three filters slide one, but now I pad with a one pixel border.', 'start': 2124.868, 'duration': 4.946}], 'summary': 'Padding input image with zeros to place a filter centered at the upper right hand pixel location.', 'duration': 30.343, 'max_score': 2099.471, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo2099471.jpg'}], 'start': 1571.98, 'title': 'Convolutional neural networks', 'summary': "Explains convolutional layers' hierarchical structure, neuron activation visualizations, and the concept of convolution in signal processing. it also covers the architecture of a convolutional neural network, including layer sequence, spatial dimension calculations, and the impact of stride and zero padding.", 'chapters': [{'end': 1804.435, 'start': 1571.98, 'title': 'Convolutional layers in neural networks', 'summary': 'Explains the hierarchical structure of convolutional layers, visualizations of neuron activations, and the concept of convolution in signal processing, showcasing examples and explaining the purpose of each visualization.', 'duration': 232.455, 'highlights': ["The chapter explains the hierarchical structure of convolutional layers, starting with simpler features and aggregating into more complex features, in line with Tubal and Wiesel's observations.", 'The visualizations of the first convolutional layer show the input that maximizes the activation of each neuron, providing insights into what the neuron is looking for.', 'The example of activation maps produced by each filter demonstrates how the filters detect specific features in the input image, such as oriented edges, and their correlation to the convolution equation in signal processing.']}, {'end': 2310.211, 'start': 1809.028, 'title': 'Understanding convolutional neural networks', 'summary': 'Covers the architecture of a convolutional neural network, including the sequence of layers, spatial dimension calculations, and the impact of stride and zero padding, ultimately leading to the formulation of the output size formula.', 'duration': 501.183, 'highlights': ['The architecture of a convolutional neural network involves a sequence of layers including convolutional, ReLU, comb relu, pooling, and fully connected layers, culminating in a final score function.', 'The spatial dimension calculation for a 32x32x3 image with a 5x5x3 filter results in a 28x28 activation map, demonstrating the impact of filter size on the output size.', 'The formula for output size (n-f)/stride+1 is derived, demonstrating its use in determining the spatial dimension of each output size based on the input dimension, filter size, and stride.', 'The impact of stride on the output size is explained, with examples showcasing the resulting output sizes for different stride values.', 'The concept of zero padding is introduced as a technique to address asymmetry in convolutions, with a clear example showcasing the adjustment of the output size formula when zero padding is applied.']}], 'duration': 738.231, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo1571980.jpg', 'highlights': ['The architecture of a convolutional neural network involves a sequence of layers including convolutional, ReLU, comb relu, pooling, and fully connected layers, culminating in a final score function.', 'The spatial dimension calculation for a 32x32x3 image with a 5x5x3 filter results in a 28x28 activation map, demonstrating the impact of filter size on the output size.', 'The impact of stride on the output size is explained, with examples showcasing the resulting output sizes for different stride values.', 'The formula for output size (n-f)/stride+1 is derived, demonstrating its use in determining the spatial dimension of each output size based on the input dimension, filter size, and stride.', 'The visualizations of the first convolutional layer show the input that maximizes the activation of each neuron, providing insights into what the neuron is looking for.', 'The example of activation maps produced by each filter demonstrates how the filters detect specific features in the input image, such as oriented edges, and their correlation to the convolution equation in signal processing.', 'The concept of zero padding is introduced as a technique to address asymmetry in convolutions, with a clear example showcasing the adjustment of the output size formula when zero padding is applied.', "The chapter explains the hierarchical structure of convolutional layers, starting with simpler features and aggregating into more complex features, in line with Tubal and Wiesel's observations."]}, {'end': 2889.134, 'segs': [{'end': 2465.3, 'src': 'embed', 'start': 2433.572, 'weight': 0, 'content': [{'end': 2438.914, 'text': 'So the way we do zero padding is to maintain the same input size as we had before.', 'start': 2433.572, 'duration': 5.342}, {'end': 2446.737, 'text': 'So we started with seven by seven, and if we looked at just starting your filter from the upper left hand corner, filling everything in,', 'start': 2439.454, 'duration': 7.283}, {'end': 2452.299, 'text': 'then we get a smaller size output, but we would like to maintain our full size output.', 'start': 2446.737, 'duration': 5.562}, {'end': 2465.3, 'text': 'Okay, so Yeah, so we saw how padding can basically help you maintain the size of the output that you want,', 'start': 2456.18, 'duration': 9.12}], 'summary': 'Zero padding maintains input size for 7x7, helps maintain output size.', 'duration': 31.728, 'max_score': 2433.572, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo2433572.jpg'}, {'end': 2512.073, 'src': 'embed', 'start': 2482.993, 'weight': 1, 'content': [{'end': 2484.455, 'text': 'These are pretty common filter sizes.', 'start': 2482.993, 'duration': 1.462}, {'end': 2493.023, 'text': 'And so each of these, for three by three, you will want to zero pad with one in order to maintain the same spatial size.', 'start': 2485.356, 'duration': 7.667}, {'end': 2497.928, 'text': "If you're going to do five by five, you can work out the math, but it's going to come out to.", 'start': 2493.503, 'duration': 4.425}, {'end': 2501.151, 'text': 'you want to zero pad by two and then for seven, you want to zero pad by three.', 'start': 2497.928, 'duration': 3.223}, {'end': 2512.073, 'text': 'Okay, and so again the motivation for doing this type of zero padding and trying to maintain the input size.', 'start': 2504.666, 'duration': 7.407}], 'summary': 'Common filter sizes require zero padding to maintain spatial size.', 'duration': 29.08, 'max_score': 2482.993, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo2482993.jpg'}, {'end': 2605.822, 'src': 'heatmap', 'start': 2558.385, 'weight': 0.726, 'content': [{'end': 2569.281, 'text': "And then at the same time also as we talked about earlier, you're also kind of losing some of this edge information, corner information,", 'start': 2558.385, 'duration': 10.896}, {'end': 2571.584, 'text': "that each time we're losing out and shrinking that further.", 'start': 2569.281, 'duration': 2.303}, {'end': 2580.286, 'text': "Okay, so let's go through a couple more examples of computing some of these sizes.", 'start': 2575.123, 'duration': 5.163}, {'end': 2584.728, 'text': "So let's say that we have an input volume which is 32 by 32 by three.", 'start': 2580.926, 'duration': 3.802}, {'end': 2588.67, 'text': 'And here we have 10 five by five filters.', 'start': 2585.649, 'duration': 3.021}, {'end': 2591.092, 'text': "Let's use stride one and pad two.", 'start': 2589.151, 'duration': 1.941}, {'end': 2599.436, 'text': "And so who can tell me what's the output volume size of this? So you can think about the formula earlier.", 'start': 2592.312, 'duration': 7.124}, {'end': 2605.822, 'text': 'Sorry, what was it? 32 by 32 by 10.', 'start': 2600.236, 'duration': 5.586}], 'summary': 'Discussing computation examples for input volume and filters with stride and padding, resulting in an output volume size of 32x32x10.', 'duration': 47.437, 'max_score': 2558.385, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo2558385.jpg'}, {'end': 2729.964, 'src': 'heatmap', 'start': 2686.771, 'weight': 0.713, 'content': [{'end': 2689.892, 'text': 'but implicitly we also have the depth in here right?', 'start': 2686.771, 'duration': 3.121}, {'end': 2691.213, 'text': "It's gonna go through the whole volume.", 'start': 2690.132, 'duration': 1.081}, {'end': 2694.629, 'text': 'So I heard, yeah, 750 I heard.', 'start': 2692.848, 'duration': 1.781}, {'end': 2696.289, 'text': 'Almost there.', 'start': 2695.809, 'duration': 0.48}, {'end': 2701.67, 'text': 'this is kind of a trick question, because also, remember, we usually always have a bias term, right?', 'start': 2696.289, 'duration': 5.381}, {'end': 2707.932, 'text': 'So in practice, each filter has five by five by three weights, plus our one bias term.', 'start': 2701.87, 'duration': 6.062}, {'end': 2713.553, 'text': "We have 76 parameters per filter, and then we have 10 of these total, and so there's 760 total parameters.", 'start': 2707.952, 'duration': 5.601}, {'end': 2725.58, 'text': "Okay, and so here's just a summary of the convolutional layer that you guys can read a little bit more carefully later on.", 'start': 2718.314, 'duration': 7.266}, {'end': 2729.003, 'text': 'but we have our input volume of a certain dimension.', 'start': 2725.58, 'duration': 3.423}, {'end': 2729.964, 'text': 'we have all these choices.', 'start': 2729.003, 'duration': 0.961}], 'summary': 'Convolutional layer has 10 filters with 760 parameters in total.', 'duration': 43.193, 'max_score': 2686.771, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo2686771.jpg'}, {'end': 2813.826, 'src': 'heatmap', 'start': 2729.964, 'weight': 3, 'content': [{'end': 2736.109, 'text': 'we have our filters right where we have number of filters, the filter size, the stride of the size, the amount of zero padding,', 'start': 2729.964, 'duration': 6.145}, {'end': 2738.611, 'text': 'and you basically can use all of these.', 'start': 2736.109, 'duration': 2.502}, {'end': 2746.458, 'text': 'go through the computations that we talked about earlier in order to find out what your output volume is actually going to be and how many total parameters that you have.', 'start': 2738.611, 'duration': 7.847}, {'end': 2757.594, 'text': 'And so some common settings of this, we talked earlier about common filter sizes of three by three, five by five.', 'start': 2749.15, 'duration': 8.444}, {'end': 2761.275, 'text': 'Stride is usually one and two is pretty common.', 'start': 2758.494, 'duration': 2.781}, {'end': 2769.559, 'text': "And then your padding p is going to be whatever fits, like whatever will preserve your spatial extent is what's common.", 'start': 2761.695, 'duration': 7.864}, {'end': 2778.705, 'text': 'And then the total number of filters k, usually we use powers of two just to be nice, so 32, 64.', 'start': 2770.399, 'duration': 8.306}, {'end': 2781.888, 'text': "128 and so on, 512, these are pretty common numbers that you'll see.", 'start': 2778.706, 'duration': 3.182}, {'end': 2788.433, 'text': 'And just as an aside, we can also do a one by one convolution.', 'start': 2784.45, 'duration': 3.983}, {'end': 2795.793, 'text': 'this still makes perfect sense where, given a one by one convolution, we still slide it over each spatial extent.', 'start': 2789.549, 'duration': 6.244}, {'end': 2802.437, 'text': "but now the spatial region is not really five by five, it's just kind of the trivial case of one by one.", 'start': 2795.793, 'duration': 6.644}, {'end': 2805.94, 'text': 'but we are still having this filter go through the entire depth.', 'start': 2802.437, 'duration': 3.503}, {'end': 2810.883, 'text': 'So this is going to be a dot product through the entire depth of your input volume.', 'start': 2806.68, 'duration': 4.203}, {'end': 2813.826, 'text': 'And so the output here right.', 'start': 2811.964, 'duration': 1.862}], 'summary': 'Discussing filters, sizes, stride, padding, and total parameters for convolutional neural networks, with common settings and filter sizes of 3x3 and 5x5, stride of 1 or 2, and total number of filters typically being powers of two such as 32, 64, 128, and 512, including the concept of one by one convolution.', 'duration': 48.741, 'max_score': 2729.964, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo2729964.jpg'}], 'start': 2310.211, 'title': 'Zero padding in convolutional neural networks', 'summary': 'Explains the purpose of zero padding in convolutional neural networks, including the recommended zero padding for common filter sizes, the importance of maintaining activation map size, and computation of output volume size and parameters in a convolutional layer.', 'chapters': [{'end': 2501.151, 'start': 2310.211, 'title': 'Convolutional neural networks basics', 'summary': 'Explains the concept of zero padding, its purpose to maintain the input size, the common filter sizes used in practice, and the recommended zero padding for each filter size.', 'duration': 190.94, 'highlights': ['Zero padding is used to maintain the same input size as before, ensuring the full size output and applying the filter at corner and edge regions.', 'Common filter sizes used in practice are three by three, five by five, and seven by seven, with corresponding zero padding of one, two, and three, respectively.', 'Using zero padding, along with choosing filter and stride size, helps in maintaining the desired output size and processing corner and edge regions of the image.']}, {'end': 2889.134, 'start': 2504.666, 'title': 'Zero padding in convolutional neural networks', 'summary': 'Explains the importance of zero padding in convolutional neural networks to maintain the size of activation maps and prevent information loss, illustrated with examples and formulas, and discusses the computation of output volume size and the number of parameters in a convolutional layer.', 'duration': 384.468, 'highlights': ['The importance of zero padding in convolutional neural networks', 'Computation of output volume size and number of parameters in a convolutional layer', 'Use of common settings in convolutional layers']}], 'duration': 578.923, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo2310211.jpg', 'highlights': ['Zero padding ensures full size output and applies filter at corner and edge regions.', 'Common filter sizes: 3x3, 5x5, 7x7 with corresponding zero padding of 1, 2, 3.', 'Importance of zero padding in maintaining desired output size and processing edge regions.', 'Computation of output volume size and number of parameters in a convolutional layer.', 'Use of common settings in convolutional layers.']}, {'end': 3383.945, 'segs': [{'end': 2967.483, 'src': 'embed', 'start': 2935.78, 'weight': 2, 'content': [{'end': 2940.883, 'text': "And so, at one sense, it's kind of the resolution at which you slide it on.", 'start': 2935.78, 'duration': 5.103}, {'end': 2948.648, 'text': 'And usually the reason behind this is because when we have a larger stride, what we end up getting as the output is a down-sampled image right?', 'start': 2940.964, 'duration': 7.684}, {'end': 2957.093, 'text': "And so what this down-sampled image lets us have is both it's a way it's kind of like pooling in a sense,", 'start': 2949.228, 'duration': 7.865}, {'end': 2960.275, 'text': "but it's just a different and sometimes works better way of doing pooling.", 'start': 2957.093, 'duration': 3.182}, {'end': 2967.483, 'text': 'is one of the intuitions behind this, right? Because you get the same effect of downsampling your image.', 'start': 2961.216, 'duration': 6.267}], 'summary': 'Larger stride results in down-sampled image for better pooling, achieving same effect as downsampling.', 'duration': 31.703, 'max_score': 2935.78, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo2935780.jpg'}, {'end': 3030.686, 'src': 'embed', 'start': 3004.485, 'weight': 0, 'content': [{'end': 3008.953, 'text': 'number of parameters you have, the size of your model, overfitting, things like that.', 'start': 3004.485, 'duration': 4.468}, {'end': 3014.283, 'text': 'And so these are kind of some of the things that you want to think about with choosing your stride.', 'start': 3009.153, 'duration': 5.13}, {'end': 3030.686, 'text': 'Okay. so now if we look a little bit at kind of the brain neuron view of a convolutional layer similar to what we looked at for the neurons in the last lecture.', 'start': 3018.414, 'duration': 12.272}], 'summary': 'Consider parameters, model size, overfitting when choosing stride for convolutional layer.', 'duration': 26.201, 'max_score': 3004.485, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo3004485.jpg'}, {'end': 3132.248, 'src': 'embed', 'start': 3100.417, 'weight': 3, 'content': [{'end': 3108.319, 'text': 'because this is the receptive field is basically the you know input field that this field of vision that this neuron is receiving right?', 'start': 3100.417, 'duration': 7.902}, {'end': 3112.56, 'text': "And so yeah, so that's just another common term that you'll hear for this.", 'start': 3108.419, 'duration': 4.141}, {'end': 3115.581, 'text': 'And then again, remember each of these, five by five filters.', 'start': 3113.06, 'duration': 2.521}, {'end': 3119.202, 'text': "we're sliding them over to the spatial locations, but they're the same.", 'start': 3115.581, 'duration': 3.621}, {'end': 3121.603, 'text': 'they share the same parameters.', 'start': 3120.002, 'duration': 1.601}, {'end': 3132.248, 'text': "Okay, and so you know, as we talked about, like, what we're gonna get at this output is going to be this volume right where spatially we have.", 'start': 3125.505, 'duration': 6.743}], 'summary': 'Describing the receptive field and sliding 5x5 filters for volume output.', 'duration': 31.831, 'max_score': 3100.417, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo3100417.jpg'}, {'end': 3190.712, 'src': 'heatmap', 'start': 3100.417, 'weight': 0.739, 'content': [{'end': 3108.319, 'text': 'because this is the receptive field is basically the you know input field that this field of vision that this neuron is receiving right?', 'start': 3100.417, 'duration': 7.902}, {'end': 3112.56, 'text': "And so yeah, so that's just another common term that you'll hear for this.", 'start': 3108.419, 'duration': 4.141}, {'end': 3115.581, 'text': 'And then again, remember each of these, five by five filters.', 'start': 3113.06, 'duration': 2.521}, {'end': 3119.202, 'text': "we're sliding them over to the spatial locations, but they're the same.", 'start': 3115.581, 'duration': 3.621}, {'end': 3121.603, 'text': 'they share the same parameters.', 'start': 3120.002, 'duration': 1.601}, {'end': 3132.248, 'text': "Okay, and so you know, as we talked about, like, what we're gonna get at this output is going to be this volume right where spatially we have.", 'start': 3125.505, 'duration': 6.743}, {'end': 3135.79, 'text': "you know, let's say 20 by 28,, and then our number of filters is the depth.", 'start': 3132.248, 'duration': 3.542}, {'end': 3142.693, 'text': "And so, for example, with five filters, what we're gonna get out is this 3D grid that's 28 by 28 by five.", 'start': 3136.31, 'duration': 6.383}, {'end': 3153.764, 'text': 'And so, if you look at the filters across in one spatial location of the activation volume and going through depth, these five neurons,', 'start': 3143.334, 'duration': 10.43}, {'end': 3160.392, 'text': "all of these neurons, basically the way you can interpret this is they're all looking at the same region in the input volume,", 'start': 3153.764, 'duration': 6.628}, {'end': 3162.074, 'text': "but they're just looking for different things.", 'start': 3160.392, 'duration': 1.682}, {'end': 3166.299, 'text': 'So there are different filters applied to the same spatial location in the image.', 'start': 3162.094, 'duration': 4.205}, {'end': 3174.561, 'text': 'And so just a reminder again, kind of comparing with the fully connected layer that we talked about earlier.', 'start': 3169.057, 'duration': 5.504}, {'end': 3183.767, 'text': 'In that case, if we look at each of the neurons in our activation or output, each of the neurons was connected to the entire stretched out input.', 'start': 3175.322, 'duration': 8.445}, {'end': 3190.712, 'text': 'So it looked at the entire full input volume compared to now where each one just looks at this local spatial region.', 'start': 3183.787, 'duration': 6.925}], 'summary': 'Neurons in a convolutional neural network use local spatial regions to extract different features from the input volume.', 'duration': 90.295, 'max_score': 3100.417, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo3100417.jpg'}, {'end': 3207.743, 'src': 'embed', 'start': 3169.057, 'weight': 5, 'content': [{'end': 3174.561, 'text': 'And so just a reminder again, kind of comparing with the fully connected layer that we talked about earlier.', 'start': 3169.057, 'duration': 5.504}, {'end': 3183.767, 'text': 'In that case, if we look at each of the neurons in our activation or output, each of the neurons was connected to the entire stretched out input.', 'start': 3175.322, 'duration': 8.445}, {'end': 3190.712, 'text': 'So it looked at the entire full input volume compared to now where each one just looks at this local spatial region.', 'start': 3183.787, 'duration': 6.925}, {'end': 3207.743, 'text': 'Question. Okay, so the question is within a given layer, are the filters completely symmetric?', 'start': 3192.634, 'duration': 15.109}], 'summary': 'Comparing fully connected layer to local spatial region; discussing filter symmetry within a layer.', 'duration': 38.686, 'max_score': 3169.057, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo3169057.jpg'}, {'end': 3312.537, 'src': 'embed', 'start': 3282.499, 'weight': 1, 'content': [{'end': 3290.364, 'text': 'So we talked about this earlier when someone asked the question of why we would want to make the representation smaller.', 'start': 3282.499, 'duration': 7.865}, {'end': 3293.805, 'text': 'And so this is again for it to have fewer.', 'start': 3291.444, 'duration': 2.361}, {'end': 3303.451, 'text': 'it affects the number of parameters that you have at the end, as well as basically does some invariance over a given region.', 'start': 3293.805, 'duration': 9.646}, {'end': 3312.537, 'text': 'And so what the pooling layer does is it does exactly just down samples and it takes your input volume.', 'start': 3304.331, 'duration': 8.206}], 'summary': 'Reducing representation size affects parameters and induces invariance over a region.', 'duration': 30.038, 'max_score': 3282.499, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo3282499.jpg'}], 'start': 2889.134, 'title': 'Convolutional neural networks', 'summary': 'Delves into understanding convolutional neural networks, discussing the impact of stride on downsampling, activation map size, and total parameters, and the trade-offs related to model size and overfitting. it also explains the concept of convolutional layers, receptive fields, and pooling layers, highlighting spatial structure preservation and representation size reduction achieved through pooling, with an example of max pooling.', 'chapters': [{'end': 3030.686, 'start': 2889.134, 'title': 'Understanding convolutional neural networks', 'summary': 'Discusses the intuition behind choosing the stride for convolutional operations, highlighting how it impacts downsampling, activation map size, and total number of parameters, and the trade-offs related to model size and overfitting.', 'duration': 141.552, 'highlights': ['Choosing the stride for convolutional operations impacts downsampling, activation map size, and total number of parameters, influencing trade-offs related to model size and overfitting.', 'Larger stride results in a down-sampled image, leading to downsized activation maps and reduced total number of parameters, affecting trade-offs related to model size and overfitting.', 'Smaller stride reduces the total number of parameters, which influences trade-offs related to model size and overfitting.']}, {'end': 3383.945, 'start': 3031.406, 'title': 'Convolutional neural network', 'summary': 'Explains the concept of convolutional layers, receptive fields, and pooling layers, highlighting the spatial structure preservation and reduction in representation size achieved through pooling, with an example of max pooling.', 'duration': 352.539, 'highlights': ['The pooling layers down sample the representations spatially, for example, from 224 by 224 by 64 to 112 by 112, without affecting the depth, and use techniques like max pooling to reduce the size of the input volume. This helps in reducing the number of parameters and achieving invariance over a given region.', 'The convolutional layers involve taking a dot product between a filter and a specific part of the image at every spatial location, resulting in a 3D grid of activation maps, such as 28 by 28 by 5, where each filter is applied to the same spatial location in the image.', 'The concept of receptive field is introduced, explaining that each neuron is connected to a local region spatially of the image, and the filters are slid over the spatial locations, sharing the same parameters.', 'The local connectivity of neurons in convolutional layers is highlighted, preserving the spatial structure and enabling reasoning on activation maps in later layers, compared to fully connected layers that look at the entire input volume.', 'The process of sliding filters over the entire input volume, weighted by the filter parameters, to generate activation maps is explained, demonstrating how each neuron looks at a local spatial region and is triggered at different spatial locations in the image.']}], 'duration': 494.811, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo2889134.jpg', 'highlights': ['Choosing the stride impacts downsampling, activation map size, and total parameters, influencing trade-offs related to model size and overfitting.', 'Pooling layers spatially down sample representations, using techniques like max pooling to reduce input volume size and achieve invariance over a given region.', 'Larger stride results in down-sampled image, leading to downsized activation maps and reduced total parameters, affecting trade-offs related to model size and overfitting.', 'The concept of receptive field is introduced, explaining that each neuron is connected to a local region spatially of the image, and the filters are slid over the spatial locations, sharing the same parameters.', 'Smaller stride reduces the total number of parameters, influencing trade-offs related to model size and overfitting.', 'The local connectivity of neurons in convolutional layers is highlighted, preserving the spatial structure and enabling reasoning on activation maps in later layers.']}, {'end': 4133.429, 'segs': [{'end': 3431.352, 'src': 'embed', 'start': 3408.29, 'weight': 5, 'content': [{'end': 3415.395, 'text': 'and so it makes sense to kind of look at this region and just get one value to represent this region and then just look at the next region, and so on.', 'start': 3408.29, 'duration': 7.105}, {'end': 3429.171, 'text': "Yeah, question? So the question is why is max pooling better than just doing something like average pooling? Yes, that's a good point.", 'start': 3415.856, 'duration': 13.315}, {'end': 3431.352, 'text': 'Average pooling is also something that you can do.', 'start': 3429.571, 'duration': 1.781}], 'summary': 'Max pooling is better than average pooling for feature representation.', 'duration': 23.062, 'max_score': 3408.29, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo3408290.jpg'}, {'end': 3499.646, 'src': 'embed', 'start': 3472.122, 'weight': 4, 'content': [{'end': 3476.264, 'text': 'whether it happens anywhere in this region, we want to fire it with a high value.', 'start': 3472.122, 'duration': 4.142}, {'end': 3478.125, 'text': 'A question.', 'start': 3477.805, 'duration': 0.32}, {'end': 3493.721, 'text': 'So the question is, since pooling and stride both have the same effect of downsampling, can you just use stride instead of pooling and so on??', 'start': 3486.374, 'duration': 7.347}, {'end': 3499.646, 'text': 'Yeah, and so in practice, I think, looking at more recent neural network architectures,', 'start': 3494.141, 'duration': 5.505}], 'summary': 'Exploring downsampling techniques in neural networks.', 'duration': 27.524, 'max_score': 3472.122, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo3472122.jpg'}, {'end': 3779.734, 'src': 'embed', 'start': 3751.287, 'weight': 3, 'content': [{'end': 3753.148, 'text': 'And so, starting from the beginning, we have our car.', 'start': 3751.287, 'duration': 1.861}, {'end': 3761.674, 'text': 'After the convolutional layer, we now have these activation maps of each of the filters slid spatially over the input image.', 'start': 3755.709, 'duration': 5.965}, {'end': 3767.898, 'text': 'Then we pass that through a ReLU, so you can see the values coming out from there, and then going all the way over.', 'start': 3761.994, 'duration': 5.904}, {'end': 3779.734, 'text': "And so what you get for the pooling layer is that it's really just taking the the output of the ReLU layer that came just before it and then it's pooling it.", 'start': 3767.938, 'duration': 11.796}], 'summary': 'Explaining the process of convolutional layer and pooling in neural network.', 'duration': 28.447, 'max_score': 3751.287, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo3751287.jpg'}, {'end': 3947.343, 'src': 'embed', 'start': 3893.473, 'weight': 1, 'content': [{'end': 3899.594, 'text': 'each value is representing how much a relatively complex sort of template is firing.', 'start': 3893.473, 'duration': 6.121}, {'end': 3904.275, 'text': 'And so because of that, now you can just have a fully connected layer.', 'start': 3900.494, 'duration': 3.781}, {'end': 3909.477, 'text': "You're just aggregating all of this information together to get a score for your class.", 'start': 3904.295, 'duration': 5.182}, {'end': 3919.055, 'text': 'So each of these values is how much a pretty complicated, complex concept is firing.', 'start': 3912.049, 'duration': 7.006}, {'end': 3931.21, 'text': "So the question is, when do you know you've done enough pooling to do the classification? And the answer is, you just try and see.", 'start': 3924.804, 'duration': 6.406}, {'end': 3936.054, 'text': 'So in practice, these are all design choices.', 'start': 3932.27, 'duration': 3.784}, {'end': 3939.397, 'text': 'And you can think about this a little bit intuitively.', 'start': 3936.234, 'duration': 3.163}, {'end': 3946.563, 'text': "You want to pool, but if you pool too much, you're going to have very few values representing your entire image.", 'start': 3939.757, 'duration': 6.806}, {'end': 3947.343, 'text': 'and so on.', 'start': 3946.983, 'duration': 0.36}], 'summary': 'Determining optimal pooling for classification involves testing design choices to avoid underrepresentation in image representation.', 'duration': 53.87, 'max_score': 3893.473, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo3893473.jpg'}, {'end': 4086.344, 'src': 'embed', 'start': 4059.775, 'weight': 0, 'content': [{'end': 4064.416, 'text': "There's been a trend towards having smaller filters and deeper architectures.", 'start': 4059.775, 'duration': 4.641}, {'end': 4067.837, 'text': "So we'll talk more about case studies for some of these later on.", 'start': 4065.076, 'duration': 2.761}, {'end': 4073.459, 'text': "There's also been a trend towards getting rid of these pooling and fully connected layers entirely.", 'start': 4068.837, 'duration': 4.622}, {'end': 4078.9, 'text': 'So just keeping these, just having comb layers, very deep networks of comb layers.', 'start': 4073.739, 'duration': 5.161}, {'end': 4086.344, 'text': "So again, we'll discuss all of this And then typical architectures again look like this.", 'start': 4078.94, 'duration': 7.404}], 'summary': 'Trend towards smaller filters, deeper architectures, and removing pooling and fully connected layers in favor of comb layers in neural networks.', 'duration': 26.569, 'max_score': 4059.775, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo4059775.jpg'}], 'start': 3388.936, 'title': 'Cnn and pooling', 'summary': 'Explains max pooling, average pooling, and stride use in recent neural network architectures, as well as the working of convolutional neural networks, including convolution, relu activation, and hierarchical feature extraction, with trends towards smaller filters and alternative network structures.', 'chapters': [{'end': 3701.909, 'start': 3388.936, 'title': 'Convolutional neural networks and pooling', 'summary': 'Explains the concept of max pooling, comparing it to average pooling and discusses the use of stride for downsampling in recent neural network architectures, along with design choices for pooling layers and the transition to fully connected layers.', 'duration': 312.973, 'highlights': ['The chapter explains the concept of max pooling, comparing it to average pooling.', 'Discussion on the use of stride for downsampling in recent neural network architectures.', 'Design choices for pooling layers and the transition to fully connected layers.']}, {'end': 4133.429, 'start': 3705.652, 'title': 'Understanding convolutional neural networks', 'summary': 'Explains the working of convolutional neural networks, highlighting the process of convolution, relu activation, and pooling, as well as the hierarchical feature extraction in the network, and the significance of design choices in pooling and fully connected layers, with a mention of trends towards smaller filters, deeper architectures, and alternative network structures.', 'duration': 427.777, 'highlights': ['Convolutional layers generate activation maps by sliding spatial filters over the input image, followed by ReLU activation and pooling to downsample the output.', 'Pooling layers downsample the ReLU layer output and identify the max value in each filter location, resulting in a more complex representation of the input.', 'Each value in the pooling layer represents a higher-level concept, with the network progressing from detecting simple structures like edges to more complex features like corners and templates.', 'Design choices in pooling and fully connected layers are crucial, requiring a balance to avoid overly reducing the representation of the input data, with the need for cross-validation and experimentation to determine optimal configurations.', 'Trends in convolutional neural networks include a shift towards smaller filters, deeper architectures, and the exploration of alternative network structures with comb layers and the elimination of pooling and fully connected layers.']}], 'duration': 744.493, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/bNb2fEVKeEo/pics/bNb2fEVKeEo3388936.jpg', 'highlights': ['Trends in convolutional neural networks include a shift towards smaller filters, deeper architectures, and the exploration of alternative network structures with comb layers and the elimination of pooling and fully connected layers.', 'Each value in the pooling layer represents a higher-level concept, with the network progressing from detecting simple structures like edges to more complex features like corners and templates.', 'Design choices in pooling and fully connected layers are crucial, requiring a balance to avoid overly reducing the representation of the input data, with the need for cross-validation and experimentation to determine optimal configurations.', 'Convolutional layers generate activation maps by sliding spatial filters over the input image, followed by ReLU activation and pooling to downsample the output.', 'Discussion on the use of stride for downsampling in recent neural network architectures.', 'The chapter explains the concept of max pooling, comparing it to average pooling.']}], 'highlights': ['ComNets are applied to video and able to classify videos by looking at both images as well as temporal information. (relevance: 5)', 'The chapter introduces the topic of Convolutional Neural Networks, a key aspect of the lecture. (relevance: 5)', 'CNNs utilize filters, such as a 5x5x3 filter, to slide over the image spatially and compute dot products at every spatial location, maintaining the structure of the three-dimensional input. (relevance: 4)', 'The first introduction of backpropagation was in 1986 with Rumelhart, laying the foundation for a principled way to train network architectures. (relevance: 4)', 'The lecture provides details about the assignment deadlines, with assignment one due on April 20th and the release of assignment two. (relevance: 4)', 'The applications of ComNets encompass image retrieval, object detection, and segmentation, with significant importance in self-driving cars, all powered by modern GPUs. (relevance: 4)', 'The hierarchy of filters in a convolutional network progresses from low-level features to mid-level features to higher-level features. (relevance: 3)', 'The architecture of a convolutional neural network involves a sequence of layers including convolutional, ReLU, comb relu, pooling, and fully connected layers, culminating in a final score function. (relevance: 3)', 'Choosing the stride impacts downsampling, activation map size, and total parameters, influencing trade-offs related to model size and overfitting. (relevance: 3)', 'Trends in convolutional neural networks include a shift towards smaller filters, deeper architectures, and the exploration of alternative network structures with comb layers and the elimination of pooling and fully connected layers. (relevance: 3)']}