title
MIT 6.S191: Convolutional Neural Networks

description
MIT Introduction to Deep Learning 6.S191: Lecture 3 Convolutional Neural Networks for Computer Vision Lecturer: Alexander Amini 2023 Edition For all lectures, slides, and lab materials: http://introtodeeplearning.com​ Lecture Outline 0:00​ - Introduction 2:37​ - Amazing applications of vision 5:35 - What computers "see" 12:38- Learning visual features 17:51​ - Feature extraction and convolution 22:23 - The convolution operation 27:30​ - Convolution neural networks 34:29​ - Non-linearity and pooling 40:07 - End-to-end code example 41:23​ - Applications 43:18 - Object detection 51:36 - End-to-end self driving cars 54:08​ - Summary Subscribe to stay up to date with new deep learning lectures at MIT, or follow us @MITDeepLearning on Twitter and Instagram to stay fully-connected!!

detail
{'title': 'MIT 6.S191: Convolutional Neural Networks', 'heatmap': [{'end': 962.254, 'start': 926.456, 'weight': 0.709}, {'end': 1763.592, 'start': 1723.324, 'weight': 0.849}, {'end': 2035.991, 'start': 1922.774, 'weight': 1}, {'end': 2099.175, 'start': 2053.601, 'weight': 0.729}, {'end': 2221.188, 'start': 2187.196, 'weight': 0.769}, {'end': 2354.018, 'start': 2316.368, 'weight': 0.851}, {'end': 2957.553, 'start': 2882.824, 'weight': 0.736}], 'summary': 'Covers computer vision and deep learning, exploring its significance and application in fields like autonomous driving and healthcare, while delving into computer vision fundamentals, spatial structure and convolution in image processing, convolutional neural networks, cnn operations, relu activation, and cnns in various applications, emphasizing feature learning and image classification.', 'chapters': [{'end': 344.201, 'segs': [{'end': 67.396, 'src': 'embed', 'start': 32.478, 'weight': 3, 'content': [{'end': 40.321, 'text': 'Now, I believe that sight, and specifically, like I said, vision, is one of the most important human senses that we all have.', 'start': 32.478, 'duration': 7.843}, {'end': 41.982, 'text': 'In fact sighted.', 'start': 41.102, 'duration': 0.88}, {'end': 49.745, 'text': 'people rely on vision quite a lot in our day-to-day lives, from everything from walking around, navigating the world,', 'start': 41.982, 'duration': 7.763}, {'end': 53.647, 'text': 'interacting and sensing other emotions in our colleagues and peers.', 'start': 49.745, 'duration': 3.902}, {'end': 67.396, 'text': "And today we're going to learn about how we can use deep learning and machine learning to build powerful vision systems that can both see and predict what is where by only looking at raw visual inputs.", 'start': 54.147, 'duration': 13.249}], 'summary': 'Sight is crucial for humans, and deep learning helps build powerful vision systems.', 'duration': 34.918, 'max_score': 32.478, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB432478.jpg'}, {'end': 104.257, 'src': 'embed', 'start': 76.403, 'weight': 1, 'content': [{'end': 81.746, 'text': 'But at its core, vision is actually so much more than just understanding what is where.', 'start': 76.403, 'duration': 5.343}, {'end': 83.607, 'text': 'It also goes much deeper.', 'start': 82.266, 'duration': 1.341}, {'end': 84.908, 'text': 'Take this scene, for example.', 'start': 83.727, 'duration': 1.181}, {'end': 91.411, 'text': 'We can build computer vision systems that can identify, of course, all of the objects in this environment,', 'start': 85.008, 'duration': 6.403}, {'end': 95.534, 'text': 'starting first with the yellow taxi or the van parked on the side of the road.', 'start': 91.411, 'duration': 4.123}, {'end': 104.257, 'text': 'But we also need to understand each of these objects at a much deeper level, not just where they are, but actually predicting the future,', 'start': 96.374, 'duration': 7.883}], 'summary': 'Computer vision goes beyond object recognition to predict the future.', 'duration': 27.854, 'max_score': 76.403, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB476403.jpg'}, {'end': 187.885, 'src': 'embed', 'start': 158.413, 'weight': 2, 'content': [{'end': 165.655, 'text': 'And deep learning, in particular, is really leading this revolution of computer vision and achieving sight of computers.', 'start': 158.413, 'duration': 7.242}, {'end': 167.436, 'text': 'For example,', 'start': 166.636, 'duration': 0.8}, {'end': 176.62, 'text': 'allowing robots to pick up on these key visual cues in their environment critical for really navigating the world together with us as humans.', 'start': 167.436, 'duration': 9.184}, {'end': 180.982, 'text': "These algorithms that you're going to learn about today have become so mainstreamed, in fact,", 'start': 177.12, 'duration': 3.862}, {'end': 187.885, 'text': "that they're fitting on all of your smartphones in your pockets, processing every single image that you take, enhancing those images,", 'start': 180.982, 'duration': 6.903}], 'summary': 'Deep learning is revolutionizing computer vision, enabling robots to navigate using visual cues; algorithms are now mainstream on smartphones.', 'duration': 29.472, 'max_score': 158.413, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB4158413.jpg'}, {'end': 226.734, 'src': 'embed', 'start': 200.842, 'weight': 0, 'content': [{'end': 209.365, 'text': 'And, like I said, deep learning has taken this field as a whole by storm over the past decade or so because of its ability critically,', 'start': 200.842, 'duration': 8.523}, {'end': 217.588, 'text': 'like we were talking about yesterday its ability to learn directly from raw data and those raw image inputs in what it sees in its environment.', 'start': 209.365, 'duration': 8.223}, {'end': 226.734, 'text': 'and learn explicitly how to perform, like we talked about yesterday, what is called feature extraction of those images in the environment.', 'start': 218.328, 'duration': 8.406}], 'summary': 'Deep learning has revolutionized the field by learning directly from raw data, especially in image processing.', 'duration': 25.892, 'max_score': 200.842, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB4200842.jpg'}], 'start': 9.226, 'title': 'Computer vision and deep learning', 'summary': 'Explores the significance of computer vision and its application in deep learning and machine learning, impacting fields such as autonomous driving, healthcare, and accessibility applications.', 'chapters': [{'end': 53.647, 'start': 9.226, 'title': 'Building computers for vision', 'summary': 'Discusses the importance of vision and how computers can achieve the sense of sight, a crucial human sense relied upon for navigating the world and interpreting emotions in others.', 'duration': 44.421, 'highlights': ['Computers achieving the sense of sight and vision is a key topic in the course, emphasizing the importance of vision for human interaction and navigation.', 'Humans rely on vision for day-to-day activities, such as walking, navigating, and interpreting emotions in colleagues and peers.']}, {'end': 344.201, 'start': 54.147, 'title': 'Deep learning for computer vision', 'summary': 'Discusses the use of deep learning and machine learning in computer vision to understand, predict, and process raw visual inputs, impacting various fields such as autonomous driving, healthcare, and accessibility applications.', 'duration': 290.054, 'highlights': ['Deep learning and machine learning are utilized to build powerful vision systems that can understand and predict what is where by processing raw visual inputs.', 'Computer vision systems need to identify and predict the future of objects in a scene, such as predicting the movement of a yellow taxi versus a parked white van.', 'The use of deep learning algorithms in computer vision has become mainstream, impacting fields like biology, medicine, autonomous driving, and accessibility applications.', 'Deep learning enables end-to-end approaches in autonomous driving, allowing vehicles to learn how to steer, throttle, and brake based on raw image inputs and sensing modalities.', 'Computer vision algorithms are also being applied to assist visually impaired individuals, such as detecting trails for navigation during runs.']}], 'duration': 334.975, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB49226.jpg', 'highlights': ['Deep learning enables end-to-end approaches in autonomous driving, impacting fields like biology, medicine, and accessibility applications.', 'Computer vision systems need to identify and predict the future of objects in a scene, such as predicting the movement of a yellow taxi versus a parked white van.', 'The use of deep learning algorithms in computer vision has become mainstream, impacting fields like biology, medicine, autonomous driving, and accessibility applications.', 'Computers achieving the sense of sight and vision is a key topic in the course, emphasizing the importance of vision for human interaction and navigation.', 'Humans rely on vision for day-to-day activities, such as walking, navigating, and interpreting emotions in colleagues and peers.']}, {'end': 940.122, 'segs': [{'end': 433.707, 'src': 'embed', 'start': 366.379, 'weight': 1, 'content': [{'end': 373.342, 'text': 'If we think of sight as coming to computers through images, then how can a computer even start to process those images?', 'start': 366.379, 'duration': 6.963}, {'end': 377.324, 'text': 'Well, to a computer, images are just numbers right?', 'start': 374.202, 'duration': 3.122}, {'end': 381.466, 'text': 'And suppose, for example, we have a picture here of Abraham Lincoln.', 'start': 378.284, 'duration': 3.182}, {'end': 385.341, 'text': 'OK, this picture is made up of what are called pixels.', 'start': 382.6, 'duration': 2.741}, {'end': 388.862, 'text': 'Every pixel is just a dot in this image.', 'start': 385.801, 'duration': 3.061}, {'end': 394.163, 'text': 'And since this is a grayscale image, each of these pixels is just a single number.', 'start': 389.422, 'duration': 4.741}, {'end': 400.045, 'text': 'Now, we can represent our image now as this two-dimensional matrix of numbers.', 'start': 394.823, 'duration': 5.222}, {'end': 407.307, 'text': 'And because, like I said, this is a grayscale image, every pixel is corresponding to just one number at that matrix location.', 'start': 400.725, 'duration': 6.582}, {'end': 410.649, 'text': "Now assume, for example, we didn't have a grayscale image.", 'start': 408.007, 'duration': 2.642}, {'end': 411.57, 'text': 'We had a color image.', 'start': 410.689, 'duration': 0.881}, {'end': 413.531, 'text': 'That would be an RGB image.', 'start': 412.05, 'duration': 1.481}, {'end': 418.035, 'text': 'So now every pixel is going to be composed not just of one number, but of three numbers.', 'start': 414.032, 'duration': 4.003}, {'end': 422.478, 'text': 'So you can think of that as kind of a 3D matrix instead of a 2D matrix,', 'start': 418.075, 'duration': 4.403}, {'end': 427.622, 'text': 'where you almost have three two-dimensional matrix that are stacked on top of each other.', 'start': 422.478, 'duration': 5.144}, {'end': 433.707, 'text': 'So now, with this basis of basically numerical representations of images,', 'start': 428.543, 'duration': 5.164}], 'summary': 'Images are represented as numerical matrices; grayscale images have single numbers per pixel while rgb images have three numbers per pixel.', 'duration': 67.328, 'max_score': 366.379, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB4366379.jpg'}, {'end': 515.318, 'src': 'embed', 'start': 480.197, 'weight': 3, 'content': [{'end': 481.718, 'text': 'These are discrete different classes.', 'start': 480.197, 'duration': 1.521}, {'end': 485.261, 'text': "So let's consider first the task of image classification.", 'start': 481.758, 'duration': 3.503}, {'end': 491.144, 'text': 'In this task, we want to predict an individual label for every single image.', 'start': 486.201, 'duration': 4.943}, {'end': 496.827, 'text': 'And this label that we predict is going to be one of n different possible labels that could be considered.', 'start': 491.204, 'duration': 5.623}, {'end': 500.629, 'text': "So for example, let's say we have a bunch of images of US presidents.", 'start': 497.247, 'duration': 3.382}, {'end': 507.633, 'text': 'And we want to build a classification pipeline to tell us which president is in this particular image that you see on the screen.', 'start': 501.329, 'duration': 6.304}, {'end': 515.318, 'text': 'Now, the goal of our model in this case is going to be basically to output a probability score,', 'start': 508.493, 'duration': 6.825}], 'summary': 'Task: image classification to predict labels for images of us presidents using probability scores.', 'duration': 35.121, 'max_score': 480.197, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB4480197.jpg'}, {'end': 783.506, 'src': 'embed', 'start': 752.428, 'weight': 0, 'content': [{'end': 757.456, 'text': "building on top of previous features that it's learned to build more and more complex set of features.", 'start': 752.428, 'duration': 5.028}, {'end': 767.103, 'text': "Now, we're going to see exactly how neural networks can do this in the image domain as part of this lecture.", 'start': 759.901, 'duration': 7.202}, {'end': 775.784, 'text': 'But specifically, neural networks will allow us to learn these visual features from visual data if we construct them cleverly.', 'start': 767.703, 'duration': 8.081}, {'end': 783.506, 'text': "And the key point here is that actually, the models and the architectures that we learned about in yesterday's lecture and so far in this course,", 'start': 775.964, 'duration': 7.542}], 'summary': 'Neural networks can learn complex visual features from data, improving on previous capabilities.', 'duration': 31.078, 'max_score': 752.428, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB4752428.jpg'}, {'end': 842.177, 'src': 'embed', 'start': 819.718, 'weight': 2, 'content': [{'end': 827.363, 'text': "Now let's say that we want to directly, without any modifications, use a fully connected network, like we learned about in lecture one,", 'start': 819.718, 'duration': 7.645}, {'end': 829.325, 'text': 'with an image processing pipeline.', 'start': 827.363, 'duration': 1.962}, {'end': 833.387, 'text': 'So directly taking an image and feeding it to a fully connected network.', 'start': 829.485, 'duration': 3.902}, {'end': 836.891, 'text': 'Could we do something like that? Actually, in this case, we could.', 'start': 833.848, 'duration': 3.043}, {'end': 842.177, 'text': 'The way we would have to do it is remember that, because our image is a two-dimensional array,', 'start': 836.911, 'duration': 5.266}], 'summary': 'Using fully connected network for image processing is feasible.', 'duration': 22.459, 'max_score': 819.718, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB4819718.jpg'}], 'start': 344.921, 'title': 'Computer vision fundamentals', 'summary': 'Covers computer vision basics including image representation, grayscale vs color images, image classification, regression, limitations of fully connected networks, and the role of neural networks in learning visual features.', 'chapters': [{'end': 433.707, 'start': 344.921, 'title': 'Computer vision basics', 'summary': 'Discusses how computers process images, representing images as numerical matrices, and the difference between grayscale and color images.', 'duration': 88.786, 'highlights': ['Computers process images as numerical representations, with grayscale images represented as a 2D matrix of single numbers and color images as a 3D matrix of three numbers.', 'The chapter delves into the core question of how computers see and process images, aiming to build a computer to perform similar tasks as humans, such as recognizing images.', 'The discussion highlights the concept of representing images as pixels and the transformation of color images into a 3D matrix, providing fundamental insights into computer vision.', 'The chapter focuses on the fundamental question of how computers process images, emphasizing the representation of images as numerical data and the distinction between grayscale and color images.']}, {'end': 940.122, 'start': 433.707, 'title': 'Image classification and computer vision', 'summary': 'Discusses the tasks of image classification and regression, the concept of features in computer vision, the limitations of fully connected networks for image processing, and the role of neural networks in learning visual features hierarchically.', 'duration': 506.415, 'highlights': ['Neural networks can learn visual features hierarchically, building on top of previous features to create a more complex set of features.', 'The limitations of using fully connected networks for image processing due to the loss of spatial information and the inefficiency in handling a large number of parameters.', 'The concept of features in computer vision, where classification is done by detecting patterns and identifying when certain patterns occur over others.', 'The tasks of image classification and regression, where image classification involves predicting an individual label for every image, and regression involves predicting a continuous value.']}], 'duration': 595.201, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB4344921.jpg', 'highlights': ['Neural networks learn visual features hierarchically', 'Computers process images as numerical representations', 'Limitations of fully connected networks for image processing', 'Image classification predicts individual label for every image', 'Grayscale images represented as a 2D matrix of single numbers', 'Color images represented as a 3D matrix of three numbers']}, {'end': 1214.265, 'segs': [{'end': 969.596, 'src': 'embed', 'start': 940.122, 'weight': 1, 'content': [{'end': 945.964, 'text': "that's very unique about images, here into our input and here into our model, most importantly.", 'start': 940.122, 'duration': 5.842}, {'end': 956.113, 'text': "So to do this, let's represent our 2D image as its original form, as a two-dimensional array of numbers.", 'start': 947.972, 'duration': 8.141}, {'end': 962.254, 'text': 'One way that we can use spatial structure here, inherent to our input,', 'start': 957.134, 'duration': 5.12}, {'end': 968.415, 'text': 'is to connect what are called basically these patches of our input to neurons in the hidden layer.', 'start': 962.254, 'duration': 6.161}, {'end': 969.596, 'text': 'So, for example,', 'start': 968.555, 'duration': 1.041}], 'summary': 'Utilize spatial structure of 2d images for connecting input patches to neurons in hidden layer.', 'duration': 29.474, 'max_score': 940.122, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB4940122.jpg'}, {'end': 1070.981, 'src': 'embed', 'start': 1040.185, 'weight': 2, 'content': [{'end': 1047.589, 'text': 'In this way, we essentially preserve all of that very key and rich spatial information inherent to our input.', 'start': 1040.185, 'duration': 7.404}, {'end': 1053.231, 'text': 'But remember that the ultimate task here is not only to just preserve that spatial information.', 'start': 1047.628, 'duration': 5.603}, {'end': 1058.374, 'text': 'we want to ultimately learn features, learn those patterns, so that we can detect and classify these images.', 'start': 1053.231, 'duration': 5.143}, {'end': 1070.981, 'text': 'And we can do this by waving, right? Waving the connections between the patches of our input and in order to detect what those certain features are.', 'start': 1058.914, 'duration': 12.067}], 'summary': 'Preserve spatial information, learn features for image detection and classification.', 'duration': 30.796, 'max_score': 1040.185, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB41040185.jpg'}, {'end': 1133.126, 'src': 'embed', 'start': 1106.155, 'weight': 3, 'content': [{'end': 1111.678, 'text': "We're going to apply this same four by four, let's call this not a patch anymore, let's use the terminology filter.", 'start': 1106.155, 'duration': 5.523}, {'end': 1121.483, 'text': "We'll apply this same four by four filter in the input and use the result of that operation to define the state of the neuron in the next layer right?", 'start': 1111.878, 'duration': 9.605}, {'end': 1126.684, 'text': "And now We're going to shift our filter by, let's say, two pixels to the right.", 'start': 1121.563, 'duration': 5.121}, {'end': 1133.126, 'text': "And that's going to define the next neuron in the adjacent location in the future layer.", 'start': 1127.565, 'duration': 5.561}], 'summary': 'Applying 4x4 filter to define neuron states; shifting filter by 2 pixels to the right.', 'duration': 26.971, 'max_score': 1106.155, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB41106155.jpg'}, {'end': 1184.665, 'src': 'embed', 'start': 1149.164, 'weight': 0, 'content': [{'end': 1154.408, 'text': "But you're probably wondering not just how the convolution operation works,", 'start': 1149.164, 'duration': 5.244}, {'end': 1161.392, 'text': 'but I think the main thing here to really narrow down on is how convolution allows us to learn these features,', 'start': 1154.408, 'duration': 6.984}, {'end': 1163.794, 'text': 'these patterns in the data that we were talking about.', 'start': 1161.392, 'duration': 2.402}, {'end': 1165.816, 'text': "Because ultimately, that's our final goal.", 'start': 1163.854, 'duration': 1.962}, {'end': 1168.497, 'text': "That's our real goal for this class is to extract those patterns.", 'start': 1165.876, 'duration': 2.621}, {'end': 1173.521, 'text': "So let's make this very concrete by walking through maybe a concrete example.", 'start': 1169.078, 'duration': 4.443}, {'end': 1184.665, 'text': 'So suppose, for example, we want to build a convolutional algorithm to detect or classify an X in an image.', 'start': 1175.362, 'duration': 9.303}], 'summary': 'Understanding how convolution allows learning of data patterns.', 'duration': 35.501, 'max_score': 1149.164, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB41149164.jpg'}], 'start': 940.122, 'title': 'Spatial structure and convolution in image processing', 'summary': 'Delves into utilizing spatial structure and convolution in image processing for preserving spatial information, learning features, and classifying images. it also provides a practical example of using convolution for pattern recognition.', 'chapters': [{'end': 1058.374, 'start': 940.122, 'title': 'Spatial structure in image processing', 'summary': 'Discusses the utilization of spatial structure in image processing, highlighting the connection of patches from the input to neurons in the hidden layer to preserve spatial information and learn features for image detection and classification.', 'duration': 118.252, 'highlights': ['Utilization of spatial structure by connecting patches from input to neurons in the hidden layer', 'Preservation of rich spatial information inherent to the input', 'Sliding patch pixel by pixel across the input image to respond with another image on the output layer']}, {'end': 1214.265, 'start': 1058.914, 'title': 'Convolution in neural networks', 'summary': 'Explains the concept of convolution as a mathematical operation used in neural networks, illustrating how it allows the learning of features and patterns in data, with a practical example of classifying an x in a black and white image.', 'duration': 155.351, 'highlights': ['Convolution is a mathematical operation used in neural networks to learn features and patterns in data, such as detecting or classifying an X in an image.', 'The process involves applying a four by four filter to the input, shifting it by pixels, and sliding over both the input image and the output neurons in the secondary layer.', 'The 16 different weights in the four by four pixel patch contribute to defining the state of the neuron in the next layer.']}], 'duration': 274.143, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB4940122.jpg', 'highlights': ['Convolution used in neural networks to learn features and patterns', 'Utilization of spatial structure by connecting patches from input to neurons', 'Preservation of rich spatial information inherent to the input', 'Applying a four by four filter to the input, shifting it by pixels']}, {'end': 2136.359, 'segs': [{'end': 1304.511, 'src': 'embed', 'start': 1276.758, 'weight': 1, 'content': [{'end': 1283.264, 'text': 'And we can use filters to pick up on when these small patches or small images occur.', 'start': 1276.758, 'duration': 6.506}, {'end': 1288.946, 'text': 'So in the case of Xs, these filters may represent semantic things,', 'start': 1284.125, 'duration': 4.821}, {'end': 1296.389, 'text': 'for example the diagonal lines or the crossings that capture all of the important characteristics of the X.', 'start': 1288.946, 'duration': 7.443}, {'end': 1304.511, 'text': "So we'll probably capture these features in the arms and the center of our letter in any image of an X,", 'start': 1296.389, 'duration': 8.122}], 'summary': 'Using filters to capture semantic features of x in images.', 'duration': 27.753, 'max_score': 1276.758, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB41276758.jpg'}, {'end': 1376.946, 'src': 'embed', 'start': 1352.378, 'weight': 0, 'content': [{'end': 1362.181, 'text': 'Convolution preserves all of that spatial information in our input by learning image features in those smaller squares of regions that preserve our input data.', 'start': 1352.378, 'duration': 9.803}, {'end': 1371.984, 'text': 'So, just to give another concrete example, to perform this operation, we need to do an element-wise multiplication between the filter matrix,', 'start': 1362.221, 'duration': 9.763}, {'end': 1376.946, 'text': 'those miniature patches, as well as the patch of our input image.', 'start': 1371.984, 'duration': 4.962}], 'summary': 'Convolution preserves spatial information in input by learning image features.', 'duration': 24.568, 'max_score': 1352.378, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB41352378.jpg'}, {'end': 1598.043, 'src': 'embed', 'start': 1571.124, 'weight': 7, 'content': [{'end': 1576.106, 'text': 'And simply by changing the weights that are present in these three by three matrices,', 'start': 1571.124, 'duration': 4.982}, {'end': 1580.048, 'text': 'you can see the variability of different types of features that we can detect.', 'start': 1576.106, 'duration': 3.942}, {'end': 1587.672, 'text': 'So for example, we can design filters that can sharpen an image, make the edges sharper in the image.', 'start': 1580.248, 'duration': 7.424}, {'end': 1590.213, 'text': 'We can design filters that will extract edges.', 'start': 1587.872, 'duration': 2.341}, {'end': 1595.716, 'text': 'We can do stronger edge detection by, again, modifying the weights in all of those filters.', 'start': 1590.894, 'duration': 4.822}, {'end': 1598.043, 'text': 'So I hope,', 'start': 1597.543, 'duration': 0.5}], 'summary': 'Changing weights in matrices detects various image features.', 'duration': 26.919, 'max_score': 1571.124, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB41571124.jpg'}, {'end': 1703.138, 'src': 'embed', 'start': 1675.111, 'weight': 8, 'content': [{'end': 1682.433, 'text': "well, what you ultimately create by creating convolutional layers and convolutional networks is what's called a CNN, a convolutional neural network.", 'start': 1675.111, 'duration': 7.322}, {'end': 1686.235, 'text': "And that's going to be the core architecture of today's class.", 'start': 1683.174, 'duration': 3.061}, {'end': 1691.417, 'text': "So let's consider a very simple CNN that was designed for image classification.", 'start': 1686.715, 'duration': 4.702}, {'end': 1696.459, 'text': 'The task here again is to learn the features directly from the raw data.', 'start': 1691.897, 'duration': 4.562}, {'end': 1703.138, 'text': 'and use these learn features for classification towards some task of object detection that we want to perform.', 'start': 1697.34, 'duration': 5.798}], 'summary': "Cnn is the core of today's class, designed for image classification and object detection.", 'duration': 28.027, 'max_score': 1675.111, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB41675111.jpg'}, {'end': 1772.777, 'src': 'heatmap', 'start': 1719.882, 'weight': 9, 'content': [{'end': 1723.124, 'text': 'Convolutions are used to generate these feature maps.', 'start': 1719.882, 'duration': 3.242}, {'end': 1728.948, 'text': 'So they take as input both the previous image as well as some filter that they want to detect.', 'start': 1723.324, 'duration': 5.624}, {'end': 1735.193, 'text': 'And they output a feature map of how this filter is related to the original image.', 'start': 1729.108, 'duration': 6.085}, {'end': 1741.059, 'text': 'The second step is, like yesterday, applying a nonlinearity to the result of these feature maps.', 'start': 1735.873, 'duration': 5.186}, {'end': 1747.046, 'text': 'That injects some nonlinear activations to our neural networks, allows it to deal with nonlinear data.', 'start': 1741.56, 'duration': 5.486}, {'end': 1749.708, 'text': 'Third step is pooling,', 'start': 1748.027, 'duration': 1.681}, {'end': 1763.592, 'text': 'which is essentially a downsampling operation to allow our networks to deal with larger and larger scale images by progressively downscaling their size so that our filters can progressively grow in receptive field.', 'start': 1749.708, 'duration': 13.884}, {'end': 1772.777, 'text': 'And finally, feeding all of these resulting features to some neural network to infer the class scores.', 'start': 1765.194, 'duration': 7.583}], 'summary': 'Convolutions generate feature maps, nonlinearity adds activations, pooling downsamples, and features are fed to a neural network for class scores.', 'duration': 52.895, 'max_score': 1719.882, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB41719882.jpg'}, {'end': 1873.304, 'src': 'embed', 'start': 1847.541, 'weight': 2, 'content': [{'end': 1853.847, 'text': "Now, what's really special here and what I really want to stress is the local connectivity.", 'start': 1847.541, 'duration': 6.306}, {'end': 1862.134, 'text': 'Every single neuron in this hidden layer only sees a certain patch of inputs in its previous layer.', 'start': 1854.407, 'duration': 7.727}, {'end': 1870.322, 'text': 'So if I point at just this one neuron in the output layer, this neuron only sees the inputs at this red square.', 'start': 1862.194, 'duration': 8.128}, {'end': 1873.304, 'text': "It doesn't see any of the other inputs in the rest of the image.", 'start': 1870.342, 'duration': 2.962}], 'summary': 'Local connectivity emphasizes specific input patches for each neuron.', 'duration': 25.763, 'max_score': 1847.541, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB41847541.jpg'}, {'end': 1930.979, 'src': 'embed', 'start': 1896.949, 'weight': 4, 'content': [{'end': 1900.41, 'text': "Now, let's define this actual computation that's going on.", 'start': 1896.949, 'duration': 3.461}, {'end': 1907.732, 'text': 'For a neuron in a hidden layer, its inputs are those neurons that fell within its patch in the previous layer.', 'start': 1900.89, 'duration': 6.842}, {'end': 1914.274, 'text': 'We can apply this matrix of weights, here denoted as a 4 by 4 filter that you can see on the left-hand side.', 'start': 1908.052, 'duration': 6.222}, {'end': 1917.215, 'text': 'And in this case, we do an element-wise multiplication.', 'start': 1914.794, 'duration': 2.421}, {'end': 1921.994, 'text': 'We add the outputs, we apply a bias, and we add that non-linearity.', 'start': 1918.151, 'duration': 3.843}, {'end': 1930.979, 'text': "That's the core steps that we take in really all of these neural networks that you're learning about in today's and this week's class, to be honest.", 'start': 1922.774, 'duration': 8.205}], 'summary': 'Neural network computes inputs using weights, bias, and non-linearity.', 'duration': 34.03, 'max_score': 1896.949, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB41896949.jpg'}, {'end': 2035.991, 'src': 'heatmap', 'start': 1922.774, 'weight': 1, 'content': [{'end': 1930.979, 'text': "That's the core steps that we take in really all of these neural networks that you're learning about in today's and this week's class, to be honest.", 'start': 1922.774, 'duration': 8.205}, {'end': 1937.323, 'text': 'Now remember that this element-wise multiplication and addition operation, that sliding operation,', 'start': 1931.8, 'duration': 5.523}, {'end': 1940.666, 'text': "that's called convolution and that's the basis of these layers.", 'start': 1937.323, 'duration': 3.343}, {'end': 1946.769, 'text': "So that defines how neurons in convolutional layers are connected, how they're mathematically formulated.", 'start': 1941.286, 'duration': 5.483}, {'end': 1949.791, 'text': 'But within a single convolutional layer.', 'start': 1947.309, 'duration': 2.482}, {'end': 1957.455, 'text': "it's also really important to understand that a single layer could actually try to detect multiple sets of filters.", 'start': 1949.791, 'duration': 7.664}, {'end': 1961.837, 'text': 'Maybe you want to detect in one image multiple features, not just one feature.', 'start': 1957.935, 'duration': 3.902}, {'end': 1969.885, 'text': "But if you were detecting faces, you don't only want to detect eyes, you want to detect You know, eyes, noses, mouths, ears.", 'start': 1961.917, 'duration': 7.968}, {'end': 1975.298, 'text': 'All of those things are critical patterns that define a face and can help you classify a face.', 'start': 1970.246, 'duration': 5.052}, {'end': 1984.012, 'text': 'So what we need to think of is actually convolution operations that can output a volume of different images.', 'start': 1976.707, 'duration': 7.305}, {'end': 1992.078, 'text': 'Every slice of this volume effectively denotes a different filter that can be identified in our original input.', 'start': 1984.372, 'duration': 7.706}, {'end': 1999.182, 'text': 'And each of those filters is going to basically correspond to a specific pattern or feature in our image as well.', 'start': 1992.538, 'duration': 6.644}, {'end': 2005.029, 'text': 'Think of the connections in these neurons in terms of their receptive field once again.', 'start': 2000.185, 'duration': 4.844}, {'end': 2011.554, 'text': 'The locations within the input of that node that they were connected to in the previous layer.', 'start': 2005.769, 'duration': 5.785}, {'end': 2023.043, 'text': 'These parameters really define what I like to think of as the spatial arrangement of information that propagates throughout the network and throughout the convolutional layers in particular.', 'start': 2012.074, 'duration': 10.969}, {'end': 2035.991, 'text': "Now I think, just to summarize what we've seen and how connections in these types of neural networks are defined, and, let's say,", 'start': 2024.218, 'duration': 11.773}], 'summary': 'Neural networks use convolution to detect multiple features, defining the spatial arrangement of information.', 'duration': 113.217, 'max_score': 1922.774, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB41922774.jpg'}, {'end': 1969.885, 'src': 'embed', 'start': 1941.286, 'weight': 3, 'content': [{'end': 1946.769, 'text': "So that defines how neurons in convolutional layers are connected, how they're mathematically formulated.", 'start': 1941.286, 'duration': 5.483}, {'end': 1949.791, 'text': 'But within a single convolutional layer.', 'start': 1947.309, 'duration': 2.482}, {'end': 1957.455, 'text': "it's also really important to understand that a single layer could actually try to detect multiple sets of filters.", 'start': 1949.791, 'duration': 7.664}, {'end': 1961.837, 'text': 'Maybe you want to detect in one image multiple features, not just one feature.', 'start': 1957.935, 'duration': 3.902}, {'end': 1969.885, 'text': "But if you were detecting faces, you don't only want to detect eyes, you want to detect You know, eyes, noses, mouths, ears.", 'start': 1961.917, 'duration': 7.968}], 'summary': 'Convolutional layers can detect multiple features in a single layer, such as eyes, noses, mouths, and ears.', 'duration': 28.599, 'max_score': 1941.286, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB41941286.jpg'}, {'end': 2122.832, 'src': 'heatmap', 'start': 2053.601, 'weight': 5, 'content': [{'end': 2056.243, 'text': 'The remaining steps are very critical as well,', 'start': 2053.601, 'duration': 2.642}, {'end': 2063.827, 'text': "but I want to maybe pause for a second and make sure that everyone's on the same page with the convolutional operation and the definition of convolutional layers.", 'start': 2056.243, 'duration': 7.584}, {'end': 2069.193, 'text': 'Awesome, okay.', 'start': 2068.453, 'duration': 0.74}, {'end': 2080.377, 'text': 'So the next step here is to take those resulting feature maps that our convolutional layers extract and apply a nonlinearity to the output volume of the convolutional layer.', 'start': 2069.914, 'duration': 10.463}, {'end': 2087.699, 'text': 'So as we discussed in the first lecture, applying these nonlinearities is really critical because it allows us to deal with nonlinear data.', 'start': 2080.437, 'duration': 7.262}, {'end': 2091.281, 'text': 'And because image data in particular is extremely nonlinear.', 'start': 2087.76, 'duration': 3.521}, {'end': 2099.175, 'text': "That's a critical component of what makes convolutional neural networks actually operational in practice.", 'start': 2091.862, 'duration': 7.313}, {'end': 2102.799, 'text': 'In particular for convolutional neural networks.', 'start': 2100.478, 'duration': 2.321}, {'end': 2109.303, 'text': 'the activation function that is really really common for these models is the ReLU activation function.', 'start': 2102.799, 'duration': 6.504}, {'end': 2113.386, 'text': 'We talked a little bit about this in lecture one and two yesterday.', 'start': 2110.444, 'duration': 2.942}, {'end': 2116.388, 'text': 'The ReLU activation function, you can see it on the right hand side.', 'start': 2113.566, 'duration': 2.822}, {'end': 2122.832, 'text': 'Think of this function as a pixel by pixel operation that replaces basically all negative values with zero.', 'start': 2116.868, 'duration': 5.964}], 'summary': 'Convolutional layers extract feature maps, relu activation is critical for dealing with nonlinear image data.', 'duration': 69.231, 'max_score': 2053.601, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB42053601.jpg'}], 'start': 1214.966, 'title': 'Using convolutional neural networks', 'summary': 'Explains the use of convolutions to detect features in images, the process of convolution in neural networks, and the application of convolutional neural networks for image classification.', 'chapters': [{'end': 1396.615, 'start': 1214.966, 'title': 'Using convolutions to detect features', 'summary': 'Explains the use of convolutions to detect defining features like diagonal lines and crossings in images, enabling comparison at a patch level, which can result in improved detection of similarities.', 'duration': 181.649, 'highlights': ['Convolution preserves spatial information in the input by learning image features in smaller regions, improving detection of similarities at a patch level.', 'Using convolutions allows for comparison of important patches that define an X, leading to improved inference of similarity between images.', 'Filters in convolutions can capture semantic features like diagonal lines and crossings, enabling the detection of important characteristics of an X.']}, {'end': 1846.641, 'start': 1397.535, 'title': 'Understanding convolution in neural networks', 'summary': 'Explains the process of convolution in neural networks, illustrating how it is used to detect patterns in images through element-wise multiplication and addition, resulting in feature maps, and how different filters can be used to achieve different results, ultimately leading to the creation of convolutional neural networks (cnns) for image classification.', 'duration': 449.106, 'highlights': ['Convolution process in neural networks', 'Demonstrating the impact of different filters on feature detection', 'Creation of convolutional neural networks (CNNs)']}, {'end': 2136.359, 'start': 1847.541, 'title': 'Understanding convolutional neural networks', 'summary': 'Explains the local connectivity of neurons in hidden layers, the computation process for a neuron, the formation of multiple filters within a single convolutional layer, and the application of relu activation function in convolutional neural networks.', 'duration': 288.818, 'highlights': ['The local connectivity of neurons in hidden layers is stressed, with each neuron only seeing a certain patch of inputs in its previous layer, crucial for scaling models to large images.', 'The computation process for a neuron in a hidden layer involves applying a matrix of weights, element-wise multiplication, addition of outputs, application of bias, and non-linearity, forming the core steps in neural networks.', 'In a single convolutional layer, multiple sets of filters can be detected to identify different features in an image, with each filter corresponding to a specific pattern or feature in the image.', 'The ReLU activation function, common in convolutional neural networks, operates as a pixel by pixel operation, replacing negative values with zero and retaining positive values, thus serving as a thresholding function for nonlinear image data.']}], 'duration': 921.393, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB41214966.jpg', 'highlights': ['Using convolutions allows for comparison of important patches that define an X, leading to improved inference of similarity between images.', 'Filters in convolutions can capture semantic features like diagonal lines and crossings, enabling the detection of important characteristics of an X.', 'The local connectivity of neurons in hidden layers is stressed, with each neuron only seeing a certain patch of inputs in its previous layer, crucial for scaling models to large images.', 'In a single convolutional layer, multiple sets of filters can be detected to identify different features in an image, with each filter corresponding to a specific pattern or feature in the image.', 'The computation process for a neuron in a hidden layer involves applying a matrix of weights, element-wise multiplication, addition of outputs, application of bias, and non-linearity, forming the core steps in neural networks.', 'The ReLU activation function, common in convolutional neural networks, operates as a pixel by pixel operation, replacing negative values with zero and retaining positive values, thus serving as a thresholding function for nonlinear image data.', 'Convolution preserves spatial information in the input by learning image features in smaller regions, improving detection of similarities at a patch level.', 'Demonstrating the impact of different filters on feature detection', 'Creation of convolutional neural networks (CNNs)', 'Convolution process in neural networks']}, {'end': 2398.838, 'segs': [{'end': 2221.188, 'src': 'heatmap', 'start': 2157.53, 'weight': 0, 'content': [{'end': 2164.114, 'text': "The other popular belief is that ReLU activation functions, well, it's not a belief.", 'start': 2157.53, 'duration': 6.584}, {'end': 2168.137, 'text': "They are extremely easy to compute, and they're very easy and computationally efficient.", 'start': 2164.154, 'duration': 3.983}, {'end': 2170.178, 'text': 'Their gradients are very cleanly defined.', 'start': 2168.457, 'duration': 1.721}, {'end': 2174.24, 'text': "They're constants, except for a piecewise non-linearity.", 'start': 2170.198, 'duration': 4.042}, {'end': 2177.422, 'text': 'So that makes them very popular for these domains.', 'start': 2175.341, 'duration': 2.081}, {'end': 2183.793, 'text': 'Now, the next key operation in a CNN is that of pooling.', 'start': 2179.949, 'duration': 3.844}, {'end': 2187.196, 'text': 'Now, pooling is an operation that is at its core.', 'start': 2184.293, 'duration': 2.903}, {'end': 2188.417, 'text': 'it serves one purpose,', 'start': 2187.196, 'duration': 1.221}, {'end': 2195.603, 'text': 'and that is to reduce the dimensionality of the image progressively as you go deeper and deeper through your convolutional layers.', 'start': 2188.417, 'duration': 7.186}, {'end': 2202.911, 'text': 'Now you can really start to reason about this is that When you decrease the dimensionality of your features,', 'start': 2196.124, 'duration': 6.787}, {'end': 2207.576, 'text': "you're effectively increasing the dimensionality of your filters right now,", 'start': 2202.911, 'duration': 4.665}, {'end': 2215.585, 'text': 'because every filter that you slide over a smaller image is capturing a larger receptive field that occurred previously in that network.', 'start': 2207.576, 'duration': 8.009}, {'end': 2221.188, 'text': "So a very common technique for pooling is what's called maximum pooling or max pooling for short.", 'start': 2216.306, 'duration': 4.882}], 'summary': 'Relu activation functions are popular due to easy computation and clean gradients. pooling reduces image dimensionality for deeper convolutional layers.', 'duration': 50.046, 'max_score': 2157.53, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB42157530.jpg'}, {'end': 2255.399, 'src': 'embed', 'start': 2228.87, 'weight': 2, 'content': [{'end': 2235.232, 'text': 'But instead of doing this convolution operation, what these patches will do is simply take the maximum of that patch location.', 'start': 2228.87, 'duration': 6.362}, {'end': 2240.434, 'text': 'So think of this as kind of activating the maximum value that comes from that location.', 'start': 2235.632, 'duration': 4.802}, {'end': 2243.495, 'text': 'and propagating only the maximums.', 'start': 2241.714, 'duration': 1.781}, {'end': 2251.878, 'text': 'I encourage all of you actually to think of maybe brainstorm other ways that we could perform even better pooling operations than max pooling.', 'start': 2244.435, 'duration': 7.443}, {'end': 2255.399, 'text': 'There are many common ways, but you could think of some.', 'start': 2252.078, 'duration': 3.321}], 'summary': 'Max pooling selects maximum values from patch locations, suggesting potential for improved pooling methods.', 'duration': 26.529, 'max_score': 2228.87, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB42228870.jpg'}, {'end': 2297.959, 'src': 'embed', 'start': 2273.137, 'weight': 4, 'content': [{'end': 2280.627, 'text': "And now we're ready to really start to put them together and form and construct a CNN all the way from the ground up.", 'start': 2273.137, 'duration': 7.49}, {'end': 2287.453, 'text': 'And with CNNs, we can layer these operations one after the other, right, starting first with convolutions, non linearities,', 'start': 2280.687, 'duration': 6.766}, {'end': 2292.996, 'text': 'and then pooling and repeating these over and over again to learn these hierarchies of features.', 'start': 2287.453, 'duration': 5.543}, {'end': 2297.959, 'text': "and that's exactly how we obtain pictures like this, which we started yesterday's lecture with,", 'start': 2292.996, 'duration': 4.963}], 'summary': 'Construct a cnn from scratch to learn hierarchies of features for image recognition.', 'duration': 24.822, 'max_score': 2273.137, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB42273137.jpg'}, {'end': 2346.732, 'src': 'embed', 'start': 2316.368, 'weight': 3, 'content': [{'end': 2321.29, 'text': 'First is the feature learning pipeline, which we learn the features that we want to detect.', 'start': 2316.368, 'duration': 4.922}, {'end': 2325.952, 'text': 'And then the second part is actually detecting those features and doing the classification.', 'start': 2321.61, 'duration': 4.342}, {'end': 2333.884, 'text': 'Now the convolutional and pooling layers output from the first part of that model.', 'start': 2328.14, 'duration': 5.744}, {'end': 2341.208, 'text': 'the goal of those convolutional and pooling layers is to output the high level features that are extracted from our input.', 'start': 2333.884, 'duration': 7.324}, {'end': 2346.732, 'text': 'But the next step is to actually use those features and detect their presence in order to classify the image.', 'start': 2341.288, 'duration': 5.444}], 'summary': 'Feature learning and detection pipeline extracts high-level features and classifies images.', 'duration': 30.364, 'max_score': 2316.368, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB42316368.jpg'}, {'end': 2354.018, 'src': 'heatmap', 'start': 2316.368, 'weight': 0.851, 'content': [{'end': 2321.29, 'text': 'First is the feature learning pipeline, which we learn the features that we want to detect.', 'start': 2316.368, 'duration': 4.922}, {'end': 2325.952, 'text': 'And then the second part is actually detecting those features and doing the classification.', 'start': 2321.61, 'duration': 4.342}, {'end': 2333.884, 'text': 'Now the convolutional and pooling layers output from the first part of that model.', 'start': 2328.14, 'duration': 5.744}, {'end': 2341.208, 'text': 'the goal of those convolutional and pooling layers is to output the high level features that are extracted from our input.', 'start': 2333.884, 'duration': 7.324}, {'end': 2346.732, 'text': 'But the next step is to actually use those features and detect their presence in order to classify the image.', 'start': 2341.288, 'duration': 5.444}, {'end': 2354.018, 'text': 'So we can feed these outputted features into the fully connected layers that we learned about in lecture one,', 'start': 2347.412, 'duration': 6.606}], 'summary': 'Feature learning, detection, and classification process using convolutional and pooling layers to extract high-level features for image classification.', 'duration': 37.65, 'max_score': 2316.368, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB42316368.jpg'}, {'end': 2390.071, 'src': 'embed', 'start': 2361.445, 'weight': 5, 'content': [{'end': 2365.989, 'text': 'And we can do this by using a function called a softmax function.', 'start': 2361.445, 'duration': 4.544}, {'end': 2374.677, 'text': 'You can think of a softmax function as simply a normalizing function whose output represents that of a categorical probability distribution.', 'start': 2366.71, 'duration': 7.967}, {'end': 2382.504, 'text': 'So another way to think of this is basically if you have an array of numbers, you want to collapse and those numbers could take any real number form.', 'start': 2374.777, 'duration': 7.727}, {'end': 2384.906, 'text': 'you want to collapse that into some probability distribution.', 'start': 2382.504, 'duration': 2.402}, {'end': 2390.071, 'text': 'A probability distribution has several properties, namely that all of its values have to sum to one.', 'start': 2384.927, 'duration': 5.144}], 'summary': 'Softmax function normalizes array of numbers into a probability distribution, ensuring the sum of values is one.', 'duration': 28.626, 'max_score': 2361.445, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB42361445.jpg'}], 'start': 2136.399, 'title': 'Cnn operations and relu activation', 'summary': 'Discusses the popularity of the relu activation function in cnn, its computational efficiency, and the purpose of pooling in reducing image dimensionality, focusing on maximum pooling. it also covers key operations and the formation process of convolutional neural networks, emphasizing feature learning and image classification.', 'chapters': [{'end': 2251.878, 'start': 2136.399, 'title': 'Cnn operations and relu activation', 'summary': 'Discusses the popularity of the relu activation function in cnn due to its intuitive mechanism, computational efficiency, and cleanly defined gradients and explains the purpose of pooling in reducing the dimensionality of the image progressively, with a focus on maximum pooling.', 'duration': 115.479, 'highlights': ['The ReLU activation function is popular in CNN due to its intuitive mechanism, computational efficiency, and cleanly defined gradients, making it popular for these domains.', 'Pooling serves the purpose of reducing the dimensionality of the image progressively as you go deeper through convolutional layers, effectively increasing the dimensionality of filters and capturing larger receptive fields.', 'Max pooling is a common technique where small patches take the maximum value of that patch location, effectively activating and propagating only the maximums.']}, {'end': 2398.838, 'start': 2252.078, 'title': 'Convolutional neural networks', 'summary': 'Discusses the key operations of convolutional neural networks, the process of forming a cnn from the ground up, and the two main parts of a cnn: feature learning pipeline and image classification, with the goal of outputting high-level features and using them for image classification.', 'duration': 146.76, 'highlights': ['The two main parts of a CNN are the feature learning pipeline and the image classification. The convolutional and pooling layers output high-level features extracted from the input, which are then used for image classification.', 'The CNN is constructed by layering operations such as convolutions, non-linearities, and pooling, to learn hierarchies of features.', 'The function used for classifying the image is the softmax function, which normalizes the output into a categorical probability distribution.']}], 'duration': 262.439, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB42136399.jpg', 'highlights': ['The ReLU activation function is popular in CNN due to its intuitive mechanism, computational efficiency, and cleanly defined gradients, making it popular for these domains.', 'Pooling serves the purpose of reducing the dimensionality of the image progressively as you go deeper through convolutional layers, effectively increasing the dimensionality of filters and capturing larger receptive fields.', 'Max pooling is a common technique where small patches take the maximum value of that patch location, effectively activating and propagating only the maximums.', 'The two main parts of a CNN are the feature learning pipeline and the image classification. The convolutional and pooling layers output high-level features extracted from the input, which are then used for image classification.', 'The CNN is constructed by layering operations such as convolutions, non-linearities, and pooling, to learn hierarchies of features.', 'The function used for classifying the image is the softmax function, which normalizes the output into a categorical probability distribution.']}, {'end': 3311.418, 'segs': [{'end': 2480.398, 'src': 'embed', 'start': 2447.096, 'weight': 0, 'content': [{'end': 2450.681, 'text': 'The next set of convolutional operations now will contain 64 features.', 'start': 2447.096, 'duration': 3.585}, {'end': 2457.072, 'text': "We'll keep progressively growing and expanding our set of patterns that we're identifying in this image.", 'start': 2450.721, 'duration': 6.351}, {'end': 2466.329, 'text': "Next, we can finally flatten those resulting features that we've identified and feed all of this through our dense layers, our fully connected layers,", 'start': 2458.484, 'duration': 7.845}, {'end': 2467.61, 'text': 'that we learned about in lecture one.', 'start': 2466.329, 'duration': 1.281}, {'end': 2471.633, 'text': "These will allow us to predict those final, let's say, 10 classes.", 'start': 2468.13, 'duration': 3.503}, {'end': 2480.398, 'text': 'If we have 10 different final possible classes in our image, this layer will account for that and allow us to output, using softmax,', 'start': 2471.693, 'duration': 8.705}], 'summary': 'Using convolutional operations to identify 64 features and predict 10 classes in an image.', 'duration': 33.302, 'max_score': 2447.096, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB42447096.jpg'}, {'end': 2522.209, 'src': 'embed', 'start': 2491.7, 'weight': 3, 'content': [{'end': 2496.564, 'text': "But in reality, one thing I really want to stress in today's class, especially towards the end,", 'start': 2491.7, 'duration': 4.864}, {'end': 2502.089, 'text': "is that this same architecture and same building blocks that we've talked about so far are extensible.", 'start': 2496.564, 'duration': 5.525}, {'end': 2509.416, 'text': 'And they extend to so many different applications and model types that we can imagine.', 'start': 2502.549, 'duration': 6.867}, {'end': 2516.243, 'text': 'So, for example, when we considered the CNN for classification, we saw that it really had two parts right?', 'start': 2509.536, 'duration': 6.707}, {'end': 2522.209, 'text': 'The first part being feature extraction, learning what features to look for, and the second part being the classification,', 'start': 2516.283, 'duration': 5.926}], 'summary': 'Neural network architecture is extensible to various applications and model types, such as cnn for classification.', 'duration': 30.509, 'max_score': 2491.7, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB42491700.jpg'}, {'end': 2609.157, 'src': 'embed', 'start': 2582.035, 'weight': 4, 'content': [{'end': 2589.143, 'text': "just to tie up the classification story, there's a significant impact in domains like health care, medical decision making,", 'start': 2582.035, 'duration': 7.108}, {'end': 2596.512, 'text': 'where deep learning models are being applied to the analysis of medical scans across a whole host of different medical imagery.', 'start': 2589.143, 'duration': 7.369}, {'end': 2609.157, 'text': 'Now classification tells us basically a discrete prediction of what our image contains, but we can actually go much deeper into this problem as well.', 'start': 2598.706, 'duration': 10.451}], 'summary': 'Deep learning models impact healthcare by analyzing medical scans for classification and discrete prediction.', 'duration': 27.122, 'max_score': 2582.035, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB42582035.jpg'}, {'end': 2871.691, 'src': 'embed', 'start': 2845.834, 'weight': 5, 'content': [{'end': 2853.978, 'text': 'Ideally, we want one model that is able to both figure out where to attend to and do that classification afterwards.', 'start': 2845.834, 'duration': 8.144}, {'end': 2862.305, 'text': "So there have been many variants that have been proposed in this field of object detection, but I want to just for the purpose of today's class,", 'start': 2855.68, 'duration': 6.625}, {'end': 2865.407, 'text': 'introduce you to one of the most popular ones.', 'start': 2862.305, 'duration': 3.102}, {'end': 2871.691, 'text': 'Now, this is a point, or this is a model called RCNN, or Faster RCNN.', 'start': 2866.708, 'duration': 4.983}], 'summary': 'Introducing the popular model faster rcnn for object detection.', 'duration': 25.857, 'max_score': 2845.834, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB42845834.jpg'}, {'end': 2957.553, 'src': 'heatmap', 'start': 2882.824, 'weight': 0.736, 'content': [{'end': 2888.531, 'text': 'so that you could learn how to feed or where to feed into the downstream neural network.', 'start': 2882.824, 'duration': 5.707}, {'end': 2894.136, 'text': 'Now, this means that we can feed in the image to what are called these region proposal networks.', 'start': 2889.292, 'duration': 4.844}, {'end': 2903.324, 'text': 'The goal of these networks is to propose certain regions in the image that you should attend to and then feed just those regions into the downstream CNNs.', 'start': 2894.456, 'duration': 8.868}, {'end': 2913.272, 'text': 'So the goal here is to directly try to learn or extract all of those key regions and process them through the later part of the model.', 'start': 2904.044, 'duration': 9.228}, {'end': 2917.635, 'text': 'Each of these regions are processed with their own independent feature extractors.', 'start': 2913.452, 'duration': 4.183}, {'end': 2924.841, 'text': 'And then a classifier can be used to aggregate them all and perform feature detection as well as object detection.', 'start': 2918.576, 'duration': 6.265}, {'end': 2930.626, 'text': 'Now, the beautiful thing about this is that this requires only a single pass through the network.', 'start': 2925.442, 'duration': 5.184}, {'end': 2932.487, 'text': "So it's extraordinarily fast.", 'start': 2930.686, 'duration': 1.801}, {'end': 2934.629, 'text': 'It can easily run in real time.', 'start': 2932.607, 'duration': 2.022}, {'end': 2939.033, 'text': "And it's very commonly used in many industry applications as well.", 'start': 2934.909, 'duration': 4.124}, {'end': 2940.594, 'text': 'It can even run on your smartphone.', 'start': 2939.073, 'duration': 1.521}, {'end': 2950.506, 'text': 'So in classification, we just saw how we can predict, you know, not only a single image per, or sorry, a single object per image.', 'start': 2942.338, 'duration': 8.168}, {'end': 2957.553, 'text': 'We saw an object detection potentially inferring multiple objects with bounding boxes in your image.', 'start': 2950.706, 'duration': 6.847}], 'summary': 'Region proposal networks extract key regions for fast object detection in real time, commonly used in industry applications and even on smartphones.', 'duration': 74.729, 'max_score': 2882.824, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB42882824.jpg'}, {'end': 2917.635, 'src': 'embed', 'start': 2894.456, 'weight': 1, 'content': [{'end': 2903.324, 'text': 'The goal of these networks is to propose certain regions in the image that you should attend to and then feed just those regions into the downstream CNNs.', 'start': 2894.456, 'duration': 8.868}, {'end': 2913.272, 'text': 'So the goal here is to directly try to learn or extract all of those key regions and process them through the later part of the model.', 'start': 2904.044, 'duration': 9.228}, {'end': 2917.635, 'text': 'Each of these regions are processed with their own independent feature extractors.', 'start': 2913.452, 'duration': 4.183}], 'summary': 'Networks propose key image regions to process with independent feature extractors.', 'duration': 23.179, 'max_score': 2894.456, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB42894456.jpg'}, {'end': 2989.791, 'src': 'embed', 'start': 2963.077, 'weight': 6, 'content': [{'end': 2967.879, 'text': 'Segmentation is the task of classification, but now done at every single pixel.', 'start': 2963.077, 'duration': 4.802}, {'end': 2972.322, 'text': 'This takes the idea of object detection, which bounding boxes, to the extreme.', 'start': 2967.939, 'duration': 4.383}, {'end': 2976.224, 'text': "Now, instead of drawing boxes, we're not even going to consider boxes.", 'start': 2972.942, 'duration': 3.282}, {'end': 2982.647, 'text': "We're going to learn how to classify every single pixel in this image in isolation.", 'start': 2976.304, 'duration': 6.343}, {'end': 2985.849, 'text': "So it's a huge number of classifications that we're going to do.", 'start': 2983.027, 'duration': 2.822}, {'end': 2987.07, 'text': "And we'll do this.", 'start': 2986.329, 'duration': 0.741}, {'end': 2989.791, 'text': 'Well, first let me show this example.', 'start': 2988.45, 'duration': 1.341}], 'summary': 'Segmentation classifies every pixel in an image, enabling isolation and classification of a huge number of pixels.', 'duration': 26.714, 'max_score': 2963.077, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB42963077.jpg'}, {'end': 3187.284, 'src': 'embed', 'start': 3160.92, 'weight': 8, 'content': [{'end': 3166.623, 'text': 'All of them are using the same underlying building blocks of convolutions nonlinearities, and pooling.', 'start': 3160.92, 'duration': 5.703}, {'end': 3174.252, 'text': 'The only difference is that, after we perform those feature extractions, how do we take those features and learn our ultimate task?', 'start': 3167.124, 'duration': 7.128}, {'end': 3178.057, 'text': 'So, for example, in the case of probabilistic control commands,', 'start': 3174.312, 'duration': 3.745}, {'end': 3187.284, 'text': 'we would want to take those learned features and understand how to predict the parameters of a full continuous probability distribution,', 'start': 3178.057, 'duration': 9.227}], 'summary': 'Deep learning models use common building blocks for feature extraction and differ in learning tasks, such as predicting parameters of continuous probability distributions.', 'duration': 26.364, 'max_score': 3160.92, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB43160920.jpg'}, {'end': 3233.001, 'src': 'embed', 'start': 3201.231, 'weight': 2, 'content': [{'end': 3203.613, 'text': 'essentially of the car, is a single model.', 'start': 3201.231, 'duration': 2.382}, {'end': 3205.474, 'text': "It's learned entirely end to end.", 'start': 3203.673, 'duration': 1.801}, {'end': 3208.976, 'text': 'We never told the car, for example, what a lane marker is.', 'start': 3205.554, 'duration': 3.422}, {'end': 3211.459, 'text': 'or the rules of the road.', 'start': 3209.756, 'duration': 1.703}, {'end': 3217.467, 'text': 'It was able to observe a lot of human driving data, extract these patterns, these features,', 'start': 3211.979, 'duration': 5.488}, {'end': 3225.439, 'text': 'from what makes a good human driver different from a bad human driver, and learn how to imitate those same types of actions that are occurring.', 'start': 3217.467, 'duration': 7.972}, {'end': 3233.001, 'text': 'So that, without any human intervention or human rules that we impose on these systems,', 'start': 3226.119, 'duration': 6.882}], 'summary': 'A single model car learned to drive by observing human driving data and imitating good driving actions, without human intervention or imposed rules.', 'duration': 31.77, 'max_score': 3201.231, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB43201231.jpg'}, {'end': 3298.851, 'src': 'embed', 'start': 3264.356, 'weight': 7, 'content': [{'end': 3269.161, 'text': 'They all tie back to this core concept of feature extraction and detection.', 'start': 3264.356, 'duration': 4.805}, {'end': 3271.462, 'text': 'And after you do that feature extraction,', 'start': 3269.841, 'duration': 1.621}, {'end': 3279.104, 'text': 'you can really crop off the rest of your network and apply it to many different heads for many different tasks and applications that you might care about.', 'start': 3271.462, 'duration': 7.642}, {'end': 3283.526, 'text': "We've touched on a few today, but there are really so, so many in different domains.", 'start': 3279.204, 'duration': 4.322}, {'end': 3285.166, 'text': "And with that, I'll conclude.", 'start': 3284.146, 'duration': 1.02}, {'end': 3298.851, 'text': "And very shortly, we'll just be talking about generative modeling, which is a really central part of today's and this week's lectures series.", 'start': 3285.526, 'duration': 13.325}], 'summary': 'Feature extraction enables versatile application in various domains and tasks.', 'duration': 34.495, 'max_score': 3264.356, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB43264356.jpg'}], 'start': 2398.858, 'title': 'Cnns in various applications', 'summary': 'Discusses cnn architecture, object detection, and applications in self-driving cars, emphasizing impact in healthcare, techniques in object detection and segmentation, and the use of cnns in self-driving cars with the potential for broader applications.', 'chapters': [{'end': 2596.512, 'start': 2398.858, 'title': 'Convolutional neural network architecture', 'summary': 'Discusses the architecture and flexibility of convolutional neural networks, illustrating the process of feature extraction, classification, and the extensibility of cnns to various applications, with emphasis on its impact in healthcare.', 'duration': 197.654, 'highlights': ['CNN architecture involves feature extraction, pooling, and classification, with specific examples of layer configurations such as 32 and 64 features, and fully connected layers for predicting classes.', 'The extensibility of CNN architecture is emphasized, allowing for the extraction of features that can be utilized in various applications beyond image classification, such as segmentation and image captioning.', 'The significant impact of CNNs in healthcare and medical decision making is mentioned, specifically in the analysis of medical scans across different imagery domains.']}, {'end': 3095.745, 'start': 2598.706, 'title': 'Object detection and segmentation in neural networks', 'summary': 'Discusses the challenges and techniques in object detection and segmentation in neural networks, covering the need for bounding boxes, region proposal networks, and the concept of semantic segmentation, demonstrating how neural networks can identify and classify objects at the pixel level.', 'duration': 497.039, 'highlights': ['Region proposal networks are used to identify high-attention locations in the image, significantly speeding up the process of feeding regions to the convolutional neural network, resulting in fast and efficient object detection.', 'The model Faster RCNN learns to classify and propose the locations of boxes, enabling it to feed only the key regions into downstream neural networks, resulting in an efficient single-pass process that can run in real time.', 'Semantic segmentation networks classify every single pixel in the image, allowing for the differentiation of objects at the pixel level, demonstrating applications in various fields including healthcare for segmenting cancerous regions or infected blood parts.']}, {'end': 3311.418, 'start': 3096.505, 'title': 'Applications of cnns in self-driving cars', 'summary': 'Explains the use of convolutional neural networks (cnns) in self-driving cars, where a single end-to-end model can learn to drive entirely from scratch by observing human driving data and extract patterns and features, with applications extending beyond the examples provided.', 'duration': 214.913, 'highlights': ['A single end-to-end CNN model can learn to drive entirely from scratch, by observing human driving data and extracting patterns and features, without any human intervention or rules imposed on the system.', 'Applications of CNNs extend far beyond the provided examples, with the core concept of feature extraction and detection being applicable to various tasks and domains.', 'The neural network architecture for autonomous navigation uses the same underlying building blocks of convolutions, nonlinearities, and pooling, with the only difference being how the features are used to learn the ultimate task.']}], 'duration': 912.56, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/NmLK_WQBxB4/pics/NmLK_WQBxB42398858.jpg', 'highlights': ['CNN architecture involves feature extraction, pooling, and classification, with specific examples of layer configurations such as 32 and 64 features, and fully connected layers for predicting classes.', 'Region proposal networks are used to identify high-attention locations in the image, significantly speeding up the process of feeding regions to the convolutional neural network, resulting in fast and efficient object detection.', 'A single end-to-end CNN model can learn to drive entirely from scratch, by observing human driving data and extracting patterns and features, without any human intervention or rules imposed on the system.', 'The extensibility of CNN architecture is emphasized, allowing for the extraction of features that can be utilized in various applications beyond image classification, such as segmentation and image captioning.', 'The significant impact of CNNs in healthcare and medical decision making is mentioned, specifically in the analysis of medical scans across different imagery domains.', 'The model Faster RCNN learns to classify and propose the locations of boxes, enabling it to feed only the key regions into downstream neural networks, resulting in an efficient single-pass process that can run in real time.', 'Semantic segmentation networks classify every single pixel in the image, allowing for the differentiation of objects at the pixel level, demonstrating applications in various fields including healthcare for segmenting cancerous regions or infected blood parts.', 'Applications of CNNs extend far beyond the provided examples, with the core concept of feature extraction and detection being applicable to various tasks and domains.', 'The neural network architecture for autonomous navigation uses the same underlying building blocks of convolutions, nonlinearities, and pooling, with the only difference being how the features are used to learn the ultimate task.']}], 'highlights': ['Deep learning enables end-to-end approaches in autonomous driving, impacting fields like biology, medicine, and accessibility applications.', 'The use of deep learning algorithms in computer vision has become mainstream, impacting fields like biology, medicine, autonomous driving, and accessibility applications.', 'CNN architecture involves feature extraction, pooling, and classification, with specific examples of layer configurations such as 32 and 64 features, and fully connected layers for predicting classes.', 'A single end-to-end CNN model can learn to drive entirely from scratch, by observing human driving data and extracting patterns and features, without any human intervention or rules imposed on the system.', 'The significant impact of CNNs in healthcare and medical decision making is mentioned, specifically in the analysis of medical scans across different imagery domains.', 'Region proposal networks are used to identify high-attention locations in the image, significantly speeding up the process of feeding regions to the convolutional neural network, resulting in fast and efficient object detection.', 'The extensibility of CNN architecture is emphasized, allowing for the extraction of features that can be utilized in various applications beyond image classification, such as segmentation and image captioning.', 'The ReLU activation function is popular in CNN due to its intuitive mechanism, computational efficiency, and cleanly defined gradients, making it popular for these domains.', 'The two main parts of a CNN are the feature learning pipeline and the image classification. The convolutional and pooling layers output high-level features extracted from the input, which are then used for image classification.', 'The ReLU activation function, common in convolutional neural networks, operates as a pixel by pixel operation, replacing negative values with zero and retaining positive values, thus serving as a thresholding function for nonlinear image data.']}