title
MIT 6.S191 (2021): Convolutional Neural Networks
description
MIT Introduction to Deep Learning 6.S191: Lecture 3
Convolutional Neural Networks for Computer Vision
Lecturer: Alexander Amini
January 2021
For all lectures, slides, and lab materials: http://introtodeeplearning.com
Lecture Outline
0:00 - Introduction
2:47 - Amazing applications of vision
7:56 - What computers "see"
14:02 - Learning visual features
18:50 - Feature extraction and convolution
22:20 - The convolution operation
27:27 - Convolution neural networks
34:05 - Non-linearity and pooling
38:59 - End-to-end code example
40:25 - Applications
42:02 - Object detection
50:52 - End-to-end self driving cars
54:00 - Summary
Subscribe to stay up to date with new deep learning lectures at MIT, or follow us @MITDeepLearning on Twitter and Instagram to stay fully-connected!!
detail
{'title': 'MIT 6.S191 (2021): Convolutional Neural Networks', 'heatmap': [{'end': 1012.015, 'start': 904.161, 'weight': 0.725}, {'end': 1210.582, 'start': 1105.599, 'weight': 0.835}, {'end': 1383.92, 'start': 1341.499, 'weight': 0.712}, {'end': 1550.144, 'start': 1510.938, 'weight': 0.71}, {'end': 1849.856, 'start': 1711.687, 'weight': 0.733}, {'end': 2119.553, 'start': 1913.68, 'weight': 0.914}, {'end': 2286.757, 'start': 2244.095, 'weight': 0.859}, {'end': 2387.463, 'start': 2315.966, 'weight': 0.877}, {'end': 2452.935, 'start': 2417.404, 'weight': 0.714}, {'end': 2926.264, 'start': 2814.209, 'weight': 0.757}], 'summary': 'Discusses computer vision with deep learning, showcasing applications in navigation, photography, medicine, and autonomous driving, convolution operations for feature extraction, and the application of convolutional neural networks in image classification, object detection, healthcare, and robotics.', 'chapters': [{'end': 353.719, 'segs': [{'end': 63.568, 'src': 'embed', 'start': 33.614, 'weight': 0, 'content': [{'end': 38.155, 'text': 'to interpreting facial expressions and understanding very complex human emotions.', 'start': 33.614, 'duration': 4.541}, {'end': 43.276, 'text': "I think it's safe to say that vision is a huge part of everyday human life.", 'start': 38.855, 'duration': 4.421}, {'end': 57.204, 'text': "And today we're going to learn about how we can use deep learning to build very powerful computer vision systems and actually predict what is where by only looking and specifically looking at only raw visual inputs.", 'start': 44.076, 'duration': 13.128}, {'end': 63.568, 'text': 'I like to think that this is a very super simple definition of what vision at its core really means.', 'start': 57.864, 'duration': 5.704}], 'summary': 'Using deep learning to predict and interpret visual inputs for computer vision systems.', 'duration': 29.954, 'max_score': 33.614, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E33614.jpg'}, {'end': 197.888, 'src': 'embed', 'start': 159.135, 'weight': 3, 'content': [{'end': 165.259, 'text': 'And building these vision algorithms really does require an understanding of all of these very subtle details.', 'start': 159.135, 'duration': 6.124}, {'end': 171.918, 'text': 'Now, deep learning is bringing forward an incredible revolution, or evolution as well,', 'start': 166.331, 'duration': 5.587}, {'end': 180.528, 'text': 'of computer vision algorithms and applications ranging from allowing robots to use visual cues to perform things like navigation,', 'start': 171.918, 'duration': 8.61}, {'end': 192.365, 'text': "And these algorithms that you're going to learn about today in this class have become so mainstreamed and so compressed that they are all fitting and running in each of our pockets,", 'start': 181.678, 'duration': 10.687}, {'end': 197.888, 'text': 'in our telephones, for processing photos and videos, and detecting faces for greater convenience.', 'start': 192.365, 'duration': 5.523}], 'summary': 'Deep learning is revolutionizing computer vision algorithms, fitting into our pockets and enabling various applications.', 'duration': 38.753, 'max_score': 159.135, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E159135.jpg'}, {'end': 361.955, 'src': 'embed', 'start': 333.265, 'weight': 2, 'content': [{'end': 335.246, 'text': "And we'll see more about that later in the lecture as well.", 'start': 333.265, 'duration': 1.981}, {'end': 338.64, 'text': "We're seeing, like I mentioned,", 'start': 337.298, 'duration': 1.342}, {'end': 346.97, 'text': 'a lot of applications in medicine and healthcare where we can take these raw images and scans of patients and learn to detect things like breast cancer,', 'start': 338.64, 'duration': 8.33}, {'end': 353.719, 'text': "skin cancer and now, most recently, taking scans of patients' lungs to detect COVID-19..", 'start': 346.97, 'duration': 6.749}, {'end': 361.955, 'text': 'Finally, I want to share this inspiring story of how computer vision is being used to help the visually impaired.', 'start': 356.653, 'duration': 5.302}], 'summary': 'Computer vision detects breast cancer, skin cancer, and covid-19 in lung scans, aids visually impaired.', 'duration': 28.69, 'max_score': 333.265, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E333265.jpg'}], 'start': 10.285, 'title': 'Computer vision with deep learning', 'summary': 'Discusses the importance of vision in human life and how deep learning can be used to build powerful computer vision systems, revolutionizing applications in navigation, photography, medicine, and autonomous driving, with examples such as facial recognition and end-to-end learning for steering control in autonomous vehicles.', 'chapters': [{'end': 139.822, 'start': 10.285, 'title': 'Computer vision with deep learning', 'summary': 'Discusses the importance of vision in human life and how deep learning can be used to build powerful computer vision systems to predict and understand objects and their future movements. it emphasizes the complexity of understanding not just what an image is of, but also predicting and anticipating future events.', 'duration': 129.537, 'highlights': ['Deep learning used to build powerful computer vision systems to predict and understand objects and their future movements.', 'Emphasizes the complexity of understanding not just what an image is of, but also predicting and anticipating future events.', 'Vision is a crucial human sense, relied upon for tasks such as navigation, object recognition, interpreting facial expressions, and understanding complex emotions.', 'Discusses the subtleties of understanding the movements and behaviors of objects in a scene, such as differentiating between a stationary white truck and a potentially stationary yellow taxi based on subtle cues.', 'Challenges involved in reasoning about subtle cues in a scene, which is an extraordinarily challenging problem in the real world.']}, {'end': 353.719, 'start': 140.322, 'title': 'Revolutionizing computer vision with deep learning', 'summary': 'Describes how deep learning has revolutionized computer vision, enabling applications in navigation, photography, medicine, and autonomous driving, with examples such as facial recognition and end-to-end learning for steering control in autonomous vehicles.', 'duration': 213.397, 'highlights': ['The algorithms taught in the class have become mainstream and are used in mobile devices for processing photos and videos, and detecting faces, making them widely accessible (quantifiable data not available).', 'Deep learning has enabled applications in medicine and healthcare, such as detecting breast cancer, skin cancer, and COVID-19 in patient scans (quantifiable data not available).', 'Deep learning has revolutionized computer vision, allowing end-to-end learning for tasks like facial detection and autonomous driving, where a single neural network learns entirely from data, differing from traditional approaches (quantifiable data not available).', 'Deep learning has transformed computer vision, allowing direct learning from raw image inputs and feature extraction based on massive data, leading to its widespread adoption in various fields (quantifiable data not available).', 'The chapter highlights the broad impact of deep learning in revolutionizing computer vision, enabling applications in diverse areas such as navigation, photography, medicine, and autonomous driving, with specific examples provided (quantifiable data not available).']}], 'duration': 343.434, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E10285.jpg', 'highlights': ['Deep learning used to build powerful computer vision systems to predict and understand objects and their future movements.', 'Vision is a crucial human sense, relied upon for tasks such as navigation, object recognition, interpreting facial expressions, and understanding complex emotions.', 'Deep learning has enabled applications in medicine and healthcare, such as detecting breast cancer, skin cancer, and COVID-19 in patient scans.', 'The algorithms taught in the class have become mainstream and are used in mobile devices for processing photos and videos, and detecting faces, making them widely accessible.', 'The chapter highlights the broad impact of deep learning in revolutionizing computer vision, enabling applications in diverse areas such as navigation, photography, medicine, and autonomous driving, with specific examples provided.']}, {'end': 1183.062, 'segs': [{'end': 385.691, 'src': 'embed', 'start': 356.653, 'weight': 0, 'content': [{'end': 361.955, 'text': 'Finally, I want to share this inspiring story of how computer vision is being used to help the visually impaired.', 'start': 356.653, 'duration': 5.302}, {'end': 364.036, 'text': 'So, in this project.', 'start': 362.575, 'duration': 1.461}, {'end': 374.399, 'text': 'actually, researchers built a deep learning enabled device that can detect a trail for running and provide audible feedback to the visually impaired users such that they can run.', 'start': 364.036, 'duration': 10.363}, {'end': 378.241, 'text': 'And now to demonstrate this, let me just share this very brief video.', 'start': 374.8, 'duration': 3.441}, {'end': 385.691, 'text': "The machine learning algorithm that we have detects the line and can tell whether the line is to the runner's left, right, or center.", 'start': 380.629, 'duration': 5.062}], 'summary': 'Computer vision aids visually impaired runners with deep learning device detecting running trails and providing audible feedback.', 'duration': 29.038, 'max_score': 356.653, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E356653.jpg'}, {'end': 425.321, 'src': 'embed', 'start': 399.096, 'weight': 3, 'content': [{'end': 402.997, 'text': "From human eyes, it's very obvious, it's very obvious to recognize the line.", 'start': 399.096, 'duration': 3.901}, {'end': 406.337, 'text': 'Teaching a machine learning model to do that is not that easy.', 'start': 403.257, 'duration': 3.08}, {'end': 410.198, 'text': "You step left and right as you're running, so there's like a shake to the line, left and right.", 'start': 406.477, 'duration': 3.721}, {'end': 413.478, 'text': 'As soon as you start going outdoors, now the light is a lot more variable.', 'start': 410.218, 'duration': 3.26}, {'end': 414.499, 'text': 'Tree shadows,', 'start': 413.499, 'duration': 1}, {'end': 422.2, 'text': 'falling leaves and also the line on the ground can be very narrow and there may be only a few pixels for the computer vision model to recognize.', 'start': 414.499, 'duration': 7.701}, {'end': 425.321, 'text': 'There was no tether.', 'start': 422.22, 'duration': 3.101}], 'summary': 'Teaching a machine learning model to recognize lines in outdoor settings is challenging due to variable lighting and limited pixels for recognition.', 'duration': 26.225, 'max_score': 399.096, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E399096.jpg'}, {'end': 481.55, 'src': 'embed', 'start': 455.044, 'weight': 2, 'content': [{'end': 463.789, 'text': "it's really remarkable to see how deep learning is being applied to some of these problems, focused on really doing good and just helping people.", 'start': 455.044, 'duration': 8.745}, {'end': 470.714, 'text': 'Here in this case, the visually impaired a man who has never run without his guide dog before,', 'start': 463.849, 'duration': 6.865}, {'end': 476.978, 'text': 'is now able to run independently through the trails with the aid of this computer vision system.', 'start': 470.714, 'duration': 6.264}, {'end': 481.55, 'text': 'And, like I said, we often take these tasks for granted.', 'start': 478.649, 'duration': 2.901}], 'summary': 'Deep learning aids visually impaired man to run independently through trails.', 'duration': 26.506, 'max_score': 455.044, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E455044.jpg'}, {'end': 575.638, 'src': 'embed', 'start': 546.909, 'weight': 5, 'content': [{'end': 554.737, 'text': 'Now, what does the computer see? So we can represent this image as a two-dimensional matrix of these numbers, one number for each pixel in the image.', 'start': 546.909, 'duration': 7.828}, {'end': 555.898, 'text': 'And this is it.', 'start': 555.118, 'duration': 0.78}, {'end': 557.84, 'text': 'This is how a computer sees an image.', 'start': 555.938, 'duration': 1.902}, {'end': 565.448, 'text': 'Like I said, if we have an RGB image, not a grayscale image, we can represent this by a three-dimensional array.', 'start': 558.621, 'duration': 6.827}, {'end': 568.972, 'text': 'Now we have three two-dimensional arrays stacked on top of each other.', 'start': 565.668, 'duration': 3.304}, {'end': 575.638, 'text': 'One of those two-dimensional arrays corresponds to the red channel, one for the green, one for the blue, representing this RGB image.', 'start': 569.272, 'duration': 6.366}], 'summary': 'Computer represents image as 2d/3d matrix, with 3 arrays for rgb channels.', 'duration': 28.729, 'max_score': 546.909, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E546909.jpg'}, {'end': 711.546, 'src': 'embed', 'start': 684.333, 'weight': 6, 'content': [{'end': 692.338, 'text': "So, if we're building an image classification pipeline, our model needs to know what are the features are, what they are, and two,", 'start': 684.333, 'duration': 8.005}, {'end': 695.24, 'text': 'it needs to be able to detect those features in a brand new image.', 'start': 692.338, 'duration': 2.902}, {'end': 704.704, 'text': 'So for example, if we want to detect human faces, some features that we might want to be able to identify would be noses, eyes, and mouths.', 'start': 696.161, 'duration': 8.543}, {'end': 711.546, 'text': 'Whereas like if we want to detect cars, we might be looking at certain things in the image like wheels, license plates, and headlights.', 'start': 705.664, 'duration': 5.882}], 'summary': 'Image classification model must identify features like noses, eyes, mouths for faces, and wheels, license plates, headlights for cars.', 'duration': 27.213, 'max_score': 684.333, 'thumbnail': ''}, {'end': 1012.015, 'src': 'heatmap', 'start': 904.161, 'weight': 0.725, 'content': [{'end': 909.505, 'text': 'So, in Lecture 1 we learned about these fully connected neural networks, also called dense neural networks,', 'start': 904.161, 'duration': 5.344}, {'end': 916.969, 'text': 'where you can have multiple hidden layers stacked on top of each other and each neuron in each hidden layer is connected to every neuron in the previous layer.', 'start': 909.505, 'duration': 7.464}, {'end': 923.254, 'text': "Now let's say we want to use a fully connected network to perform image classification.", 'start': 917.89, 'duration': 5.364}, {'end': 929.719, 'text': "And we're going to try and motivate the use of something better than this by first starting with what we already know.", 'start': 923.454, 'duration': 6.265}, {'end': 931.18, 'text': "And we'll see the limitations of this.", 'start': 929.819, 'duration': 1.361}, {'end': 935.363, 'text': 'So in this case, remember, our input is this two-dimensional image.', 'start': 932, 'duration': 3.363}, {'end': 937.364, 'text': "It's a vector, a two-dimensional vector,", 'start': 935.403, 'duration': 1.961}, {'end': 943.348, 'text': 'but it can be collapsed into a one-dimensional vector if we just stack all of those dimensions on top of each other of pixel values.', 'start': 937.364, 'duration': 5.984}, {'end': 952.137, 'text': "And what we're going to do is feed in that vector of pixel values to our hidden layer connected to all neurons in the next layer.", 'start': 944.429, 'duration': 7.708}, {'end': 960.107, 'text': 'Now, here you should already appreciate something, and that is that all spatial information that we had in this image is automatically gone.', 'start': 952.397, 'duration': 7.71}, {'end': 960.928, 'text': "It's lost.", 'start': 960.287, 'duration': 0.641}, {'end': 965.291, 'text': 'Because now, since we have flattened this two-dimensional image into one dimension,', 'start': 961.308, 'duration': 3.983}, {'end': 970.615, 'text': 'we have now basically removed any spatial information that we previously had by the next layer.', 'start': 965.291, 'duration': 5.324}, {'end': 976.219, 'text': 'And our network now has to relearn all of that very important spatial information.', 'start': 971.095, 'duration': 5.124}, {'end': 980.061, 'text': 'For example, that one pixel is closer to its neighboring pixel.', 'start': 976.259, 'duration': 3.802}, {'end': 985.005, 'text': "That's something very important in our input, but it's lost immediately in a fully connected layer.", 'start': 980.642, 'duration': 4.363}, {'end': 1001.429, 'text': 'So the question is how can we build some structure into our model so that we can actually inform the learning process and provide some prior information to the model and help it learn this very complicated and large input image?', 'start': 986.803, 'duration': 14.626}, {'end': 1012.015, 'text': "So to do this, let's keep our representation of our image, our 2D image, as an array, a two-dimensional array of pixel values.", 'start': 1004.11, 'duration': 7.905}], 'summary': 'Fully connected networks lose spatial information in image classification, motivating the need for better models with prior information.', 'duration': 107.854, 'max_score': 904.161, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E904161.jpg'}, {'end': 1158.882, 'src': 'embed', 'start': 1129.958, 'weight': 7, 'content': [{'end': 1137.649, 'text': "Well, In practice there's an operation called a convolution, and we'll first think about this at a high level.", 'start': 1129.958, 'duration': 7.691}, {'end': 1144.895, 'text': 'Suppose we have a 4x4 patch or a filter, which will consist of 16 weights.', 'start': 1138.35, 'duration': 6.545}, {'end': 1156.941, 'text': "We're going to apply this same filter to 4x4 patches in the input and use the result of that operation to define the state of the neuron in the next layer.", 'start': 1146.797, 'duration': 10.144}, {'end': 1158.882, 'text': 'So the neuron in the next layer.', 'start': 1157.581, 'duration': 1.301}], 'summary': 'Convolution operation uses 4x4 filter with 16 weights to define neuron state.', 'duration': 28.924, 'max_score': 1129.958, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E1129958.jpg'}], 'start': 356.653, 'title': 'Computer vision for visually impaired', 'summary': 'Discusses the development of a deep learning enabled device using computer vision to assist visually impaired individuals in running independently, showcasing the capabilities of deep learning. it also explains computer image processing basics including image classification, regression tasks, and convolution operations.', 'chapters': [{'end': 501.04, 'start': 356.653, 'title': 'Computer vision for visually impaired', 'summary': 'Discusses the development of a deep learning enabled device using computer vision to assist visually impaired individuals in running independently, allowing a man to run without a guide dog, showcasing the capabilities of deep learning in solving real-world problems.', 'duration': 144.387, 'highlights': ['The development of a deep learning enabled device using computer vision to assist visually impaired individuals in running independently.', 'The device can detect a trail for running and provide audible feedback to the visually impaired users, enabling them to run without assistance.', 'The application of deep learning in solving real-world problems and helping people, specifically the visually impaired, in performing tasks that are often taken for granted by sighted individuals.', 'The challenges faced in training a machine learning model to recognize and follow a trail for running, including variations in lighting, tree shadows, and narrow lines on the ground.', 'The remarkable impact of the computer vision system, allowing a visually impaired man to run independently through trails without his guide dog for the first time in decades.']}, {'end': 1183.062, 'start': 501.96, 'title': 'Computer image processing basics', 'summary': 'Explains how a computer processes images as two-dimensional arrays of numbers, how classification and regression tasks are performed, and the use of convolution operations to preserve spatial structure and learn visual features in the image.', 'duration': 681.102, 'highlights': ['Computers process images as two-dimensional arrays of numbers, represented by pixels, with grayscale images having one number per pixel and color images represented by three numbers (RGB).', 'Classification tasks involve predicting a label for each image, requiring the detection of unique features in the images and the ability to detect and distinguish them in a new image.', 'Convolution operations are used to preserve spatial structure and learn visual features in images by applying filters to patches of the input, allowing the detection of particular features.']}], 'duration': 826.409, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E356653.jpg', 'highlights': ['The development of a deep learning enabled device using computer vision to assist visually impaired individuals in running independently.', 'The device can detect a trail for running and provide audible feedback to the visually impaired users, enabling them to run without assistance.', 'The remarkable impact of the computer vision system, allowing a visually impaired man to run independently through trails without his guide dog for the first time in decades.', 'The challenges faced in training a machine learning model to recognize and follow a trail for running, including variations in lighting, tree shadows, and narrow lines on the ground.', 'The application of deep learning in solving real-world problems and helping people, specifically the visually impaired, in performing tasks that are often taken for granted by sighted individuals.', 'Computers process images as two-dimensional arrays of numbers, represented by pixels, with grayscale images having one number per pixel and color images represented by three numbers (RGB).', 'Classification tasks involve predicting a label for each image, requiring the detection of unique features in the images and the ability to detect and distinguish them in a new image.', 'Convolution operations are used to preserve spatial structure and learn visual features in images by applying filters to patches of the input, allowing the detection of particular features.']}, {'end': 1486.284, 'segs': [{'end': 1320.937, 'src': 'embed', 'start': 1232.001, 'weight': 0, 'content': [{'end': 1235.205, 'text': 'So how can we do that? We want to detect the features that define an X.', 'start': 1232.001, 'duration': 3.204}, {'end': 1242.707, 'text': 'So instead, we want our model to basically compare images of a piece of an X, piece by piece.', 'start': 1236.366, 'duration': 6.341}, {'end': 1249.249, 'text': "And the really important pieces that it should look for are exactly what we've been calling the features.", 'start': 1243.268, 'duration': 5.981}, {'end': 1258.091, 'text': 'If our model can find those important features, those rough features that define the X in the same positions, roughly the same positions,', 'start': 1249.989, 'duration': 8.102}, {'end': 1266.434, 'text': 'then it can get a lot better at understanding the similarity between different examples of X, even in the presence of these types of deformities.', 'start': 1258.091, 'duration': 8.343}, {'end': 1271.958, 'text': "So let's suppose each feature is like a mini-image.", 'start': 1268.976, 'duration': 2.982}, {'end': 1273.038, 'text': "it's a patch, right?", 'start': 1271.958, 'duration': 1.08}, {'end': 1282.585, 'text': "It's also a small array, a small two-dimensional array of values, and we'll use these filters to pick up on the features common to the x's.", 'start': 1273.999, 'duration': 8.586}, {'end': 1285.527, 'text': 'In the case of this x, for example,', 'start': 1283.625, 'duration': 1.902}, {'end': 1294.983, 'text': 'The filters we might want to pay attention to might represent things like the diagonal lines on the edge as well as the crossing point.', 'start': 1286.653, 'duration': 8.33}, {'end': 1296.785, 'text': 'you can see in the second patch here.', 'start': 1294.983, 'duration': 1.802}, {'end': 1303.613, 'text': "So we'll probably want to capture these features in the arms and the center of the X in order to detect all of these different variations.", 'start': 1297.365, 'duration': 6.248}, {'end': 1310.348, 'text': 'So note that these smaller matrices of filters, like we can see on the top row here,', 'start': 1305.083, 'duration': 5.265}, {'end': 1320.937, 'text': "these represent the filters of weights that we're going to use as part of our convolution operation in order to detect the corresponding features in the input image.", 'start': 1310.348, 'duration': 10.589}], 'summary': 'Model compares x features through filters to understand similarity, despite deformities.', 'duration': 88.936, 'max_score': 1232.001, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E1232001.jpg'}, {'end': 1383.92, 'src': 'heatmap', 'start': 1341.499, 'weight': 0.712, 'content': [{'end': 1345.322, 'text': 'And that is exactly what the operation of convolution is all about.', 'start': 1341.499, 'duration': 3.823}, {'end': 1355.61, 'text': 'The idea of convolution is to preserve the spatial relationship between pixels by learning image features in small little patches of image data.', 'start': 1346.343, 'duration': 9.267}, {'end': 1366.815, 'text': 'Now, to do this, we need to perform an element-wise multiplication between the filter matrix and the patch of the input image of the same dimension.', 'start': 1356.191, 'duration': 10.624}, {'end': 1374.397, 'text': "So if we have a patch of 3 by 3, we're going to compare that to an input filter, or our filter, which is also of size 3 by 3 with learned weights.", 'start': 1366.875, 'duration': 7.522}, {'end': 1383.92, 'text': 'So in this case, our filter, which you can see on the top left, all of its entries are of either a positive 1 or 1.', 'start': 1375.237, 'duration': 8.683}], 'summary': 'Convolution operation preserves spatial relationships between pixels by learning image features in small patches of image data using element-wise multiplication.', 'duration': 42.421, 'max_score': 1341.499, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E1341499.jpg'}, {'end': 1374.397, 'src': 'embed', 'start': 1346.343, 'weight': 4, 'content': [{'end': 1355.61, 'text': 'The idea of convolution is to preserve the spatial relationship between pixels by learning image features in small little patches of image data.', 'start': 1346.343, 'duration': 9.267}, {'end': 1366.815, 'text': 'Now, to do this, we need to perform an element-wise multiplication between the filter matrix and the patch of the input image of the same dimension.', 'start': 1356.191, 'duration': 10.624}, {'end': 1374.397, 'text': "So if we have a patch of 3 by 3, we're going to compare that to an input filter, or our filter, which is also of size 3 by 3 with learned weights.", 'start': 1366.875, 'duration': 7.522}], 'summary': 'Convolution preserves spatial relationships by learning image features in small patches of data.', 'duration': 28.054, 'max_score': 1346.343, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E1346343.jpg'}, {'end': 1459.812, 'src': 'embed', 'start': 1411.724, 'weight': 6, 'content': [{'end': 1418.15, 'text': 'Now the next step as part of the convolution operation is to add all of those element-wise multiplications together.', 'start': 1411.724, 'duration': 6.426}, {'end': 1421.633, 'text': 'So the result here after we add those outputs is going to be nine.', 'start': 1418.59, 'duration': 3.043}, {'end': 1432.655, 'text': 'So what this means now, Actually, before we get to that, let me start with another very brief example.', 'start': 1423.795, 'duration': 8.86}, {'end': 1439.817, 'text': 'Suppose we want to compute the convolution now, not of a very large image, but this is just of a 5x5 image.', 'start': 1433.235, 'duration': 6.582}, {'end': 1442.117, 'text': 'Our filter here is 3x3.', 'start': 1440.397, 'duration': 1.72}, {'end': 1452.439, 'text': 'So we can slide this 3x3 filter over the entirety of our input image, and performing this element-wise multiplication, and then adding the outputs.', 'start': 1442.597, 'duration': 9.842}, {'end': 1455.169, 'text': "Let's see what this looks like.", 'start': 1454.309, 'duration': 0.86}, {'end': 1459.812, 'text': "So let's start by sliding this filter over the top left hand side of our input.", 'start': 1455.63, 'duration': 4.182}], 'summary': 'In a 5x5 image, sliding a 3x3 filter results in a convolution output of nine.', 'duration': 48.088, 'max_score': 1411.724, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E1411724.jpg'}], 'start': 1185.084, 'title': 'Convolution operations', 'summary': 'Explains how the convolution operator facilitates feature extraction, using examples of classifying black and white images, and emphasizes preserving spatial relationships and performing element-wise multiplication in image processing and neural networks.', 'chapters': [{'end': 1273.038, 'start': 1185.084, 'title': 'Convolution operator: feature extraction', 'summary': 'Explains how the convolution operator allows us to extract features, using a simple example of classifying the letter x in black and white images, and emphasizes the importance of detecting and comparing important features for robust classification.', 'duration': 87.954, 'highlights': ['Our model needs to compare images of a piece of an X, piece by piece, to detect the important features that define an X, in order to build a classifier robust to deformities. (relevance score: 5)', 'The convolution operator allows us to extract features by finding important features, roughly in the same positions, to understand the similarity between different examples of X, even in the presence of deformities. (relevance score: 4)', 'The chapter explains the importance of detecting features that define an X and the need for the model to compare images of a piece of an X, piece by piece to achieve robust classification. (relevance score: 3)']}, {'end': 1366.815, 'start': 1273.999, 'title': 'Convolution operation in image processing', 'summary': 'Explains the concept of convolution in image processing, using small matrices of filters to detect features like diagonal lines and crossing points, with a focus on preserving spatial relationships between pixels and performing element-wise multiplication.', 'duration': 92.816, 'highlights': ['The smaller matrices of filters represent the weights used in the convolution operation to detect features in the input image.', 'The convolution operation aims to preserve spatial relationships between pixels and learn image features in small patches of image data.', 'The filters are used to capture features like diagonal lines and crossing points in the input image, with a focus on detecting various variations.']}, {'end': 1486.284, 'start': 1366.875, 'title': 'Convolution operation in neural networks', 'summary': 'Explains the process of performing element-wise multiplication and adding the outputs in a convolution operation, with examples of 3x3 and 5x5 image patches and their corresponding filters, resulting in a 3x3 matrix of all ones and an output of 4 for a specific patch.', 'duration': 119.409, 'highlights': ['Performing element-wise multiplication and adding the outputs in a convolution operation', 'Example of sliding a 3x3 filter over a 5x5 image and computing the element-wise multiplication and addition of the outputs']}], 'duration': 301.2, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E1185084.jpg', 'highlights': ['Our model needs to compare images of a piece of an X, piece by piece, to detect the important features that define an X, in order to build a classifier robust to deformities. (relevance score: 5)', 'The convolution operator allows us to extract features by finding important features, roughly in the same positions, to understand the similarity between different examples of X, even in the presence of deformities. (relevance score: 4)', 'The chapter explains the importance of detecting features that define an X and the need for the model to compare images of a piece of an X, piece by piece to achieve robust classification. (relevance score: 3)', 'The smaller matrices of filters represent the weights used in the convolution operation to detect features in the input image.', 'The convolution operation aims to preserve spatial relationships between pixels and learn image features in small patches of image data.', 'The filters are used to capture features like diagonal lines and crossing points in the input image, with a focus on detecting various variations.', 'Performing element-wise multiplication and adding the outputs in a convolution operation', 'Example of sliding a 3x3 filter over a 5x5 image and computing the element-wise multiplication and addition of the outputs']}, {'end': 2197.268, 'segs': [{'end': 1538.598, 'src': 'embed', 'start': 1510.938, 'weight': 1, 'content': [{'end': 1516.839, 'text': 'So, for example, wherever we see this pattern conveyed in the original input image,', 'start': 1510.938, 'duration': 5.901}, {'end': 1523.2, 'text': "that's where this feature map is going to have the highest value and that's where we need to actually activate maximally.", 'start': 1516.839, 'duration': 6.361}, {'end': 1532.355, 'text': "Now that we've gone through the mechanism of the convolution operation, Let's see how different filters can be used to produce feature maps.", 'start': 1523.88, 'duration': 8.475}, {'end': 1538.598, 'text': "So picture this picture of a woman's face.", 'start': 1533.635, 'duration': 4.963}], 'summary': 'Convolution operation highlights patterns in input images to produce feature maps.', 'duration': 27.66, 'max_score': 1510.938, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E1510938.jpg'}, {'end': 1550.144, 'src': 'heatmap', 'start': 1510.938, 'weight': 0.71, 'content': [{'end': 1516.839, 'text': 'So, for example, wherever we see this pattern conveyed in the original input image,', 'start': 1510.938, 'duration': 5.901}, {'end': 1523.2, 'text': "that's where this feature map is going to have the highest value and that's where we need to actually activate maximally.", 'start': 1516.839, 'duration': 6.361}, {'end': 1532.355, 'text': "Now that we've gone through the mechanism of the convolution operation, Let's see how different filters can be used to produce feature maps.", 'start': 1523.88, 'duration': 8.475}, {'end': 1538.598, 'text': "So picture this picture of a woman's face.", 'start': 1533.635, 'duration': 4.963}, {'end': 1540.779, 'text': "This woman's name is Lena.", 'start': 1539.238, 'duration': 1.541}, {'end': 1545.061, 'text': 'And the output of applying these three convolutional filters.', 'start': 1541.439, 'duration': 3.622}, {'end': 1550.144, 'text': "So you can see the three filters that we're considering on the bottom right hand corner of each image.", 'start': 1545.121, 'duration': 5.023}], 'summary': 'Explaining how convolutional filters produce feature maps in image processing.', 'duration': 39.206, 'max_score': 1510.938, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E1510938.jpg'}, {'end': 1590.868, 'src': 'embed', 'start': 1561.035, 'weight': 3, 'content': [{'end': 1567.002, 'text': 'So we can learn to sharpen the image by applying this very specific type of sharpening filter.', 'start': 1561.035, 'duration': 5.967}, {'end': 1576.173, 'text': 'We can learn to detect edges or we can learn to detect very strong edges in this image simply by modifying these filters.', 'start': 1567.362, 'duration': 8.811}, {'end': 1577.936, 'text': 'So these filters are not learned filters.', 'start': 1576.194, 'duration': 1.742}, {'end': 1585.005, 'text': "these are constructed filters and there's been a ton of research historically about developing, hand-engineering these filters.", 'start': 1577.936, 'duration': 7.069}, {'end': 1590.868, 'text': 'But what convolutional neural networks want to do is actually to learn the weights defining these filters.', 'start': 1585.646, 'duration': 5.222}], 'summary': 'Convolutional neural networks learn to define image filters, instead of using pre-constructed ones.', 'duration': 29.833, 'max_score': 1561.035, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E1561035.jpg'}, {'end': 1849.856, 'src': 'heatmap', 'start': 1711.687, 'weight': 0.733, 'content': [{'end': 1717.429, 'text': "The first part is what we've already gotten some exposure to in the first part of this lecture, and that is the convolution operation.", 'start': 1711.687, 'duration': 5.742}, {'end': 1724.432, 'text': 'And that allows us, like we saw earlier, to generate these feature maps and detect features in our image.', 'start': 1718.61, 'duration': 5.822}, {'end': 1727.713, 'text': 'The second part is applying a non-linearity.', 'start': 1725.392, 'duration': 2.321}, {'end': 1736.537, 'text': 'And we saw the importance of non-linearities in the first and the second lecture in order to help us deal with these features that we extract being highly non-linear.', 'start': 1727.893, 'duration': 8.644}, {'end': 1741.327, 'text': 'Thirdly, we need to apply some sort of pooling operation.', 'start': 1737.901, 'duration': 3.426}, {'end': 1744.493, 'text': 'This is another word for a downsampling operation.', 'start': 1741.468, 'duration': 3.025}, {'end': 1749.903, 'text': 'And this allows us to scale down the size of each feature map.', 'start': 1745.134, 'duration': 4.769}, {'end': 1759.716, 'text': "Now the computation of a class of scores, which is what we're doing when we define an image classification task,", 'start': 1750.973, 'duration': 8.743}, {'end': 1766.839, 'text': 'is actually performed using these features that we obtain through convolution, non-linearity and pooling,', 'start': 1759.716, 'duration': 7.123}, {'end': 1772.801, 'text': 'and then passing those learned features into a fully connected network or a dense layer,', 'start': 1766.839, 'duration': 5.962}, {'end': 1776.162, 'text': 'like we learned about in the first part of the class in the first lecture.', 'start': 1772.801, 'duration': 3.361}, {'end': 1778.743, 'text': 'And we can train this model end to end.', 'start': 1776.782, 'duration': 1.961}, {'end': 1789.534, 'text': 'from image input to class prediction output, using fully connected layers and convolutional layers end-to-end, where we learn,', 'start': 1779.403, 'duration': 10.131}, {'end': 1795.18, 'text': 'as part of the convolutional layers, the sets of weights of the filters for each convolutional layer and,', 'start': 1789.534, 'duration': 5.646}, {'end': 1800.506, 'text': 'as well as the weights that define these fully connected layers that actually perform our classification task in the end.', 'start': 1795.18, 'duration': 5.326}, {'end': 1810.841, 'text': "And we'll go through each one of these operations in a bit more detail to really break down the basics and the architecture of these convolutional neural networks.", 'start': 1801.167, 'duration': 9.674}, {'end': 1817.001, 'text': "So first we'll consider the convolution operation of a CNN.", 'start': 1813.479, 'duration': 3.522}, {'end': 1823.444, 'text': 'And as before, each neuron in the hidden layer will compute a weighted sum of each of its inputs.', 'start': 1817.241, 'duration': 6.203}, {'end': 1826.005, 'text': 'Like we saw in the dense layers.', 'start': 1824.164, 'duration': 1.841}, {'end': 1836.63, 'text': "we'll also need to add on a bias to allow us to shift the activation function and apply and activate it with some non-linearity so that we can handle non-linear data relationships.", 'start': 1826.005, 'duration': 10.625}, {'end': 1841.512, 'text': "Now what's really special here is that the local connectivity is preserved.", 'start': 1837.31, 'duration': 4.202}, {'end': 1849.856, 'text': 'Each neuron in the hidden layer, you can see in the middle, only sees a very specific patch of its inputs.', 'start': 1841.952, 'duration': 7.904}], 'summary': 'Convolutional neural networks involve convolution, non-linearity, and pooling operations to process image features and perform image classification using learned features.', 'duration': 138.169, 'max_score': 1711.687, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E1711687.jpg'}, {'end': 1776.162, 'src': 'embed', 'start': 1750.973, 'weight': 0, 'content': [{'end': 1759.716, 'text': "Now the computation of a class of scores, which is what we're doing when we define an image classification task,", 'start': 1750.973, 'duration': 8.743}, {'end': 1766.839, 'text': 'is actually performed using these features that we obtain through convolution, non-linearity and pooling,', 'start': 1759.716, 'duration': 7.123}, {'end': 1772.801, 'text': 'and then passing those learned features into a fully connected network or a dense layer,', 'start': 1766.839, 'duration': 5.962}, {'end': 1776.162, 'text': 'like we learned about in the first part of the class in the first lecture.', 'start': 1772.801, 'duration': 3.361}], 'summary': 'Image classification scores computed using convolution, non-linearity, and pooling, passed to a fully connected network.', 'duration': 25.189, 'max_score': 1750.973, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E1750973.jpg'}, {'end': 2119.553, 'src': 'heatmap', 'start': 1913.68, 'weight': 0.914, 'content': [{'end': 1921.664, 'text': 'Remember, our element-wise multiplication and addition is exactly that convolutional operation that we talked about earlier.', 'start': 1913.68, 'duration': 7.984}, {'end': 1926.026, 'text': 'So if you look up the definition of what convolution means, it is actually that exactly.', 'start': 1921.724, 'duration': 4.302}, {'end': 1929.608, 'text': "It's element-wise multiplication and then a summation of all of the results.", 'start': 1926.046, 'duration': 3.562}, {'end': 1936.332, 'text': 'And this actually defines also how convolutional layers are connected to these ideas.', 'start': 1930.489, 'duration': 5.843}, {'end': 1942.619, 'text': 'But with this single convolutional layer, how can we have multiple filters?', 'start': 1938.097, 'duration': 4.522}, {'end': 1948.962, 'text': 'So all we saw in the previous slide is how we can take this input image and learn a single feature map.', 'start': 1942.659, 'duration': 6.303}, {'end': 1952.263, 'text': 'But in reality, there are many types of features in our image.', 'start': 1949.442, 'duration': 2.821}, {'end': 1961.487, 'text': 'How can we use convolutional layers to learn a stack or many different types of features that could be useful for performing our type of task?', 'start': 1952.543, 'duration': 8.944}, {'end': 1965.289, 'text': 'How can we use this to do multiple feature extraction?', 'start': 1961.527, 'duration': 3.762}, {'end': 1969.051, 'text': 'Now the output layer is still convolution,', 'start': 1965.749, 'duration': 3.302}, {'end': 1976.876, 'text': 'but now it has a volume dimension where the height and the width are spatial dimensions dependent upon the dimensions of the input layer,', 'start': 1969.051, 'duration': 7.825}, {'end': 1978.918, 'text': 'the dimensions of the filter, the stride.', 'start': 1976.876, 'duration': 2.042}, {'end': 1983.541, 'text': "how much we're skipping on each time that we apply the filter?", 'start': 1978.918, 'duration': 4.623}, {'end': 1992.852, 'text': "But we also need to think about the connections of the neurons in these layers in terms of their, what's called, receptive field.", 'start': 1984.889, 'duration': 7.963}, {'end': 2001.236, 'text': "The locations of their input in the model, in the path of the model that they're connected to.", 'start': 1993.433, 'duration': 7.803}, {'end': 2012.261, 'text': 'Now, these parameters actually define the spatial arrangement of how the neurons are connected in the convolutional layers and how those connections are really defined.', 'start': 2001.796, 'duration': 10.465}, {'end': 2018.224, 'text': 'So the output of a convolutional layer in this case will have this volume dimension.', 'start': 2013.14, 'duration': 5.084}, {'end': 2024.589, 'text': "So instead of having one filter map that we slide along our image, now we're going to have a volume of filters.", 'start': 2018.444, 'duration': 6.145}, {'end': 2031.875, 'text': 'Each filter is going to be slid across the image and compute this convolution operation piece by piece for each filter.', 'start': 2024.889, 'duration': 6.986}, {'end': 2040.161, 'text': 'The result of each convolution operation defines the feature map that that filter will activate maximally.', 'start': 2032.075, 'duration': 8.086}, {'end': 2045.164, 'text': "So now we're well on our way to actually defining what a CNN is.", 'start': 2040.962, 'duration': 4.202}, {'end': 2048.646, 'text': 'And the next step would actually be to apply that nonlinearity.', 'start': 2045.345, 'duration': 3.301}, {'end': 2056.331, 'text': 'After each convolution operation, we need to actually apply this nonlinear activation function to the output volume of that layer.', 'start': 2049.407, 'duration': 6.924}, {'end': 2062.358, 'text': 'And this is very, very similar, like I said, in the first and we saw also in the second lecture.', 'start': 2057.11, 'duration': 5.248}, {'end': 2065.743, 'text': 'And we do this because image data is highly nonlinear.', 'start': 2062.839, 'duration': 2.904}, {'end': 2073.293, 'text': 'A common example in the image domain is to use an activation function of ReLU, which is the rectified linear unit.', 'start': 2066.364, 'duration': 6.929}, {'end': 2081.857, 'text': 'This is a pixel-wise operation that replaces all negative values with zero and keeps all positive values with whatever their value was.', 'start': 2073.893, 'duration': 7.964}, {'end': 2087.438, 'text': 'We can think of this really as a thresholding operation, so anything less than zero gets thresholded to zero.', 'start': 2082.157, 'duration': 5.281}, {'end': 2096.603, 'text': 'Negative values indicate negative detection of a convolution, but this non-linearity actually kind of clamps that to some sense.', 'start': 2088.239, 'duration': 8.364}, {'end': 2104.99, 'text': 'And that is a nonlinear operation, so it does satisfy our ability to learn nonlinear dynamics as part of our neural network model.', 'start': 2097.901, 'duration': 7.089}, {'end': 2110.69, 'text': 'So the next operation in convolutional neural networks is that of pooling.', 'start': 2106.909, 'duration': 3.781}, {'end': 2119.553, 'text': 'Pooling is an operation that is commonly used to reduce the dimensionality of our inputs and of our feature maps,', 'start': 2111.41, 'duration': 8.143}], 'summary': 'Convolutional layers use element-wise multiplication and summation to learn multiple feature maps for image data, followed by the application of nonlinearity and pooling to reduce dimensionality.', 'duration': 205.873, 'max_score': 1913.68, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E1913680.jpg'}, {'end': 2146.41, 'src': 'embed', 'start': 2122.113, 'weight': 4, 'content': [{'end': 2127.915, 'text': 'Now a common technique and a common type of pooling that is commonly used in practice is called max pooling.', 'start': 2122.113, 'duration': 5.802}, {'end': 2129.796, 'text': 'as shown in this example.', 'start': 2128.795, 'duration': 1.001}, {'end': 2133.639, 'text': 'Max pooling is actually super simple and intuitive.', 'start': 2130.457, 'duration': 3.182}, {'end': 2141.486, 'text': "It's simply taking the maximum over these 2x2 filters in our patches and sliding that patch over our input.", 'start': 2134.5, 'duration': 6.986}, {'end': 2146.41, 'text': 'Very similar to convolutions, but now, instead of applying an element-wise multiplication and summation,', 'start': 2141.506, 'duration': 4.904}], 'summary': 'Max pooling is a common technique where the maximum is taken over 2x2 filters in patches and slid over the input.', 'duration': 24.297, 'max_score': 2122.113, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E2122113.jpg'}], 'start': 1486.324, 'title': 'Convolutional neural networks', 'summary': 'Explains the application of convolutional filters in producing feature maps and the utilization of convolutional neural networks in extracting local features from images for image classification tasks.', 'chapters': [{'end': 1585.005, 'start': 1486.324, 'title': 'Convolutional filters and feature maps', 'summary': 'Explains how convolutional filters are applied to images to produce feature maps, demonstrating the activation and learning capabilities of different filters and their impact on feature detection.', 'duration': 98.681, 'highlights': ["Convolutional filters are used to produce feature maps by detecting activation in different parts of the input image, with the example of using three filters on an image of a woman's face.", 'The feature map indicates the areas in the input image that are activated by a specific filter, with the highest value corresponding to the pattern conveyed in the original input image.', 'By adjusting the weights of the filters, different features can be detected in the image, such as sharpening, edge detection, or detecting strong edges, showcasing the learning capabilities of the filters.', 'The filters used are not learned filters but constructed filters, with the ability to hand-engineer specific features through a ton of historical research.']}, {'end': 2197.268, 'start': 1585.646, 'title': 'Convolutional neural networks', 'summary': 'Explains how convolutional neural networks utilize the convolution operation, non-linearity, and pooling to extract local features from images, preserving spatial structure and enabling image classification tasks.', 'duration': 611.622, 'highlights': ['Convolution operation allows networks to learn what features to extract from images, such as edge detection or certain geometric objects.', 'Convolutional neural networks utilize convolution, non-linearity, and pooling operations to extract features and perform image classification tasks.', 'Pooling operation, such as max pooling, reduces dimensionality while preserving spatial invariance by taking the maximum over patches of the input.']}], 'duration': 710.944, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E1486324.jpg', 'highlights': ['Convolutional neural networks utilize convolution, non-linearity, and pooling operations to extract features and perform image classification tasks.', "Convolutional filters are used to produce feature maps by detecting activation in different parts of the input image, with the example of using three filters on an image of a woman's face.", 'The feature map indicates the areas in the input image that are activated by a specific filter, with the highest value corresponding to the pattern conveyed in the original input image.', 'By adjusting the weights of the filters, different features can be detected in the image, such as sharpening, edge detection, or detecting strong edges, showcasing the learning capabilities of the filters.', 'Pooling operation, such as max pooling, reduces dimensionality while preserving spatial invariance by taking the maximum over patches of the input.']}, {'end': 2511.479, 'segs': [{'end': 2286.757, 'src': 'heatmap', 'start': 2220.165, 'weight': 0, 'content': [{'end': 2224.065, 'text': 'And with CNNs, just to remind you once again, we can layer these operations.', 'start': 2220.165, 'duration': 3.9}, {'end': 2231.287, 'text': 'The whole point of this is that we want to learn this hierarchy of features present in the image data, starting from the low-level features,', 'start': 2224.185, 'duration': 7.102}, {'end': 2237.99, 'text': 'composing those together, to mid-level features and then again to high-level features that can be used to accomplish our task.', 'start': 2231.287, 'duration': 6.703}, {'end': 2243.515, 'text': 'Now, a CNN built for image classification can be broken down into two parts.', 'start': 2239.271, 'duration': 4.244}, {'end': 2252.521, 'text': 'First, the feature learning part where we actually try to learn the features in our input image that can be used to perform our specific task.', 'start': 2244.095, 'duration': 8.426}, {'end': 2259.187, 'text': "That feature learning part is actually done through those pieces that we've been seeing so far in this lecture the convolution,", 'start': 2253.162, 'duration': 6.025}, {'end': 2262.169, 'text': 'the non-linearity and the pooling to preserve the spatial invariance.', 'start': 2259.187, 'duration': 2.982}, {'end': 2274.053, 'text': 'Now the second part, the convolutional layers and pooling provide output of the first part is those high level features of the input.', 'start': 2264.81, 'duration': 9.243}, {'end': 2281.735, 'text': 'Now the second part is actually using those features to perform our classification or whatever our task is.', 'start': 2274.633, 'duration': 7.102}, {'end': 2286.757, 'text': 'In this case, the task is to output the class probabilities that are present in the input image.', 'start': 2281.775, 'duration': 4.982}], 'summary': 'Cnns learn hierarchy of features in image data for classification, outputting class probabilities.', 'duration': 61.57, 'max_score': 2220.165, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E2220165.jpg'}, {'end': 2389.024, 'src': 'heatmap', 'start': 2315.966, 'weight': 3, 'content': [{'end': 2319.967, 'text': 'whose output actually represents a categorical probability distribution.', 'start': 2315.966, 'duration': 4.001}, {'end': 2325.668, 'text': "It's summed equal to 1, so it does make it a proper categorical distribution.", 'start': 2320.547, 'duration': 5.121}, {'end': 2331.43, 'text': 'And each element in this is strictly between 0 and 1.', 'start': 2326.488, 'duration': 4.942}, {'end': 2334.229, 'text': "So it's all positive, and it does sum to 1.", 'start': 2331.43, 'duration': 2.799}, {'end': 2339.11, 'text': 'So it makes it very well suited for the second part if your task is image classification.', 'start': 2334.229, 'duration': 4.881}, {'end': 2341.351, 'text': "So now let's put this all together.", 'start': 2340.17, 'duration': 1.181}, {'end': 2345.432, 'text': 'What does a end-to-end convolutional neural network look like?', 'start': 2341.531, 'duration': 3.901}, {'end': 2356.614, 'text': 'We start by defining our feature extraction head, which starts with a convolutional layer with 32 feature maps, a filter size of 3 by 3 pixels.', 'start': 2346.232, 'duration': 10.382}, {'end': 2360.315, 'text': 'And we downsample this using a max pooling operation.', 'start': 2357.374, 'duration': 2.941}, {'end': 2364.216, 'text': 'with a pooling size of 2 and a stride of 2.', 'start': 2361.115, 'duration': 3.101}, {'end': 2369.638, 'text': 'This is exactly the same as what we saw when we were first introducing the convolution operation.', 'start': 2364.216, 'duration': 5.422}, {'end': 2380.042, 'text': 'Next, we feed these 32 feature maps into the next set of the convolutional convolutional and pooling layers.', 'start': 2370.639, 'duration': 9.403}, {'end': 2387.463, 'text': "Now we're increasing this from 32 feature maps to 64 feature maps and still downscaling our image as a result.", 'start': 2380.162, 'duration': 7.301}, {'end': 2389.024, 'text': "So we're downscaling the image,", 'start': 2387.483, 'duration': 1.541}], 'summary': 'The output represents a categorical probability distribution, suitable for image classification. the convolutional neural network includes 32 feature maps, a filter size of 3 by 3 pixels, and downscaling through max pooling.', 'duration': 42.792, 'max_score': 2315.966, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E2315966.jpg'}, {'end': 2452.935, 'src': 'heatmap', 'start': 2417.404, 'weight': 0.714, 'content': [{'end': 2422.851, 'text': "And note here that we're using the activation function of Softmax to make sure that these outputs are a categorical distribution.", 'start': 2417.404, 'duration': 5.447}, {'end': 2425.518, 'text': 'Okay, awesome.', 'start': 2424.577, 'duration': 0.941}, {'end': 2430.645, 'text': "So, so far we've talked about how we can use CNNs for image classification tasks.", 'start': 2425.659, 'duration': 4.986}, {'end': 2439.176, 'text': 'This architecture is actually so powerful because it extends to a number of different tasks, not just image classification.', 'start': 2431.426, 'duration': 7.75}, {'end': 2446.426, 'text': 'And the reason for that is that You can really take this feature extraction head, this feature learning part,', 'start': 2439.777, 'duration': 6.649}, {'end': 2452.935, 'text': "and you can put onto the second part so many different end networks, whatever end network you'd like to use.", 'start': 2446.426, 'duration': 6.509}], 'summary': 'Cnns with softmax activation for versatile tasks', 'duration': 35.531, 'max_score': 2417.404, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E2417404.jpg'}, {'end': 2511.479, 'src': 'embed', 'start': 2472.495, 'weight': 4, 'content': [{'end': 2477.799, 'text': 'We can introduce new architectures for specifically things like image and object detection,', 'start': 2472.495, 'duration': 5.304}, {'end': 2481.481, 'text': 'semantic segmentation and even things like image captioning.', 'start': 2477.799, 'duration': 3.682}, {'end': 2486.205, 'text': 'You can use this as an input to some of the sequential networks that we saw in lecture two, even.', 'start': 2481.541, 'duration': 4.664}, {'end': 2495.973, 'text': "So let's look at and dive a bit deeper into each of these different types of tasks that we could use our convolutional neural networks for.", 'start': 2488.55, 'duration': 7.423}, {'end': 2498.534, 'text': 'In the case of classification, for example.', 'start': 2496.353, 'duration': 2.181}, {'end': 2508.338, 'text': 'there is a significant impact in medicine and healthcare when deep learning models are actually being applied to the analysis of entire inputs of medical image scans.', 'start': 2498.534, 'duration': 9.804}, {'end': 2511.479, 'text': 'Now, this is an example of a paper that was published in Nature.', 'start': 2508.718, 'duration': 2.761}], 'summary': 'New architectures for image and object detection have significant impact in medicine and healthcare, as seen in a nature paper.', 'duration': 38.984, 'max_score': 2472.495, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E2472495.jpg'}], 'start': 2197.308, 'title': 'Convolutional neural networks', 'summary': 'Explains the key operations of a convolutional neural network, feature learning in input images, and the use of dense neural network for classification. it emphasizes the hierarchy of features, application of softmax function, feature extraction process in cnns, and potential tasks beyond image classification, such as object detection and medical image analysis.', 'chapters': [{'end': 2345.432, 'start': 2197.308, 'title': 'Convolutional neural networks', 'summary': 'Explains the key operations of a convolutional neural network, the process of learning features in input images, and the use of dense neural network for classification, emphasizing the hierarchy of features and the application of softmax function for categorical probability distribution.', 'duration': 148.124, 'highlights': ['The second part of a CNN involves using the high-level features obtained from the feature learning part for classification, achieved by feeding the features into a fully connected or dense neural network to output class probabilities using the softmax function.', 'CNNs aim to learn a hierarchy of features in image data, starting from low-level features to mid-level features and then to high-level features, which are utilized to accomplish specific tasks.', 'The feature learning part of a CNN is accomplished through convolution, non-linearity, and pooling operations to preserve spatial invariance in the input image.']}, {'end': 2511.479, 'start': 2346.232, 'title': 'Cnns for image classification', 'summary': 'Explains the feature extraction process in cnns, starting with 32 feature maps and increasing to 64 feature maps while downscaling the image, culminating in the potential for various tasks beyond image classification, such as object detection, semantic segmentation, and medical image analysis.', 'duration': 165.247, 'highlights': ['The feature extraction head in CNNs begins with a convolutional layer with 32 feature maps and a filter size of 3 by 3 pixels, followed by downsampling using max pooling with a size of 2 and a stride of 2.', 'The transition to the next set of layers involves increasing the feature maps to 64 while still downscaling the image, allowing for expansion in the dimensional space while downsampling irrelevant spatial information.', 'The flexibility of CNNs extends to various tasks beyond image classification, allowing for the potential application in object detection, semantic segmentation, medical image analysis, and sequential networks.', 'Deep learning models applied to medical image scans have a significant impact on analysis, as demonstrated in a published paper in Nature.']}], 'duration': 314.171, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E2197308.jpg', 'highlights': ['The feature learning part of a CNN involves convolution, non-linearity, and pooling operations to preserve spatial invariance.', 'CNNs aim to learn a hierarchy of features in image data, starting from low-level to mid-level to high-level features.', 'The second part of a CNN uses high-level features for classification through a fully connected or dense neural network with the softmax function.', 'The feature extraction head in CNNs starts with a convolutional layer with 32 feature maps and a 3x3 filter size, followed by max pooling.', 'CNNs have potential applications in object detection, semantic segmentation, medical image analysis, and sequential networks.', 'Deep learning models applied to medical image scans have a significant impact on analysis, as demonstrated in a published paper in Nature.', 'The transition to the next set of layers in CNNs involves increasing feature maps to 64 while still downscaling the image.']}, {'end': 3022.318, 'segs': [{'end': 2541.262, 'src': 'embed', 'start': 2512.6, 'weight': 0, 'content': [{'end': 2520.379, 'text': 'for actually demonstrating that a CNN can outperform expert radiologists at detecting breast cancer directly from mammogram images.', 'start': 2512.6, 'duration': 7.779}, {'end': 2530.336, 'text': 'Instead of giving a binary prediction of what an output is, though cancer or not cancer or what type of objects,', 'start': 2523.052, 'duration': 7.284}, {'end': 2534.658, 'text': 'for example in this image we may say that this image is an image of a taxi.', 'start': 2530.336, 'duration': 4.322}, {'end': 2541.262, 'text': 'we may want to ask our neural network to do something a bit more fine resolution and tell us for this image.', 'start': 2534.658, 'duration': 6.604}], 'summary': 'Cnn outperforms radiologists in detecting breast cancer from mammogram images.', 'duration': 28.662, 'max_score': 2512.6, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E2512600.jpg'}, {'end': 2643.23, 'src': 'embed', 'start': 2617.385, 'weight': 4, 'content': [{'end': 2622.307, 'text': 'So our network should not be constrained to only outputting a single output or a certain number of outputs.', 'start': 2617.385, 'duration': 4.922}, {'end': 2627.729, 'text': 'It needs to have a flexible range of how it can dynamically infer the objects in the scene.', 'start': 2622.367, 'duration': 5.362}, {'end': 2635.306, 'text': 'So what is one maybe naive solution to tackle this very complicated problem, and how can CNNs be used to do that?', 'start': 2629.482, 'duration': 5.824}, {'end': 2643.23, 'text': "So what we can do is start with this image and let's consider the simplest way possible to do this problem.", 'start': 2636.206, 'duration': 7.024}], 'summary': 'Network should have flexible range for object inference. cnns can be used to tackle this complicated problem.', 'duration': 25.845, 'max_score': 2617.385, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E2617385.jpg'}, {'end': 2729.216, 'src': 'embed', 'start': 2679.57, 'weight': 5, 'content': [{'end': 2684.514, 'text': 'Then we pick another box in the scene, and we pass that through the network to predict its class.', 'start': 2679.57, 'duration': 4.944}, {'end': 2688.977, 'text': 'And we can keep doing this with different boxes in the scene and keep doing it.', 'start': 2684.854, 'duration': 4.123}, {'end': 2698.144, 'text': "And over time, we can basically have many different class predictions of all of these boxes as they're passed through our classification network.", 'start': 2689.178, 'duration': 8.966}, {'end': 2702.389, 'text': 'In some sense, if each of these boxes give us a prediction class,', 'start': 2699.005, 'duration': 3.384}, {'end': 2710.018, 'text': 'we can pick the boxes that do have a class in them and use those as a box where an object is found.', 'start': 2702.389, 'duration': 7.629}, {'end': 2713.662, 'text': 'If no object is found, we can simply discard it and move on to the next box.', 'start': 2710.278, 'duration': 3.384}, {'end': 2721.214, 'text': "So what's the problem with this? Well, one is that there are way too many inputs.", 'start': 2715.253, 'duration': 5.961}, {'end': 2729.216, 'text': 'This basically results in boxes and considering a number of boxes that have way too many scales, way too many positions, too many sizes.', 'start': 2721.614, 'duration': 7.602}], 'summary': 'Using network to predict classes of boxes in scene, facing issue of many inputs and scales/sizes.', 'duration': 49.646, 'max_score': 2679.57, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E2679570.jpg'}, {'end': 2814.189, 'src': 'embed', 'start': 2787.78, 'weight': 7, 'content': [{'end': 2793.182, 'text': 'So, if we extract in this case 2, 000 regions we have here, we have to feed this.', 'start': 2787.78, 'duration': 5.402}, {'end': 2797.424, 'text': 'we have to run this network 2, 000 times to get the answer just for the single image.', 'start': 2793.182, 'duration': 4.242}, {'end': 2805.846, 'text': "It also tends to be very brittle because in practice, How are we doing this region proposal? Well, it's entirely heuristic-based.", 'start': 2798.064, 'duration': 7.782}, {'end': 2814.189, 'text': "It's not being learned with a neural network, and it's also, even more importantly perhaps, it's detached from the feature extraction part.", 'start': 2806.307, 'duration': 7.882}], 'summary': 'Processing 2,000 regions requires 2,000 runs, posing brittleness due to heuristic-based region proposal detached from feature extraction.', 'duration': 26.409, 'max_score': 2787.78, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E2787780.jpg'}, {'end': 2965.559, 'src': 'heatmap', 'start': 2814.209, 'weight': 1, 'content': [{'end': 2822.572, 'text': 'So our feature extraction is learning one piece, but our region proposal piece of the network or of this architecture is completely detached.', 'start': 2814.209, 'duration': 8.363}, {'end': 2828.954, 'text': 'the model cannot learn to predict regions that may be specific to a given task.', 'start': 2823.792, 'duration': 5.162}, {'end': 2831.456, 'text': 'That makes it very brittle for some applications.', 'start': 2829.355, 'duration': 2.101}, {'end': 2841.02, 'text': 'Now, many variants have been proposed to actually tackle some of these issues and advance this forward to accomplish object detection.', 'start': 2832.936, 'duration': 8.084}, {'end': 2848.344, 'text': "But I'd like to touch on one extremely quickly just to point you in this direction for those of you who are interested.", 'start': 2841.801, 'duration': 6.543}, {'end': 2852.766, 'text': "And that's the faster RCNN method to actually learn these region proposals.", 'start': 2848.384, 'duration': 4.382}, {'end': 2862.311, 'text': 'The idea here is, instead of feeding in this image to a heuristic-based feedback region proposal, network or method,', 'start': 2853.346, 'duration': 8.965}, {'end': 2869.075, 'text': 'we can have a part of our network that is trained to identify the proposal regions of our model of our image.', 'start': 2862.311, 'duration': 6.764}, {'end': 2877.457, 'text': 'And that allows us to directly understand or identify these regions in our original image,', 'start': 2870.055, 'duration': 7.402}, {'end': 2883.219, 'text': 'where there are candidate patches that we should explore for our classification and for our object detection.', 'start': 2877.457, 'duration': 5.762}, {'end': 2891.542, 'text': 'Now, each of these regions then are processed with their own feature extractor as part of our neural network and individuals in their CNN heads.', 'start': 2883.739, 'duration': 7.803}, {'end': 2900.951, 'text': 'Then, after these features for each of these proposals are extracted, we can do a normal classification over each of these individual regions.', 'start': 2892.529, 'duration': 8.422}, {'end': 2907.813, 'text': 'Very similar as before, but now the huge advantage of this is that it only requires a single forward pass through the model.', 'start': 2901.291, 'duration': 6.522}, {'end': 2917.195, 'text': 'We only feed in this image once we have a region proposal network that extracts the regions and all of these regions are fed on to perform classification on the rest of the image.', 'start': 2908.373, 'duration': 8.822}, {'end': 2920.016, 'text': "So it's super, super fast compared to the previous method.", 'start': 2917.795, 'duration': 2.221}, {'end': 2926.264, 'text': 'So in classification, we predict one class for an entire image of the model.', 'start': 2921.701, 'duration': 4.563}, {'end': 2933.87, 'text': 'In object detection, we predict bounding boxes over all of the objects in order to localize them and identify them.', 'start': 2926.765, 'duration': 7.105}, {'end': 2936.031, 'text': 'We can go even further than this.', 'start': 2934.61, 'duration': 1.421}, {'end': 2941.956, 'text': "And in this idea, we're still using CNNs to predict this output as well.", 'start': 2936.372, 'duration': 5.584}, {'end': 2950.382, 'text': 'But instead of predicting bounding boxes, which are rather coarse, we can task our network to also, here, predict an entire image as well.', 'start': 2942.056, 'duration': 8.326}, {'end': 2954.046, 'text': 'Now, one example of this would be for semantic segmentation,', 'start': 2950.962, 'duration': 3.084}, {'end': 2960.893, 'text': 'where the input is an RGB image just a normal RGB image and the output would be pixel-wise probabilities.', 'start': 2954.046, 'duration': 6.847}, {'end': 2962.555, 'text': 'For every single pixel.', 'start': 2961.134, 'duration': 1.421}, {'end': 2965.559, 'text': 'what is the probability that it belongs to a given class?', 'start': 2962.555, 'duration': 3.004}], 'summary': 'Faster rcnn method uses a region proposal network to achieve super fast object detection and classification.', 'duration': 64.268, 'max_score': 2814.209, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E2814209.jpg'}], 'start': 2512.6, 'title': 'Object detection techniques', 'summary': 'Covers various object detection techniques, including the use of cnns for fine-resolution object detection and classification, an overview of object detection technique using random boxes and a classification network, and the use of region proposal networks in object detection. the drawbacks of heuristic-based region proposal networks are discussed, and a potential solution, faster rcnn, is introduced to improve accuracy and reduce processing time.', 'chapters': [{'end': 2635.306, 'start': 2512.6, 'title': 'Cnn for object detection', 'summary': 'Discusses how cnns can be used to perform fine-resolution object detection and classification, emphasizing the flexibility to infer a dynamic number of objects in a scene.', 'duration': 122.706, 'highlights': ['CNNs can outperform expert radiologists in detecting breast cancer from mammogram images, showcasing their potential for medical image analysis.', 'Object detection using CNNs involves localizing and classifying multiple objects in an image, which is a more complex task compared to simple classification.', 'The network needs to be flexible enough to infer a dynamic number of objects in a scene, accommodating various types of objects and classes, and outputting multiple outputs as needed.']}, {'end': 2747.751, 'start': 2636.206, 'title': 'Object detection technique overview', 'summary': 'Introduces a technique for object detection using random boxes and a classification network, but it faces challenges due to the large number of inputs and scales involved.', 'duration': 111.545, 'highlights': ['By using random boxes and a classification network, the technique iterates through different boxes in an image to predict their class, but faces the challenge of handling a large number of inputs and scales.', 'The classification network is used to predict the class of each random box placed over the image, allowing for the identification of boxes containing objects based on their predicted classes.', 'The approach of iterating through different boxes in an image and predicting their classes provides a simple way to start addressing the object detection problem, but it encounters limitations due to the excessive number of inputs and scales involved.']}, {'end': 3022.318, 'start': 2747.751, 'title': 'Region proposal networks in object detection', 'summary': 'Discusses the use of heuristic-based region proposal networks in object detection, highlighting the drawbacks of the method and introducing faster rcnn as a potential solution, which reduces the processing time and improves accuracy by integrating region proposal with the classification network.', 'duration': 274.567, 'highlights': ['Faster RCNN method reduces processing time by requiring only a single forward pass through the model, improving efficiency compared to previous methods.', 'Introduction of semantic segmentation as a further advancement in object detection, allowing pixel-wise classification and probability prediction for each class.', 'Heuristic-based region proposal networks result in slow processing, requiring multiple passes through the model, making it inefficient for object detection.']}], 'duration': 509.718, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E2512600.jpg', 'highlights': ['CNNs can outperform expert radiologists in detecting breast cancer from mammogram images, showcasing their potential for medical image analysis.', 'Faster RCNN method reduces processing time by requiring only a single forward pass through the model, improving efficiency compared to previous methods.', 'Object detection using CNNs involves localizing and classifying multiple objects in an image, which is a more complex task compared to simple classification.', 'Introduction of semantic segmentation as a further advancement in object detection, allowing pixel-wise classification and probability prediction for each class.', 'The network needs to be flexible enough to infer a dynamic number of objects in a scene, accommodating various types of objects and classes, and outputting multiple outputs as needed.', 'The classification network is used to predict the class of each random box placed over the image, allowing for the identification of boxes containing objects based on their predicted classes.', 'The approach of iterating through different boxes in an image and predicting their classes provides a simple way to start addressing the object detection problem, but it encounters limitations due to the excessive number of inputs and scales involved.', 'Heuristic-based region proposal networks result in slow processing, requiring multiple passes through the model, making it inefficient for object detection.', 'By using random boxes and a classification network, the technique iterates through different boxes in an image to predict their class, but faces the challenge of handling a large number of inputs and scales.']}, {'end': 3354.535, 'segs': [{'end': 3066.493, 'src': 'embed', 'start': 3043.545, 'weight': 2, 'content': [{'end': 3051.432, 'text': 'cancerous regions on medical scans or even identifying parts of the blood that are infected with diseases like, in this case, malaria.', 'start': 3043.545, 'duration': 7.887}, {'end': 3060.365, 'text': "Let's see one final example here of how we can use convolutional feature extraction to perform yet another task.", 'start': 3053.916, 'duration': 6.449}, {'end': 3066.493, 'text': 'This task is different from the first three that we saw with classification, object detection, and semantic segmentation.', 'start': 3060.705, 'duration': 5.788}], 'summary': 'Convolutional feature extraction can be used to identify cancerous regions on medical scans and infected blood parts.', 'duration': 22.948, 'max_score': 3043.545, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E3043545.jpg'}, {'end': 3115.143, 'src': 'embed', 'start': 3085.982, 'weight': 0, 'content': [{'end': 3093.484, 'text': "And it's also going to see a noisy representation of street view maps, something that you might see, for example, from Google Maps on your smartphone.", 'start': 3085.982, 'duration': 7.502}, {'end': 3098.428, 'text': 'And it will be tasked not to predict the classification problem or object detection,', 'start': 3094.224, 'duration': 4.204}, {'end': 3107.476, 'text': 'but rather learn a full probability distribution over the space of all possible control commands that this vehicle could take in this given situation.', 'start': 3098.428, 'duration': 9.048}, {'end': 3115.143, 'text': 'Now how does it do that actually? This entire model is actually using everything that we learned about in this lecture today.', 'start': 3109.158, 'duration': 5.985}], 'summary': 'Model learns full probability distribution over possible control commands for vehicle in a given situation.', 'duration': 29.161, 'max_score': 3085.982, 'thumbnail': ''}, {'end': 3246.358, 'src': 'embed', 'start': 3212.729, 'weight': 1, 'content': [{'end': 3215.531, 'text': 'So this is really powerful because a human can actually enter the car,', 'start': 3212.729, 'duration': 2.802}, {'end': 3222.816, 'text': 'input a desired destination and the end-to-end CNN will output the control commands to actuate the vehicle towards that destination.', 'start': 3215.531, 'duration': 7.285}, {'end': 3239.376, 'text': 'Note here that the vehicle is able to successfully recognize when it approaches the intersections and take the correct control commands to actually navigate that vehicle through these brand new environments that it has never seen before and never driven before in its training dataset.', 'start': 3223.516, 'duration': 15.86}, {'end': 3246.358, 'text': "And the impact of CNNs has been very wide reaching beyond these examples as well that I've explained here today.", 'start': 3240.356, 'duration': 6.002}], 'summary': 'End-to-end cnn enables vehicle to navigate new environments, impacting beyond examples.', 'duration': 33.629, 'max_score': 3212.729, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E3212729.jpg'}, {'end': 3312.2, 'src': 'embed', 'start': 3267.563, 'weight': 4, 'content': [{'end': 3273.773, 'text': 'We saw that we can build up these convolutions into the basic architecture, defining convolutional neural networks,', 'start': 3267.563, 'duration': 6.21}, {'end': 3276.997, 'text': 'and discussed how CNNs can be used for classification.', 'start': 3273.773, 'duration': 3.224}, {'end': 3290.163, 'text': 'Finally, we talked about a lot of the extensions and applications of how you can use these basic convolutional neural network architectures as a feature extraction module and then use this to perform your task at hand.', 'start': 3278.154, 'duration': 12.009}, {'end': 3312.2, 'text': "And a bit about how we can actually visualize the behavior of our neural network and actually understand a bit about what it's doing under the hood through ways of some of these semantic segmentation maps and really getting a more fine-grained perspective of the very high resolution classification of these input images that it's seeing.", 'start': 3290.703, 'duration': 21.497}], 'summary': 'Cnns used for classification and feature extraction, with visualization of network behavior.', 'duration': 44.637, 'max_score': 3267.563, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E3267563.jpg'}], 'start': 3022.318, 'title': 'Cnn applications in healthcare and robotics', 'summary': 'Explores the applications of convolutional neural networks (cnns) in healthcare, emphasizing semantic segmentation for identifying disease-infected regions on medical scans, and in robotics for continuous control in self-driving cars, highlighting the optimization of control commands and the impact of cnns in computer vision.', 'chapters': [{'end': 3066.493, 'start': 3022.318, 'title': 'Convolutional feature extraction', 'summary': 'Discusses the power of semantic segmentation in healthcare applications, such as segmenting cancerous regions on medical scans and identifying disease-infected parts of the blood, showcasing the potential of convolutional feature extraction.', 'duration': 44.175, 'highlights': ['The semantic segmentation idea can be applied to healthcare for segmenting cancerous regions on medical scans or identifying disease-infected parts of the blood, like malaria.', 'The chapter showcases the potential of convolutional feature extraction in healthcare applications, emphasizing its ability to segment cancerous regions on medical scans and identify disease-infected parts of the blood.', 'The transcript discusses the application of semantic segmentation in healthcare, particularly for identifying disease-infected parts of the blood and segmenting cancerous regions on medical scans.']}, {'end': 3354.535, 'start': 3066.613, 'title': 'End-to-end cnn for continuous robotic control', 'summary': 'Discusses using end-to-end cnn for continuous robotic control in self-driving cars, optimizing probability distribution over control commands, visualizing probability distribution on the map, and the wide-reaching impact of cnns in computer vision.', 'duration': 287.922, 'highlights': ['The model uses end-to-end CNN to optimize a probability distribution over the space of all possible control commands for the vehicle, not focusing on classification or object detection.', 'The end-to-end model predicts control commands to actuate the vehicle towards the desired destination, with the ability to navigate new environments and recognize intersections.', 'CNNs have had a wide-reaching impact in computer vision, covering foundations, basic architecture, classification, feature extraction, and visualization of neural network behavior.', 'The upcoming lab will focus on computer vision, including building convolutional neural networks, facial detection systems, and using unsupervised generative models for fair and unbiased algorithms.']}], 'duration': 332.217, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/AjtX1N_VT9E/pics/AjtX1N_VT9E3022318.jpg', 'highlights': ['The model uses end-to-end CNN to optimize a probability distribution over the space of all possible control commands for the vehicle, not focusing on classification or object detection.', 'The end-to-end model predicts control commands to actuate the vehicle towards the desired destination, with the ability to navigate new environments and recognize intersections.', 'The semantic segmentation idea can be applied to healthcare for segmenting cancerous regions on medical scans or identifying disease-infected parts of the blood, like malaria.', 'The chapter showcases the potential of convolutional feature extraction in healthcare applications, emphasizing its ability to segment cancerous regions on medical scans and identify disease-infected parts of the blood.', 'CNNs have had a wide-reaching impact in computer vision, covering foundations, basic architecture, classification, feature extraction, and visualization of neural network behavior.']}], 'highlights': ['Deep learning revolutionizes computer vision in diverse areas like navigation, photography, medicine, and autonomous driving.', 'Deep learning enables applications in medicine, detecting breast cancer, skin cancer, and COVID-19 in patient scans.', 'Computer vision system assists visually impaired individuals in running independently through trails.', 'Convolution operations preserve spatial structure and learn visual features in images by applying filters to patches of the input.', 'CNNs aim to learn a hierarchy of features in image data, starting from low-level to mid-level to high-level features.', 'CNNs have potential applications in object detection, semantic segmentation, medical image analysis, and sequential networks.', 'CNNs can outperform expert radiologists in detecting breast cancer from mammogram images.', 'End-to-end CNN model predicts control commands to actuate the vehicle towards the desired destination.']}