title
Lecture 12 | Visualizing and Understanding

description
In Lecture 12 we discuss methods for visualizing and understanding the internal mechanisms of convolutional networks. We also discuss the use of convolutional networks for generating new images, including DeepDream and artistic style transfer. Keywords: Visualization, t-SNE, saliency maps, class visualizations, fooling images, feature inversion, DeepDream, style transfer Slides: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture12.pdf -------------------------------------------------------------------------------------- Convolutional Neural Networks for Visual Recognition Instructors: Fei-Fei Li: http://vision.stanford.edu/feifeili/ Justin Johnson: http://cs.stanford.edu/people/jcjohns/ Serena Yeung: http://ai.stanford.edu/~syyeung/ Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image classification, localization and detection. Recent developments in neural network (aka “deep learning”) approaches have greatly advanced the performance of these state-of-the-art visual recognition systems. This lecture collection is a deep dive into details of the deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification. From this lecture collection, students will learn to implement, train and debug their own neural networks and gain a detailed understanding of cutting-edge research in computer vision. Website: http://cs231n.stanford.edu/ For additional learning opportunities please visit: http://online.stanford.edu/

detail
{'title': 'Lecture 12 | Visualizing and Understanding', 'heatmap': [{'end': 364.061, 'start': 318.41, 'weight': 0.705}, {'end': 549.518, 'start': 495.692, 'weight': 0.756}, {'end': 819.569, 'start': 768.505, 'weight': 0.788}, {'end': 1280.221, 'start': 1211.586, 'weight': 0.785}, {'end': 1369.271, 'start': 1316.96, 'weight': 0.732}, {'end': 1638.438, 'start': 1495.555, 'weight': 0.726}, {'end': 2009.46, 'start': 1909.195, 'weight': 0.887}, {'end': 2152.398, 'start': 2047.071, 'weight': 0.91}, {'end': 2276.875, 'start': 2225.439, 'weight': 0.866}, {'end': 2373.335, 'start': 2314.682, 'weight': 0.837}, {'end': 3183.838, 'start': 3087.292, 'weight': 0.86}, {'end': 3502.451, 'start': 3409.619, 'weight': 0.963}, {'end': 3776.532, 'start': 3684.599, 'weight': 0.817}, {'end': 4093.444, 'start': 4045.624, 'weight': 0.75}, {'end': 4368.126, 'start': 4314.841, 'weight': 0.775}], 'summary': 'The lecture covers visualizing convolutional networks, computer vision tasks, visualizing neural network features, image analysis in neural networks, visualization techniques, image synthesis, deep dream visualizations, neural texture synthesis, and style transfer techniques with practical examples and insights.', 'chapters': [{'end': 145.834, 'segs': [{'end': 77.88, 'src': 'embed', 'start': 50.792, 'weight': 2, 'content': [{'end': 55.473, 'text': 'You all got registration emails for Gradescope probably in the last week, something like that.', 'start': 50.792, 'duration': 4.681}, {'end': 57.174, 'text': 'We saw a couple questions on Piazza.', 'start': 55.493, 'duration': 1.681}, {'end': 59.235, 'text': "So we've decided to use Gradescope to grade the midterms.", 'start': 57.274, 'duration': 1.961}, {'end': 61.656, 'text': "So don't be confused if you get some emails about that.", 'start': 59.415, 'duration': 2.241}, {'end': 66.678, 'text': 'Another reminder is that assignment three was released last week on Friday.', 'start': 62.857, 'duration': 3.821}, {'end': 71.359, 'text': "It'll be due a week from this Friday on the 26th.", 'start': 67.318, 'duration': 4.041}, {'end': 77.88, 'text': 'Assignment three is almost entirely brand new this year, so we apologize for taking a little bit longer than expected to get it out.', 'start': 71.839, 'duration': 6.041}], 'summary': 'Gradescope used for midterms grading, assignment three due on the 26th.', 'duration': 27.088, 'max_score': 50.792, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY50792.jpg'}, {'end': 145.834, 'src': 'embed', 'start': 108.923, 'weight': 0, 'content': [{'end': 114.947, 'text': 'Due to the high interest in HyperQuest and due to the conflicts with the milestone submission time,', 'start': 108.923, 'duration': 6.024}, {'end': 117.81, 'text': 'we decided to extend that deadline for extra credit through Sunday.', 'start': 114.947, 'duration': 2.863}, {'end': 124.358, 'text': 'So anyone who does at least 12 runs on HyperQuest by Sunday will get a little bit of extra credit in the class.', 'start': 118.49, 'duration': 5.868}, {'end': 129.826, 'text': 'Also those of you who are at the top of the leaderboard doing really well will get maybe a little bit extra, extra credit.', 'start': 124.779, 'duration': 5.047}, {'end': 132.39, 'text': 'So thanks for participating.', 'start': 131.128, 'duration': 1.262}, {'end': 133.952, 'text': 'We got a lot of interest and that was really cool.', 'start': 132.41, 'duration': 1.542}, {'end': 137.389, 'text': 'Final reminder is about the poster session.', 'start': 135.848, 'duration': 1.541}, {'end': 140.931, 'text': 'So the poster session will be on June 6th.', 'start': 137.789, 'duration': 3.142}, {'end': 142.532, 'text': 'That date is finalized.', 'start': 141.532, 'duration': 1}, {'end': 145.834, 'text': "I don't remember the exact time, but it is June 6th,", 'start': 142.872, 'duration': 2.962}], 'summary': 'Deadline for hyperquest extended to sunday for at least 12 runs, with extra credit reward. poster session on june 6th.', 'duration': 36.911, 'max_score': 108.923, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY108923.jpg'}], 'start': 4.821, 'title': 'Visualizing convolutional networks', 'summary': 'Discusses administrative updates, project milestones, midterm grades, assignment three, hyperquest activity, and the poster session, and extends the hyperquest deadline for extra credit through sunday for at least 12 runs.', 'chapters': [{'end': 145.834, 'start': 4.821, 'title': 'Cs231n lecture 12: visualizing convolutional networks', 'summary': 'Discusses administrative updates, including project milestones, midterm grades, assignment three, hyperquest activity, and the poster session, while highlighting the extension of the hyperquest deadline for extra credit through sunday for at least 12 runs.', 'duration': 141.013, 'highlights': ['The poster session will be on June 6th.', 'The deadline for extra credit on HyperQuest has been extended through Sunday for at least 12 runs.', 'Assignment three was released last week and will be due on the 26th.', 'Midterm grades will be available on Gradescope this week.', 'The chapter discusses administrative updates, including project milestones, midterm grades, assignment three, HyperQuest activity, and the poster session.']}], 'duration': 141.013, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY4821.jpg', 'highlights': ['The deadline for extra credit on HyperQuest has been extended through Sunday for at least 12 runs.', 'The poster session will be on June 6th.', 'Assignment three was released last week and will be due on the 26th.', 'Midterm grades will be available on Gradescope this week.', 'The chapter discusses administrative updates, including project milestones, midterm grades, assignment three, HyperQuest activity, and the poster session.']}, {'end': 526.339, 'segs': [{'end': 196.647, 'src': 'embed', 'start': 168.063, 'weight': 0, 'content': [{'end': 169.743, 'text': 'We talked about semantic segmentation,', 'start': 168.063, 'duration': 1.68}, {'end': 177.567, 'text': 'which is this problem where you want to assign labels to every pixel in the input image but does not differentiate the object instances in those images.', 'start': 169.743, 'duration': 7.824}, {'end': 180.309, 'text': 'We talked about classification plus localization.', 'start': 178.128, 'duration': 2.181}, {'end': 186.336, 'text': 'where, in addition to a class label, you also want to draw a box or perhaps several boxes in the image,', 'start': 180.729, 'duration': 5.607}, {'end': 192.162, 'text': "where the distinction here is that in the classification plus localization setup you have some fixed number of objects that you're looking for.", 'start': 186.336, 'duration': 5.826}, {'end': 196.647, 'text': 'So we also saw that this type of paradigm can be applied to things like pose recognition,', 'start': 192.523, 'duration': 4.124}], 'summary': 'Discussed semantic segmentation and classification plus localization for object recognition.', 'duration': 28.584, 'max_score': 168.063, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY168063.jpg'}, {'end': 235.876, 'src': 'embed', 'start': 200.071, 'weight': 1, 'content': [{'end': 207.014, 'text': "We also talked about object detection, where you start with some fixed set of category labels that you're interested in, like dogs and cats,", 'start': 200.071, 'duration': 6.943}, {'end': 212.075, 'text': 'and then the task is to draw boxes around every instance of those objects that appear in the input image.', 'start': 207.014, 'duration': 5.061}, {'end': 216.957, 'text': 'And object detection is really distinct from classification plus localization,', 'start': 212.976, 'duration': 3.981}, {'end': 221.479, 'text': "because with object detection we don't know ahead of time how many object instances we're looking for in the image.", 'start': 216.957, 'duration': 4.522}, {'end': 227.942, 'text': "and we saw that there's this whole family of methods based on RCNN, fast RCNN, faster RCNN,", 'start': 222.119, 'duration': 5.823}, {'end': 231.564, 'text': 'as well as these single shot detection methods for addressing this problem of object detection.', 'start': 227.942, 'duration': 3.622}, {'end': 235.876, 'text': 'Then, finally, we talked pretty briefly about instance segmentation,', 'start': 232.512, 'duration': 3.364}], 'summary': 'Object detection involves drawing boxes around objects in images, using methods like rcnn and faster rcnn.', 'duration': 35.805, 'max_score': 200.071, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY200071.jpg'}, {'end': 364.061, 'src': 'heatmap', 'start': 318.41, 'weight': 0.705, 'content': [{'end': 321.895, 'text': 'And can we try to gain intuition for how compnets are working?', 'start': 318.41, 'duration': 3.485}, {'end': 323.817, 'text': "what types of things in the image they're looking for?", 'start': 321.895, 'duration': 1.922}, {'end': 327.522, 'text': 'What kinds of techniques do we have for analyzing these internals of the network?', 'start': 324.378, 'duration': 3.144}, {'end': 332.139, 'text': 'So one relatively simple thing is the first layer.', 'start': 329.218, 'duration': 2.921}, {'end': 338.94, 'text': "So we've talked about this before, but recall that the first convolutional layer consists of filters.", 'start': 332.519, 'duration': 6.421}, {'end': 344.682, 'text': 'that so, for example, in AlexNet, the first convolutional layer consists of a number of convolutional filters.', 'start': 338.94, 'duration': 5.742}, {'end': 349.103, 'text': 'Each convolutional filter has shape three by 11 by 11.', 'start': 345.162, 'duration': 3.941}, {'end': 352.107, 'text': 'and these convolutional filters get slid over the input image.', 'start': 349.103, 'duration': 3.004}, {'end': 356.753, 'text': 'we take inner products between some chunk of the image and the weights of the convolutional filter,', 'start': 352.107, 'duration': 4.646}, {'end': 360.537, 'text': 'and that gives us our outputs after that first convolutional layer.', 'start': 356.753, 'duration': 3.784}, {'end': 364.061, 'text': 'So in AlexNet then we have 64 of these filters.', 'start': 361.618, 'duration': 2.443}], 'summary': 'Understanding the workings of compnets, analyzing network internals, 64 filters in alexnet', 'duration': 45.651, 'max_score': 318.41, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY318410.jpg'}, {'end': 356.753, 'src': 'embed', 'start': 332.519, 'weight': 3, 'content': [{'end': 338.94, 'text': "So we've talked about this before, but recall that the first convolutional layer consists of filters.", 'start': 332.519, 'duration': 6.421}, {'end': 344.682, 'text': 'that so, for example, in AlexNet, the first convolutional layer consists of a number of convolutional filters.', 'start': 338.94, 'duration': 5.742}, {'end': 349.103, 'text': 'Each convolutional filter has shape three by 11 by 11.', 'start': 345.162, 'duration': 3.941}, {'end': 352.107, 'text': 'and these convolutional filters get slid over the input image.', 'start': 349.103, 'duration': 3.004}, {'end': 356.753, 'text': 'we take inner products between some chunk of the image and the weights of the convolutional filter,', 'start': 352.107, 'duration': 4.646}], 'summary': 'First layer in alexnet has 3x11x11 convolutional filters.', 'duration': 24.234, 'max_score': 332.519, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY332519.jpg'}, {'end': 504.86, 'src': 'embed', 'start': 471.469, 'weight': 4, 'content': [{'end': 487.119, 'text': 'But this really only, oh sorry, was there a question? Yeah, so these are showing the learned weights of the first convolutional layer.', 'start': 471.469, 'duration': 15.65}, {'end': 500.516, 'text': 'Oh so the question is, why does visualizing the weights of the filters tell you what the filter is looking for?', 'start': 495.692, 'duration': 4.824}, {'end': 504.86, 'text': 'So this intuition comes from sort of template matching and inner products,', 'start': 501.257, 'duration': 3.603}], 'summary': "Visualizing learned weights reveals filter's search criteria", 'duration': 33.391, 'max_score': 471.469, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY471469.jpg'}], 'start': 145.834, 'title': 'Computer vision and convolutional networks', 'summary': 'Covers various computer vision tasks including semantic segmentation, classification plus localization, object detection, and instance segmentation, and explores the inner workings of convolutional networks, presented on june 6th. it also focuses on visualizing the learned weights of the first convolutional layer and identifying the features they are looking for, such as oriented edges and opposing colors.', 'chapters': [{'end': 257.298, 'start': 145.834, 'title': 'Computer vision tasks overview', 'summary': 'Covered various computer vision tasks including semantic segmentation, classification plus localization, object detection, and instance segmentation, with a focus on differentiating the tasks and their applications, presented on june 6th.', 'duration': 111.464, 'highlights': ['The lecture covered semantic segmentation, classification plus localization, object detection, and instance segmentation. The lecture provided an overview of various computer vision tasks.', 'The lecture explained the distinction between classification plus localization and object detection, highlighting the fixed number of objects in the former and the uncertainty about the number of instances in the latter. Differentiating between classification plus localization and object detection based on the number of objects and instances.', 'The methods for addressing the problem of object detection were discussed, including RCNN, fast RCNN, faster RCNN, and single shot detection methods. Explanation of the various methods such as RCNN, fast RCNN, faster RCNN, and single shot detection for object detection.', 'The lecture briefly covered instance segmentation, which combines aspects of both semantic segmentation and object detection to detect instances and label pixels belonging to each instance. Overview of instance segmentation, combining semantic segmentation and object detection to detect instances and label pixels.', 'The poster session is scheduled for June 6th, with no questions on the admin notes. Confirmation of the poster session date and absence of questions on admin notes.']}, {'end': 526.339, 'start': 257.298, 'title': 'Understanding convolutional networks', 'summary': 'Explores the inner workings of convolutional networks, particularly focusing on visualizing the learned weights of the first convolutional layer and identifying the features they are looking for, such as oriented edges and opposing colors.', 'duration': 269.041, 'highlights': ['The first convolutional layer consists of filters which look for oriented edges, opposing colors, and various positions in the input, resembling the early layers of the human visual system. The first convolutional layer consists of filters which look for oriented edges, opposing colors, and various positions in the input, resembling the early layers of the human visual system.', 'Visualizing the learned weights of the first convolutional layer provides insight into what the filters are looking for, based on template matching and inner products. Visualizing the learned weights of the first convolutional layer provides insight into what the filters are looking for, based on template matching and inner products.', 'No matter the architecture or training data, the first convolutional weights of most convolutional networks end up looking for oriented edges and opposing colors in the input image. No matter the architecture or training data, the first convolutional weights of most convolutional networks end up looking for oriented edges and opposing colors in the input image.']}], 'duration': 380.505, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY145834.jpg', 'highlights': ['The lecture covered semantic segmentation, classification plus localization, object detection, and instance segmentation.', 'The methods for addressing the problem of object detection were discussed, including RCNN, fast RCNN, faster RCNN, and single shot detection methods.', 'The lecture explained the distinction between classification plus localization and object detection, highlighting the fixed number of objects in the former and the uncertainty about the number of instances in the latter.', 'The first convolutional layer consists of filters which look for oriented edges, opposing colors, and various positions in the input, resembling the early layers of the human visual system.', 'Visualizing the learned weights of the first convolutional layer provides insight into what the filters are looking for, based on template matching and inner products.']}, {'end': 1464.01, 'segs': [{'end': 581.176, 'src': 'embed', 'start': 549.899, 'weight': 4, 'content': [{'end': 554.942, 'text': "So generally, whenever you're looking at image, whenever you're thinking about image data and training convolutional networks,", 'start': 549.899, 'duration': 5.043}, {'end': 557.184, 'text': 'you generally put a convolutional layer at the first step.', 'start': 554.942, 'duration': 2.242}, {'end': 574.093, 'text': "Yeah, so the question is can we do this same type of procedure in the middle of the network? That's actually the next slide, so good anticipation.", 'start': 567.97, 'duration': 6.123}, {'end': 581.176, 'text': "So if we draw this exact same visualization for the intermediate convolutional layers, it's actually a lot less interpretable.", 'start': 575.013, 'duration': 6.163}], 'summary': 'Convolutional layers are typically placed at the start of image data training, but their interpretability decreases in intermediate layers.', 'duration': 31.277, 'max_score': 549.899, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY549899.jpg'}, {'end': 621.734, 'src': 'embed', 'start': 595.862, 'weight': 5, 'content': [{'end': 600.205, 'text': "So up at the top, we're visualizing these first layer weights for this network just like we saw in the previous slide.", 'start': 595.862, 'duration': 4.343}, {'end': 606.088, 'text': "But now the second layer weights, after we do a convolution, then there's some ReLU and some other non-linearity perhaps.", 'start': 600.725, 'duration': 5.363}, {'end': 614.192, 'text': 'But the second convolutional layer now receives this 16 channel input and does a seven by seven convolution with 20 convolutional filters.', 'start': 606.468, 'duration': 7.724}, {'end': 619.953, 'text': "And we've actually, so the problem here is that you can't really visualize these directly as images.", 'start': 615.132, 'duration': 4.821}, {'end': 621.734, 'text': 'So you can try.', 'start': 620.393, 'duration': 1.341}], 'summary': 'Neural network has 16 channel input and 20 convolutional filters.', 'duration': 25.872, 'max_score': 595.862, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY595862.jpg'}, {'end': 716.921, 'src': 'embed', 'start': 688.471, 'weight': 6, 'content': [{'end': 696.074, 'text': "but it doesn't really give you a good intuition for what they're looking at, because these filters are not connected directly to the input image.", 'start': 688.471, 'duration': 7.603}, {'end': 701.316, 'text': 'Instead, recall that the second layer convolutional filters are connected to the output of the first layer.', 'start': 696.514, 'duration': 4.802}, {'end': 710.139, 'text': 'So this is giving you a visualization of what type of activation pattern after the first layer convolution would cause the second layer convolution to maximally activate.', 'start': 701.796, 'duration': 8.343}, {'end': 716.921, 'text': "But that's not very interpretable because we don't have a good sense for what those first layer convolutions look like in terms of image pixels.", 'start': 710.579, 'duration': 6.342}], 'summary': 'Second layer convolutions visualize activation patterns after first layer convolutions.', 'duration': 28.45, 'max_score': 688.471, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY688471.jpg'}, {'end': 764.442, 'src': 'embed', 'start': 736.438, 'weight': 7, 'content': [{'end': 742.081, 'text': 'So in practice, those weights could be unbounded, they could have any range, but to get nice visualizations, we need to scale those.', 'start': 736.438, 'duration': 5.643}, {'end': 746.344, 'text': 'These visualizations also do not take into account the biases in these layers,', 'start': 742.862, 'duration': 3.482}, {'end': 751.427, 'text': 'so you should keep that in mind and not take these types of visualizations too literally.', 'start': 746.344, 'duration': 5.083}, {'end': 755.116, 'text': 'Now at the last layer.', 'start': 754.115, 'duration': 1.001}, {'end': 758.298, 'text': "remember, when we're looking at the last layer of a convolutional network,", 'start': 755.116, 'duration': 3.182}, {'end': 764.442, 'text': 'we have these maybe thousand class scores that are telling us what are the predicted scores for each of the classes in our training data set.', 'start': 758.298, 'duration': 6.144}], 'summary': 'Visualize weights, consider biases; last layer predicts thousand class scores.', 'duration': 28.004, 'max_score': 736.438, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY736438.jpg'}, {'end': 819.569, 'src': 'heatmap', 'start': 768.505, 'weight': 0.788, 'content': [{'end': 773.868, 'text': 'In the case of AlexNet, we have some 4096 dimensional feature representation of our image.', 'start': 768.505, 'duration': 5.363}, {'end': 777.671, 'text': 'that then gets fed into that final affine layer to predict our final class scores.', 'start': 773.868, 'duration': 3.803}, {'end': 787.474, 'text': "And another kind of route for tackling the problem of visualizing and understanding continents is to try to understand what's happening at the last layer of our convolutional network.", 'start': 778.451, 'duration': 9.023}, {'end': 791.215, 'text': 'So what we can do is take some data set of images,', 'start': 787.954, 'duration': 3.261}, {'end': 798.537, 'text': 'run a bunch of images through our trained convolutional network and record that 4096 dimensional vector for each of those images,', 'start': 791.215, 'duration': 7.322}, {'end': 805.199, 'text': 'and now go through and try to figure out and visualize that last hidden layer rather than the first convolutional layer.', 'start': 798.537, 'duration': 6.662}, {'end': 809.202, 'text': 'So one thing you might imagine is trying a nearest neighbor approach.', 'start': 805.999, 'duration': 3.203}, {'end': 815.867, 'text': 'So remember, way back in the second lecture, we saw this graphic on the left where we had a nearest neighbor classifier,', 'start': 809.642, 'duration': 6.225}, {'end': 819.569, 'text': 'where we were looking at nearest neighbors in pixel space between CIFAR-10 images.', 'start': 815.867, 'duration': 3.702}], 'summary': 'Alexnet uses 4096 dimensional feature representation, and visualizes the last hidden layer using a nearest neighbor approach.', 'duration': 51.064, 'max_score': 768.505, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY768505.jpg'}, {'end': 974.967, 'src': 'embed', 'start': 926.867, 'weight': 1, 'content': [{'end': 932.413, 'text': 'However, in the feature space which is learned by the network, those two images end up being very close to each other,', 'start': 926.867, 'duration': 5.546}, {'end': 937.558, 'text': 'which means that somehow this last layer of features is capturing some of the semantic content of these images.', 'start': 932.413, 'duration': 5.145}, {'end': 939.38, 'text': "So that's really cool and really exciting.", 'start': 937.959, 'duration': 1.421}, {'end': 940.601, 'text': 'And in general,', 'start': 940.141, 'duration': 0.46}, {'end': 945.867, 'text': "looking at these kind of nearest neighbor visualizations is a really quick and easy way to visualize something about what's going on here.", 'start': 940.601, 'duration': 5.266}, {'end': 969.744, 'text': 'Yes, so the question is that through the standard supervised learning procedure for training classification networks,', 'start': 962.52, 'duration': 7.224}, {'end': 972.906, 'text': "there's nothing in the loss encouraging these features to be close together.", 'start': 969.744, 'duration': 3.162}, {'end': 974.967, 'text': "So that's true.", 'start': 974.007, 'duration': 0.96}], 'summary': 'Neural network feature space captures semantic content of images, encouraging visualizations for understanding.', 'duration': 48.1, 'max_score': 926.867, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY926867.jpg'}, {'end': 1116.539, 'src': 'embed', 'start': 1076.884, 'weight': 0, 'content': [{'end': 1082.43, 'text': 'which is a nonlinear dimensionality reduction method that people in deep learning often use for visualizing features.', 'start': 1076.884, 'duration': 5.546}, {'end': 1092.74, 'text': 'So here, as just an example of what t-SNE can do, this visualization here is showing a t-SNE dimensionality reduction on the MNIST dataset.', 'start': 1083.05, 'duration': 9.69}, {'end': 1097.064, 'text': 'So MNIST, remember, is this dataset of handwritten digits between zero and nine.', 'start': 1093.12, 'duration': 3.944}, {'end': 1100.047, 'text': 'Each image is a grayscale image, a 28 by 28 grayscale image.', 'start': 1097.425, 'duration': 2.622}, {'end': 1116.539, 'text': "So now we've used t-SNE to take that 28 times 28 dimensional feature space of the raw pixels for MNIST and now compress it down to two dimensions and then visualize each of those MNIST digits in this compressed two dimensional representation.", 'start': 1103.03, 'duration': 13.509}], 'summary': 'T-sne used to visualize mnist dataset in 2d', 'duration': 39.655, 'max_score': 1076.884, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY1076884.jpg'}, {'end': 1280.221, 'src': 'heatmap', 'start': 1211.586, 'weight': 0.785, 'content': [{'end': 1229.66, 'text': 'Was there a question? Yeah, so the basic idea is that we have an image, so now we end up with three different pieces of information about each image.', 'start': 1211.586, 'duration': 18.074}, {'end': 1233.203, 'text': 'We have the pixels of the image, we have the 4096-dimensional vector,', 'start': 1229.7, 'duration': 3.503}, {'end': 1245.793, 'text': 'then we use t-SNE to convert the 4096-dimensional vector into a two-dimensional coordinate and then we take the original pixels of the image and place it at the two-dimensional coordinate corresponding to the dimensionality-reduced version of the 4096-dimensional feature.', 'start': 1233.203, 'duration': 12.59}, {'end': 1248.896, 'text': 'Yeah, a little bit involved here.', 'start': 1247.935, 'duration': 0.961}, {'end': 1250.017, 'text': 'Question in the front?', 'start': 1249.497, 'duration': 0.52}, {'end': 1258.817, 'text': 'Question is roughly how much variance do these two dimensions explain?', 'start': 1255.916, 'duration': 2.901}, {'end': 1263.377, 'text': "Well, I'm not sure of the exact number and it gets a little bit muddy when you're talking about t-SNE,", 'start': 1259.117, 'duration': 4.26}, {'end': 1265.798, 'text': "because it's a non-linear dimensionality reduction technique.", 'start': 1263.377, 'duration': 2.421}, {'end': 1269.639, 'text': "So I'd have to look offline and I'm not sure of exactly how much it explains.", 'start': 1266.138, 'duration': 3.501}, {'end': 1280.221, 'text': "Question? Can you do the same analysis of other layers of the network? And yes you can, but no I don't have those visualizations here, sorry.", 'start': 1270.299, 'duration': 9.922}], 'summary': 'Using t-sne to convert 4096-dimensional vectors into 2d coordinates for image analysis.', 'duration': 68.635, 'max_score': 1211.586, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY1211586.jpg'}, {'end': 1369.271, 'src': 'heatmap', 'start': 1316.96, 'weight': 0.732, 'content': [{'end': 1321.481, 'text': "And again, at the link, there's a couple more visualizations of this nature that address that a little bit.", 'start': 1316.96, 'duration': 4.521}, {'end': 1333.708, 'text': 'Okay, so another thing that you can do for some of these intermediate features is so we talked a couple slides ago that visualizing the weights of these intermediate layers is not so interpretable,', 'start': 1323.041, 'duration': 10.667}, {'end': 1340.232, 'text': 'but actually visualizing the activation maps of those intermediate layers is kind of interpretable in some cases.', 'start': 1333.708, 'duration': 6.524}, {'end': 1347.457, 'text': 'So again in the example of AlexNet, remember, the CON5 layer of AlexNet gives us this 128 by.', 'start': 1340.753, 'duration': 6.704}, {'end': 1361.746, 'text': 'the CON5 features for any image is now a 128 by 13 by 13 dimensional tensor, but we can think of that as 128 different 13 by 13 2D grids.', 'start': 1349.359, 'duration': 12.387}, {'end': 1369.271, 'text': 'So now we can actually go and visualize each of those 13 by 13 elements, slices of the feature map as a grayscale image.', 'start': 1362.247, 'duration': 7.024}], 'summary': 'Visualize activation maps of intermediate layers for interpretability and understanding.', 'duration': 52.311, 'max_score': 1316.96, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY1316960.jpg'}, {'end': 1347.457, 'src': 'embed', 'start': 1323.041, 'weight': 3, 'content': [{'end': 1333.708, 'text': 'Okay, so another thing that you can do for some of these intermediate features is so we talked a couple slides ago that visualizing the weights of these intermediate layers is not so interpretable,', 'start': 1323.041, 'duration': 10.667}, {'end': 1340.232, 'text': 'but actually visualizing the activation maps of those intermediate layers is kind of interpretable in some cases.', 'start': 1333.708, 'duration': 6.524}, {'end': 1347.457, 'text': 'So again in the example of AlexNet, remember, the CON5 layer of AlexNet gives us this 128 by.', 'start': 1340.753, 'duration': 6.704}], 'summary': 'Visualizing activation maps of intermediate layers in alexnet is interpretable.', 'duration': 24.416, 'max_score': 1323.041, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY1323041.jpg'}], 'start': 526.339, 'title': 'Visualizing neural network features', 'summary': 'Explores visualizing weights of convolutional layers, understanding convolutional networks, nearest neighbor visualization, and dimensionality reduction techniques. it covers challenges in direct visualization, feature scaling, and insights into clustering and semantic similarity utilizing t-sne for nonlinear dimensionality reduction.', 'chapters': [{'end': 716.921, 'start': 526.339, 'title': 'Visualizing convolutional layers', 'summary': 'Explains visualizing weights of convolutional layers in neural networks, focusing on the interpretability of the first and second layers and the challenges in directly visualizing the weights.', 'duration': 190.582, 'highlights': ['The first layer in convolutional networks is always a convolutional layer for image data. The first layer in convolutional networks is always a convolutional layer for image data.', 'The second layer in convolutional networks receives a 16 channel input and applies a seven by seven convolution with 20 convolutional filters. The second layer in convolutional networks receives a 16 channel input and applies a seven by seven convolution with 20 convolutional filters.', 'Visualizing the weights of the second layer convolutional filters as images does not provide a good intuition due to their indirect connection to the input image. Visualizing the weights of the second layer convolutional filters as images does not provide a good intuition due to their indirect connection to the input image.']}, {'end': 926.407, 'start': 718.402, 'title': 'Understanding convolutional networks', 'summary': 'Discusses visualizing and understanding convolutional networks, including the scaling of weights for visualizations, the 4096 dimensional feature representation, and the computation of nearest neighbors in the feature space vs. pixel space.', 'duration': 208.005, 'highlights': ['The visualization techniques for convolutional networks involve scaling the weights to the 0-255 range to obtain visualizations, but these do not account for biases in the layers. scaling the weights to the 0-255 range', 'The 4096 dimensional feature representation of images in AlexNet precedes the final affine layer for predicting class scores. 4096 dimensional feature representation', 'Computing nearest neighbors in the 4096 dimensional feature space reveals that the semantic content of images tends to be similar despite differences in pixel space. computing nearest neighbors in the 4096 dimensional feature space']}, {'end': 1211.186, 'start': 926.867, 'title': 'Nearest neighbor visualization and dimensionality reduction', 'summary': 'Explores the use of nearest neighbor visualization to understand the semantic content captured by the last layer of features in a network, and the application of t-sne for nonlinear dimensionality reduction to visualize feature spaces, offering insights into clustering and semantic similarity.', 'duration': 284.319, 'highlights': ['The last layer of features in the network captures semantic content, as demonstrated by the images being very close to each other in the feature space, despite not being explicitly encouraged during standard supervised learning. Semantic content captured by the last layer of features', 'The application of t-SNE for nonlinear dimensionality reduction to visualize feature spaces provides insights into clustering and semantic similarity, as demonstrated by the natural clusters appearing when applied to the MNIST dataset and the visualization of the learned feature space. Insights into clustering and semantic similarity from t-SNE visualization', 'The use of nearest neighbor visualization offers a quick and easy way to understand the relationships and similarities between images, providing valuable insights into the feature space learned by the network. Quick and easy understanding of feature space relationships and similarities']}, {'end': 1464.01, 'start': 1211.586, 'title': 'Visualizing neural network features', 'summary': 'Discusses using t-sne to visualize high-dimensional image features in 2d, exploring the interpretability of activation maps of intermediate layers, and the potential insights gained from visualizing the image recognition process.', 'duration': 252.424, 'highlights': ['Using t-SNE to visualize high-dimensional image features in 2D, demonstrating the process of converting a 4096-dimensional vector into a two-dimensional coordinate system.', 'Exploring the interpretability of activation maps of intermediate layers in neural networks, showcasing the visualization of different 13 by 13 2D grids to understand what each feature in the convolutional layer is looking for.', "Discussing the potential insights gained from visualizing the image recognition process, such as identifying specific features activated by human faces in a neural network's intermediate layer.", 'Addressing the potential overlap and density issues in the feature space when using dimensionality reduction, emphasizing the need to consider the distribution of features in different parts of the space.']}], 'duration': 937.671, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY526339.jpg', 'highlights': ['The application of t-SNE for nonlinear dimensionality reduction to visualize feature spaces provides insights into clustering and semantic similarity, as demonstrated by the natural clusters appearing when applied to the MNIST dataset and the visualization of the learned feature space.', 'The use of nearest neighbor visualization offers a quick and easy way to understand the relationships and similarities between images, providing valuable insights into the feature space learned by the network.', 'The last layer of features in the network captures semantic content, as demonstrated by the images being very close to each other in the feature space, despite not being explicitly encouraged during standard supervised learning.', 'Exploring the interpretability of activation maps of intermediate layers in neural networks, showcasing the visualization of different 13 by 13 2D grids to understand what each feature in the convolutional layer is looking for.', 'The first layer in convolutional networks is always a convolutional layer for image data.', 'The second layer in convolutional networks receives a 16 channel input and applies a seven by seven convolution with 20 convolutional filters.', 'Visualizing the weights of the second layer convolutional filters as images does not provide a good intuition due to their indirect connection to the input image.', 'The visualization techniques for convolutional networks involve scaling the weights to the 0-255 range to obtain visualizations, but these do not account for biases in the layers.']}, {'end': 2043.068, 'segs': [{'end': 1638.438, 'src': 'heatmap', 'start': 1495.555, 'weight': 0.726, 'content': [{'end': 1502.879, 'text': 'At each layer in the convolutional network, our input image is like three by 224 by 224, and then it goes through many stages of convolution.', 'start': 1495.555, 'duration': 7.324}, {'end': 1507.601, 'text': 'And then after each convolutional layer is some three-dimensional chunk of numbers,', 'start': 1503.299, 'duration': 4.302}, {'end': 1510.003, 'text': 'which are the outputs from that layer of the convolutional network.', 'start': 1507.601, 'duration': 2.402}, {'end': 1518.067, 'text': 'And that entire three-dimensional chunk of numbers, which are the output of the previous convolutional layer, we call an activation volume,', 'start': 1510.443, 'duration': 7.624}, {'end': 1521.629, 'text': 'and then one of those slices is an activation map.', 'start': 1518.067, 'duration': 3.562}, {'end': 1538.365, 'text': 'The question is if the image is k by k, will the activation map be k by k??', 'start': 1534.462, 'duration': 3.903}, {'end': 1541.988, 'text': 'Not always, because there can be subsampling due to striated convolution and pooling.', 'start': 1538.445, 'duration': 3.543}, {'end': 1547.232, 'text': 'But in general, the size of each activation map will be linear in the size of the input image.', 'start': 1542.408, 'duration': 4.824}, {'end': 1562.762, 'text': 'So another kind of useful thing we can do for visualizing intermediate features is visualizing what types of patches from input images cause maximal activation in different neurons.', 'start': 1550.417, 'duration': 12.345}, {'end': 1571.546, 'text': "So what we've done here is that we pick, maybe again, the CON5 layer from AlexNet and remember each of these activation volumes at CON5,", 'start': 1563.322, 'duration': 8.224}, {'end': 1573.426, 'text': 'and AlexNet gives us a 128 by 13 by 13 chunk of numbers.', 'start': 1571.546, 'duration': 1.88}, {'end': 1579.491, 'text': "then we'll pick one of those 128 channels, maybe channel 17,.", 'start': 1575.587, 'duration': 3.904}, {'end': 1585.536, 'text': "and now what we'll do is run many images through this convolutional network and then, for each of those images,", 'start': 1579.491, 'duration': 6.045}, {'end': 1591.056, 'text': 'record the confide features and then look at the Right.', 'start': 1585.536, 'duration': 5.52}, {'end': 1597.261, 'text': 'so then look at the parts of that 17th feature map that are maximally activated over our data set of images.', 'start': 1591.056, 'duration': 6.205}, {'end': 1604.926, 'text': 'And now because again this is a convolutional layer, each of those neurons in the convolutional layer has some small receptive field in the input.', 'start': 1597.841, 'duration': 7.085}, {'end': 1608.689, 'text': "Each of those neurons is not looking at the whole image, they're only looking at some subset of the image.", 'start': 1605.047, 'duration': 3.642}, {'end': 1617.896, 'text': "So, then, what we'll do is visualize the patches from this large data set of images corresponding to the maximal activations of that feature,", 'start': 1609.09, 'duration': 8.806}, {'end': 1619.958, 'text': 'of that particular feature in that particular layer.', 'start': 1617.896, 'duration': 2.062}, {'end': 1625.463, 'text': 'and then we can sort these patches by their activation at that particular layer.', 'start': 1620.638, 'duration': 4.825}, {'end': 1632.491, 'text': "So here is some examples from this network called the network doesn't matter,", 'start': 1626.124, 'duration': 6.367}, {'end': 1635.715, 'text': 'but these are some visualizations of these kind of maximally activating patches.', 'start': 1632.491, 'duration': 3.224}, {'end': 1638.438, 'text': 'So each row.', 'start': 1636.316, 'duration': 2.122}], 'summary': 'Convolutional network analyzes input images, generating activation volumes and maps, and visualizes maximal activation patches for different neurons.', 'duration': 142.883, 'max_score': 1495.555, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY1495555.jpg'}, {'end': 1632.491, 'src': 'embed', 'start': 1605.047, 'weight': 0, 'content': [{'end': 1608.689, 'text': "Each of those neurons is not looking at the whole image, they're only looking at some subset of the image.", 'start': 1605.047, 'duration': 3.642}, {'end': 1617.896, 'text': "So, then, what we'll do is visualize the patches from this large data set of images corresponding to the maximal activations of that feature,", 'start': 1609.09, 'duration': 8.806}, {'end': 1619.958, 'text': 'of that particular feature in that particular layer.', 'start': 1617.896, 'duration': 2.062}, {'end': 1625.463, 'text': 'and then we can sort these patches by their activation at that particular layer.', 'start': 1620.638, 'duration': 4.825}, {'end': 1632.491, 'text': "So here is some examples from this network called the network doesn't matter,", 'start': 1626.124, 'duration': 6.367}], 'summary': 'Neurons analyze image subsets and visualize patches from dataset based on maximal activations in a specific layer.', 'duration': 27.444, 'max_score': 1605.047, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY1605047.jpg'}, {'end': 1849.604, 'src': 'embed', 'start': 1823.009, 'weight': 1, 'content': [{'end': 1829.835, 'text': 'So when we block out the region of the image corresponding to this go cart in front, then the predicted probability for the go cart class drops a lot.', 'start': 1823.009, 'duration': 6.826}, {'end': 1837.562, 'text': 'So that gives us some sense that the network is actually caring a lot about these pixels in the input image in order to make its classification decision.', 'start': 1830.215, 'duration': 7.347}, {'end': 1849.604, 'text': "Question? Yeah, so the question is that what's going on in the background?", 'start': 1838.423, 'duration': 11.181}], 'summary': "Blocking region of go cart in image reduces predicted probability, indicating network's emphasis on those pixels.", 'duration': 26.595, 'max_score': 1823.009, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY1823009.jpg'}, {'end': 2026.555, 'src': 'heatmap', 'start': 1909.195, 'weight': 2, 'content': [{'end': 1911.998, 'text': "So it's more for your understanding than for improving performance per se.", 'start': 1909.195, 'duration': 2.803}, {'end': 1920.067, 'text': 'So another related idea is this concept of a saliency map, which is something that you will see in your homeworks.', 'start': 1914.284, 'duration': 5.783}, {'end': 1927.772, 'text': 'So again we have the same question of given an input image of a dog in this case and the predicted class label of dog.', 'start': 1920.448, 'duration': 7.324}, {'end': 1931.254, 'text': 'we wanna know which pixels in the input image are important for classification.', 'start': 1927.772, 'duration': 3.482}, {'end': 1938.858, 'text': 'We saw masking as one way to get at this question, but saliency maps are another way for attacking this problem.', 'start': 1931.614, 'duration': 7.244}, {'end': 1951.162, 'text': "And the question is and one relatively simple idea from Karen Simonian's paper a couple years ago is this is just computing the gradient of the predicted class score with respect to the pixels of the input image?", 'start': 1939.418, 'duration': 11.744}, {'end': 1958.885, 'text': 'And this will directly tell us in this sort of first order approximation sense for each pixel in the input image.', 'start': 1951.603, 'duration': 7.282}, {'end': 1963.367, 'text': 'if we wiggle that pixel a little bit, then how much will the classification score for the class change?', 'start': 1958.885, 'duration': 4.482}, {'end': 1969.398, 'text': 'And this is another way to get at this question of which pixels in the input matter for the classification.', 'start': 1963.874, 'duration': 5.524}, {'end': 1979.205, 'text': 'And when we run, for example, compute, a saliency map for this dog, we see kind of a nice outline of a dog in the image,', 'start': 1970.479, 'duration': 8.726}, {'end': 1983.768, 'text': 'which tells us that these are probably the pixels that that network is actually looking at for this image.', 'start': 1979.205, 'duration': 4.563}, {'end': 1991.574, 'text': 'And when we repeat this kind of process for different images, we get some sense that the network is sort of looking at the right regions,', 'start': 1984.889, 'duration': 6.685}, {'end': 1992.574, 'text': 'which is somewhat comforting.', 'start': 1991.574, 'duration': 1}, {'end': 2001.111, 'text': 'The question is, do people use saliency maps for semantic segmentation? The answer is yes.', 'start': 1997.267, 'duration': 3.844}, {'end': 2005.936, 'text': 'That actually was, yeah, you guys are really on top of it this lecture.', 'start': 2001.852, 'duration': 4.084}, {'end': 2009.46, 'text': "So that was another component again in Karen's paper,", 'start': 2006.677, 'duration': 2.783}, {'end': 2017.308, 'text': "where there's this idea that maybe you can use these saliency maps to perform semantic segmentation without any labeled data for these segments.", 'start': 2009.46, 'duration': 7.848}, {'end': 2023.773, 'text': "So here they're using this grab-cut segmentation algorithm, which I don't really want to get into the details of,", 'start': 2018.829, 'duration': 4.944}, {'end': 2026.555, 'text': "but it's kind of an interactive segmentation algorithm that you can use.", 'start': 2023.773, 'duration': 2.782}], 'summary': 'Saliency maps show important pixels for classification, used for semantic segmentation.', 'duration': 117.36, 'max_score': 1909.195, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY1909195.jpg'}], 'start': 1464.07, 'title': 'Image analysis in neural networks', 'summary': 'Covers visualization of convolutional network activations, occlusion experiments for identifying important image parts, and the use of saliency maps for semantic segmentation, providing insights into image analysis methods in neural networks.', 'chapters': [{'end': 1735.15, 'start': 1464.07, 'title': 'Convolutional network activation visualization', 'summary': 'Explains how a convolutional network can learn features useful for classification tasks, visualizing intermediate features by analyzing activation maps, and identifying the types of features neurons might be looking for in input images.', 'duration': 271.08, 'highlights': ['The convolutional network can learn features useful for the classification task, even different from the explicit classification task it was trained for.', 'The size of each activation map in the convolutional network will be linear in the size of the input image, with subsampling due to striated convolution and pooling.', 'Visualizing intermediate features involves identifying the patches from input images that cause maximal activation in different neurons, and sorting these patches by their activation at a particular layer.', 'The visualizations of maximally activating patches provide insights into the types of features that neurons might be looking for in input images, such as circly things, text in different colors, curving edges of different colors and orientations, and humans or human faces.', 'Neurons in a higher layer of the network have larger receptive fields, looking for larger structures in the input image.']}, {'end': 1864.369, 'start': 1736.411, 'title': 'Neural network occlusion experiment', 'summary': "Discusses an occlusion experiment to identify important parts of the input image for a neural network classification, revealing that when certain regions of the image are occluded, the network's predicted probability for a specific class drastically changes, indicating the significance of those occluded pixels in the classification decision.", 'duration': 127.958, 'highlights': ["By occluding parts of the input image and observing the change in predicted probability, it was revealed that certain regions, such as the go-cart in a sample image, significantly influence the network's classification decision.", 'The occlusion experiment involved sliding an occluded patch over every position in the input image and recording the predicted probability for each position, ultimately generating a heat map to visualize the importance of different parts of the input image.', "The occlusion experiment is based on the idea that if blocking out a part of the image causes a drastic change in the network's predicted probability, then that part of the input image is crucial for the classification decision."]}, {'end': 2043.068, 'start': 1864.469, 'title': 'Understanding saliency maps in neural networks', 'summary': 'Discusses the concept of saliency maps and their usefulness in understanding neural networks, including the computation of gradients for determining important pixels in images and their application in semantic segmentation without labeled data.', 'duration': 178.599, 'highlights': ["Saliency maps are used to understand the computations of neural networks by determining important pixels in input images. The chapter explains that saliency maps help in understanding the computations of neural networks by identifying important pixels in input images, providing insights into the network's decision-making process.", 'Computing the gradient of the predicted class score with respect to the pixels of the input image helps determine the importance of each pixel for classification. The process of computing the gradient of the predicted class score with respect to the pixels of the input image allows for the determination of the importance of each pixel for classification, offering a first-order approximation of the impact on the classification score when the pixel is altered.', 'Saliency maps can be used for semantic segmentation without labeled data, in combination with grab-cut segmentation algorithm. The chapter discusses the application of saliency maps for semantic segmentation without labeled data, particularly in combination with the grab-cut segmentation algorithm, enabling the segmentation of objects in images without requiring labeled data.']}], 'duration': 578.998, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY1464070.jpg', 'highlights': ['The visualizations of maximally activating patches provide insights into the types of features that neurons might be looking for in input images, such as circly things, text in different colors, curving edges of different colors and orientations, and humans or human faces.', "By occluding parts of the input image and observing the change in predicted probability, it was revealed that certain regions, such as the go-cart in a sample image, significantly influence the network's classification decision.", 'Saliency maps can be used for semantic segmentation without labeled data, in combination with grab-cut segmentation algorithm.']}, {'end': 2469.806, 'segs': [{'end': 2175.904, 'src': 'heatmap', 'start': 2047.071, 'weight': 0, 'content': [{'end': 2053.164, 'text': "So I'm not sure how practical this is, but it is pretty cool that it works at all.", 'start': 2047.071, 'duration': 6.093}, {'end': 2057.072, 'text': 'But it probably works much less than something trained explicitly to segment with supervision.', 'start': 2053.404, 'duration': 3.668}, {'end': 2063.302, 'text': 'So kind of another related idea is this idea of guided backpropagation.', 'start': 2058.978, 'duration': 4.324}, {'end': 2069.966, 'text': 'So again, we still want to answer the question of for one particular image.', 'start': 2063.722, 'duration': 6.244}, {'end': 2072.388, 'text': 'then now, instead of looking at the class score,', 'start': 2069.966, 'duration': 2.422}, {'end': 2083.516, 'text': 'we want to know we want to pick some intermediate neuron in the network and ask again which parts of the input image influence the score of that internal neuron in the network.', 'start': 2072.388, 'duration': 11.128}, {'end': 2088.641, 'text': 'And then you could imagine again you could imagine computing a saliency map for this right?', 'start': 2084.216, 'duration': 4.425}, {'end': 2093.405, 'text': 'That rather than computing the gradient of the class score with respect to the pixels of the image,', 'start': 2088.981, 'duration': 4.424}, {'end': 2098.41, 'text': 'you could compute the gradient of some intermediate value in the network with respect to the pixels of the image.', 'start': 2093.405, 'duration': 5.005}, {'end': 2105.037, 'text': 'And that would tell us, again, which pixels in the input image influence that value of that particular neuron.', 'start': 2098.751, 'duration': 6.286}, {'end': 2107.72, 'text': 'And that would be using normal backpropagation.', 'start': 2105.778, 'duration': 1.942}, {'end': 2114.642, 'text': "but it turns out that there's a slight tweak that we can do to this backpropagation procedure that ends up giving some slightly cleaner images.", 'start': 2108.28, 'duration': 6.362}, {'end': 2120.523, 'text': "So that's this idea of guided backpropagation that again comes from Zeiler and Fergus' 2014 paper.", 'start': 2115.102, 'duration': 5.421}, {'end': 2124.124, 'text': "And I don't really want to get into the details too much here,", 'start': 2121.384, 'duration': 2.74}, {'end': 2136.608, 'text': "but it's kind of a weird tweak where you change the way that you backpropagate through ReLU nonlinearities and you sort of only backpropagate positive gradients through ReLUs and you do not backpropagate negative gradients through the ReLUs.", 'start': 2124.124, 'duration': 12.484}, {'end': 2145.874, 'text': "So you're no longer computing the true gradient, instead you're kind of only keeping track of positive influences throughout the entire network.", 'start': 2137.128, 'duration': 8.746}, {'end': 2152.398, 'text': "So maybe you should read through these papers referenced here if you want a little bit more details about why that's a good idea.", 'start': 2146.874, 'duration': 5.524}, {'end': 2160.4, 'text': 'But empirically, when you do guided back propagation as opposed to regular back propagation, you tend to get much cleaner,', 'start': 2153.498, 'duration': 6.902}, {'end': 2166.561, 'text': 'nicer images that tell you which pixels of the input image influence that particular neuron.', 'start': 2160.4, 'duration': 6.161}, {'end': 2175.904, 'text': "So again we're seeing the same visualization that we saw a few slides ago of the maximally activating patches for different neurons in this convolutional network.", 'start': 2167.122, 'duration': 8.782}], 'summary': 'Guided backpropagation yields cleaner images, providing insight into the influence of pixels on specific neurons in the network.', 'duration': 128.833, 'max_score': 2047.071, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY2047071.jpg'}, {'end': 2276.875, 'src': 'heatmap', 'start': 2225.439, 'weight': 0.866, 'content': [{'end': 2232.424, 'text': "They're telling us for a fixed input image which pixels or which parts of that input image influence the value of the neuron.", 'start': 2225.439, 'duration': 6.985}, {'end': 2243.971, 'text': 'Another question you might answer is remove this reliance on some input image and instead just ask what type of input in general would cause this neuron to activate?', 'start': 2232.884, 'duration': 11.087}, {'end': 2248.474, 'text': 'And we can answer this question using a technique called gradient ascent.', 'start': 2244.531, 'duration': 3.943}, {'end': 2254.277, 'text': 'So remember we always use gradient descent to train our convolutional networks by minimizing the loss.', 'start': 2249.054, 'duration': 5.223}, {'end': 2269.807, 'text': 'Instead, now we want to fix the weights of our trained convolutional network and instead synthesize an image by performing gradient ascent on the pixels of the image to try and maximize the score of some intermediate neuron or of some class.', 'start': 2254.778, 'duration': 15.029}, {'end': 2276.875, 'text': "So in the process of gradient ascent, we're no longer optimizing over the weights of the network.", 'start': 2270.928, 'duration': 5.947}], 'summary': 'Using gradient ascent to synthesize image for neuron activation.', 'duration': 51.436, 'max_score': 2225.439, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY2225439.jpg'}, {'end': 2389.01, 'src': 'heatmap', 'start': 2314.682, 'weight': 2, 'content': [{'end': 2321.666, 'text': 'One, we want it to maximally activate some score or some neuron value, but we also want it to look like a natural image.', 'start': 2314.682, 'duration': 6.984}, {'end': 2325.829, 'text': 'And we want it to kind of have the kind of statistics that we typically see in natural images.', 'start': 2321.926, 'duration': 3.903}, {'end': 2332.433, 'text': 'So these regularization term in this objective is something to enforce our generated image to look relatively natural.', 'start': 2326.369, 'duration': 6.064}, {'end': 2335.635, 'text': "And we'll see a couple different examples of regularizers as we go through.", 'start': 2332.813, 'duration': 2.822}, {'end': 2344.219, 'text': "but the general strategy for this is actually pretty simple and you'll again implement a lot of things of this nature on your assignment three.", 'start': 2336.956, 'duration': 7.263}, {'end': 2350.321, 'text': "but what we'll do is start with some initial image, either initializing to zeros or to uniform or Gaussian noise,", 'start': 2344.219, 'duration': 6.102}, {'end': 2359.245, 'text': "but initialize your image in some way and now repeat where you'll forward your image through your network and compute the score or neuron value that you're interested.", 'start': 2350.321, 'duration': 8.924}, {'end': 2373.335, 'text': 'Now back propagate to compute the gradient of that neuron score with respect to the pixels of the image and then make a small gradient descent or gradient ascent update to the pixels of the image itself to try and maximize that score.', 'start': 2359.805, 'duration': 13.53}, {'end': 2376.597, 'text': 'And now repeat this process over and over again until you have a beautiful image.', 'start': 2373.735, 'duration': 2.862}, {'end': 2381.888, 'text': 'And then we talked about this image regularizer.', 'start': 2379.347, 'duration': 2.541}, {'end': 2389.01, 'text': 'Well, a very simple idea for an image regularizer is simply to penalize the L2 norm of our generated image.', 'start': 2382.208, 'duration': 6.802}], 'summary': 'Maximizing neuron value while maintaining natural image look, using l2 norm as regularizer.', 'duration': 74.328, 'max_score': 2314.682, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY2314682.jpg'}], 'start': 2043.068, 'title': 'Visualization techniques in neural networks', 'summary': 'Covers guided backpropagation for cleaner images and its practical implications, and visualizing neuron activation using gradient ascent and regularization terms, with examples of synthesized images for specific classes.', 'chapters': [{'end': 2145.874, 'start': 2043.068, 'title': 'Guided backpropagation in neural networks', 'summary': 'Discusses the concept of guided backpropagation, a technique that computes the gradient of intermediate values in a network with respect to the pixels of the image, resulting in cleaner images, as well as its practical implications in comparison to supervised training.', 'duration': 102.806, 'highlights': ['Guided backpropagation computes the gradient of intermediate values in a network with respect to the pixels of the image, resulting in cleaner images compared to normal backpropagation.', 'The practical implications of guided backpropagation are discussed, indicating that it may work less effectively than methods explicitly trained with supervision.']}, {'end': 2469.806, 'start': 2146.874, 'title': 'Visualizing neuron activation in cnns', 'summary': 'Discusses the use of guided back propagation and gradient ascent to visualize which parts of input images influence neuron activation, generating images to maximize neuron scores, and incorporating regularization terms to enforce natural image properties, with examples of synthesized images for specific classes.', 'duration': 322.932, 'highlights': ['Using guided back propagation to visualize the influence of input image pixels on neuron activation, resulting in cleaner and more informative images Guided back propagation provides cleaner and more informative images by showing which pixels of the input image influence specific neurons.', 'Applying gradient ascent to synthesize images that maximize the score of specific neurons or classes, with the incorporation of regularization terms to enforce natural image properties Gradient ascent is used to synthesize images that maximize neuron scores, with a focus on enforcing natural image properties through regularization terms.', 'Incorporating regularization terms to enforce natural image properties and prevent overfitting to network peculiarities Regularization terms are used to enforce natural image properties and prevent overfitting to network peculiarities, ensuring the synthesized images appear relatively natural.']}], 'duration': 426.738, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY2043068.jpg', 'highlights': ['Guided backpropagation computes the gradient of intermediate values in a network with respect to the pixels of the image, resulting in cleaner images compared to normal backpropagation.', 'Using guided back propagation to visualize the influence of input image pixels on neuron activation, resulting in cleaner and more informative images.', 'Applying gradient ascent to synthesize images that maximize the score of specific neurons or classes, with the incorporation of regularization terms to enforce natural image properties.']}, {'end': 3258.22, 'segs': [{'end': 2627.756, 'src': 'embed', 'start': 2571.37, 'weight': 0, 'content': [{'end': 2578.155, 'text': 'And there was another paper from Jason Yasinski and some of his collaborators where they added some additional implicit regularizers.', 'start': 2571.37, 'duration': 6.785}, {'end': 2584.099, 'text': "So, in addition to this L2 norm constraint, in addition, we'll also periodically, during optimization,", 'start': 2578.616, 'duration': 5.483}, {'end': 2590.224, 'text': 'do some Gaussian blurring on the image or also clip some small pixel values all the way to zero,', 'start': 2584.099, 'duration': 6.125}, {'end': 2594.208, 'text': 'or also clip some of the pixel values with low gradients to zero.', 'start': 2590.644, 'duration': 3.564}, {'end': 2597.592, 'text': 'So you can see this as kind of a projected gradient descent algorithm,', 'start': 2594.649, 'duration': 2.943}, {'end': 2604.059, 'text': "where periodically we're projecting our generated image onto some nicer set of images with some nicer properties.", 'start': 2597.592, 'duration': 6.467}, {'end': 2607.622, 'text': 'For example, spatial smoothness with respect to the Gaussian blurring.', 'start': 2604.419, 'duration': 3.203}, {'end': 2612.346, 'text': 'So when you do this, you tend to get much nicer images that are much clearer to see.', 'start': 2608.283, 'duration': 4.063}, {'end': 2621.392, 'text': 'So now these flamingos look like flamingos, the ground beetle is starting to look more beetle-like, or this black swan maybe looks like a black swan.', 'start': 2612.806, 'duration': 8.586}, {'end': 2627.756, 'text': 'These billiard tables actually look kind of impressive now, where you can definitely see this billiard table structure.', 'start': 2622.213, 'duration': 5.543}], 'summary': 'Incorporating implicit regularizers in optimization improved image quality, resulting in clearer and recognizable images.', 'duration': 56.386, 'max_score': 2571.37, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY2571370.jpg'}, {'end': 2751.281, 'src': 'embed', 'start': 2723.556, 'weight': 3, 'content': [{'end': 2730.577, 'text': 'So here they were trying to explicitly take account, take this multimodality into account in the optimization procedure, where they did indeed,', 'start': 2723.556, 'duration': 7.021}, {'end': 2732.178, 'text': 'I think, see the initial.', 'start': 2730.577, 'duration': 1.601}, {'end': 2733.778, 'text': 'so they, for each of the classes,', 'start': 2732.178, 'duration': 1.6}, {'end': 2742.239, 'text': 'you run a clustering algorithm to try to separate the classes into different modes and then initialize with something that is close to one of those modes.', 'start': 2733.778, 'duration': 8.461}, {'end': 2745.36, 'text': 'And then when you do that, you kind of account for this multimodality.', 'start': 2742.699, 'duration': 2.661}, {'end': 2751.281, 'text': 'So for intuition on the right here, these eight images are all of grocery stores,', 'start': 2745.84, 'duration': 5.441}], 'summary': 'Optimization procedure accounts for multimodality using clustering algorithm for grocery store images.', 'duration': 27.725, 'max_score': 2723.556, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY2723556.jpg'}, {'end': 2876.412, 'src': 'embed', 'start': 2838.859, 'weight': 4, 'content': [{'end': 2842.721, 'text': "So that's sort of one cool thing that we can do with this strategy,", 'start': 2838.859, 'duration': 3.862}, {'end': 2849.325, 'text': 'but this idea of trying to synthesize images by using gradients on the image pixels is actually super powerful.', 'start': 2842.721, 'duration': 6.604}, {'end': 2853.628, 'text': 'And another really cool thing we can do with this is this concept of a fooling image.', 'start': 2849.926, 'duration': 3.702}, {'end': 2859.754, 'text': 'So what we can do is pick some arbitrary image and then try to maximize the.', 'start': 2854.168, 'duration': 5.586}, {'end': 2868.884, 'text': "so say we've taken a picture of an elephant and then we tell the network that we want to change the image to maximize the score of koala bear instead.", 'start': 2859.754, 'duration': 9.13}, {'end': 2876.412, 'text': "So then what we're doing is trying to change that image of an elephant to try and instead cause the network to classify it as a koala bear.", 'start': 2869.384, 'duration': 7.028}], 'summary': 'Synthesizing images using gradients is powerful for fooling image classification networks.', 'duration': 37.553, 'max_score': 2838.859, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY2838859.jpg'}, {'end': 3057.449, 'src': 'embed', 'start': 3024.864, 'weight': 6, 'content': [{'end': 3028.547, 'text': 'The question is what does having understanding these intermediate neurons?', 'start': 3024.864, 'duration': 3.683}, {'end': 3030.889, 'text': 'how does that help our understanding of the final classification?', 'start': 3028.547, 'duration': 2.342}, {'end': 3038.394, 'text': 'So this is actually this whole field of trying to visualize intermediates is kind of in response to a common criticism of deep learning.', 'start': 3031.889, 'duration': 6.505}, {'end': 3044.019, 'text': "So a common criticism of deep learning is like you've got this big black box network, you've trained it on gradient descent,", 'start': 3038.835, 'duration': 5.184}, {'end': 3045.74, 'text': "you get a good number and that's great.", 'start': 3044.019, 'duration': 1.721}, {'end': 3050.704, 'text': "but we don't trust the network because we don't understand, as people, why it's making the decisions that it's making.", 'start': 3045.74, 'duration': 4.964}, {'end': 3057.449, 'text': 'So a lot of these type of visualization techniques were developed to try and address that and try to understand, as people,', 'start': 3051.184, 'duration': 6.265}], 'summary': 'Visualizing intermediate neurons aids in understanding deep learning decisions.', 'duration': 32.585, 'max_score': 3024.864, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY3024864.jpg'}, {'end': 3075.963, 'src': 'embed', 'start': 3051.184, 'weight': 7, 'content': [{'end': 3057.449, 'text': 'So a lot of these type of visualization techniques were developed to try and address that and try to understand, as people,', 'start': 3051.184, 'duration': 6.265}, {'end': 3061.072, 'text': 'why the networks are making their various classification decisions a bit more.', 'start': 3057.449, 'duration': 3.623}, {'end': 3066.896, 'text': 'Because if you contrast a deep convolutional neural network with other machine learning,', 'start': 3061.412, 'duration': 5.484}, {'end': 3070.318, 'text': 'techniques like linear models are much easier to interpret in general,', 'start': 3066.896, 'duration': 3.422}, {'end': 3075.963, 'text': 'because you can look at the weights and kind of understand the interpretation between how much each input feature affect the decision.', 'start': 3070.318, 'duration': 5.645}], 'summary': "Visualization techniques aim to aid in understanding neural networks' classification decisions.", 'duration': 24.779, 'max_score': 3051.184, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY3051184.jpg'}, {'end': 3183.838, 'src': 'heatmap', 'start': 3087.292, 'weight': 0.86, 'content': [{'end': 3093.397, 'text': 'So a lot of this is sort of, in response to that criticism, to say that, yes, they are these large, complex models,', 'start': 3087.292, 'duration': 6.105}, {'end': 3096.92, 'text': 'but they are still doing some interesting and interpretable things under the hood.', 'start': 3093.397, 'duration': 3.523}, {'end': 3100.323, 'text': "They're not just totally going out and randomly classifying things.", 'start': 3097.18, 'duration': 3.143}, {'end': 3101.504, 'text': 'They are doing something meaningful.', 'start': 3100.403, 'duration': 1.101}, {'end': 3110.429, 'text': 'So another cool thing we can do with this gradient-based optimization of images is this idea of deep dream.', 'start': 3104.747, 'duration': 5.682}, {'end': 3115.17, 'text': 'So this was a really cool blog post that came out from Google a year or two ago.', 'start': 3110.949, 'duration': 4.221}, {'end': 3120.591, 'text': 'And the idea is that this is, so we talked about scientific value, this is almost entirely for fun.', 'start': 3115.59, 'duration': 5.001}, {'end': 3123.372, 'text': 'So the point of this exercise is mostly to generate cool images.', 'start': 3120.731, 'duration': 2.641}, {'end': 3129.736, 'text': 'And kind of as a side, you also get some sense for what features images are looking at, or these networks are looking at.', 'start': 3124.232, 'duration': 5.504}, {'end': 3136.421, 'text': 'So what we can do is we take our input image, we run it through the convolutional network up to some layer and now we back,', 'start': 3130.057, 'duration': 6.364}, {'end': 3140.204, 'text': 'propagate and set the gradient at that layer equal to the activation value.', 'start': 3136.421, 'duration': 3.783}, {'end': 3144.865, 'text': 'and now back propagate back to the image, update the image, and repeat, repeat, repeat.', 'start': 3140.604, 'duration': 4.261}, {'end': 3151.347, 'text': 'So this has the interpretation of trying to amplify existing features that were detected by the network in this image.', 'start': 3145.346, 'duration': 6.001}, {'end': 3154.028, 'text': 'right?. Because whatever features existed on that layer,', 'start': 3151.347, 'duration': 2.681}, {'end': 3159.19, 'text': 'now we set the gradient equal to the feature and we just tell the network to amplify whatever features you already saw in that image.', 'start': 3154.028, 'duration': 5.162}, {'end': 3165.932, 'text': 'And by the way, you can also see this as trying to maximize the L2 norm of the features at that layer of the image.', 'start': 3159.85, 'duration': 6.082}, {'end': 3169.974, 'text': 'And when you do this, the code ends up looking really simple.', 'start': 3166.793, 'duration': 3.181}, {'end': 3175.296, 'text': 'So your code for many of your homework assignments will probably be about this complex or maybe even a little bit less so.', 'start': 3170.374, 'duration': 4.922}, {'end': 3180.177, 'text': "So the idea is that, but there's a couple tricks here that you'll also see in your assignments.", 'start': 3175.856, 'duration': 4.321}, {'end': 3183.838, 'text': 'So one trick is to jitter the image before you compute your gradients.', 'start': 3180.637, 'duration': 3.201}], 'summary': 'Deep dream generates cool images and helps understand network features.', 'duration': 96.546, 'max_score': 3087.292, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY3087292.jpg'}, {'end': 3129.736, 'src': 'embed', 'start': 3104.747, 'weight': 8, 'content': [{'end': 3110.429, 'text': 'So another cool thing we can do with this gradient-based optimization of images is this idea of deep dream.', 'start': 3104.747, 'duration': 5.682}, {'end': 3115.17, 'text': 'So this was a really cool blog post that came out from Google a year or two ago.', 'start': 3110.949, 'duration': 4.221}, {'end': 3120.591, 'text': 'And the idea is that this is, so we talked about scientific value, this is almost entirely for fun.', 'start': 3115.59, 'duration': 5.001}, {'end': 3123.372, 'text': 'So the point of this exercise is mostly to generate cool images.', 'start': 3120.731, 'duration': 2.641}, {'end': 3129.736, 'text': 'And kind of as a side, you also get some sense for what features images are looking at, or these networks are looking at.', 'start': 3124.232, 'duration': 5.504}], 'summary': 'Using gradient-based optimization for deep dream images, primarily for fun and generating cool images.', 'duration': 24.989, 'max_score': 3104.747, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY3104747.jpg'}], 'start': 2470.287, 'title': 'Image synthesis and deep learning visualizations', 'summary': 'Explores the impact of regularizers on image synthesis, emphasizing how they enhance image quality, and discusses multimodality optimization, fooling images, and deep learning visualizations. it also covers generating deep dream images and addressing the interpretability of deep learning networks.', 'chapters': [{'end': 2704.262, 'start': 2470.287, 'title': 'Regularizers in image synthesis', 'summary': 'Discusses the impact of regularizers on image synthesis, highlighting how the addition of regularizers improves the quality of generated images, making them more recognizable, and how different random initializations of input images can be used to synthesize images that maximally activate intermediate neurons of the network.', 'duration': 233.975, 'highlights': ['Adding regularizers improves the quality of generated images, making them more recognizable, for example, flamingos, ground beetles, black swans, and billiard tables become clearer and more defined.', 'Different random initializations of input images can be used to synthesize images that maximally activate intermediate neurons of the network, offering insight into the preferences and receptive fields of these neurons.', 'The addition of regularizers, such as Gaussian blurring and clipping low gradient pixel values to zero, enhances the visualized images, leading to clearer and more structured representations.', 'The use of regularizers, like L2 norm constraint and Gaussian blurring, aids in combatting multimodality in image synthesis, contributing to the improvement of visualizations and generating clearer images.']}, {'end': 2933.632, 'start': 2707.343, 'title': 'Image synthesis and fooling images', 'summary': 'Discusses the use of multimodality optimization and image synthesis using gradients for generating realistic images, and the concept of fooling images by changing image classification without visible changes.', 'duration': 226.289, 'highlights': ['Multimodality optimization was addressed by explicitly accounting for it in the optimization procedure, resulting in nicer results through synthesized images.', 'The strategy of synthesizing images using gradients on the image pixels is a powerful concept, allowing for the generation of realistic images by adding additional priors towards modeling natural images.', "The concept of fooling images was demonstrated by changing an image's classification without visible changes, as seen in the example of an elephant morphing into a koala bear without any visible transformation."]}, {'end': 3258.22, 'start': 2934.488, 'title': 'Understanding deep learning visualizations', 'summary': 'Covers the importance of understanding intermediate neurons in deep learning, addressing the criticism of black box networks, and discusses the use of gradient-based optimization for generating deep dream images with examples of unique patterns.', 'duration': 323.732, 'highlights': ['The importance of understanding intermediate neurons in deep learning Understanding intermediate neurons helps in grasping the final classification decisions made by deep learning networks, addressing the criticism of their black box nature.', 'Addressing the criticism of black box networks in deep learning The visualization techniques were developed to address the criticism of deep learning networks being black box models, aiming to make their decisions more interpretable and understandable to humans.', 'Use of gradient-based optimization for generating deep dream images The technique involves running an input image through a convolutional network, back propagating and setting the gradient at a layer equal to the activation value, aiming to amplify existing features detected by the network in the image.']}], 'duration': 787.933, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY2470287.jpg', 'highlights': ['Adding regularizers improves the quality of generated images, making them more recognizable, for example, flamingos, ground beetles, black swans, and billiard tables become clearer and more defined.', 'The addition of regularizers, such as Gaussian blurring and clipping low gradient pixel values to zero, enhances the visualized images, leading to clearer and more structured representations.', 'The use of regularizers, like L2 norm constraint and Gaussian blurring, aids in combatting multimodality in image synthesis, contributing to the improvement of visualizations and generating clearer images.', 'Multimodality optimization was addressed by explicitly accounting for it in the optimization procedure, resulting in nicer results through synthesized images.', 'The strategy of synthesizing images using gradients on the image pixels is a powerful concept, allowing for the generation of realistic images by adding additional priors towards modeling natural images.', "The concept of fooling images was demonstrated by changing an image's classification without visible changes, as seen in the example of an elephant morphing into a koala bear without any visible transformation.", 'The importance of understanding intermediate neurons in deep learning Understanding intermediate neurons helps in grasping the final classification decisions made by deep learning networks, addressing the criticism of their black box nature.', 'Addressing the criticism of black box networks in deep learning The visualization techniques were developed to address the criticism of deep learning networks being black box models, aiming to make their decisions more interpretable and understandable to humans.', 'Use of gradient-based optimization for generating deep dream images The technique involves running an input image through a convolutional network, back propagating and setting the gradient at a layer equal to the activation value, aiming to amplify existing features detected by the network in the image.']}, {'end': 4012.849, 'segs': [{'end': 3337.558, 'src': 'embed', 'start': 3308.427, 'weight': 0, 'content': [{'end': 3313.768, 'text': "So here they're doing a kind of multi-scale processing where they start with a small image, run Deep Dream on the small image,", 'start': 3308.427, 'duration': 5.341}, {'end': 3318.209, 'text': 'then make it bigger and continue Deep Dream on the larger image, and kind of repeat with this multi-scale processing.', 'start': 3313.768, 'duration': 4.441}, {'end': 3325.11, 'text': 'And then you can get, and then maybe after you complete the final scale, then you restart from the beginning and you just go wild on this thing.', 'start': 3318.649, 'duration': 6.461}, {'end': 3326.951, 'text': 'And you can get some really crazy images.', 'start': 3325.511, 'duration': 1.44}, {'end': 3331.013, 'text': 'So these examples were all from networks trained on ImageNet.', 'start': 3328.031, 'duration': 2.982}, {'end': 3337.558, 'text': "There's another data set from MIT called the MIT Places data set that instead of 1,000 categories of objects,", 'start': 3331.313, 'duration': 6.245}], 'summary': 'Multi-scale processing using deep dream creates crazy images from imagenet data set', 'duration': 29.131, 'max_score': 3308.427, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY3308427.jpg'}, {'end': 3367.721, 'src': 'embed', 'start': 3342.601, 'weight': 1, 'content': [{'end': 3350.106, 'text': 'And now if we repeat this deep dream procedure using a network trained on MIT Places, we get some really cool visualizations as well.', 'start': 3342.601, 'duration': 7.505}, {'end': 3356.852, 'text': 'So now, instead of dog slugs and admiral dogs and this kind of stuff, instead, we often get these kind of roof shapes,', 'start': 3350.666, 'duration': 6.186}, {'end': 3362.236, 'text': 'of these kind of Japanese-style buildings or these different types of bridges or mountain ranges.', 'start': 3356.852, 'duration': 5.384}, {'end': 3364.098, 'text': 'really cool, beautiful visualizations.', 'start': 3362.236, 'duration': 1.862}, {'end': 3367.721, 'text': 'So the code for Deep Dream is online, released by Google.', 'start': 3365.179, 'duration': 2.542}], 'summary': 'Deep dream procedure using mit places network produces beautiful visualizations of roof shapes, japanese-style buildings, bridges, and mountain ranges.', 'duration': 25.12, 'max_score': 3342.601, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY3342601.jpg'}, {'end': 3502.451, 'src': 'heatmap', 'start': 3409.619, 'weight': 0.963, 'content': [{'end': 3414.18, 'text': "But in practice, many implementations you'll see not explicitly compute that, instead they'll just send the gradient back.", 'start': 3409.619, 'duration': 4.561}, {'end': 3420.864, 'text': 'So another kind of useful thing that we can do is this concept of feature inversion.', 'start': 3416.4, 'duration': 4.464}, {'end': 3427.15, 'text': 'So this again gives us a sense for what types of elements of the image are captured at different layers of the network.', 'start': 3421.325, 'duration': 5.825}, {'end': 3432.095, 'text': "So what we're gonna do now is we're going to take an image, run that image through the network,", 'start': 3427.65, 'duration': 4.445}, {'end': 3439.662, 'text': "record the feature value for one of those images and now we're gonna try to reconstruct that image from its feature representation.", 'start': 3432.095, 'duration': 7.567}, {'end': 3445.704, 'text': 'And now, based on what that reconstructed image looks like,', 'start': 3440.262, 'duration': 5.442}, {'end': 3450.545, 'text': "that'll give us some sense for what type of information about the image was captured in that feature vector.", 'start': 3445.704, 'duration': 4.841}, {'end': 3457.047, 'text': 'So again, we can do this with gradient ascent, with some regularizer where now, rather than maximizing some score.', 'start': 3451.065, 'duration': 5.982}, {'end': 3464.309, 'text': 'instead, we want to minimize the distance between this cached feature vector and between the computed features of our generated image,', 'start': 3457.047, 'duration': 7.262}, {'end': 3468.791, 'text': 'to try and again synthesize a new image that matches the feature vector that we computed before.', 'start': 3464.309, 'duration': 4.482}, {'end': 3476.328, 'text': "And another kind of regularizer that you frequently see here is the total variation regularizer that you'll also see on your homework.", 'start': 3470.505, 'duration': 5.823}, {'end': 3478.089, 'text': 'So here the total variation.', 'start': 3476.789, 'duration': 1.3}, {'end': 3485.893, 'text': 'regularizer is penalizing differences between adjacent pixels, both adjacent in left and right and adjacent top to bottom,', 'start': 3478.089, 'duration': 7.804}, {'end': 3488.675, 'text': 'to again try to encourage spatial smoothness in the generated image.', 'start': 3485.893, 'duration': 2.782}, {'end': 3494.448, 'text': 'So now, if we do this idea of feature inversion, so this visualization, here on the left,', 'start': 3489.867, 'duration': 4.581}, {'end': 3502.451, 'text': "we're showing some original image the elephants or the fruit at the left, and then we run the image through a BGG16 network,", 'start': 3494.448, 'duration': 8.003}], 'summary': 'Implementing feature inversion to reconstruct images from their feature representations and using regularizers to encourage spatial smoothness in generated images.', 'duration': 92.832, 'max_score': 3409.619, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY3409619.jpg'}, {'end': 3502.451, 'src': 'embed', 'start': 3470.505, 'weight': 3, 'content': [{'end': 3476.328, 'text': "And another kind of regularizer that you frequently see here is the total variation regularizer that you'll also see on your homework.", 'start': 3470.505, 'duration': 5.823}, {'end': 3478.089, 'text': 'So here the total variation.', 'start': 3476.789, 'duration': 1.3}, {'end': 3485.893, 'text': 'regularizer is penalizing differences between adjacent pixels, both adjacent in left and right and adjacent top to bottom,', 'start': 3478.089, 'duration': 7.804}, {'end': 3488.675, 'text': 'to again try to encourage spatial smoothness in the generated image.', 'start': 3485.893, 'duration': 2.782}, {'end': 3494.448, 'text': 'So now, if we do this idea of feature inversion, so this visualization, here on the left,', 'start': 3489.867, 'duration': 4.581}, {'end': 3502.451, 'text': "we're showing some original image the elephants or the fruit at the left, and then we run the image through a BGG16 network,", 'start': 3494.448, 'duration': 8.003}], 'summary': 'Total variation regularizer promotes spatial smoothness in image generation.', 'duration': 31.946, 'max_score': 3470.505, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY3470505.jpg'}, {'end': 3776.532, 'src': 'heatmap', 'start': 3684.599, 'weight': 0.817, 'content': [{'end': 3689.783, 'text': 'So in order to perform neural texture synthesis, they used this concept of a gram matrix.', 'start': 3684.599, 'duration': 5.184}, {'end': 3696.21, 'text': "So what we're going to do is we're going to take our input texture in this case some pictures of rocks,", 'start': 3690.484, 'duration': 5.726}, {'end': 3703.838, 'text': 'and then take that input texture and pass it through some convolutional neural network and pull out the convolutional features at some layer of the network.', 'start': 3696.21, 'duration': 7.628}, {'end': 3713.047, 'text': "So maybe then this convolutional feature volume that we've talked about might be h by w by c, or sorry, C by H by W at that layer of the network.", 'start': 3704.258, 'duration': 8.789}, {'end': 3716.39, 'text': 'So you can think of this as an H by W spatial grid,', 'start': 3713.427, 'duration': 2.963}, {'end': 3722.715, 'text': 'and at each point of the grid we have this C dimensional feature vector describing kind of the rough appearance of that image at that point.', 'start': 3716.39, 'duration': 6.325}, {'end': 3729.557, 'text': "And now we're gonna use this activation map to compute a descriptor of the texture of this input image.", 'start': 3724.212, 'duration': 5.345}, {'end': 3734.822, 'text': "So what we're going to do is pick out two of these different feature columns in the input volume.", 'start': 3729.997, 'duration': 4.825}, {'end': 3742.73, 'text': 'Each of these feature columns will be a C-dimensional vector and now take the outer product between those two vectors to give us a C by C matrix.', 'start': 3735.162, 'duration': 7.568}, {'end': 3749.176, 'text': 'This C by C matrix now tells us something about the co-occurrence of the different features at those two points in the image.', 'start': 3743.31, 'duration': 5.866}, {'end': 3759.725, 'text': 'So if element i, j in the C by C matrix is large, that means that both elements i and j of those two input vectors were large and something like that.', 'start': 3750.442, 'duration': 9.283}, {'end': 3768.688, 'text': 'So this somehow captures some second order statistics about which features in that feature map tend to activate together at different spatial positions.', 'start': 3760.105, 'duration': 8.583}, {'end': 3776.532, 'text': "And now we're going to repeat this procedure using all different pairs of feature vectors from all different points.", 'start': 3771.469, 'duration': 5.063}], 'summary': 'Neural texture synthesis uses gram matrix to extract texture descriptors from image features.', 'duration': 91.933, 'max_score': 3684.599, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY3684599.jpg'}, {'end': 3941.517, 'src': 'embed', 'start': 3911.384, 'weight': 2, 'content': [{'end': 3916.529, 'text': 'Go forward, compute your gram matrices, compute your losses, back prop, get gradients on the image, and repeat.', 'start': 3911.384, 'duration': 5.145}, {'end': 3921.835, 'text': "And once you do this, eventually you'll end up generating a texture that matches your input texture quite nicely.", 'start': 3917.03, 'duration': 4.805}, {'end': 3929.333, 'text': 'So this was all from a NIPS 2015 paper by a group in Germany, and they had some really cool results for texture synthesis.', 'start': 3922.591, 'duration': 6.742}, {'end': 3934.715, 'text': "So here on the top, we're showing four different input textures and now on the bottom,", 'start': 3929.913, 'duration': 4.802}, {'end': 3941.517, 'text': "we're doing this texture synthesis approach by gram matrix matching,", 'start': 3934.715, 'duration': 6.802}], 'summary': 'Texture synthesis using gram matrix matching from nips 2015 paper by a group in germany resulted in generating textures that match input textures nicely.', 'duration': 30.133, 'max_score': 3911.384, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY3911384.jpg'}, {'end': 3977.568, 'src': 'embed', 'start': 3951.103, 'weight': 4, 'content': [{'end': 3956.307, 'text': "we generally get splotches of the right colors, but the overall spatial structure doesn't get preserved so much.", 'start': 3951.103, 'duration': 5.204}, {'end': 3963.233, 'text': 'And now, as we move to down farther in the image and you compute these Gram matrices at higher layers,', 'start': 3956.868, 'duration': 6.365}, {'end': 3966.456, 'text': 'you see that they tend to reconstruct larger patterns from the input image.', 'start': 3963.233, 'duration': 3.223}, {'end': 3969.039, 'text': 'For example, these whole rocks or these whole cranberries.', 'start': 3966.877, 'duration': 2.162}, {'end': 3977.568, 'text': 'And now this works pretty well, that now we can synthesize these new images that kind of match the general spatial statistics of the inputs,', 'start': 3970.019, 'duration': 7.549}], 'summary': 'Gram matrices at higher layers reconstruct larger patterns from the input image, preserving general spatial statistics.', 'duration': 26.465, 'max_score': 3951.103, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY3951103.jpg'}], 'start': 3258.22, 'title': 'Deep dream visualizations and neural texture synthesis', 'summary': 'Explains the generation of deep dream visualizations using a network trained on imagenet and mit places datasets, discussing feature inversion, regularizers, and the concept of neural texture synthesis. it also explores texture synthesis, gram matrix matching, and the impressive results of a nips 2015 paper.', 'chapters': [{'end': 3488.675, 'start': 3258.22, 'title': 'Deep dream visualizations', 'summary': 'Explains how deep dream visualizations are generated using a network trained on imagenet and mit places datasets, resulting in hallucinated dog-like features and scenic elements, and it also discusses the concept of feature inversion and the use of regularizers in the process.', 'duration': 230.455, 'highlights': ['The network trained on ImageNet, with 200 categories of dogs, often hallucinates dog-like features in visualizations, morphed with other animals.', 'Deep Dream can produce visually striking images by performing multi-scale processing and running the procedure on images at different scales.', 'Using a network trained on MIT Places dataset, deep dream visualizations capture scenic elements like roof shapes, Japanese-style buildings, bridges, and mountain ranges.', 'Feature inversion involves reconstructing an image from its feature representation, providing insights into the information captured at different layers of the network, and it involves minimizing the distance between the cached feature vector and computed features of the generated image using gradient ascent.', 'The use of the total variation regularizer in feature inversion aims to encourage spatial smoothness in the generated image by penalizing differences between adjacent pixels in both horizontal and vertical directions.']}, {'end': 4012.849, 'start': 3489.867, 'title': 'Neural texture synthesis and style transfer', 'summary': 'Explores feature inversion, texture synthesis, and neural texture synthesis, showing how high network layers preserve spatial structure and semantic information, and how gram matrix matching can be used for texture synthesis, resulting in a nips 2015 paper with impressive results.', 'duration': 522.982, 'highlights': ['The NIPS 2015 paper presented a novel approach utilizing gram matrix matching for texture synthesis, achieving impressive results for synthesizing new images matching the spatial statistics of input textures. The NIPS 2015 paper introduced a method using gram matrix matching and gradient ascent to synthesize new images matching the spatial statistics of input textures, resulting in impressive results for texture synthesis.', 'The visualization demonstrates how higher layers of the neural network preserve the general spatial structure of the input images but lose low-level details, indicating the preservation of semantic information and small changes in color and texture. The visualization shows that higher layers of the neural network preserve the general spatial structure of input images while losing low-level details, suggesting the preservation of semantic information and small changes in color and texture.', 'The chapter discusses the concept of feature inversion and its application in reconstructing images based on features from different layers of the VGG16 network, revealing the amount of information stored in the features at different layers. The chapter explores feature inversion, demonstrating the reconstruction of images based on features from different layers of the VGG16 network to understand the information stored in the features at different layers.']}], 'duration': 754.629, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY3258220.jpg', 'highlights': ['Deep Dream can produce visually striking images by performing multi-scale processing and running the procedure on images at different scales.', 'Using a network trained on MIT Places dataset, deep dream visualizations capture scenic elements like roof shapes, Japanese-style buildings, bridges, and mountain ranges.', 'The NIPS 2015 paper presented a novel approach utilizing gram matrix matching for texture synthesis, achieving impressive results for synthesizing new images matching the spatial statistics of input textures.', 'The use of the total variation regularizer in feature inversion aims to encourage spatial smoothness in the generated image by penalizing differences between adjacent pixels in both horizontal and vertical directions.', 'The visualization demonstrates how higher layers of the neural network preserve the general spatial structure of the input images but lose low-level details, indicating the preservation of semantic information and small changes in color and texture.']}, {'end': 4544.694, 'segs': [{'end': 4069.102, 'src': 'embed', 'start': 4045.624, 'weight': 0, 'content': [{'end': 4054.031, 'text': 'And now something really interesting happens when you combine this idea of texture synthesis by gram matrix matching with feature inversion by feature matching.', 'start': 4045.624, 'duration': 8.407}, {'end': 4057.934, 'text': 'And then this brings us to this really cool algorithm called style transfer.', 'start': 4054.492, 'duration': 3.442}, {'end': 4062.257, 'text': "So in style transfer, we're gonna take two images as input.", 'start': 4058.875, 'duration': 3.382}, {'end': 4069.102, 'text': "One, we're gonna take a content image that will guide like what type of thing we want, what we generally want our output to look like.", 'start': 4062.798, 'duration': 6.304}], 'summary': 'Combining texture synthesis and feature inversion leads to style transfer algorithm using two images as input.', 'duration': 23.478, 'max_score': 4045.624, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY4045624.jpg'}, {'end': 4101.916, 'src': 'heatmap', 'start': 4045.624, 'weight': 2, 'content': [{'end': 4054.031, 'text': 'And now something really interesting happens when you combine this idea of texture synthesis by gram matrix matching with feature inversion by feature matching.', 'start': 4045.624, 'duration': 8.407}, {'end': 4057.934, 'text': 'And then this brings us to this really cool algorithm called style transfer.', 'start': 4054.492, 'duration': 3.442}, {'end': 4062.257, 'text': "So in style transfer, we're gonna take two images as input.", 'start': 4058.875, 'duration': 3.382}, {'end': 4069.102, 'text': "One, we're gonna take a content image that will guide like what type of thing we want, what we generally want our output to look like.", 'start': 4062.798, 'duration': 6.304}, {'end': 4074.705, 'text': 'Also a style image that will tell us what is the general texture or style that we want our generated image to have.', 'start': 4069.742, 'duration': 4.963}, {'end': 4080.349, 'text': 'And then we will jointly do feature we will generate a new image by minimizing the feature.', 'start': 4075.366, 'duration': 4.983}, {'end': 4083.371, 'text': 'reconstruction loss of the content image and the gram matrix.', 'start': 4080.349, 'duration': 3.022}, {'end': 4084.932, 'text': 'reconstruction loss of the style image.', 'start': 4083.371, 'duration': 1.561}, {'end': 4093.444, 'text': 'And when we do these two things, we get a really cool image that kind of renders the content image kind of in the artistic style of the style image.', 'start': 4085.532, 'duration': 7.912}, {'end': 4095.206, 'text': 'And now this is really cool.', 'start': 4094.285, 'duration': 0.921}, {'end': 4097.529, 'text': 'And you can get these really beautiful figures.', 'start': 4096.046, 'duration': 1.483}, {'end': 4101.916, 'text': "So again, what this kind of looks like is that you'll take your style image and your content image,", 'start': 4098.229, 'duration': 3.687}], 'summary': 'Combining texture synthesis and feature matching leads to style transfer, creating cool artistic images.', 'duration': 56.292, 'max_score': 4045.624, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY4045624.jpg'}, {'end': 4165.345, 'src': 'embed', 'start': 4134.736, 'weight': 1, 'content': [{'end': 4139.718, 'text': "So in Deep Dream, you don't have a lot of control about exactly what types of things are gonna happen coming out at the end.", 'start': 4134.736, 'duration': 4.982}, {'end': 4145.481, 'text': 'You just kind of pick different layers of the networks, maybe set different numbers of iterations, and then dog slugs pop up everywhere.', 'start': 4140.438, 'duration': 5.043}, {'end': 4151.341, 'text': 'But with style transfer, you get a lot more fine-grained control over what you want the result to look like.', 'start': 4146.441, 'duration': 4.9}, {'end': 4158.304, 'text': 'By now picking different style images with the same content image, you can generate a whole different types of results, which is really cool.', 'start': 4151.381, 'duration': 6.923}, {'end': 4165.345, 'text': "Also, you can play around with the hyperparameters here, because we're minimizing this feature.", 'start': 4159.084, 'duration': 6.261}], 'summary': 'Deep dream offers limited control, while style transfer provides more fine-grained control over image results and allows for generation of different outputs by selecting different style images.', 'duration': 30.609, 'max_score': 4134.736, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY4134736.jpg'}, {'end': 4252.702, 'src': 'embed', 'start': 4225.041, 'weight': 4, 'content': [{'end': 4228.604, 'text': 'And you can do a similar type of multi-scale processing in style transfer as well.', 'start': 4225.041, 'duration': 3.563}, {'end': 4233.888, 'text': 'So then we can compute images like this that are super high resolution.', 'start': 4229.244, 'duration': 4.644}, {'end': 4239.572, 'text': 'This is I think a 4K image of our favorite school rendered in the style of Starry Night.', 'start': 4233.908, 'duration': 5.664}, {'end': 4242.594, 'text': 'But this is actually super expensive to compute.', 'start': 4240.833, 'duration': 1.761}, {'end': 4246.257, 'text': 'I think this one took four GPUs, so a little expensive.', 'start': 4242.614, 'duration': 3.643}, {'end': 4252.702, 'text': 'We can also do other style images and get some really cool results from the same content image, again at high resolution.', 'start': 4246.977, 'duration': 5.725}], 'summary': 'Multi-scale style transfer can produce high-resolution images, e.g., a 4k image of a school in the style of starry night, but it is computationally expensive, requiring four gpus.', 'duration': 27.661, 'max_score': 4225.041, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY4225041.jpg'}, {'end': 4368.126, 'src': 'heatmap', 'start': 4314.841, 'weight': 0.775, 'content': [{'end': 4321.603, 'text': "So I had a paper about this last year and the idea is that we're gonna fix some style that we care about at the beginning,", 'start': 4314.841, 'duration': 6.762}, {'end': 4327.985, 'text': 'in this case starry night and now, rather than running a separate optimization procedure for each image that we want to synthesize,', 'start': 4321.603, 'duration': 6.382}, {'end': 4334.807, 'text': "instead we're going to train a single feedforward network that can input the content image and then directly output the stylized result.", 'start': 4327.985, 'duration': 6.822}, {'end': 4346.312, 'text': 'And now the way that we train this network is that we compute these same content and style losses during training of our feedforward network and use that same gradient to update the weights of the feedforward network.', 'start': 4335.687, 'duration': 10.625}, {'end': 4352.895, 'text': "And now this thing takes maybe a few hours to train, but once it's trained, then, in order to produce stylized images,", 'start': 4346.732, 'duration': 6.163}, {'end': 4355.256, 'text': 'you just need to do a single forward pass through the trained network.', 'start': 4352.895, 'duration': 2.361}, {'end': 4365.744, 'text': 'So I have code for this online and you can see that it ends up looking about relatively comparable quality in some cases to this very slow optimization based method.', 'start': 4356.056, 'duration': 9.688}, {'end': 4368.126, 'text': "but now it runs in real time it's about a thousand times faster.", 'start': 4365.744, 'duration': 2.382}], 'summary': 'A feedforward network produces stylized images a thousand times faster than optimization-based methods, with comparable quality.', 'duration': 53.285, 'max_score': 4314.841, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY4314841.jpg'}, {'end': 4355.256, 'src': 'embed', 'start': 4335.687, 'weight': 3, 'content': [{'end': 4346.312, 'text': 'And now the way that we train this network is that we compute these same content and style losses during training of our feedforward network and use that same gradient to update the weights of the feedforward network.', 'start': 4335.687, 'duration': 10.625}, {'end': 4352.895, 'text': "And now this thing takes maybe a few hours to train, but once it's trained, then, in order to produce stylized images,", 'start': 4346.732, 'duration': 6.163}, {'end': 4355.256, 'text': 'you just need to do a single forward pass through the trained network.', 'start': 4352.895, 'duration': 2.361}], 'summary': 'Feedforward network trained in a few hours to create stylized images efficiently.', 'duration': 19.569, 'max_score': 4335.687, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY4335687.jpg'}, {'end': 4444.462, 'src': 'embed', 'start': 4417.571, 'weight': 5, 'content': [{'end': 4423.514, 'text': 'The only difference is that now this final layer produces a three channel output for the RGB of that final image.', 'start': 4417.571, 'duration': 5.943}, {'end': 4427.937, 'text': 'And inside this network we have batch normalization in the various layers.', 'start': 4425.075, 'duration': 2.862}, {'end': 4433.46, 'text': 'But in this paper they introduce, they swap out the batch normalization for something else called instance normalization.', 'start': 4428.477, 'duration': 4.983}, {'end': 4434.821, 'text': 'It tends to give you much better results.', 'start': 4433.54, 'duration': 1.281}, {'end': 4444.462, 'text': 'So one drawback of these types of methods is that we are now training one new style transfer network for every style that we want to apply.', 'start': 4435.998, 'duration': 8.464}], 'summary': 'Using instance normalization improves results and requires training one network per style.', 'duration': 26.891, 'max_score': 4417.571, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY4417571.jpg'}], 'start': 4012.849, 'title': 'Style transfer techniques', 'summary': 'Explains artistic style transfer algorithm combining texture synthesis and feature inversion, allowing for fine-grained control over the output. it also discusses various techniques for style transfer, resulting in a thousand times faster and comparable quality results than optimization-based methods.', 'chapters': [{'end': 4151.341, 'start': 4012.849, 'title': 'Artistic style transfer algorithm', 'summary': 'Explains the process of combining texture synthesis and feature inversion to create a style transfer algorithm, which allows for the generation of beautiful images that render the content in the artistic style of a chosen image, with fine-grained control over the output.', 'duration': 138.492, 'highlights': ['The combination of texture synthesis by gram matrix matching with feature inversion by feature matching results in a cool algorithm called style transfer. By combining texture synthesis with feature inversion, the algorithm of style transfer is created, allowing for the rendering of content in the artistic style of a chosen image.', 'The style transfer algorithm provides more control over the generated image compared to Deep Dream, allowing for fine-grained control over the desired result. Style transfer algorithm offers a higher level of control over the output image in comparison to Deep Dream, providing greater precision in achieving the desired visual outcome.', 'The style transfer algorithm allows for the generation of beautiful images by jointly minimizing the feature reconstruction loss of the content image and the gram matrix reconstruction loss of the style image. Through the joint minimization of feature and gram matrix reconstruction losses, the style transfer algorithm enables the creation of visually appealing images that incorporate both content and artistic style.']}, {'end': 4544.694, 'start': 4151.381, 'title': 'Style transfer and deep dream techniques', 'summary': 'Discusses various hyperparameters and techniques for style transfer, including multi-scale processing, real-time feedforward networks for style transfer, and instance normalization, leading to a thousand times faster and comparable quality results than the optimization-based methods.', 'duration': 393.313, 'highlights': ['Real-time feedforward network for style transfer Introducing a real-time feedforward network for style transfer that runs a thousand times faster and produces comparable quality results to the slow optimization-based methods.', 'Multi-scale processing in style transfer Utilizing multi-scale processing in style transfer to generate super high-resolution images, although it is computationally expensive.', 'Instance normalization for better results Replacing batch normalization with instance normalization in feedforward networks, resulting in much better results.']}], 'duration': 531.845, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/6wcs6szJWMY/pics/6wcs6szJWMY4012849.jpg', 'highlights': ['The style transfer algorithm combines texture synthesis with feature inversion, allowing for rendering content in the artistic style of a chosen image.', 'Style transfer algorithm offers a higher level of control over the output image compared to Deep Dream, providing greater precision in achieving the desired visual outcome.', 'The joint minimization of feature and gram matrix reconstruction losses enables the creation of visually appealing images that incorporate both content and artistic style.', 'Introducing a real-time feedforward network for style transfer that runs a thousand times faster and produces comparable quality results to the slow optimization-based methods.', 'Utilizing multi-scale processing in style transfer to generate super high-resolution images, although it is computationally expensive.', 'Replacing batch normalization with instance normalization in feedforward networks results in much better results.']}], 'highlights': ['The lecture covers visualizing convolutional networks, computer vision tasks, visualizing neural network features, image analysis in neural networks, visualization techniques, image synthesis, deep dream visualizations, neural texture synthesis, and style transfer techniques with practical examples and insights.', 'The methods for addressing the problem of object detection were discussed, including RCNN, fast RCNN, faster RCNN, and single shot detection methods.', 'The application of t-SNE for nonlinear dimensionality reduction to visualize feature spaces provides insights into clustering and semantic similarity, as demonstrated by the natural clusters appearing when applied to the MNIST dataset and the visualization of the learned feature space.', 'Guided backpropagation computes the gradient of intermediate values in a network with respect to the pixels of the image, resulting in cleaner images compared to normal backpropagation.', 'Adding regularizers improves the quality of generated images, making them more recognizable, for example, flamingos, ground beetles, black swans, and billiard tables become clearer and more defined.', 'Deep Dream can produce visually striking images by performing multi-scale processing and running the procedure on images at different scales.', 'The NIPS 2015 paper presented a novel approach utilizing gram matrix matching for texture synthesis, achieving impressive results for synthesizing new images matching the spatial statistics of input textures.', 'The style transfer algorithm combines texture synthesis with feature inversion, allowing for rendering content in the artistic style of a chosen image.']}