title
Lecture 3 | Loss Functions and Optimization

description
Lecture 3 continues our discussion of linear classifiers. We introduce the idea of a loss function to quantify our unhappiness with a model’s predictions, and discuss two commonly used loss functions for image classification: the multiclass SVM loss and the multinomial logistic regression loss. We introduce the idea of regularization as a mechanism to fight overfitting, with weight decay as a concrete example. We introduce the idea of optimization and the stochastic gradient descent algorithm. We also briefly discuss the use of feature representations in computer vision. Keywords: Image classification, linear classifiers, SVM loss, regularization, multinomial logistic regression, optimization, stochastic gradient descent Slides: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture3.pdf -------------------------------------------------------------------------------------- Convolutional Neural Networks for Visual Recognition Instructors: Fei-Fei Li: http://vision.stanford.edu/feifeili/ Justin Johnson: http://cs.stanford.edu/people/jcjohns/ Serena Yeung: http://ai.stanford.edu/~syyeung/ Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image classification, localization and detection. Recent developments in neural network (aka “deep learning”) approaches have greatly advanced the performance of these state-of-the-art visual recognition systems. This lecture collection is a deep dive into details of the deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification. From this lecture collection, students will learn to implement, train and debug their own neural networks and gain a detailed understanding of cutting-edge research in computer vision. Website: http://cs231n.stanford.edu/ For additional learning opportunities please visit: http://online.stanford.edu/

detail
{'title': 'Lecture 3 | Loss Functions and Optimization', 'heatmap': [{'end': 762.193, 'start': 671.94, 'weight': 0.996}, {'end': 1307.711, 'start': 1208.278, 'weight': 1}, {'end': 2425.13, 'start': 2374.261, 'weight': 0.74}, {'end': 2786.065, 'start': 2687.83, 'weight': 0.751}], 'summary': 'Cs231n lecture 3 covers administrative updates on assignments and project ideas, challenges of image recognition including use of k-nearest neighbor classifier, computation of multi-class svm loss yielding a total loss of 5.3, significance of regularization in machine learning, types of regularization, loss functions in deep learning, optimization methods for deep learning, and the use of feature representations and transforms for image classification.', 'chapters': [{'end': 138.207, 'segs': [{'end': 51.441, 'src': 'embed', 'start': 27.343, 'weight': 0, 'content': [{'end': 33.568, 'text': "And since we were a little bit late in getting this assignment out to you guys, we've decided to change the due date to Thursday,", 'start': 27.343, 'duration': 6.225}, {'end': 35.809, 'text': 'April 20th at 11.59.', 'start': 33.568, 'duration': 2.241}, {'end': 42.655, 'text': 'p.m.. This will give you a full two weeks from the assignment release date to go and actually finish and work on it.', 'start': 35.809, 'duration': 6.846}, {'end': 47.859, 'text': "So we'll update the syllabus for this new due date in a little bit later today.", 'start': 43.115, 'duration': 4.744}, {'end': 51.441, 'text': 'And as a reminder, when you complete the assignment,', 'start': 49.375, 'duration': 2.066}], 'summary': 'Due date changed to april 20th, 11:59 p.m., providing two weeks from release.', 'duration': 24.098, 'max_score': 27.343, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc27343.jpg'}, {'end': 138.207, 'src': 'embed', 'start': 71.708, 'weight': 1, 'content': [{'end': 78.37, 'text': 'So we went out and solicited example of project ideas from various people in the Stanford community or affiliated to Stanford.', 'start': 71.708, 'duration': 6.662}, {'end': 84.353, 'text': 'And they came up with some interesting suggestions for want students in the class to work on.', 'start': 79.19, 'duration': 5.163}, {'end': 89.519, 'text': 'So check out this pinned post on Piazza and if you want to work on any of these projects,', 'start': 84.914, 'duration': 4.605}, {'end': 93.043, 'text': 'then feel free to contact the project mentors directly about these things.', 'start': 89.519, 'duration': 3.524}, {'end': 97.207, 'text': "Additionally, we've posted office hours on the course website.", 'start': 94.584, 'duration': 2.623}, {'end': 102.954, 'text': "This is a Google calendar, so this is something that people have been asking about, and now it's up there.", 'start': 97.328, 'duration': 5.626}, {'end': 107.817, 'text': 'The final administrative note is about Google Cloud.', 'start': 105.296, 'duration': 2.521}, {'end': 112.638, 'text': "As a reminder, because we're supported by Google Cloud in this class,", 'start': 108.577, 'duration': 4.061}, {'end': 118.119, 'text': "we're able to give each of you an additional $100 credit for Google Cloud to work on your assignments and projects.", 'start': 112.638, 'duration': 5.481}, {'end': 123.76, 'text': 'And the exact details of how to redeem that credit will go out later today, most likely on Piazza.', 'start': 119.159, 'duration': 4.601}, {'end': 130.401, 'text': "So I guess if there's no questions about administrative stuff, then we'll move on to course content.", 'start': 125.54, 'duration': 4.861}, {'end': 134.242, 'text': 'Okay, cool.', 'start': 133.782, 'duration': 0.46}, {'end': 138.207, 'text': 'So recall from last time in lecture two,', 'start': 135.925, 'duration': 2.282}], 'summary': 'Solicited project ideas, offered $100 google cloud credit, and announced office hours.', 'duration': 66.499, 'max_score': 71.708, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc71708.jpg'}], 'start': 4.898, 'title': 'Cs231n lecture 3: loss functions & optimization', 'summary': 'Covers administrative updates on assignment due date, project ideas, office hours, and google cloud credits, including a two-week extension for assignment one and $100 credit for google cloud.', 'chapters': [{'end': 138.207, 'start': 4.898, 'title': 'Cs231n lecture 3: loss functions & optimization', 'summary': 'Includes administrative updates on assignment due date, project ideas, office hours, and google cloud credits, providing a two-week extension for assignment one and $100 credit for google cloud.', 'duration': 133.309, 'highlights': ['Stanford University offering a two-week extension for assignment one, moving the due date to Thursday, April 20th at 11.59 p.m.', 'Providing $100 credit for Google Cloud to work on assignments and projects, with details on redemption to be communicated via Piazza.', 'Highlighting several example project ideas from the Stanford community on Piazza, encouraging students to contact project mentors directly if interested.', "Posting office hours on the course website, addressing students' previous inquiries about access to office hours."]}], 'duration': 133.309, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc4898.jpg', 'highlights': ['Stanford University offering a two-week extension for assignment one, due date: Thursday, April 20th at 11.59 p.m.', 'Providing $100 credit for Google Cloud to work on assignments and projects, details on redemption via Piazza.', 'Highlighting several example project ideas from the Stanford community on Piazza.', "Posting office hours on the course website, addressing students' previous inquiries."]}, {'end': 779.564, 'segs': [{'end': 166.472, 'src': 'embed', 'start': 138.207, 'weight': 0, 'content': [{'end': 143.751, 'text': 'we were really talking about the challenges of recognition and trying to hone in on this idea of a data-driven approach.', 'start': 138.207, 'duration': 5.544}, {'end': 148.214, 'text': "We talked about this idea of image classification, talked about why it's hard.", 'start': 144.571, 'duration': 3.643}, {'end': 155.479, 'text': "There's this semantic gap between the giant grid of numbers that the computer sees and the actual image that you see.", 'start': 148.314, 'duration': 7.165}, {'end': 162.044, 'text': 'We talked about various challenges regarding this around illumination, deformation, et cetera, and why this is actually a really, really hard problem.', 'start': 155.939, 'duration': 6.105}, {'end': 166.472, 'text': "even though it's super easy for people to do with their human eyes and human visual system.", 'start': 162.384, 'duration': 4.088}], 'summary': 'Challenges of image recognition explained; data-driven approach emphasized.', 'duration': 28.265, 'max_score': 138.207, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc138207.jpg'}, {'end': 203.542, 'src': 'embed', 'start': 175.487, 'weight': 2, 'content': [{'end': 181.091, 'text': 'We talked about the CIFAR-10 dataset, where you can see an example of these images on the upper left here,', 'start': 175.487, 'duration': 5.604}, {'end': 185.194, 'text': 'where CIFAR-10 gives you these 10 different categories airplane, automobile, whatnot.', 'start': 181.091, 'duration': 4.103}, {'end': 194.982, 'text': 'And we talked about how the k-nearest neighbor classifier can be used to learn decision boundaries to separate these data points into classes based on the training data.', 'start': 186.055, 'duration': 8.927}, {'end': 203.542, 'text': 'This also led us to a discussion of the idea of cross-validation and setting hyperparameters by dividing your data into trained validation and test sets.', 'start': 195.997, 'duration': 7.545}], 'summary': 'Discussed cifar-10 dataset with 10 categories, k-nearest neighbor classifier, and cross-validation for setting hyperparameters.', 'duration': 28.055, 'max_score': 175.487, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc175487.jpg'}, {'end': 294.6, 'src': 'embed', 'start': 271.805, 'weight': 1, 'content': [{'end': 281.322, 'text': 'means the classifier thinks that the cat is more likely for that image and lower values for maybe the dog or car class indicate lower probabilities of those classes being present in the image.', 'start': 271.805, 'duration': 9.517}, {'end': 286.433, 'text': 'Also. so I think this point was a little bit unclear last time.', 'start': 282.791, 'duration': 3.642}, {'end': 294.6, 'text': 'that linear classification has this interpretation as learning templates per class, where, if you look at the diagram on the lower left,', 'start': 286.433, 'duration': 8.167}], 'summary': 'The classifier assigns higher probabilities to the cat class and lower probabilities to the dog or car classes in the image.', 'duration': 22.795, 'max_score': 271.805, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc271805.jpg'}, {'end': 478.22, 'src': 'embed', 'start': 446.784, 'weight': 4, 'content': [{'end': 450.608, 'text': 'we need some way to quantify the badness of any particular W.', 'start': 446.784, 'duration': 3.824}, {'end': 455.391, 'text': 'And this function that takes in a W,', 'start': 451.649, 'duration': 3.742}, {'end': 461.495, 'text': "looks at the scores and then tells us how bad quantitatively is that W is something that we'll call a loss function.", 'start': 455.391, 'duration': 6.104}, {'end': 468.659, 'text': "And in this lecture we'll see a couple examples of different loss functions that you can use for this image classification problem.", 'start': 461.515, 'duration': 7.144}, {'end': 478.22, 'text': "So then, once we've got this idea of a loss function, this allows us to quantify for any given value of w, how good or bad is it?", 'start': 470.097, 'duration': 8.123}], 'summary': 'Quantify the badness of w using loss function for image classification.', 'duration': 31.436, 'max_score': 446.784, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc446784.jpg'}, {'end': 671.94, 'src': 'embed', 'start': 641.385, 'weight': 3, 'content': [{'end': 648.128, 'text': 'So, as a first example of a concrete loss function that is a nice thing to work with.', 'start': 641.385, 'duration': 6.743}, {'end': 652.049, 'text': "in image classification we'll talk about the multi-class SVM loss.", 'start': 648.128, 'duration': 3.921}, {'end': 664.514, 'text': 'You may have seen the binary SVM or support vector machine in CS229, and the multi-class SVM is a generalization of that to handle multiple classes.', 'start': 653.009, 'duration': 11.505}, {'end': 670.899, 'text': 'In the binary SVM case, as you may have seen in 229, you only had two classes.', 'start': 666.615, 'duration': 4.284}, {'end': 671.94, 'text': 'Each example.', 'start': 671.139, 'duration': 0.801}], 'summary': 'Introduction to multi-class svm loss for image classification.', 'duration': 30.555, 'max_score': 641.385, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc641385.jpg'}, {'end': 762.193, 'src': 'heatmap', 'start': 671.94, 'weight': 0.996, 'content': [{'end': 677.384, 'text': 'x was gonna be classified as either a positive or negative example, but now we have 10 categories,', 'start': 671.94, 'duration': 5.444}, {'end': 680.386, 'text': 'so we need to generalize this notion to handle multiple classes.', 'start': 677.384, 'duration': 3.002}, {'end': 689.21, 'text': "So this loss function has kind of a funny functional form, so we'll walk through it in quite a bit of detail over the next couple of slides.", 'start': 681.227, 'duration': 7.983}, {'end': 695.032, 'text': 'But what this is saying is that the loss li for any individual example.', 'start': 689.95, 'duration': 5.082}, {'end': 702.375, 'text': "the way we'll compute it is we're gonna perform a sum over all of the categories y except for the true category yi.", 'start': 695.032, 'duration': 7.343}, {'end': 712.084, 'text': "So we're gonna sum over all the incorrect categories and then we're gonna compare the score of the correct category and the score of the incorrect category.", 'start': 703.635, 'duration': 8.449}, {'end': 722.656, 'text': 'And now, if the score for the correct category is greater than the score of the incorrect category, greater than the incorrect score,', 'start': 712.745, 'duration': 9.911}, {'end': 725.318, 'text': 'by some safety margin that we set to one?', 'start': 722.656, 'duration': 2.662}, {'end': 731.941, 'text': "If that's the case, that means that the true score is much, or the score for the true category is.", 'start': 725.979, 'duration': 5.962}, {'end': 737.002, 'text': "if it's much larger than any of the false categories, then we'll get a loss of zero.", 'start': 731.941, 'duration': 5.061}, {'end': 746.986, 'text': "And we'll sum this up over all of the incorrect categories for our image, and this will give us our final loss for this one example in the data set.", 'start': 737.703, 'duration': 9.283}, {'end': 751.167, 'text': "And again, we'll take the average of this loss over the whole training data set.", 'start': 747.426, 'duration': 3.741}, {'end': 762.193, 'text': 'So this kind of gets this kind of like if-then statement, like if the true class is class score is much larger than the others.', 'start': 752.306, 'duration': 9.887}], 'summary': 'The loss function handles multiple classes, computing a loss for each example and averaging it over the training dataset.', 'duration': 90.253, 'max_score': 671.94, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc671940.jpg'}], 'start': 138.207, 'title': 'Image recognition challenges', 'summary': 'Discusses challenges of image recognition, including difficulties in classification, use of k-nearest neighbor classifier, cifar-10 dataset, cross-validation, and linear classification. it explains interpretation of linear classification, quantifying classifier performance using loss functions, and introduces multi-class svm loss as an example for image classification.', 'chapters': [{'end': 253.894, 'start': 138.207, 'title': 'Challenges of data-driven approach', 'summary': 'Discusses the challenges of image recognition, focusing on the data-driven approach, including the difficulties in image classification, the use of k-nearest neighbor classifier, the cifar-10 dataset, cross-validation, and linear classification.', 'duration': 115.687, 'highlights': ['The challenges of image recognition and the data-driven approach', 'Use of k-nearest neighbor classifier and CIFAR-10 dataset', 'Cross-validation and setting hyperparameters', 'Linear classification as a building block towards neural networks']}, {'end': 779.564, 'start': 253.894, 'title': 'Linear classification and loss functions', 'summary': 'Explains the interpretation of linear classification, the process of quantifying the performance of the classifier using a loss function, and introduces the multi-class svm loss as a concrete example of a loss function for image classification.', 'duration': 525.67, 'highlights': ['The chapter explains the interpretation of linear classification as learning templates per class and learning linear decision boundaries between pixels in high dimensional space.', 'Introduces the concept of quantifying the performance of the classifier using a loss function and explains the need to determine the least bad value of w using an optimization procedure.', 'Introduces the multi-class SVM loss as a concrete example of a loss function for image classification and provides a detailed explanation of its functional form and computation process.']}], 'duration': 641.357, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc138207.jpg', 'highlights': ['Introduction to challenges of image recognition and data-driven approach', 'Explanation of linear classification and learning templates per class', 'Use of k-nearest neighbor classifier and CIFAR-10 dataset', 'Introduction of multi-class SVM loss as a concrete example for image classification', 'Explanation of quantifying classifier performance using loss functions']}, {'end': 1115.3, 'segs': [{'end': 827.084, 'src': 'embed', 'start': 781.289, 'weight': 2, 'content': [{'end': 789.971, 'text': 'And by the way, this style of loss function where we take max of zero and some other quantity is often referred to as some type of a hinge loss.', 'start': 781.289, 'duration': 8.682}, {'end': 794.332, 'text': 'And this name comes from the shape of the graph when you go and plot it.', 'start': 790.771, 'duration': 3.561}, {'end': 799.013, 'text': 'So here the x-axis corresponds to the syi.', 'start': 794.952, 'duration': 4.061}, {'end': 803.074, 'text': 'That is the score of the true class for some training example.', 'start': 799.393, 'duration': 3.681}, {'end': 805.075, 'text': 'And now the y-axis is the loss.', 'start': 803.474, 'duration': 1.601}, {'end': 812.358, 'text': 'and you can see that as the score for the true category, for the true category for this example, increases,', 'start': 805.615, 'duration': 6.743}, {'end': 820.521, 'text': 'then the loss will go down linearly until we get to this to above this safety margin, and after which the loss will be zero,', 'start': 812.358, 'duration': 8.163}, {'end': 822.782, 'text': "because we've already correctly classified this example.", 'start': 820.521, 'duration': 2.261}, {'end': 827.084, 'text': "So let's oh question?", 'start': 825.883, 'duration': 1.201}], 'summary': 'The hinge loss function is used for classification, reducing loss as score for true category increases.', 'duration': 45.795, 'max_score': 781.289, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc781289.jpg'}, {'end': 936.605, 'src': 'embed', 'start': 896.697, 'weight': 1, 'content': [{'end': 904.06, 'text': 'And if the true score is not high enough, greater than any of the other scores, then we will incur some loss and that will be bad.', 'start': 896.697, 'duration': 7.363}, {'end': 912.694, 'text': 'So this might make a little bit more sense if you walk through an explicit example for this tiny three example data set.', 'start': 906.829, 'duration': 5.865}, {'end': 919.219, 'text': "So here, remember, I've sort of removed the case-based notation and just switching back to the zero, one notation.", 'start': 913.454, 'duration': 5.765}, {'end': 927.459, 'text': 'And now, if we look at, if we think about computing this multi-class SVM loss for just this first training example on the left,', 'start': 920.134, 'duration': 7.325}, {'end': 930.601, 'text': "then remember we're going to loop over all of the incorrect classes.", 'start': 927.459, 'duration': 3.142}, {'end': 933.463, 'text': 'So for this example, cat is the correct class.', 'start': 931.142, 'duration': 2.321}, {'end': 936.605, 'text': "So we're gonna loop over the car and frog classes.", 'start': 933.964, 'duration': 2.641}], 'summary': 'Training example incurs loss if true score not high enough.', 'duration': 39.908, 'max_score': 896.697, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc896697.jpg'}, {'end': 1063.711, 'src': 'embed', 'start': 1033.127, 'weight': 0, 'content': [{'end': 1036.451, 'text': 'Compare frog and car, incur a lot of loss because the score is very low.', 'start': 1033.127, 'duration': 3.324}, {'end': 1041.471, 'text': 'And then our loss for this example is 12.9.', 'start': 1036.992, 'duration': 4.479}, {'end': 1046.558, 'text': 'And then our final loss for the entire data set is the average of these losses across the different examples.', 'start': 1041.471, 'duration': 5.087}, {'end': 1049.522, 'text': 'So when you sum those out, it comes to about 5.3.', 'start': 1046.738, 'duration': 2.784}, {'end': 1054.629, 'text': "So then it's sort of, this is our quantitative measure that our classifier is 5.3 bad on this data set.", 'start': 1049.522, 'duration': 5.107}, {'end': 1063.711, 'text': "Was there a question? Yeah, the question is how do you choose the plus one? That's actually a really great question.", 'start': 1054.649, 'duration': 9.062}], 'summary': 'Comparison between frog and car resulted in an average loss of 5.3, with a specific loss of 12.9 for one example.', 'duration': 30.584, 'max_score': 1033.127, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc1033127.jpg'}], 'start': 781.289, 'title': 'Hinge loss, multi-class svm loss', 'summary': "Discusses hinge loss in classification, where the loss function is determined by the true class score, and multi-class svm loss computation, yielding a total loss of 5.3, and the use of an arbitrary constant 'plus one' for score differentiation.", 'chapters': [{'end': 827.084, 'start': 781.289, 'title': 'Hinge loss and classification', 'summary': 'Introduces the concept of hinge loss in the context of classification, explaining how the loss function is determined by the score of the true class for a training example, and how it is zero once the example is correctly classified.', 'duration': 45.795, 'highlights': ['The loss function is referred to as a hinge loss and is determined by taking the maximum of zero and another quantity, with the graph of this function being shaped by the score of the true class for a training example and the resulting loss.', 'As the score for the true category of a training example increases, the loss decreases linearly until it reaches a certain margin, beyond which the loss becomes zero as the example is correctly classified.']}, {'end': 1115.3, 'start': 835.236, 'title': 'Multi-class svm loss', 'summary': "Explains the computation of multi-class svm loss, where the total loss for the entire data set is 5.3, and the arbitrary constant 'plus one' is chosen to ensure the correct score is much greater than the incorrect scores.", 'duration': 280.064, 'highlights': ['The total loss for the entire data set is 5.3.', "The arbitrary constant 'plus one' is chosen to ensure the correct score is much greater than the incorrect scores.", 'The explanation of the computation of multi-class SVM loss.']}], 'duration': 334.011, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc781289.jpg', 'highlights': ['The total loss for the entire data set is 5.3.', "The arbitrary constant 'plus one' is chosen to ensure the correct score is much greater than the incorrect scores.", 'The loss function is referred to as a hinge loss and is determined by taking the maximum of zero and another quantity, with the graph of this function being shaped by the score of the true class for a training example and the resulting loss.', 'As the score for the true category of a training example increases, the loss decreases linearly until it reaches a certain margin, beyond which the loss becomes zero as the example is correctly classified.', 'The explanation of the computation of multi-class SVM loss.']}, {'end': 1746.046, 'segs': [{'end': 1193.215, 'src': 'embed', 'start': 1147.297, 'weight': 0, 'content': [{'end': 1153.121, 'text': 'so the answer is that if we jiggle the scores for this car image a little bit, the loss will not change.', 'start': 1147.297, 'duration': 5.824}, {'end': 1160.926, 'text': 'So the SVM loss, remember, the only thing it cares about is getting the correct score to be greater than one more than the incorrect scores.', 'start': 1153.661, 'duration': 7.265}, {'end': 1165.789, 'text': 'But in this case the car score is already quite a bit larger than the others.', 'start': 1161.386, 'duration': 4.403}, {'end': 1173.534, 'text': 'so if the scores for this example change just a little bit, this margin of one will still be retained and the loss will not change.', 'start': 1165.789, 'duration': 7.745}, {'end': 1174.495, 'text': "We'll still get zero loss.", 'start': 1173.574, 'duration': 0.921}, {'end': 1185.733, 'text': "The next question, what's the min and max possible loss for SVM? Well, I hear some murmurs.", 'start': 1177.059, 'duration': 8.674}, {'end': 1193.215, 'text': 'So the minimum loss is zero, because if you can imagine that across all the classes, if our correct score was much larger,', 'start': 1186.293, 'duration': 6.922}], 'summary': 'Svm loss remains unchanged if car score is already larger. min loss is 0.', 'duration': 45.918, 'max_score': 1147.297, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc1147297.jpg'}, {'end': 1307.711, 'src': 'heatmap', 'start': 1208.278, 'weight': 1, 'content': [{'end': 1210.158, 'text': 'So the min is zero and the max is infinity.', 'start': 1208.278, 'duration': 1.88}, {'end': 1216.451, 'text': 'Another question sort of when you initialize these things and start training from scratch.', 'start': 1211.749, 'duration': 4.702}, {'end': 1219.832, 'text': 'usually you kind of initialize w with some small random values.', 'start': 1216.451, 'duration': 3.381}, {'end': 1225.235, 'text': 'As a result, your scores tend to be sort of small uniform random values at the beginning of training.', 'start': 1220.293, 'duration': 4.942}, {'end': 1231.698, 'text': "And then the question is that if all of your s's, if all of the scores are approximately zero and approximately equal,", 'start': 1225.835, 'duration': 5.863}, {'end': 1234.939, 'text': "then what kind of loss do you expect when you're using the multiclass SVM??", 'start': 1231.698, 'duration': 3.241}, {'end': 1242.68, 'text': 'Yeah, so the answer is number of classes minus one.', 'start': 1237.657, 'duration': 5.023}, {'end': 1251.785, 'text': "because remember that if we're looping over all of the incorrect classes, so we're looping over c minus one classes within each of those classes,", 'start': 1242.68, 'duration': 9.105}, {'end': 1253.986, 'text': "the two s's will be about the same.", 'start': 1251.785, 'duration': 2.201}, {'end': 1257.188, 'text': "so we'll get a loss of one because of the margin and we'll get c minus one.", 'start': 1253.986, 'duration': 3.202}, {'end': 1268.477, 'text': "So this is actually kind of useful because when you this is a useful debugging strategy when you're using these things that when you start off training you should think about what you expect your loss to be,", 'start': 1257.808, 'duration': 10.669}, {'end': 1273.962, 'text': 'and if the loss you actually see at the start of training at that first iteration is not equal to C minus one,', 'start': 1268.477, 'duration': 5.485}, {'end': 1276.844, 'text': 'in this case that means you probably have a bug and you should go check your code.', 'start': 1273.962, 'duration': 2.882}, {'end': 1280.007, 'text': 'So this is actually kind of a useful thing to be checking in practice.', 'start': 1277.244, 'duration': 2.763}, {'end': 1284.824, 'text': 'Another question what happens if so?', 'start': 1282.199, 'duration': 2.625}, {'end': 1289.514, 'text': "I said we're summing an SVM over the incorrect classes.", 'start': 1284.824, 'duration': 4.69}, {'end': 1291.819, 'text': 'what happens if this sum is also over the correct class?', 'start': 1289.514, 'duration': 2.305}, {'end': 1292.761, 'text': 'if we just go over everything?', 'start': 1291.819, 'duration': 0.942}, {'end': 1298.746, 'text': 'Yeah, so the answer is that the loss increases by one.', 'start': 1296.325, 'duration': 2.421}, {'end': 1307.711, 'text': "And I think the reason that we do this in practice is because normally loss of zero has this nice interpretation that you're not losing at all.", 'start': 1299.947, 'duration': 7.764}], 'summary': 'Initial w values result in small random scores, leading to specific loss expectations in multiclass svm training.', 'duration': 99.433, 'max_score': 1208.278, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc1208278.jpg'}, {'end': 1365.058, 'src': 'embed', 'start': 1325.253, 'weight': 2, 'content': [{'end': 1332.878, 'text': "So another question, what if we used mean instead of sum here? Yeah, the answer is that it doesn't change.", 'start': 1325.253, 'duration': 7.625}, {'end': 1337.14, 'text': 'So the number of classes is gonna be fixed ahead of time when we select our data set.', 'start': 1333.418, 'duration': 3.722}, {'end': 1340.382, 'text': "So that's just rescaling the whole loss function by a constant.", 'start': 1337.461, 'duration': 2.921}, {'end': 1341.723, 'text': "So it doesn't really matter.", 'start': 1340.703, 'duration': 1.02}, {'end': 1344.345, 'text': "It'll sort of wash out with all the other scale things,", 'start': 1341.763, 'duration': 2.582}, {'end': 1349.168, 'text': "because we don't actually care about the true values of the scores or the true value of the loss, for that matter.", 'start': 1344.345, 'duration': 4.823}, {'end': 1351.975, 'text': "So now here's another example.", 'start': 1350.854, 'duration': 1.121}, {'end': 1357.079, 'text': 'What if we change this loss formulation and we actually added a square term on top of this max?', 'start': 1352.335, 'duration': 4.744}, {'end': 1362.163, 'text': 'Would this end up being the same problem or would this be a different classification algorithm??', 'start': 1358.24, 'duration': 3.923}, {'end': 1365.058, 'text': 'Yes, this would be different.', 'start': 1364.158, 'duration': 0.9}], 'summary': "Using mean instead of sum doesn't change, fixed classes ahead, adding square term changes algorithm", 'duration': 39.805, 'max_score': 1325.253, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc1325253.jpg'}, {'end': 1734.277, 'src': 'embed', 'start': 1673.804, 'weight': 4, 'content': [{'end': 1675.906, 'text': 'We care about the performance on the test data.', 'start': 1673.804, 'duration': 2.102}, {'end': 1683.552, 'text': 'So now if we have some new data come in that sort of follows the same trend, then this very wiggly blue line is gonna be totally wrong.', 'start': 1676.647, 'duration': 6.905}, {'end': 1684.652, 'text': 'And, in fact,', 'start': 1684.152, 'duration': 0.5}, {'end': 1694.539, 'text': 'what we probably would have preferred the classifier to do was maybe predict this straight green line rather than this very complex wiggly line to perfectly fit all the training data.', 'start': 1684.652, 'duration': 9.887}, {'end': 1698.982, 'text': 'And this is kind of a core fundamental problem in machine learning.', 'start': 1695.541, 'duration': 3.441}, {'end': 1702.582, 'text': 'And the way we usually solve it is this concept of regularization.', 'start': 1699.482, 'duration': 3.1}, {'end': 1706.283, 'text': "So here, we're gonna add an additional term to the loss function.", 'start': 1703.142, 'duration': 3.141}, {'end': 1711.704, 'text': 'In addition to the data loss, which will tell our classifier that it should fit the training data,', 'start': 1706.783, 'duration': 4.921}, {'end': 1721.105, 'text': "we'll also typically add another term to the loss function called a regularization term, which encourages the model to somehow pick a simpler W,", 'start': 1711.704, 'duration': 9.401}, {'end': 1724.386, 'text': 'whereas the concept of simple kind of depends on the task and the model.', 'start': 1721.105, 'duration': 3.281}, {'end': 1734.277, 'text': "but the whole idea is that there's this whole idea of Occam's razor, which is kind of this fundamental idea in scientific discovery more broadly,", 'start': 1725.146, 'duration': 9.131}], 'summary': 'Regularization in machine learning tackles overfitting with simpler models.', 'duration': 60.473, 'max_score': 1673.804, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc1673804.jpg'}], 'start': 1118.24, 'title': 'Svm loss function and regularization in machine learning', 'summary': "Delves into the svm loss function, revealing that score changes for car image do not affect loss, with the minimum loss being zero and the maximum being infinity, and the use of mean instead of sum in the loss function yielding unchanged results. additionally, it explores the significance of regularization in machine learning, addressing the risks of overfitting, occam's razor, and advocating for a regularization term in the loss function to promote the selection of a simpler w.", 'chapters': [{'end': 1465.961, 'start': 1118.24, 'title': 'Understanding svm loss function', 'summary': "Explains the svm loss function, stating that changing the scores for the car image will not change the loss, the min loss is zero, the max loss is infinity, and using mean instead of sum in the loss function doesn't change the result.", 'duration': 347.721, 'highlights': ['The loss will not change if the scores for the car image change slightly, retaining a margin of one and resulting in zero loss.', 'The minimum loss is zero, and the maximum loss is infinity for the SVM.', "Using mean instead of sum in the loss function doesn't change the result due to fixed number of classes in the dataset.", 'Adding a square term to the max in the loss formulation would result in a different classification algorithm, changing the trade-offs between good and badness in a non-linear way.']}, {'end': 1746.046, 'start': 1466.502, 'title': 'Regularization in machine learning', 'summary': "Discusses the importance of regularization in machine learning, highlighting the potential issues with overfitting and the concept of occam's razor, and emphasizes the need for a regularization term in the loss function to encourage the model to pick a simpler w.", 'duration': 279.544, 'highlights': ['The concept of regularization is introduced to address the problem of overfitting in machine learning, emphasizing the importance of adding a regularization term to the loss function, which encourages the model to pick a simpler W, ultimately aiming to prevent unintuitive behavior and improve performance on test data.', 'The discussion addresses the potential issues with overfitting in machine learning, illustrating the example of fitting a very wiggly curve to perfectly classify all training data points, leading to incorrect predictions on new data, highlighting the need for the classifier to prioritize generalized performance over fitting the training data perfectly.', "The transcript delves into the fundamental concept of Occam's razor, emphasizing the preference for simpler explanations when faced with competing hypotheses, and how this idea translates to the need for regularization in machine learning to encourage the selection of simpler W, which is more likely to generalize well to new observations."]}], 'duration': 627.806, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc1118240.jpg', 'highlights': ['The loss will not change if the scores for the car image change slightly, retaining a margin of one and resulting in zero loss.', 'The minimum loss is zero, and the maximum loss is infinity for the SVM.', "Using mean instead of sum in the loss function doesn't change the result due to fixed number of classes in the dataset.", 'Adding a square term to the max in the loss formulation would result in a different classification algorithm, changing the trade-offs between good and badness in a non-linear way.', 'The concept of regularization is introduced to address the problem of overfitting in machine learning, emphasizing the importance of adding a regularization term to the loss function, which encourages the model to pick a simpler W, ultimately aiming to prevent unintuitive behavior and improve performance on test data.', 'The discussion addresses the potential issues with overfitting in machine learning, illustrating the example of fitting a very wiggly curve to perfectly classify all training data points, leading to incorrect predictions on new data, highlighting the need for the classifier to prioritize generalized performance over fitting the training data perfectly.', "The transcript delves into the fundamental concept of Occam's razor, emphasizing the preference for simpler explanations when faced with competing hypotheses, and how this idea translates to the need for regularization in machine learning to encourage the selection of simpler W, which is more likely to generalize well to new observations."]}, {'end': 2268.352, 'segs': [{'end': 1797.818, 'src': 'embed', 'start': 1746.686, 'weight': 0, 'content': [{'end': 1756.728, 'text': "And the way that we operationalize this intuition in machine learning is typically through some explicit regularization penalty that's often written down as R.", 'start': 1746.686, 'duration': 10.042}, {'end': 1765.45, 'text': "So then, your standard loss function usually has these two terms a data loss and a regularization loss, and there's some hyperparameter here, lambda,", 'start': 1756.728, 'duration': 8.722}, {'end': 1766.77, 'text': 'that trades off between the two.', 'start': 1765.45, 'duration': 1.32}, {'end': 1771.552, 'text': 'And we talked about hyperparameters and cross-validation in the last lecture.', 'start': 1767.83, 'duration': 3.722}, {'end': 1778.435, 'text': "so this regularization hyperparameter lambda will be one of the more important ones that you'll need to tune when training these models in practice.", 'start': 1771.552, 'duration': 6.883}, {'end': 1791.193, 'text': 'Question? Yeah,', 'start': 1780.436, 'duration': 10.757}, {'end': 1797.818, 'text': "so the question is what's the connection between this lambda rw term and actually forcing this wiggly line to become a straight green line?", 'start': 1791.193, 'duration': 6.625}], 'summary': 'Regularization penalty balances data loss and regularization loss using hyperparameter lambda in machine learning.', 'duration': 51.132, 'max_score': 1746.686, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc1746686.jpg'}, {'end': 2041.57, 'src': 'embed', 'start': 2008.399, 'weight': 3, 'content': [{'end': 2013.622, 'text': 'But now the question is, if you look at these two examples, then which one would L2 regression prefer?', 'start': 2008.399, 'duration': 5.223}, {'end': 2020.673, 'text': 'Yeah, so L2 regression would prefer W2 because it has a smaller norm.', 'start': 2016.172, 'duration': 4.501}, {'end': 2029.756, 'text': 'So the answer is that the L2 regression kind of measures complexity of the classifier in this kind of relatively coarse way,', 'start': 2021.234, 'duration': 8.522}, {'end': 2041.57, 'text': "where the idea is that Remember the W's in linear classification had this interpretation of how much does this value of the vector x correspond to this output class?", 'start': 2029.756, 'duration': 11.814}], 'summary': 'L2 regression prefers w2 due to its smaller norm, measuring classifier complexity relatively coarsely.', 'duration': 33.171, 'max_score': 2008.399, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc2008399.jpg'}, {'end': 2158.601, 'src': 'embed', 'start': 2114.004, 'weight': 4, 'content': [{'end': 2121.307, 'text': 'But maybe if we set the W, you could construct a similar example to this where W1 would be preferred by L1 regularization.', 'start': 2114.004, 'duration': 7.303}, {'end': 2127.39, 'text': 'I guess the general intuition behind L1 is that it generally prefers sparse solutions,', 'start': 2122.348, 'duration': 5.042}, {'end': 2133.753, 'text': "that it kind of drives all your entries of W to zero for most of the entries, except for a couple where it's allowed to deviate from zero.", 'start': 2127.39, 'duration': 6.363}, {'end': 2137.574, 'text': 'So kind of the way of measuring complexity.', 'start': 2134.433, 'duration': 3.141}, {'end': 2145.457, 'text': 'for L1 is maybe the number of non-zero entries, and then for L2, it thinks that things that spread the W across all the values are less complex.', 'start': 2137.574, 'duration': 7.883}, {'end': 2147.978, 'text': 'So it kind of depends on your data, depends on your problem.', 'start': 2145.917, 'duration': 2.061}, {'end': 2151.759, 'text': "Oh, and by the way, if you're a hardcore Bayesian,", 'start': 2149.218, 'duration': 2.541}, {'end': 2158.601, 'text': 'then using L2 regularization has kind of this nice interpretation of map inference under a Gaussian prior on the parameter vector.', 'start': 2151.759, 'duration': 6.842}], 'summary': 'L1 regularization prefers sparse solutions, driving most w entries to zero, while l2 regularization spreads w across all values.', 'duration': 44.597, 'max_score': 2114.004, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc2114004.jpg'}], 'start': 1746.686, 'title': 'Regularization in machine learning', 'summary': 'Discusses the significance of the regularization hyperparameter lambda in training machine learning models, the trade-off between data loss and regularization loss, and various types of regularization such as l1 and l2, their impact on model complexity, and preference for smaller norms.', 'chapters': [{'end': 1797.818, 'start': 1746.686, 'title': 'Regularization in machine learning', 'summary': 'Discusses the importance of the regularization hyperparameter lambda in training machine learning models, how it connects to forcing a wiggly line to become a straight green line, and the trade-off between data loss and regularization loss.', 'duration': 51.132, 'highlights': ['The regularization hyperparameter lambda is crucial in training machine learning models, and it requires careful tuning during the training process.', 'The connection between the regularization hyperparameter lambda and forcing a wiggly line to become a straight green line is a key point to understand in machine learning.', 'In machine learning, the standard loss function consists of data loss and regularization loss, with a hyperparameter lambda that trades off between the two.']}, {'end': 2268.352, 'start': 1798.578, 'title': 'Regularization and model complexity', 'summary': 'Discusses various types of regularization, including l1 and l2, which penalize the complexity of the model by encouraging sparsity in the weight vector and measuring complexity based on the norm of the weight vector. it also explains how l2 regularization measures the complexity of a model by preferring smaller norms and spreading the influence across all values in the input vector.', 'duration': 469.774, 'highlights': ['L2 regularization measures complexity by preferring smaller norms and spreading the influence across all values in the input vector.', 'L1 regularization encourages sparsity in the weight vector and measures complexity based on the number of non-zero entries in the weight vector.', 'The chapter explains that L2 regularization is commonly used and has a nice interpretation of map inference under a Gaussian prior on the parameter vector for hardcore Bayesians.']}], 'duration': 521.666, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc1746686.jpg', 'highlights': ['The regularization hyperparameter lambda is crucial in training ML models, requiring careful tuning.', 'The connection between lambda and forcing a wiggly line to become a straight line is key in ML.', 'The standard loss function consists of data loss and regularization loss, with a hyperparameter lambda.', 'L2 regularization measures complexity by preferring smaller norms and spreading influence.', 'L1 regularization encourages sparsity in the weight vector and measures complexity based on non-zero entries.', 'L2 regularization is commonly used and has a nice interpretation of map inference under a Gaussian prior.']}, {'end': 3112.01, 'segs': [{'end': 2334.772, 'src': 'embed', 'start': 2286.899, 'weight': 0, 'content': [{'end': 2293.444, 'text': 'But I mean, kind of my intuition is that they all tend to work kind of similarly in practice, at least in the context of deep learning.', 'start': 2286.899, 'duration': 6.545}, {'end': 2298.688, 'text': "So we'll kind of stick with this one particular formulation of the multi-class SVM loss in this class.", 'start': 2294.164, 'duration': 4.524}, {'end': 2304.265, 'text': "But of course there's many different loss functions you might imagine.", 'start': 2301.283, 'duration': 2.982}, {'end': 2312.071, 'text': 'So another really popular choice in addition to the multi-class SVM loss,', 'start': 2304.886, 'duration': 7.185}, {'end': 2318.696, 'text': 'another really popular choice in deep learning is this multinomial logistic regression or a softmax loss.', 'start': 2312.071, 'duration': 6.625}, {'end': 2327.142, 'text': 'And this one is probably actually a bit more common in the context of deep learning, but I decided to present it second for some reason.', 'start': 2319.716, 'duration': 7.426}, {'end': 2334.772, 'text': "So remember in the context of the multi-class SVM loss, we didn't actually have an interpretation for those scores.", 'start': 2328.587, 'duration': 6.185}], 'summary': 'Common loss functions in deep learning: multi-class svm and softmax, with softmax being more common.', 'duration': 47.873, 'max_score': 2286.899, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc2286899.jpg'}, {'end': 2425.13, 'src': 'heatmap', 'start': 2374.261, 'weight': 0.74, 'content': [{'end': 2383.626, 'text': 'So we use this so-called softmax function, where we take all of our scores, we exponentiate them so that now they become positive,', 'start': 2374.261, 'duration': 9.365}, {'end': 2386.447, 'text': 'then we renormalize them by the sum of those exponents.', 'start': 2383.626, 'duration': 2.821}, {'end': 2393.291, 'text': 'So now, after we send our scores through this softmax function, now we end up with this probability distribution,', 'start': 2386.847, 'duration': 6.444}, {'end': 2395.972, 'text': 'where now we have probabilities over our classes,', 'start': 2393.291, 'duration': 2.681}, {'end': 2402.175, 'text': 'where each probability is between zero and one and the sum of probabilities across all classes sum to one.', 'start': 2395.972, 'duration': 6.203}, {'end': 2408.507, 'text': 'And now kind of the interpretation is that we kind of want to.', 'start': 2404.246, 'duration': 4.261}, {'end': 2416.848, 'text': "there's this computed probability distribution that's implied by our scores and we want to compare this with the target or true probability distribution.", 'start': 2408.507, 'duration': 8.341}, {'end': 2425.13, 'text': 'So if we know that the thing is a cat, then the target probability distribution would put all of the probability mass on cat.', 'start': 2417.409, 'duration': 7.721}], 'summary': 'Softmax function normalizes scores into a probability distribution over classes.', 'duration': 50.869, 'max_score': 2374.261, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc2374261.jpg'}, {'end': 2595.22, 'src': 'embed', 'start': 2565.768, 'weight': 1, 'content': [{'end': 2570.856, 'text': 'So now we asked several questions to try to gain intuition about the multi-class SVM loss,', 'start': 2565.768, 'duration': 5.088}, {'end': 2576.545, 'text': "and it's kind of useful to think about some of the same questions to contrast with the softmax loss.", 'start': 2570.856, 'duration': 5.689}, {'end': 2580.832, 'text': "So then the question is what's the min and max value of the softmax loss??", 'start': 2577.747, 'duration': 3.085}, {'end': 2587.014, 'text': 'Okay, maybe not so sure.', 'start': 2585.293, 'duration': 1.721}, {'end': 2589.276, 'text': "there's too many logs and sums and stuff going on in here.", 'start': 2587.014, 'duration': 2.262}, {'end': 2595.22, 'text': 'But remember our, so the answer is that the min loss is zero and the max loss is infinity.', 'start': 2589.776, 'duration': 5.444}], 'summary': 'The min value of softmax loss is 0, while the max loss is infinity.', 'duration': 29.452, 'max_score': 2565.768, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc2565768.jpg'}, {'end': 2786.065, 'src': 'heatmap', 'start': 2687.83, 'weight': 0.751, 'content': [{'end': 2689.711, 'text': "But again, you'll never really get here,", 'start': 2687.83, 'duration': 1.881}, {'end': 2698.575, 'text': 'because the only way you can actually get this probability to be zero is if e to the correct class score is zero.', 'start': 2689.711, 'duration': 8.864}, {'end': 2701.456, 'text': 'and that can only happen if that correct class score is minus infinity.', 'start': 2698.575, 'duration': 2.881}, {'end': 2705.818, 'text': "So again, you'll never actually get to these minimum and maximum values with finite precision.", 'start': 2701.936, 'duration': 3.882}, {'end': 2716.258, 'text': 'So then, remember we had this sort of debugging sanity check question in the context of the multi-class SVM, and we can ask the same for the softmax.', 'start': 2708.973, 'duration': 7.285}, {'end': 2721.302, 'text': "If all the s's are small and about zero, then what is the loss here?", 'start': 2716.959, 'duration': 4.343}, {'end': 2721.962, 'text': 'Yeah, answer?', 'start': 2721.502, 'duration': 0.46}, {'end': 2726.665, 'text': 'So minus log of one over c?', 'start': 2724.624, 'duration': 2.041}, {'end': 2734.931, 'text': "I think that's yeah, yeah, so then it would be minus log of one over c, which is because log, you can flip the thing.", 'start': 2726.665, 'duration': 8.266}, {'end': 2735.672, 'text': "so then it's just log of c.", 'start': 2734.931, 'duration': 0.741}, {'end': 2738.274, 'text': "Yeah, so it's just log of C.", 'start': 2736.892, 'duration': 1.382}, {'end': 2740.096, 'text': 'And again, this is kind of a nice debugging thing.', 'start': 2738.274, 'duration': 1.822}, {'end': 2744.04, 'text': "If you're training a model with this softmax loss, you should check the first iteration.", 'start': 2740.136, 'duration': 3.904}, {'end': 2747.624, 'text': "If it's not log C, then something's gone wrong.", 'start': 2744.781, 'duration': 2.843}, {'end': 2753.913, 'text': 'So then we can kind of compare and contrast these two loss functions a bit.', 'start': 2750.151, 'duration': 3.762}, {'end': 2757.455, 'text': 'That in terms of linear classification, the setup looks the same.', 'start': 2754.394, 'duration': 3.061}, {'end': 2762.138, 'text': "We've got this W matrix that gets multiplied against our input to produce this vector of scores.", 'start': 2757.715, 'duration': 4.423}, {'end': 2769.142, 'text': 'And now the difference between the two loss functions is how we choose to interpret those scores to quantitatively measure the badness afterwards.', 'start': 2762.558, 'duration': 6.584}, {'end': 2776.827, 'text': 'So for SVM, we were gonna go in and look at the margins between the true and the scores of the correct class and the scores of the incorrect class.', 'start': 2769.663, 'duration': 7.164}, {'end': 2786.065, 'text': "whereas for this softmax or cross-entropy loss we're gonna go and compute a probability distribution and then look at the minus log probability of the correct class.", 'start': 2777.367, 'duration': 8.698}], 'summary': 'Comparing softmax and svm loss functions, debugging tips for softmax loss, and interpreting scores for each loss function.', 'duration': 98.235, 'max_score': 2687.83, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc2687830.jpg'}, {'end': 2865.755, 'src': 'embed', 'start': 2831.714, 'weight': 2, 'content': [{'end': 2838.058, 'text': 'because the only thing that the SVM loss cared about was getting that correct score to be greater than a margin above the incorrect scores.', 'start': 2831.714, 'duration': 6.344}, {'end': 2841.62, 'text': 'But now the softmax loss is actually quite different in this respect.', 'start': 2838.598, 'duration': 3.022}, {'end': 2846.323, 'text': 'The softmax loss actually always wants to drive that probability mass all the way to one.', 'start': 2842.04, 'duration': 4.283}, {'end': 2854.568, 'text': "So, even if you're giving very high score to the correct class and very low score to all the incorrect classes,", 'start': 2846.903, 'duration': 7.665}, {'end': 2865.755, 'text': 'softmax will want you to pile more and more probability mass on the correct class and continue to push the score of that correct class up towards infinity and the score of the incorrect classes down towards minus infinity.', 'start': 2854.568, 'duration': 11.187}], 'summary': 'Softmax loss drives probability mass to one for correct class and pushes incorrect class scores towards minus infinity.', 'duration': 34.041, 'max_score': 2831.714, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc2831714.jpg'}, {'end': 2953.055, 'src': 'embed', 'start': 2928.242, 'weight': 4, 'content': [{'end': 2938.83, 'text': "And then we'll often augment this loss function with a regularization term that tries to trade off between fitting the training data and preferring simpler models.", 'start': 2928.242, 'duration': 10.588}, {'end': 2943.751, 'text': 'So this is kind of a pretty generic overview of a lot of what we call supervised learning.', 'start': 2939.35, 'duration': 4.401}, {'end': 2953.055, 'text': "And what we'll see in deep learning as we sort of move forward is that generally you'll want to specify some function f that could be very complex in structure,", 'start': 2944.292, 'duration': 8.763}], 'summary': 'Supervised learning involves augmenting loss function with a regularization term to balance data fitting and model simplicity.', 'duration': 24.813, 'max_score': 2928.242, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc2928242.jpg'}], 'start': 2270.427, 'title': 'Loss functions in deep learning', 'summary': 'Discusses multi-class svm and softmax loss in deep learning, emphasizing score interpretation, probability distribution computation, and theoretical minimum and maximum values of softmax loss. it also compares the loss functions of multi-class svm and softmax, highlighting differences and practical implications, with a brief overview of supervised learning and optimization challenges.', 'chapters': [{'end': 2705.818, 'start': 2270.427, 'title': 'Multi-class svm and softmax loss', 'summary': 'Discusses the multi-class svm loss and softmax loss in the context of deep learning, emphasizing the interpretation of scores, probability distribution computation, and the theoretical minimum and maximum values of the softmax loss function.', 'duration': 435.391, 'highlights': ['The chapter discusses the interpretation of scores, probability distribution computation, and the theoretical minimum and maximum values of the softmax loss function.', 'The multi-class SVM loss and softmax loss are popular choices in deep learning for classification tasks.', 'Explanation of the theoretical minimum and maximum values of the softmax loss function.']}, {'end': 3112.01, 'start': 2708.973, 'title': 'Understanding loss functions in deep learning', 'summary': 'Discusses the comparison between the loss functions of multi-class svm and softmax, highlighting their differences and practical implications, with a brief overview of supervised learning and the challenges in optimization.', 'duration': 403.037, 'highlights': ['The softmax loss function is -log(1/c) when all scores are small and about zero, and it aims to drive the probability mass of the correct class towards one, unlike the SVM loss.', 'The practical implications of using softmax or SVM loss functions may not significantly differ in many deep learning applications, despite their distinct operational behaviors.', 'The chapter provides an overview of supervised learning, emphasizing the use of a linear classifier, a loss function, and a regularization term to compute and optimize predictions.']}], 'duration': 841.583, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc2270427.jpg', 'highlights': ['The multi-class SVM loss and softmax loss are popular choices in deep learning for classification tasks.', 'The chapter discusses the interpretation of scores, probability distribution computation, and the theoretical minimum and maximum values of the softmax loss function.', 'The softmax loss function is -log(1/c) when all scores are small and about zero, and it aims to drive the probability mass of the correct class towards one, unlike the SVM loss.', 'The practical implications of using softmax or SVM loss functions may not significantly differ in many deep learning applications, despite their distinct operational behaviors.', 'The chapter provides an overview of supervised learning, emphasizing the use of a linear classifier, a loss function, and a regularization term to compute and optimize predictions.', 'Explanation of the theoretical minimum and maximum values of the softmax loss function.']}, {'end': 3617.306, 'segs': [{'end': 3158.298, 'src': 'embed', 'start': 3134.264, 'weight': 2, 'content': [{'end': 3142.347, 'text': "but what you can do is kind of feel with your foot and figure out what is the local geometry like which way, if I'm standing right here,", 'start': 3134.264, 'duration': 8.083}, {'end': 3143.968, 'text': 'which way will take me a little bit downhill?', 'start': 3142.347, 'duration': 1.621}, {'end': 3150.632, 'text': 'So then you can kind of feel with your feet and feel where is the slope of the ground taking me down a little bit in this direction.', 'start': 3144.568, 'duration': 6.064}, {'end': 3152.554, 'text': 'And you can take a step in that direction.', 'start': 3150.652, 'duration': 1.902}, {'end': 3158.298, 'text': "And then you'll go down a little bit, feel again with your feet to figure out which way is down and then repeat over and over again,", 'start': 3152.874, 'duration': 5.424}], 'summary': 'Use foot to feel local geometry, find downhill direction, repeat process.', 'duration': 24.034, 'max_score': 3134.264, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc3134264.jpg'}, {'end': 3498.182, 'src': 'embed', 'start': 3471.293, 'weight': 0, 'content': [{'end': 3478.702, 'text': "So in summary, this numerical gradient is something that's simple and kind of makes sense, but you won't really use it in practice.", 'start': 3471.293, 'duration': 7.409}, {'end': 3484.709, 'text': "In practice, you'll always take an analytic gradient and use that when actually performing these gradient computations.", 'start': 3479.203, 'duration': 5.506}, {'end': 3490.74, 'text': 'However, one interesting note is that these numeric gradients are actually a very useful debugging tool.', 'start': 3485.578, 'duration': 5.162}, {'end': 3498.182, 'text': "So say you've written some code and you wrote some code that computes the loss and the gradients of the loss.", 'start': 3491.38, 'duration': 6.802}], 'summary': 'Numeric gradients are not used in practice, but serve as a useful debugging tool for computing loss and gradients.', 'duration': 26.889, 'max_score': 3471.293, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc3471293.jpg'}, {'end': 3571.615, 'src': 'embed', 'start': 3546.101, 'weight': 1, 'content': [{'end': 3551.384, 'text': 'but turns out to be at the heart of how we train even these very biggest, most complex deep learning algorithms.', 'start': 3546.101, 'duration': 5.283}, {'end': 3552.785, 'text': "And that's gradient descent.", 'start': 3551.844, 'duration': 0.941}, {'end': 3558.648, 'text': 'So gradient descent is like first we initialize our w as some random thing.', 'start': 3553.405, 'duration': 5.243}, {'end': 3567.193, 'text': "then, while true, we'll compute our loss and our gradient and then we'll update our weights in the opposite of the gradient direction.", 'start': 3558.648, 'duration': 8.545}, {'end': 3571.615, 'text': 'because remember that the gradient was pointing in the direction of greatest increase of the function.', 'start': 3567.793, 'duration': 3.822}], 'summary': 'Gradient descent is crucial for training deep learning algorithms by optimizing weights in the opposite direction of the gradient.', 'duration': 25.514, 'max_score': 3546.101, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc3546101.jpg'}], 'start': 3114.178, 'title': 'Optimizing movement in a landscape and understanding gradient in deep learning', 'summary': 'Discusses using local geometry for landscape navigation and explains the concept of gradient in deep learning, emphasizing the significance of learning rate and analytic gradients over numerical gradients.', 'chapters': [{'end': 3169.927, 'start': 3114.178, 'title': 'Optimizing movement in a landscape', 'summary': 'Discusses using local geometry to navigate a landscape, where a simple algorithm involving feeling the slope with your feet and taking steps in the downhill direction tends to work really well in practice.', 'duration': 55.749, 'highlights': ['By using the local geometry of the landscape, one can navigate by feeling the slope with their feet and taking steps in the downhill direction, which tends to work well in practice.', 'This algorithm seems relatively simple but is effective in practice if all the details are considered.']}, {'end': 3617.306, 'start': 3170.467, 'title': 'Understanding gradient in deep learning', 'summary': 'Explains the concept of gradient, its importance in deep learning, and the use of analytic gradients over numerical gradients, emphasizing the significance of the learning rate in gradient descent.', 'duration': 446.839, 'highlights': ['Gradient is a vector of partial derivatives, pointing in the direction of greatest increase of the function, and is crucial for updating parameter vectors in deep learning.', 'Analytic gradients are preferred over numerical gradients as they are exact and efficient, saving computation time and allowing for direct computation of the gradient in one step.', 'Gradient descent, a simple algorithm based on computing loss and gradient, is fundamental in training complex deep learning algorithms, with the learning rate being a crucial hyperparameter.']}], 'duration': 503.128, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc3114178.jpg', 'highlights': ['Analytic gradients are preferred over numerical gradients as they are exact and efficient, saving computation time and allowing for direct computation of the gradient in one step.', 'Gradient descent, a simple algorithm based on computing loss and gradient, is fundamental in training complex deep learning algorithms, with the learning rate being a crucial hyperparameter.', 'By using the local geometry of the landscape, one can navigate by feeling the slope with their feet and taking steps in the downhill direction, which tends to work well in practice.']}, {'end': 4070.078, 'segs': [{'end': 3691.205, 'src': 'embed', 'start': 3663.338, 'weight': 1, 'content': [{'end': 3666.74, 'text': 'So what this looks like in practice is that if we repeat this thing over and over again,', 'start': 3663.338, 'duration': 3.402}, {'end': 3675.949, 'text': 'then we will start off at some point and then eventually just taking tiny gradient steps each time.', 'start': 3669.42, 'duration': 6.529}, {'end': 3680.335, 'text': "you'll see that the parameter will arc in towards the center of this region of minima.", 'start': 3675.949, 'duration': 4.386}, {'end': 3683.579, 'text': "And that's really what you want, because you want to get to low loss.", 'start': 3680.955, 'duration': 2.624}, {'end': 3691.205, 'text': 'And by the way, as a bit of a teaser, we saw on the previous slide this example of very simple gradient descent,', 'start': 3684.8, 'duration': 6.405}], 'summary': 'Repeating process leads to parameter convergence towards minima for low loss.', 'duration': 27.867, 'max_score': 3663.338, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc3663338.jpg'}, {'end': 3772.944, 'src': 'embed', 'start': 3736.261, 'weight': 2, 'content': [{'end': 3740.622, 'text': 'So one of these is gradient descent with momentum.', 'start': 3736.261, 'duration': 4.361}, {'end': 3742.943, 'text': 'The other is this atom optimizer.', 'start': 3740.942, 'duration': 2.001}, {'end': 3745.243, 'text': "And we'll see more details about those later in the course.", 'start': 3743.063, 'duration': 2.18}, {'end': 3749.944, 'text': 'But the idea is that we have this very basic algorithm called gradient descent,', 'start': 3745.823, 'duration': 4.121}, {'end': 3753.125, 'text': 'where we use the gradient at every time step to determine where to step next.', 'start': 3749.944, 'duration': 3.181}, {'end': 3758.566, 'text': 'And there exist different update rules which tell us how exactly do we use that gradient information.', 'start': 3753.545, 'duration': 5.021}, {'end': 3762.667, 'text': "But it's all kind of the same basic algorithm of trying to go downhill at every time step.", 'start': 3758.866, 'duration': 3.801}, {'end': 3772.944, 'text': "But there's actually one more little wrinkle that we should talk about.", 'start': 3770.202, 'duration': 2.742}], 'summary': 'Introduction to gradient descent and atom optimizer in machine learning.', 'duration': 36.683, 'max_score': 3736.261, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc3736261.jpg'}, {'end': 3864.842, 'src': 'embed', 'start': 3828.89, 'weight': 0, 'content': [{'end': 3833.612, 'text': 'So if our n was like a million, this would be super, super slow and we would have to wait a very,', 'start': 3828.89, 'duration': 4.722}, {'end': 3836.653, 'text': 'very long time before we make any individual update to w.', 'start': 3833.612, 'duration': 3.041}, {'end': 3841.998, 'text': 'So in practice, we tend to use what is called stochastic gradient descent, where,', 'start': 3837.293, 'duration': 4.705}, {'end': 3846.803, 'text': 'rather than computing the loss and gradient over the entire training set, instead,', 'start': 3841.998, 'duration': 4.805}, {'end': 3851.428, 'text': 'at every iteration we sample some small set of training examples called a minibatch.', 'start': 3846.803, 'duration': 4.625}, {'end': 3855.332, 'text': 'Typically, usually this is a power of two by convention, like 32, 64, 128 are kind of common numbers.', 'start': 3851.828, 'duration': 3.504}, {'end': 3864.842, 'text': "And then we'll use this small mini batch to compute an estimate of the full sum and an estimate of the true gradient.", 'start': 3858.335, 'duration': 6.507}], 'summary': 'Stochastic gradient descent uses small mini-batches, e.g., 32, 64, 128, to estimate the full sum and true gradient.', 'duration': 35.952, 'max_score': 3828.89, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc3828890.jpg'}], 'start': 3619.591, 'title': 'Optimization methods for deep learning', 'summary': 'Covers gradient descent, gradient descent with momentum, atom optimizer, and stochastic gradient descent, which improves training efficiency and scalability for deep neural networks.', 'chapters': [{'end': 3772.944, 'start': 3619.591, 'title': 'Gradient descent and optimization', 'summary': 'Explains the concept of gradient descent, where the negative gradient direction is used to minimize loss function, and introduces slightly more advanced update rules like gradient descent with momentum and atom optimizer.', 'duration': 153.353, 'highlights': ['The chapter discusses the concept of gradient descent, where the negative gradient direction is used to move towards the minima of the loss function, ultimately aiming to achieve low loss.', 'It introduces slightly more advanced update rules such as gradient descent with momentum and atom optimizer, which are commonly used in practice for training models.', 'The chapter emphasizes the importance of using different update rules to determine how to use gradient information, while still following the fundamental algorithm of going downhill at every time step.']}, {'end': 4070.078, 'start': 3773.504, 'title': 'Stochastic gradient descent', 'summary': 'Discusses the challenges of computing loss functions and gradients over a large training set, leading to the introduction of stochastic gradient descent, which samples mini-batches to estimate the full sum and true gradient, improving training efficiency and scalability for deep neural networks.', 'duration': 296.574, 'highlights': ['Stochastic Gradient Descent is used to address the computational challenges of computing loss and gradients over a large training set, with the example of ImageNet dataset having 1.3 million samples, leading to slow computation.', 'Stochastic Gradient Descent samples a small set of training examples called a minibatch, typically a power of two like 32, 64, or 128, to efficiently compute an estimate of the full sum and true gradient.', 'The chapter encourages the audience to play with an interactive web demo to build intuition about training linear classifiers via gradient descent, allowing adjustments to decision boundaries, weights, biases, and step size.', 'Feeding raw pixel values into linear classifiers tends to not work well due to issues like multimodality, leading to the common practice of a two-stage approach instead of using deep neural networks.']}], 'duration': 450.487, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc3619591.jpg', 'highlights': ['Stochastic Gradient Descent samples a small set of training examples called a minibatch, typically a power of two like 32, 64, or 128, to efficiently compute an estimate of the full sum and true gradient.', 'The chapter discusses the concept of gradient descent, where the negative gradient direction is used to move towards the minima of the loss function, ultimately aiming to achieve low loss.', 'It introduces slightly more advanced update rules such as gradient descent with momentum and atom optimizer, which are commonly used in practice for training models.', 'The chapter emphasizes the importance of using different update rules to determine how to use gradient information, while still following the fundamental algorithm of going downhill at every time step.', 'Stochastic Gradient Descent is used to address the computational challenges of computing loss and gradients over a large training set, with the example of ImageNet dataset having 1.3 million samples, leading to slow computation.']}, {'end': 4478.127, 'segs': [{'end': 4149.72, 'src': 'embed', 'start': 4119.688, 'weight': 0, 'content': [{'end': 4126.532, 'text': 'then this kind of complex data set actually might become linearly separable and actually could be classified correctly by a linear classifier.', 'start': 4119.688, 'duration': 6.844}, {'end': 4132.755, 'text': 'And the whole trick here now is to kind of figure out what is the right feature, transform, that is, sort of computing,', 'start': 4127.152, 'duration': 5.603}, {'end': 4134.975, 'text': 'the right quantities for the problem that you care about.', 'start': 4132.755, 'duration': 2.22}, {'end': 4136.616, 'text': 'So like for images.', 'start': 4135.536, 'duration': 1.08}, {'end': 4140.256, 'text': "maybe converting your pixels to a polar coordinates doesn't make sense,", 'start': 4136.616, 'duration': 3.64}, {'end': 4149.72, 'text': 'but you actually can try to write down feature representations of images that might make sense and actually might help you out and might do better than putting in raw pixels into the classifier.', 'start': 4140.256, 'duration': 9.464}], 'summary': 'Complex datasets can be linearly separable; finding the right feature transforms is crucial for better classification.', 'duration': 30.032, 'max_score': 4119.688, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc4119688.jpg'}, {'end': 4255.368, 'src': 'embed', 'start': 4232.118, 'weight': 4, 'content': [{'end': 4239.843, 'text': 'And now your full feature vector will be these different bucketed histograms of edge orientations across all the different eight by eight regions in the image.', 'start': 4232.118, 'duration': 7.725}, {'end': 4246.805, 'text': 'So this kind of gives you some, this is in some sense dual to the color histogram classifier that we saw before.', 'start': 4240.643, 'duration': 6.162}, {'end': 4250.987, 'text': 'So color histogram is kind of saying globally what colors exist in the image.', 'start': 4247.285, 'duration': 3.702}, {'end': 4255.368, 'text': 'And this is kind of saying overall what types of edge information exist in the image.', 'start': 4251.347, 'duration': 4.021}], 'summary': 'Feature vector includes bucketed histograms of edge orientations across eight by eight regions in the image, providing overall edge information.', 'duration': 23.25, 'max_score': 4232.118, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc4232118.jpg'}, {'end': 4309.067, 'src': 'embed', 'start': 4271.479, 'weight': 2, 'content': [{'end': 4278.245, 'text': "then you can see that in this region we've got kind of a lot of diagonal edges that this histogram of oriented gradient feature representation is capturing.", 'start': 4271.479, 'duration': 6.766}, {'end': 4284.891, 'text': 'So this was a super common feature representation and was used a lot for object recognition actually not too long ago.', 'start': 4279.486, 'duration': 5.405}, {'end': 4291.395, 'text': 'Another feature representation that you might see kind of out there is this idea of bag of words.', 'start': 4286.731, 'duration': 4.664}, {'end': 4296.259, 'text': "So here we've got this, this is kind of taking inspiration from natural language processing.", 'start': 4292.176, 'duration': 4.083}, {'end': 4298.541, 'text': "So if you've got like a paragraph,", 'start': 4296.679, 'duration': 1.862}, {'end': 4305.226, 'text': 'then kind of a way that you might represent a paragraph by a feature vector is kind of counting up the occurrences of different words in that paragraph.', 'start': 4298.541, 'duration': 6.685}, {'end': 4309.067, 'text': 'So we want to kind of take that intuition and apply it to images in some way.', 'start': 4305.926, 'duration': 3.141}], 'summary': 'Histogram of oriented gradient and bag of words are common feature representations used for object recognition in images.', 'duration': 37.588, 'max_score': 4271.479, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc4271479.jpg'}, {'end': 4419.637, 'src': 'embed', 'start': 4403.046, 'weight': 1, 'content': [{'end': 4418.096, 'text': 'would be that you would take your image and then compute these different feature representations of your image things like bag of words or histogram of oriented gradients concatenate a whole bunch of features together and then feed these feature extractors down into some linear classifier.', 'start': 4403.046, 'duration': 15.05}, {'end': 4419.637, 'text': "I'm simplifying a little bit.", 'start': 4418.537, 'duration': 1.1}], 'summary': 'Using various feature representations of images, a linear classifier is fed with concatenated features.', 'duration': 16.591, 'max_score': 4403.046, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc4403046.jpg'}], 'start': 4070.078, 'title': 'Image feature representations', 'summary': 'Discusses the use of feature representations and transforms for image classification, including color histograms, histogram of oriented gradients, and bag of words, emphasizing their role in making complex datasets linearly separable and their historical relevance in image processing.', 'chapters': [{'end': 4176.522, 'start': 4070.078, 'title': 'Feature representation for image classification', 'summary': 'Discusses the use of feature representations and feature transforms to make complex datasets linearly separable for image classification, exemplifying how a color histogram can be used to represent global color distribution in an image.', 'duration': 106.444, 'highlights': ['Using feature representations and feature transforms to make complex datasets linearly separable for image classification', 'Exemplifying how a color histogram can be used to represent global color distribution in an image']}, {'end': 4478.127, 'start': 4176.522, 'title': 'Feature representations in image processing', 'summary': 'Discusses common feature representations in image processing, including color histograms, histogram of oriented gradients, and bag of words, with a focus on their data-driven nature and their historical relevance in object recognition and image classification pipelines.', 'duration': 301.605, 'highlights': ['Histogram of oriented gradients feature representation captures local edge orientations by dividing the image into eight by eight pixel regions and computing dominant edge directions, which was a common feature representation used for object recognition.', 'Bag of words feature representation involves defining a vocabulary of visual words through clustering and then encoding the image based on the occurrences of these visual words, which was historically relevant and was worked on by Fei-Fei when she was a grad student.', 'The image classification pipeline, historically, involved computing different feature representations such as bag of words and histogram of oriented gradients, concatenating them, and feeding them into a linear classifier, which was later replaced by convolutional neural networks that learned features directly from the data.']}], 'duration': 408.049, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/h7iBpEHGVNc/pics/h7iBpEHGVNc4070078.jpg', 'highlights': ['Using feature representations and feature transforms to make complex datasets linearly separable for image classification', 'The image classification pipeline historically involved computing different feature representations such as bag of words and histogram of oriented gradients, concatenating them, and feeding them into a linear classifier', 'Bag of words feature representation involves defining a vocabulary of visual words through clustering and then encoding the image based on the occurrences of these visual words, which was historically relevant and was worked on by Fei-Fei when she was a grad student', 'Histogram of oriented gradients feature representation captures local edge orientations by dividing the image into eight by eight pixel regions and computing dominant edge directions, which was a common feature representation used for object recognition', 'Exemplifying how a color histogram can be used to represent global color distribution in an image']}], 'highlights': ['Total loss for entire data set is 5.3', 'Stanford University offers a two-week extension for assignment one', 'Introduction to challenges of image recognition and data-driven approach', 'Regularization term encourages model to pick a simpler W', 'Analytic gradients are preferred over numerical gradients', 'Introduction of multi-class SVM loss as a concrete example for image classification', 'L2 regularization measures complexity by preferring smaller norms', 'Using feature representations and feature transforms for image classification', 'Stochastic Gradient Descent samples a small set of training examples', 'The softmax loss function aims to drive the probability mass of the correct class towards one']}