title

Lecture 14 - Support Vector Machines

description

Support Vector Machines - One of the most successful learning algorithms; getting a complex model at the price of a simple one. Lecture 14 of 18 of Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa. View course materials in iTunes U Course App - https://itunes.apple.com/us/course/machine-learning/id515364596 and on the course website - http://work.caltech.edu/telecourse.html
Produced in association with Caltech Academic Media Technologies under the Attribution-NonCommercial-NoDerivs Creative Commons License (CC BY-NC-ND). To learn more about this license, http://creativecommons.org/licenses/by-nc-nd/3.0/
This lecture was recorded on May 17, 2012, in Hameetman Auditorium at Caltech, Pasadena, CA, USA.

detail

{'title': 'Lecture 14 - Support Vector Machines', 'heatmap': [{'end': 3566.974, 'start': 3517.242, 'weight': 1}], 'summary': 'Lecture on support vector machines covers topics such as validation and cross-validation in machine learning, support vector machines, margin importance, plane geometry, euclidean distance, lagrange formulation in optimization, quadratic programming, svm and support vectors, kernel methods, and svm in nonlinear transformation.', 'chapters': [{'end': 226.293, 'segs': [{'end': 36.679, 'src': 'embed', 'start': 0.703, 'weight': 0, 'content': [{'end': 3.567, 'text': 'The following program is brought to you by Caltech.', 'start': 0.703, 'duration': 2.864}, {'end': 16.145, 'text': 'Welcome back.', 'start': 15.644, 'duration': 0.501}, {'end': 25.014, 'text': 'Last time we talked about validation, which is a very important technique in machine learning for estimating the out-of-sample performance.', 'start': 18.351, 'duration': 6.663}, {'end': 31.457, 'text': 'And the idea is that we start from the data set that is given to us that has n points.', 'start': 26.535, 'duration': 4.922}, {'end': 36.679, 'text': 'We set aside k points for validation, for just estimation.', 'start': 32.377, 'duration': 4.302}], 'summary': 'Validation is crucial in machine learning for estimating out-of-sample performance using k points from a dataset of n points.', 'duration': 35.976, 'max_score': 0.703, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU703.jpg'}, {'end': 107.894, 'src': 'embed', 'start': 77.592, 'weight': 3, 'content': [{'end': 81.573, 'text': 'And there is a question of how accurate an estimate this would be for Eout.', 'start': 77.592, 'duration': 3.981}, {'end': 88.115, 'text': 'And we found out that k cannot be too small and cannot be too big in order for this estimate to be reliable.', 'start': 82.013, 'duration': 6.102}, {'end': 94.317, 'text': 'And we ended up with a rule of thumb of about 20% of the data set go to validation.', 'start': 88.535, 'duration': 5.782}, {'end': 95.657, 'text': 'That will give you a reasonable estimate.', 'start': 94.357, 'duration': 1.3}, {'end': 99.19, 'text': 'Now, this was an unbiased estimate.', 'start': 97.509, 'duration': 1.681}, {'end': 100.13, 'text': 'So we get an E out.', 'start': 99.21, 'duration': 0.92}, {'end': 107.894, 'text': 'We can get better than E out, or worse than E out in general, as far as E val estimating the performance of g minus.', 'start': 100.15, 'duration': 7.744}], 'summary': 'To estimate eout, use about 20% of the data for validation, ensuring k is neither too small nor too big for reliable results.', 'duration': 30.302, 'max_score': 77.592, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU77592.jpg'}, {'end': 177.616, 'src': 'embed', 'start': 145.955, 'weight': 1, 'content': [{'end': 148.516, 'text': 'just to pin down the bias.', 'start': 145.955, 'duration': 2.561}, {'end': 152.898, 'text': 'And we realized that as we increase the number of examples, the bias goes down.', 'start': 149.297, 'duration': 3.601}, {'end': 154.639, 'text': 'The difference between the two curves goes down.', 'start': 152.958, 'duration': 1.681}, {'end': 161.204, 'text': 'And indeed, if you have a reasonable size validation set, you can afford to estimate a couple of parameters for sure,', 'start': 155.559, 'duration': 5.645}, {'end': 163.005, 'text': 'without contaminating the data too much.', 'start': 161.204, 'duration': 1.801}, {'end': 167.969, 'text': "So you can assume that the measurement you're getting from the validation set is a reliable estimate.", 'start': 163.305, 'duration': 4.664}, {'end': 177.616, 'text': 'Then, because the number of examples turned out to be an issue, we introduced the cross-validation, which is, by and large,', 'start': 169.594, 'duration': 8.022}], 'summary': 'Increasing examples decreases bias; cross-validation addresses sample size issues.', 'duration': 31.661, 'max_score': 145.955, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU145955.jpg'}, {'end': 206.053, 'src': 'embed', 'start': 183.398, 'weight': 2, 'content': [{'end': 190.86, 'text': 'So in this case, illustrating a case where you have tenfold cross-validation, you divide the data set into 10 parts.', 'start': 183.398, 'duration': 7.462}, {'end': 195.724, 'text': 'You train on 9, and validate on the 10th, and keep that estimate of the error.', 'start': 191.4, 'duration': 4.324}, {'end': 200.388, 'text': 'And you keep repeating as you choose the validation subset to be one of those.', 'start': 196.224, 'duration': 4.164}, {'end': 206.053, 'text': 'So you have 10 runs, and each of them gives you an estimate on a small number of examples, one-tenth of the examples.', 'start': 200.448, 'duration': 5.605}], 'summary': 'Illustrates tenfold cross-validation, training on 9 parts, validating on 10th, repeated 10 times for error estimation.', 'duration': 22.655, 'max_score': 183.398, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU183398.jpg'}], 'start': 0.703, 'title': 'Validation and cross-validation in ml', 'summary': 'Discusses the technique of validation in machine learning, where a subset of data is set aside for validation to estimate out-of-sample performance, and the impact of validation set size on bias is highlighted. it also covers the use of tenfold cross-validation to estimate out-of-sample error with a reliable average estimate.', 'chapters': [{'end': 76.572, 'start': 0.703, 'title': 'Validation in machine learning', 'summary': 'Discusses the technique of validation in machine learning, where a subset of data is set aside for validation to estimate out-of-sample performance, and a leap of faith is made to come up with the best possible hypothesis using the validation error.', 'duration': 75.869, 'highlights': ['The technique of validation involves setting aside a subset of data for estimating out-of-sample performance, which is crucial in machine learning.', 'The process includes training with the remaining points after setting aside a subset for validation, resulting in a labeled hypothesis g- for estimation.', 'A leap of faith is made when putting back all examples to come up with the best possible hypothesis, using the validation error to estimate out-of-sample performance on the delivered hypothesis.']}, {'end': 226.293, 'start': 77.592, 'title': 'Validation and cross-validation in ml', 'summary': 'Discusses the importance of validation and cross-validation in machine learning, highlighting the impact of validation set size on bias and the use of tenfold cross-validation to estimate out-of-sample error with a reliable average estimate.', 'duration': 148.701, 'highlights': ['The impact of validation set size on bias is demonstrated, showing that a reasonable size validation set can provide reliable estimates with reduced bias as the number of examples increases.', 'The use of tenfold cross-validation is explained, where the data set is divided into 10 parts, training on 9 and validating on the 10th, to provide a reliable average estimate of out-of-sample error despite different subsets each time.', 'A rule of thumb is provided, suggesting that about 20% of the data set should go to validation to obtain a reliable estimate for Eout.', 'The validation error is shown to have a slight positive or optimistic bias when used for model selection, impacting its reliability as an unbiased estimate of the out-of-sample error.']}], 'duration': 225.59, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU703.jpg', 'highlights': ['The technique of validation involves setting aside a subset of data for estimating out-of-sample performance, crucial in machine learning.', 'The impact of validation set size on bias is demonstrated, showing that a reasonable size validation set can provide reliable estimates with reduced bias as the number of examples increases.', 'The use of tenfold cross-validation is explained, providing a reliable average estimate of out-of-sample error despite different subsets each time.', 'A rule of thumb is provided, suggesting that about 20% of the data set should go to validation to obtain a reliable estimate for Eout.']}, {'end': 885.185, 'segs': [{'end': 273.331, 'src': 'embed', 'start': 246.216, 'weight': 0, 'content': [{'end': 249.16, 'text': 'And validation is the method of choice in that case, in order to make that.', 'start': 246.216, 'duration': 2.944}, {'end': 255.318, 'text': "So we move on to today's lecture, which is support vector machines.", 'start': 251.416, 'duration': 3.902}, {'end': 262.123, 'text': 'Support vector machines are arguably the most successful classification method in machine learning.', 'start': 255.959, 'duration': 6.164}, {'end': 267.387, 'text': 'And they are very nice, because there is a principal derivation for the method.', 'start': 263.144, 'duration': 4.243}, {'end': 273.331, 'text': 'There is a very nice optimization package that you can use in order to get the solution.', 'start': 268.027, 'duration': 5.304}], 'summary': 'Support vector machines are the most successful classification method in machine learning.', 'duration': 27.115, 'max_score': 246.216, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU246216.jpg'}, {'end': 329.524, 'src': 'embed', 'start': 289.208, 'weight': 1, 'content': [{'end': 293.414, 'text': 'And we will ask a question of maximizing the margin, getting the best possible margin.', 'start': 289.208, 'duration': 4.206}, {'end': 297.439, 'text': 'And after formulating the problem, we are going to go and get the solution.', 'start': 294.335, 'duration': 3.104}, {'end': 299.061, 'text': 'And we are going to do that analytically.', 'start': 297.499, 'duration': 1.562}, {'end': 301.403, 'text': 'It will be a constrained optimization problem.', 'start': 299.542, 'duration': 1.861}, {'end': 307.245, 'text': 'And we faced one before in regularization, where we gave a geometrical solution, if you will.', 'start': 301.483, 'duration': 5.762}, {'end': 314.228, 'text': 'This time, we are going to do it analytically, because the formulation is simply too complicated to have an intuitive geometric solution for.', 'start': 307.585, 'duration': 6.643}, {'end': 320.801, 'text': 'And finally, we are going to expand from the linear case to the nonlinear case in the usual way,', 'start': 315.399, 'duration': 5.402}, {'end': 329.524, 'text': 'thus expanding all of the machinery to a case where you can deal with nonlinear surfaces instead of just a line in a separable case,', 'start': 320.801, 'duration': 8.723}], 'summary': 'Analytically solve constrained optimization problem to maximize margin.', 'duration': 40.316, 'max_score': 289.208, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU289208.jpg'}, {'end': 513.207, 'src': 'embed', 'start': 487.546, 'weight': 3, 'content': [{'end': 494.151, 'text': 'Now, it is quite intuitive that the bigger margin is better, because think of a process that is generating the data.', 'start': 487.546, 'duration': 6.605}, {'end': 496.353, 'text': "And let's say that there is noise in it.", 'start': 494.992, 'duration': 1.361}, {'end': 503.579, 'text': 'If you have the bigger margin, the chances are the new point will still be on the correct side of the line.', 'start': 497.714, 'duration': 5.865}, {'end': 510.466, 'text': 'Whereas if I use this one, there is a chance that the next red point will be here, and it will be misclassified.', 'start': 505.302, 'duration': 5.164}, {'end': 513.207, 'text': "Again, I'm not giving any proofs.", 'start': 511.346, 'duration': 1.861}], 'summary': 'Bigger margin reduces chance of misclassification in data points.', 'duration': 25.661, 'max_score': 487.546, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU487546.jpg'}, {'end': 673.501, 'src': 'embed', 'start': 641.134, 'weight': 4, 'content': [{'end': 647.497, 'text': 'So effectively, by requiring the margin to be at least something, I am putting a restriction on the growth function.', 'start': 641.134, 'duration': 6.363}, {'end': 651.639, 'text': 'Fat margin implies fewer dichotomies possible.', 'start': 648.477, 'duration': 3.162}, {'end': 661.554, 'text': 'And therefore, if we manage to separate the points with a FAD dichotomy, we can say that FAD dichotomies have a smaller VC dimension,', 'start': 652.608, 'duration': 8.946}, {'end': 664.495, 'text': "smaller growth function, than if I didn't restrict them at all.", 'start': 661.554, 'duration': 2.941}, {'end': 673.501, 'text': 'And although this is all informal, we will come at the end of the lecture to a result that estimates the out-of-sample error based on the margin.', 'start': 665.916, 'duration': 7.585}], 'summary': 'Requiring a fat margin restricts growth function, leading to fewer dichotomies and smaller vc dimension, with implications for out-of-sample error estimation.', 'duration': 32.367, 'max_score': 641.134, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU641134.jpg'}, {'end': 789.465, 'src': 'embed', 'start': 760.41, 'weight': 5, 'content': [{'end': 762.291, 'text': 'The first one is to normalize w.', 'start': 760.41, 'duration': 1.881}, {'end': 763.151, 'text': 'What do I mean by that?', 'start': 762.291, 'duration': 0.86}, {'end': 769.013, 'text': 'For all the points in the data set near and far.', 'start': 765.412, 'duration': 3.601}, {'end': 776.236, 'text': 'when you take w, transpose times xn, you will get a number that is different from 0..', 'start': 769.013, 'duration': 7.223}, {'end': 780.078, 'text': 'And indeed, it will agree with the label yn, because the points are linearly separable.', 'start': 776.236, 'duration': 3.842}, {'end': 783.759, 'text': "So I can take the absolute value of this and claim that it's greater than 0 for every point.", 'start': 780.118, 'duration': 3.641}, {'end': 789.465, 'text': 'Now, I would like to relate w to the margin, or to the distance.', 'start': 785.083, 'duration': 4.382}], 'summary': 'Normalize w to ensure w transpose times xn is greater than 0 for every point.', 'duration': 29.055, 'max_score': 760.41, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU760410.jpg'}], 'start': 226.733, 'title': 'Support vector machines and margin importance', 'summary': 'Covers the concept of support vector machines, a powerful classification method in machine learning, explaining the notion of margin and the process of maximizing it to obtain the best possible solution. it also discusses the significance of a bigger margin in classification, highlighting its impact on reducing misclassifications and its relationship with vc analysis.', 'chapters': [{'end': 452.939, 'start': 226.733, 'title': 'Support vector machines', 'summary': 'Covers the concept of support vector machines, a powerful classification method in machine learning, explaining the notion of margin and the process of maximizing it to obtain the best possible solution, and expanding from linear to nonlinear cases.', 'duration': 226.206, 'highlights': ['Support vector machines are arguably the most successful classification method in machine learning. Highlighting the significance of support vector machines as a highly successful classification method.', 'Introducing the notion of the margin, which is the main concept in support vector machines, and the process of maximizing the margin to obtain the best possible solution. Emphasizing the importance of the margin concept and the process of maximizing it to achieve the best solution in support vector machines.', 'Expanding from the linear case to the nonlinear case in support vector machines to handle nonlinear surfaces instead of just a line in a separable case. Highlighting the expansion from linear to nonlinear cases in support vector machines to deal with nonlinear surfaces.']}, {'end': 885.185, 'start': 454.3, 'title': 'Importance of bigger margin', 'summary': 'Discusses the significance of a bigger margin in classification, highlighting its impact on reducing misclassifications and its relationship with vc analysis, where a fat margin implies fewer dichotomies and a smaller vc dimension, leading to better out-of-sample performance.', 'duration': 430.885, 'highlights': ['The bigger margin is better as it reduces misclassifications, increasing the chances of new points being correctly classified. A fat margin implies fewer dichotomies, resulting in a smaller VC dimension and better out-of-sample performance.', 'The relationship between a fat margin and the growth function is explained, emphasizing that a fat margin restricts the growth function, leading to fewer possible dichotomies and a smaller VC dimension.', 'The process of normalizing w and its relation to the margin, emphasizing the need for scale invariance and the significance of wxn as the signal for classification.']}], 'duration': 658.452, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU226733.jpg', 'highlights': ['Support vector machines are arguably the most successful classification method in machine learning.', 'Introducing the notion of the margin and the process of maximizing it to obtain the best possible solution.', 'Expanding from the linear case to the nonlinear case in support vector machines to handle nonlinear surfaces.', 'The bigger margin is better as it reduces misclassifications, increasing the chances of new points being correctly classified.', 'The relationship between a fat margin and the growth function is explained, emphasizing that a fat margin restricts the growth function.', 'The process of normalizing w and its relation to the margin, emphasizing the need for scale invariance and the significance of wxn as the signal for classification.']}, {'end': 1629.673, 'segs': [{'end': 932.491, 'src': 'embed', 'start': 902.38, 'weight': 3, 'content': [{'end': 905.701, 'text': "I'm comparing the performance of different planes on the same point.", 'start': 902.38, 'duration': 3.321}, {'end': 907.541, 'text': 'So I have to have the same yardstick.', 'start': 906.041, 'duration': 1.5}, {'end': 910.562, 'text': "And the yardstick I'm going to use is the Euclidean distance.", 'start': 907.821, 'duration': 2.741}, {'end': 914.382, 'text': "So I'm going to take this as a constraint.", 'start': 912.642, 'duration': 1.74}, {'end': 919.823, 'text': "And when I solve for it, I will find out that the problem I'm not solving for is much easier to solve.", 'start': 914.442, 'duration': 5.381}, {'end': 923.424, 'text': 'And then I can get the plane, and the plane will be general under this normalization.', 'start': 920.163, 'duration': 3.261}, {'end': 926.428, 'text': 'The second one is pure technicality.', 'start': 924.707, 'duration': 1.721}, {'end': 932.491, 'text': 'Remember that we had x being in Euclidean space, r to the d.', 'start': 927.148, 'duration': 5.343}], 'summary': 'Comparing plane performance using euclidean distance as a yardstick and solving for an easier problem.', 'duration': 30.111, 'max_score': 902.38, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU902380.jpg'}, {'end': 1001.648, 'src': 'embed', 'start': 969.755, 'weight': 2, 'content': [{'end': 974.701, 'text': 'So the vector w now is the old vector, w1 up to wd.', 'start': 969.755, 'duration': 4.946}, {'end': 978.345, 'text': 'And you take out w0.', 'start': 976.182, 'duration': 2.163}, {'end': 985.552, 'text': 'And in order not to confuse it and call it w because it has a different role, we are going to call it here b for bias.', 'start': 978.885, 'duration': 6.667}, {'end': 995.825, 'text': 'So now the equation for the plane is w, our new w, times x, plus b equals 0.', 'start': 987.301, 'duration': 8.524}, {'end': 997.346, 'text': 'And there is no x0.', 'start': 995.825, 'duration': 1.521}, {'end': 1001.648, 'text': 'x0 is the one that used to be multiplied by b, also known as w0.', 'start': 997.506, 'duration': 4.142}], 'summary': 'Equation for the plane is w times x plus b equals 0', 'duration': 31.893, 'max_score': 969.755, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU969755.jpg'}, {'end': 1230.332, 'src': 'embed', 'start': 1202.62, 'weight': 1, 'content': [{'end': 1211.651, 'text': 'Because what do we have? The distance between xn and the plane, and we put them here, is what? It can be computed as follows.', 'start': 1202.62, 'duration': 9.031}, {'end': 1214.274, 'text': 'Take any point, one point on the plane.', 'start': 1212.372, 'duration': 1.902}, {'end': 1215.636, 'text': 'We just call it generic x.', 'start': 1214.534, 'duration': 1.102}, {'end': 1224.83, 'text': 'And then you take the projection of the vector going from here to here.', 'start': 1218.408, 'duration': 6.422}, {'end': 1230.332, 'text': 'You project it on the direction which is orthogonal to the plane, and that will be your distance.', 'start': 1225.49, 'duration': 4.842}], 'summary': 'Computing the distance between a point and a plane using orthogonal projection.', 'duration': 27.712, 'max_score': 1202.62, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU1202620.jpg'}, {'end': 1403.503, 'src': 'embed', 'start': 1374.255, 'weight': 0, 'content': [{'end': 1380.517, 'text': "This I can use in order now to choose what combination of w's will give me the best possible margin, which is the next one.", 'start': 1374.255, 'duration': 6.262}, {'end': 1382.978, 'text': "So let's now formulate the problem.", 'start': 1381.538, 'duration': 1.44}, {'end': 1385.139, 'text': 'Here is the optimization problem that resulted.', 'start': 1383.298, 'duration': 1.841}, {'end': 1389.405, 'text': 'We are maximizing the margin.', 'start': 1387.543, 'duration': 1.862}, {'end': 1392.409, 'text': 'The margin happens to be 1 over the norm, so that is what we are maximizing.', 'start': 1389.445, 'duration': 2.964}, {'end': 1394.712, 'text': 'Subject to what?', 'start': 1393.911, 'duration': 0.801}, {'end': 1403.503, 'text': 'Subject to the fact that for the nearest point, which happens to have the smallest value of those guys,', 'start': 1396.955, 'duration': 6.548}], 'summary': 'Optimizing to maximize margin, subject to nearest point constraint.', 'duration': 29.248, 'max_score': 1374.255, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU1374255.jpg'}], 'start': 886.387, 'title': 'Plane geometry and euclidean distance', 'summary': 'Covers the use of euclidean distance for comparing plane performance, the separation of bias term from weight vector in support vector machines, and the computation of distance using vectors to maximize margin through quadratic programming.', 'chapters': [{'end': 1032.367, 'start': 886.387, 'title': 'Euclidean distance and plane representation', 'summary': 'Discusses the importance of using euclidean distance for comparing the performance of different planes on the same point, as well as the technicality of separating the bias term from the weight vector for analysis of support vector machines.', 'duration': 145.98, 'highlights': ['The importance of using Euclidean distance for comparing the performance of different planes on the same point is emphasized, as it provides a consistent yardstick for evaluation.', 'Separating the bias term from the weight vector is crucial for the analysis of support vector machines, as w1 up to wd play a different role from w0, leading to the convention of representing the plane equation as w transpose x plus b equals 0.']}, {'end': 1629.673, 'start': 1032.406, 'title': 'Geometry of plane and distance computation', 'summary': 'Explains the geometry of a plane and the computation of distance using vectors, emphasizing the orthogonal relationship between the vector w and the plane, and the formulation of an optimization problem to maximize the margin, leading to quadratic programming.', 'duration': 597.267, 'highlights': ['The vector w is orthogonal to the plane, as it must be orthogonal to every vector on the plane, leading to the conclusion that w is orthogonal to the plane. The vector w is shown to be orthogonal to the plane, as it must be orthogonal to every vector on the plane, resulting in the conclusion that w is orthogonal to the plane.', 'The distance between xn and the plane can be computed by taking the projection of the vector going from xn to a point on the plane on the direction orthogonal to the plane, and normalizing it by its norm to obtain the distance. The computation of the distance between xn and the plane involves taking the projection of the vector from xn to a point on the plane on the direction orthogonal to the plane, and normalizing it by its norm to obtain the distance.', 'The formulation of an optimization problem to maximize the margin involves maximizing 1 over the norm of w, subject to constraints related to the nearest point, resulting in the equivalent problem of minimizing 1 over the norm of w under friendly constraints, leading to the use of quadratic programming. The formulation of an optimization problem to maximize the margin involves maximizing 1 over the norm of w, subject to constraints related to the nearest point, resulting in the equivalent problem of minimizing 1 over the norm of w under friendly constraints, leading to the use of quadratic programming.']}], 'duration': 743.286, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU886387.jpg', 'highlights': ['The formulation of an optimization problem to maximize the margin involves maximizing 1 over the norm of w, subject to constraints related to the nearest point, resulting in the equivalent problem of minimizing 1 over the norm of w under friendly constraints, leading to the use of quadratic programming.', 'The distance between xn and the plane can be computed by taking the projection of the vector going from xn to a point on the plane on the direction orthogonal to the plane, and normalizing it by its norm to obtain the distance.', 'Separating the bias term from the weight vector is crucial for the analysis of support vector machines, as w1 up to wd play a different role from w0, leading to the convention of representing the plane equation as w transpose x plus b equals 0.', 'The importance of using Euclidean distance for comparing the performance of different planes on the same point is emphasized, as it provides a consistent yardstick for evaluation.']}, {'end': 2074.597, 'segs': [{'end': 1664.777, 'src': 'embed', 'start': 1629.673, 'weight': 2, 'content': [{'end': 1631.296, 'text': 'which means that the minimum is 1..', 'start': 1629.673, 'duration': 1.623}, {'end': 1633.339, 'text': 'And therefore, this problem is equivalent to this problem.', 'start': 1631.296, 'duration': 2.043}, {'end': 1636.05, 'text': 'This is really very nice.', 'start': 1634.929, 'duration': 1.121}, {'end': 1639.431, 'text': 'So we started from a concept and geometry and simplification.', 'start': 1636.47, 'duration': 2.961}, {'end': 1643.032, 'text': 'Now we end up with this very friendly statement that we are going to solve.', 'start': 1639.491, 'duration': 3.541}, {'end': 1648.775, 'text': 'And when you solve it, you are going to get the separating plane with the best possible margin.', 'start': 1643.393, 'duration': 5.382}, {'end': 1652.494, 'text': "Let's look at the solution.", 'start': 1651.574, 'duration': 0.92}, {'end': 1657.715, 'text': "Formally speaking, let's put it in a constraint optimization question.", 'start': 1654.015, 'duration': 3.7}, {'end': 1664.777, 'text': 'The constraint optimization here, you minimize this objective function, subject to these constraints.', 'start': 1658.255, 'duration': 6.522}], 'summary': 'Problem simplification leads to constraint optimization with a minimum of 1.', 'duration': 35.104, 'max_score': 1629.673, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU1629673.jpg'}, {'end': 1703.137, 'src': 'embed', 'start': 1676.8, 'weight': 5, 'content': [{'end': 1680.582, 'text': 'Now when you have a constraint optimization, we have a bunch of constraints here.', 'start': 1676.8, 'duration': 3.782}, {'end': 1684.345, 'text': 'And we will need to go an analytic route in order to solve it.', 'start': 1681.263, 'duration': 3.082}, {'end': 1686.166, 'text': "The geometry won't help us very much.", 'start': 1684.645, 'duration': 1.521}, {'end': 1691.349, 'text': 'So what we are going to do here, we are going to ask ourselves, oh, constraint optimization.', 'start': 1687.287, 'duration': 4.062}, {'end': 1692.45, 'text': 'I heard of Lagrange.', 'start': 1691.469, 'duration': 0.981}, {'end': 1699.435, 'text': 'You form a Lagrangian, and then all of a sudden the constraint becomes unconstrained, and you solve it, and you get the multiplier as lambda.', 'start': 1693.171, 'duration': 6.264}, {'end': 1702.277, 'text': 'Lambda is pretty much what we got in regularization before.', 'start': 1699.835, 'duration': 2.442}, {'end': 1703.137, 'text': 'We did it geometrically.', 'start': 1702.337, 'duration': 0.8}], 'summary': 'Analytically solve constraint optimization using lagrange method and get multiplier lambda.', 'duration': 26.337, 'max_score': 1676.8, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU1676800.jpg'}, {'end': 1756.943, 'src': 'embed', 'start': 1728.371, 'weight': 1, 'content': [{'end': 1732.337, 'text': "I can comment on that in the Q&A session, because it's a very nice approach.", 'start': 1728.371, 'duration': 3.966}, {'end': 1739.668, 'text': 'And that approach was derived independently by two sets of people, Karush, which is the first K, and Kim-Tucker, which is the KT.', 'start': 1732.778, 'duration': 6.89}, {'end': 1744.615, 'text': 'And the Lagrangian under inequality constraint is referred to as KKT.', 'start': 1740.309, 'duration': 4.306}, {'end': 1748.27, 'text': 'Now, let us try to solve this.', 'start': 1746.146, 'duration': 2.124}, {'end': 1756.943, 'text': "And I'd like, before I actually go through the mathematics of it, to remind you that we actually saw this before in the constraint optimization.", 'start': 1748.83, 'duration': 8.113}], 'summary': 'Discussion of the kkt approach and constraint optimization in mathematics.', 'duration': 28.572, 'max_score': 1728.371, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU1728371.jpg'}, {'end': 1809.122, 'src': 'embed', 'start': 1787.475, 'weight': 4, 'content': [{'end': 1796.078, 'text': "And the condition for the solution showed that the gradient of your objective function, the thing you're trying to minimize,", 'start': 1787.475, 'duration': 8.603}, {'end': 1798.899, 'text': 'becomes something that is related to the constraint itself.', 'start': 1796.078, 'duration': 2.821}, {'end': 1800.359, 'text': 'In this case, normal.', 'start': 1798.939, 'duration': 1.42}, {'end': 1809.122, 'text': 'The most important aspect to realize is that when you solve the constraint problem here, the end result was that the gradient is not 0.', 'start': 1801.32, 'duration': 7.802}], 'summary': 'Solving the constraint problem resulted in a non-zero gradient for the objective function.', 'duration': 21.647, 'max_score': 1787.475, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU1787475.jpg'}, {'end': 1850.502, 'src': 'embed', 'start': 1825.768, 'weight': 3, 'content': [{'end': 1833.51, 'text': 'But one of the benefits of reminding you of the regularization is that there is a conceptual dichotomy, no pun intended,', 'start': 1825.768, 'duration': 7.742}, {'end': 1837.692, 'text': 'between the regularization and the SVM.', 'start': 1833.51, 'duration': 4.182}, {'end': 1841.033, 'text': 'SVM is what we are doing here, maximizing the margin and regularization.', 'start': 1838.272, 'duration': 2.761}, {'end': 1846.478, 'text': "So let's look at both cases and ask ourselves what are we optimizing and what is the constraint?", 'start': 1841.573, 'duration': 4.905}, {'end': 1850.502, 'text': 'If you remember, in regularization we already have the equation.', 'start': 1848.14, 'duration': 2.362}], 'summary': 'Regularization and svm are conceptually dichotomous; svm maximizes margin and regularization optimizes with constraints.', 'duration': 24.734, 'max_score': 1825.768, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU1825768.jpg'}, {'end': 1940.18, 'src': 'embed', 'start': 1910.376, 'weight': 0, 'content': [{'end': 1919.48, 'text': 'we are not going to do much beyond getting a clean version of the Lagrangian and then passing it on to a package of quadratic programming to give us a solution.', 'start': 1910.376, 'duration': 9.104}, {'end': 1923.701, 'text': 'But at least, arriving there is important.', 'start': 1921.1, 'duration': 2.601}, {'end': 1924.561, 'text': "So let's look at it.", 'start': 1923.821, 'duration': 0.74}, {'end': 1926.342, 'text': 'We are minimizing.', 'start': 1925.502, 'duration': 0.84}, {'end': 1932.874, 'text': 'This is our objective function, subject to constraints of this form.', 'start': 1928.311, 'duration': 4.563}, {'end': 1940.18, 'text': 'First step, take the inequality constraints and put them in the 0 form.', 'start': 1934.215, 'duration': 5.965}], 'summary': 'Objective: minimize objective function subject to constraints. part of process to solve using quadratic programming.', 'duration': 29.804, 'max_score': 1910.376, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU1910376.jpg'}], 'start': 1629.673, 'title': 'Lagrange formulation in optimization', 'summary': 'Discusses the lagrange formulation in optimization, emphasizing the use of lagrange multipliers to handle inequality constraints and maximize with respect to alphas while ensuring non-negativity, and the relationship between constraint optimization and regularization.', 'chapters': [{'end': 1748.27, 'start': 1629.673, 'title': 'Constraint optimization and lagrange', 'summary': 'Discusses constraint optimization in the context of a separating plane with the best possible margin, using lagrange multipliers to handle inequality constraints, which leads to the kkt approach, independently derived by karush and kim-tucker.', 'duration': 118.597, 'highlights': ['The chapter discusses constraint optimization in the context of a separating plane with the best possible margin The problem aims to find the separating plane with the best possible margin, which is a key concept discussed.', 'Using Lagrange multipliers to handle inequality constraints, which leads to the KKT approach The chapter explains the use of Lagrange multipliers to handle inequality constraints, leading to the KKT approach, independently derived by Karush and Kim-Tucker.', 'The concepts of geometry and simplification at the beginning lead to formulating the problem as a constraint optimization question The discussion begins with concepts of geometry and simplification, eventually formulating the problem as a constraint optimization question.']}, {'end': 1873.273, 'start': 1748.83, 'title': 'Lagrangian analysis and optimization', 'summary': 'Discusses the relationship between constraint optimization and regularization, emphasizing the impact of inequality constraints on gradient and the conceptual dichotomy between svm and regularization, with a focus on the optimization of w transpose w.', 'duration': 124.443, 'highlights': ["Regularization's conceptual dichotomy with SVM emphasizes the optimization of W transpose W and En under related constraints, depicting a conceptual dichotomy between the two methods.", 'Solving a constraint problem results in the gradient being related to the constraint, as opposed to being 0 in an unconstrained problem, showcasing the impact of the constraint on the gradient.', 'Reminding the audience of the regularization concept, emphasizing the relationship between inequality constraints and weight decay, providing a contextual understanding for the subsequent analysis.', 'The constraint optimization previously seen under inequality constraints, specifically regularization, serves as a foundational perspective for the current analysis of constraint problems.']}, {'end': 2074.597, 'start': 1874.093, 'title': 'Lagrange formulation in optimization', 'summary': 'Discusses the lagrange formulation in optimization, focusing on converting inequality constraints into a 0 form, introducing lagrange multipliers, and the concept of maximizing with respect to alphas while ensuring they are non-negative.', 'duration': 200.504, 'highlights': ['The Lagrange formulation involves converting inequality constraints into a 0 form, introducing Lagrange multipliers, and maximizing with respect to alphas while ensuring they are non-negative. This process involves converting inequality constraints into a 0 form, introducing Lagrange multipliers for every point in the set, and maximizing with respect to alphas while ensuring they are non-negative.', 'The concept of maximizing with respect to alphas is particularly interesting due to the need for non-negativity, unlike the case of equality constraints where only gradient equals 0 was considered. Inequality constraints require attention to maximizing with respect to alphas while ensuring non-negativity, unlike equality constraints where only gradient equals 0 was considered for both maximum and minimum.', 'The formulation involves a compromise as both the constraint and objective function blend in the Lagrangian, leading to a clean version of the Lagrangian for quadratic programming. The blending of the constraint and objective function in the Lagrangian leads to a compromise, resulting in a clean version of the Lagrangian for quadratic programming.']}], 'duration': 444.924, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU1629673.jpg', 'highlights': ['The Lagrange formulation involves converting inequality constraints into a 0 form, introducing Lagrange multipliers, and maximizing with respect to alphas while ensuring they are non-negative.', 'Using Lagrange multipliers to handle inequality constraints, which leads to the KKT approach.', 'The chapter discusses constraint optimization in the context of a separating plane with the best possible margin.', "Regularization's conceptual dichotomy with SVM emphasizes the optimization of W transpose W and En under related constraints.", 'Solving a constraint problem results in the gradient being related to the constraint, as opposed to being 0 in an unconstrained problem.', 'The concepts of geometry and simplification at the beginning lead to formulating the problem as a constraint optimization question.']}, {'end': 2706.58, 'segs': [{'end': 2211.63, 'src': 'embed', 'start': 2180.356, 'weight': 1, 'content': [{'end': 2185.598, 'text': "I'm going to go back and substitute with these conditions in the original Lagrangian,", 'start': 2180.356, 'duration': 5.242}, {'end': 2194.543, 'text': 'such that the maximization with respect to alpha which is the tricky part, because alpha has a range will become free of w and b.', 'start': 2185.598, 'duration': 8.945}, {'end': 2197.885, 'text': 'And that formulation is referred to as the dual formulation of the problem.', 'start': 2194.543, 'duration': 3.342}, {'end': 2200.586, 'text': "So let's substitute.", 'start': 2199.566, 'duration': 1.02}, {'end': 2204.888, 'text': 'Here are what I got from the last slide.', 'start': 2202.567, 'duration': 2.321}, {'end': 2211.63, 'text': 'This is the This one I got from the gradient with respect to w equals 0.', 'start': 2204.968, 'duration': 6.662}], 'summary': 'Substitute conditions in the lagrangian to achieve a free maximization with respect to alpha, known as the dual formulation.', 'duration': 31.274, 'max_score': 2180.356, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU2180356.jpg'}, {'end': 2369.382, 'src': 'embed', 'start': 2344.743, 'weight': 4, 'content': [{'end': 2352.007, 'text': 'Now this is a very nice quantity to have, because this is a very simple quadratic form in the vector alpha.', 'start': 2344.743, 'duration': 7.264}, {'end': 2354.229, 'text': 'Alpha here appears as a linear guy.', 'start': 2352.387, 'duration': 1.842}, {'end': 2355.71, 'text': 'Here appears as a quadratic guy.', 'start': 2354.389, 'duration': 1.321}, {'end': 2356.21, 'text': "That's all.", 'start': 2355.79, 'duration': 0.42}, {'end': 2357.951, 'text': 'Now I need to put the constraints.', 'start': 2356.73, 'duration': 1.221}, {'end': 2361.3, 'text': 'I put back the things I took out.', 'start': 2359.919, 'duration': 1.381}, {'end': 2365.261, 'text': "And let's look at the maximization with respect to alpha.", 'start': 2362.2, 'duration': 3.061}, {'end': 2368.022, 'text': 'Subject to, non-negative 1.', 'start': 2365.961, 'duration': 2.061}, {'end': 2369.382, 'text': 'This is a KKT condition.', 'start': 2368.022, 'duration': 1.36}], 'summary': 'Simple quadratic form in vector alpha, maximize with respect to alpha under non-negativity constraint.', 'duration': 24.639, 'max_score': 2344.743, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU2344743.jpg'}, {'end': 2439.341, 'src': 'embed', 'start': 2409.486, 'weight': 0, 'content': [{'end': 2414.208, 'text': "Now, if I didn't have those annoying constraints, I would be basically done.", 'start': 2409.486, 'duration': 4.722}, {'end': 2416.489, 'text': "Because I look at this, that's pretty easy.", 'start': 2414.548, 'duration': 1.941}, {'end': 2420.091, 'text': 'I can express one of the alphas in terms of the rest of the alphas.', 'start': 2416.77, 'duration': 3.321}, {'end': 2420.952, 'text': 'Factor it out.', 'start': 2420.411, 'duration': 0.541}, {'end': 2423.713, 'text': 'Substitute for that alpha here.', 'start': 2422.392, 'duration': 1.321}, {'end': 2427.955, 'text': 'And all of a sudden, I have a purely unconstrained optimization of a quadratic one.', 'start': 2424.493, 'duration': 3.462}, {'end': 2428.655, 'text': 'I solve it.', 'start': 2428.055, 'duration': 0.6}, {'end': 2431.997, 'text': "I get something, maybe a pseudo-inverse of something, and I'm done.", 'start': 2428.795, 'duration': 3.202}, {'end': 2439.341, 'text': "But I cannot do that simply because I'm restricted to those choices and therefore I have to work with a constraint optimization,", 'start': 2432.597, 'duration': 6.744}], 'summary': 'Constrained optimization leads to extra work and complexity.', 'duration': 29.855, 'max_score': 2409.486, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU2409486.jpg'}, {'end': 2475.103, 'src': 'embed', 'start': 2449.287, 'weight': 3, 'content': [{'end': 2460.214, 'text': 'The purpose of the slide here is to translate the objective and the constraints we had into the coefficients that you are going to pass onto a package called quadratic programming.', 'start': 2449.287, 'duration': 10.927}, {'end': 2461.875, 'text': 'This is a practical slide.', 'start': 2460.654, 'duration': 1.221}, {'end': 2471.181, 'text': 'First, what we are doing is maximizing with respect to alpha this quantity that we found, subject to a bunch of constraints.', 'start': 2464.298, 'duration': 6.883}, {'end': 2475.103, 'text': 'Quadratic programming packages come usually with minimization.', 'start': 2472.562, 'duration': 2.541}], 'summary': 'Maximizing alpha quantity subject to constraints for quadratic programming', 'duration': 25.816, 'max_score': 2449.287, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU2449287.jpg'}, {'end': 2526.637, 'src': 'embed', 'start': 2497.249, 'weight': 5, 'content': [{'end': 2500.091, 'text': "You're not passing alphas to the quadratic programming.", 'start': 2497.249, 'duration': 2.842}, {'end': 2503.633, 'text': 'Quadratic programming works with a vector of variables that you call alpha.', 'start': 2500.111, 'duration': 3.522}, {'end': 2516.259, 'text': 'What you are passing are the coefficients of your particular problems that are decided by these numbers that the quadratic programming will take and then will be able to give you the alphas that would minimize this quantity.', 'start': 2504.073, 'duration': 12.186}, {'end': 2518.06, 'text': 'So this is what it looks like.', 'start': 2516.699, 'duration': 1.361}, {'end': 2524.055, 'text': 'I have a quadratic term, alpha transpose alpha.', 'start': 2521.693, 'duration': 2.362}, {'end': 2526.637, 'text': 'And these are the coefficients in the double summation.', 'start': 2524.455, 'duration': 2.182}], 'summary': 'Quadratic programming requires passing alphas as variables to minimize the quantity.', 'duration': 29.388, 'max_score': 2497.249, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU2497249.jpg'}, {'end': 2568.969, 'src': 'embed', 'start': 2540.848, 'weight': 7, 'content': [{'end': 2544.171, 'text': 'And quadratic programming asks you for the quadratic term and asks you for the linear term.', 'start': 2540.848, 'duration': 3.323}, {'end': 2549.336, 'text': 'Well, the linear term, just to be formal, happens to be, since we are just taking minus alpha.', 'start': 2544.571, 'duration': 4.765}, {'end': 2552.919, 'text': "it's minus 1 transpose alpha, which is the sum of those guys.", 'start': 2549.336, 'duration': 3.583}, {'end': 2556.082, 'text': 'So this is the bunch of linear coefficients that you pass.', 'start': 2553.299, 'duration': 2.783}, {'end': 2559.264, 'text': 'And then the constraints.', 'start': 2558.063, 'duration': 1.201}, {'end': 2561.745, 'text': 'You put the constraints again in the same way, subject 2.', 'start': 2559.344, 'duration': 2.401}, {'end': 2563.506, 'text': 'So there is a part which asks you for constraints.', 'start': 2561.745, 'duration': 1.761}, {'end': 2566.928, 'text': 'And here, again, the constraints, you care about the coefficients of the constraints.', 'start': 2563.886, 'duration': 3.042}, {'end': 2568.969, 'text': 'So this is a linear equality constraint.', 'start': 2566.948, 'duration': 2.021}], 'summary': 'Quadratic programming involves quadratic and linear terms, along with constraints and coefficients.', 'duration': 28.121, 'max_score': 2540.848, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU2540848.jpg'}, {'end': 2655.81, 'src': 'embed', 'start': 2625.234, 'weight': 8, 'content': [{'end': 2627.294, 'text': 'It fit one of the standard optimization tools.', 'start': 2625.234, 'duration': 2.06}, {'end': 2631.616, 'text': 'It happens to be a convex function in this case, so the quadratic programming would be very successful.', 'start': 2627.354, 'duration': 4.262}, {'end': 2633.616, 'text': "And then we'll pass it in.", 'start': 2632.796, 'duration': 0.82}, {'end': 2635.016, 'text': "We'll get a number back.", 'start': 2633.676, 'duration': 1.34}, {'end': 2636.897, 'text': 'Just a word of warning before we go there.', 'start': 2635.196, 'duration': 1.701}, {'end': 2644.679, 'text': "You look at the size of this matrix, and it's N by N.", 'start': 2637.157, 'duration': 7.522}, {'end': 2648.46, 'text': 'So the dimension of the matrix depends on the number of examples.', 'start': 2644.679, 'duration': 3.781}, {'end': 2651.107, 'text': 'Well, if you have 100 examples, no sweat.', 'start': 2649.366, 'duration': 1.741}, {'end': 2652.748, 'text': 'If you have 1,000 examples, no sweat.', 'start': 2651.167, 'duration': 1.581}, {'end': 2655.81, 'text': 'If you have a million examples, this is really trouble.', 'start': 2653.409, 'duration': 2.401}], 'summary': 'Optimization tool fits convex function, but matrix size creates trouble with a million examples.', 'duration': 30.576, 'max_score': 2625.234, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU2625234.jpg'}, {'end': 2689.451, 'src': 'embed', 'start': 2667.018, 'weight': 9, 'content': [{'end': 2676.144, 'text': 'quadratic programming will have a pretty hard time finding the solution to the level where there are tons of heuristics to solve this problem when the number of examples is big.', 'start': 2667.018, 'duration': 9.126}, {'end': 2679.046, 'text': "It's a practical consideration, but it's an important consideration.", 'start': 2676.304, 'duration': 2.742}, {'end': 2689.451, 'text': "But basically, if you are working with problems in the typical machine learning problem, where you have, let's say, not more than 10,000,", 'start': 2679.886, 'duration': 9.565}], 'summary': 'Quadratic programming struggles with large datasets, requiring heuristics for practical solutions.', 'duration': 22.433, 'max_score': 2667.018, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU2667018.jpg'}], 'start': 2074.597, 'title': 'Quadratic programming', 'summary': 'Covers maximizing and minimizing in quadratic programming, deriving conditions for optimization, transforming to a dual formulation, optimizing a simple quadratic form, and minimizing alpha using quadratic programming, while addressing challenges with a large number of examples.', 'chapters': [{'end': 2342.269, 'start': 2074.597, 'title': 'Maximizing and minimizing in quadratic programming', 'summary': 'Discusses the process of maximizing and minimizing in quadratic programming, including the derivation of conditions for optimization with respect to w and b, and the transformation to a dual formulation of the problem, resulting in a final formula as a function of alpha only.', 'duration': 267.672, 'highlights': ['The chapter discusses the derivation of conditions for optimization with respect to w and b, resulting in a final formula as a function of alpha only, after eliminating w and b from the problem.', 'The process involves taking the gradient of the Lagrangian with respect to w and b, and then substituting these conditions in the original Lagrangian to transform it into a dual formulation of the problem.', 'The optimization process leads to a final formula as a function of alpha only, which is a crucial transformation in the quadratic programming problem.']}, {'end': 2497.229, 'start': 2344.743, 'title': 'Quadratic form optimization and translation', 'summary': 'Discusses the optimization of a simple quadratic form in the vector alpha, subject to constraints, and the translation of the objective and constraints into coefficients for a quadratic programming package.', 'duration': 152.486, 'highlights': ['The chapter discusses the optimization of a simple quadratic form in the vector alpha, subject to constraints. The optimization involves expressing alphas in terms of the rest of the alphas, leading to a purely unconstrained quadratic optimization.', 'The translation of the objective and constraints into coefficients for a quadratic programming package. This involves maximizing with respect to alpha, subject to a bunch of constraints, and translating it into minimization by taking its negative value.', 'The use of quadratic programming for the solution. The solution involves utilizing a package called quadratic programming and isolating coefficients from the alphas to prepare for the next step.']}, {'end': 2706.58, 'start': 2497.249, 'title': 'Quadratic programming for alpha minimization', 'summary': 'Explains the process of using quadratic programming to minimize the quantity alpha, with coefficients and constraints, and warns about the challenges when dealing with a large number of examples.', 'duration': 209.331, 'highlights': ['The process involves using quadratic programming to minimize the quantity alpha, with coefficients derived from training data and linear equality constraints, with a range of alphas between 0 and infinity. The chapter emphasizes the challenges when dealing with a large number of examples due to the practical considerations of solving a dense matrix with heuristics.', 'The dimension of the matrix depends on the number of examples, with a warning that dealing with a huge number of examples can pose significant challenges in solving the problem using quadratic programming, leading to the need for hierarchical methods to address the complexity.', 'The coefficients and constraints are essential components passed to quadratic programming, including the linear term, linear equality constraint coefficients, and the range of alphas, which is crucial for obtaining the minimized alpha. The chapter highlights the practical considerations when working with problems in typical machine learning scenarios, emphasizing challenges when the number of examples exceeds 10,000.']}], 'duration': 631.983, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU2074597.jpg', 'highlights': ['The optimization process leads to a final formula as a function of alpha only, which is a crucial transformation in the quadratic programming problem.', 'The process involves taking the gradient of the Lagrangian with respect to w and b, and then substituting these conditions in the original Lagrangian to transform it into a dual formulation of the problem.', 'The chapter discusses the derivation of conditions for optimization with respect to w and b, resulting in a final formula as a function of alpha only, after eliminating w and b from the problem.', 'The translation of the objective and constraints into coefficients for a quadratic programming package. This involves maximizing with respect to alpha, subject to a bunch of constraints, and translating it into minimization by taking its negative value.', 'The chapter discusses the optimization of a simple quadratic form in the vector alpha, subject to constraints. The optimization involves expressing alphas in terms of the rest of the alphas, leading to a purely unconstrained quadratic optimization.', 'The use of quadratic programming for the solution. The solution involves utilizing a package called quadratic programming and isolating coefficients from the alphas to prepare for the next step.', 'The process involves using quadratic programming to minimize the quantity alpha, with coefficients derived from training data and linear equality constraints, with a range of alphas between 0 and infinity.', 'The coefficients and constraints are essential components passed to quadratic programming, including the linear term, linear equality constraint coefficients, and the range of alphas, which is crucial for obtaining the minimized alpha.', 'The dimension of the matrix depends on the number of examples, with a warning that dealing with a huge number of examples can pose significant challenges in solving the problem using quadratic programming, leading to the need for hierarchical methods to address the complexity.', 'The chapter emphasizes the challenges when dealing with a large number of examples due to the practical considerations of solving a dense matrix with heuristics.', 'The chapter highlights the practical considerations when working with problems in typical machine learning scenarios, emphasizing challenges when the number of examples exceeds 10,000.']}, {'end': 3471.403, 'segs': [{'end': 2777.796, 'src': 'embed', 'start': 2734.117, 'weight': 0, 'content': [{'end': 2735.998, 'text': 'So the solution is vector of alphas.', 'start': 2734.117, 'duration': 1.881}, {'end': 2743.833, 'text': 'And the first thing is that it is very easy to get the w, because, luckily, the formula for w being this,', 'start': 2738.147, 'duration': 5.686}, {'end': 2747.857, 'text': 'was one of the constraints we got from solving the original one.', 'start': 2743.833, 'duration': 4.024}, {'end': 2750.82, 'text': 'When we got the gradient with respect to w, we found out this is the thing.', 'start': 2747.877, 'duration': 2.943}, {'end': 2755.765, 'text': 'So you get the alphas, you plug them in, and then you get the w.', 'start': 2751.101, 'duration': 4.664}, {'end': 2757.647, 'text': 'So you get the vector of weights you want.', 'start': 2755.765, 'duration': 1.882}, {'end': 2766.828, 'text': 'Now I would like to tell you a condition which is very important and will be the key to defining support vectors in this case,', 'start': 2760.424, 'duration': 6.404}, {'end': 2771.431, 'text': 'which is another KKT condition that will be satisfied at the minimum, which is the following', 'start': 2766.828, 'duration': 4.603}, {'end': 2773.513, 'text': 'Quadratic programming will hand you alpha.', 'start': 2772.292, 'duration': 1.221}, {'end': 2776.575, 'text': "Let's say that alpha is the same length as the number of examples.", 'start': 2773.673, 'duration': 2.902}, {'end': 2777.796, 'text': "Let's say you have 1,000 examples.", 'start': 2776.595, 'duration': 1.201}], 'summary': 'Using vector of alphas and quadratic programming to get w and support vectors, with 1,000 examples.', 'duration': 43.679, 'max_score': 2734.117, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU2734117.jpg'}, {'end': 2976.105, 'src': 'embed', 'start': 2944.617, 'weight': 2, 'content': [{'end': 2947.079, 'text': 'Now we come to an interesting definition.', 'start': 2944.617, 'duration': 2.462}, {'end': 2951.182, 'text': 'Alpha is largely zeros, interior points.', 'start': 2948.72, 'duration': 2.462}, {'end': 2957.147, 'text': 'The most important points in the game are the points that actually define the plane and the margin.', 'start': 2952.023, 'duration': 5.124}, {'end': 2960.21, 'text': "And these are the ones for which alpha n's are positive.", 'start': 2957.748, 'duration': 2.462}, {'end': 2965.901, 'text': 'And these are called support vectors.', 'start': 2961.991, 'duration': 3.91}, {'end': 2971.623, 'text': 'So I have n points, and I classify them, and I got the maximum margin.', 'start': 2967.181, 'duration': 4.442}, {'end': 2976.105, 'text': "And because it's the maximum margin, it touched on some of the plus 1 and some of the minus 1 points.", 'start': 2971.803, 'duration': 4.302}], 'summary': 'Support vector machine identifies key points for maximum margin.', 'duration': 31.488, 'max_score': 2944.617, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU2944617.jpg'}, {'end': 3075.977, 'src': 'embed', 'start': 3046.638, 'weight': 3, 'content': [{'end': 3049.04, 'text': 'They will get alpha equals 0 in this case.', 'start': 3046.638, 'duration': 2.402}, {'end': 3054.724, 'text': 'And the support vectors achieve the margin exactly.', 'start': 3052.222, 'duration': 2.502}, {'end': 3056.245, 'text': 'They are the critical points.', 'start': 3055.144, 'duration': 1.101}, {'end': 3062.309, 'text': 'The other guys, their margin, if you will, is bigger or much bigger.', 'start': 3056.605, 'duration': 5.704}, {'end': 3069.974, 'text': 'And for the support vectors, you satisfy this with equal 1, since all of this is fine.', 'start': 3065.092, 'duration': 4.882}, {'end': 3075.977, 'text': 'Now we used to compute w in terms of the summation of alpha, n, y, n, x, n,', 'start': 3070.995, 'duration': 4.982}], 'summary': 'Support vectors achieve margin exactly with alpha equals 0.', 'duration': 29.339, 'max_score': 3046.638, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU3046638.jpg'}, {'end': 3278.234, 'src': 'embed', 'start': 3249.879, 'weight': 4, 'content': [{'end': 3255.742, 'text': "And I'd like to see what happens to the problem of support vector machines, as we stated it and solved it,", 'start': 3249.879, 'duration': 5.863}, {'end': 3259.024, 'text': 'when you actually move to the higher dimensional space.', 'start': 3255.742, 'duration': 3.282}, {'end': 3262.025, 'text': 'Is the problem becoming more difficult? Does it hold? Et cetera.', 'start': 3259.404, 'duration': 2.621}, {'end': 3263.006, 'text': "So let's look at it.", 'start': 3262.306, 'duration': 0.7}, {'end': 3266.944, 'text': "So we're going to work with Z instead of X.", 'start': 3264.622, 'duration': 2.322}, {'end': 3271.128, 'text': "And we're going to work in the script Z space instead of the X space.", 'start': 3266.944, 'duration': 4.184}, {'end': 3273.991, 'text': "So let's first put what we are doing.", 'start': 3271.548, 'duration': 2.443}, {'end': 3278.234, 'text': 'Analytically after doing all of the stuff, and I even forgot what the details are.', 'start': 3274.411, 'duration': 3.823}], 'summary': 'Analyzing the impact of moving to higher dimensional space on support vector machines.', 'duration': 28.355, 'max_score': 3249.879, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU3249879.jpg'}], 'start': 2707.961, 'title': 'Svm and support vectors', 'summary': 'Covers solving svm with quadratic programming, deriving the solution and support vectors, with an example of 900 alphas being 0 out of 1,000 examples. it also explains the concept of support vectors, their significance, and the impact of nonlinear transformations on the dimensionality of the problem.', 'chapters': [{'end': 2821.404, 'start': 2707.961, 'title': 'Solving svm with quadratic programming', 'summary': 'Explains how alpha, obtained from quadratic programming, is used to derive the solution for the original problem, obtaining the vector of weights and defining support vectors based on the kkt condition, with an example of more than 900 alphas being 0 out of 1,000 examples.', 'duration': 113.443, 'highlights': ['The solution obtained from quadratic programming is a vector of alphas, which is used to derive the vector of weights for the original problem. Solving the original problem with the obtained alphas, obtaining the vector of weights.', 'A KKT condition is used to define support vectors, where out of 1,000 alphas obtained, more than 900 are 0. Defining support vectors based on the KKT condition, with an example of more than 900 alphas being 0 out of 1,000 examples.']}, {'end': 3471.403, 'start': 2821.644, 'title': 'Support vector machines and support vectors', 'summary': 'Explains the concept of support vectors and their significance in support vector machines, highlighting the conditions for interior points, the identification of support vectors, the computation of weight vector, and the impact of nonlinear transformations on the dimensionality of the problem.', 'duration': 649.759, 'highlights': ['The identification of support vectors is crucial, as they are the points that define the plane and margin, with positive alpha values, and their role in achieving the maximum margin is emphasized.', 'The computation of the weight vector involves the summation of alpha, n, y, n, x, n, and the significance of support vectors in satisfying the margin condition is explained in detail.', 'The impact of nonlinear transformations on the dimensionality of the problem is highlighted, emphasizing that the difficulty of solving the problem remains constant regardless of the dimensionality of the space.']}], 'duration': 763.442, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU2707961.jpg', 'highlights': ['Solving the original problem with the obtained alphas, obtaining the vector of weights.', 'Defining support vectors based on the KKT condition, with an example of more than 900 alphas being 0 out of 1,000 examples.', 'The identification of support vectors is crucial, as they are the points that define the plane and margin, with positive alpha values.', 'The computation of the weight vector involves the summation of alpha, n, y, n, x, n, and the significance of support vectors in satisfying the margin condition is explained in detail.', 'The impact of nonlinear transformations on the dimensionality of the problem is highlighted, emphasizing that the difficulty of solving the problem remains constant regardless of the dimensionality of the space.']}, {'end': 4065.872, 'segs': [{'end': 3579.927, 'src': 'heatmap', 'start': 3517.242, 'weight': 2, 'content': [{'end': 3521.205, 'text': 'So support vectors live in the space you are doing the process in.', 'start': 3517.242, 'duration': 3.963}, {'end': 3522.286, 'text': 'In this case, the z-space.', 'start': 3521.305, 'duration': 0.981}, {'end': 3527.323, 'text': 'In the X space, there is an interpretation.', 'start': 3524.921, 'duration': 2.402}, {'end': 3529.986, 'text': "So let's look at the X space here.", 'start': 3527.663, 'duration': 2.323}, {'end': 3533.248, 'text': 'I have these guys, not linearly separable.', 'start': 3530.006, 'duration': 3.242}, {'end': 3536.932, 'text': 'And you decided to go to a high-dimensional Z space.', 'start': 3533.749, 'duration': 3.183}, {'end': 3537.872, 'text': "I'm not going to tell you what.", 'start': 3536.972, 'duration': 0.9}, {'end': 3541.536, 'text': 'And you solved that support vector machine.', 'start': 3539.414, 'duration': 2.122}, {'end': 3542.336, 'text': 'You got the alphas.', 'start': 3541.556, 'duration': 0.78}, {'end': 3544.478, 'text': 'You got the line or the hyperplane in that space.', 'start': 3542.396, 'duration': 2.082}, {'end': 3548.061, 'text': 'And then you are putting the boundary here that corresponds to this guy.', 'start': 3544.919, 'duration': 3.142}, {'end': 3549.783, 'text': 'And this is what the boundary looks like.', 'start': 3548.462, 'duration': 1.321}, {'end': 3555.742, 'text': 'Now we have alarm bells overfitting, overfitting, overfitting.', 'start': 3552.939, 'duration': 2.803}, {'end': 3561.268, 'text': "Whenever you see something like that, you say, OK, what? That's the big advantage you get out of support vectors.", 'start': 3556.003, 'duration': 5.265}, {'end': 3562.249, 'text': 'So I get this surface.', 'start': 3561.308, 'duration': 0.941}, {'end': 3566.974, 'text': 'This surface is simply what the line in the Z space with the best margin got.', 'start': 3562.83, 'duration': 4.144}, {'end': 3567.635, 'text': "That's all.", 'start': 3567.334, 'duration': 0.301}, {'end': 3573.18, 'text': 'If I look at what the support vectors are in the Z space, they happen to correspond to points here.', 'start': 3568.956, 'duration': 4.224}, {'end': 3574.402, 'text': 'They are just data points.', 'start': 3573.22, 'duration': 1.182}, {'end': 3579.927, 'text': 'So let me identify them here as pre-images of support vectors.', 'start': 3575.302, 'duration': 4.625}], 'summary': 'Support vectors in z space create boundary to prevent overfitting.', 'duration': 62.685, 'max_score': 3517.242, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU3517242.jpg'}, {'end': 3688.047, 'src': 'embed', 'start': 3659.198, 'weight': 3, 'content': [{'end': 3663.442, 'text': 'And you have a handle on how good the generalization is just by counting the number of support vectors.', 'start': 3659.198, 'duration': 4.244}, {'end': 3665.725, 'text': 'And that will get us.', 'start': 3664.182, 'duration': 1.543}, {'end': 3669.671, 'text': 'This is a good point I forgot to mention.', 'start': 3666.887, 'duration': 2.784}, {'end': 3676.461, 'text': 'The distance between the support vectors and the surface here are not the margin.', 'start': 3670.913, 'duration': 5.548}, {'end': 3679.402, 'text': 'The margins are in the linear space, et cetera.', 'start': 3676.781, 'duration': 2.621}, {'end': 3682.124, 'text': 'They are likely, these guys, to be close to the surface.', 'start': 3679.762, 'duration': 2.362}, {'end': 3684.125, 'text': "But the distance wouldn't be the same.", 'start': 3682.604, 'duration': 1.521}, {'end': 3688.047, 'text': "And there are perhaps other points that look like they should be support vectors, and they aren't.", 'start': 3684.645, 'duration': 3.402}], 'summary': 'Support vector counting measures generalization quality in linear space.', 'duration': 28.849, 'max_score': 3659.198, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU3659198.jpg'}, {'end': 3730.56, 'src': 'embed', 'start': 3704.578, 'weight': 1, 'content': [{'end': 3709.239, 'text': 'So you are doing classification, and you are using the classification error, the binary error.', 'start': 3704.578, 'duration': 4.661}, {'end': 3712.04, 'text': 'So this is the probability of error in classifying an out-of-sample point.', 'start': 3709.259, 'duration': 2.781}, {'end': 3716.942, 'text': 'The statement here is very much what you expect.', 'start': 3713.121, 'duration': 3.821}, {'end': 3723.857, 'text': 'You have the number of support vectors, which happens to be the number of effective parameters, the alphas that survived.', 'start': 3718.755, 'duration': 5.102}, {'end': 3725.398, 'text': 'This is your guy.', 'start': 3724.658, 'duration': 0.74}, {'end': 3726.799, 'text': 'You divide it by n.', 'start': 3725.818, 'duration': 0.981}, {'end': 3728.059, 'text': 'Well, n minus 1 in this case.', 'start': 3726.799, 'duration': 1.26}, {'end': 3730.56, 'text': 'And that will give you an upper bound on E out.', 'start': 3728.6, 'duration': 1.96}], 'summary': 'Using classification error to find probability of error in out-of-sample points with an upper bound on e out.', 'duration': 25.982, 'max_score': 3704.578, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU3704578.jpg'}, {'end': 3812.649, 'src': 'embed', 'start': 3787.921, 'weight': 0, 'content': [{'end': 3793.384, 'text': 'You asked me after you were done with this entire machinery, how many support vectors did you get?', 'start': 3787.921, 'duration': 5.463}, {'end': 3799.95, 'text': 'If you have 1,000 data points and you get 10 support vectors, you are in pretty good shape,', 'start': 3794.264, 'duration': 5.686}, {'end': 3803.013, 'text': 'regardless of the dimensionality of the space that you visited.', 'start': 3799.95, 'duration': 3.063}, {'end': 3807.297, 'text': "Because then 10 over 1,000, that's a pretty good bound on E out.", 'start': 3804.554, 'duration': 2.743}, {'end': 3812.649, 'text': "On the other hand, it doesn't say that now I can go to any dimensional space and things will be fine,", 'start': 3808.527, 'duration': 4.122}], 'summary': '10 support vectors out of 1,000 data points is a good e out bound.', 'duration': 24.728, 'max_score': 3787.921, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU3787921.jpg'}, {'end': 3861.808, 'src': 'embed', 'start': 3836.681, 'weight': 4, 'content': [{'end': 3844.103, 'text': 'But this is the main theoretical result that makes people use support vectors, and support vectors with the nonlinear transformation.', 'start': 3836.681, 'duration': 7.422}, {'end': 3852.265, 'text': "You don't pay for the computation of going to the higher dimension, and you don't get to pay for the generalization that goes with that.", 'start': 3844.423, 'duration': 7.842}, {'end': 3857.187, 'text': 'And then, when we go to kernel methods, which is a modification of this next time,', 'start': 3852.766, 'duration': 4.421}, {'end': 3861.808, 'text': 'you are not even going to pay for the simple computational price of getting the inner product.', 'start': 3857.187, 'duration': 4.621}], 'summary': 'Support vectors and kernel methods offer efficient computation and generalization.', 'duration': 25.127, 'max_score': 3836.681, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU3836681.jpg'}, {'end': 3938.084, 'src': 'embed', 'start': 3899.081, 'weight': 5, 'content': [{'end': 3900.442, 'text': 'And the other way will be the kernel.', 'start': 3899.081, 'duration': 1.361}, {'end': 3908.269, 'text': 'But that will open another set of possibilities of working in a set of spaces we never imagined touching and still getting,', 'start': 3901.403, 'duration': 6.866}, {'end': 3913.433, 'text': 'not only the computation being the same, but also the generalization being dependent on something that we can measure,', 'start': 3908.269, 'duration': 5.164}, {'end': 3916.876, 'text': 'which is the number of support vectors.', 'start': 3913.433, 'duration': 3.443}, {'end': 3926.424, 'text': 'I will stop here and take questions after a short break.', 'start': 3917.036, 'duration': 9.388}, {'end': 3927.485, 'text': "Let's start the Q&A.", 'start': 3926.704, 'duration': 0.781}, {'end': 3938.084, 'text': 'OK, so can you please first explain again why you can normalize w transpose x plus b to be 1? OK.', 'start': 3929.375, 'duration': 8.709}], 'summary': 'Exploring new possibilities in computation through kernel, with generalization dependent on measurable support vectors.', 'duration': 39.003, 'max_score': 3899.081, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU3899081.jpg'}], 'start': 3471.683, 'title': 'Support vector machines and kernel methods', 'summary': "Discusses support vectors in z space, highlighting their impact on generalization behavior and providing an upper bound on classification error. it also introduces kernel methods' ability to work in infinite-dimensional spaces while maintaining the same computational cost.", 'chapters': [{'end': 3770.212, 'start': 3471.683, 'title': 'Support vector machines in z space', 'summary': 'Explains the concept of support vectors in the z space, highlighting how the number of support vectors impacts the generalization behavior, providing an upper bound on the classification error in terms of the number of support vectors and the sample size.', 'duration': 298.529, 'highlights': ['The number of support vectors impacts the generalization behavior, with a smaller number of support vectors indicating better generalization performance. When using a million-dimensional space, having only four support vectors implies that the generalization behavior effectively goes with the four parameters, indicating a handle on the generalization performance.', 'The upper bound on the classification error is related to the number of support vectors and the sample size, with the probability of error in classifying an out-of-sample point being bounded by the number of effective parameters divided by n-1. The generalization result provides an upper bound on the classification error, where the probability of error in classifying an out-of-sample point is related to the number of support vectors, the number of effective parameters, and the sample size.', 'The support vectors achieve the margin in the Z space, determining their status as support vectors, and the distance between the support vectors and the surface in the Z space is likely to be close but not identical to the margin. Support vectors achieve the margin in the Z space, and their status is determined by achieving the margin, with the distance between the support vectors and the surface likely to be close but not identical to the margin.']}, {'end': 4065.872, 'start': 3770.852, 'title': 'Support vector machines and kernel methods', 'summary': 'Discusses the importance of support vectors in svm, emphasizing that the number of support vectors determines generalization; it also introduces the concept of kernel methods and their ability to work in infinite-dimensional spaces while maintaining the same computational cost.', 'duration': 295.02, 'highlights': ['Support vectors in SVM are crucial for generalization, with a lower ratio of support vectors to data points indicating better performance, regardless of the dimensionality of the space visited.', 'The computational and generalization costs in SVM are not affected by the dimensionality of the space, but rather depend on the number of support vectors obtained.', 'Kernel methods allow working in infinite-dimensional spaces without incurring additional computational costs, and the generalization remains dependent on the number of support vectors.']}], 'duration': 594.189, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU3471683.jpg', 'highlights': ['The number of support vectors impacts generalization behavior, with a smaller number indicating better performance.', 'The upper bound on classification error is related to the number of support vectors and sample size.', 'Support vectors achieve the margin in the Z space, determining their status and distance from the surface.', 'Support vectors in SVM are crucial for generalization, with a lower ratio indicating better performance.', 'Computational and generalization costs in SVM depend on the number of support vectors obtained.', 'Kernel methods allow working in infinite-dimensional spaces without additional computational costs.']}, {'end': 4442.263, 'segs': [{'end': 4111.988, 'src': 'embed', 'start': 4069.774, 'weight': 0, 'content': [{'end': 4076.161, 'text': 'So many people are curious, what happens when the points are not linearly separable? There are two cases.', 'start': 4069.774, 'duration': 6.387}, {'end': 4081.326, 'text': 'One of them, they are horribly not linearly separable, like that.', 'start': 4076.822, 'duration': 4.504}, {'end': 4084.449, 'text': 'And in this case, you go to a nonlinear transformation, as we have seen.', 'start': 4081.426, 'duration': 3.023}, {'end': 4088.474, 'text': 'And then there is a slightly not linearly separable, as we have seen before.', 'start': 4084.91, 'duration': 3.564}, {'end': 4095.498, 'text': 'And in that case, you will see that the method I described today is called hard margin SVM.', 'start': 4089.114, 'duration': 6.384}, {'end': 4098.198, 'text': 'Hard margin because the margin is satisfied strictly.', 'start': 4095.838, 'duration': 2.36}, {'end': 4104.723, 'text': 'And then you are going to get another version of it, which is called soft margin, that allows for few errors and penalizes for them.', 'start': 4098.719, 'duration': 6.004}, {'end': 4106.904, 'text': 'And that will be covered next.', 'start': 4105.082, 'duration': 1.822}, {'end': 4111.988, 'text': "But basically, it's very much in parallel with the perceptron.", 'start': 4107.705, 'duration': 4.283}], 'summary': 'Nonlinear transformation used for not linearly separable cases in svm.', 'duration': 42.214, 'max_score': 4069.774, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU4069774.jpg'}, {'end': 4237.375, 'src': 'embed', 'start': 4210.857, 'weight': 2, 'content': [{'end': 4218.502, 'text': 'the fact that the expectation is that almost all of them will be 0 makes it more or less that the effective number of parameters are the ones that end up being non-zero.', 'start': 4210.857, 'duration': 7.645}, {'end': 4223.986, 'text': "Again, this is not an accurate statement, but it's a very reasonable statement.", 'start': 4218.722, 'duration': 5.264}, {'end': 4231.671, 'text': 'So the number of non-zero parameters which correspond to the VC dimension also happens to be the number of the support vectors, by definition,', 'start': 4224.386, 'duration': 7.285}, {'end': 4237.375, 'text': 'because support vectors are the ones that correspond to the non-zero Lagrange multipliers.', 'start': 4231.671, 'duration': 5.704}], 'summary': 'The number of non-zero parameters correlates with vc dimension and support vectors.', 'duration': 26.518, 'max_score': 4210.857, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU4210857.jpg'}, {'end': 4325.455, 'src': 'embed', 'start': 4300.163, 'weight': 3, 'content': [{'end': 4306.726, 'text': 'But different measures for the margin, with one norm, two norm, and other things, have been applied.', 'start': 4300.163, 'duration': 6.563}, {'end': 4311.869, 'text': 'And there is really no compelling reason to prefer one over the other in terms of performance.', 'start': 4307.227, 'duration': 4.642}, {'end': 4316.032, 'text': 'So it really is the analytic properties that usually dictate the choice.', 'start': 4312.07, 'duration': 3.962}, {'end': 4325.455, 'text': 'Is there any pruning method that can maybe get rid of some of the support vectors or not really?', 'start': 4317.785, 'duration': 7.67}], 'summary': 'Various margin measures applied, no preference for performance, analytic properties dictate choice.', 'duration': 25.292, 'max_score': 4300.163, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU4300163.jpg'}, {'end': 4442.263, 'src': 'embed', 'start': 4417.134, 'weight': 4, 'content': [{'end': 4419.195, 'text': 'And obviously, that is not good, because you are fitting the noise.', 'start': 4417.134, 'duration': 2.061}, {'end': 4425.137, 'text': 'But in those cases, and in almost all of the cases, use the soft version of this, which is remarkably similar.', 'start': 4419.535, 'duration': 5.602}, {'end': 4428.099, 'text': "It's different assumptions, but the solution is remarkably similar.", 'start': 4425.218, 'duration': 2.881}, {'end': 4434.422, 'text': 'And therefore, in that case, you will be as vulnerable or not vulnerable to noise as you would by using other methods.', 'start': 4428.679, 'duration': 5.743}, {'end': 4439.797, 'text': "I think that's it.", 'start': 4438.515, 'duration': 1.282}, {'end': 4440.84, 'text': 'Very good.', 'start': 4440.479, 'duration': 0.361}, {'end': 4442.263, 'text': 'So we will see you next week.', 'start': 4441.12, 'duration': 1.143}], 'summary': 'Use the soft version for similar results with different assumptions.', 'duration': 25.129, 'max_score': 4417.134, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU4417134.jpg'}], 'start': 4069.774, 'title': 'Svm in nonlinear transformation', 'summary': 'Discusses linearly and not linearly separable points in svm, introducing hard margin svm for strictly satisfied margins and soft margin svm for allowing and penalizing few errors. it also covers the relationship between the number of support vectors and vc dimension in svm, the impact of margin and norm variation, and the handling of noisy datasets.', 'chapters': [{'end': 4148.212, 'start': 4069.774, 'title': 'Nonlinear transformation in svm', 'summary': 'Discusses the cases of linearly separable and not linearly separable points in svm, introducing hard margin svm for strictly satisfied margins and soft margin svm for allowing and penalizing few errors, with a focus on nonlinear transformations and their advantages.', 'duration': 78.438, 'highlights': ['The method described is called hard margin SVM, where the margin is satisfied strictly, and then there is a version called soft margin that allows for few errors and penalizes for them.', 'Nonlinear transformation is applied when the points are horribly not linearly separable, providing attractive positive properties, and is used together with the soft version to handle outliers and reduce the complexity of the boundary.', 'Perceptron is used for linearly separable points, while a linear transformation is applied if the points are terribly not linearly separable.']}, {'end': 4442.263, 'start': 4152.033, 'title': 'Support vectors and vc dimension', 'summary': 'Discusses the relationship between the number of support vectors and vc dimension in svm, the impact of margin and norm variation, and the handling of noisy datasets, highlighting the practical implications and theoretical considerations.', 'duration': 290.23, 'highlights': ['The number of non-zero parameters, which correspond to the VC dimension, also happens to be the number of the support vectors, by definition. The effective number of parameters, which determines the VC dimension, is related to the number of support vectors, indicating the practical implementation of VC theory.', 'Different measures for the margin, with one norm, two norm, and other things, have been applied, and there is no compelling reason to prefer one over the other in terms of performance. The choice of margin measure depends on analytic properties rather than performance, highlighting the theoretical basis for selecting a specific margin norm.', "The SVM is not particularly susceptible to noise and can handle noisy datasets similar to other methods, especially when using the soft version with different assumptions but remarkably similar solutions. The SVM's resilience to noise is emphasized, particularly when utilizing the soft version with similar solutions, offering practical insights into handling noisy datasets with SVM."]}], 'duration': 372.489, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/eHsErlPJWUU/pics/eHsErlPJWUU4069774.jpg', 'highlights': ['Nonlinear transformation is applied when the points are horribly not linearly separable, providing attractive positive properties, and is used together with the soft version to handle outliers and reduce the complexity of the boundary.', 'The method described is called hard margin SVM, where the margin is satisfied strictly, and then there is a version called soft margin that allows for few errors and penalizes for them.', 'The number of non-zero parameters, which correspond to the VC dimension, also happens to be the number of the support vectors, by definition.', 'Different measures for the margin, with one norm, two norm, and other things, have been applied, and there is no compelling reason to prefer one over the other in terms of performance.', 'The SVM is not particularly susceptible to noise and can handle noisy datasets similar to other methods, especially when using the soft version with different assumptions but remarkably similar solutions.']}], 'highlights': ['The use of tenfold cross-validation is explained, providing a reliable average estimate of out-of-sample error despite different subsets each time.', 'The technique of validation involves setting aside a subset of data for estimating out-of-sample performance, crucial in machine learning.', 'A rule of thumb is provided, suggesting that about 20% of the data set should go to validation to obtain a reliable estimate for Eout.', 'The impact of validation set size on bias is demonstrated, showing that a reasonable size validation set can provide reliable estimates with reduced bias as the number of examples increases.', 'Support vector machines are arguably the most successful classification method in machine learning.', 'Introducing the notion of the margin and the process of maximizing it to obtain the best possible solution.', 'The bigger margin is better as it reduces misclassifications, increasing the chances of new points being correctly classified.', 'The formulation of an optimization problem to maximize the margin involves maximizing 1 over the norm of w, subject to constraints related to the nearest point, resulting in the equivalent problem of minimizing 1 over the norm of w under friendly constraints, leading to the use of quadratic programming.', 'The distance between xn and the plane can be computed by taking the projection of the vector going from xn to a point on the plane on the direction orthogonal to the plane, and normalizing it by its norm to obtain the distance.', 'The Lagrange formulation involves converting inequality constraints into a 0 form, introducing Lagrange multipliers, and maximizing with respect to alphas while ensuring they are non-negative.', 'The optimization process leads to a final formula as a function of alpha only, which is a crucial transformation in the quadratic programming problem.', 'The process involves taking the gradient of the Lagrangian with respect to w and b, and then substituting these conditions in the original Lagrangian to transform it into a dual formulation of the problem.', 'Solving the original problem with the obtained alphas, obtaining the vector of weights.', 'The number of support vectors impacts generalization behavior, with a smaller number indicating better performance.', 'Support vectors achieve the margin in the Z space, determining their status and distance from the surface.', 'Nonlinear transformation is applied when the points are horribly not linearly separable, providing attractive positive properties, and is used together with the soft version to handle outliers and reduce the complexity of the boundary.', 'Kernel methods allow working in infinite-dimensional spaces without additional computational costs.']}