title
Lecture 7 - Kernels | Stanford CS229: Machine Learning Andrew Ng (Autumn 2018)

description
For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/3GftN16 Andrew Ng Adjunct Professor of Computer Science https://www.andrewng.org/ To follow along with the course schedule and syllabus, visit: http://cs229.stanford.edu/syllabus-autumn2018.html 0:00 Introduction 0:10 Support vector machine algorithm 2:47 Derivation of this classification problem 7:47 Decision boundary 11:58 The represented theorem 13:20 Logistic Regression 26:31 The dual optimization problem 28:48 Apply kernels 28:56 Kernel trick 31:45 A kernel function 33:56 No free lunch theorem 34:40 Example of kernels 54:13 Kernel matrix 59:16 Gaussian kernel 59:39 The gaussian kernel 1:11:57 Dual form 1:13:35 Examples of SVM kernels 1:14:13 Handwritten digit classification 1:15:39 Protein sequence classifier 1:17:03 Design a feature vector

detail
{'title': 'Lecture 7 - Kernels | Stanford CS229: Machine Learning Andrew Ng (Autumn 2018)', 'heatmap': [{'end': 585.759, 'start': 530.307, 'weight': 0.748}, {'end': 724.776, 'start': 670.755, 'weight': 0.915}, {'end': 1115.585, 'start': 1060.865, 'weight': 1}, {'end': 2605.789, 'start': 2552.616, 'weight': 0.726}, {'end': 3185.299, 'start': 3134.874, 'weight': 0.897}, {'end': 4816.787, 'start': 4774.55, 'weight': 0.792}], 'summary': 'The lecture series on kernels in machine learning covers support vector machine algorithm, optimal margin classifier, svm decision boundary and optimization, logistic regression parameter update, svm optimization and high-dimensional feature mapping, kernel functions and computation, support vector machines, and kernel methods in machine learning.', 'chapters': [{'end': 57.292, 'segs': [{'end': 35.183, 'src': 'embed', 'start': 4.537, 'weight': 1, 'content': [{'end': 4.857, 'text': 'All right.', 'start': 4.537, 'duration': 0.32}, {'end': 5.438, 'text': 'Good morning.', 'start': 4.957, 'duration': 0.481}, {'end': 7.3, 'text': "Um, let's get started.", 'start': 5.838, 'duration': 1.462}, {'end': 13.125, 'text': 'So, uh, today you see the support vector machine algorithm.', 'start': 8.2, 'duration': 4.925}, {'end': 17.249, 'text': "Um, and this is one of my favorite algorithms, because it's very turnkey right?", 'start': 13.405, 'duration': 3.844}, {'end': 22.514, 'text': 'If you have a classification problem, um, you just kind of run it and it more or less works.', 'start': 17.269, 'duration': 5.245}, {'end': 30.76, 'text': 'Uh, So in particular, talk a bit more about the, um, optimization problem, uh, that you have to solve for the support vector machine.', 'start': 22.974, 'duration': 7.786}, {'end': 35.183, 'text': 'Um, then talk about something called the representer theorem.', 'start': 30.78, 'duration': 4.403}], 'summary': 'Introduction to support vector machine algorithm for classification problems', 'duration': 30.646, 'max_score': 4.537, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg4537.jpg'}, {'end': 73.021, 'src': 'embed', 'start': 43.787, 'weight': 0, 'content': [{'end': 47.248, 'text': 'or a 100 billion dimensional, or even infinite dimensional feature spaces.', 'start': 43.787, 'duration': 3.461}, {'end': 53.59, 'text': 'And this will teach you how to represent feature vectors and how to represent parameters that may be, you know,', 'start': 47.508, 'duration': 6.082}, {'end': 57.292, 'text': '100 billion dimensional or 100 trillion dimensional or infinite dimensional.', 'start': 53.59, 'duration': 3.702}, {'end': 65.597, 'text': 'Um and uh, based on this, with derived kernels, which is the mechanism for working with these incredibly high dimensional fe- feature spaces,', 'start': 58.112, 'duration': 7.485}, {'end': 73.021, 'text': 'and then, uh hopefully time permitting, wrap up with a few examples of uh concrete implementations of these ideas.', 'start': 65.597, 'duration': 7.424}], 'summary': 'Teaching representation of feature vectors and parameters in high-dimensional spaces, with derived kernels and concrete implementations.', 'duration': 29.234, 'max_score': 43.787, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg43787.jpg'}], 'start': 4.537, 'title': 'Support vector machine algorithm', 'summary': 'Introduces the support vector machine algorithm, emphasizing its turnkey nature for classification problems, discussing the optimization problem and the representer theorem, which allows working in potentially very high dimensional feature spaces.', 'chapters': [{'end': 57.292, 'start': 4.537, 'title': 'Support vector machine algorithm', 'summary': 'Introduces the support vector machine algorithm, emphasizing its turnkey nature for classification problems, discussing the optimization problem and the representer theorem, which allows working in potentially very high dimensional feature spaces.', 'duration': 52.755, 'highlights': ['The representer theorem allows working in potentially very high dimensional feature spaces, such as 100 billion dimensional or infinite dimensional, and teaches how to represent feature vectors and parameters.', 'The support vector machine algorithm is highlighted as a favorite due to its turnkey nature for classification problems, making it more or less work when run.']}], 'duration': 52.755, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg4537.jpg', 'highlights': ['The representer theorem allows working in potentially very high dimensional feature spaces, such as 100 billion dimensional or infinite dimensional, and teaches how to represent feature vectors and parameters.', 'The support vector machine algorithm is highlighted as a favorite due to its turnkey nature for classification problems, making it more or less work when run.']}, {'end': 400.857, 'segs': [{'end': 153.906, 'src': 'embed', 'start': 129.656, 'weight': 1, 'content': [{'end': 138.977, 'text': 'You will make um so, of all of your M training examples, which one has the least or has the worst possible geometric margin?', 'start': 129.656, 'duration': 9.321}, {'end': 144.721, 'text': 'And the support vector- the optimal margin classifier, um will try to make this as big as possible.', 'start': 139.177, 'duration': 5.544}, {'end': 146.322, 'text': 'And, by the way,', 'start': 145.441, 'duration': 0.881}, {'end': 153.906, 'text': "what we'll- what you'll see later on is that um optimal margin classifier is basically this algorithm and optimal margin classifier plus kernels,", 'start': 146.322, 'duration': 7.584}], 'summary': 'Identify the training example with the least geometric margin for optimal margin classifier with kernels.', 'duration': 24.25, 'max_score': 129.656, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg129656.jpg'}, {'end': 293.279, 'src': 'embed', 'start': 261.327, 'weight': 0, 'content': [{'end': 269.351, 'text': 'So this constraint says every example has geometric margin greater than equal to Gamma.', 'start': 261.327, 'duration': 8.024}, {'end': 271.332, 'text': "This is- this is what it's saying.", 'start': 269.511, 'duration': 1.821}, {'end': 278.415, 'text': "And you want to set Gamma as big as possible, which means that you're maximizing the worst case geometric margin.", 'start': 272.172, 'duration': 6.243}, {'end': 280.63, 'text': 'Does this make sense? Right?', 'start': 279.209, 'duration': 1.421}, {'end': 293.279, 'text': 'So so if I- so the only way to make Gamma say 17 or 20 or whatever, is if every training example has geometric margin bigger than 17,, right?', 'start': 280.67, 'duration': 12.609}], 'summary': 'Maximize worst case geometric margin by setting gamma as large as possible, ensuring every training example has margin > 17.', 'duration': 31.952, 'max_score': 261.327, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg261327.jpg'}, {'end': 331.244, 'src': 'embed', 'start': 303.547, 'weight': 2, 'content': [{'end': 314.415, 'text': 'So this optimization problem maximizes the- causes, um, uh, causes you to find W and B with as big a geometric margin as possible,', 'start': 303.547, 'duration': 10.868}, {'end': 317.196, 'text': 'as big a worst case geometric margin as possible.', 'start': 314.415, 'duration': 2.781}, {'end': 324.201, 'text': 'Okay? Um, and so does this make sense actually? Right? Okay.', 'start': 317.997, 'duration': 6.204}, {'end': 325.742, 'text': 'Actually, raise your hand if this makes sense.', 'start': 324.281, 'duration': 1.461}, {'end': 327.722, 'text': 'Uh, oh, good.', 'start': 327.162, 'duration': 0.56}, {'end': 328.943, 'text': 'Okay Well, many of you.', 'start': 327.822, 'duration': 1.121}, {'end': 329.303, 'text': 'All right.', 'start': 329.063, 'duration': 0.24}, {'end': 331.244, 'text': 'Let me, let me see it in a slightly different way.', 'start': 329.343, 'duration': 1.901}], 'summary': 'Optimization problem seeks w and b with large geometric margin.', 'duration': 27.697, 'max_score': 303.547, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg303547.jpg'}], 'start': 58.112, 'title': 'Optimal margin classifier', 'summary': 'Covers the optimal margin classifier and the derivation of the classification problem, emphasizing the maximization of the worst-case geometric margin through an optimization problem.', 'chapters': [{'end': 400.857, 'start': 58.112, 'title': 'Optimal margin classifier', 'summary': 'Covers the optimal margin classifier and the derivation of the classification problem, emphasizing the maximization of the worst-case geometric margin through an optimization problem.', 'duration': 342.745, 'highlights': ['The optimal margin classifier aims to find the decision boundary with the greatest geometric margin, calculated by a specific formula, and seeks to maximize the worst-case geometric margin.', 'The optimization problem focuses on maximizing the worst-case geometric margin by finding values for W and B that define the decision boundary.', 'The optimization problem enforces that every training example has a geometric margin greater than or equal to a specific value, driving up the worst-case geometric margin as much as possible.']}], 'duration': 342.745, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg58112.jpg', 'highlights': ['The optimization problem enforces a geometric margin >= specific value', 'The optimal margin classifier aims to maximize the worst-case geometric margin', 'The optimization problem focuses on maximizing the worst-case geometric margin']}, {'end': 798.944, 'segs': [{'end': 524.323, 'src': 'embed', 'start': 495.835, 'weight': 2, 'content': [{'end': 504.823, 'text': "Um, and so that's the decision boundary where the, uh, SVM will predict positive everywhere here and predict negative everywhere to the lower left.", 'start': 495.835, 'duration': 8.988}, {'end': 512.289, 'text': 'And this straight line, you know, stays the same even if you multiply these parameters by any constant, okay?', 'start': 505.523, 'duration': 6.766}, {'end': 524.323, 'text': 'Um and so, um, to simplify this, uh, notice that you could choose anything you want for the norm of w right?', 'start': 513.274, 'duration': 11.049}], 'summary': 'Svm predicts positive everywhere above decision boundary.', 'duration': 28.488, 'max_score': 495.835, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg495835.jpg'}, {'end': 602.412, 'src': 'heatmap', 'start': 530.307, 'weight': 1, 'content': [{'end': 539.854, 'text': 'But you have the flexibility to scale the parameters w and b, you know, up or down by any fixed constant without changing the decision boundary.', 'start': 530.307, 'duration': 9.547}, {'end': 555.026, 'text': 'And so the trick to simplify this equation into that one is if you choose to scale the norm of w to be equal to 1 over Gamma um.', 'start': 540.999, 'duration': 14.027}, {'end': 585.759, 'text': 'uh, because if you do that, then this optimization objective becomes um, maximize 1 over norm of w, subject to right?', 'start': 555.026, 'duration': 30.733}, {'end': 593.874, 'text': 'Uh, so if you substitute norm of w equals 1 of Gamma um, and so that cancels out, Uh.', 'start': 586.06, 'duration': 7.814}, {'end': 596.239, 'text': 'and so you end up with this optimization problem.', 'start': 593.874, 'duration': 2.365}, {'end': 602.412, 'text': 'Instead of maximizing 1 over norm w, you can minimize 1 half the norm of w squared subject to this.', 'start': 596.419, 'duration': 5.993}], 'summary': 'Scaling parameters w and b affects the decision boundary; simplifying the optimization problem by substituting the norm of w and minimizing 1 half the norm w squared.', 'duration': 47.386, 'max_score': 530.307, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg530307.jpg'}, {'end': 724.776, 'src': 'heatmap', 'start': 670.755, 'weight': 0.915, 'content': [{'end': 688.19, 'text': 'uh, what we will assume is that W can be represented as a sum, as a linear combination of the training examples, okay?', 'start': 670.755, 'duration': 17.435}, {'end': 691.733, 'text': 'So um, in order to derive the support vector machine,', 'start': 688.891, 'duration': 2.842}, {'end': 701.379, 'text': "we're gonna make an additional restriction that the parameters w can be expressed as a linear combination of the training examples, right?", 'start': 691.733, 'duration': 9.646}, {'end': 707.363, 'text': 'So um, and it turns out that when X i is, you know, 100 trillion dimensional,', 'start': 701.619, 'duration': 5.744}, {'end': 713.307, 'text': 'doing this will let us derive algorithms that work even in these 100 trillion or in these infinite dimensional feature spaces.', 'start': 707.363, 'duration': 5.944}, {'end': 717.374, 'text': "Now, I'm describing this, uh, just as an assumption.", 'start': 714.093, 'duration': 3.281}, {'end': 724.776, 'text': "It turns out that there's a theorem called the representer theorem that shows that you can make this assumption without losing any performance.", 'start': 717.814, 'duration': 6.962}], 'summary': 'Support vector machine algorithms can work in infinite dimensional feature spaces without losing performance, due to the representer theorem.', 'duration': 54.021, 'max_score': 670.755, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg670755.jpg'}, {'end': 784.294, 'src': 'embed', 'start': 753.407, 'weight': 0, 'content': [{'end': 757.009, 'text': "we're actually gonna assume that w i can be written right?", 'start': 753.407, 'duration': 3.602}, {'end': 760.032, 'text': 'So in- in this example, this is plus minus 1, right?', 'start': 757.05, 'duration': 2.982}, {'end': 764.435, 'text': 'So um, this makes some of the math a little bit downstream come out easier.', 'start': 760.172, 'duration': 4.263}, {'end': 771.6, 'text': 'But this- but this is still saying that w is- can be represented as a linear combination of the training examples.', 'start': 764.475, 'duration': 7.125}, {'end': 779.793, 'text': "Okay So, um, let me just describe less formally why this is a reasonable assumption, but it's actually not an assumption.", 'start': 772.21, 'duration': 7.583}, {'end': 784.294, 'text': 'The representative theorem proves that, you know, this is just true at the optimal value of w.', 'start': 779.813, 'duration': 4.481}], 'summary': 'W can be represented as a linear combination of training examples, confirmed by the representative theorem.', 'duration': 30.887, 'max_score': 753.407, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg753407.jpg'}], 'start': 400.857, 'title': 'Svm decision boundary and optimization', 'summary': 'Explains the invariance of the decision boundary, transforming the optimization problem, and the assumption about parameters w, providing insights into svm decision boundary and optimization techniques.', 'chapters': [{'end': 798.944, 'start': 400.857, 'title': 'Svm decision boundary and optimization', 'summary': 'Explains the invariance of the decision boundary with respect to scaling of parameters, the transformation of the optimization problem to maximize 1 over the norm of w to minimizing 1/2 the norm of w squared, and the assumption that the parameters w can be expressed as a linear combination of the training examples.', 'duration': 398.087, 'highlights': ['The decision boundary remains the same even if parameters w and b are multiplied by any constant.', 'The optimization problem is transformed from maximizing 1 over the norm of w to minimizing 1/2 the norm of w squared.', 'The assumption that the parameters w can be expressed as a linear combination of the training examples is proven by the representer theorem.']}], 'duration': 398.087, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg400857.jpg', 'highlights': ['The assumption that the parameters w can be expressed as a linear combination of the training examples is proven by the representer theorem.', 'The optimization problem is transformed from maximizing 1 over the norm of w to minimizing 1/2 the norm of w squared.', 'The decision boundary remains the same even if parameters w and b are multiplied by any constant.']}, {'end': 1508.031, 'segs': [{'end': 833.971, 'src': 'embed', 'start': 798.944, 'weight': 2, 'content': [{'end': 807.629, 'text': "Um, and I'm going to refer to logistic regression.", 'start': 798.944, 'duration': 8.685}, {'end': 815.313, 'text': 'right where um suppose that you run logistic regression with uh gradient descent, say stochastic gradient descent.', 'start': 807.629, 'duration': 7.684}, {'end': 826.668, 'text': 'then you would initialize the parameters to be equal to 0 at first and then for each iteration of uh stochastic gradient descent, right,', 'start': 815.313, 'duration': 11.355}, {'end': 827.428, 'text': 'you would update.', 'start': 826.668, 'duration': 0.76}, {'end': 833.971, 'text': 'Theta gets updated as Theta minus the learning rate times.', 'start': 827.428, 'duration': 6.543}], 'summary': 'Referencing logistic regression with stochastic gradient descent for parameter updating.', 'duration': 35.027, 'max_score': 798.944, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg798944.jpg'}, {'end': 884.25, 'src': 'embed', 'start': 860.772, 'weight': 5, 'content': [{'end': 868.557, 'text': "If theta starts off 0 and if if, on every iteration of gradient descent, you're adding a multiple of some training example,", 'start': 860.772, 'duration': 7.785}, {'end': 876.504, 'text': 'then no matter how many iterations you run gradient descent, Theta is still a linear combination of your training examples.', 'start': 868.557, 'duration': 7.947}, {'end': 884.25, 'text': 'okay?. And- and again I did this with Theta, uh, the-, the-, it was really Theta 0, Theta 1 up to Theta n, right?,', 'start': 876.504, 'duration': 7.746}], 'summary': 'Gradient descent with theta starting at 0 results in linear combination of training examples.', 'duration': 23.478, 'max_score': 860.772, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg860772.jpg'}, {'end': 1051.47, 'src': 'embed', 'start': 1026.285, 'weight': 0, 'content': [{'end': 1031.586, 'text': 'This is, uh, a fact of, I guess, geometry or something or linear algebra, right?', 'start': 1026.285, 'duration': 5.301}, {'end': 1035.047, 'text': 'Where, uh, the vector w 2, 1..', 'start': 1031.626, 'duration': 3.421}, {'end': 1042.002, 'text': 'So the vector w, you know, is sort of 2 to the right and then 1 up, is always at, well, All right.', 'start': 1035.047, 'duration': 6.955}, {'end': 1046.445, 'text': 'Uh, the vector w is always at 90 degrees, um, to the decision boundary.', 'start': 1042.021, 'duration': 4.424}, {'end': 1051.47, 'text': 'And the decision boundary separates where you predict positive from where you predict negative.', 'start': 1046.486, 'duration': 4.984}], 'summary': 'In linear algebra, vector w (2, 1) is always at 90 degrees to the decision boundary.', 'duration': 25.185, 'max_score': 1026.285, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg1026285.jpg'}, {'end': 1115.585, 'src': 'heatmap', 'start': 1060.865, 'weight': 1, 'content': [{'end': 1071.133, 'text': "to take a simple example, let's say you have um, two training examples a positive example and a negative example, right?", 'start': 1060.865, 'duration': 10.268}, {'end': 1072.814, 'text': 'Then by x2, right?', 'start': 1071.193, 'duration': 1.621}, {'end': 1080.279, 'text': 'Uh, the- the linear algebra way of saying this is that the vector w lies in the span of the training examples.', 'start': 1072.974, 'duration': 7.305}, {'end': 1083.429, 'text': 'okay?. Oh, and-, and- and um.', 'start': 1080.279, 'duration': 3.15}, {'end': 1092.514, 'text': 'the way to picture this is that W sets the direction of the decision boundary and as you vary B, then the position, see the- the relative position.', 'start': 1083.429, 'duration': 9.085}, {'end': 1096.997, 'text': 'you know, setting different values of B will move the decision boundary back and forth like this,', 'start': 1092.514, 'duration': 4.483}, {'end': 1101.04, 'text': 'and W uh pins the direction of the decision boundary.', 'start': 1096.997, 'duration': 4.043}, {'end': 1108.084, 'text': 'Okay?. Um, and just one last example for- for why this might be true.', 'start': 1101.06, 'duration': 7.024}, {'end': 1115.585, 'text': 'um is, uh.', 'start': 1108.084, 'duration': 7.501}], 'summary': 'Vector w lies in the span of training examples, setting decision boundary direction.', 'duration': 54.72, 'max_score': 1060.865, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg1060865.jpg'}, {'end': 1299.995, 'src': 'embed', 'start': 1254.604, 'weight': 4, 'content': [{'end': 1266.511, 'text': 'so that the norm of w squared is as small as possible and so that the uh this is bigger than or equal to 1, right?', 'start': 1254.604, 'duration': 11.907}, {'end': 1271.914, 'text': 'Um, for every value of i.', 'start': 1266.691, 'duration': 5.223}, {'end': 1280.826, 'text': "So let's see, norm of w squared, this is just equal to w, transpose w.", 'start': 1271.914, 'duration': 8.912}, {'end': 1292.731, 'text': 'Um, and so if you plug in this definition of w, you know, into these equations you have as the optimization objective min of 1.', 'start': 1280.826, 'duration': 11.905}, {'end': 1299.995, 'text': 'half um sum for y equals 1 through m.', 'start': 1292.731, 'duration': 7.264}], 'summary': 'Minimize ||w||^2 subject to w^t w and w are plugged into the optimization objective.', 'duration': 45.391, 'max_score': 1254.604, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg1254604.jpg'}, {'end': 1488.692, 'src': 'embed', 'start': 1463.072, 'weight': 1, 'content': [{'end': 1471.241, 'text': 'the only place that the feature vector disappears is in this inner product.', 'start': 1463.072, 'duration': 8.169}, {'end': 1480.145, 'text': 'right?. Um, and it turns out, when we talk about the kernel trick, when we talk about the application of kernels, it turns out that, um,', 'start': 1471.241, 'duration': 8.904}, {'end': 1487.031, 'text': "if you can compute this very efficiently, that's when you can get away with manipulating even infinite dimensional feature vectors.", 'start': 1480.145, 'duration': 6.886}, {'end': 1488.692, 'text': "We'll- we'll get to this in a second.", 'start': 1487.331, 'duration': 1.361}], 'summary': 'Efficient computation allows manipulation of infinite feature vectors.', 'duration': 25.62, 'max_score': 1463.072, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg1463072.jpg'}], 'start': 798.944, 'title': 'Logistic regression parameter update and understanding gradient descent', 'summary': 'Discusses logistic regression parameter update using gradient descent and the intuition behind gradient descent and support vector machine. it highlights the linear combination of training examples, decision boundary orientation, and efficient computation in high-dimensional feature spaces.', 'chapters': [{'end': 908.891, 'start': 798.944, 'title': 'Logistic regression parameter update', 'summary': 'Discusses the parameter update in logistic regression using gradient descent, demonstrating how the parameters are updated and proving that they remain a linear combination of the training examples.', 'duration': 109.947, 'highlights': ['The parameters in logistic regression are updated using gradient descent by initializing them to zero and iteratively updating them by adding or subtracting a constant multiple of the training examples, resulting in the parameters remaining a linear combination of the training examples.', 'The process involves updating the parameters theta by subtracting the learning rate times Xi on each iteration of stochastic gradient descent.', 'The discussion emphasizes that no matter how many iterations are run, the parameters theta still remain a linear combination of the training examples.']}, {'end': 1508.031, 'start': 908.891, 'title': 'Understanding gradient descent and support vector machine', 'summary': "Explains the intuition behind gradient descent and support vector machine, highlighting the linear combination of training samples, the decision boundary's orientation, and the optimization objective, emphasizing the kernel trick's efficient computation in high-dimensional feature spaces.", 'duration': 599.14, 'highlights': ['The vector W is always at 90 degrees to the decision boundary, separating positive and negative predictions.', 'The optimization objective involves minimizing the norm of W squared and ensuring a specific constraint for every value of i.', 'The inner product between feature vectors is crucial for expressing algorithms and efficiently computing even infinite dimensional feature vectors, demonstrating the significance of the kernel trick.']}], 'duration': 709.087, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg798944.jpg', 'highlights': ['The vector W is always at 90 degrees to the decision boundary, separating positive and negative predictions.', 'The inner product between feature vectors is crucial for expressing algorithms and efficiently computing even infinite dimensional feature vectors, demonstrating the significance of the kernel trick.', 'The parameters in logistic regression are updated using gradient descent by initializing them to zero and iteratively updating them by adding or subtracting a constant multiple of the training examples, resulting in the parameters remaining a linear combination of the training examples.', 'The process involves updating the parameters theta by subtracting the learning rate times Xi on each iteration of stochastic gradient descent.', 'The optimization objective involves minimizing the norm of W squared and ensuring a specific constraint for every value of i.', 'The discussion emphasizes that no matter how many iterations are run, the parameters theta still remain a linear combination of the training examples.']}, {'end': 2070.992, 'segs': [{'end': 1625.215, 'src': 'embed', 'start': 1598.897, 'weight': 0, 'content': [{'end': 1607.243, 'text': "Um, the way to simplify that optimization problem to this one, uh, that's actually done by, um, using convex optimization theory.", 'start': 1598.897, 'duration': 8.346}, {'end': 1612.626, 'text': "Uh, and- and- and again, the derivation is written in the lecture notes but I don't want to do that here.", 'start': 1607.263, 'duration': 5.363}, {'end': 1619.091, 'text': 'If- if you want, think of it as doing a bunch more algebra to simplify that problem to this one and coincidentally cancel all b along the way.', 'start': 1612.726, 'duration': 6.365}, {'end': 1625.215, 'text': "It's a little more complicated than that but- but, right, full derivation is given in the lecture notes.", 'start': 1619.111, 'duration': 6.104}], 'summary': 'Simplify optimization problem using convex optimization theory, detailed derivation in lecture notes.', 'duration': 26.318, 'max_score': 1598.897, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg1598897.jpg'}, {'end': 1764.289, 'src': 'embed', 'start': 1700.125, 'weight': 1, 'content': [{'end': 1708.749, 'text': 'And so, once again, you know, once you have stored the alphas in your computer memory, um, you can make predictions using just inner products again,', 'start': 1700.125, 'duration': 8.624}, {'end': 1717.092, 'text': 'right?. And so the entire algorithm both the optimization objective you need to deal with during training as well as how you make predictions is, um, uh,', 'start': 1708.749, 'duration': 8.343}, {'end': 1722.815, 'text': 'is expressed only in terms of inner products, okay?', 'start': 1717.092, 'duration': 5.723}, {'end': 1738.317, 'text': "So we're now ready to apply kernels, and sometimes, in machine learning people, sometimes you call this a kernel trick.", 'start': 1722.835, 'duration': 15.482}, {'end': 1741.84, 'text': 'Let me just lay out a recipe for, for what this means.', 'start': 1738.417, 'duration': 3.423}, {'end': 1748.987, 'text': 'Step 1 is write your whole algorithm.', 'start': 1742.921, 'duration': 6.066}, {'end': 1760.268, 'text': 'um, in terms of X i, xj, in terms of inner products, uh, and instead of carrying the superscript you know x i,', 'start': 1748.987, 'duration': 11.281}, {'end': 1764.289, 'text': "x j I'm sometimes gonna write inner product between x and z,", 'start': 1760.268, 'duration': 4.021}], 'summary': 'Algorithm expressed in inner products, ready to apply kernels.', 'duration': 64.164, 'max_score': 1700.125, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg1700125.jpg'}, {'end': 1837.889, 'src': 'embed', 'start': 1815.211, 'weight': 3, 'content': [{'end': 1827.802, 'text': 'And so you could, um, take this 1D feature and expand it to a high dimensional feature vector with x, x squared, x cubed x to the fourth right?', 'start': 1815.211, 'duration': 12.591}, {'end': 1830.784, 'text': 'So this would be one way of defining high-dimensional feature mapping.', 'start': 1827.822, 'duration': 2.962}, {'end': 1837.889, 'text': 'Or another one could be if you have two features, x1 and x2, uh, corresponding to the size of the halls and the number of bedrooms.', 'start': 1831.324, 'duration': 6.565}], 'summary': 'High dimensional feature mapping involves expanding 1d feature to a vector with x, x squared, x cubed, x to the fourth power or using multiple features like x1 and x2.', 'duration': 22.678, 'max_score': 1815.211, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg1815211.jpg'}, {'end': 1938.134, 'src': 'embed', 'start': 1905.342, 'weight': 4, 'content': [{'end': 1906.843, 'text': 'So this is called the kernel function.', 'start': 1905.342, 'duration': 1.501}, {'end': 1915.425, 'text': "And what we're going to do is we'll see that there are clever tricks so that you can compute the inner product between x and z,", 'start': 1907.623, 'duration': 7.802}, {'end': 1919.307, 'text': 'even when phi of x and phi of z are incredibly high dimensional, right?', 'start': 1915.425, 'duration': 3.882}, {'end': 1922.108, 'text': "We'll see an example of this in a in in very, very soon.", 'start': 1919.327, 'duration': 2.781}, {'end': 1930.09, 'text': 'And then step four is, um, replace x, z in algorithm.', 'start': 1923.588, 'duration': 6.502}, {'end': 1938.134, 'text': 'with k of xz, okay?', 'start': 1934.833, 'duration': 3.301}], 'summary': 'Kernel function computes high-dimensional inner product between x and z.', 'duration': 32.792, 'max_score': 1905.342, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg1905342.jpg'}, {'end': 2057.025, 'src': 'embed', 'start': 2035.19, 'weight': 5, 'content': [{'end': 2044.314, 'text': "Yeah, I think the no free lunch theorem is a fascinating theoretical concept, but I think that it's been I don't know it's been less useful actually,", 'start': 2035.19, 'duration': 9.124}, {'end': 2047.015, 'text': 'because I think we have inductive biases that turn out to be useful.', 'start': 2044.314, 'duration': 2.701}, {'end': 2051.802, 'text': "There's a- there's a famous theorem in learning theory called no free lunch.", 'start': 2049.261, 'duration': 2.541}, {'end': 2053.763, 'text': 'it was like 20 years ago.', 'start': 2051.802, 'duration': 1.961}, {'end': 2057.025, 'text': 'that basically says that in the worst case, learning algorithms do not work.', 'start': 2053.763, 'duration': 3.262}], 'summary': "The 'no free lunch' theorem in learning theory states that learning algorithms may not work in the worst case scenario, and has been in existence for about 20 years.", 'duration': 21.835, 'max_score': 2035.19, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg2035190.jpg'}], 'start': 1508.091, 'title': 'Svm optimization and high-dimensional feature mapping', 'summary': 'Delves into svm algorithm optimization using parameters alpha and b, and the kernel trick, alongside transforming low-dimensional features to high-dimensional feature vectors and the implications of the no free lunch theorem in learning theory.', 'chapters': [{'end': 1814.43, 'start': 1508.091, 'title': 'Svm optimization and kernel trick', 'summary': 'Discusses the optimization of svm algorithm using parameters alpha and b and the application of the kernel trick, where the entire algorithm is expressed in terms of inner products and a mapping from original input features to a higher dimensional set of features.', 'duration': 306.339, 'highlights': ['The optimization of the SVM algorithm is expressed in terms of parameters Alpha and B, and a further simplification of the optimization problem is achieved using convex optimization theory.', 'The entire algorithm, including the optimization objective during training and prediction making, is expressed only in terms of inner products.', 'The application of the kernel trick involves writing the algorithm in terms of inner products and mapping the original input features to a higher dimensional set of features.']}, {'end': 2070.992, 'start': 1815.211, 'title': 'High-dimensional feature mapping', 'summary': 'Discusses transforming low-dimensional features to high-dimensional feature vectors, computing kernel functions efficiently, and the implications of the no free lunch theorem in learning theory.', 'duration': 255.781, 'highlights': ['The chapter explains the process of transforming low-dimensional features to high-dimensional feature vectors, with examples of mapping 1D and 2D inputs to high-dimensional feature spaces.', 'The discussion includes the efficient computation of the kernel function, enabling the computation of inner products between high-dimensional feature vectors without explicitly computing the feature vectors.', 'The implications of the no free lunch theorem in learning theory are mentioned, highlighting its theoretical significance and the practical limitations in the real-world data distributions.']}], 'duration': 562.901, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg1508091.jpg', 'highlights': ['The optimization of the SVM algorithm is expressed in terms of parameters Alpha and B, and a further simplification of the optimization problem is achieved using convex optimization theory.', 'The entire algorithm, including the optimization objective during training and prediction making, is expressed only in terms of inner products.', 'The application of the kernel trick involves writing the algorithm in terms of inner products and mapping the original input features to a higher dimensional set of features.', 'The chapter explains the process of transforming low-dimensional features to high-dimensional feature vectors, with examples of mapping 1D and 2D inputs to high-dimensional feature spaces.', 'The discussion includes the efficient computation of the kernel function, enabling the computation of inner products between high-dimensional feature vectors without explicitly computing the feature vectors.', 'The implications of the no free lunch theorem in learning theory are mentioned, highlighting its theoretical significance and the practical limitations in the real-world data distributions.']}, {'end': 2647.427, 'segs': [{'end': 2115.303, 'src': 'embed', 'start': 2071.232, 'weight': 4, 'content': [{'end': 2072.813, 'text': "So that's the learning algorithms turned out okay.", 'start': 2071.232, 'duration': 1.581}, {'end': 2079.844, 'text': 'Um, All right.', 'start': 2076.975, 'duration': 2.869}, {'end': 2082.607, 'text': "Let's go through one example of kernels.", 'start': 2080.185, 'duration': 2.422}, {'end': 2090.293, 'text': "Um, so for this example, let's say that your original input features was three-dimensional, x1, x2, x3.", 'start': 2083.188, 'duration': 7.105}, {'end': 2099.219, 'text': "And let's say I'm gonna choose the feature mapping phi of x to be, um, all, uh, so pairwise, um, monomial terms.", 'start': 2090.993, 'duration': 8.226}, {'end': 2115.303, 'text': "So I'm gonna choose x1 times x1, x1, x2, x1, x3, x2, x1, Okay?", 'start': 2099.479, 'duration': 15.824}], 'summary': 'Learning algorithms performed well, discussing kernel example with 3-dimensional input features.', 'duration': 44.071, 'max_score': 2071.232, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg2071232.jpg'}, {'end': 2244.933, 'src': 'embed', 'start': 2138.277, 'weight': 0, 'content': [{'end': 2142.778, 'text': 'In practice, think of x as a thousand-dimensional, and so this is now a million, right?', 'start': 2138.277, 'duration': 4.501}, {'end': 2146.519, 'text': 'Or think of this as maybe 10, 000, and this is now like a 100 million, okay?', 'start': 2142.838, 'duration': 3.681}, {'end': 2148.54, 'text': 'So n squared features is much bigger.', 'start': 2146.539, 'duration': 2.001}, {'end': 2171.297, 'text': 'Um, and then similarly, phi of z is going to be z1, z1, z1, z2, Okay.', 'start': 2149.48, 'duration': 21.817}, {'end': 2177.518, 'text': "So we've gone from n features like 10, 000 features to n squared features which we, in this case, 100 million features.", 'start': 2171.317, 'duration': 6.201}, {'end': 2187.861, 'text': 'Um so, because there are n squared elements.', 'start': 2179.119, 'duration': 8.742}, {'end': 2208.064, 'text': 'right, you would need order n squared time to compute phi of x or, uh, to compute phi of x, transpose phi of z explicitly, right?', 'start': 2187.861, 'duration': 20.203}, {'end': 2212.948, 'text': 'So if you want to compute the inner product between phi of x and phi of z, and you do it explicitly in the obvious way,', 'start': 2208.104, 'duration': 4.844}, {'end': 2218.612, 'text': "it'll take n squared time to just compute all of these inner products and then do the right and then, uh, compute this.", 'start': 2212.948, 'duration': 5.664}, {'end': 2220.433, 'text': 'uh, com- compute this right?', 'start': 2218.612, 'duration': 1.821}, {'end': 2223.696, 'text': "And it could- it's actually n squared over 2, because a lot of these things are duplicated.", 'start': 2220.493, 'duration': 3.203}, {'end': 2225.437, 'text': "but that's the order n squared.", 'start': 2223.696, 'duration': 1.741}, {'end': 2238.249, 'text': "but let's see if we can find a better way to do that.", 'start': 2236.288, 'duration': 1.961}, {'end': 2244.933, 'text': 'So what we want is to write out the kernel of x comma z.', 'start': 2239.53, 'duration': 5.403}], 'summary': 'Switching from n to n squared features increases computation time to order n squared', 'duration': 106.656, 'max_score': 2138.277, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg2138277.jpg'}, {'end': 2444.049, 'src': 'embed', 'start': 2420.122, 'weight': 3, 'content': [{'end': 2429.834, 'text': "So this proves that, um, you've turned what was previously an order n squared time calculation into an order n time calculation.", 'start': 2420.122, 'duration': 9.712}, {'end': 2442.147, 'text': 'Which means that, um, if n was 10, 000, instead of needing to manipulate 100, 000 dimensional vectors to come up with these.', 'start': 2430.615, 'duration': 11.532}, {'end': 2443.248, 'text': 'Oh, this is my phone buzzing.', 'start': 2442.227, 'duration': 1.021}, {'end': 2444.049, 'text': 'This is really loud.', 'start': 2443.288, 'duration': 0.761}], 'summary': 'Reduced calculation time from n^2 to n, making it more efficient.', 'duration': 23.927, 'max_score': 2420.122, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg2420122.jpg'}, {'end': 2605.789, 'src': 'heatmap', 'start': 2552.616, 'weight': 0.726, 'content': [{'end': 2563.06, 'text': 'uh, if you choose this to the power of d right?', 'start': 2552.616, 'duration': 10.444}, {'end': 2566.481, 'text': 'Um, notice that this still is an order.', 'start': 2563.66, 'duration': 2.821}, {'end': 2570.36, 'text': 'n time computation right?', 'start': 2566.481, 'duration': 3.879}, {'end': 2572.161, 'text': 'X, transpose z, takes all the n time.', 'start': 2570.581, 'duration': 1.58}, {'end': 2573.082, 'text': 'you add a number to it.', 'start': 2572.161, 'duration': 0.921}, {'end': 2574.282, 'text': 'then you take this to the power of d.', 'start': 2573.082, 'duration': 1.2}, {'end': 2576.183, 'text': 'So you can compute this in all the n time.', 'start': 2574.282, 'duration': 1.901}, {'end': 2588.67, 'text': "But this corresponds to now phi of x has all, um, the number of terms turns out to be n plus d choose d, but it doesn't matter.", 'start': 2576.884, 'duration': 11.786}, {'end': 2593.292, 'text': 'It turns out this contains all features of, uh, monomials.', 'start': 2589.39, 'duration': 3.902}, {'end': 2600.468, 'text': 'up to, uh, order d.', 'start': 2598.027, 'duration': 2.441}, {'end': 2605.789, 'text': "So, by which I mean um, if-, if-, if-, let's say, d is equal to 5, right?", 'start': 2600.468, 'duration': 5.321}], 'summary': 'Compute phi of x in n time, containing n plus d choose d terms, representing monomials up to order d.', 'duration': 53.173, 'max_score': 2552.616, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg2552616.jpg'}], 'start': 2071.232, 'title': 'Kernel functions and computation', 'summary': 'Discusses the computation of kernels using feature mapping, reducing time complexity from n squared to order n and examples of kernels and their impact on feature modification, such as reducing dimensional vectors from 100,000 to 10,000 and the addition of constants resulting in the creation of new features.', 'chapters': [{'end': 2419.542, 'start': 2071.232, 'title': 'Kernel computation in machine learning', 'summary': 'Explains the computation of kernels using feature mapping, where it shows that the kernel of x comma z can be computed as x transpose z squared, reducing the time complexity from n squared to order n.', 'duration': 348.31, 'highlights': ['The feature mapping phi of x transforms three-dimensional input features to nine-dimensional features and can scale to millions for larger dimensions.', 'The time complexity to compute phi of x or phi of x transpose phi of z explicitly is of order n squared.', 'The kernel of x comma z can be computed as x transpose z squared, reducing the time complexity from n squared to order n.']}, {'end': 2647.427, 'start': 2420.122, 'title': 'Kernel functions and feature modification', 'summary': 'Discusses the transformation of computational time complexity from order n squared to order n, with examples of kernels and their impact on feature modification, such as reducing dimensional vectors from 100,000 to 10,000 and the addition of constants resulting in the creation of new features.', 'duration': 227.305, 'highlights': ['The transformation of computational time complexity from order n squared to order n', 'Impact of choosing specific kernels and feature modification', 'Explanation of the impact of adding a number to the computation']}], 'duration': 576.195, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg2071232.jpg', 'highlights': ['The feature mapping phi of x transforms three-dimensional input features to nine-dimensional features and can scale to millions for larger dimensions.', 'The kernel of x comma z can be computed as x transpose z squared, reducing the time complexity from n squared to order n.', 'The time complexity to compute phi of x or phi of x transpose phi of z explicitly is of order n squared.', 'The transformation of computational time complexity from order n squared to order n', 'Impact of choosing specific kernels and feature modification', 'Explanation of the impact of adding a number to the computation']}, {'end': 3538.79, 'segs': [{'end': 2697.349, 'src': 'embed', 'start': 2648.087, 'weight': 0, 'content': [{'end': 2654.331, 'text': 'So- and there are- it turns out there are n plus d choose d, which is, uh, roughly n plus d to the power of d, very roughly.', 'start': 2648.087, 'duration': 6.244}, {'end': 2656.533, 'text': 'So this is a very, very large number of features.', 'start': 2654.411, 'duration': 2.122}, {'end': 2662.717, 'text': "Um, but your computation doesn't blow up exponentially even as d increases.", 'start': 2657.373, 'duration': 5.344}, {'end': 2680.082, 'text': 'Okay?. So, um, what the support vector machine is is, um, taking the optimal margin classifier that we derived earlier and applying the kernel trick to it,', 'start': 2662.737, 'duration': 17.345}, {'end': 2694.508, 'text': 'uh, in which we already had the- so well, so, optimal margin classifier plus the kernel trick.', 'start': 2680.082, 'duration': 14.426}, {'end': 2697.349, 'text': 'right, that is the support vector machine, okay?', 'start': 2694.508, 'duration': 2.841}], 'summary': 'Support vector machine handles n plus d choose d features, roughly n plus d to the power of d, with kernel trick.', 'duration': 49.262, 'max_score': 2648.087, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg2648087.jpg'}, {'end': 2799.455, 'src': 'embed', 'start': 2765.152, 'weight': 3, 'content': [{'end': 2771.955, 'text': 'So each kernel function of um, uh, uh, yes, up to trivial differences, right?', 'start': 2765.152, 'duration': 6.803}, {'end': 2776.557, 'text': 'If you have a feature mapping where the features are commu-, are permuted or something, then the kernel function stays the same.', 'start': 2772.055, 'duration': 4.502}, {'end': 2780.499, 'text': 'Uh, uh, so there are trivial trun- function transformations like that.', 'start': 2777.517, 'duration': 2.982}, {'end': 2785.081, 'text': 'But, uh, if you have a totally different feature mapping, you would expect to need a totally different kernel function.', 'start': 2780.599, 'duration': 4.482}, {'end': 2799.455, 'text': "Yeah So I wanted to- let's see.", 'start': 2785.101, 'duration': 14.354}], 'summary': 'Kernel functions remain the same with permuted features; different mapping requires different kernel function.', 'duration': 34.303, 'max_score': 2765.152, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg2765152.jpg'}, {'end': 2914.029, 'src': 'embed', 'start': 2888.076, 'weight': 2, 'content': [{'end': 2892.577, 'text': "Does that make sense? Right? So you're taking the data, all right.", 'start': 2888.076, 'duration': 4.501}, {'end': 2900.338, 'text': "Um, so you're taking the data, uh, mapping it to a much higher dimensional feature space.", 'start': 2893.498, 'duration': 6.84}, {'end': 2904.241, 'text': 'Three-dimensional is visualization, but in practice can be 100 trillion dimensions.', 'start': 2900.378, 'duration': 3.863}, {'end': 2912.228, 'text': 'And then finding a linear decision boundary in that 100 trillion dimensional space, uh, which is going to be a hyperplane like a, like a straight,', 'start': 2905.022, 'duration': 7.206}, {'end': 2914.029, 'text': 'you know, like a plane or a straight line or a plane.', 'start': 2912.228, 'duration': 1.801}], 'summary': 'Data is mapped to a high-dimensional space, up to 100 trillion dimensions, to find a linear decision boundary.', 'duration': 25.953, 'max_score': 2888.076, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg2888076.jpg'}, {'end': 2975.429, 'src': 'embed', 'start': 2934.499, 'weight': 4, 'content': [{'end': 2950.429, 'text': 'But you find that if you use a SVM kernel, you know, um, Right? You can learn very non-linear decision boundaries like that.', 'start': 2934.499, 'duration': 15.93}, {'end': 2954.491, 'text': 'But that is a linear decision boundary in a very high dimensional space.', 'start': 2951.35, 'duration': 3.141}, {'end': 2960.554, 'text': 'But when you project it back down to, you know, 2D, you end up with a very non-linear decision boundary like that.', 'start': 2954.531, 'duration': 6.023}, {'end': 2964.455, 'text': 'Okay? All right.', 'start': 2960.594, 'duration': 3.861}, {'end': 2967.817, 'text': 'So Yeah.', 'start': 2966.876, 'duration': 0.941}, {'end': 2975.429, 'text': 'Oh, sure.', 'start': 2975.089, 'duration': 0.34}], 'summary': 'Svm kernel can learn non-linear decision boundaries in high-dimensional space.', 'duration': 40.93, 'max_score': 2934.499, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg2934499.jpg'}, {'end': 3185.299, 'src': 'heatmap', 'start': 3134.874, 'weight': 0.897, 'content': [{'end': 3143.198, 'text': 'So it turns out that, um, a function like that k of xz, you can use it as a kernel function.', 'start': 3134.874, 'duration': 8.324}, {'end': 3160.915, 'text': 'uh, only if there exists some phi, such that K of x z equals phi of x, transpose phi of z, right?', 'start': 3143.198, 'duration': 17.717}, {'end': 3163.636, 'text': 'So we derived the whole algorithm, assuming this to be true.', 'start': 3161.215, 'duration': 2.421}, {'end': 3168.037, 'text': "And it turns out if you plug in the kernel function, for which you know this isn't true,", 'start': 3164.176, 'duration': 3.861}, {'end': 3177.04, 'text': 'then all of the derivation we wrote down breaks down and the optimization problem, you know, um uh, can have very strange solutions, right?', 'start': 3168.037, 'duration': 9.003}, {'end': 3180.841, 'text': "Uh that- that don't correspond to a good classification, a good classifier at all.", 'start': 3177.28, 'duration': 3.561}, {'end': 3185.299, 'text': 'Um, and so this puts some constraints on what kernel functions we could choose.', 'start': 3181.596, 'duration': 3.703}], 'summary': 'Kernel function constraints can impact algorithm and classifier performance.', 'duration': 50.425, 'max_score': 3134.874, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg3134874.jpg'}, {'end': 3472.004, 'src': 'embed', 'start': 3422.717, 'weight': 5, 'content': [{'end': 3431.101, 'text': 'Um, and so, more generally, it turns out that this is also a sufficient condition, um,', 'start': 3422.717, 'duration': 8.384}, {'end': 3437.148, 'text': 'for a kernel function to- for a function K to be a valid kernel function.', 'start': 3431.101, 'duration': 6.047}, {'end': 3440.01, 'text': 'So let me just write this out.', 'start': 3438.529, 'duration': 1.481}, {'end': 3455.866, 'text': "This is called a Mercer's theorem, M-E-R-C-E-R, right? Um, so K is a valid kernel.", 'start': 3442.152, 'duration': 13.714}, {'end': 3464.8, 'text': 'So K is a valid kernel function, i.e..', 'start': 3461.778, 'duration': 3.022}, {'end': 3472.004, 'text': 'there exists phi such that K of xz equals phi of x.', 'start': 3464.8, 'duration': 7.204}], 'summary': "Mercer's theorem establishes sufficient condition for valid kernel function.", 'duration': 49.287, 'max_score': 3422.717, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg3422717.jpg'}], 'start': 2648.087, 'title': 'Support vector machines and kernel functions', 'summary': "Delves into support vector machines, utilizing kernel trick for optimal margin classifier, enabling computation in high dimensional feature spaces with linear scaling, and the significance of kernel functions in feature mappings. it also covers the visualization of data in higher dimensions, use of kernel functions for non-linear decision boundaries, and mercer's theorem validation.", 'chapters': [{'end': 2785.081, 'start': 2648.087, 'title': 'Support vector machine and kernel trick', 'summary': 'Discusses the support vector machine, which applies the kernel trick to the optimal margin classifier, allowing computation in high dimensional feature spaces with only linear scaling, and the importance of kernel functions in relation to feature mappings.', 'duration': 136.994, 'highlights': ['The support vector machine applies the kernel trick to the optimal margin classifier, allowing computation in high dimensional feature spaces with only linear scaling.', 'The number of features in the n plus d choose d scenario is roughly n plus d to the power of d, resulting in a very large number of features.', 'The kernel function applies only to the specific feature mapping, with trivial transformations for permuted features and the need for a totally different kernel function for a different feature mapping.']}, {'end': 3538.79, 'start': 2785.101, 'title': 'Support vector machines and kernel functions', 'summary': "Explains support vector machines, visualizing data in higher dimensions, the use of kernel functions to create non-linear decision boundaries, and mercer's theorem for validating kernel functions.", 'duration': 753.689, 'highlights': ['The chapter explains the concept of support vector machines and visualizes data in higher dimensions to create a linear decision boundary in a 100 trillion dimensional space, resulting in a non-linear decision boundary in the original feature space.', 'It discusses the use of kernel functions to create non-linear decision boundaries by mapping similar data points closer and dissimilar points farther, with an example of a kernel function using e to the negative x minus z squared over 2 sigma squared.', "The explanation of Mercer's theorem outlines the sufficient condition for a function to be a valid kernel function, stating that the corresponding kernel matrix must be positive semi-definite for any set of d points, and provides a test for validating kernel functions."]}], 'duration': 890.703, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg2648087.jpg', 'highlights': ['The support vector machine applies the kernel trick to the optimal margin classifier, allowing computation in high dimensional feature spaces with only linear scaling.', 'The number of features in the n plus d choose d scenario is roughly n plus d to the power of d, resulting in a very large number of features.', 'The chapter explains the concept of support vector machines and visualizes data in higher dimensions to create a linear decision boundary in a 100 trillion dimensional space, resulting in a non-linear decision boundary in the original feature space.', 'The kernel function applies only to the specific feature mapping, with trivial transformations for permuted features and the need for a totally different kernel function for a different feature mapping.', 'It discusses the use of kernel functions to create non-linear decision boundaries by mapping similar data points closer and dissimilar points farther, with an example of a kernel function using e to the negative x minus z squared over 2 sigma squared.', "The explanation of Mercer's theorem outlines the sufficient condition for a function to be a valid kernel function, stating that the corresponding kernel matrix must be positive semi-definite for any set of d points, and provides a test for validating kernel functions."]}, {'end': 4819.669, 'segs': [{'end': 3586.653, 'src': 'embed', 'start': 3539.69, 'weight': 8, 'content': [{'end': 3553.754, 'text': 'Um, and it turns out that, oh, the kernel I wrote up there, um, that one k of x, z, right?', 'start': 3539.69, 'duration': 14.064}, {'end': 3555.894, 'text': 'It turns out this is a valid kernel.', 'start': 3554.114, 'duration': 1.78}, {'end': 3557.575, 'text': 'This is called the Gaussian kernel.', 'start': 3556.095, 'duration': 1.48}, {'end': 3564.08, 'text': 'This is, uh, probably the most widely used kernel.', 'start': 3561.118, 'duration': 2.962}, {'end': 3568.322, 'text': 'Um, well, uh, uh, actually the- well, uh, let me.', 'start': 3564.82, 'duration': 3.502}, {'end': 3580.709, 'text': 'Well, the- the- actually the most widely used kernel is- is maybe the linear kernel.', 'start': 3576.667, 'duration': 4.042}, {'end': 3586.653, 'text': 'Um, which just uses K of x, z.', 'start': 3582.69, 'duration': 3.963}], 'summary': 'The gaussian kernel is a widely used valid kernel in machine learning, while the linear kernel is also popular.', 'duration': 46.963, 'max_score': 3539.69, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg3539690.jpg'}, {'end': 3640.337, 'src': 'embed', 'start': 3608.165, 'weight': 3, 'content': [{'end': 3610.566, 'text': "Uh, you're not taking advantage of kernels in other words.", 'start': 3608.165, 'duration': 2.401}, {'end': 3617.291, 'text': 'Uh, but after the linear kernel, the Gaussian kernel is probably the most widely used kernel, uh, the one I wrote.', 'start': 3611.267, 'duration': 6.024}, {'end': 3626.786, 'text': 'up there, and this corresponds to a feature dimensional space that is um infinite dimensional, right?', 'start': 3618.012, 'duration': 8.774}, {'end': 3633.691, 'text': 'And, uh, uh, this this actually, this particular kernel function, corresponds to using all monomial features.', 'start': 3627.386, 'duration': 6.305}, {'end': 3640.337, 'text': 'So, if you have, uh, you know, x1 and also x1, x2 and x1 squared, x2 and x1 squared, x5 to the 10 and so on.', 'start': 3633.771, 'duration': 6.566}], 'summary': 'Gaussian kernel widely used, infinite dimensional feature space.', 'duration': 32.172, 'max_score': 3608.165, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg3608165.jpg'}, {'end': 3711.576, 'src': 'embed', 'start': 3687.276, 'weight': 1, 'content': [{'end': 3695.739, 'text': 'Um, it was really popularized by the support vector machine where you know, researchers uh, I guess Valdemar Vathnik and Coronet Cortes,', 'start': 3687.276, 'duration': 8.463}, {'end': 3701.503, 'text': 'uh found that applying this kernel tricks to support vector machine makes for a very effective learning algorithm.', 'start': 3695.739, 'duration': 5.764}, {'end': 3703.745, 'text': 'But the kernel trick is actually more general.', 'start': 3701.923, 'duration': 1.822}, {'end': 3711.576, 'text': 'And if you have any learning algorithm that you can write in terms of inner products like this, then you can apply the kernel trick to it.', 'start': 3703.846, 'duration': 7.73}], 'summary': 'Kernel trick popularized by support vector machine for effective learning', 'duration': 24.3, 'max_score': 3687.276, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg3687276.jpg'}, {'end': 3767.515, 'src': 'embed', 'start': 3739.189, 'weight': 0, 'content': [{'end': 3745.033, 'text': 'So, linear regression, logistic regression, uh, everything in the generalized linear model family, the perceptron algorithm,', 'start': 3739.189, 'duration': 5.844}, {'end': 3752.158, 'text': 'all of the- all of those algorithms, um, uh, you can actually apply the kernel trick too, which means that you can, um,', 'start': 3745.033, 'duration': 7.125}, {'end': 3755.521, 'text': 'apply linear regression in an infinite dimensional feature space, if you wish.', 'start': 3752.158, 'duration': 3.363}, {'end': 3761.593, 'text': "Um, and later in this class, we'll talk about principal components analysis, which you heard of.", 'start': 3757.992, 'duration': 3.601}, {'end': 3767.515, 'text': "But when we talk about principal components analysis, turns out that's yet another algorithm that can be written only in terms of inner products.", 'start': 3761.633, 'duration': 5.882}], 'summary': 'Various algorithms in the generalized linear model family and perceptron can be applied with kernel trick to achieve infinite dimensional feature space.', 'duration': 28.326, 'max_score': 3739.189, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg3739189.jpg'}, {'end': 4069.304, 'src': 'embed', 'start': 4039.528, 'weight': 6, 'content': [{'end': 4049.973, 'text': 'Um, and the SVM is asking for it to not just classify it correctly, but classify it correctly with a- with a functional margin of at least 1.', 'start': 4039.528, 'duration': 10.445}, {'end': 4057.557, 'text': "Um, and if you allow C i to be positive, then that's relaxing that constraint.", 'start': 4049.973, 'duration': 7.584}, {'end': 4069.304, 'text': "Um, but you don't want the C i's to be too big, which is why you add to the optimization cost function a cost for making C i too big.", 'start': 4059.137, 'duration': 10.167}], 'summary': 'Svm aims for correct classification with a functional margin of at least 1, optimizing cost function to avoid large c i values.', 'duration': 29.776, 'max_score': 4039.528, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg4039528.jpg'}, {'end': 4254.46, 'src': 'embed', 'start': 4225.057, 'weight': 2, 'content': [{'end': 4227.098, 'text': 'to have a huge impact on your decision boundary.', 'start': 4225.057, 'duration': 2.041}, {'end': 4237.469, 'text': "And so The L1 non-soft margin SVM, um allows the SVM to still keep the decision boundary closer to the blue line even when there's one outlier,", 'start': 4227.219, 'duration': 10.25}, {'end': 4240.412, 'text': 'and it makes it um much more robust outliers.', 'start': 4237.469, 'duration': 2.943}, {'end': 4242.815, 'text': 'Okay, Um.', 'start': 4242.595, 'duration': 0.22}, {'end': 4254.46, 'text': 'And then, if you go through the uh representative theorem, derivation, uh, you know represent w as a function of the alphas and so on um,', 'start': 4246.215, 'duration': 8.245}], 'summary': 'L1 non-soft margin svm keeps boundary close to blue line, more robust to outliers.', 'duration': 29.403, 'max_score': 4225.057, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg4225057.jpg'}, {'end': 4343.213, 'src': 'embed', 'start': 4320.416, 'weight': 9, 'content': [{'end': 4328.044, 'text': 'The only change is that you end up with this additional condition, right? The- the constraints within alpha are between 0 and C.', 'start': 4320.416, 'duration': 7.628}, {'end': 4339.629, 'text': 'Um, and it turns out that, uh, uh, today there are very good, you know, packages, uh, software packages which are solving that for you.', 'start': 4331.721, 'duration': 7.908}, {'end': 4343.213, 'text': 'I, I, I, I think once upon a time they were doing machine learning.', 'start': 4340.27, 'duration': 2.943}], 'summary': 'Software packages now solve constraints within alpha for machine learning.', 'duration': 22.797, 'max_score': 4320.416, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg4320416.jpg'}, {'end': 4473.166, 'src': 'embed', 'start': 4446.271, 'weight': 4, 'content': [{'end': 4452.293, 'text': 'And it turns out that I guess early days of SVMs you know, one of the proof points of SVMs was um,', 'start': 4446.271, 'duration': 6.022}, {'end': 4456.034, 'text': 'a few of the machine learning was doing a lot of work on handwritten digit classification.', 'start': 4452.293, 'duration': 3.741}, {'end': 4458.035, 'text': "So that's a, so a.", 'start': 4456.114, 'duration': 1.921}, {'end': 4464.997, 'text': 'a digit is a matrix of pixels with values that are, you know, 0 or 1, or maybe grayscale values, right?', 'start': 4458.035, 'duration': 6.962}, {'end': 4473.166, 'text': 'And so if you take a list of pixel intensity values and list them so this is 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,', 'start': 4465.097, 'duration': 8.069}], 'summary': 'Early svms were used for handwritten digit classification with pixel intensity values.', 'duration': 26.895, 'max_score': 4446.271, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg4446271.jpg'}, {'end': 4784.806, 'src': 'embed', 'start': 4755.961, 'weight': 5, 'content': [{'end': 4765.566, 'text': 'there can be innovative kernels to use uh in order to measure the similarity of two amino acid sequences or the similarity of two of whatever else,', 'start': 4755.961, 'duration': 9.605}, {'end': 4774.53, 'text': 'and then to use that to um further classifier, even on very strange shaped object which you know, do not come um as a feature.', 'start': 4765.566, 'duration': 8.964}, {'end': 4784.806, 'text': 'Okay?. So, um, uh, and, and I think actually another example, uh, if the input x is a histogram, you know, maybe you have two different countries,', 'start': 4774.55, 'duration': 10.256}], 'summary': 'Innovative kernels can be used to measure similarity and classify diverse objects.', 'duration': 28.845, 'max_score': 4755.961, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg4755961.jpg'}, {'end': 4816.787, 'src': 'heatmap', 'start': 4774.55, 'weight': 0.792, 'content': [{'end': 4784.806, 'text': 'Okay?. So, um, uh, and, and I think actually another example, uh, if the input x is a histogram, you know, maybe you have two different countries,', 'start': 4774.55, 'duration': 10.256}, {'end': 4787.268, 'text': "you have histograms of people's, uh, demographics.", 'start': 4784.806, 'duration': 2.462}, {'end': 4796.333, 'text': 'It turns out that there is a kernel that taking the min of the two histograms and then summing up to compute a kernel function that inputs two histograms and measures how similar they are.', 'start': 4787.368, 'duration': 8.965}, {'end': 4800.275, 'text': 'So there are many different kernel functions for many different unique types of inputs.', 'start': 4796.353, 'duration': 3.922}, {'end': 4802.056, 'text': 'you might want to classify, okay?', 'start': 4800.275, 'duration': 1.781}, {'end': 4804.502, 'text': "So that's of SVMs.", 'start': 4803.061, 'duration': 1.441}, {'end': 4806.163, 'text': 'uh, very useful algorithm.', 'start': 4804.502, 'duration': 1.661}, {'end': 4810.985, 'text': "And what we'll do on Wednesday is continue with more advice on.", 'start': 4806.503, 'duration': 4.482}, {'end': 4816.787, 'text': "Now that you know all of these learning algorithms, we'll talk about bias and variance to give you more advice on how to actually apply them.", 'start': 4811.005, 'duration': 5.782}], 'summary': 'Kernel function measures similarity of histograms for classification in svms.', 'duration': 42.237, 'max_score': 4774.55, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg4774550.jpg'}], 'start': 3539.69, 'title': 'Kernel methods in machine learning', 'summary': "Discusses the gaussian kernel and its use in creating infinite-dimensional feature spaces, the kernel trick's general applicability to learning algorithms, and the application of support vector machines (svm) in handwritten digit classification and protein sequence classification.", 'chapters': [{'end': 3659.449, 'start': 3539.69, 'title': 'Gaussian kernel in machine learning', 'summary': 'Discusses the concepts of the gaussian kernel and its use in machine learning, highlighting its application in creating infinite-dimensional feature spaces and its widespread usage after the linear kernel.', 'duration': 119.759, 'highlights': ['The Gaussian kernel is probably the most widely used kernel after the linear kernel.', 'The Gaussian kernel corresponds to a feature dimensional space that is infinite dimensional.', 'The linear kernel is maybe the most widely used kernel, which just uses K of x, z.']}, {'end': 4254.46, 'start': 3659.449, 'title': 'The kernel trick and soft margin svm', 'summary': 'Discusses the general applicability of the kernel trick to various learning algorithms, its popularization by the support vector machine, and the modification to the basic algorithm through the l1 norm soft margin svm, which relaxes the constraint on the functional margin, and makes the svm more robust to outliers.', 'duration': 595.011, 'highlights': ['The kernel trick can be applied to various learning algorithms, including linear regression, logistic regression, perceptron algorithm, and Kernel Principal Components Analysis, enabling them to operate in an infinite dimensional feature space (Relevance: 5)', 'The support vector machine popularized the kernel trick, making it a very effective learning algorithm, and it is widely applied in practice (Relevance: 4)', 'The L1 norm soft margin SVM relaxes the constraint on the functional margin, allowing for a more robust decision boundary and making the SVM more adaptable to noisy and non-linearly separable datasets (Relevance: 3)']}, {'end': 4819.669, 'start': 4254.46, 'title': 'Support vector machines and kernels', 'summary': 'Explains the optimization problem of svm, the significance of parameter c, and the use of svm kernels, including polynomial and gaussian kernels. it also discusses the application of svms in handwritten digit classification and protein sequence classification.', 'duration': 565.209, 'highlights': ["SVM's optimization problem becomes simpler by adding an additional condition on the alpha values, and modern software packages efficiently solve this problem, reducing the need to worry about numerical optimization.", 'The parameter C in SVM needs to be chosen to balance the accuracy of training examples, and the chapter promises further discussion on how to select this parameter.', 'SVM kernels, including polynomial and Gaussian kernels, are effective in handwritten digit classification, as demonstrated by the success of SVM with a kernel like this on the MNIST dataset.', 'The design of kernels, such as for protein sequence classification, involves representing amino acid sequences as feature vectors and using innovative kernels tailored to specific input data to measure similarity and classify effectively.', 'SVMs are a useful algorithm with innovative kernel functions for various types of inputs, and further advice on applying learning algorithms, including bias and variance, will be discussed.']}], 'duration': 1279.979, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/8NYoQiRANpg/pics/8NYoQiRANpg3539690.jpg', 'highlights': ['The kernel trick can be applied to various learning algorithms, enabling them to operate in an infinite dimensional feature space (Relevance: 5)', 'The support vector machine popularized the kernel trick, making it a very effective learning algorithm (Relevance: 4)', 'The L1 norm soft margin SVM allows for a more robust decision boundary and makes the SVM more adaptable to noisy and non-linearly separable datasets (Relevance: 3)', 'The Gaussian kernel corresponds to a feature dimensional space that is infinite dimensional', 'SVM kernels, including polynomial and Gaussian kernels, are effective in handwritten digit classification', 'The design of kernels for protein sequence classification involves using innovative kernels tailored to specific input data to measure similarity and classify effectively', 'The parameter C in SVM needs to be chosen to balance the accuracy of training examples', 'The Gaussian kernel is probably the most widely used kernel after the linear kernel', 'The linear kernel is maybe the most widely used kernel, which just uses K of x, z', "SVM's optimization problem becomes simpler by adding an additional condition on the alpha values, and modern software packages efficiently solve this problem"]}], 'highlights': ['The kernel trick can be applied to various learning algorithms, enabling them to operate in an infinite dimensional feature space (Relevance: 5)', 'The support vector machine popularized the kernel trick, making it a very effective learning algorithm (Relevance: 4)', 'The L1 norm soft margin SVM allows for a more robust decision boundary and makes the SVM more adaptable to noisy and non-linearly separable datasets (Relevance: 3)', 'The optimization problem enforces a geometric margin >= specific value', 'The optimal margin classifier aims to maximize the worst-case geometric margin', 'The optimization problem focuses on maximizing the worst-case geometric margin', 'The representer theorem allows working in potentially very high dimensional feature spaces, such as 100 billion dimensional or infinite dimensional, and teaches how to represent feature vectors and parameters', 'The assumption that the parameters w can be expressed as a linear combination of the training examples is proven by the representer theorem', 'The optimization problem is transformed from maximizing 1 over the norm of w to minimizing 1/2 the norm of w squared', 'The decision boundary remains the same even if parameters w and b are multiplied by any constant', 'The vector W is always at 90 degrees to the decision boundary, separating positive and negative predictions', 'The inner product between feature vectors is crucial for expressing algorithms and efficiently computing even infinite dimensional feature vectors, demonstrating the significance of the kernel trick', 'The parameters in logistic regression are updated using gradient descent by initializing them to zero and iteratively updating them by adding or subtracting a constant multiple of the training examples, resulting in the parameters remaining a linear combination of the training examples', 'The process involves updating the parameters theta by subtracting the learning rate times Xi on each iteration of stochastic gradient descent', 'The optimization objective involves minimizing the norm of W squared and ensuring a specific constraint for every value of i', 'The discussion emphasizes that no matter how many iterations are run, the parameters theta still remain a linear combination of the training examples', 'The optimization of the SVM algorithm is expressed in terms of parameters Alpha and B, and a further simplification of the optimization problem is achieved using convex optimization theory', 'The entire algorithm, including the optimization objective during training and prediction making, is expressed only in terms of inner products', 'The application of the kernel trick involves writing the algorithm in terms of inner products and mapping the original input features to a higher dimensional set of features', 'The chapter explains the process of transforming low-dimensional features to high-dimensional feature vectors, with examples of mapping 1D and 2D inputs to high-dimensional feature spaces', 'The discussion includes the efficient computation of the kernel function, enabling the computation of inner products between high-dimensional feature vectors without explicitly computing the feature vectors', 'The implications of the no free lunch theorem in learning theory are mentioned, highlighting its theoretical significance and the practical limitations in the real-world data distributions', 'The feature mapping phi of x transforms three-dimensional input features to nine-dimensional features and can scale to millions for larger dimensions', 'The kernel of x comma z can be computed as x transpose z squared, reducing the time complexity from n squared to order n', 'The time complexity to compute phi of x or phi of x transpose phi of z explicitly is of order n squared', 'The transformation of computational time complexity from order n squared to order n', 'The support vector machine applies the kernel trick to the optimal margin classifier, allowing computation in high dimensional feature spaces with only linear scaling', 'The number of features in the n plus d choose d scenario is roughly n plus d to the power of d, resulting in a very large number of features', 'The chapter explains the concept of support vector machines and visualizes data in higher dimensions to create a linear decision boundary in a 100 trillion dimensional space, resulting in a non-linear decision boundary in the original feature space', 'The kernel function applies only to the specific feature mapping, with trivial transformations for permuted features and the need for a totally different kernel function for a different feature mapping', 'It discusses the use of kernel functions to create non-linear decision boundaries by mapping similar data points closer and dissimilar points farther, with an example of a kernel function using e to the negative x minus z squared over 2 sigma squared', "The explanation of Mercer's theorem outlines the sufficient condition for a function to be a valid kernel function, stating that the corresponding kernel matrix must be positive semi-definite for any set of d points, and provides a test for validating kernel functions"]}