title

Support Vector Machines - The Math of Intelligence (Week 1)

description

Support Vector Machines are a very popular type of machine learning model used for classification when you have a small dataset. We'll go through when to use them, how they work, and build our own using numpy. This is part of Week 1 of The Math of Intelligence. This is a re-recorded version of a video I just released a day ago (the audio/video quality is better in this one)
Code for this video:
https://github.com/llSourcell/Classifying_Data_Using_a_Support_Vector_Machine
Please Subscribe! And like. And comment. that's what keeps me going.
Course Syllabus:
https://github.com/llSourcell/The_Math_of_Intelligence
Join us in the Wizards Slack channel:
http://wizards.herokuapp.com/
More Learning resources:
https://www.analyticsvidhya.com/blog/2015/10/understaing-support-vector-machine-example-code/
http://www.robots.ox.ac.uk/~az/lectures/ml/lect2.pdf
http://machinelearningmastery.com/support-vector-machines-for-machine-learning/
http://www.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tutorial.pdf
http://www.statsoft.com/Textbook/Support-Vector-Machines
https://www.youtube.com/watch?v=_PwhiWxHK8o
And please support me on Patreon:
https://www.patreon.com/user?u=3191693
Follow me:
Twitter: https://twitter.com/sirajraval
Facebook: https://www.facebook.com/sirajology Instagram: https://www.instagram.com/sirajraval/ Instagram: https://www.instagram.com/sirajraval/
Signup for my newsletter for exciting updates in the field of AI:
https://goo.gl/FZzJ5w
Hit the Join button above to sign up to become a member of my channel for access to exclusive content! Join my AI community: http://chatgptschool.io/ Sign up for my AI Sports betting Bot, WagerGPT! (500 spots available):
https://www.wagergpt.co

detail

{'title': 'Support Vector Machines - The Math of Intelligence (Week 1)', 'heatmap': [{'end': 921.255, 'start': 857.485, 'weight': 0.838}, {'end': 1133.692, 'start': 1033.146, 'weight': 0.746}, {'end': 1277.469, 'start': 1251.281, 'weight': 0.707}], 'summary': 'Covers support vector machines, highlighting their applications for classification, regression, and outlier detection, as well as their suitability for small datasets (1000 rows or less). it explains concepts like hyperplane, kernel trick, and hinge loss, and discusses weight updating, learning rate, and model convergence in svm using stochastic gradient descent.', 'chapters': [{'end': 431.966, 'segs': [{'end': 55.134, 'src': 'embed', 'start': 24.727, 'weight': 0, 'content': [{'end': 31.87, 'text': "And the way we're going to optimize the support vector machine, this type of machine learning model, is to use gradient descent.", 'start': 24.727, 'duration': 7.143}, {'end': 33.871, 'text': "So that's what we're going to do.", 'start': 32.551, 'duration': 1.32}, {'end': 35.312, 'text': 'And this is what it looks like.', 'start': 34.251, 'duration': 1.061}, {'end': 37.314, 'text': 'It looks like this.', 'start': 36.212, 'duration': 1.102}, {'end': 38.716, 'text': "So say we've got two classes.", 'start': 37.514, 'duration': 1.202}, {'end': 45.224, 'text': "We have one class that's going to be denoted by red dots, and the other class is going to be denoted by blue dots.", 'start': 38.756, 'duration': 6.468}, {'end': 55.134, 'text': 'So if we were to plot both classes on a 2D graph, an XY graph, then we could draw a line, a decision boundary,', 'start': 45.545, 'duration': 9.589}], 'summary': 'Optimizing support vector machine using gradient descent for two classes on a 2d graph.', 'duration': 30.407, 'max_score': 24.727, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE24727.jpg'}, {'end': 262.957, 'src': 'embed', 'start': 237.557, 'weight': 1, 'content': [{'end': 247.445, 'text': 'The idea is that for a human, given some metrics like their age and their pulse rate, obviously, we can predict what their emotions will be.', 'start': 237.557, 'duration': 9.888}, {'end': 249.107, 'text': "It's emotion classification.", 'start': 247.505, 'duration': 1.602}, {'end': 255.031, 'text': "We're using an SVM in this repository as well, as well as scikit-learn to implement that SVM.", 'start': 249.647, 'duration': 5.384}, {'end': 262.957, 'text': 'Check out those two repositories once you really understand the math behind support vector machines from this video and from the associated code.', 'start': 256.172, 'duration': 6.785}], 'summary': 'Using svm to predict human emotions based on age and pulse rate.', 'duration': 25.4, 'max_score': 237.557, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE237557.jpg'}, {'end': 324.958, 'src': 'embed', 'start': 276.368, 'weight': 3, 'content': [{'end': 281.211, 'text': 'Well, as a rule of thumb, SVMs are great if you have small data sets.', 'start': 276.368, 'duration': 4.843}, {'end': 287.235, 'text': "So I'm saying like a thousand rows or less of data, right? A thousand data points or less.", 'start': 281.291, 'duration': 5.944}, {'end': 293.139, 'text': 'If we have that, then SVMs are great for classification and they are very popular.', 'start': 288.136, 'duration': 5.003}, {'end': 299.563, 'text': 'However, other algorithms random forests, deep neural networks, etc.', 'start': 294.56, 'duration': 5.003}, {'end': 305.166, 'text': 'require more data, But almost always come up with a very robust model,', 'start': 299.563, 'duration': 5.603}, {'end': 310.689, 'text': 'and the decision of which classifier to use depends on both your problem and your data.', 'start': 305.166, 'duration': 5.523}, {'end': 316.272, 'text': 'and As you build this mathematical intuition, all of these choices will become very clear to you.', 'start': 310.689, 'duration': 5.583}, {'end': 318.613, 'text': "So we're starting off with the support vector machine, okay?", 'start': 316.272, 'duration': 2.341}, {'end': 324.958, 'text': 'So I also have this quote by this famous, a computer science professor, or Donald Knuth, who know?', 'start': 319.513, 'duration': 5.445}], 'summary': 'Svms are great for small data sets (<= 1000 rows) for classification, but other algorithms like random forests and deep neural networks require more data, yet produce robust models. choice of classifier depends on problem and data, as mathematical intuition develops over time.', 'duration': 48.59, 'max_score': 276.368, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE276368.jpg'}, {'end': 415.023, 'src': 'embed', 'start': 391.502, 'weight': 2, 'content': [{'end': 398.949, 'text': 'So what is a support vector machine? So this thing can be used for both classification, is it this, is it this, is it this, and regression.', 'start': 391.502, 'duration': 7.447}, {'end': 401.431, 'text': "Got these points, what's the next point in the series?", 'start': 399.169, 'duration': 2.262}, {'end': 410.9, 'text': 'So, given two or more labeled classes of data remember we are using supervised learning it can create a discriminative classifier.', 'start': 402.092, 'duration': 8.808}, {'end': 415.023, 'text': 'That is a classifier that can discriminate between different classes.', 'start': 410.96, 'duration': 4.063}], 'summary': 'Support vector machine used for classification and regression, creating discriminative classifier.', 'duration': 23.521, 'max_score': 391.502, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE391502.jpg'}], 'start': 0.049, 'title': 'Support vector machines', 'summary': "Covers building a support vector machine for classification using gradient descent, with applications including classification, regression, outlier detection, and supervised classification. it provides examples for handwritten digit and emotion classification, and emphasizes svm's suitability for small data sets (1000 rows or less) compared to other algorithms.", 'chapters': [{'end': 255.031, 'start': 0.049, 'title': 'Support vector machine basics', 'summary': 'Covers building a support vector machine to classify two classes of data using gradient descent, with potential use cases including classification, regression, outlier detection, and supervised classification, and provides examples for handwritten digit classification and emotion classification using svm.', 'duration': 254.982, 'highlights': ['The chapter covers building a support vector machine to classify two classes of data using gradient descent. Siraj explains building a support vector machine using gradient descent to classify two classes of data, emphasizing the importance of maintaining quality standards on the channel.', 'Potential use cases include classification, regression, outlier detection, and supervised classification. Siraj discusses potential use cases for support vector machines, including classification, regression, outlier detection, and supervised classification.', 'Provides examples for handwritten digit classification and emotion classification using SVM. Siraj provides examples of using SVM for handwritten digit classification and emotion classification, showcasing the practical application of the support vector machine.']}, {'end': 431.966, 'start': 256.172, 'title': 'Support vector machines: overview & application', 'summary': 'Introduces support vector machines (svms) as a powerful classification algorithm, highlighting its suitability for small data sets (1000 rows or less) and emphasizing the importance of choosing the right classifier based on the problem and data, in comparison to other algorithms like random forests and deep neural networks.', 'duration': 175.794, 'highlights': ['SVMs are great for classification with small data sets, typically 1000 rows or less, and are very popular (quantifiable: suitable for small data sets).', 'The decision of which classifier to use depends on both the problem and the data, emphasizing the importance of choosing the right classifier based on the problem and data (quantifiable: importance of choosing the right classifier).', 'Premature optimization is discouraged, highlighting the importance of using the minimum amount of work necessary to achieve the desired results (quantifiable: discouraging unnecessary use of complex models).', 'SVMs can be used for both classification and regression, and can create a discriminative classifier for supervised learning (quantifiable: application of SVMs in both classification and regression, creation of discriminative classifier).']}], 'duration': 431.917, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE49.jpg', 'highlights': ['The chapter covers building a support vector machine to classify two classes of data using gradient descent.', 'Provides examples for handwritten digit classification and emotion classification using SVM.', 'SVMs can be used for both classification and regression, and can create a discriminative classifier for supervised learning.', 'The decision of which classifier to use depends on both the problem and the data, emphasizing the importance of choosing the right classifier based on the problem and data.', 'SVMs are great for classification with small data sets, typically 1000 rows or less, and are very popular.']}, {'end': 740.618, 'segs': [{'end': 478.593, 'src': 'embed', 'start': 432.186, 'weight': 1, 'content': [{'end': 439.408, 'text': "Anyway, so the way we build this hyperplane and we'll talk about that term but the way we build this hyperplane, or line,", 'start': 432.186, 'duration': 7.222}, {'end': 448.293, 'text': 'this decision boundary between the classes, is by maximizing the margin, that is, the space between that line and both of those classes.', 'start': 439.408, 'duration': 8.885}, {'end': 449.774, 'text': 'What do I mean by that?', 'start': 449.033, 'duration': 0.741}, {'end': 454.057, 'text': 'When I say both of those classes, what I actually mean are the points in each.', 'start': 450.054, 'duration': 4.003}, {'end': 455.238, 'text': 'check out this image.', 'start': 454.057, 'duration': 1.181}, {'end': 460.242, 'text': 'are the points in each of those classes that are closest to the decision boundary?', 'start': 455.238, 'duration': 5.004}, {'end': 463.745, 'text': 'And these points are called support vectors.', 'start': 460.742, 'duration': 3.003}, {'end': 466.947, 'text': 'Okay, we call them support vectors because they are vectors.', 'start': 464.205, 'duration': 2.742}, {'end': 473.932, 'text': 'they are data point vectors that support the creation of this hyperplane that our support vector machine will create.', 'start': 466.947, 'duration': 6.985}, {'end': 477.773, 'text': 'So we are maximizing the margin.', 'start': 475.992, 'duration': 1.781}, {'end': 478.593, 'text': 'And why do we do that?', 'start': 477.813, 'duration': 0.78}], 'summary': 'Building a hyperplane to maximize margin for support vector machine.', 'duration': 46.407, 'max_score': 432.186, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE432186.jpg'}, {'end': 544.187, 'src': 'embed', 'start': 500.521, 'weight': 0, 'content': [{'end': 509.114, 'text': 'And the only way to do that, to maximize the space with which a new data point can fall into its correct class category,', 'start': 500.521, 'duration': 8.593}, {'end': 514.422, 'text': 'is to maximize the space between data points and put a line right in the middle of that space.', 'start': 509.114, 'duration': 5.308}, {'end': 515.363, 'text': "You see what I'm saying?", 'start': 514.682, 'duration': 0.681}, {'end': 520.328, 'text': "Can't get feedback, but I'm just gonna assume that that that that was intuitive, right.", 'start': 516.284, 'duration': 4.044}, {'end': 521.347, 'text': 'so right.', 'start': 520.328, 'duration': 1.019}, {'end': 522.509, 'text': 'so small margin.', 'start': 521.347, 'duration': 1.162}, {'end': 528.414, 'text': "We're maximizing the margin and we're trying to draw a decision boundary, a line of best, not a line of best fit,", 'start': 522.509, 'duration': 5.905}, {'end': 532.278, 'text': 'But a line of best classification between both of those.', 'start': 528.414, 'duration': 3.864}, {'end': 534.62, 'text': 'and we call this line a hyper plane.', 'start': 532.278, 'duration': 2.342}, {'end': 537.422, 'text': 'and Okay, so what is a hyperplane?', 'start': 534.62, 'duration': 2.802}, {'end': 540.304, 'text': 'well, a hyperplane is a decision surface.', 'start': 537.422, 'duration': 2.882}, {'end': 542.305, 'text': 'so given n dimensions, right.', 'start': 540.304, 'duration': 2.001}, {'end': 544.187, 'text': "so let's say our data is n dimensional.", 'start': 542.305, 'duration': 1.882}], 'summary': 'Maximize space between data points to draw a hyperplane for better classification.', 'duration': 43.666, 'max_score': 500.521, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE500521.jpg'}, {'end': 621.276, 'src': 'embed', 'start': 590.662, 'weight': 4, 'content': [{'end': 593.483, 'text': 'And so we can extrapolate this to many dimensions.', 'start': 590.662, 'duration': 2.821}, {'end': 602.587, 'text': "So if we had a 400 dimensional space, which we often do in machine learning, our data doesn't just have two or three features, it has many,", 'start': 593.703, 'duration': 8.884}, {'end': 603.488, 'text': 'many features right?', 'start': 602.587, 'duration': 0.901}, {'end': 605.989, 'text': "It's not so neatly packaged for us to visualize.", 'start': 603.528, 'duration': 2.461}, {'end': 610.391, 'text': "And that's where techniques like dimensionality reduction and all this come into play, which we'll talk about.", 'start': 606.369, 'duration': 4.022}, {'end': 621.276, 'text': "But right now, if we have a 400 dimensional space graph of points, then a hyperplane would be 399 dimensions, which we can't really visualize.", 'start': 610.851, 'duration': 10.425}], 'summary': 'In machine learning, dealing with high-dimensional data (400 dimensions) presents visualization challenges and necessitates techniques like dimensionality reduction.', 'duration': 30.614, 'max_score': 590.662, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE590662.jpg'}, {'end': 725.985, 'src': 'embed', 'start': 701.494, 'weight': 5, 'content': [{'end': 709.476, 'text': "But there's a way to map that input space into a feature space such that the hyperplane that you draw is linear, even though it wouldn't be otherwise.", 'start': 701.494, 'duration': 7.982}, {'end': 711.337, 'text': "And so that's called the kernel trick.", 'start': 709.876, 'duration': 1.461}, {'end': 712.837, 'text': "And we'll talk about that later.", 'start': 711.597, 'duration': 1.24}, {'end': 716.718, 'text': "So we're only talking about linear classification.", 'start': 713.477, 'duration': 3.241}, {'end': 718.679, 'text': 'for support vector machines.', 'start': 717.358, 'duration': 1.321}, {'end': 722.963, 'text': 'Supervised linear classification, right? As opposed to unsupervised.', 'start': 719.12, 'duration': 3.843}, {'end': 725.985, 'text': "Anyway, there's so many different ways that we can frame this problem right?", 'start': 723.503, 'duration': 2.482}], 'summary': 'Introduction to mapping input space for linear hyperplane in support vector machines for supervised linear classification.', 'duration': 24.491, 'max_score': 701.494, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE701494.jpg'}], 'start': 432.186, 'title': 'Support vector machine and hyperplane in machine learning', 'summary': 'Explains the concepts of support vector machine and hyperplane in machine learning, emphasizing the technique of maximizing the margin to create a decision boundary and the use of hyperplane as a decision surface in n-dimensional space. it also introduces the kernel trick and the challenges of visualizing data in high dimensions.', 'chapters': [{'end': 522.509, 'start': 432.186, 'title': 'Support vector machine', 'summary': 'Explains the concept of a support vector machine, emphasizing the technique of maximizing the margin to create a decision boundary that accurately classifies data points, ensuring new data points have a maximum likelihood of falling into their correct class category.', 'duration': 90.323, 'highlights': ['Maximizing the margin is crucial for drawing a line that is in the absolute perfect middle spot between both sets of data, ensuring new data points have the maximum likelihood of falling on the correct side of the decision boundary. This emphasizes the importance of maximizing the margin to position the decision boundary accurately, ensuring new data points have a maximum likelihood of falling on the correct side. This is crucial for classification accuracy.', "Support vectors are the data point vectors that support the creation of the hyperplane, aiding in maximizing the margin for accurate classification. Support vectors are crucial for creating the hyperplane and maximizing the margin, contributing to accurate classification. These points are pivotal for the support vector machine's functionality.", 'Building a hyperplane and maximizing the margin involves drawing a line that separates the classes and ensures the maximum space between data points, crucial for accurate classification. The process of building a hyperplane and maximizing the margin involves drawing a line that separates the classes and ensures the maximum space between data points, which is essential for accurate classification. This technique ensures a clear decision boundary.']}, {'end': 740.618, 'start': 522.509, 'title': 'Hyperplane in machine learning', 'summary': 'Explains the concept of a hyperplane in machine learning, which acts as a decision surface in n-dimensional space, enabling linear classification and support vector machines. it also introduces the idea of the kernel trick and the challenges of visualizing data in high dimensions.', 'duration': 218.109, 'highlights': ['A hyperplane acts as a decision surface in n-dimensional space, with dimensions being n minus one, enabling linear classification and support vector machines.', 'In a 400-dimensional space, a hyperplane would be 399 dimensions, illustrating the challenges of visualizing high-dimensional data.', 'The concept of the kernel trick is introduced as a method to map input space into a feature space, enabling linear hyperplane classification for nonlinear data.', 'The chapter emphasizes the challenges of visualizing data in high dimensions and highlights the excitement and ongoing discoveries in the field of machine learning.', 'The discussion is focused on linear classification and support vector machines, with a mention of the distinction between supervised and unsupervised learning.']}], 'duration': 308.432, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE432186.jpg', 'highlights': ['Maximizing the margin is crucial for accurate classification, ensuring new data points fall on the correct side.', "Support vectors aid in maximizing the margin for accurate classification, pivotal for the support vector machine's functionality.", 'Building a hyperplane and maximizing the margin involves drawing a line that separates the classes and ensures maximum space between data points, essential for accurate classification.', 'A hyperplane acts as a decision surface in n-dimensional space, enabling linear classification and support vector machines.', 'The challenges of visualizing high-dimensional data are emphasized, highlighting the excitement and ongoing discoveries in machine learning.', 'The kernel trick is introduced as a method to enable linear hyperplane classification for nonlinear data.']}, {'end': 981.015, 'segs': [{'end': 767.25, 'src': 'embed', 'start': 741.667, 'weight': 1, 'content': [{'end': 749.555, 'text': "So, no matter what model you're using a random forest, a support vector machine, a deep neural network in the end we are approximating.", 'start': 741.667, 'duration': 7.888}, {'end': 754.38, 'text': 'we are guessing iteratively, we are educated guessing.', 'start': 749.555, 'duration': 4.825}, {'end': 758.545, 'text': "I'm trying to think of a different word for approximation, but we are approximating a function.", 'start': 754.38, 'duration': 4.165}, {'end': 762.427, 'text': 'Right, we are trying to find what is the, what is that optimal function?', 'start': 759.145, 'duration': 3.282}, {'end': 767.25, 'text': 'and that function represents the relationship? Between all the variables in our data.', 'start': 762.427, 'duration': 4.823}], 'summary': 'In machine learning, models approximate functions to find optimal relationships between variables in data.', 'duration': 25.583, 'max_score': 741.667, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE741667.jpg'}, {'end': 810.636, 'src': 'embed', 'start': 781.638, 'weight': 0, 'content': [{'end': 783.699, 'text': 'And its coefficients are its weights.', 'start': 781.638, 'duration': 2.061}, {'end': 791.064, 'text': "And they are being updated over time through some optimization technique, be that gradient descent, usually, or Newton's method,", 'start': 784.06, 'duration': 7.004}, {'end': 793.025, 'text': "which we'll learn about, or whatever it is.", 'start': 791.064, 'duration': 1.961}, {'end': 797.968, 'text': "So whatever it is, we're just trying to approximate a function.", 'start': 793.806, 'duration': 4.162}, {'end': 802.151, 'text': "This is a way of thinking, approximating a function, whatever we're using.", 'start': 797.988, 'duration': 4.163}, {'end': 803.572, 'text': 'Decision forest.', 'start': 802.751, 'duration': 0.821}, {'end': 806.733, 'text': "whatever we're using, it's all about approximating a function.", 'start': 803.572, 'duration': 3.161}, {'end': 808.494, 'text': 'decision trees, all right.', 'start': 806.733, 'duration': 1.761}, {'end': 810.636, 'text': "So let's go ahead and get and get to building right.", 'start': 808.494, 'duration': 2.142}], 'summary': 'The coefficients are being updated over time through optimization techniques to approximate a function using methods such as gradient descent and decision trees.', 'duration': 28.998, 'max_score': 781.638, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE781638.jpg'}, {'end': 921.255, 'src': 'heatmap', 'start': 849.058, 'weight': 2, 'content': [{'end': 857.485, 'text': 'we have this set of x, y coordinate pairs that we can plot on a graph, and Each of these data points has an associated label, an output label.', 'start': 849.058, 'duration': 8.427}, {'end': 861.669, 'text': 'that output label is either a negative one or a one.', 'start': 857.485, 'duration': 4.184}, {'end': 865.312, 'text': "Okay, so for the first two they're going to be negative one and for the last three,", 'start': 861.669, 'duration': 3.643}, {'end': 866.173, 'text': "They're going to be one.", 'start': 865.332, 'duration': 0.841}, {'end': 867.534, 'text': 'So these last three.', 'start': 866.173, 'duration': 1.361}, {'end': 873.24, 'text': 'so what we can do is we can plot these examples on a 2D graph and Okay,', 'start': 867.534, 'duration': 5.706}, {'end': 880.953, 'text': "so we can say let's plot the first two with the negative marker and let's plot the last three with the positive marker.", 'start': 873.24, 'duration': 7.713}, {'end': 883.598, 'text': 'Okay, and so when we plot it, it looks like this.', 'start': 881.595, 'duration': 2.003}, {'end': 890.458, 'text': "And what we're also going to do is we're going to plot a possible hyperplane.", 'start': 885.134, 'duration': 5.324}, {'end': 894.14, 'text': "That is a hyperplane that is just a line that we don't know.", 'start': 890.858, 'duration': 3.282}, {'end': 895.501, 'text': "It's just our naive guess.", 'start': 894.34, 'duration': 1.161}, {'end': 897.582, 'text': "We don't know if it's the optimal hyperplane.", 'start': 895.601, 'duration': 1.981}, {'end': 898.503, 'text': "In fact, it's not.", 'start': 897.622, 'duration': 0.881}, {'end': 904.947, 'text': 'But it just so happens to perfectly separate our training data classes, just for us to just see what it looks like.', 'start': 898.903, 'duration': 6.044}, {'end': 906.609, 'text': 'This is just for that example.', 'start': 904.987, 'duration': 1.622}, {'end': 909.21, 'text': "Okay, so that's that.", 'start': 907.729, 'duration': 1.481}, {'end': 911.791, 'text': 'so now what we can do is get into the math.', 'start': 909.21, 'duration': 2.581}, {'end': 912.611, 'text': "so I hope you're ready for this.", 'start': 911.791, 'duration': 0.82}, {'end': 914.732, 'text': "alright. So let's get into our calculus.", 'start': 912.611, 'duration': 2.121}, {'end': 921.255, 'text': 'alright. so right, machine learning, machine learning is all about optimizing for an objective function,', 'start': 914.732, 'duration': 6.523}], 'summary': 'Using x, y coordinate pairs, we plot data points with associated output labels of -1 and 1, then visualize a hyperplane that separates the classes.', 'duration': 45.082, 'max_score': 849.058, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE849058.jpg'}, {'end': 963.371, 'src': 'embed', 'start': 935.671, 'weight': 3, 'content': [{'end': 939.015, 'text': 'Our loss function in this case is going to be called the hinge loss.', 'start': 935.671, 'duration': 3.344}, {'end': 944.561, 'text': 'So the hinge loss is a very popular type of loss function for support vector machines.', 'start': 939.515, 'duration': 5.046}, {'end': 952.606, 'text': 'okay and the class of algorithms that support vector machines fall under our maximum margin classification algorithms.', 'start': 945.262, 'duration': 7.344}, {'end': 957.128, 'text': 'right, we are trying to maximize the margin, that is, the distance between classes,', 'start': 952.606, 'duration': 4.522}, {'end': 963.371, 'text': 'such that we can draw the best decision boundary between those classes that best separates both of those classes.', 'start': 957.128, 'duration': 6.243}], 'summary': 'Using hinge loss in support vector machines to maximize margin for best class separation.', 'duration': 27.7, 'max_score': 935.671, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE935671.jpg'}], 'start': 741.667, 'title': 'Approximating functions in machine learning', 'summary': "Explains the goal of approximating a function in machine learning, regardless of the model used, with coefficients being updated through optimization techniques like gradient descent or newton's method.", 'chapters': [{'end': 827.766, 'start': 741.667, 'title': 'Approximating functions in machine learning', 'summary': "Explains that in machine learning, regardless of the model used, the primary goal is to approximate a function that represents the relationship between variables in the data, with the coefficients being updated over time through optimization techniques like gradient descent or newton's method.", 'duration': 86.099, 'highlights': ['In machine learning, the primary goal is to approximate a function that represents the relationship between variables in the data. Machine learning models aim to approximate a function that represents the relationship between variables in the data, with the ultimate goal of learning from the data.', "The coefficients of the function in machine learning models are updated over time through optimization techniques like gradient descent or Newton's method. The coefficients of the function in machine learning models are updated over time through optimization techniques like gradient descent or Newton's method, which aids in achieving the best approximation of the function.", 'The chapter emphasizes that regardless of the specific model used, the core concept in machine learning is approximating a function. The chapter stresses that regardless of the specific model used, the fundamental concept in machine learning remains the approximation of a function to represent the data relationship.']}, {'end': 981.015, 'start': 827.766, 'title': 'Support vector machines', 'summary': 'Introduces the concept of support vector machines by explaining the format of the data, plotting the data points on a graph, and defining the hinge loss function for support vector machines.', 'duration': 153.249, 'highlights': ['The chapter introduces the concept of support vector machines by explaining the format of the data The data is in the form of X, Y, and bias, with five data points. The x and y coordinates are used to plot the data points on a graph, and each point has an associated output label of either -1 or 1.', 'Defining the hinge loss function for support vector machines The hinge loss is a popular loss function for support vector machines, falling under maximum margin classification algorithms. It aims to maximize the margin between classes to draw the best decision boundary.', 'Plotting the data points on a graph and discussing the possible hyperplane The data points are plotted on a 2D graph, with the first two points labeled as negative and the last three as positive. A possible hyperplane is plotted to separate the training data classes.']}], 'duration': 239.348, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE741667.jpg', 'highlights': ["The coefficients of the function in machine learning models are updated over time through optimization techniques like gradient descent or Newton's method.", 'The chapter emphasizes that regardless of the specific model used, the core concept in machine learning is approximating a function.', 'The data is in the form of X, Y, and bias, with five data points. The x and y coordinates are used to plot the data points on a graph, and each point has an associated output label of either -1 or 1.', 'The hinge loss is a popular loss function for support vector machines, falling under maximum margin classification algorithms. It aims to maximize the margin between classes to draw the best decision boundary.', 'Plotting the data points on a graph and discussing the possible hyperplane The data points are plotted on a 2D graph, with the first two points labeled as negative and the last three as positive. A possible hyperplane is plotted to separate the training data classes.']}, {'end': 1225.948, 'segs': [{'end': 1133.692, 'src': 'heatmap', 'start': 1033.146, 'weight': 0.746, 'content': [{'end': 1041.913, 'text': 'And remember, this y and this f of x, both of these values, these scalar values, these single values, are going to be a single number.', 'start': 1033.146, 'duration': 8.767}, {'end': 1043.934, 'text': "And that's why we can multiply them.", 'start': 1042.252, 'duration': 1.682}, {'end': 1051.426, 'text': 'And so our objective function, then, is going to consist of the loss function, which notice how it looks a little different,', 'start': 1045.015, 'duration': 6.411}, {'end': 1055.893, 'text': "but it's really the same thing this 1 minus y times xw.", 'start': 1051.426, 'duration': 4.467}, {'end': 1058.617, 'text': "It's the same as this loss up here, it's just a different way of denoting it.", 'start': 1055.933, 'duration': 2.684}, {'end': 1069.106, 'text': 'And we can say the sigma term means that we are going to take the sum of terms where the number of terms is n and n is the number of data points that we have.', 'start': 1061.141, 'duration': 7.965}, {'end': 1076.07, 'text': "So, for all five data points, we'll find the loss of each of those data points using this loss function, the hinge loss,", 'start': 1069.386, 'duration': 6.684}, {'end': 1083.134, 'text': "and we'll sum them all up together, and that total sum will represent our total loss for our data.", 'start': 1076.07, 'duration': 7.064}, {'end': 1084.275, 'text': "right?. That's a single number.", 'start': 1083.134, 'duration': 1.141}, {'end': 1090.229, 'text': "it's going to be a single number and then, once we have that, we're going to define our objective function.", 'start': 1085.287, 'duration': 4.942}, {'end': 1098.813, 'text': 'so our objective function in this case is going to be denoted by this min, lambda, w, okay, with the square sign, and so what is this are.', 'start': 1090.229, 'duration': 8.584}, {'end': 1105.896, 'text': 'so our objective function is going to be denoted by the loss plus, this regularizer term, which is denoted right here with this min and the lambda.', 'start': 1098.813, 'duration': 7.083}, {'end': 1114.991, 'text': 'so a regularizer is is a tuning knob, and what the regularizer does is it tells us how best to fit our data.', 'start': 1105.896, 'duration': 9.095}, {'end': 1123.523, 'text': "So if the regularizer term is too high, then our model will be overfit to the training data, and it's not gonna generalize well to new data points.", 'start': 1115.131, 'duration': 8.392}, {'end': 1124.724, 'text': "It's going to be overfit.", 'start': 1123.603, 'duration': 1.121}, {'end': 1129.331, 'text': 'But if the regularizer term is too low, then our model is going to be underfit.', 'start': 1125.065, 'duration': 4.266}, {'end': 1133.692, 'text': "So that means it's going to be too generalized, and it will have a large training error.", 'start': 1129.631, 'duration': 4.061}], 'summary': 'Objective function consists of loss function and regularizer term, optimizing model fit.', 'duration': 100.546, 'max_score': 1033.146, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE1033146.jpg'}, {'end': 1098.813, 'src': 'embed', 'start': 1061.141, 'weight': 0, 'content': [{'end': 1069.106, 'text': 'And we can say the sigma term means that we are going to take the sum of terms where the number of terms is n and n is the number of data points that we have.', 'start': 1061.141, 'duration': 7.965}, {'end': 1076.07, 'text': "So, for all five data points, we'll find the loss of each of those data points using this loss function, the hinge loss,", 'start': 1069.386, 'duration': 6.684}, {'end': 1083.134, 'text': "and we'll sum them all up together, and that total sum will represent our total loss for our data.", 'start': 1076.07, 'duration': 7.064}, {'end': 1084.275, 'text': "right?. That's a single number.", 'start': 1083.134, 'duration': 1.141}, {'end': 1090.229, 'text': "it's going to be a single number and then, once we have that, we're going to define our objective function.", 'start': 1085.287, 'duration': 4.942}, {'end': 1098.813, 'text': 'so our objective function in this case is going to be denoted by this min, lambda, w, okay, with the square sign, and so what is this are.', 'start': 1090.229, 'duration': 8.584}], 'summary': 'The total loss for five data points is represented by a single number which is used to define the objective function.', 'duration': 37.672, 'max_score': 1061.141, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE1061141.jpg'}, {'end': 1141.715, 'src': 'embed', 'start': 1115.131, 'weight': 5, 'content': [{'end': 1123.523, 'text': "So if the regularizer term is too high, then our model will be overfit to the training data, and it's not gonna generalize well to new data points.", 'start': 1115.131, 'duration': 8.392}, {'end': 1124.724, 'text': "It's going to be overfit.", 'start': 1123.603, 'duration': 1.121}, {'end': 1129.331, 'text': 'But if the regularizer term is too low, then our model is going to be underfit.', 'start': 1125.065, 'duration': 4.266}, {'end': 1133.692, 'text': "So that means it's going to be too generalized, and it will have a large training error.", 'start': 1129.631, 'duration': 4.061}, {'end': 1141.715, 'text': 'So we need the perfect regularizer term for our model to be as generalizable as possible and fit to our training data.', 'start': 1133.953, 'duration': 7.762}], 'summary': 'Optimizing regularizer term minimizes overfitting and underfitting, improving model generalizability.', 'duration': 26.584, 'max_score': 1115.131, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE1115131.jpg'}, {'end': 1202.781, 'src': 'embed', 'start': 1158.724, 'weight': 1, 'content': [{'end': 1163.248, 'text': 'our objective function consists of our regularizer and our loss function.', 'start': 1158.724, 'duration': 4.524}, {'end': 1164.97, 'text': 'we add them both together.', 'start': 1163.248, 'duration': 1.722}, {'end': 1168.753, 'text': 'so what we want to do is we want to optimize for this objective.', 'start': 1164.97, 'duration': 3.783}, {'end': 1176.299, 'text': "and by optimizing for this objective, we're going to find the optimal regularizer term and we're going to minimize the loss.", 'start': 1168.753, 'duration': 7.546}, {'end': 1179.202, 'text': "so we're going to do two things by optimizing for this objective.", 'start': 1176.299, 'duration': 2.903}, {'end': 1186.188, 'text': "And so the way we're going to optimize is we're going to perform gradient descent, right?", 'start': 1179.882, 'duration': 6.306}, {'end': 1194.034, 'text': "And so the way we're going to perform gradient descent is by taking the partial derivative of both of these two terms, of both of these terms.", 'start': 1186.248, 'duration': 7.786}, {'end': 1202.781, 'text': "We're gonna take the partial derivative of the regularizer and we're gonna take the partial derivative of the loss term.", 'start': 1194.454, 'duration': 8.327}], 'summary': 'Optimizing objective function using gradient descent to find optimal regularizer and minimize loss.', 'duration': 44.057, 'max_score': 1158.724, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE1158724.jpg'}], 'start': 981.015, 'title': 'Hinge loss, regularization, and objective function', 'summary': 'Covers the hinge loss function in machine learning, along with its role in calculating total loss and optimizing the objective function. it also discusses the importance of finding the optimal regularizer term to avoid overfitting or underfitting the model and the process of optimizing the objective function through gradient descent.', 'chapters': [{'end': 1098.813, 'start': 981.015, 'title': 'Hinge loss and objective function', 'summary': 'Explains the hinge loss function in machine learning, which is defined as 1 minus y times f of x, and its role in calculating the total loss for a set of data points using the objective function.', 'duration': 117.798, 'highlights': ['The hinge loss function is defined as 1 minus y times f of x, with the condition that if the result is negative, it is set to zero, and it is used to calculate the total loss for a set of data points.', 'The objective function consists of the hinge loss function summed over all data points, denoted by the sigma term, and is represented as the total loss for the data set.', 'The objective function is denoted as min lambda w with the square sign, and it plays a crucial role in optimizing the model parameters for machine learning.']}, {'end': 1225.948, 'start': 1098.813, 'title': 'Optimizing regularization and loss function', 'summary': 'Explains the importance of finding the optimal regularizer term to avoid overfitting or underfitting the model, and the process of optimizing the objective function through gradient descent to minimize the loss and find the optimal regularizer term.', 'duration': 127.135, 'highlights': ['The regularizer term determines the balance between overfitting and underfitting, with a too high term leading to overfitting and a too low term leading to underfitting.', 'The objective function consists of the regularizer and the loss function, and the goal is to optimize for this objective through gradient descent to find the optimal regularizer term and minimize the loss.', 'Gradient descent is used to optimize the objective function by taking the partial derivatives of both the regularizer and the loss term, following the power rule for partial derivatives.']}], 'duration': 244.933, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE981015.jpg', 'highlights': ['The objective function is denoted as min lambda w with the square sign, and it plays a crucial role in optimizing the model parameters for machine learning.', 'The objective function consists of the regularizer and the loss function, and the goal is to optimize for this objective through gradient descent to find the optimal regularizer term and minimize the loss.', 'Gradient descent is used to optimize the objective function by taking the partial derivatives of both the regularizer and the loss term, following the power rule for partial derivatives.', 'The hinge loss function is defined as 1 minus y times f of x, with the condition that if the result is negative, it is set to zero, and it is used to calculate the total loss for a set of data points.', 'The objective function consists of the hinge loss function summed over all data points, denoted by the sigma term, and is represented as the total loss for the data set.', 'The regularizer term determines the balance between overfitting and underfitting, with a too high term leading to overfitting and a too low term leading to underfitting.']}, {'end': 1794.095, 'segs': [{'end': 1290.995, 'src': 'heatmap', 'start': 1251.281, 'weight': 0, 'content': [{'end': 1258.462, 'text': 'And what do I mean by a certain way? I mean we can update our weights by using both the regularizer term and the loss function term.', 'start': 1251.281, 'duration': 7.181}, {'end': 1261.763, 'text': "Okay, because this isn't zero, it's gonna be negative y times x.", 'start': 1258.822, 'duration': 2.941}, {'end': 1270.386, 'text': "But else if, if we, if we have correctly classified, then this value is going to be zero, It's going to be a zero.", 'start': 1262.603, 'duration': 7.783}, {'end': 1277.469, 'text': "so we don't need to update our loss, we only update our reg, we only update our weights using our Regularizer term,", 'start': 1270.386, 'duration': 7.083}, {'end': 1279.63, 'text': 'and so this term right here is a learning rate, by the way.', 'start': 1277.469, 'duration': 2.161}, {'end': 1285.112, 'text': "So it's weights plus learning rate times, the regularizer term the learning rate, by the way,", 'start': 1279.63, 'duration': 5.482}, {'end': 1290.995, 'text': "is how we it's our is another tuning knob for how we fast we learn.", 'start': 1285.112, 'duration': 5.883}], 'summary': 'Weights are updated using regularizer and loss function terms, with learning rate as a tuning knob.', 'duration': 65.047, 'max_score': 1251.281, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE1251281.jpg'}, {'end': 1341.682, 'src': 'embed', 'start': 1314.157, 'weight': 2, 'content': [{'end': 1317.9, 'text': 'um, Now let me plug into some power here.', 'start': 1314.157, 'duration': 3.743}, {'end': 1322.185, 'text': "Okay, so now let's get into the code for this, right? We've talked about the math.", 'start': 1317.92, 'duration': 4.265}, {'end': 1323.566, 'text': "Let's get into the code.", 'start': 1322.585, 'duration': 0.981}, {'end': 1329.152, 'text': 'So for the code part, we can say, all right, well, we want to initialize a support vector machine.', 'start': 1324.127, 'duration': 5.025}, {'end': 1332.055, 'text': "We're gonna perform stochastic gradient descent, by the way.", 'start': 1329.172, 'duration': 2.883}, {'end': 1337.639, 'text': "But we're going to initialize a support vector machine with a set of weight vectors.", 'start': 1332.856, 'duration': 4.783}, {'end': 1341.682, 'text': "And these weight vectors are the coefficient of the model that we're trying to approximate.", 'start': 1337.759, 'duration': 3.923}], 'summary': 'Initializing support vector machine with weight vectors for stochastic gradient descent.', 'duration': 27.525, 'max_score': 1314.157, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE1314157.jpg'}, {'end': 1537.523, 'src': 'embed', 'start': 1508.572, 'weight': 4, 'content': [{'end': 1509.994, 'text': "So we've trained our model right?", 'start': 1508.572, 'duration': 1.422}, {'end': 1517.584, 'text': "We've trained it on those five toy data points, and so now what we can do is we can now plot this model.", 'start': 1510.034, 'duration': 7.55}, {'end': 1525.031, 'text': 'We can plot this model and can we can add testing data as well?', 'start': 1518.125, 'duration': 6.906}, {'end': 1525.932, 'text': 'right, but we have.', 'start': 1525.031, 'duration': 0.901}, {'end': 1533.98, 'text': 'we had a misclassification case and we had a correct classification case and, depending on whether it was misclassified or classified correctly,', 'start': 1525.932, 'duration': 8.048}, {'end': 1537.523, 'text': 'we updated our weights using a different strategy.', 'start': 1533.98, 'duration': 3.543}], 'summary': 'Trained model on 5 toy data points, updated weights based on classification results.', 'duration': 28.951, 'max_score': 1508.572, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE1508572.jpg'}], 'start': 1225.948, 'title': 'Weight updating and svm in code', 'summary': 'Discusses weight updating and learning rate in classification, emphasizing the regularizer term and learning rate in model convergence. it also covers implementing svm using stochastic gradient descent, zero initialization, weight updating, and visualization of classification accuracy.', 'chapters': [{'end': 1314.157, 'start': 1225.948, 'title': 'Weight updating and learning rate in classification', 'summary': 'Discusses the conditions for misclassification and classification in updating weights for a learning model, emphasizing the role of the regularizer term and learning rate in determining the convergence of the model.', 'duration': 88.209, 'highlights': ['The regularizer term and loss function term are used to update weights based on misclassification or correct classification, with the regularizer term being updated when correctly classified, and both terms being updated when misclassified.', 'The learning rate serves as a tuning knob for the speed of learning, with a too high learning rate leading to overshooting the minima and a too low learning rate resulting in slow convergence or non-convergence.']}, {'end': 1794.095, 'start': 1314.157, 'title': 'Support vector machines in code', 'summary': 'Covers implementing a support vector machine using stochastic gradient descent, initializing weight vectors with zeros, updating weights based on misclassification and correct classification, and plotting the model to visualize the classification accuracy.', 'duration': 479.938, 'highlights': ['Implementing a support vector machine using stochastic gradient descent The code involves initializing a support vector machine with weight vectors, learning rate, number of epochs, and a list of errors, and updating weights using the stochastic gradient descent method.', 'Updating weights based on misclassification and correct classification The weights are updated based on misclassification and correct classification cases using specific strategies and a regularizer term, aiming to minimize the error value over iterations.', 'Visualizing the classification accuracy by plotting the model The model is plotted to visualize the accuracy of classifying both training and testing data, ensuring that the positive and negative labeled data points are correctly classified and lie on the correct side of the decision boundary.']}], 'duration': 568.147, 'thumbnail': 'https://coursnap.oss-ap-southeast-1.aliyuncs.com/video-capture/g8D5YL6cOSE/pics/g8D5YL6cOSE1225948.jpg', 'highlights': ['The regularizer term and loss function term update weights based on misclassification or correct classification.', 'The learning rate serves as a tuning knob for the speed of learning.', 'Implementing a support vector machine using stochastic gradient descent.', 'Updating weights based on misclassification and correct classification.', 'Visualizing the classification accuracy by plotting the model.']}], 'highlights': ['SVMs are great for classification with small data sets, typically 1000 rows or less, and are very popular.', 'Provides examples for handwritten digit classification and emotion classification using SVM.', 'The decision of which classifier to use depends on both the problem and the data, emphasizing the importance of choosing the right classifier based on the problem and data.', 'The chapter covers building a support vector machine to classify two classes of data using gradient descent.', 'Maximizing the margin is crucial for accurate classification, ensuring new data points fall on the correct side.', "Support vectors aid in maximizing the margin for accurate classification, pivotal for the support vector machine's functionality.", 'The kernel trick is introduced as a method to enable linear hyperplane classification for nonlinear data.', 'The hinge loss is a popular loss function for support vector machines, falling under maximum margin classification algorithms. It aims to maximize the margin between classes to draw the best decision boundary.', 'The objective function consists of the regularizer and the loss function, and the goal is to optimize for this objective through gradient descent to find the optimal regularizer term and minimize the loss.', 'The regularizer term and loss function term update weights based on misclassification or correct classification.', 'The learning rate serves as a tuning knob for the speed of learning.']}